Skip to content

Feature: Implement Data Analysis and Enhance Model Management

Alex Rudaev requested to merge github/fork/HlexNC/main into feature/data-analysis

This pull request introduces key updates to the Assistant Systems Project, focusing on enhancing data analysis capabilities and organizing machine learning models more effectively. These changes aim to provide better data handling, insightful visualizations, and streamlined workflows for training and managing models, while also refining the chatbot functionalities for improved user interactions.

Key Changes:

  1. Data Processing and Analysis:

    • Data Loading and Preprocessing: Added data_loader.py and data_preprocessor.py to handle data loading, missing value imputation, and preprocessing tasks essential for model training.
    • Data Augmentation: Introduced data_augmentation.py to create synthetic data using Faker and SMOTE, increasing the dataset's diversity and supporting better model performance.
    • Data Visualization: Created data_visualization.py using Altair to generate visualizations such as correlation heatmaps and distribution plots, helping users understand the data better.
  2. Model Management:

    • Organized Model Directories: Structured models into separate folders (models/data_analysis and models/chatbot) and updated .gitignore to exclude large or sensitive model files, maintaining a clean repository.
    • Training Workflow: Enhanced data_analysis.py to include steps for data splitting, augmentation, preprocessing, and model training with evaluation metrics. Implemented functionality to save and load models reliably.
    • Evaluation Metrics: Automated the saving of evaluation metrics into CSV files, making it easier to review model performance within the Streamlit application.
  3. Streamlit Application Enhancements:

    • Sidebar Updates: Modified app.py to display separate sections for Data Analysis Models and Chatbot Models. Included model evaluation summaries to provide clear insights directly in the application.
    • Data Analysis Module: Implemented the DataAnalysis class to manage data analysis tasks, including loading, filtering, visualization, training, and making sample predictions.
    • Chatbot Improvements: Updated rasa_chatbot.py to handle message suggestions effectively and display previous chat interactions, enhancing the user experience.
  4. Infrastructure and Configuration:

    • Docker Configuration: Updated docker-compose.yml to mount specific model directories (models/data_analysis and models/chatbot) and include the data directory, ensuring organized and efficient deployment environments.
    • Dependency Management: Updated requirements.txt and requirements-actions.txt to include necessary packages such as scikit-learn, imblearn, faker, seaborn, and altair, supporting the new data analysis and augmentation functionalities.
  5. Cleanup and Optimization:

    • Removed Unused Files: Deleted outdated training scripts and Dockerfiles (train_and_rename.sh and Dockerfile.train) related to Rasa model training, simplifying the repository.
    • Dockerignore Configuration: Added .dockerignore to exclude unnecessary files and directories from Docker builds, optimizing container sizes and build times.

These updates enhance Project Apero by providing robust data analysis tools and a more organized approach to managing machine learning models. Users can now perform comprehensive data analysis, visualize key metrics, and interact with the chatbot more effectively. The improved structure and workflows facilitate easier maintenance and scalability for future developments.

Merge request reports

Loading