Feature: Enhanced Data Analysis, Model Management, and Streamlined Development (!3) · Merge requests · Alex Rudaev / Assistant-Systems-Project

Alex Rudaev requested to merge feature/data-analysis into main Nov 12, 2024

This pull request introduces extensive updates to the data analysis capabilities, model management, and overall project structure within the Assistant Systems project. Key features include improved Docker configuration, refined data processing, and enhanced chatbot functionality.

Project Structure and Configuration Updates:

Updated .gitignore and .dockerignore: New ignore rules prevent unnecessary files from being tracked in Git or included in Docker images, streamlining development and reducing image sizes. This includes ignoring cache files, virtual environments, processed data, and specific model directories.
Refined Dockerfile and docker-compose.yml: The main Dockerfile now includes a data/ directory, allowing direct access to essential files. Additional mounts in docker-compose.yml enable organized management of data_analysis and chatbot models within the container, optimizing deployment and model usage across services.
Removed Dockerfile.train and train_and_rename.sh: Deprecated files for training have been eliminated, and all training functionalities are now centralized for streamlined management and reduced clutter.

Data Analysis Enhancements:

New DataAnalysis Class: Added to data_analysis.py, this class encapsulates data processing, visualization, and model training functions. Key updates:
- Filtering and Preprocessing: The display_filters method enables real-time data filtering, storing these settings in session state. This ensures dynamic filtering based on user inputs, providing up-to-date and relevant data for analysis.
- Visualization: A separate data_visualization.py module integrates Altair plots, including correlation heatmaps and data distribution charts, enabling interactive visual analysis in Streamlit.
- Model Management: Includes handling of both real and augmented data, with model evaluation summaries stored in model_evaluations.csv for transparency.
- Augmentation: Introduces data_augmentation.py using Faker for data synthesis, enhancing model robustness in low-sample scenarios.
Model Training and Evaluation:
- Multiple models (e.g., Logistic Regression, SVM, and Random Forest) are trained on both real and augmented data, with performance metrics stored in model_evaluations.csv.
- New preprocessing pipeline with StandardScaler and OneHotEncoder streamlines data handling and scales training data efficiently.

Updated Chatbot Functionality:

Enhanced Rasa Integration: The rasa_chatbot.py module now includes improved chatbot interaction within Streamlit, displaying previous interactions and offering message suggestions for a better user experience.
Refined Action Files: Updates in actions.py improve chatbot responses, data analysis sharing, and recommendation generation by loading and using specific models based on user inputs. Additionally, the updated requirements-actions.txt includes dependencies needed for new chatbot functionalities.

Documentation and README Updates:

Detailed README: Enhanced with instructions for setting up and running the application via Docker, along with options for local setup. Key additions:
- Step-by-step guidance on model training, chatbot integration, and data analysis.
- Improved clarity on Docker dependencies and service management.

This pull request significantly enhances the Assistant Systems project by enabling comprehensive data handling, interactive visualization, and robust chatbot interactions. These updates streamline the workflow, reduce overhead, and provide a more interactive and informative user experience.

Feature: Enhanced Data Analysis, Model Management, and Streamlined Development

Project Structure and Configuration Updates:

Data Analysis Enhancements:

Updated Chatbot Functionality:

Documentation and README Updates:

Merge request reports