Feature: Implement Data Analysis and Enhance Model Management
This pull request introduces key updates to the Assistant Systems Project, focusing on enhancing data analysis capabilities and organizing machine learning models more effectively. These changes aim to provide better data handling, insightful visualizations, and streamlined workflows for training and managing models, while also refining the chatbot functionalities for improved user interactions.
Key Changes:
-
Data Processing and Analysis:
-
Data Loading and Preprocessing: Added
data_loader.py
anddata_preprocessor.py
to handle data loading, missing value imputation, and preprocessing tasks essential for model training. -
Data Augmentation: Introduced
data_augmentation.py
to create synthetic data using Faker and SMOTE, increasing the dataset's diversity and supporting better model performance. -
Data Visualization: Created
data_visualization.py
using Altair to generate visualizations such as correlation heatmaps and distribution plots, helping users understand the data better.
-
Data Loading and Preprocessing: Added
-
Model Management:
-
Organized Model Directories: Structured models into separate folders (
models/data_analysis
andmodels/chatbot
) and updated.gitignore
to exclude large or sensitive model files, maintaining a clean repository. -
Training Workflow: Enhanced
data_analysis.py
to include steps for data splitting, augmentation, preprocessing, and model training with evaluation metrics. Implemented functionality to save and load models reliably. - Evaluation Metrics: Automated the saving of evaluation metrics into CSV files, making it easier to review model performance within the Streamlit application.
-
Organized Model Directories: Structured models into separate folders (
-
Streamlit Application Enhancements:
-
Sidebar Updates: Modified
app.py
to display separate sections for Data Analysis Models and Chatbot Models. Included model evaluation summaries to provide clear insights directly in the application. -
Data Analysis Module: Implemented the
DataAnalysis
class to manage data analysis tasks, including loading, filtering, visualization, training, and making sample predictions. -
Chatbot Improvements: Updated
rasa_chatbot.py
to handle message suggestions effectively and display previous chat interactions, enhancing the user experience.
-
Sidebar Updates: Modified
-
Infrastructure and Configuration:
-
Docker Configuration: Updated
docker-compose.yml
to mount specific model directories (models/data_analysis
andmodels/chatbot
) and include the data directory, ensuring organized and efficient deployment environments. -
Dependency Management: Updated
requirements.txt
andrequirements-actions.txt
to include necessary packages such asscikit-learn
,imblearn
,faker
,seaborn
, andaltair
, supporting the new data analysis and augmentation functionalities.
-
Docker Configuration: Updated
-
Cleanup and Optimization:
-
Removed Unused Files: Deleted outdated training scripts and Dockerfiles (
train_and_rename.sh
andDockerfile.train
) related to Rasa model training, simplifying the repository. -
Dockerignore Configuration: Added
.dockerignore
to exclude unnecessary files and directories from Docker builds, optimizing container sizes and build times.
-
Removed Unused Files: Deleted outdated training scripts and Dockerfiles (
These updates enhance Project Apero by providing robust data analysis tools and a more organized approach to managing machine learning models. Users can now perform comprehensive data analysis, visualize key metrics, and interact with the chatbot more effectively. The improved structure and workflows facilitate easier maintenance and scalability for future developments.