Update Model training authored by Mykyta Kostohryz's avatar Mykyta Kostohryz
...@@ -18,7 +18,16 @@ ...@@ -18,7 +18,16 @@
- **Random Under-Sampling (Optional):** - **Random Under-Sampling (Optional):**
- After SMOTE, random under-sampling can be applied to reduce the size of the majority class, ensuring better balance between classes. - After SMOTE, random under-sampling can be applied to reduce the size of the majority class, ensuring better balance between classes.
### Step 3: Model Selection and Training ### Step 3: Dimensionality Reduction and Feature Selection
- **Principal Component Analysis (PCA):**
- PCA is applied to reduce the dimensionality of the feature space while preserving the variance in the data.
- This technique helps to avoid overfitting, speeds up model training, and can improve model performance when dealing with highly correlated features.
- The number of components can be selected based on the amount of variance to retain (e.g., 95% variance retention).
- **Recursive Feature Elimination (RFE):**
- RFE is used to select the most important features by recursively removing features and training the model to identify the best subset.
- This method helps improve model interpretability and can reduce overfitting by removing redundant or irrelevant features.
### Step 4: Model Selection and Training
- **Models Trained:** - **Models Trained:**
1. **Support Vector Machine (SVM):** Effective for high-dimensional spaces and performs well for binary classification tasks. 1. **Support Vector Machine (SVM):** Effective for high-dimensional spaces and performs well for binary classification tasks.
2. **Random Forest (RF):** A robust ensemble method that aggregates the predictions of multiple decision trees, ensuring better generalization. 2. **Random Forest (RF):** A robust ensemble method that aggregates the predictions of multiple decision trees, ensuring better generalization.
...@@ -38,7 +47,7 @@ ...@@ -38,7 +47,7 @@
- **kNN:** Number of neighbors (n_neighbors) and distance metric. - **kNN:** Number of neighbors (n_neighbors) and distance metric.
- **Ensemble Model:** Hyperparameters for both SVM and Random Forest models are optimized separately before being combined in the ensemble method. - **Ensemble Model:** Hyperparameters for both SVM and Random Forest models are optimized separately before being combined in the ensemble method.
### Step 4: Model Evaluation ### Step 5: Model Evaluation
- After training, models are evaluated using multiple metrics: - After training, models are evaluated using multiple metrics:
- **Accuracy:** Overall correctness of the model. - **Accuracy:** Overall correctness of the model.
- **Classification Report:** Provides precision, recall, F1-score for each class. - **Classification Report:** Provides precision, recall, F1-score for each class.
...@@ -49,15 +58,11 @@ ...@@ -49,15 +58,11 @@
- **Evaluation Tools:** - **Evaluation Tools:**
- **Precision-Recall and ROC Curves:** Plots showing the trade-off between precision and recall, and the model’s performance across different classification thresholds. - **Precision-Recall and ROC Curves:** Plots showing the trade-off between precision and recall, and the model’s performance across different classification thresholds.
### Step 5: Model Deployment ### Step 6: Model Deployment
- **Saving the Model:** - **Saving the Model:**
- The trained models are serialized and saved using `joblib` for later deployment. - The trained models are serialized and saved using `joblib` for later deployment.
- Saved models are stored in a directory (`./runs/`) for easy access during predictions. - Saved models are stored in a directory (`./runs/`) for easy access during predictions.
### Step 6: Performance Monitoring and Updates
- **Monitoring:** The models' performance is monitored over time, with periodic evaluations on new data.
- **Model Updating:** Based on new data or performance degradation, models can be retrained and updated with improved features or hyperparameters.
--- ---
This pipeline ensures that the models are trained on clean, balanced data with optimized hyperparameters, while providing robust evaluation and monitoring mechanisms to guarantee the model’s long-term performance. This pipeline ensures that the models are trained on clean, balanced data with optimized hyperparameters, while providing robust evaluation and monitoring mechanisms to guarantee the model’s long-term performance.
\ No newline at end of file