Update Model training authored by Mykyta Kostohryz's avatar Mykyta Kostohryz
......@@ -18,7 +18,16 @@
- **Random Under-Sampling (Optional):**
- After SMOTE, random under-sampling can be applied to reduce the size of the majority class, ensuring better balance between classes.
### Step 3: Model Selection and Training
### Step 3: Dimensionality Reduction and Feature Selection
- **Principal Component Analysis (PCA):**
- PCA is applied to reduce the dimensionality of the feature space while preserving the variance in the data.
- This technique helps to avoid overfitting, speeds up model training, and can improve model performance when dealing with highly correlated features.
- The number of components can be selected based on the amount of variance to retain (e.g., 95% variance retention).
- **Recursive Feature Elimination (RFE):**
- RFE is used to select the most important features by recursively removing features and training the model to identify the best subset.
- This method helps improve model interpretability and can reduce overfitting by removing redundant or irrelevant features.
### Step 4: Model Selection and Training
- **Models Trained:**
1. **Support Vector Machine (SVM):** Effective for high-dimensional spaces and performs well for binary classification tasks.
2. **Random Forest (RF):** A robust ensemble method that aggregates the predictions of multiple decision trees, ensuring better generalization.
......@@ -38,7 +47,7 @@
- **kNN:** Number of neighbors (n_neighbors) and distance metric.
- **Ensemble Model:** Hyperparameters for both SVM and Random Forest models are optimized separately before being combined in the ensemble method.
### Step 4: Model Evaluation
### Step 5: Model Evaluation
- After training, models are evaluated using multiple metrics:
- **Accuracy:** Overall correctness of the model.
- **Classification Report:** Provides precision, recall, F1-score for each class.
......@@ -49,15 +58,11 @@
- **Evaluation Tools:**
- **Precision-Recall and ROC Curves:** Plots showing the trade-off between precision and recall, and the model’s performance across different classification thresholds.
### Step 5: Model Deployment
### Step 6: Model Deployment
- **Saving the Model:**
- The trained models are serialized and saved using `joblib` for later deployment.
- Saved models are stored in a directory (`./runs/`) for easy access during predictions.
### Step 6: Performance Monitoring and Updates
- **Monitoring:** The models' performance is monitored over time, with periodic evaluations on new data.
- **Model Updating:** Based on new data or performance degradation, models can be retrained and updated with improved features or hyperparameters.
---
This pipeline ensures that the models are trained on clean, balanced data with optimized hyperparameters, while providing robust evaluation and monitoring mechanisms to guarantee the model’s long-term performance.
\ No newline at end of file