Changes

Mykyta Kostohryz · 14101392
--- a/Model-training.md
+++ b/Model-training.md
+```markdown
+## Model Training Pipeline
+### Step 1: Data Loading and Preprocessing
+- **File Sources:**
+  - **Original dataset:** `water_potability.csv` contains the core water potability data.
+  - **Fake dataset:** `extended_water_potability.csv` includes fake features for trials with model prediction.
+  - **Feature Engineering:**
+    - **Binning (Optional):** Numerical features are grouped into bins to improve the model's ability to generalize.
+    - **Polynomial Features (Optional):** Polynomial transformations are applied to generate new features based on interactions between existing features, helping to capture non-linear relationships.
+- **Data Splitting:**
+  - The dataset is split into training and testing sets using an 80/20 split.
+  - **Stratified splitting** ensures the distribution of target classes (Potability) is preserved across both training and test datasets.
+### Step 2: Data Balancing
+- **SMOTE (Synthetic Minority Oversampling Technique):**
+  - Used to address class imbalance by synthetically generating examples of the minority class (non-potable water).
+- **Random Under-Sampling (Optional):**
+  - After SMOTE, random under-sampling can be applied to reduce the size of the majority class, ensuring better balance between classes.
+### Step 3: Model Selection and Training
+- **Models Trained:**
+  1. **Support Vector Machine (SVM):** Effective for high-dimensional spaces and performs well for binary classification tasks.
+  2. **Random Forest (RF):** A robust ensemble method that aggregates the predictions of multiple decision trees, ensuring better generalization.
+  3. **Logistic Regression:** A fundamental method for binary classification that provides a clear probability-based output.
+  4. **k-Nearest Neighbors (kNN):** A non-parametric model that predicts based on the closest neighbors, effective for classification tasks.
+- **Hyperparameter Tuning:**
+  - **GridSearchCV:** A search over a specified parameter grid is used to identify the best hyperparameters for each model.
+  - The hyperparameter tuning includes:
+    - **SVM:** Kernel type, regularization parameter (C), and gamma values.
+    - **Random Forest:** Number of trees (n_estimators), maximum depth (max_depth), and minimum samples for splits.
+    - **Logistic Regression:** Regularization strength (C) and solver.
+    - **kNN:** Number of neighbors (n_neighbors) and distance metric.
+### Step 4: Model Evaluation
+- After training, models are evaluated using multiple metrics:
+  - **Accuracy:** Overall correctness of the model.
+  - **Classification Report:** Provides precision, recall, F1-score for each class.
+  - **Confusion Matrix:** Displays true positives, false positives, true negatives, and false negatives.
+  - **ROC AUC Score:** A measure of the model’s ability to distinguish between classes.
+  - **Cross-Validation:** Cross-validation scores are computed to ensure the model generalizes well on unseen data.
+- **Evaluation Tools:**
+  - **Precision-Recall and ROC Curves:** Plots showing the trade-off between precision and recall, and the model’s performance across different classification thresholds.
+### Step 5: Model Deployment
+- **Saving the Model:**
+  - The trained models are serialized and saved using `joblib` for later deployment.
+  - Saved models are stored in a directory (`./runs/`) for easy access during predictions.
+### Step 6: Performance Monitoring and Updates
+- **Monitoring:** The models' performance is monitored over time, with periodic evaluations on new data.
+- **Model Updating:** Based on new data or performance degradation, models can be retrained and updated with improved features or hyperparameters.
+---
+This pipeline ensures that the models are trained on clean, balanced data with optimized hyperparameters, while providing robust evaluation and monitoring mechanisms to guarantee the model’s long-term performance.
+```
\ No newline at end of file