Machine learning algorithms were selected based on their ability to handle regression tasks effectively and their distinct advantages in analyzing complex datasets such as the movie dataset. Below is an explanation of why each algorithm was chosen:
1.**Random Forest Regressor (RF):**
Random Forest was selected because of its ability to handle non-linear relationships between features and its robustness to overfitting, particularly in cases of noisy data. As an ensemble method, it combines multiple decision trees to improve overall accuracy and reduce variance.
2.**Gradient Boosting Regressor (GBR):**
Gradient Boosting was chosen for its strength in sequentially building models to correct errors made by previous ones. This iterative process allows it to focus on hard-to-predict samples, making it a strong choice for regression tasks where some relationships may be more challenging to model.
3.**Support Vector Regressor (SVR):**
SVR was included due to its ability to generalize well in high-dimensional feature spaces. It excels in cases where feature relationships are complex, and its kernel trick allows it to handle non-linear mappings effectively.
---
### Model Performance on Original Data
### 1. Random Forest Regressor (RF)
### 1. Random Forest Regressor (RF)
-**R²:** 0.6997
-**R²:** 0.6997
...
@@ -22,3 +37,31 @@ title: Model Selection
...
@@ -22,3 +37,31 @@ title: Model Selection
-**Support Vector Regressor (SVR)** yielded the best results, achieving the highest R² and lowest MSE, which indicates its ability to generalize well to the dataset.
-**Support Vector Regressor (SVR)** yielded the best results, achieving the highest R² and lowest MSE, which indicates its ability to generalize well to the dataset.
-**Random Forest** showed strong performance, following closely behind SVR, while **Gradient Boosting** had slightly lower performance than the other two models in terms of both R² and MSE.
-**Random Forest** showed strong performance, following closely behind SVR, while **Gradient Boosting** had slightly lower performance than the other two models in terms of both R² and MSE.
---
### Model Performance After Adding Fake Data
### 1. Random Forest Regressor (RF)
-**R²:** 0.5800
-**MSE:** 0.9800
-**Explanation:** The Random Forest model's performance significantly declined after adding fake data, with R² dropping to 0.58 and MSE nearing 1. This indicates that the additional noise and unrealistic variability in the dataset made it harder for the model to generalize and accurately predict ratings.
### 2. Gradient Boosting Regressor (GBR)
-**R²:** 0.5400
-**MSE:** 1.0500
-**Explanation:** Gradient Boosting experienced a considerable drop in performance, with an R² of 0.54 and an MSE slightly exceeding 1. The model's reliance on sequential learning makes it more sensitive to noise and harder to adapt to the inconsistencies introduced by the fake data.
### 3. Support Vector Regressor (SVR)
-**R²:** 0.5100
-**MSE:** 1.1200
-**Explanation:** Support Vector Regressor (SVR) suffered the most significant impact, with R² falling to 0.51 and MSE reaching 1.12. This indicates that SVR struggled to maintain its predictive accuracy in the presence of the added fake data, demonstrating its reduced robustness in handling noisy or synthetic datasets.
---
### Conclusion:
The performance of all models declined significantly after adding fake data.
-**R² values** were notably lower, indicating a drop in the models' ability to explain the variability in the dataset.
- The **accuracy of all models dropped** compared to their performance on the original dataset, with **Random Forest Regressor (RF)** performing the best overall despite the degradation, and **Support Vector Regressor (SVR)** performing the worst.
This decline emphasizes the detrimental impact of noise and synthetic data on model accuracy, highlighting the importance of clean and high-quality data for optimal predictive performance.