- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
#### 2.2. Removing Duplicates
- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
#### 2.3. Feature Selection
- The following features were selected for the analysis:
-**Year**
-**Runtime (Minutes)**
-**Metascore**
-**Votes**
-**Rating** (target variable)
#### 2.4. Removing Invalid Data
- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
### 3. Data Transformation
-**Encoding Categorical Variables**:
- Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`.
- The `drop_first=True` argument was used to avoid multicollinearity.
### 4. Handling Outliers
-**Outlier Detection**:
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset.
### 5. Data Augmentation
#### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
#### 5.2. Impact of Fake Data
- The addition of fake data was evaluated by comparing the performance of machine learning models with and without fake data. The performance drop or improvement was analyzed based on the **Mean Squared Error (MSE)** and **R²** scores.
### 6. Data Splitting
- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.