Update Data authored by Adham Beshr's avatar Adham Beshr
......@@ -3,3 +3,51 @@ title: Data
---
## Data Chapter
### 1. Data Source
- The dataset used for this project is sourced from **Kaggle**.
- [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).
### 2. Data Preprocessing
#### 2.1. Handling Missing Values
- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
#### 2.2. Removing Duplicates
- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
#### 2.3. Feature Selection
- The following features were selected for the analysis:
- **Year**
- **Runtime (Minutes)**
- **Metascore**
- **Votes**
- **Rating** (target variable)
#### 2.4. Removing Invalid Data
- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
### 3. Data Transformation
- **Encoding Categorical Variables**:
- Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`.
- The `drop_first=True` argument was used to avoid multicollinearity.
### 4. Handling Outliers
- **Outlier Detection**:
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset.
### 5. Data Augmentation
#### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
#### 5.2. Impact of Fake Data
- The addition of fake data was evaluated by comparing the performance of machine learning models with and without fake data. The performance drop or improvement was analyzed based on the **Mean Squared Error (MSE)** and **R²** scores.
### 6. Data Splitting
- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
---
---
---