Changes

Adham Beshr · 025a6d25
--- a/Data.md
+++ b/Data.md
@@ -3,3 +3,51 @@ title: Data
 ---


+## Data Chapter
+
+### 1. Data Source
+- The dataset used for this project is sourced from **Kaggle**.
+- [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).
+
+### 2. Data Preprocessing
+#### 2.1. Handling Missing Values
+- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
+
+#### 2.2. Removing Duplicates
+- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
+
+#### 2.3. Feature Selection
+- The following features were selected for the analysis:
+  - **Year**
+  - **Runtime (Minutes)**
+  - **Metascore**
+  - **Votes**
+  - **Rating** (target variable)
+
+#### 2.4. Removing Invalid Data
+- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
+
+### 3. Data Transformation
+- **Encoding Categorical Variables**: 
+  - Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`.
+  - The `drop_first=True` argument was used to avoid multicollinearity.
+
+### 4. Handling Outliers
+- **Outlier Detection**: 
+  - Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
+  - Outliers were either removed or transformed based on their influence on the dataset.
+
+### 5. Data Augmentation
+#### 5.1. Generating Fake Data
+- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
+
+#### 5.2. Impact of Fake Data
+- The addition of fake data was evaluated by comparing the performance of machine learning models with and without fake data. The performance drop or improvement was analyzed based on the **Mean Squared Error (MSE)** and **R²** scores.
+
+### 6. Data Splitting
+- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation.
+- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
+
+--- 
+--- 
+---