Changes

Adham Beshr · 54de9e11
--- a/Data.md
+++ b/Data.md
@@ -2,6 +2,7 @@
 title: Data
 ---
+https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads
 ## Data Chapter
@@ -9,6 +10,10 @@ title: Data
 - The dataset used for this project is sourced from **Kaggle**.
 - [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data). 
+![Dataset sample 1](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_1.png?ref_type=heads) 
+![Dataset sample 2](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads) 
 ### 2. Data Preprocessing
 #### 2.1. Handling Missing Values
 - Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
@@ -37,6 +42,25 @@ title: Data
  - Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
  - Outliers were either removed or transformed based on their influence on the dataset.
+  ```python
+      def handle_outliers(df, column):
+         # Calculate the first and third quartiles (Q1, Q3)
+         Q1 = df[column].quantile(0.25)
+         Q3 = df[column].quantile(0.75)
+         # Compute the Interquartile Range (IQR)
+         IQR = Q3 - Q1
+         # Define the lower and upper bounds for outliers
+         lower_bound = Q1 - 1.5 * IQR
+         upper_bound = Q3 + 1.5 * IQR
+         # Cap the outliers to the calculated bounds
+         df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
+         df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
 ### 5. Data Augmentation
 #### 5.1. Generating Fake Data
 - A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.