Update Data authored by Adham Beshr's avatar Adham Beshr
...@@ -2,6 +2,7 @@ ...@@ -2,6 +2,7 @@
title: Data title: Data
--- ---
https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads
## Data Chapter ## Data Chapter
...@@ -9,6 +10,10 @@ title: Data ...@@ -9,6 +10,10 @@ title: Data
- The dataset used for this project is sourced from **Kaggle**. - The dataset used for this project is sourced from **Kaggle**.
- [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data). - [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).
![Dataset sample 1](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_1.png?ref_type=heads)
![Dataset sample 2](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads)
### 2. Data Preprocessing ### 2. Data Preprocessing
#### 2.1. Handling Missing Values #### 2.1. Handling Missing Values
- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column. - Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
...@@ -37,6 +42,25 @@ title: Data ...@@ -37,6 +42,25 @@ title: Data
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers. - Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset. - Outliers were either removed or transformed based on their influence on the dataset.
```python
def handle_outliers(df, column):
# Calculate the first and third quartiles (Q1, Q3)
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
# Compute the Interquartile Range (IQR)
IQR = Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Cap the outliers to the calculated bounds
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
### 5. Data Augmentation ### 5. Data Augmentation
#### 5.1. Generating Fake Data #### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness. - A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
... ...
......