Update Data authored by Adham Beshr's avatar Adham Beshr
......@@ -2,12 +2,17 @@
title: Data
---
https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads
## Data Chapter
### 1. Data Source
- The dataset used for this project is sourced from **Kaggle**.
- [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).
- [Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).
![Dataset sample 1](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_1.png?ref_type=heads)
![Dataset sample 2](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads)
### 2. Data Preprocessing
#### 2.1. Handling Missing Values
......@@ -36,6 +41,25 @@ title: Data
- **Outlier Detection**:
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset.
```python
def handle_outliers(df, column):
# Calculate the first and third quartiles (Q1, Q3)
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
# Compute the Interquartile Range (IQR)
IQR = Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Cap the outliers to the calculated bounds
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
### 5. Data Augmentation
#### 5.1. Generating Fake Data
......
......