Update Data authored by Adham Beshr's avatar Adham Beshr
---
title: Data
---
## Data Chapter
### 1. Data Source
......@@ -12,99 +9,106 @@ title: Data
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Kaggle_Dataset_sample.png)
### 2. Data Preprocessing
#### 2.1. Handling Missing Values
- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
#### 2.2. Removing Duplicates
- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
#### 2.3. Feature Selection
- The following features were selected for the analysis:
- **Year**
- **Runtime (Minutes)**
- **Metascore**
- **Votes**
- **Rating** (target variable)
#### 2.4. Removing Invalid Data
- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
### 3. Data Transformation
- **Encoding Categorical Variables**:
- Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`.
- The `drop_first=True` argument was used to avoid multicollinearity.
### 4. Handling Outliers
- **Outlier Detection**:
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset.
```python
def handle_outliers(df, column):
# Calculate the first and third quartiles (Q1, Q3)
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
# Compute the Interquartile Range (IQR)
IQR = Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Cap the outliers to the calculated bounds
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
### 5. Data Augmentation
#### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
- ```python
def generate_fake_data(real_data, fake_percentage=0.25):
data = []
genres = real_data['Genre'].dropna().unique()
num_samples = int(len(real_data) * fake_percentage)
for _ in range(num_samples):
title = fake.bs().title()
genre = random.choice(genres)
description = fake.sentence(nb_words=12)
director = fake.name()
actors = fake.name() + ', ' + fake.name()
year = random.randint(2000, 2023)
runtime = random.randint(80, 180)
rating = round(random.uniform(1, 10), 1)
votes = random.randint(50000, 1000000)
revenue = round(random.uniform(10, 500), 2)
metascore = random.randint(0, 100)
data.append([title, genre, description, director, actors, year, runtime, rating, votes, revenue, metascore])
columns = ['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
fake_df = pd.DataFrame(data, columns=columns)
return fake_df
#### 5.2. Impact of Fake Data
- The inclusion of fake data was carefully evaluated as it significantly influenced the model's accuracy. Adding more than 25% fake data caused the accuracy to drop drastically, making the model unreliable.
- To maintain balance and robustness, the fake data was limited to 25%. This ensured enough variety without heavily compromising model performance.
- The evaluation showed a decline in performance metrics such as **Mean Squared Error (MSE)** and **R²** scores when fake data exceeded this threshold, emphasizing the importance of minimizing synthetic data.
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_1.png?ref_type=heads)
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_2.png?ref_type=heads)
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp3.png?ref_type=heads)
### 6. Data Splitting
- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
---
---
## Data Description
The dataset consists of movie information, including features such as movie rankings, titles, genres, descriptions, directors, actors, years, runtime, ratings, votes, revenue, and metascores.
### Feature Variables
- **Rank**: Integer, movie ranking based on various factors like rating, box office performance, etc.
- **Title**: String, the title of the movie.
- **Genre**: String, comma-separated list of genres (e.g., Action, Adventure, Sci-Fi).
- **Description**: String, a brief description of the movie's plot.
- **Director**: String, the director of the movie.
- **Actors**: String, the main actors involved in the movie.
- **Year**: Integer, the release year of the movie.
- **Runtime (Minutes)**: Integer, the length of the movie in minutes.
- **Rating**: Float, the IMDb rating of the movie (scale 1-10).
- **Votes**: Integer, the number of votes the movie received on IMDb.
- **Revenue (Millions)**: Float, the total revenue of the movie in millions.
- **Metascore**: Integer, the metascore rating given by critics (scale 0-100).
---
## Target Variable
- **Rating**: This is the target variable for regression, representing the movie's IMDb rating. The goal is to predict this value based on other features such as runtime, genre, and revenue.
---
## Data Transformations
- **Outliers**: There are some potential outliers in the dataset, such as unusually high revenues or ratings. These should be detected and handled. You can use methods like the IQR method or Z-score to identify and exclude these outliers.
---
## Additional Fake Data
- **Fake Data Generation**: In the process, 25% of the data might be fake. For example, fake titles, genres, and other features might have been generated to simulate realistic movie data. If you're combining original and fake data, it should be treated carefully during model training to avoid bias.
---
## Data Preprocessing Explanation
### Step 1: Loading and Exploring the Dataset
- **Load the dataset:**
- The dataset is loaded using `pd.read_csv()` from a specified file path. This operation loads the movie data into a DataFrame, which can then be manipulated and analyzed.
- **Display the first few rows:**
- `data.head()` is used to show a preview of the first few rows to understand the structure and content of the dataset.
- **Dataset Overview:**
- `data.info()` provides an overview of the dataset, including the number of entries and data types of the columns.
- **Check for missing values:**
- `data.isnull().sum()` is used to check if there are any missing values in the dataset, which helps identify potential issues that need to be addressed.
- **Check for duplicates:**
- `data.duplicated().sum()` is used to check for duplicate rows in the dataset. Duplicates can lead to skewed analysis and need to be removed.
### Step 2: Data Cleaning
- **Handle missing values:**
- In cases where missing values are detected, they can be removed or filled in. In this case, we use `data.dropna()` to remove rows with missing values.
- **Drop duplicates:**
- `data.drop_duplicates()` is used to remove any duplicate rows found in the dataset.
- **Outlier Detection and Handling:**
- The Interquartile Range (IQR) method is used to detect and handle outliers by adjusting values that fall outside the calculated lower and upper bounds. This ensures that extreme values do not negatively impact the model’s performance.
### Step 3: Feature Engineering
- **Normalization of Numerical Features:**
- The `MinMaxScaler` is used to normalize the numerical features, ensuring that all values are scaled between 0 and 1. This is particularly important for algorithms like SVM that are sensitive to feature scaling.
- **One-Hot Encoding for Categorical Features:**
- For categorical features such as `Genre` and `Director`, one-hot encoding is applied. This creates binary columns for each category, allowing them to be used in machine learning models.
- **Handling the `Runtime (Minutes)` Feature:**
- The `Runtime (Minutes)` feature is normalized using `MinMaxScaler` to scale it within a range that is compatible with other features.
- **Log Transformation of `Revenue (Millions)` Feature:**
- A log transformation (`np.log1p()`) is applied to the `Revenue (Millions)` feature to reduce skewness and make the data more suitable for machine learning models.
### Step 4: Fake Data Generation and Combination
- **Generate Fake Data:**
- Fake data is generated using the `Faker` library to simulate additional movie data. This helps in testing models when more data is needed.
- **Combine Original and Fake Data:**
- The original dataset and the generated fake data are concatenated to create a larger dataset, which may help improve the model's performance by providing more variety.
### Step 5: Preprocessing on Combined Data
- **Data Cleaning on Combined Data:**
- Similar to the original dataset, missing values and duplicates are removed from the combined data.
- **Outlier Handling:**
- The outlier handling function is applied to the numerical columns of the combined dataset to ensure that extreme values are adjusted appropriately.
- **Feature Scaling and Encoding on Combined Data:**
- After cleaning, the combined data undergoes feature scaling and one-hot encoding for categorical features like `Genre` and `Director`.