Update Data authored by Adham Beshr's avatar Adham Beshr
--- ---
title: Data title: Data
--- ---
## Data Chapter ## Data Chapter
### 1. Data Source ### 1. Data Source
...@@ -12,99 +9,106 @@ title: Data ...@@ -12,99 +9,106 @@ title: Data
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Kaggle_Dataset_sample.png) ![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Kaggle_Dataset_sample.png)
### 2. Data Preprocessing ---
#### 2.1. Handling Missing Values
- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
#### 2.2. Removing Duplicates ## Data Description
- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
#### 2.3. Feature Selection The dataset consists of movie information, including features such as movie rankings, titles, genres, descriptions, directors, actors, years, runtime, ratings, votes, revenue, and metascores.
- The following features were selected for the analysis:
- **Year**
- **Runtime (Minutes)**
- **Metascore**
- **Votes**
- **Rating** (target variable)
#### 2.4. Removing Invalid Data ### Feature Variables
- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
### 3. Data Transformation - **Rank**: Integer, movie ranking based on various factors like rating, box office performance, etc.
- **Encoding Categorical Variables**: - **Title**: String, the title of the movie.
- Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`. - **Genre**: String, comma-separated list of genres (e.g., Action, Adventure, Sci-Fi).
- The `drop_first=True` argument was used to avoid multicollinearity. - **Description**: String, a brief description of the movie's plot.
- **Director**: String, the director of the movie.
- **Actors**: String, the main actors involved in the movie.
- **Year**: Integer, the release year of the movie.
- **Runtime (Minutes)**: Integer, the length of the movie in minutes.
- **Rating**: Float, the IMDb rating of the movie (scale 1-10).
- **Votes**: Integer, the number of votes the movie received on IMDb.
- **Revenue (Millions)**: Float, the total revenue of the movie in millions.
- **Metascore**: Integer, the metascore rating given by critics (scale 0-100).
### 4. Handling Outliers ---
- **Outlier Detection**:
- Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
- Outliers were either removed or transformed based on their influence on the dataset.
```python ## Target Variable
def handle_outliers(df, column):
# Calculate the first and third quartiles (Q1, Q3)
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
# Compute the Interquartile Range (IQR) - **Rating**: This is the target variable for regression, representing the movie's IMDb rating. The goal is to predict this value based on other features such as runtime, genre, and revenue.
IQR = Q3 - Q1
# Define the lower and upper bounds for outliers ---
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Cap the outliers to the calculated bounds ## Data Transformations
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
- **Outliers**: There are some potential outliers in the dataset, such as unusually high revenues or ratings. These should be detected and handled. You can use methods like the IQR method or Z-score to identify and exclude these outliers.
---
### 5. Data Augmentation ## Additional Fake Data
#### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
- ```python - **Fake Data Generation**: In the process, 25% of the data might be fake. For example, fake titles, genres, and other features might have been generated to simulate realistic movie data. If you're combining original and fake data, it should be treated carefully during model training to avoid bias.
def generate_fake_data(real_data, fake_percentage=0.25):
data = []
genres = real_data['Genre'].dropna().unique()
num_samples = int(len(real_data) * fake_percentage)
for _ in range(num_samples): ---
title = fake.bs().title()
genre = random.choice(genres)
description = fake.sentence(nb_words=12)
director = fake.name()
actors = fake.name() + ', ' + fake.name()
year = random.randint(2000, 2023)
runtime = random.randint(80, 180)
rating = round(random.uniform(1, 10), 1)
votes = random.randint(50000, 1000000)
revenue = round(random.uniform(10, 500), 2)
metascore = random.randint(0, 100)
data.append([title, genre, description, director, actors, year, runtime, rating, votes, revenue, metascore]) ## Data Preprocessing Explanation
columns = ['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year', ### Step 1: Loading and Exploring the Dataset
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
fake_df = pd.DataFrame(data, columns=columns)
return fake_df - **Load the dataset:**
- The dataset is loaded using `pd.read_csv()` from a specified file path. This operation loads the movie data into a DataFrame, which can then be manipulated and analyzed.
- **Display the first few rows:**
- `data.head()` is used to show a preview of the first few rows to understand the structure and content of the dataset.
#### 5.2. Impact of Fake Data - **Dataset Overview:**
- The inclusion of fake data was carefully evaluated as it significantly influenced the model's accuracy. Adding more than 25% fake data caused the accuracy to drop drastically, making the model unreliable. - `data.info()` provides an overview of the dataset, including the number of entries and data types of the columns.
- To maintain balance and robustness, the fake data was limited to 25%. This ensured enough variety without heavily compromising model performance. - **Check for missing values:**
- `data.isnull().sum()` is used to check if there are any missing values in the dataset, which helps identify potential issues that need to be addressed.
- The evaluation showed a decline in performance metrics such as **Mean Squared Error (MSE)** and **R²** scores when fake data exceeded this threshold, emphasizing the importance of minimizing synthetic data. - **Check for duplicates:**
- `data.duplicated().sum()` is used to check for duplicate rows in the dataset. Duplicates can lead to skewed analysis and need to be removed.
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_1.png?ref_type=heads) ### Step 2: Data Cleaning
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_2.png?ref_type=heads)
![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp3.png?ref_type=heads)
- **Handle missing values:**
- In cases where missing values are detected, they can be removed or filled in. In this case, we use `data.dropna()` to remove rows with missing values.
### 6. Data Splitting - **Drop duplicates:**
- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation. - `data.drop_duplicates()` is used to remove any duplicate rows found in the dataset.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
--- - **Outlier Detection and Handling:**
- The Interquartile Range (IQR) method is used to detect and handle outliers by adjusting values that fall outside the calculated lower and upper bounds. This ensures that extreme values do not negatively impact the model’s performance.
### Step 3: Feature Engineering
- **Normalization of Numerical Features:**
- The `MinMaxScaler` is used to normalize the numerical features, ensuring that all values are scaled between 0 and 1. This is particularly important for algorithms like SVM that are sensitive to feature scaling.
- **One-Hot Encoding for Categorical Features:**
- For categorical features such as `Genre` and `Director`, one-hot encoding is applied. This creates binary columns for each category, allowing them to be used in machine learning models.
- **Handling the `Runtime (Minutes)` Feature:**
- The `Runtime (Minutes)` feature is normalized using `MinMaxScaler` to scale it within a range that is compatible with other features.
- **Log Transformation of `Revenue (Millions)` Feature:**
- A log transformation (`np.log1p()`) is applied to the `Revenue (Millions)` feature to reduce skewness and make the data more suitable for machine learning models.
### Step 4: Fake Data Generation and Combination
- **Generate Fake Data:**
- Fake data is generated using the `Faker` library to simulate additional movie data. This helps in testing models when more data is needed.
- **Combine Original and Fake Data:**
- The original dataset and the generated fake data are concatenated to create a larger dataset, which may help improve the model's performance by providing more variety.
### Step 5: Preprocessing on Combined Data
- **Data Cleaning on Combined Data:**
- Similar to the original dataset, missing values and duplicates are removed from the combined data.
- **Outlier Handling:**
- The outlier handling function is applied to the numerical columns of the combined dataset to ensure that extreme values are adjusted appropriately.
- **Feature Scaling and Encoding on Combined Data:**
- After cleaning, the combined data undergoes feature scaling and one-hot encoding for categorical features like `Genre` and `Director`.