The project uses the Netflix Titles dataset sourced from Kaggle. It contains information about movies and TV shows available on Netflix, including their title, type, duration, release year, and more. To enhance the dataset, 20% additional fake data was generated for analysis and predictions.
The project utilizes the Netflix Titles dataset sourced from Kaggle. It provides information about movies and TV shows available on Netflix, including attributes such as title, director, release year, genre, and duration. This dataset is preprocessed to facilitate predictions and insights, and additional transformations were applied to prepare it for model training.
#Feature variables
---
## Feature Variables
show_id:object
### **1. show_id:** (object)
A unique identifier for each Netflix title.
A unique identifier for each Netflix title.
type:object
### **2. type:** (object)
Indicates whether the title is a "Movie" or a "TV Show".
Indicates whether the title is a "Movie" or a "TV Show."
title:object
### **3. title:** (object)
The name of the title.
The name of the title.
director:object
### **4. director:** (object)
The director of the title. Missing values were filled with "Unknown".
The director of the title. Missing values were removed during preprocessing to ensure data consistency.
cast: object
### **5. duration:** (object)
The cast of the title. Missing values were filled with "Unknown".
The duration of the title:
- Movies: Represented in minutes (e.g., "90 min").
- TV Shows: Represented by the number of seasons (e.g., "2 Seasons").
This column was cleaned and converted into a numeric format for analysis.
country: object
### **6. release_year:** (int64)
The country where the title was produced. Missing values were filled with "Unknown".
The year the title was released. This column was scaled to standardize its values for training models.
date_added: datetime64
### **7. listed_in:** (object)
The date the title was added to Netflix. Missing values were replaced with a default date (2020-01-01).
Categories or genres of the title (e.g., "Drama, Comedy").
- The first listed genre was extracted as the **primary_genre** for model training.
- Rare genres were grouped into an "Other" category.
release_year: int64
### **8. genre_encoded:** (int64)
The year the title was released. Ranges from 1920 to 2023. Missing values were replaced with the median year.
A numerical encoding of the primary genre using the LabelEncoder for machine learning compatibility.
rating: object
### **9. director_count:** (int64)
The rating of the title (e.g., "PG", "R"). Missing values were filled with "Not Rated".
The frequency of each director in the dataset, used as a proxy for popularity.
duration: object
### **10. duration_scaled:** (float64)
Describes the duration:
The scaled version of the duration column, standardized using the StandardScaler.
Movies: Duration in minutes (e.g., "90 min").
TV Shows: Number of seasons (e.g., "2 Seasons").
listed_in: object
### **11. release_year_scaled:** (float64)
Categories or genres of the title (e.g., "Drama, Comedy"). These were split and cleaned for analysis.
The scaled version of the release year column, standardized using the StandardScaler.
is_fake: int64
---
An indicator variable where:
0: Original data.
1: Augmented (fake) data.
# Target variable
## Target Variable
release_year: int64
### **primary_genre:** (int64)
Year of release, used as the target variable for regression.
The primary genre of each title, encoded numerically, is used as the target variable for classification.
# Transformations
---
Duration Column:
## Transformations
Converted to numeric values:
Movies: Integer duration in minutes (e.g., "90 min" → 90).
TV Shows: Integer number of seasons (e.g., "2 Seasons" → 2).
Genres:
### **1. Duration Column:**
Split and cleaned from the listed_in column to create individual genre categories.
- Converted from strings (e.g., "90 min") to integers (e.g., 90).
- Non-recognizable values (e.g., "Seasons") were replaced with 0 or excluded.
Date Added:
### **2. Genres:**
Converted to datetime format, with missing values replaced by a default date.
- Extracted and cleaned from the `listed_in` column.
- Rare genres were grouped under the "Other" category for better modeling performance.
Numerical Encoding:
### **3. Numerical Encoding:**
Categorical variables (e.g., type, rating) were encoded into numerical values for model training.
- Categorical variables such as `primary_genre` and `director` were encoded numerically using LabelEncoder.
### **4. Scaling:**
-**duration_scaled** and **release_year_scaled** were standardized using the StandardScaler to improve model performance.
---
# Outliers
## Outliers
Outliers were identified in numerical columns (e.g., duration_numeric and release_year). Further analysis to handle outliers used:
Outliers in numerical columns, such as duration and release year, were identified and analyzed using:
-**`quantile()`** to determine extreme values.
-**Z-Score Analysis** to evaluate deviations from the mean.
quantile() from the pandas library.
These outliers were handled by scaling or filtering to ensure data consistency.
zscore() from the scipy library.
See this Stack Overflow thread for additional details.
---
## Data Augmentation
To improve model robustness, additional synthetic data was not generated in this case. However, strategies such as oversampling and class balancing could be considered in future iterations for enhancing genre diversity.
---
# Additional fake data
## Summary
To enhance the dataset, an additional 20% of fake data was generated using random functions. This fake data replicates the structure of the original dataset while introducing variability.
This data preprocessing pipeline ensures the Netflix dataset is clean, consistent, and ready for predictive modeling. Key transformations such as genre encoding, duration scaling, and outlier handling have been applied to optimize the dataset for logistic regression-based genre prediction.