Data Description
The project utilizes the Netflix Titles dataset sourced from Kaggle. It provides information about movies and TV shows available on Netflix, including attributes such as title, director, release year, genre, and duration. This dataset is preprocessed to facilitate predictions and insights, and additional transformations were applied to prepare it for model training.
Feature Variables
1. show_id: (object)
A unique identifier for each Netflix title.
2. type: (object)
Indicates whether the title is a "Movie" or a "TV Show."
3. title: (object)
The name of the title.
4. director: (object)
The director of the title. Missing values were removed during preprocessing to ensure data consistency.
5. duration: (object)
The duration of the title:
- Movies: Represented in minutes (e.g., "90 min").
- TV Shows: Represented by the number of seasons (e.g., "2 Seasons").
This column was cleaned and converted into a numeric format for analysis.
6. release_year: (int64)
The year the title was released. This column was scaled to standardize its values for training models.
7. listed_in: (object)
Categories or genres of the title (e.g., "Drama, Comedy").
- The first listed genre was extracted as the primary_genre for model training.
- Rare genres were grouped into an "Other" category.
8. genre_encoded: (int64)
A numerical encoding of the primary genre using the LabelEncoder for machine learning compatibility.
9. director_count: (int64)
The frequency of each director in the dataset, used as a proxy for popularity.
10. duration_scaled: (float64)
The scaled version of the duration column, standardized using the StandardScaler.
11. release_year_scaled: (float64)
The scaled version of the release year column, standardized using the StandardScaler.
Target Variable
primary_genre: (int64)
The primary genre of each title, encoded numerically, is used as the target variable for classification.
Transformations
1. Duration Column:
- Converted from strings (e.g., "90 min") to integers (e.g., 90).
- Non-recognizable values (e.g., "Seasons") were replaced with 0 or excluded.
2. Genres:
- Extracted and cleaned from the
listed_in
column. - Rare genres were grouped under the "Other" category for better modeling performance.
3. Numerical Encoding:
- Categorical variables such as
primary_genre
anddirector
were encoded numerically using LabelEncoder.
4. Scaling:
- duration_scaled and release_year_scaled were standardized using the StandardScaler to improve model performance.
Outliers
Outliers in numerical columns, such as duration and release year, were identified and analyzed using:
-
quantile()
to determine extreme values. - Z-Score Analysis to evaluate deviations from the mean.
These outliers were handled by scaling or filtering to ensure data consistency.
Data Augmentation
To improve model robustness, additional synthetic data was not generated in this case. However, strategies such as oversampling and class balancing could be considered in future iterations for enhancing genre diversity.
Summary
This data preprocessing pipeline ensures the Netflix dataset is clean, consistent, and ready for predictive modeling. Key transformations such as genre encoding, duration scaling, and outlier handling have been applied to optimize the dataset for logistic regression-based genre prediction.