The project uses the Netflix Titles dataset sourced from Kaggle. It contains information about movies and TV shows available on Netflix, including their title, type, duration, release year, and more. To enhance the dataset, 20% additional fake data was generated for analysis and predictions.
# Introduction
This is the Wiki for the Netflix Dataset Analysis and Augmentation project. It documents the development process, features, and internal aspects of the project. This wiki serves as a comprehensive guide for understanding the functionality, implementation, and usage of the application.
# Content
[Data](data)
\ No newline at end of file
#Feature variables
show_id: object
A unique identifier for each Netflix title.
type: object
Indicates whether the title is a "Movie" or a "TV Show".
title: object
The name of the title.
director: object
The director of the title. Missing values were filled with "Unknown".
cast: object
The cast of the title. Missing values were filled with "Unknown".
country: object
The country where the title was produced. Missing values were filled with "Unknown".
date_added: datetime64
The date the title was added to Netflix. Missing values were replaced with a default date (2020-01-01).
release_year: int64
The year the title was released. Ranges from 1920 to 2023. Missing values were replaced with the median year.
rating: object
The rating of the title (e.g., "PG", "R"). Missing values were filled with "Not Rated".
duration: object
Describes the duration:
Movies: Duration in minutes (e.g., "90 min").
TV Shows: Number of seasons (e.g., "2 Seasons").
listed_in: object
Categories or genres of the title (e.g., "Drama, Comedy"). These were split and cleaned for analysis.
is_fake: int64
An indicator variable where:
0: Original data.
1: Augmented (fake) data.
# Target variable
release_year: int64
Year of release, used as the target variable for regression.
# Transformations
Duration Column:
Converted to numeric values:
Movies: Integer duration in minutes (e.g., "90 min" → 90).
TV Shows: Integer number of seasons (e.g., "2 Seasons" → 2).
Genres:
Split and cleaned from the listed_in column to create individual genre categories.
Date Added:
Converted to datetime format, with missing values replaced by a default date.
Numerical Encoding:
Categorical variables (e.g., type, rating) were encoded into numerical values for model training.
# Outliers
Outliers were identified in numerical columns (e.g., duration_numeric and release_year). Further analysis to handle outliers used:
quantile() from the pandas library.
zscore() from the scipy library.
See this Stack Overflow thread for additional details.
# Additional fake data
To enhance the dataset, an additional 20% of fake data was generated using random functions. This fake data replicates the structure of the original dataset while introducing variability.