Changes

Asif Khan · 28f67bc6
--- a/Home/{Data}.md
+++ b/Home/{Data}.md
 ---
 title: Data
 ---
+#Data description
+The project uses the Netflix Titles dataset sourced from Kaggle. It contains information about movies and TV shows available on Netflix, including their title, type, duration, release year, and more. To enhance the dataset, 20% additional fake data was generated for analysis and predictions.

-# Introduction
-This is the Wiki for the Netflix Dataset Analysis and Augmentation project. It documents the development process, features, and internal aspects of the project. This wiki serves as a comprehensive guide for understanding the functionality, implementation, and usage of the application.

-# Content
-[Data](data)
\ No newline at end of file
+#Feature variables
+
+show_id: object
+A unique identifier for each Netflix title.
+
+type: object
+Indicates whether the title is a "Movie" or a "TV Show".
+
+title: object
+The name of the title.
+
+director: object
+The director of the title. Missing values were filled with "Unknown".
+
+cast: object
+The cast of the title. Missing values were filled with "Unknown".
+
+country: object
+The country where the title was produced. Missing values were filled with "Unknown".
+
+date_added: datetime64
+The date the title was added to Netflix. Missing values were replaced with a default date (2020-01-01).
+
+release_year: int64
+The year the title was released. Ranges from 1920 to 2023. Missing values were replaced with the median year.
+
+rating: object
+The rating of the title (e.g., "PG", "R"). Missing values were filled with "Not Rated".
+
+duration: object
+Describes the duration:
+Movies: Duration in minutes (e.g., "90 min").
+TV Shows: Number of seasons (e.g., "2 Seasons").
+
+listed_in: object
+Categories or genres of the title (e.g., "Drama, Comedy"). These were split and cleaned for analysis.
+
+is_fake: int64
+An indicator variable where:
+0: Original data.
+1: Augmented (fake) data.
+
+# Target variable
+
+release_year: int64
+Year of release, used as the target variable for regression.
+
+# Transformations
+
+Duration Column:
+Converted to numeric values:
+Movies: Integer duration in minutes (e.g., "90 min" → 90).
+TV Shows: Integer number of seasons (e.g., "2 Seasons" → 2).
+
+Genres:
+Split and cleaned from the listed_in column to create individual genre categories.
+
+Date Added:
+Converted to datetime format, with missing values replaced by a default date.
+
+Numerical Encoding:
+Categorical variables (e.g., type, rating) were encoded into numerical values for model training.
+
+# Outliers
+
+Outliers were identified in numerical columns (e.g., duration_numeric and release_year). Further analysis to handle outliers used:
+
+quantile() from the pandas library.
+zscore() from the scipy library.
+See this Stack Overflow thread for additional details.
+
+# Additional fake data
+
+To enhance the dataset, an additional 20% of fake data was generated using random functions. This fake data replicates the structure of the original dataset while introducing variability. 
\ No newline at end of file