Changes

Adham Beshr · cc903e46
--- a/Data.md
+++ b/Data.md
 ---
 title: Data
 ---
 ## Data Chapter
 ### 1. Data Source
@@ -12,99 +9,106 @@ title: Data
 ![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Kaggle_Dataset_sample.png) 
-### 2. Data Preprocessing
+---
-#### 2.1. Handling Missing Values
- Missing values were handled using the **SimpleImputer** from `sklearn.impute`, which imputes missing values using the most frequent value in the column.
+## Data Description
-#### 2.2. Removing Duplicates
+The dataset consists of movie information, including features such as movie rankings, titles, genres, descriptions, directors, actors, years, runtime, ratings, votes, revenue, and metascores.
- Duplicate entries were removed using the `drop_duplicates()` method to ensure the dataset contains unique rows.
+### Feature Variables
-#### 2.3. Feature Selection
- The following features were selected for the analysis:
+- **Rank**: Integer, movie ranking based on various factors like rating, box office performance, etc.
-  - **Year**
+- **Title**: String, the title of the movie.
-  - **Runtime (Minutes)**
+- **Genre**: String, comma-separated list of genres (e.g., Action, Adventure, Sci-Fi).
-  - **Metascore**
+- **Description**: String, a brief description of the movie's plot.
-  - **Votes**
+- **Director**: String, the director of the movie.
-  - **Rating** (target variable)
+- **Actors**: String, the main actors involved in the movie.
+- **Year**: Integer, the release year of the movie.
-#### 2.4. Removing Invalid Data
+- **Runtime (Minutes)**: Integer, the length of the movie in minutes.
- Rows with zero votes were removed using `data[data["Votes"] > 0]` to eliminate irrelevant entries.
+- **Rating**: Float, the IMDb rating of the movie (scale 1-10).
+- **Votes**: Integer, the number of votes the movie received on IMDb.
-### 3. Data Transformation
+- **Revenue (Millions)**: Float, the total revenue of the movie in millions.
- **Encoding Categorical Variables**: 
+- **Metascore**: Integer, the metascore rating given by critics (scale 0-100).
-  - Categorical features, such as the `Genre` column, were encoded using one-hot encoding with `pd.get_dummies()`.
-  - The `drop_first=True` argument was used to avoid multicollinearity.
+---
-### 4. Handling Outliers
+## Target Variable
- **Outlier Detection**: 
-  - Outliers were detected using the **Interquartile Range (IQR)** method, where values outside the range `[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]` were considered outliers.
+- **Rating**: This is the target variable for regression, representing the movie's IMDb rating. The goal is to predict this value based on other features such as runtime, genre, and revenue.
-  - Outliers were either removed or transformed based on their influence on the dataset.
+---
-  ```python
-      def handle_outliers(df, column):
+## Data Transformations
-         # Calculate the first and third quartiles (Q1, Q3)
-         Q1 = df[column].quantile(0.25)
+- **Outliers**: There are some potential outliers in the dataset, such as unusually high revenues or ratings. These should be detected and handled. You can use methods like the IQR method or Z-score to identify and exclude these outliers.
-         Q3 = df[column].quantile(0.75)
+---
-         # Compute the Interquartile Range (IQR)
-         IQR = Q3 - Q1
+## Additional Fake Data
-         # Define the lower and upper bounds for outliers
+- **Fake Data Generation**: In the process, 25% of the data might be fake. For example, fake titles, genres, and other features might have been generated to simulate realistic movie data. If you're combining original and fake data, it should be treated carefully during model training to avoid bias.
-         lower_bound = Q1 - 1.5 * IQR
-         upper_bound = Q3 + 1.5 * IQR
+---
-         # Cap the outliers to the calculated bounds
+## Data Preprocessing Explanation
-         df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
-         df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
+### Step 1: Loading and Exploring the Dataset
+- **Load the dataset:** 
+    - The dataset is loaded using `pd.read_csv()` from a specified file path. This operation loads the movie data into a DataFrame, which can then be manipulated and analyzed.
-### 5. Data Augmentation
-#### 5.1. Generating Fake Data
+- **Display the first few rows:**
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness. 
+    - `data.head()` is used to show a preview of the first few rows to understand the structure and content of the dataset.
- ```python
+- **Dataset Overview:**
-       def generate_fake_data(real_data, fake_percentage=0.25):
+    - `data.info()` provides an overview of the dataset, including the number of entries and data types of the columns.
-          data = []
-          genres = real_data['Genre'].dropna().unique()
+- **Check for missing values:**
-          num_samples = int(len(real_data) * fake_percentage)
+    - `data.isnull().sum()` is used to check if there are any missing values in the dataset, which helps identify potential issues that need to be addressed.
-          for _ in range(num_samples):
+- **Check for duplicates:**
-          title = fake.bs().title()
+    - `data.duplicated().sum()` is used to check for duplicate rows in the dataset. Duplicates can lead to skewed analysis and need to be removed.
-          genre = random.choice(genres)
-          description = fake.sentence(nb_words=12)
+### Step 2: Data Cleaning
-          director = fake.name()
-          actors = fake.name() + ', ' + fake.name()
+- **Handle missing values:** 
-          year = random.randint(2000, 2023)
+    - In cases where missing values are detected, they can be removed or filled in. In this case, we use `data.dropna()` to remove rows with missing values.
-          runtime = random.randint(80, 180)
-          rating = round(random.uniform(1, 10), 1)
+- **Drop duplicates:**
-          votes = random.randint(50000, 1000000)
+    - `data.drop_duplicates()` is used to remove any duplicate rows found in the dataset.
-          revenue = round(random.uniform(10, 500), 2)
-          metascore = random.randint(0, 100)
+- **Outlier Detection and Handling:**
+    - The Interquartile Range (IQR) method is used to detect and handle outliers by adjusting values that fall outside the calculated lower and upper bounds. This ensures that extreme values do not negatively impact the model’s performance.
-          data.append([title, genre, description, director, actors, year, runtime, rating, votes, revenue, metascore])
+### Step 3: Feature Engineering
-      columns = ['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year', 
-                 'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
+- **Normalization of Numerical Features:**
-      fake_df = pd.DataFrame(data, columns=columns)
+    - The `MinMaxScaler` is used to normalize the numerical features, ensuring that all values are scaled between 0 and 1. This is particularly important for algorithms like SVM that are sensitive to feature scaling.
-      return fake_df
+- **One-Hot Encoding for Categorical Features:**
+    - For categorical features such as `Genre` and `Director`, one-hot encoding is applied. This creates binary columns for each category, allowing them to be used in machine learning models.
-#### 5.2. Impact of Fake Data
+- **Handling the `Runtime (Minutes)` Feature:**
- The inclusion of fake data was carefully evaluated as it significantly influenced the model's accuracy. Adding more than 25% fake data caused the accuracy to drop drastically, making the model unreliable.
+    - The `Runtime (Minutes)` feature is normalized using `MinMaxScaler` to scale it within a range that is compatible with other features.
- To maintain balance and robustness, the fake data was limited to 25%. This ensured enough variety without heavily compromising model performance.
+- **Log Transformation of `Revenue (Millions)` Feature:**
+    - A log transformation (`np.log1p()`) is applied to the `Revenue (Millions)` feature to reduce skewness and make the data more suitable for machine learning models.
- The evaluation showed a decline in performance metrics such as **Mean Squared Error (MSE)** and **R²** scores when fake data exceeded this threshold, emphasizing the importance of minimizing synthetic data.
+### Step 4: Fake Data Generation and Combination
-![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_1.png?ref_type=heads)
-![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp_2.png?ref_type=heads)
+- **Generate Fake Data:**
-![Dataset Kaggle](https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Data_pp3.png?ref_type=heads)
+    - Fake data is generated using the `Faker` library to simulate additional movie data. This helps in testing models when more data is needed.
+- **Combine Original and Fake Data:**
-### 6. Data Splitting
+    - The original dataset and the generated fake data are concatenated to create a larger dataset, which may help improve the model's performance by providing more variety.
- The dataset was split into **training** and **testing** sets using a **70:30** ratio for model evaluation.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
+### Step 5: Preprocessing on Combined Data
--- 
+- **Data Cleaning on Combined Data:**
+    - Similar to the original dataset, missing values and duplicates are removed from the combined data.
+- **Outlier Handling:**
+    - The outlier handling function is applied to the numerical columns of the combined dataset to ensure that extreme values are adjusted appropriately.
+- **Feature Scaling and Encoding on Combined Data:**
+    - After cleaning, the combined data undergoes feature scaling and one-hot encoding for categorical features like `Genre` and `Director`.