Update Data authored by Adham Beshr's avatar Adham Beshr
......@@ -2,6 +2,7 @@
title: Data
---
https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads
## Data Chapter
......@@ -65,6 +66,34 @@ https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.
#### 5.1. Generating Fake Data
- A synthetic dataset was created by adding **25% fake data** using the **Faker** library to simulate movie attributes like genre, rating, and votes. The fake data was introduced to test model performance and robustness.
- ```python
def generate_fake_data(real_data, fake_percentage=0.25):
data = []
genres = real_data['Genre'].dropna().unique()
num_samples = int(len(real_data) * fake_percentage)
for _ in range(num_samples):
title = fake.bs().title()
genre = random.choice(genres)
description = fake.sentence(nb_words=12)
director = fake.name()
actors = fake.name() + ', ' + fake.name()
year = random.randint(2000, 2023)
runtime = random.randint(80, 180)
rating = round(random.uniform(1, 10), 1)
votes = random.randint(50000, 1000000)
revenue = round(random.uniform(10, 500), 2)
metascore = random.randint(0, 100)
data.append([title, genre, description, director, actors, year, runtime, rating, votes, revenue, metascore])
columns = ['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
fake_df = pd.DataFrame(data, columns=columns)
return fake_df
#### 5.2. Impact of Fake Data
- The addition of fake data was evaluated by comparing the performance of machine learning models with and without fake data. The performance drop or improvement was analyzed based on the **Mean Squared Error (MSE)** and **R²** scores.
......@@ -73,5 +102,3 @@ https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.
- The split was done using the `train_test_split()` function from `sklearn.model_selection`.
---
---
---