Changes
Page history
Update Data
authored
Jan 25, 2025
by
Adham Beshr
Show whitespace changes
Inline
Side-by-side
Data.md
View page @
b86854f4
...
...
@@ -2,6 +2,7 @@
title
:
Data
---
https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.png?ref_type=heads
## Data Chapter
...
...
@@ -65,6 +66,34 @@ https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.
#### 5.1. Generating Fake Data
-
A
synthetic
dataset
was
created
by
adding
**
25
%
fake
data
**
using
the
**
Faker
**
library
to
simulate
movie
attributes
like
genre
,
rating
,
and
votes
.
The
fake
data
was
introduced
to
test
model
performance
and
robustness
.
-
``
`
python
def
generate_fake_data
(
real_data
,
fake_percentage
=
0.25
):
data
=
[]
genres
=
real_data
[
'
Genre
'
].
dropna
().
unique
()
num_samples
=
int
(
len
(
real_data
)
*
fake_percentage
)
for
_
in
range
(
num_samples
):
title
=
fake
.
bs
().
title
()
genre
=
random
.
choice
(
genres
)
description
=
fake
.
sentence
(
nb_words
=
12
)
director
=
fake
.
name
()
actors
=
fake
.
name
()
+
'
,
'
+
fake
.
name
()
year
=
random
.
randint
(
2000
,
2023
)
runtime
=
random
.
randint
(
80
,
180
)
rating
=
round
(
random
.
uniform
(
1
,
10
),
1
)
votes
=
random
.
randint
(
50000
,
1000000
)
revenue
=
round
(
random
.
uniform
(
10
,
500
),
2
)
metascore
=
random
.
randint
(
0
,
100
)
data
.
append
([
title
,
genre
,
description
,
director
,
actors
,
year
,
runtime
,
rating
,
votes
,
revenue
,
metascore
])
columns
=
[
'
Title
'
,
'
Genre
'
,
'
Description
'
,
'
Director
'
,
'
Actors
'
,
'
Year
'
,
'
Runtime (Minutes)
'
,
'
Rating
'
,
'
Votes
'
,
'
Revenue (Millions)
'
,
'
Metascore
'
]
fake_df
=
pd
.
DataFrame
(
data
,
columns
=
columns
)
return
fake_df
#### 5.2. Impact of Fake Data
-
The
addition
of
fake
data
was
evaluated
by
comparing
the
performance
of
machine
learning
models
with
and
without
fake
data
.
The
performance
drop
or
improvement
was
analyzed
based
on
the
**
Mean
Squared
Error
(
MSE
)
**
and
**
R
²
**
scores
.
...
...
@@ -73,5 +102,3 @@ https://mygit.th-deg.de/ab11885/watch-wise/-/raw/main/Images/Dataset_sample_-_2.
-
The
split
was
done
using
the
`train_test_split()`
function
from
`sklearn.model_selection`
.
---
---
---