... | ... | @@ -69,10 +69,48 @@ There were no transformations appear to have been performed on the dataset. |
|
|
Yes, there might be several opportunities for data transformation in our dataset. Here are some suggestion:
|
|
|
|
|
|
1. Departure Date: - Extract components like Year, Month, Day of the week and convert to a datetime format for analysis.
|
|
|
2. Flight Status: - Encode as numerical values. (example., 0 = Delayed, 1 = On time, 2 = Cancelled).
|
|
|
2. Flight Status: - Encode as numerical values. (example., 0 = Delayed, 1 = On time, 2 = Cancelled).
|
|
|
3. **`First Name`** and **`Last Name`** could be dropped unless you are performing text or name-based analysis.
|
|
|
4. Merge **`Airport Name`** and **`Airport Country Code`** into a single feature for uniqueness.
|
|
|
|
|
|
Here are some suggestions for Data transformation if there might be possible in our Airline Dataset.
|
|
|
Here are some suggestions for Data transformation if there might be possible in our Airline Dataset.
|
|
|
|
|
|
# Fake Dataset Analysis |
|
|
\ No newline at end of file |
|
|
# Realistic Fake Dataset Analysis
|
|
|
|
|
|
Here is the analysis of fake data and their potential effects: -
|
|
|
|
|
|
##### Effect on Model Training
|
|
|
|
|
|
1. Our original dataset size is 1,499 rows. After adding 25% realistic data size is increased to 1,873 rows increment of 374 rows.
|
|
|
2. Variability in categorical features like `Nationality`, `Airport Continent`, and `Flight Status` increased.
|
|
|
3. In features like `Gender`, male frequency increased from 763 to 895 and `Flight Status`, "On Time" frequency increased from 515 to 641 altering the target variable distribution.
|
|
|
4. If fake date data is realistic it might introduce diverse patterns, potentially improving the model's generalization or if it is unrealistic it could mislead the model, introducing errors or overfitting to artifacts in the fake data.
|
|
|
|
|
|
##### Approach for generating fake data
|
|
|
|
|
|
1. **Steps Taken**:
|
|
|
* Used realistic values for categorical features like `Nationality`, `Airport Continent`, and `Flight Status`, ensuring alignment with existing distributions.
|
|
|
* For numerical features (e.g., `Age`) data was sampled or generated within valid ranges (e.g., 1–90).
|
|
|
* Target labels (`Flight Status`) were assigned proportions similar to the original data.
|
|
|
2. **Key Considerations**:
|
|
|
* Ensured fake data is same as the distributions of real data.
|
|
|
* Preserved the balance across target classes to avoid skewing the dataset.
|
|
|
* Introduced some variability in features like `Nationality` or `Airport Country Code` to enrich the dataset.
|
|
|
|
|
|
##### Influence on the Model
|
|
|
|
|
|
There are some positives and negatives for our model. They are: -
|
|
|
|
|
|
_Positive Influence: -_
|
|
|
|
|
|
* If the fake data diversifies the dataset, it could help the model generalize better to unseen scenarios.
|
|
|
* If the fake data corrects feature like `Flight Status`, it may improve predictive performance.
|
|
|
* Add more example for the model to practice on.
|
|
|
|
|
|
_Negative Influence: -_
|
|
|
|
|
|
* Unrealistic patterns in fake data (e.g., unusual ages or impossible categories) could introduce irrelevance, reducing model accuracy.
|
|
|
* If the fake data is bad (unrealistic), the model learns "wrong lessons" and performs worse.
|
|
|
* If the fake data introduces too much randomness, the model might struggle to distinguish between important patterns and irrelevant differences.
|
|
|
|
|
|
##### |
|
|
\ No newline at end of file |