--- title: data --- # Data Description The Airline Dataset contains 15 features describing about passenger personal details, flight details, and flight performance. Below is a detailed description of each feature. # Feature Variables 1. **Passenger ID:** _int64_ * Unique identifier for passengers. 2. **First Name**: _string_ * The first name of the passenger. 3. **Last Name**: _string_ * The last name of the passenger. 4. **Gender**: _string_ * Values are categorical (Male, Female). * Transformation: May be encoded as binary (e.g., Male = 0, Female = 1) or one-hot encoding for analysis. 5. **Age**: _int64_ * Age of the passenger. * Range: \[1, 90\] * Min: 1 , Max: 90, Mean: 46.25 6. **Nationality**: _string_ * The country of origin for the passenger. 7. **Airport Name**: _string_ * Name of the originating or destination airport. 8. **Airport Country Code**: _string_ * Country code of the airport (e.g., USA, CANADA, FRANCE etc). 9. **Country Name**: _string_ * Full country name corresponding to the airport. 10. **Airport Continent**: _string_ * Values include NAM (North America), EU (Europe), etc. * Transformation: May be encoded as categorical numerical values or one-hot encoded. 11. **Continents**: _string_ * Generalized continent grouping. * Transformation: Similar to **Airport Continent**, encode numerically if necessary. 12. **Departure Date**: _string_ * The date of the flight's departure. * Transformation: Convert to a `datetime` format for time-based analysis. 13. **Arrival Airport**: _string_ * Code of the destination airport. 14. **Pilot Name**: _string_ * Name of the pilot for the flight. 15. **Flight Status**: _string_ * Categorical values (On Time, Delayed). * Transformation: Encode numerically for machine learning (e.g., On Time = 0, Delayed = 1). # Target Variables As per the the Airline dataset the target variables can Flight status. **Flight Status**: _Categorical -\> Binary_ * The primary target variable, indicating whether the flight was on time or delayed. * It helps to understand how well the airline operates and what causes delays. * Transformation: Convert to binary values for classification models with only two possible outcomes. (e.g., On Time = 0, Delayed = 1). ##### Application * **Operational Analysis:** Helps identify patterns in delays based on flight details like departure time, airport, or pilot name. * **Predictive Modelling:** Make it easier/possible to create models based on predicts delays, helps to improve customer comfort, planning and experience. The factors that can influence flight status can be the more busy airport the more delay it can be, flights during peak hours, weather conditions. # Data Transformation There were no transformations appear to have been performed on the dataset. Yes, there might be several opportunities for data transformation in our dataset. Here are some suggestion: 1. Departure Date: - Extract components like Year, Month, Day of the week and convert to a datetime format for analysis. 2. Flight Status: - Encode as numerical values. (example., 0 = Delayed, 1 = On time, 2 = Cancelled). 3. **`First Name`** and **`Last Name`** could be dropped unless you are performing text or name-based analysis. 4. Merge **`Airport Name`** and **`Airport Country Code`** into a single feature for uniqueness. Here are some suggestions for Data transformation if there might be possible in our Airline Dataset. # Realistic Fake Dataset Analysis Here is the analysis of fake data and their potential effects: - ##### Effect on Model Training 1. Our original dataset size is 1,499 rows. After adding 25% realistic data size is increased to 1,873 rows increment of 374 rows. 2. Variability in categorical features like `Nationality`, `Airport Continent`, and `Flight Status` increased. 3. In features like `Gender`, male frequency increased from 763 to 895 and `Flight Status`, "On Time" frequency increased from 515 to 641 altering the target variable distribution. 4. If fake date data is realistic it might introduce diverse patterns, potentially improving the model's generalization or if it is unrealistic it could mislead the model, introducing errors or overfitting to artifacts in the fake data. ##### Approach for generating fake data 1. **Steps Taken**: * Used realistic values for categorical features like `Nationality`, `Airport Continent`, and `Flight Status`, ensuring alignment with existing distributions. * For numerical features (e.g., `Age`) data was sampled or generated within valid ranges (e.g., 1–90). * Target labels (`Flight Status`) were assigned proportions similar to the original data. 2. **Key Considerations**: * Ensured fake data is same as the distributions of real data. * Preserved the balance across target classes to avoid skewing the dataset. * Introduced some variability in features like `Nationality` or `Airport Country Code` to enrich the dataset. ##### Influence on the Model There are some positives and negatives for our model. They are: - _Positive Influence: -_ * If the fake data diversifies the dataset, it could help the model generalize better to unseen scenarios. * If the fake data corrects feature like `Flight Status`, it may improve predictive performance. * Add more example for the model to practice on. _Negative Influence: -_ * Unrealistic patterns in fake data (e.g., unusual ages or impossible categories) could introduce irrelevance, reducing model accuracy. * If the fake data is bad (unrealistic), the model learns "wrong lessons" and performs worse. * If the fake data introduces too much randomness, the model might struggle to distinguish between important patterns and irrelevant differences. #####