Data Description
The Airline Dataset contains 15 features describing about passenger personal details, flight details, and flight performance. Below is a detailed description of each feature.
Feature Variables
-
Passenger ID: int64
- Unique identifier for passengers.
-
First Name: string
- The first name of the passenger.
-
Last Name: string
- The last name of the passenger.
-
Gender: string
- Values are categorical (Male, Female).
- Transformation: May be encoded as binary (e.g., Male = 0, Female = 1) or one-hot encoding for analysis.
-
Age: int64
- Age of the passenger.
- Range: [1, 90]
- Min: 1 , Max: 90, Mean: 46.25
-
Nationality: string
- The country of origin for the passenger.
-
Airport Name: string
- Name of the originating or destination airport.
-
Airport Country Code: string
- Country code of the airport (e.g., USA, CANADA, FRANCE etc).
-
Country Name: string
- Full country name corresponding to the airport.
-
Airport Continent: string
- Values include NAM (North America), EU (Europe), etc.
- Transformation: May be encoded as categorical numerical values or one-hot encoded.
-
Continents: string
- Generalized continent grouping.
- Transformation: Similar to Airport Continent, encode numerically if necessary.
-
Departure Date: string
- The date of the flight's departure.
- Transformation: Convert to a
datetime
format for time-based analysis.
-
Arrival Airport: string
- Code of the destination airport.
-
Pilot Name: string
- Name of the pilot for the flight.
-
Flight Status: string
- Categorical values (On Time, Delayed).
- Transformation: Encode numerically for machine learning (e.g., On Time = 0, Delayed = 1).
Target Variables
As per the the Airline dataset the target variables can Flight status.
Flight Status: Categorical -> Binary
- The primary target variable, indicating whether the flight was on time or delayed.
- It helps to understand how well the airline operates and what causes delays.
- Transformation: Convert to binary values for classification models with only two possible outcomes. (e.g., On Time = 0, Delayed = 1).
Application
- Operational Analysis: Helps identify patterns in delays based on flight details like departure time, airport, or pilot name.
- Predictive Modelling: Make it easier/possible to create models based on predicts delays, helps to improve customer comfort, planning and experience.
The factors that can influence flight status can be the more busy airport the more delay it can be, flights during peak hours, weather conditions.
Data Transformation
There were no transformations appear to have been performed on the dataset.
Yes, there might be several opportunities for data transformation in our dataset. Here are some suggestion:
- Departure Date: - Extract components like Year, Month, Day of the week and convert to a datetime format for analysis.
- Flight Status: - Encode as numerical values. (example., 0 = Delayed, 1 = On time, 2 = Cancelled).
-
First Name
andLast Name
could be dropped unless you are performing text or name-based analysis. - Merge
Airport Name
andAirport Country Code
into a single feature for uniqueness.
Here are some suggestions for Data transformation if there might be possible in our Airline Dataset.
Realistic Fake Dataset Analysis
Here is the analysis of fake data and their potential effects: -
Effect on Model Training
- Our original dataset size is 1,499 rows. After adding 25% realistic data size is increased to 1,873 rows increment of 374 rows.
- Variability in categorical features like
Nationality
,Airport Continent
, andFlight Status
increased. - In features like
Gender
, male frequency increased from 763 to 895 andFlight Status
, "On Time" frequency increased from 515 to 641 altering the target variable distribution. - If fake date data is realistic it might introduce diverse patterns, potentially improving the model's generalization or if it is unrealistic it could mislead the model, introducing errors or overfitting to artifacts in the fake data.
Approach for generating fake data
-
Steps Taken:
- Used realistic values for categorical features like
Nationality
,Airport Continent
, andFlight Status
, ensuring alignment with existing distributions. - For numerical features (e.g.,
Age
) data was sampled or generated within valid ranges (e.g., 1–90). - Target labels (
Flight Status
) were assigned proportions similar to the original data.
- Used realistic values for categorical features like
-
Key Considerations:
- Ensured fake data is same as the distributions of real data.
- Preserved the balance across target classes to avoid skewing the dataset.
- Introduced some variability in features like
Nationality
orAirport Country Code
to enrich the dataset.
Influence on the Model
There are some positives and negatives for our model. They are: -
Positive Influence: -
- If the fake data diversifies the dataset, it could help the model generalize better to unseen scenarios.
- If the fake data corrects feature like
Flight Status
, it may improve predictive performance. - Add more example for the model to practice on.
Negative Influence: -
- Unrealistic patterns in fake data (e.g., unusual ages or impossible categories) could introduce irrelevance, reducing model accuracy.
- If the fake data is bad (unrealistic), the model learns "wrong lessons" and performs worse.
- If the fake data introduces too much randomness, the model might struggle to distinguish between important patterns and irrelevant differences.