In our app we implemented two ways of generating fake data - **Random** and **Proportional**.
### Random
This approach is based on generating absolutely random, non-proportional and non-correlating data.
Input: The range of acceptable values for each vehicle parameter - for example, a price range, a list of all possible vehicles, etc.
### Proportional
This method is based on generating data that reflects the proportions and distributions observed in an existing dataset. Instead of generating purely random values, the proportional approach ensures that the fake data is statistically similar to the original dataset by maintaining the same probability distributions for categorical variables and similar ranges or distributions for numerical variables.
Input: An existing dataset containing realistic data for vehicles.
### Comparison of the Two Approaches
The **Random** approach provides completely unstructured data by generating values independently for each feature within defined ranges, ensuring no dependencies or patterns. In contrast, the **Proportional** approach generates data that mimics the distributions of features in an existing dataset, making it more realistic by reflecting the original data's proportions and variability. However, neither method considers correlations between features, meaning that relationships like higher power corresponding to higher price are not captured, which can limit their utility in creating realistic datasets.
### Final Decision
After implementing and testing both approaches, none of them was used in the final version of the code. Both methods introduced additional noise and inconsistencies that negatively impacted model training results. Since the original dataset already contains ~37,000 rows of real data, augmenting it with synthetic data was unnecessary and counterproductive. The real data provided sufficient diversity and relevance for accurate and robust training.