| ... | ... | @@ -129,49 +129,38 @@ Close-Predicting the Close price for future days can be useful for traders aimin |
|
|
|
## Data Generation Process
|
|
|
|
The data generation process involves augmenting the original dataset with synthetic data to increase the dataset size and provide more varied examples for model training. This process is divided into several steps to ensure that the generated data is realistic, maintains consistency with the original dataset, and introduces beneficial noise to help improve model generalization.
|
|
|
|
|
|
|
|
####
|
|
|
|
Step 1: Determine Number of Synthetic Rows to Add
|
|
|
|
#### Step 1: Determine Number of Synthetic Rows to Add
|
|
|
|
The synthetic data will be generated to augment the dataset by approximately 25%. This is done by calculating 25% of the total number of rows in the original dataset.
|
|
|
|
|
|
|
|
{width=386 height=35}
|
|
|
|
|
|
|
|
Purpose: This step ensures that the synthetic data constitutes 25% of the total dataset, allowing the model to benefit from the additional data without overwhelming the original dataset.
|
|
|
|
####
|
|
|
|
Step 2: Generate Synthetic Dates
|
|
|
|
#### Step 2: Generate Synthetic Dates
|
|
|
|
{width=794 height=63}
|
|
|
|
|
|
|
|
The synthetic data needs to have dates that extend from the last date in the original dataset. The dates are generated sequentially, starting from the last date, adding one day at a time for the new rows.
|
|
|
|
|
|
|
|
last_date = datetime.strptime(data['Date'].iloc[-1], "%Y-%m-%d")
|
|
|
|
synthetic_dates = [(last_date + timedelta(days=i+1)).strftime("%Y-%m-%d") for i in range(num_synthetic_rows)]
|
|
|
|
|
|
|
|
Purpose: This ensures that the synthetic data is temporally consistent, starting immediately after the last real data point in the dataset.
|
|
|
|
####
|
|
|
|
Step 3: Generate Synthetic Data for Each Feature
|
|
|
|
#### Step 3: Generate Synthetic Data for Each Feature
|
|
|
|
For each of the numerical features (Open, High, Low, Close, Adj Close, Volume), synthetic data is generated by adding random noise based on the mean and standard deviation of the respective feature from the original dataset.
|
|
|
|
|
|
|
|
### Generate synthetic data by adding random noise based on mean and standard deviation
|
|
|
|
np.random.seed(0) # For reproducible results
|
|
|
|
open_synthetic = data['Open'].mean() + np.random.normal(0, data['Open'].std() * 0.05, num_synthetic_rows)
|
|
|
|
high_synthetic = data['High'].mean() + np.random.normal(0, data['High'].std() * 0.05, num_synthetic_rows)
|
|
|
|
low_synthetic = data['Low'].mean() + np.random.normal(0, data['Low'].std() * 0.05, num_synthetic_rows)
|
|
|
|
close_synthetic = data['Close'].mean() + np.random.normal(0, data['Close'].std() * 0.05, num_synthetic_rows)
|
|
|
|
adj_close_synthetic = data['Adj Close'].mean() + np.random.normal(0, data['Adj Close'].std() * 0.05, num_synthetic_rows)
|
|
|
|
{width=891 height=211}
|
|
|
|
|
|
|
|
Purpose:
|
|
|
|
Random Noise: The random noise is added by generating values from a normal distribution centered around the mean of the feature, with a standard deviation scaled to 5% of the original data's standard deviation. This introduces slight variations to simulate natural fluctuations.
|
|
|
|
Ensuring Realism: This noise ensures the generated data has a similar range and distribution as the original data, making it realistic and useful for model training.
|
|
|
|
Step 4: Generate Synthetic Volume Data with Spikes
|
|
|
|
####Step 4: Generate Synthetic Volume Data with Spikes
|
|
|
|
Volume data is particularly important in financial markets as it reflects the trading activity. Synthetic volume data is generated with added variability to simulate real-world market behavior.
|
|
|
|
|
|
|
|
|
|
|
|
original_volume_mean = data['Volume'].mean()
|
|
|
|
original_volume_std = data['Volume'].std()
|
|
|
|
|
|
|
|
### Generate realistic synthetic volume data
|
|
|
|
volume_synthetic = np.random.normal(loc=original_volume_mean, scale=original_volume_std * 0.2, size=num_synthetic_rows)
|
|
|
|
|
|
|
|
### Introduce spikes to simulate real market behavior (5% of synthetic data will have spikes)
|
|
|
|
spike_indices = np.random.choice(num_synthetic_rows, size=int(num_synthetic_rows * 0.05), replace=False)
|
|
|
|
volume_synthetic[spike_indices] = volume_synthetic[spike_indices] * np.random.uniform(1.5, 3.5, size=len(spike_indices))
|
|
|
|
{width=882 height=137}
|
|
|
|
|
|
|
|
### Ensure no negative values
|
|
|
|
volume_synthetic = np.abs(volume_synthetic).astype(int)
|
|
|
|
{width=429 height=31}
|
|
|
|
Purpose:
|
|
|
|
Base Volume with Noise: Volume values are generated from a normal distribution with a mean and standard deviation calculated from the original dataset. The scale is set to 20% of the original standard deviation to create realistic variations.
|
|
|
|
Spikes: About 5% of the synthetic volume data is randomly selected to have artificially high values (spikes), simulating sudden surges in market activity (e.g., after news releases or major events). This introduces further variation and realism in the data.
|
| ... | ... | @@ -194,21 +183,15 @@ Step 6: Create DataFrame for Synthetic Data |
|
|
|
The synthetic data for all features (Open, High, Low, Close, Adj Close, and Volume) is compiled into a DataFrame, which is then concatenated with the original data.
|
|
|
|
|
|
|
|
|
|
|
|
synthetic_data = pd.DataFrame({
|
|
|
|
'Date': synthetic_dates,
|
|
|
|
'Open': open_synthetic,
|
|
|
|
'High': high_synthetic,
|
|
|
|
'Low': low_synthetic,
|
|
|
|
'Close': close_synthetic,
|
|
|
|
'Adj Close': adj_close_synthetic,
|
|
|
|
'Volume': volume_synthetic
|
|
|
|
})
|
|
|
|
{width=287 height=156}
|
|
|
|
|
|
|
|
Purpose: This organizes the synthetic data into a structured format, making it ready for analysis and model training.
|
|
|
|
Step 7: Concatenate Original and Synthetic Data
|
|
|
|
Finally, the synthetic data is concatenated with the original dataset to create an enhanced dataset.
|
|
|
|
|
|
|
|
|
|
|
|
enhanced_data = pd.concat([data, synthetic_data], ignore_index=True)
|
|
|
|
{width=493 height=178}
|
|
|
|
|
|
|
|
Purpose: This step combines the original data and synthetic data, creating a larger, augmented dataset that will be used for model training.
|
|
|
|
Conclusion
|
|
|
|
The synthetic data generation process introduces additional rows into the dataset, enriching the feature space and providing the model with more examples to learn from. By carefully generating realistic noise for each feature and simulating market behavior through volume spikes, this approach helps the model generalize better to unseen data. However, it’s crucial to ensure that the synthetic data aligns well with the real-world data distribution and is handled correctly during training to avoid introducing biases or overfitting.
|
| ... | ... | |
| ... | ... | |