| ... | @@ -126,3 +126,90 @@ Maximum |
... | @@ -126,3 +126,90 @@ Maximum |
|
|
### Target Variable
|
|
### Target Variable
|
|
|
Close-Predicting the Close price for future days can be useful for traders aiming to anticipate market moves. This is a typical regression problem where the goal is to predict the price value
|
|
Close-Predicting the Close price for future days can be useful for traders aiming to anticipate market moves. This is a typical regression problem where the goal is to predict the price value
|
|
|
|
|
|
|
|
|
## Data Generation Process
|
|
|
|
The data generation process involves augmenting the original dataset with synthetic data to increase the dataset size and provide more varied examples for model training. This process is divided into several steps to ensure that the generated data is realistic, maintains consistency with the original dataset, and introduces beneficial noise to help improve model generalization.
|
|
|
|
|
|
|
|
####
|
|
|
|
Step 1: Determine Number of Synthetic Rows to Add
|
|
|
|
The synthetic data will be generated to augment the dataset by approximately 25%. This is done by calculating 25% of the total number of rows in the original dataset.
|
|
|
|
|
|
|
|
num_synthetic_rows = int(len(data) * 0.25)
|
|
|
|
Purpose: This step ensures that the synthetic data constitutes 25% of the total dataset, allowing the model to benefit from the additional data without overwhelming the original dataset.
|
|
|
|
####
|
|
|
|
Step 2: Generate Synthetic Dates
|
|
|
|
The synthetic data needs to have dates that extend from the last date in the original dataset. The dates are generated sequentially, starting from the last date, adding one day at a time for the new rows.
|
|
|
|
|
|
|
|
last_date = datetime.strptime(data['Date'].iloc[-1], "%Y-%m-%d")
|
|
|
|
synthetic_dates = [(last_date + timedelta(days=i+1)).strftime("%Y-%m-%d") for i in range(num_synthetic_rows)]
|
|
|
|
Purpose: This ensures that the synthetic data is temporally consistent, starting immediately after the last real data point in the dataset.
|
|
|
|
####
|
|
|
|
Step 3: Generate Synthetic Data for Each Feature
|
|
|
|
For each of the numerical features (Open, High, Low, Close, Adj Close, Volume), synthetic data is generated by adding random noise based on the mean and standard deviation of the respective feature from the original dataset.
|
|
|
|
|
|
|
|
### Generate synthetic data by adding random noise based on mean and standard deviation
|
|
|
|
np.random.seed(0) # For reproducible results
|
|
|
|
open_synthetic = data['Open'].mean() + np.random.normal(0, data['Open'].std() * 0.05, num_synthetic_rows)
|
|
|
|
high_synthetic = data['High'].mean() + np.random.normal(0, data['High'].std() * 0.05, num_synthetic_rows)
|
|
|
|
low_synthetic = data['Low'].mean() + np.random.normal(0, data['Low'].std() * 0.05, num_synthetic_rows)
|
|
|
|
close_synthetic = data['Close'].mean() + np.random.normal(0, data['Close'].std() * 0.05, num_synthetic_rows)
|
|
|
|
adj_close_synthetic = data['Adj Close'].mean() + np.random.normal(0, data['Adj Close'].std() * 0.05, num_synthetic_rows)
|
|
|
|
Purpose:
|
|
|
|
Random Noise: The random noise is added by generating values from a normal distribution centered around the mean of the feature, with a standard deviation scaled to 5% of the original data's standard deviation. This introduces slight variations to simulate natural fluctuations.
|
|
|
|
Ensuring Realism: This noise ensures the generated data has a similar range and distribution as the original data, making it realistic and useful for model training.
|
|
|
|
Step 4: Generate Synthetic Volume Data with Spikes
|
|
|
|
Volume data is particularly important in financial markets as it reflects the trading activity. Synthetic volume data is generated with added variability to simulate real-world market behavior.
|
|
|
|
|
|
|
|
|
|
|
|
original_volume_mean = data['Volume'].mean()
|
|
|
|
original_volume_std = data['Volume'].std()
|
|
|
|
|
|
|
|
### Generate realistic synthetic volume data
|
|
|
|
volume_synthetic = np.random.normal(loc=original_volume_mean, scale=original_volume_std * 0.2, size=num_synthetic_rows)
|
|
|
|
|
|
|
|
### Introduce spikes to simulate real market behavior (5% of synthetic data will have spikes)
|
|
|
|
spike_indices = np.random.choice(num_synthetic_rows, size=int(num_synthetic_rows * 0.05), replace=False)
|
|
|
|
volume_synthetic[spike_indices] = volume_synthetic[spike_indices] * np.random.uniform(1.5, 3.5, size=len(spike_indices))
|
|
|
|
|
|
|
|
### Ensure no negative values
|
|
|
|
volume_synthetic = np.abs(volume_synthetic).astype(int)
|
|
|
|
Purpose:
|
|
|
|
Base Volume with Noise: Volume values are generated from a normal distribution with a mean and standard deviation calculated from the original dataset. The scale is set to 20% of the original standard deviation to create realistic variations.
|
|
|
|
Spikes: About 5% of the synthetic volume data is randomly selected to have artificially high values (spikes), simulating sudden surges in market activity (e.g., after news releases or major events). This introduces further variation and realism in the data.
|
|
|
|
Step 5: Handle Missing or Infinite Values
|
|
|
|
After generating the synthetic data, any missing or invalid values (such as NaN or inf) are replaced with the respective feature's mean from the original dataset to ensure data integrity.
|
|
|
|
|
|
|
|
|
|
|
|
### Replace NaN or infinite values in each column of synthetic data with mean values of respective columns
|
|
|
|
synthetic_data = synthetic_data.replace([np.inf, -np.inf], np.nan)
|
|
|
|
synthetic_data = synthetic_data.fillna({
|
|
|
|
'Open': data['Open'].mean(),
|
|
|
|
'High': data['High'].mean(),
|
|
|
|
'Low': data['Low'].mean(),
|
|
|
|
'Close': data['Close'].mean(),
|
|
|
|
'Adj Close': data['Adj Close'].mean(),
|
|
|
|
'Volume': abs(data['Volume'].mean())
|
|
|
|
})
|
|
|
|
Purpose: This step ensures that the synthetic data is clean and free of missing or extreme values that could interfere with model training.
|
|
|
|
Step 6: Create DataFrame for Synthetic Data
|
|
|
|
The synthetic data for all features (Open, High, Low, Close, Adj Close, and Volume) is compiled into a DataFrame, which is then concatenated with the original data.
|
|
|
|
|
|
|
|
|
|
|
|
synthetic_data = pd.DataFrame({
|
|
|
|
'Date': synthetic_dates,
|
|
|
|
'Open': open_synthetic,
|
|
|
|
'High': high_synthetic,
|
|
|
|
'Low': low_synthetic,
|
|
|
|
'Close': close_synthetic,
|
|
|
|
'Adj Close': adj_close_synthetic,
|
|
|
|
'Volume': volume_synthetic
|
|
|
|
})
|
|
|
|
Purpose: This organizes the synthetic data into a structured format, making it ready for analysis and model training.
|
|
|
|
Step 7: Concatenate Original and Synthetic Data
|
|
|
|
Finally, the synthetic data is concatenated with the original dataset to create an enhanced dataset.
|
|
|
|
|
|
|
|
|
|
|
|
enhanced_data = pd.concat([data, synthetic_data], ignore_index=True)
|
|
|
|
Purpose: This step combines the original data and synthetic data, creating a larger, augmented dataset that will be used for model training.
|
|
|
|
Conclusion
|
|
|
|
The synthetic data generation process introduces additional rows into the dataset, enriching the feature space and providing the model with more examples to learn from. By carefully generating realistic noise for each feature and simulating market behavior through volume spikes, this approach helps the model generalize better to unseen data. However, it’s crucial to ensure that the synthetic data aligns well with the real-world data distribution and is handled correctly during training to avoid introducing biases or overfitting.
|
|
|
|
|