Skip to content
Snippets Groups Projects
Commit 8e7797f3 authored by Edward Mawuko Samlafo-Adams's avatar Edward Mawuko Samlafo-Adams
Browse files

readme

parent 5825d92f
No related branches found
No related tags found
No related merge requests found
......@@ -96,10 +96,35 @@ SAS-EN-TEST/
- **Salary**: Numeric data for analysis and prediction.
- **Description**: Job descriptions used for text analysis.
- **Data Preprocessing**:
- Removed null values and duplicates.
- Tokenized job descriptions for machine learning.
- Derived insights like job distribution and salary trends.
### Approach for Handling Outliers
1. **Detection Methods**:
- **Quantile-based filtering**: Outliers below the 5th percentile and above the 95th percentile were identified.
- **Z-score analysis**: Values with a Z-score greater than 3 were flagged as potential outliers.
2. **Handling**:
- Detected outliers were either removed or replaced with the median value of the respective column.
- Columns with significant outlier presence were closely monitored for distribution changes after outlier treatment.
---
### Approach for Creating Fake Data
1. **Synthetic Data Generation**:
- Used Python libraries like `numpy` and `pandas` to generate random values within realistic ranges.
- Randomized fields include:
- **Salary range**: Values were created within observed realistic salary intervals.
- **Experience level**: Generated evenly distributed levels to balance the dataset.
2. **Textual Data**:
- Synthetic job descriptions were constructed using predefined templates combined with randomly generated keywords.
3. **Purpose**:
- Balancing underrepresented categories.
- Expanding the dataset to improve model robustness during training and testing.
---
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment