readme

8e7797f3 · Edward Mawuko Samlafo-Adams · 5825d92f · 8e7797f3
Commit 8e7797f3 authored 2 months ago by Edward Mawuko Samlafo-Adams
--- a/README.md
+++ b/README.md
@@ -96,10 +96,35 @@ SAS-EN-TEST/
  - **Salary**: Numeric data for analysis and prediction.
  - **Description**: Job descriptions used for text analysis.

- **Data Preprocessing**:
-  - Removed null values and duplicates.
-  - Tokenized job descriptions for machine learning.
-  - Derived insights like job distribution and salary trends.
+### Approach for Handling Outliers
+
+1. **Detection Methods**:
+
+   - **Quantile-based filtering**: Outliers below the 5th percentile and above the 95th percentile were identified.
+   - **Z-score analysis**: Values with a Z-score greater than 3 were flagged as potential outliers.
+
+2. **Handling**:
+   - Detected outliers were either removed or replaced with the median value of the respective column.
+   - Columns with significant outlier presence were closely monitored for distribution changes after outlier treatment.
+
+---
+
+### Approach for Creating Fake Data
+
+1. **Synthetic Data Generation**:
+
+   - Used Python libraries like `numpy` and `pandas` to generate random values within realistic ranges.
+   - Randomized fields include:
+     - **Salary range**: Values were created within observed realistic salary intervals.
+     - **Experience level**: Generated evenly distributed levels to balance the dataset.
+
+2. **Textual Data**:
+
+   - Synthetic job descriptions were constructed using predefined templates combined with randomly generated keywords.
+
+3. **Purpose**:
+   - Balancing underrepresented categories.
+   - Expanding the dataset to improve model robustness during training and testing.

 ---