Assignment 2 – Developer Salary Prediction

Stack Overflow Developer Survey 2025 | Classification, Regression & Clustering


Video


Overview

This project uses the Stack Overflow Annual Developer Survey 2025 (49,123 responses, 170 features) to predict a software developer's annual salary. The pipeline covers end-to-end data science: exploratory analysis, feature engineering, unsupervised clustering, regression, and multi-class classification.

Research Question: Can we predict a software developer's annual salary from their professional profile, and which factors matter most?


Dataset

Property Value
Source Stack Overflow Developer Survey 2025 (Kaggle)
Raw rows 49,123
Raw columns 170
Target column ConvertedCompYearly (annual salary in USD)
Final feature count 253 (after engineering + cluster feature)

Part 1 – Setup

  • Environment: Google Colab compatible
  • Reproducibility seed: SEED = 42
  • Key libraries: pandas, numpy, scikit-learn, matplotlib

Part 2 – Exploratory Data Analysis

2.1 Data Cleaning

  • Removed rows with no salary value
  • Clipped extreme outliers at the 1st and 99th percentile (final median salary β‰ˆ $75K)
  • Removed 44 columns with >60% missing values (170 β†’ 126 columns), protecting the top-15 salary correlates regardless of missingness

2.2 Missing Value Analysis

01_Bar_chart_top-40_columns_by_missing

Several columns exceed 60% missingness and are dropped. The protected essential columns are retained despite high missingness and imputed later.

2.3 Descriptive Statistics

Statistic Value
Median salary ~$75,000
Distribution Right-skewed
Outlier treatment 1st–99th percentile clip

2.4 Salary Distribution

02_Distribution_of_Annual_Developer_Salary

The raw distribution is heavily right-skewed with a long tail above $200K.

2.5 Research Questions & Findings

Q1: Does coding experience predict salary?

03_Does_Coding_Experience_Predict_Salary

Salary increases steeply through the first 15–20 years of experience then flattens. There is wide variance at every experience level, suggesting experience alone is not sufficient to predict salary.

Q2: How does education level affect salary?

04_How_Does_Education_Level_Affect_Salary

Median salary rises with education level, but the gap between a Bachelor's and Master's degree is smaller than expected. Professional degrees and doctoral holders show the highest median salaries.

Q3: Which countries pay developers the most?

05_Which_Countries_Pay_Developers_the_Most

The US dominates with a median salary roughly 2–3Γ— the global median. Israeli, Western European and Australian developers cluster in a second tier, while developers in Asia and South America earn less.

Q4: Do remote workers earn more?

06_Do_Remote_Workers_Earn_More

Fully remote developers show a slight salary premium over hybrid and in-office roles. The difference is modest, suggesting remote work correlates with higher-paying companies rather than being a direct cause.

Q5: How does salary vary across developer roles?

07_How_Does_Salary_Vary_Across_Developer_Roles

C-Suite and ML/Data Science roles have the widest salary ranges and highest medians. Full-stack and front-end developers cluster around the global median with less variance.

2.6 Final Feature Selection (~20 Columns)

Category Features
Target ConvertedCompYearly
Numeric YearsCode, WorkExp, JobSat, JobSatPoints_11, JobSatPoints_4
Demographics Age, Country, EdLevel, MainBranch
Work profile Employment, RemoteWork, DevType, OrgSize
Tech & AI LanguageHaveWorkedWith, AISelect
Learning LearnCodeChoose, SOVisitFreq

2.7 EDA Takeaways

  1. Salary is right-skewed; median ~$75K after cleaning
  2. Work experience (WorkExp) and coding experience (YearsCode) are the strongest numeric predictors
  3. Country is the dominant signal β€” geography explains more variance than any other feature
  4. Remote work carries a small positive premium
  5. Developer role and education have meaningful but secondary effects

Part 3 – Baseline Model

A simple Linear Regression trained on raw numeric columns only β€” no encoding, no feature engineering.

Train/Test Split

  • 80/20 random split, SEED=42

Results

Metric Baseline
MAE $45,810
RMSE $61,947
RΒ² 0.1598

Predicted vs. Actual

08_Plot_1_Predicted_vs_Actual

The baseline model struggles with high earners β€” predictions cluster around the mean and fail to capture the upper salary range. The scatter is wide, consistent with an RΒ² of only 0.16.


Part 4 – Feature Engineering & Clustering

Engineering Steps

Step Description
4.1 Numeric features Derived ratio/interaction features
4.2 Ordinal encoding EdLevel, OrgSize mapped to integers
4.3 One-hot encoding Country, RemoteWork, Employment, MainBranch, AISelect, SOVisitFreq, Age, PrimaryDevType
4.4 Language flags Binary flag for each of the top-10 programming languages
4.5 Imputation & scaling Median imputation + StandardScaler β†’ 249 features

KMeans Elbow Method

09_46_KMeans_Elbow

The inertia curve decreases gradually without a sharp elbow, reflecting the high-dimensional and overlapping nature of the data. k=4 was selected as a reasonable balance between cluster granularity and interpretability.

Silhouette Score Comparison

10_Silhouette_scores_across_k_for_each_clustering_method

Silhouette scores are low across all values of k, confirming that natural cluster separation is weak in this dataset. Agglomerative clustering consistently outperforms KMeans, peaking around k=4.

Three Clustering Algorithms

Algorithm k / params Silhouette
KMeans k=4 0.0109
DBSCAN eps=5, min_samples=25 0.0912 (7 clusters)
Agglomerative Ward k=4 0.0224

The data's high dimensionality (249 features) makes density-based clustering (DBSCAN) impractical β€” inter-point distances are too large for meaningful core-point detection.

Cluster Visualisations (PCA 2D)

11_48_Separate_scatter_plots

KMeans splits the data into four roughly equal blobs with significant overlap in the PCA projection. The clusters correspond loosely to salary level but boundaries are indistinct.

12_48_Separate_scatter_plots

DBSCAN classifies the vast majority of points as noise, forming 7 clusters. High dimensionality makes distance-based density estimation ineffective on this dataset.

13_48_Separate_scatter_plots

Agglomerative clustering produces the clearest separation, isolating a distinct high-salary cluster on the right of the PCA plot. The four tiers align visually with low, mainstream, high-mid, and elite salary groups.

Cluster Profiles – Agglomerative (Chosen)

Cluster Mean Salary Median Salary Count
0 $86,041 $74,000 18,192
1 $109,980 $95,000 4,069
2 $27,574 $13,949 877
3 $101,735 $93,387 317

Winner: Agglomerative Ward (k=4) β€” highest silhouette score and four interpretable salary tiers (low-income, mainstream, high-mid, elite).

Cluster Feature Added

cluster_id one-hot encoded and appended β†’ 253 final features


Part 5 – Improved Regression Models

Three models trained on the full 253-feature matrix (249 engineered features + 4 cluster dummies).

Results

Model MAE RMSE RΒ²
Baseline Linear Regression $45,810 $61,947 0.1598
Improved Linear Regression $30,688 $44,314 0.5701
Random Forest (200 trees) $31,998 $45,784 0.5411
HistGradientBoosting (300 iters) $28,991 $43,039 0.5944

Model Performance Comparison

14_Comparison_table

HistGradientBoosting wins on all three metrics. The jump from baseline to improved linear regression is dramatic β€” encoding Country alone accounts for the majority of the RΒ² improvement from 0.16 to 0.57.

Feature Importance

15_Feature_importance_for_all_three_models

Country dummies (especially US) dominate feature importance across all three models. Work experience and years of coding rank consistently high. The cluster feature appears in the top 20 for linear regression, validating the clustering step.

Winning Model – Predicted vs. Actual

16_Declare_the_winner_based_on_RΒ²_highest_and_MAE_lowest

The HistGradientBoosting model tracks the perfect-prediction diagonal much more closely than the baseline. It still under-predicts some very high earners above $300K but captures the mid-range salary distribution well.

Discussion

  • Baseline β†’ Improved Linear Regression (+0.41 RΒ²): One-hot encoding Country was the single biggest improvement. Geography is the dominant salary signal.
  • Random Forest vs. Linear Regression: Non-linear feature interactions (e.g. senior developer Γ— US location) are captured naturally by trees.
  • HistGradientBoosting wins: Sequential boosting focuses on the hardest predictions. Natively handles missing values and is 10–100Γ— faster than standard GradientBoosting.
  • Cluster feature: Pre-computed salary-tier signal from Part 4 particularly boosts Linear Regression.

Winner: HistGradientBoosting Regressor

Metric Value
MAE $28,991
RMSE $43,039
RΒ² 0.5944

Part 6 – Winning Regression Model Export

The winning regression model is saved to winning_model_regression.pkl.


Part 7 – Salary Classification Setup

The continuous salary target is binned into four ordered classes:

Class Label Range (USD/year)
0 Low < $30,000
1 Mid $30,000 – $90,000
2 High $90,000 – $160,000
3 Very High > $160,000

Class Distribution

17_Bar_chart

Mid-salary developers make up nearly 40% of the dataset. Very High earners are the smallest class at 13.6%, creating a mild class imbalance that the models must handle.

Salary Distribution per Class

18_Salary_Distribution_per_Class

Each bin shows a clean salary range with minimal overlap at the boundaries, confirming the thresholds were well-chosen. The Very High class has the widest spread, reflecting high variability among top earners.


Part 8 – Classification Models

Same 253-feature matrix as regression, with a stratified 80/20 train/test split.

Precision vs. Recall & False Positives vs. False Negatives

Recall is prioritised over precision in this task. Misclassifying a developer into a lower salary tier (a false negative) carries real-world cost β€” under-negotiation, poor benchmarking, missed career leverage β€” whereas a false positive (over-predicting a tier) is relatively benign.

False Negatives are more critical than False Positives. Predicting "Mid" when a developer is truly "High" or "Very High" obscures their earning potential. For this reason, evaluation uses weighted F1-score, which balances precision and recall across all four classes with particular attention to recall in the minority tiers (Low and Very High).

Results

Model Accuracy F1 (weighted)
Logistic Regression 0.598 0.597
Random Forest (200 trees) 0.605 0.597
HistGradientBoosting (300 iters) 0.611 0.610

Classification Model Comparison

19_Summary_table

HistGradientBoosting leads on both accuracy and weighted F1, though the margin between all three models is narrow. The gap is larger on F1, reflecting better handling of the minority classes.

Confusion Matrices

20_graph

All three models struggle most with the High class ($90K–$160K), frequently confusing it with Mid. HistGradientBoosting shows the best recall on the Low and Very High tiers β€” the most actionable classes β€” with misclassifications mostly occurring between adjacent salary bands.

Per-Class Performance – HistGradientBoosting (Winner)

Class Precision Recall F1
Low (<$30K) 0.66 0.76 0.71
Mid ($30K–$90K) 0.69 0.59 0.64
High ($90K–$160K) 0.53 0.49 0.51
Very High (>$160K) 0.51 0.67 0.58

Winner: HistGradientBoosting Classifier

Sequential boosting handles the sparse one-hot encoded feature space well, focuses capacity on the most difficult salary boundaries, and outperforms both Logistic Regression and Random Forest on accuracy and F1.


Final Model Files

File Contents
winning_model_regression.pkl HistGradientBoosting Regressor (MAE $28,991, RΒ² 0.59)
winning_model_classifier.pkl HistGradientBoosting Classifier (Accuracy 0.61, F1 0.61)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support