| # Design Choices |
|
|
| Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards. |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering) |
| 2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines) |
| 3. [Quality Assurance](#3-quality-assurance) |
| 4. [API (Inference Service)](#4-api-inference-service) |
| 5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd) |
| 6. [Monitoring](#6-monitoring) |
|
|
| --- |
|
|
| ## 1. Inception (Requirements Engineering) |
|
|
| ### Machine Learning Canvas |
|
|
| The project adopted the **Machine Learning Canvas** framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions. |
|
|
| | Canvas Section | Application | |
| |----------------|-------------| |
| | **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue text | |
| | **Decisions** | Automated developer assignment based on predicted skill requirements | |
| | **Value Proposition** | Reduced issue resolution time, optimized resource allocation | |
| | **Data Sources** | SkillScope DB (7,245 PRs from 11 Java repositories) | |
| | **Making Predictions** | Real-time classification upon issue creation | |
| | **Building Models** | Iterative improvement over RF+TF-IDF baseline | |
| | **Monitoring** | Continuous evaluation with drift detection | |
|
|
| The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md). |
|
|
| ### Functional vs Non-Functional Requirements |
|
|
| #### Functional Requirements |
|
|
| | Requirement | Target | Metric | |
| |-------------|--------|--------| |
| | **Precision** | ≥ Baseline | True positives / Predicted positives | |
| | **Recall** | ≥ Baseline | True positives / Actual positives | |
| | **Micro-F1** | > Baseline | Harmonic mean across all labels | |
| | **Multi-label Support** | 217 skills | Simultaneous prediction of multiple labels | |
|
|
| #### Non-Functional Requirements |
|
|
| | Category | Requirement | Implementation | |
| |----------|-------------|----------------| |
| | **Reproducibility** | Auditable experiments | MLflow tracking, DVC versioning | |
| | **Explainability** | Interpretable predictions | Confidence scores per skill | |
| | **Performance** | Low latency inference | FastAPI async, model caching | |
| | **Scalability** | Batch processing | `/predict/batch` endpoint (max 100) | |
| | **Maintainability** | Clean code | Ruff linting, type hints, docstrings | |
|
|
| ### System-First vs Model-First Development |
|
|
| The project adopted a **System-First** approach, prioritizing infrastructure and pipeline development before model optimization: |
|
|
| ``` |
| Timeline: |
| ┌─────────────────────────────────────────────────────────────┐ |
| │ Phase 1: Infrastructure │ Phase 2: Model Development │ |
| │ - DVC/MLflow setup │ - Feature engineering │ |
| │ - CI/CD pipeline │ - Hyperparameter tuning │ |
| │ - Docker containers │ - SMOTE/ADASYN experiments │ |
| │ - API skeleton │ - Performance optimization │ |
| └─────────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| **Rationale:** |
| - Enables rapid iteration once infrastructure is stable |
| - Ensures reproducibility from day one |
| - Reduces technical debt during model development |
| - Facilitates team collaboration with shared tooling |
|
|
| --- |
|
|
| ## 2. Reproducibility (Versioning & Pipelines) |
|
|
| ### Code Versioning (Git) |
|
|
| Standard Git workflow with branch protection: |
|
|
| | Branch | Purpose | |
| |--------|---------| |
| | `main` | Production-ready code | |
| | `feature/*` | New development | |
| | `milestone/*` | Grouping all features before merging into main | |
|
|
| ### Data & Model Versioning (DVC) |
|
|
| **Design Decision:** Use DVC (Data Version Control) with DagsHub remote storage for large file management. |
|
|
| ``` |
| .dvc/config |
| ├── remote: origin |
| ├── url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc |
| └── auth: basic (credentials via environment) |
| ``` |
|
|
| **Tracked Artifacts:** |
|
|
| | File | Purpose | |
| |------|---------| |
| | `data/raw/skillscope_data.db` | Original SQLite database | |
| | `data/processed/*.npy` | TF-IDF and embedding features | |
| | `models/*.pkl` | Trained models and vectorizers | |
|
|
| **Versioning Workflow:** |
| ```bash |
| # Track new data |
| dvc add data/raw/new_dataset.db |
| git add data/raw/.gitignore data/raw/new_dataset.db.dvc |
| |
| # Push to remote |
| dvc push |
| git commit -m "Add new dataset version" |
| git push |
| ``` |
|
|
| ### Experiment Tracking (MLflow) |
|
|
| **Design Decision:** Remote MLflow instance on DagsHub for collaborative experiment tracking. |
|
|
| | Configuration | Value | |
| |---------------|-------| |
| | Tracking URI | `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` | |
| | Experiments | `skill_classification`, `skill_prediction_api` | |
|
|
| **Logged Metrics:** |
| - Training: precision, recall, F1-score, training time |
| - Inference: prediction latency, confidence scores, timestamps |
|
|
| **Artifact Storage:** |
| - Model binaries (`.pkl`) |
| - Vectorizers and scalers |
| - Hyperparameter configurations |
|
|
| ### Auditable ML Pipeline |
|
|
| The pipeline is designed for complete reproducibility: |
|
|
| ``` |
| ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ |
| │ dataset.py │───▶│ features.py │───▶│ train.py │ |
| │ (DVC pull) │ │ (TF-IDF) │ │ (MLflow) │ |
| └──────────────┘ └──────────────┘ └──────────────┘ |
| │ │ │ |
| ▼ ▼ ▼ |
| .dvc files .dvc files MLflow Run |
| ``` |
|
|
| --- |
|
|
| ## 3. Quality Assurance |
|
|
| ### Testing Strategy |
|
|
| #### Static Analysis (Ruff) |
|
|
| **Design Decision:** Use Ruff as the primary linter for speed and comprehensive rule coverage. |
|
|
| | Configuration | Value | |
| |---------------|-------| |
| | Line Length | 88 (Black compatible) | |
| | Target Python | 3.10+ | |
| | Rule Sets | PEP 8, isort, pyflakes | |
|
|
| **CI Integration:** |
| ```yaml |
| - name: Lint with Ruff |
| run: make lint |
| ``` |
|
|
| #### Dynamic Testing (Pytest) |
|
|
| **Test Organization:** |
|
|
| ``` |
| tests/ |
| ├── unit/ # Isolated function tests |
| ├── integration/ # Component interaction tests |
| ├── system/ # End-to-end tests |
| ├── behavioral/ # ML-specific tests |
| ├── deepchecks/ # Data validation |
| └── great expectations/ # Schema validation |
| ``` |
|
|
| **Markers for Selective Execution:** |
| ```python |
| @pytest.mark.unit |
| @pytest.mark.integration |
| @pytest.mark.system |
| @pytest.mark.slow |
| ``` |
|
|
| ### Model Validation vs Model Verification |
|
|
| | Concept | Definition | Implementation | |
| |---------|------------|----------------| |
| | **Validation** | Does the model fit user needs? | Micro-F1 vs baseline comparison | |
| | **Verification** | Is the model correctly built? | Unit tests, behavioral tests | |
|
|
| ### Behavioral Testing |
|
|
| **Design Decision:** Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics. |
|
|
| | Test Type | Count | Purpose | |
| |-----------|-------|---------| |
| | **Invariance** | 9 | Stability under perturbations (typos, case changes) | |
| | **Directional** | 10 | Expected behavior with keyword additions | |
| | **Minimum Functionality** | 17 | Basic sanity checks on clear examples | |
|
|
| **Example Invariance Test:** |
| ```python |
| def test_case_insensitivity(): |
| """Model should predict same skills regardless of case.""" |
| assert predict("Fix BUG") == predict("fix bug") |
| ``` |
|
|
| ### Data Quality Checks |
|
|
| #### Great Expectations (10 Tests) |
|
|
| **Design Decision:** Validate data at pipeline boundaries to catch quality issues early. |
|
|
| | Validation Point | Tests | |
| |------------------|-------| |
| | Raw Database | Schema, row count, required columns | |
| | Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility | |
| | Label Matrix | Binary format, distribution, consistency | |
| | Train/Test Split | No leakage, stratification | |
|
|
| #### Deepchecks (24 Checks) |
|
|
| **Suites:** |
| - **Data Integrity Suite** (12 checks): Duplicates, nulls, correlations |
| - **Train-Test Validation Suite** (12 checks): Leakage, drift, distribution |
|
|
| **Status:** Production-ready (96% overall score) |
|
|
| --- |
|
|
| ## 4. API (Inference Service) |
|
|
| ### FastAPI Implementation |
|
|
| **Design Decision:** Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation. |
|
|
| **Key Features:** |
| - Async lifespan management for model loading |
| - Middleware for Prometheus metrics collection |
| - Structured exception handling |
|
|
| ### RESTful Principles |
|
|
| **Design Decision:** Follow REST best practices for intuitive API design. |
|
|
| | Principle | Implementation | |
| |-----------|----------------| |
| | **Nouns, not verbs** | `/predictions` instead of `/getPrediction` | |
| | **Plural resources** | `/predictions`, `/issues` | |
| | **HTTP methods** | GET (retrieve), POST (create) | |
| | **Status codes** | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) | |
|
|
| **Endpoint Design:** |
|
|
| | Method | Endpoint | Action | |
| |--------|----------|--------| |
| | `POST` | `/predict` | Create new prediction | |
| | `POST` | `/predict/batch` | Create batch predictions | |
| | `GET` | `/predictions` | List predictions | |
| | `GET` | `/predictions/{run_id}` | Get specific prediction | |
|
|
| ### OpenAPI/Swagger Documentation |
|
|
| **Auto-generated documentation at runtime:** |
| - Swagger UI: `/docs` |
| - ReDoc: `/redoc` |
| - OpenAPI JSON: `/openapi.json` |
|
|
| **Pydantic Models for Schema Enforcement:** |
| ```python |
| class IssueInput(BaseModel): |
| issue_text: str |
| repo_name: Optional[str] = None |
| pr_number: Optional[int] = None |
| |
| class PredictionResponse(BaseModel): |
| run_id: str |
| predictions: List[SkillPrediction] |
| model_version: str |
| ``` |
|
|
| --- |
|
|
| ## 5. Deployment (Containerization & CI/CD) |
|
|
| ### Docker Containerization |
|
|
| **Design Decision:** Multi-stage Docker builds with security best practices. |
|
|
| **Dockerfile Features:** |
| - Python 3.10 slim base image (minimal footprint) |
| - Non-root user for security |
| - DVC integration for model pulling |
| - Health check endpoint configuration |
|
|
| **Multi-Service Architecture:** |
|
|
| ``` |
| docker-compose.yml |
| ├── hopcroft-api (FastAPI) |
| │ ├── Port: 8080 |
| │ ├── Volumes: source code, logs |
| │ └── Health check: /health |
| │ |
| ├── hopcroft-gui (Streamlit) |
| │ ├── Port: 8501 |
| │ ├── Depends on: hopcroft-api |
| │ └── Environment: API_BASE_URL |
| │ |
| └── hopcroft-net (Bridge network) |
| ``` |
|
|
| **Design Rationale:** |
| - Separation of concerns (API vs GUI) |
| - Independent scaling |
| - Health-based dependency management |
| - Shared network for internal communication |
|
|
| ### CI/CD Pipeline (GitHub Actions) |
|
|
| **Design Decision:** Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds. |
|
|
| **Pipeline Stages:** |
|
|
| ```yaml |
| Jobs: |
| unit-tests: |
| - Checkout code |
| - Setup Python 3.10 |
| - Install dependencies |
| - Ruff linting |
| - Pytest unit tests |
| - Upload test report (on failure) |
| |
| build-image: |
| - Needs: unit-tests |
| - Configure DVC credentials |
| - Pull models |
| - Build Docker image |
| ``` |
|
|
| **Triggers:** |
| - Push to `main`, `feature/*` |
| - Pull requests to `main` |
|
|
| **Secrets Management:** |
| - `DAGSHUB_USERNAME`: DagsHub authentication |
| - `DAGSHUB_TOKEN`: DagsHub access token |
|
|
| ### Hugging Face Spaces Hosting |
|
|
| **Design Decision:** Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support. |
|
|
| **Configuration:** |
| ```yaml |
| --- |
| title: Hopcroft Skill Classification |
| sdk: docker |
| app_port: 7860 |
| --- |
| ``` |
| |
| **Startup Flow:** |
| 1. `start_space.sh` configures DVC credentials |
| 2. Pull models from DagsHub |
| 3. Start FastAPI (port 8000) |
| 4. Start Streamlit (port 8501) |
| 5. Start Nginx (port 7860) for routing |
| |
| **Nginx Reverse Proxy:** |
| - `/` → Streamlit GUI |
| - `/docs`, `/predict`, `/predictions` → FastAPI |
| - `/prometheus` → Prometheus metrics |
| |
| --- |
|
|
| ## 6. Monitoring |
|
|
| ### Resource-Level Monitoring |
|
|
| **Design Decision:** Implement Prometheus metrics for real-time observability. |
|
|
| | Metric | Type | Purpose | |
| |--------|------|---------| |
| | `hopcroft_requests_total` | Counter | Request volume by endpoint | |
| | `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) | |
| | `hopcroft_in_progress_requests` | Gauge | Concurrent request load | |
| | `hopcroft_prediction_processing_seconds` | Summary | Model inference time | |
|
|
| **Middleware Implementation:** |
| ```python |
| @app.middleware("http") |
| async def monitor_requests(request, call_next): |
| IN_PROGRESS.inc() |
| with REQUEST_LATENCY.labels(method, endpoint).time(): |
| response = await call_next(request) |
| REQUESTS_TOTAL.labels(method, endpoint, status).inc() |
| IN_PROGRESS.dec() |
| return response |
| ``` |
|
|
| ### Performance-Level Monitoring |
|
|
| **Model Staleness Indicators:** |
| - Prediction confidence trends over time |
| - Drift detection alerts |
| - Error rate monitoring |
|
|
| ### Drift Detection Strategy |
|
|
| **Design Decision:** Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction. |
|
|
| | Component | Details | |
| |-----------|---------| |
| | **Algorithm** | KS Two-Sample Test | |
| | **Baseline** | 1000 samples from training data | |
| | **Threshold** | p-value < 0.05 (Bonferroni corrected) | |
| | **Execution** | Scheduled via cron or manual trigger | |
|
|
| **Drift Types Monitored:** |
|
|
| | Type | Definition | Detection Method | |
| |------|------------|------------------| |
| | **Data Drift** | Feature distribution shift | KS test on input features | |
| | **Target Drift** | Label distribution shift | Chi-square test on predictions | |
| | **Concept Drift** | Relationship change | Performance degradation monitoring | |
|
|
| **Metrics Published to Pushgateway:** |
| - `drift_detected`: Binary indicator (0/1) |
| - `drift_p_value`: Statistical significance |
| - `drift_distance`: KS distance metric |
| - `drift_check_timestamp`: Last check time |
|
|
| ### Alerting Configuration |
|
|
| **Prometheus Alert Rules:** |
|
|
| | Alert | Condition | Severity | |
| |-------|-----------|----------| |
| | `ServiceDown` | Target down for 5m | Critical | |
| | `HighErrorRate` | 5xx rate > 10% | Warning | |
| | `SlowRequests` | P95 latency > 2s | Warning | |
| | `DriftDetected` | drift_detected = 1 | Warning | |
| |
| **Alertmanager Integration:** |
| - Severity-based routing |
| - Email notifications |
| - Inhibition rules to prevent alert storms |
| |
| ### Grafana Visualization |
| |
| **Dashboard Panels:** |
| 1. Request Rate (gauge) |
| 2. Request Latency p50/p95 (time series) |
| 3. In-Progress Requests (stat panel) |
| 4. Error Rate 5xx (stat panel) |
| 5. Model Prediction Time (time series) |
| 6. Requests by Endpoint (bar chart) |
| |
| **Data Sources:** |
| - Prometheus: Real-time metrics |
| - Pushgateway: Batch job metrics (drift detection) |
| |
| ### HF Spaces Deployment |
| |
| Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy: |
| |
| | Service | Production URL | |
| |---------|----------------| |
| | Prometheus | `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` | |
| | Grafana | `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` | |
| |
| This enables real-time monitoring of the production deployment without additional infrastructure. |
| |