Sravanth18
/

verity-h-prototype

Model card Files Files and versions

xet

Community

Sravanth18 commited on Apr 26

Commit

c858b2d

verified ·

1 Parent(s): 8f0f6ea

Upload README.md

Browse files

Files changed (1) hide show

README.md +41 -32

README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Project Verity-H v0.3.1
 **Teaching AI to say "I don't know."**
@@ -23,10 +23,6 @@ pip install -e ".[test]"
 # Run tests (mock mode, no API key needed)
 pytest
-# Run with a real LLM
-cp .env.example .env
-# Edit .env: set LLM_MODE=hf_api, add your HF_API_KEY
 ```
 ## Run Evaluation
@@ -43,12 +39,15 @@ python -m src.baseline_runner --mode normal --output results/baseline_normal.jso
 python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
 # Pipeline
-python -m src.pipeline_runner --output results/verity_pipeline.jsonl
 # Report
 python -m src.report --normal results/baseline_normal.jsonl \
                      --honesty results/baseline_honesty.jsonl \
-                     --pipeline results/verity_pipeline.jsonl \
                      --output results/report.md
 ```
@@ -65,7 +64,7 @@ Question + Evidence
      • Filter junk/meta claims
      • Fix mislabeled claims via span matching
      • Detect inferential claims (4-tier)
-     • Detect contradictions (frame-based)
   5. Gate decision                       (deterministic)
        │
        ▼
@@ -80,12 +79,12 @@ Question + Evidence
 |-----------|----------|---------------|
 | All claims verified | `accept` | Clean answer from verified claims |
 | Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
-| Evidence contradicts itself | `contradiction` | Flags conflict, shows both sides |
 | No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
 | Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
 | Verifier failed to parse | `verifier_error` | Refuses to answer |
-## Inference Detection (v0.3.1)
 The verifier catches claims the LLM wrongly marks as SUPPORTED:
@@ -98,7 +97,7 @@ The verifier catches claims the LLM wrongly marks as SUPPORTED:
 Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
-## Results (Qwen3-4B, 30 cases)
 | Metric | Baseline Normal | Baseline Honesty | Verity-H |
 |--------|:-:|:-:|:-:|
@@ -109,7 +108,9 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
 | Pressure hypothesis (↑) | 0% | 0% | **100%** *(v0.2.1)* |
 | False contradiction (↓) | 0% | 0% | **0%** |
 | Partial coverage (↑) | 0% | 0% | **100%** |
-| Latency p50 | 3,525ms | 3,244ms | **6,495ms** |
 ## Environment Variables
@@ -122,30 +123,33 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
 | `LLM_TEMPERATURE` | `0.0` | Temperature |
 | `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
 | `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
 ## Gold Cases
-30 cases across 6 categories:
 | Category | Count | Tests |
 |----------|:-----:|-------|
-| `grounded` | 5 | All claims in evidence → accept |
-| `missing_info` | 5 | Evidence doesn't cover question → abstain |
-| `contradiction` | 5 | Conflicting facts in evidence → flag |
-| `pressure` | 5 | Speculative question → hypothesis with caveats |
-| `filler_trap` | 5 | Tempts model to invent facts → abstain |
-| `partial_answer` | 5 | Some facts available, some not → partial |
 ## Tests
-154 tests covering all modules. Run with `pytest -v`.
 ```
 tests/
 ├── test_calibration.py          # Table-format probe validation
 ├── test_claim_filter.py         # Slot-aware relevance filtering
 ├── test_constants.py            # Shared stop words
-├── test_contradiction_checks.py # Frame-based + false positive prevention
 ├── test_evidence_spans.py       # Abbreviation-aware splitting
 ├── test_gate.py                 # All gate rules + edge cases
 ├── test_inference_detector.py   # All 4 tiers + exact failure cases
@@ -165,22 +169,27 @@ tests/
 This is a **research harness**, not a product.
-## Known Limitations
-1. **30 cases is a starter set** — not sufficient for statistical significance. Target: 100+.
-2. **Metrics are directional** — text heuristics for baselines, structured outputs for pipeline. Not directly comparable.
-3. **Verifier depends on LLM quality** — the deterministic layers fix many errors, but a very weak LLM will still produce poor claim extraction.
-4. **Inference detector is regex-based** — covers common patterns but cannot catch all forms of inferential reasoning.
-5. **Single evidence document** — no multi-document or multi-turn evidence handling.
 ## Next Steps
-- [ ] Run v0.3.1 eval (inference detector + contradiction fix)
-- [ ] Expand to 100 gold cases
-- [ ] Test on multiple models (1B, 4B, 70B+)
-- [ ] Per-claim-type metric breakdowns
 - [ ] Confidence calibration analysis
-- [ ] Inter-annotator agreement study
 ---

+# Project Verity-H v0.4
 **Teaching AI to say "I don't know."**
 # Run tests (mock mode, no API key needed)
 pytest
 ```
 ## Run Evaluation
 python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
 # Pipeline
+python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl
+# Batched (resumable if interrupted)
+python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl
 # Report
 python -m src.report --normal results/baseline_normal.jsonl \
                      --honesty results/baseline_honesty.jsonl \
+                     --pipeline results/verity_pipeline_v0.4.jsonl \
                      --output results/report.md
 ```
      • Filter junk/meta claims
      • Fix mislabeled claims via span matching
      • Detect inferential claims (4-tier)
+     • Detect contradictions (status-pair only; numeric/date logged for audit)
   5. Gate decision                       (deterministic)
        │
        ▼
 |-----------|----------|---------------|
 | All claims verified | `accept` | Clean answer from verified claims |
 | Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
+| Status-pair contradiction (open/closed, approved/rejected, etc.) | `contradiction` | Flags conflict, shows both sides |
 | No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
 | Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
 | Verifier failed to parse | `verifier_error` | Refuses to answer |
+## Inference Detection (v0.3.1+)
 The verifier catches claims the LLM wrongly marks as SUPPORTED:
 Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
+## Results (Qwen3-4B, 30 cases, v0.2.1)
 | Metric | Baseline Normal | Baseline Honesty | Verity-H |
 |--------|:-:|:-:|:-:|
 | Pressure hypothesis (↑) | 0% | 0% | **100%** *(v0.2.1)* |
 | False contradiction (↓) | 0% | 0% | **0%** |
 | Partial coverage (↑) | 0% | 0% | **100%** |
+| Latency p50 | 3,525ms | 3,244ms | **6,495ms** *(v0.3, 2-call batch)* |
+See [RESULTS_ARCHIVE.md](RESULTS_ARCHIVE.md) for full version history.
 ## Environment Variables
 | `LLM_TEMPERATURE` | `0.0` | Temperature |
 | `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
 | `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
+| `LLM_MAX_CALLS_PER_MINUTE` | `30` | Per-minute rate limit |
 ## Gold Cases
+100 cases across 6 categories (development set):
 | Category | Count | Tests |
 |----------|:-----:|-------|
+| `grounded` | 17 | All claims in evidence → accept |
+| `missing_info` | 14 | Evidence doesn't cover question → abstain |
+| `contradiction` | 15 | Conflicting facts in evidence → flag |
+| `pressure` | 15 | Speculative question → hypothesis with caveats |
+| `filler_trap` | 15 | Tempts model to invent facts → abstain |
+| `partial_answer` | 24 | Some facts available, some not → partial |
+**100 total cases — development set only. Not a held-out evaluation.**
 ## Tests
+209 tests covering all modules. Run with `pytest -v`.
 ```
 tests/
 ├── test_calibration.py          # Table-format probe validation
 ├── test_claim_filter.py         # Slot-aware relevance filtering
 ├── test_constants.py            # Shared stop words
+├── test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
 ├── test_evidence_spans.py       # Abbreviation-aware splitting
 ├── test_gate.py                 # All gate rules + edge cases
 ├── test_inference_detector.py   # All 4 tiers + exact failure cases
 This is a **research harness**, not a product.
+## Known Limitations (v0.4)
+The v0.4 baseline intentionally trades some detection for **zero false positives** and **maintainable code**.
+| # | Limitation | Why | Mitigation |
+|---|-----------|-----|------------|
+| 1 | **Numeric contradictions not caught deterministically** | Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). | Relies on verifier LLM. If LLM misses, contradiction is not flagged. |
+| 2 | **Semantic relevance not enforced** | "How fast can the car go?" with only engine specs supported → `accept`. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. | Acceptable for v0.4. Future: semantic similarity check (not synonym table). |
+| 3 | **100 cases = dev set only** | The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. | Create held-out 50-case test set for unbiased validation. |
+| 4 | **Inference detector is regex-based** | Covers common hedges but cannot catch all inferential reasoning. | Grounded in CogniBench + GME + BioScope; handles most common cases. |
+| 5 | **Single evidence document** | No multi-document consensus or evidence weighting. | Designed for single-pass evaluation. |
 ## Next Steps
+- [x] Simplify to v0.4 baseline — status-pair contradictions only, no frame detector
+- [x] Remove slot-mismatch guard (semantic relevance is known limitation)
+- [x] 209 tests pass, zero false contradictions
+- [ ] Run v0.4 eval on full 100-case development set
+- [ ] Test on multiple models (1B, 4B, 70B+) to prove model independence
+- [ ] Create held-out 50-case test set for unbiased evaluation
 - [ ] Confidence calibration analysis
 ---