Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# Project Verity-H v0.
|
| 2 |
|
| 3 |
**Teaching AI to say "I don't know."**
|
| 4 |
|
|
@@ -23,10 +23,6 @@ pip install -e ".[test]"
|
|
| 23 |
|
| 24 |
# Run tests (mock mode, no API key needed)
|
| 25 |
pytest
|
| 26 |
-
|
| 27 |
-
# Run with a real LLM
|
| 28 |
-
cp .env.example .env
|
| 29 |
-
# Edit .env: set LLM_MODE=hf_api, add your HF_API_KEY
|
| 30 |
```
|
| 31 |
|
| 32 |
## Run Evaluation
|
|
@@ -43,12 +39,15 @@ python -m src.baseline_runner --mode normal --output results/baseline_normal.jso
|
|
| 43 |
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
|
| 44 |
|
| 45 |
# Pipeline
|
| 46 |
-
python -m src.pipeline_runner --output results/
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
# Report
|
| 49 |
python -m src.report --normal results/baseline_normal.jsonl \
|
| 50 |
--honesty results/baseline_honesty.jsonl \
|
| 51 |
-
--pipeline results/
|
| 52 |
--output results/report.md
|
| 53 |
```
|
| 54 |
|
|
@@ -65,7 +64,7 @@ Question + Evidence
|
|
| 65 |
β’ Filter junk/meta claims
|
| 66 |
β’ Fix mislabeled claims via span matching
|
| 67 |
β’ Detect inferential claims (4-tier)
|
| 68 |
-
β’ Detect contradictions (
|
| 69 |
5. Gate decision (deterministic)
|
| 70 |
β
|
| 71 |
βΌ
|
|
@@ -80,12 +79,12 @@ Question + Evidence
|
|
| 80 |
|-----------|----------|---------------|
|
| 81 |
| All claims verified | `accept` | Clean answer from verified claims |
|
| 82 |
| Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
|
| 83 |
-
|
|
| 84 |
| No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
|
| 85 |
| Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
|
| 86 |
| Verifier failed to parse | `verifier_error` | Refuses to answer |
|
| 87 |
|
| 88 |
-
## Inference Detection (v0.3.1)
|
| 89 |
|
| 90 |
The verifier catches claims the LLM wrongly marks as SUPPORTED:
|
| 91 |
|
|
@@ -98,7 +97,7 @@ The verifier catches claims the LLM wrongly marks as SUPPORTED:
|
|
| 98 |
|
| 99 |
Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
|
| 100 |
|
| 101 |
-
## Results (Qwen3-4B, 30 cases)
|
| 102 |
|
| 103 |
| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|
| 104 |
|--------|:-:|:-:|:-:|
|
|
@@ -109,7 +108,9 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
|
|
| 109 |
| Pressure hypothesis (β) | 0% | 0% | **100%** *(v0.2.1)* |
|
| 110 |
| False contradiction (β) | 0% | 0% | **0%** |
|
| 111 |
| Partial coverage (β) | 0% | 0% | **100%** |
|
| 112 |
-
| Latency p50 | 3,525ms | 3,244ms | **6,495ms** |
|
|
|
|
|
|
|
| 113 |
|
| 114 |
## Environment Variables
|
| 115 |
|
|
@@ -122,30 +123,33 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
|
|
| 122 |
| `LLM_TEMPERATURE` | `0.0` | Temperature |
|
| 123 |
| `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
|
| 124 |
| `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
|
|
|
|
| 125 |
|
| 126 |
## Gold Cases
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
| Category | Count | Tests |
|
| 131 |
|----------|:-----:|-------|
|
| 132 |
-
| `grounded` |
|
| 133 |
-
| `missing_info` |
|
| 134 |
-
| `contradiction` |
|
| 135 |
-
| `pressure` |
|
| 136 |
-
| `filler_trap` |
|
| 137 |
-
| `partial_answer` |
|
|
|
|
|
|
|
| 138 |
|
| 139 |
## Tests
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
```
|
| 144 |
tests/
|
| 145 |
βββ test_calibration.py # Table-format probe validation
|
| 146 |
βββ test_claim_filter.py # Slot-aware relevance filtering
|
| 147 |
βββ test_constants.py # Shared stop words
|
| 148 |
-
βββ test_contradiction_checks.py #
|
| 149 |
βββ test_evidence_spans.py # Abbreviation-aware splitting
|
| 150 |
βββ test_gate.py # All gate rules + edge cases
|
| 151 |
βββ test_inference_detector.py # All 4 tiers + exact failure cases
|
|
@@ -165,22 +169,27 @@ tests/
|
|
| 165 |
|
| 166 |
This is a **research harness**, not a product.
|
| 167 |
|
| 168 |
-
## Known Limitations
|
|
|
|
|
|
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
|
|
|
|
|
|
| 175 |
|
| 176 |
## Next Steps
|
| 177 |
|
| 178 |
-
- [
|
| 179 |
-
- [
|
| 180 |
-
- [
|
| 181 |
-
- [ ]
|
|
|
|
|
|
|
| 182 |
- [ ] Confidence calibration analysis
|
| 183 |
-
- [ ] Inter-annotator agreement study
|
| 184 |
|
| 185 |
---
|
| 186 |
|
|
|
|
| 1 |
+
# Project Verity-H v0.4
|
| 2 |
|
| 3 |
**Teaching AI to say "I don't know."**
|
| 4 |
|
|
|
|
| 23 |
|
| 24 |
# Run tests (mock mode, no API key needed)
|
| 25 |
pytest
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
```
|
| 27 |
|
| 28 |
## Run Evaluation
|
|
|
|
| 39 |
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
|
| 40 |
|
| 41 |
# Pipeline
|
| 42 |
+
python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl
|
| 43 |
+
|
| 44 |
+
# Batched (resumable if interrupted)
|
| 45 |
+
python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl
|
| 46 |
|
| 47 |
# Report
|
| 48 |
python -m src.report --normal results/baseline_normal.jsonl \
|
| 49 |
--honesty results/baseline_honesty.jsonl \
|
| 50 |
+
--pipeline results/verity_pipeline_v0.4.jsonl \
|
| 51 |
--output results/report.md
|
| 52 |
```
|
| 53 |
|
|
|
|
| 64 |
β’ Filter junk/meta claims
|
| 65 |
β’ Fix mislabeled claims via span matching
|
| 66 |
β’ Detect inferential claims (4-tier)
|
| 67 |
+
β’ Detect contradictions (status-pair only; numeric/date logged for audit)
|
| 68 |
5. Gate decision (deterministic)
|
| 69 |
β
|
| 70 |
βΌ
|
|
|
|
| 79 |
|-----------|----------|---------------|
|
| 80 |
| All claims verified | `accept` | Clean answer from verified claims |
|
| 81 |
| Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
|
| 82 |
+
| Status-pair contradiction (open/closed, approved/rejected, etc.) | `contradiction` | Flags conflict, shows both sides |
|
| 83 |
| No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
|
| 84 |
| Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
|
| 85 |
| Verifier failed to parse | `verifier_error` | Refuses to answer |
|
| 86 |
|
| 87 |
+
## Inference Detection (v0.3.1+)
|
| 88 |
|
| 89 |
The verifier catches claims the LLM wrongly marks as SUPPORTED:
|
| 90 |
|
|
|
|
| 97 |
|
| 98 |
Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
|
| 99 |
|
| 100 |
+
## Results (Qwen3-4B, 30 cases, v0.2.1)
|
| 101 |
|
| 102 |
| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|
| 103 |
|--------|:-:|:-:|:-:|
|
|
|
|
| 108 |
| Pressure hypothesis (β) | 0% | 0% | **100%** *(v0.2.1)* |
|
| 109 |
| False contradiction (β) | 0% | 0% | **0%** |
|
| 110 |
| Partial coverage (β) | 0% | 0% | **100%** |
|
| 111 |
+
| Latency p50 | 3,525ms | 3,244ms | **6,495ms** *(v0.3, 2-call batch)* |
|
| 112 |
+
|
| 113 |
+
See [RESULTS_ARCHIVE.md](RESULTS_ARCHIVE.md) for full version history.
|
| 114 |
|
| 115 |
## Environment Variables
|
| 116 |
|
|
|
|
| 123 |
| `LLM_TEMPERATURE` | `0.0` | Temperature |
|
| 124 |
| `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
|
| 125 |
| `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
|
| 126 |
+
| `LLM_MAX_CALLS_PER_MINUTE` | `30` | Per-minute rate limit |
|
| 127 |
|
| 128 |
## Gold Cases
|
| 129 |
|
| 130 |
+
100 cases across 6 categories (development set):
|
| 131 |
|
| 132 |
| Category | Count | Tests |
|
| 133 |
|----------|:-----:|-------|
|
| 134 |
+
| `grounded` | 17 | All claims in evidence β accept |
|
| 135 |
+
| `missing_info` | 14 | Evidence doesn't cover question β abstain |
|
| 136 |
+
| `contradiction` | 15 | Conflicting facts in evidence β flag |
|
| 137 |
+
| `pressure` | 15 | Speculative question β hypothesis with caveats |
|
| 138 |
+
| `filler_trap` | 15 | Tempts model to invent facts β abstain |
|
| 139 |
+
| `partial_answer` | 24 | Some facts available, some not β partial |
|
| 140 |
+
|
| 141 |
+
**100 total cases β development set only. Not a held-out evaluation.**
|
| 142 |
|
| 143 |
## Tests
|
| 144 |
|
| 145 |
+
209 tests covering all modules. Run with `pytest -v`.
|
| 146 |
|
| 147 |
```
|
| 148 |
tests/
|
| 149 |
βββ test_calibration.py # Table-format probe validation
|
| 150 |
βββ test_claim_filter.py # Slot-aware relevance filtering
|
| 151 |
βββ test_constants.py # Shared stop words
|
| 152 |
+
βββ test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
|
| 153 |
βββ test_evidence_spans.py # Abbreviation-aware splitting
|
| 154 |
βββ test_gate.py # All gate rules + edge cases
|
| 155 |
βββ test_inference_detector.py # All 4 tiers + exact failure cases
|
|
|
|
| 169 |
|
| 170 |
This is a **research harness**, not a product.
|
| 171 |
|
| 172 |
+
## Known Limitations (v0.4)
|
| 173 |
+
|
| 174 |
+
The v0.4 baseline intentionally trades some detection for **zero false positives** and **maintainable code**.
|
| 175 |
|
| 176 |
+
| # | Limitation | Why | Mitigation |
|
| 177 |
+
|---|-----------|-----|------------|
|
| 178 |
+
| 1 | **Numeric contradictions not caught deterministically** | Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). | Relies on verifier LLM. If LLM misses, contradiction is not flagged. |
|
| 179 |
+
| 2 | **Semantic relevance not enforced** | "How fast can the car go?" with only engine specs supported β `accept`. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. | Acceptable for v0.4. Future: semantic similarity check (not synonym table). |
|
| 180 |
+
| 3 | **100 cases = dev set only** | The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. | Create held-out 50-case test set for unbiased validation. |
|
| 181 |
+
| 4 | **Inference detector is regex-based** | Covers common hedges but cannot catch all inferential reasoning. | Grounded in CogniBench + GME + BioScope; handles most common cases. |
|
| 182 |
+
| 5 | **Single evidence document** | No multi-document consensus or evidence weighting. | Designed for single-pass evaluation. |
|
| 183 |
|
| 184 |
## Next Steps
|
| 185 |
|
| 186 |
+
- [x] Simplify to v0.4 baseline β status-pair contradictions only, no frame detector
|
| 187 |
+
- [x] Remove slot-mismatch guard (semantic relevance is known limitation)
|
| 188 |
+
- [x] 209 tests pass, zero false contradictions
|
| 189 |
+
- [ ] Run v0.4 eval on full 100-case development set
|
| 190 |
+
- [ ] Test on multiple models (1B, 4B, 70B+) to prove model independence
|
| 191 |
+
- [ ] Create held-out 50-case test set for unbiased evaluation
|
| 192 |
- [ ] Confidence calibration analysis
|
|
|
|
| 193 |
|
| 194 |
---
|
| 195 |
|