Sravanth18 commited on
Commit
c858b2d
Β·
verified Β·
1 Parent(s): 8f0f6ea

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -32
README.md CHANGED
@@ -1,4 +1,4 @@
1
- # Project Verity-H v0.3.1
2
 
3
  **Teaching AI to say "I don't know."**
4
 
@@ -23,10 +23,6 @@ pip install -e ".[test]"
23
 
24
  # Run tests (mock mode, no API key needed)
25
  pytest
26
-
27
- # Run with a real LLM
28
- cp .env.example .env
29
- # Edit .env: set LLM_MODE=hf_api, add your HF_API_KEY
30
  ```
31
 
32
  ## Run Evaluation
@@ -43,12 +39,15 @@ python -m src.baseline_runner --mode normal --output results/baseline_normal.jso
43
  python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
44
 
45
  # Pipeline
46
- python -m src.pipeline_runner --output results/verity_pipeline.jsonl
 
 
 
47
 
48
  # Report
49
  python -m src.report --normal results/baseline_normal.jsonl \
50
  --honesty results/baseline_honesty.jsonl \
51
- --pipeline results/verity_pipeline.jsonl \
52
  --output results/report.md
53
  ```
54
 
@@ -65,7 +64,7 @@ Question + Evidence
65
  β€’ Filter junk/meta claims
66
  β€’ Fix mislabeled claims via span matching
67
  β€’ Detect inferential claims (4-tier)
68
- β€’ Detect contradictions (frame-based)
69
  5. Gate decision (deterministic)
70
  β”‚
71
  β–Ό
@@ -80,12 +79,12 @@ Question + Evidence
80
  |-----------|----------|---------------|
81
  | All claims verified | `accept` | Clean answer from verified claims |
82
  | Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
83
- | Evidence contradicts itself | `contradiction` | Flags conflict, shows both sides |
84
  | No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
85
  | Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
86
  | Verifier failed to parse | `verifier_error` | Refuses to answer |
87
 
88
- ## Inference Detection (v0.3.1)
89
 
90
  The verifier catches claims the LLM wrongly marks as SUPPORTED:
91
 
@@ -98,7 +97,7 @@ The verifier catches claims the LLM wrongly marks as SUPPORTED:
98
 
99
  Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
100
 
101
- ## Results (Qwen3-4B, 30 cases)
102
 
103
  | Metric | Baseline Normal | Baseline Honesty | Verity-H |
104
  |--------|:-:|:-:|:-:|
@@ -109,7 +108,9 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
109
  | Pressure hypothesis (↑) | 0% | 0% | **100%** *(v0.2.1)* |
110
  | False contradiction (↓) | 0% | 0% | **0%** |
111
  | Partial coverage (↑) | 0% | 0% | **100%** |
112
- | Latency p50 | 3,525ms | 3,244ms | **6,495ms** |
 
 
113
 
114
  ## Environment Variables
115
 
@@ -122,30 +123,33 @@ Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08
122
  | `LLM_TEMPERATURE` | `0.0` | Temperature |
123
  | `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
124
  | `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
 
125
 
126
  ## Gold Cases
127
 
128
- 30 cases across 6 categories:
129
 
130
  | Category | Count | Tests |
131
  |----------|:-----:|-------|
132
- | `grounded` | 5 | All claims in evidence β†’ accept |
133
- | `missing_info` | 5 | Evidence doesn't cover question β†’ abstain |
134
- | `contradiction` | 5 | Conflicting facts in evidence β†’ flag |
135
- | `pressure` | 5 | Speculative question β†’ hypothesis with caveats |
136
- | `filler_trap` | 5 | Tempts model to invent facts β†’ abstain |
137
- | `partial_answer` | 5 | Some facts available, some not β†’ partial |
 
 
138
 
139
  ## Tests
140
 
141
- 154 tests covering all modules. Run with `pytest -v`.
142
 
143
  ```
144
  tests/
145
  β”œβ”€β”€ test_calibration.py # Table-format probe validation
146
  β”œβ”€β”€ test_claim_filter.py # Slot-aware relevance filtering
147
  β”œβ”€β”€ test_constants.py # Shared stop words
148
- β”œβ”€β”€ test_contradiction_checks.py # Frame-based + false positive prevention
149
  β”œβ”€β”€ test_evidence_spans.py # Abbreviation-aware splitting
150
  β”œβ”€β”€ test_gate.py # All gate rules + edge cases
151
  β”œβ”€β”€ test_inference_detector.py # All 4 tiers + exact failure cases
@@ -165,22 +169,27 @@ tests/
165
 
166
  This is a **research harness**, not a product.
167
 
168
- ## Known Limitations
 
 
169
 
170
- 1. **30 cases is a starter set** β€” not sufficient for statistical significance. Target: 100+.
171
- 2. **Metrics are directional** β€” text heuristics for baselines, structured outputs for pipeline. Not directly comparable.
172
- 3. **Verifier depends on LLM quality** β€” the deterministic layers fix many errors, but a very weak LLM will still produce poor claim extraction.
173
- 4. **Inference detector is regex-based** β€” covers common patterns but cannot catch all forms of inferential reasoning.
174
- 5. **Single evidence document** β€” no multi-document or multi-turn evidence handling.
 
 
175
 
176
  ## Next Steps
177
 
178
- - [ ] Run v0.3.1 eval (inference detector + contradiction fix)
179
- - [ ] Expand to 100 gold cases
180
- - [ ] Test on multiple models (1B, 4B, 70B+)
181
- - [ ] Per-claim-type metric breakdowns
 
 
182
  - [ ] Confidence calibration analysis
183
- - [ ] Inter-annotator agreement study
184
 
185
  ---
186
 
 
1
+ # Project Verity-H v0.4
2
 
3
  **Teaching AI to say "I don't know."**
4
 
 
23
 
24
  # Run tests (mock mode, no API key needed)
25
  pytest
 
 
 
 
26
  ```
27
 
28
  ## Run Evaluation
 
39
  python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
40
 
41
  # Pipeline
42
+ python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl
43
+
44
+ # Batched (resumable if interrupted)
45
+ python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl
46
 
47
  # Report
48
  python -m src.report --normal results/baseline_normal.jsonl \
49
  --honesty results/baseline_honesty.jsonl \
50
+ --pipeline results/verity_pipeline_v0.4.jsonl \
51
  --output results/report.md
52
  ```
53
 
 
64
  β€’ Filter junk/meta claims
65
  β€’ Fix mislabeled claims via span matching
66
  β€’ Detect inferential claims (4-tier)
67
+ β€’ Detect contradictions (status-pair only; numeric/date logged for audit)
68
  5. Gate decision (deterministic)
69
  β”‚
70
  β–Ό
 
79
  |-----------|----------|---------------|
80
  | All claims verified | `accept` | Clean answer from verified claims |
81
  | Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
82
+ | Status-pair contradiction (open/closed, approved/rejected, etc.) | `contradiction` | Flags conflict, shows both sides |
83
  | No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
84
  | Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
85
  | Verifier failed to parse | `verifier_error` | Refuses to answer |
86
 
87
+ ## Inference Detection (v0.3.1+)
88
 
89
  The verifier catches claims the LLM wrongly marks as SUPPORTED:
90
 
 
97
 
98
  Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
99
 
100
+ ## Results (Qwen3-4B, 30 cases, v0.2.1)
101
 
102
  | Metric | Baseline Normal | Baseline Honesty | Verity-H |
103
  |--------|:-:|:-:|:-:|
 
108
  | Pressure hypothesis (↑) | 0% | 0% | **100%** *(v0.2.1)* |
109
  | False contradiction (↓) | 0% | 0% | **0%** |
110
  | Partial coverage (↑) | 0% | 0% | **100%** |
111
+ | Latency p50 | 3,525ms | 3,244ms | **6,495ms** *(v0.3, 2-call batch)* |
112
+
113
+ See [RESULTS_ARCHIVE.md](RESULTS_ARCHIVE.md) for full version history.
114
 
115
  ## Environment Variables
116
 
 
123
  | `LLM_TEMPERATURE` | `0.0` | Temperature |
124
  | `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
125
  | `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
126
+ | `LLM_MAX_CALLS_PER_MINUTE` | `30` | Per-minute rate limit |
127
 
128
  ## Gold Cases
129
 
130
+ 100 cases across 6 categories (development set):
131
 
132
  | Category | Count | Tests |
133
  |----------|:-----:|-------|
134
+ | `grounded` | 17 | All claims in evidence β†’ accept |
135
+ | `missing_info` | 14 | Evidence doesn't cover question β†’ abstain |
136
+ | `contradiction` | 15 | Conflicting facts in evidence β†’ flag |
137
+ | `pressure` | 15 | Speculative question β†’ hypothesis with caveats |
138
+ | `filler_trap` | 15 | Tempts model to invent facts β†’ abstain |
139
+ | `partial_answer` | 24 | Some facts available, some not β†’ partial |
140
+
141
+ **100 total cases β€” development set only. Not a held-out evaluation.**
142
 
143
  ## Tests
144
 
145
+ 209 tests covering all modules. Run with `pytest -v`.
146
 
147
  ```
148
  tests/
149
  β”œβ”€β”€ test_calibration.py # Table-format probe validation
150
  β”œβ”€β”€ test_claim_filter.py # Slot-aware relevance filtering
151
  β”œβ”€β”€ test_constants.py # Shared stop words
152
+ β”œβ”€β”€ test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
153
  β”œβ”€β”€ test_evidence_spans.py # Abbreviation-aware splitting
154
  β”œβ”€β”€ test_gate.py # All gate rules + edge cases
155
  β”œβ”€β”€ test_inference_detector.py # All 4 tiers + exact failure cases
 
169
 
170
  This is a **research harness**, not a product.
171
 
172
+ ## Known Limitations (v0.4)
173
+
174
+ The v0.4 baseline intentionally trades some detection for **zero false positives** and **maintainable code**.
175
 
176
+ | # | Limitation | Why | Mitigation |
177
+ |---|-----------|-----|------------|
178
+ | 1 | **Numeric contradictions not caught deterministically** | Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). | Relies on verifier LLM. If LLM misses, contradiction is not flagged. |
179
+ | 2 | **Semantic relevance not enforced** | "How fast can the car go?" with only engine specs supported β†’ `accept`. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. | Acceptable for v0.4. Future: semantic similarity check (not synonym table). |
180
+ | 3 | **100 cases = dev set only** | The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. | Create held-out 50-case test set for unbiased validation. |
181
+ | 4 | **Inference detector is regex-based** | Covers common hedges but cannot catch all inferential reasoning. | Grounded in CogniBench + GME + BioScope; handles most common cases. |
182
+ | 5 | **Single evidence document** | No multi-document consensus or evidence weighting. | Designed for single-pass evaluation. |
183
 
184
  ## Next Steps
185
 
186
+ - [x] Simplify to v0.4 baseline β€” status-pair contradictions only, no frame detector
187
+ - [x] Remove slot-mismatch guard (semantic relevance is known limitation)
188
+ - [x] 209 tests pass, zero false contradictions
189
+ - [ ] Run v0.4 eval on full 100-case development set
190
+ - [ ] Test on multiple models (1B, 4B, 70B+) to prove model independence
191
+ - [ ] Create held-out 50-case test set for unbiased evaluation
192
  - [ ] Confidence calibration analysis
 
193
 
194
  ---
195