Post-hoc calibration without retraining for large language models. This toolkit turns a raw prompt into:
- a bounded hallucination risk using the Expectation-level Decompression Law (EDFL), and
- a decision to ANSWER or REFUSE under a target SLA, with transparent math (nats).
It supports two deployment modes:
- Evidence-based: prompts include evidence/context; rolling priors are built by erasing that evidence.
- Closed-book: prompts have no evidence; rolling priors are built by semantic masking of entities/numbers/titles.
All scoring relies only on the OpenAI Chat Completions API. No retraining required.
- Install & Setup
- Core Mathematical Framework
- Understanding System Behavior
- Two Ways to Build Rolling Priors
- API Surface
- Calibration & Validation
- Practical Considerations
- Project Layout
- Deployment Options
pip install --upgrade openai
export OPENAI_API_KEY=sk-...
The module uses
openai>=1.0.0
and the Chat Completions API (e.g.,gpt-4o
,gpt-4o-mini
).
Let the binary event
Build an ensemble of content-weakened prompts (the rolling priors)
-
Information budget:
$$\bar{\Delta} = \tfrac{1}{m}\sum_k \mathrm{clip}_+(\log P(y) - \log S_k(y), B)$$ (one-sided clipping; default$B=12$ nats to prevent outliers while maintaining conservative bounds). -
Prior masses:
$q_k = S_k(\mathcal{A})$ , with:-
$\bar{q}=\tfrac{1}{m}\sum_k q_k$ (average prior for EDFL bound) -
$q_{\text{lo}}=\min_k q_k$ (worst-case prior for SLA gating)
-
By EDFL, the achievable reliability is bounded by:
Thus the hallucination risk (error) is bounded by
For target hallucination rate
-
Bits-to-Trust:
$\mathrm{B2T} = \mathrm{KL}(\mathrm{Ber}(1-h^*) | \mathrm{Ber}(q_{\text{lo}}))$ -
Information Sufficiency Ratio:
$\mathrm{ISR} = \bar{\Delta}/\mathrm{B2T}$ -
ANSWER iff
$\mathrm{ISR}\ge 1$ and$\bar{\Delta} \ge \mathrm{B2T} + \text{margin}$ (defaultmargin≈0.2
nats)
Why two priors? The gate uses worst-case
$q_{\text{lo}}$ for strict SLA compliance. The RoH bound uses average$\bar{q}$ per EDFL theory. This dual approach ensures conservative safety while providing realistic risk bounds.
The toolkit exhibits different behaviors across query types, which is mathematically consistent with the framework:
Observation: May abstain despite apparent simplicity
Explanation:
- Models often attempt answers even with masked numbers (pattern recognition)
- This yields low information lift
$\bar{\Delta} \approx 0$ between full prompt and skeletons - Despite potentially low EDFL risk bound, worst-case prior gate triggers abstention (ISR < 1)
Observation: Generally answered with confidence
Explanation:
- Masking entities/dates substantially reduces answer propensity in skeletons
- Restoring these yields large
$\bar{\Delta}$ that clears B2T threshold - System answers with tight EDFL risk bound
This is not a bug but a feature: The framework prioritizes safety through worst-case guarantees while providing realistic average-case bounds.
-
Switch Event Measurement
- Use Correct/Incorrect instead of Answer/Refuse for factual QA
- Skeletons without key information rarely yield correct results → large
$\bar{\Delta}$
-
Enhance Skeleton Weakening
- Implement mask-aware decision head that refuses on redaction tokens
- Ensures skeletons have strictly lower "Answer" mass than full prompt
-
Calibration Adjustments
- Relax
$h^*$ slightly (e.g., 0.10 instead of 0.05) for higher answer rates - Reduce margin for less conservative gating
- Increase sampling (
$n=7-10$ ) for stability
- Relax
-
Provide Evidence
- Adding compact, relevant evidence increases
$\bar{\Delta}$ while preserving bounds
- Adding compact, relevant evidence increases
- Prompt contains a field like
Evidence:
(or JSON keys) - Skeletons erase the evidence content but preserve structure and roles; then permute blocks deterministically (seeded)
- Decision head: "Answer only if the provided evidence is sufficient; otherwise refuse."
Example
from scripts.hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner
backend = OpenAIBackend(model="gpt-4o-mini")
prompt = (
"""Task: Answer strictly based on the evidence below.
Question: Who won the Nobel Prize in Physics in 2019?
Evidence:
- Nobel Prize press release (2019): James Peebles (1/2); Michel Mayor & Didier Queloz (1/2).
Constraints: If evidence is insufficient or conflicting, refuse.
"""
)
item = OpenAIItem(
prompt=prompt,
n_samples=5,
m=6,
fields_to_erase=["Evidence"],
skeleton_policy="auto"
)
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(
[item],
h_star=0.05,
isr_threshold=1.0,
margin_extra_bits=0.2,
B_clip=12.0,
clip_mode="one-sided"
)
for m in metrics:
print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
print(f"Rationale: {m.rationale}")
- Prompt has no evidence
- Skeletons apply semantic masking of:
- Multi-word proper nouns (e.g., "James Peebles" → "[…]")
- Years (e.g., "2019" → "[…]")
- Numbers (e.g., "3.14" → "[…]")
- Quoted spans (e.g., '"Nobel Prize"' → "[…]")
- Masking strengths: Progressive levels (0.25, 0.35, 0.5, 0.65, 0.8, 0.9) across skeleton ensemble
- Mask-aware decision head refuses if redaction tokens appear or key slots look missing
Example
from scripts.hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner
backend = OpenAIBackend(model="gpt-4o-mini")
item = OpenAIItem(
prompt="Who won the 2019 Nobel Prize in Physics?",
n_samples=7, # More samples for stability
m=6, # Number of skeletons
skeleton_policy="closed_book"
)
planner = OpenAIPlanner(backend, temperature=0.3, q_floor=None)
metrics = planner.run(
[item],
h_star=0.05, # Target max 5% hallucination
isr_threshold=1.0, # Standard ISR gate
margin_extra_bits=0.2, # Safety margin in nats
B_clip=12.0, # Clipping bound
clip_mode="one-sided" # Conservative clipping
)
for m in metrics:
print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
print(f"Δ̄={m.delta_bar:.4f}, B2T={m.b2t:.4f}, ISR={m.isr:.3f}")
print(f"EDFL RoH bound={m.roh_bound:.3f}")
Tuning knobs (closed-book):
-
n_samples=5–7
andtemperature≈0.3
stabilize priors -
q_floor
(Laplace by default: $1/(n+2)$) prevents worst-case prior collapse to 0 - Adjust masking strength levels if a task family remains too answerable under masking
OpenAIBackend(model, api_key=None)
– wraps Chat Completions APIOpenAIItem(prompt, n_samples=5, m=6, fields_to_erase=None, skeleton_policy="auto")
– one evaluation itemOpenAIPlanner(backend, temperature=0.5, q_floor=None)
– runs evaluation:run(items, h_star, isr_threshold, margin_extra_bits, B_clip=12.0, clip_mode="one-sided") -> List[ItemMetrics]
aggregate(items, metrics, alpha=0.05, h_star, ...) -> AggregateReport
make_sla_certificate(report, model_name)
– creates formal SLA certificatesave_sla_certificate_json(cert, path)
– exports certificate for auditgenerate_answer_if_allowed(backend, item, metric)
– only emits answer if decision was ANSWER
Every ItemMetrics
includes:
-
delta_bar
: Information budget (nats) -
q_conservative
: Worst-case prior$q_{\text{lo}}$ -
q_avg
: Average prior$\bar{q}$ -
b2t
: Bits-to-Trust requirement -
isr
: Information Sufficiency Ratio -
roh_bound
: EDFL hallucination risk bound -
decision_answer
: Boolean decision -
rationale
: Human-readable explanation -
meta
: Dict withq_list
,S_list_y
,P_y
,closed_book
, etc.
On a labeled validation set:
- Sweep the margin parameter from 0 to 1 nats
- For each margin, compute:
- Empirical hallucination rate among answered items
- Wilson upper bound at 95% confidence
-
Select smallest margin where Wilson upper bound ≤ target
$h^*$ (e.g., 5%) -
Freeze policy:
$(h^*, \tau, \text{margin}, B, \text{clip_mode}, m, r, \text{skeleton_policy})$
The toolkit provides comprehensive metrics:
- Answer/abstention rates
- Empirical hallucination rate + Wilson bound
- Distribution of per-item EDFL RoH bounds
- Worst-case and median risk bounds
- Complete audit trail
The default event is the decision
Task Type | Recommended Event | Rationale |
---|---|---|
Factual QA | Correct/Incorrect | Directly measures hallucination |
Decision Support | Answer/Refuse | Measures confidence to respond |
Creative Writing | Answer/Refuse | Correctness often undefined |
For tasks where skeletons still trigger answers frequently (causing
Not a contradiction! The gate uses worst-case
- Solution: Increase
n_samples
, lower decision temperature, ensure skeletons truly weaken the event
Cause: Clipping may be too aggressive
- Solution: Increase
B_clip
(default 12) and useclip_mode="one-sided"
Cause: Pattern recognition allows answers even with masked numbers
- Solutions:
- Switch to Correctness event
- Reduce masking strength for numbers on subset of skeletons
- Provide worked examples as evidence
Cause: All skeletons strongly refuse
- Solution: Apply prior floor (default Laplace: $1/(n+2)$) or use quantile prior
Metric | Typical Value | Notes |
---|---|---|
Latency per item | 2-5 seconds | 7 samples × 7 variants (1 full + 6 skeletons) |
API calls | Can be parallelized | |
Accuracy | Wilson-bounded at 95% | Empirically validated |
Cost | ~$0.01-0.03 per item | Using gpt-4o-mini |
-
Sampling parameters:
- Use
$n \ge 5$ samples per variant - Keep temperature
$\in [0.2, 0.5]$ for decision head - Lower temperature → more stable priors
- Use
-
Skeleton ensemble:
- Use
$m \ge 6$ skeletons - Ensure diversity in masking strengths
- Verify skeletons are meaningfully weaker
- Use
-
Clipping strategy:
- Always use one-sided clipping for conservative bounds
- Set
$B \ge 10$ nats to avoid artificial ceilings - Monitor clipping frequency in logs
.
├── app/ # Application entry points
│ ├── web/web_app.py # Streamlit UI
│ ├── cli/frontend.py # Interactive CLI
│ ├── examples/ # Example scripts
│ └── launcher/entry.py # Unified launcher
├── scripts/ # Core module
│ ├── hallucination_toolkit.py
│ └── build_offline_backend.sh
├── electron/ # Desktop wrapper
├── launch/ # Platform launchers
├── release/ # Packaged artifacts
├── bin/ # Offline backend binary
├── requirements.txt
├── pyproject.toml
└── README.md
from scripts.hallucination_toolkit import (
OpenAIBackend, OpenAIItem, OpenAIPlanner,
make_sla_certificate, save_sla_certificate_json
)
# Configure and run
backend = OpenAIBackend(model="gpt-4o-mini")
items = [OpenAIItem(prompt="...", n_samples=7, m=6)]
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(items, h_star=0.05)
# Generate SLA certificate
report = planner.aggregate(items, metrics)
cert = make_sla_certificate(report, model_name="GPT-4o-mini")
save_sla_certificate_json(cert, "sla.json")
streamlit run app/web/web_app.py
- Windows: Double-click
launch/Launch App.bat
- macOS: Double-click
launch/Launch App.command
- Linux: Run
bash launch/launch.sh
First run creates .venv
and installs dependencies automatically.
Development:
cd electron
npm install
npm run start
Build installers:
npm run build
Build single-file executable:
# macOS/Linux
bash scripts/build_offline_backend.sh
# Windows
scripts\build_offline_backend.bat
Creates bin/hallucination-backend[.exe]
with bundled Python, Streamlit, and dependencies.
from scripts.hallucination_toolkit import (
OpenAIBackend, OpenAIItem, OpenAIPlanner,
make_sla_certificate, save_sla_certificate_json,
generate_answer_if_allowed
)
# Setup
backend = OpenAIBackend(model="gpt-4o-mini")
# Prepare items
items = [
OpenAIItem(
prompt="Who won the 2019 Nobel Prize in Physics?",
n_samples=7,
m=6,
skeleton_policy="closed_book"
),
OpenAIItem(
prompt="If James has 5 apples and eats 3, how many remain?",
n_samples=7,
m=6,
skeleton_policy="closed_book"
)
]
# Run evaluation
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(
items,
h_star=0.05, # Target 5% hallucination max
isr_threshold=1.0, # Standard threshold
margin_extra_bits=0.2, # Safety margin
B_clip=12.0, # Clipping bound
clip_mode="one-sided" # Conservative mode
)
# Generate report and certificate
report = planner.aggregate(items, metrics, alpha=0.05, h_star=0.05)
cert = make_sla_certificate(report, model_name="GPT-4o-mini")
save_sla_certificate_json(cert, "sla_certificate.json")
# Show results
for item, m in zip(items, metrics):
print(f"\nPrompt: {item.prompt[:50]}...")
print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
print(f"Risk bound: {m.roh_bound:.3f}")
print(f"Rationale: {m.rationale}")
# Generate answer if allowed
if m.decision_answer:
answer = generate_answer_if_allowed(backend, item, m)
print(f"Answer: {answer}")
This project is licensed under the MIT License — see the LICENSE file for details.
Developed by Hassana Labs (https://hassana.io).
This implementation follows the framework from the paper “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” (NeurIPS 2024 preprint) and related EDFL/ISR/B2T methodology.