Open, practical tools to understand LLM streaming & sampling and run a reproducible mini-eval with hash-attested results.
Measure → Evaluate → Prove.
Streaming logger (TTFT & tokens/sec) + coding eval (pass@1 / pass@k) + tamper-evident receipts.
- Node 18+
- A model endpoint:
- OpenAI-compatible hosted API (e.g., OpenAI), or
- Local OpenAI-compatible server (vLLM / llama.cpp)
npm i- Copy
.env.example→.env, setBASE_URL,API_KEY(if hosted), andMODEL - Run:
npm run stream -- "Explain recursion in one paragraph with a short JS example."
This writes (Streaming):
manifest.json— model, tokenizer, sampling, prompt hashresults/run-*.json— timings (ttft_ms,generation_ms), tokens (input/output/total),rates.tokens_per_sec, output text
Note: TTFT varies with network/queue;
tokens/secis exact (via tiktoken).
| temp | top_p | TTFT (ms) | tokens/sec | note |
|---|---|---|---|---|
| 0.0 | 1.0 | ~1205 | ~57.25 | baseline, deterministic |
| 0.8 | 0.9 | ~766 | ~31.62 | creative phrasing |
Run a deterministic eval (k=1), verify receipts, and print a summary row:
K_ATTEMPTS=1 npm run eval
npm run verify -- $(ls -t results/attest-*.jsonl | head -n1)
npm run summarize -- $(ls -t results/eval-*.json | head -n1)This writes (Eval + Attestations):
results/eval-*.json— per-taskattempts[]withlatency_msand token counts, plustotals(pass@1 / pass@k)results/attest-*.jsonl— hash-chained receipts (one JSON line per task)- The verifier prints
Attestation OK ✓when the chain is intact
| model | setting | pass@1 | pass@k | median task latency | avg tokens (in / out) |
|---|---|---|---|---|---|
| GPT-4o-mini | temp=0, top_p=1, k=1 | 16/16 | 16/16 | ~3145 ms | ~25 / ~75 |
| GPT-4o | temp=0, top_p=1, k=1 | 16/16 | 16/16 | ~1573 ms | ~25 / ~79 |
Notes
- Both models pass all 16 tasks on the first attempt (pass@1 = 16/16).
- GPT-4o shows lower median latency than GPT-4o-mini on this suite (~1.6s vs ~3.1s in latest runs).
- Output token lengths are similar (GPT-4o slightly longer on average).