Self-hosted LLM evaluation. Docker compose. Multi-provider. Live results.
⚠️ Pre-release. Multi-model runs, four scoring methods, SSE streaming, cost/latency tracking, and regression compare are implemented end-to-end. Schemas may still shift — don't depend on this in production yet.
Point EvalKit at any combination of Claude, GPT-4o, and Gemini and get a side-by-side scored comparison of how each one handles your test suite — with token cost, latency, hallucination flags, and a baseline-vs-current regression view. No SaaS signup, no vendor SDK lock-in, no opaque scores. The whole thing runs on one docker compose up.
cp .env.example .env # add your API keys
docker compose up -d && docker compose run --rm seed| EvalKit | Promptfoo | OpenAI Evals | LangSmith | Helicone | |
|---|---|---|---|---|---|
| Self-hosted, no account, no SaaS | ✅ | ✅ | ✅ | — | partial (self-host) |
| Multi-provider runs in parallel (Claude + GPT + Gemini) | ✅ | ✅ | partial | partial | n/a (passive logging) |
| 4 scoring methods in one suite (exact / semantic / LLM-judge / rubric) | ✅ | partial | partial | partial | — |
| Hallucination detection with judge rationale | ✅ | — | — | — | — |
| Live SSE streaming of results as they complete | ✅ | — | — | partial | — |
| Per-result token cost + latency, per-model rollups | ✅ | partial | — | ✅ | ✅ |
| Baseline-vs-current regression compare (≥10% drop highlight) | ✅ | partial | — | partial | — |
| Demo mode for portfolio / read-only display | ✅ | — | — | — | — |
| Web UI out of the box | ✅ | partial (viewer) | — | ✅ | ✅ |
| Cloud storage / hosted dashboard | — | — | — | ✅ | ✅ |
Where EvalKit fits: when you want a single dashboard that runs the same suite against three providers at once, scores it four different ways, and shows you the cost/latency/regression view without sending your prompts or completions to anyone else's server. Promptfoo gives you a powerful CLI; LangSmith gives you a hosted UI tied to the LangChain stack. EvalKit gives you a small, opinionated webapp you can docker compose up on a $5 VPS and share with your team.
1. Clone and configure.
git clone https://github.com/Danultimate/evalkit.git
cd evalkit
cp .env.example .envFill in at least ANTHROPIC_API_KEY, OPENAI_API_KEY, and GOOGLE_API_KEY. Add VOYAGE_API_KEY if you want semantic scoring.
2. Bring up the stack.
docker compose up -d
docker compose run --rm seedseed is idempotent — it waits for the backend to be healthy, then loads a sample suite (General Knowledge QA) plus one demo run so the dashboard isn't empty on first boot.
3. Open the app.
http://localhost
You land on the dashboard. Pick a suite, choose the models you want to run, and click Run. Results stream in live via SSE — no refresh, no polling.
Run a suite from the API (no UI)
curl -X POST http://localhost/api/runs \
-H "Content-Type: application/json" \
-d '{
"suite_id": "<uuid>",
"models": ["claude-sonnet-4", "gpt-4o", "gemini-1.5-pro"]
}'Then stream results:
curl -N http://localhost/api/runs/<run_id>/stream
# event: result
# data: {"case_id": "...", "model": "claude-sonnet-4", "score": 0.92, "cost_usd": 0.0031, "latency_ms": 1140}The stream closes with event: done when every (case × model) cell has settled.
Demo mode (portfolio / read-only)
Set DEMO_MODE=true in .env and restart. The seed run still loads, but the Run button is disabled, a banner appears, and any POST to /api/runs returns 403. Useful for sharing a public URL without burning API credits.
EvalKit ships four scoring methods. Pick one per test case — the suite can mix them freely.
| Method | How it works | Best for | Needs |
|---|---|---|---|
exact |
String equality on the trimmed completion | Classification, structured outputs, regex-able answers | — |
semantic |
Voyage AI embedding + cosine similarity (≥ 0.85 = pass) | Paraphrase, summarisation, "same meaning, different words" | VOYAGE_API_KEY |
llm_judge |
Claude Sonnet 4 scores the completion 0–1 against expected_output |
Open-ended QA, tone, nuance | ANTHROPIC_API_KEY |
rubric |
Claude Sonnet 4 scores against a free-text rubric — no ground truth required | Creative writing, no single right answer | ANTHROPIC_API_KEY |
Each result records the chosen method, the raw score, and (for judge / rubric) the judge's written rationale. The rationale doubles as the hallucination explanation — see below.
When a test case has an expected_output, every judge-scored result also returns a hallucination_flag plus an explanation. The judge prompt asks Claude to flag responses that introduce facts not supported by the expected answer — not just low-scoring ones. Flagged results turn red in the dashboard with the rationale expandable inline.
✗ gpt-4o · score 0.42 · ⚠ hallucination
rationale: "The response claims the treaty was signed in 1947;
the expected answer correctly states 1949."
Heuristic-only suites (no judge configured, no expected_output) don't run hallucination detection — the flag is null, not false.
Every (test case × model) cell records:
- Input tokens and output tokens as returned by the provider
- USD cost, computed from a built-in pricing table (see Models) with longest-prefix matching so versioned IDs resolve correctly
- Latency, measured end-to-end from request dispatch to final token
Per-run rollups live on the run detail page: total cost, total latency, cost-per-model bar chart, and a per-model average score. Per-case rollups appear on the suite page so you can spot the cases that eat your budget.
The Compare page takes two runs (a baseline and a current) and highlights any case where the current score drops ≥ 10% versus baseline, per model. Useful when you change a system prompt, swap a model version, or refactor your retrieval layer and want to know which specific cases got worse — not just whether the average moved.
case "translate idiom: 'cost an arm and a leg'"
baseline (run 2026-05-15): claude-sonnet-4 0.94 gpt-4o 0.91
current (run 2026-05-22): claude-sonnet-4 0.96 gpt-4o 0.67 ↓ regression
| Model | Provider | Input ($/1M) | Output ($/1M) |
|---|---|---|---|
| Claude Sonnet 4 | Anthropic | $3.00 | $15.00 |
| Claude Haiku 4 | Anthropic | $0.80 | $4.00 |
| GPT-4o | OpenAI | $2.50 | $10.00 |
| GPT-4o Mini | OpenAI | $0.15 | $0.60 |
| Gemini 1.5 Pro | $1.25 | $5.00 | |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Pricing is read from backend/services/cost_tracker.py — edit the table there to add models or override negotiated rates. Unknown models cost 0 and emit a warning so the run still completes.
nginx (host, TLS termination, /api proxy)
└── Docker Compose
├── frontend (Nginx, serves React SPA, proxies /api → backend)
├── backend (FastAPI + asyncpg, port 8000)
├── db (pgvector/pgvector:pg16, port 5432)
└── seed (one-shot, idempotent — waits for backend health)
The backend fans out one HTTP request per (case, model) cell into asyncio.gather, scores each result as it returns, and pushes it onto an SSE channel keyed by run_id. The frontend's React Query client subscribes to that stream and renders cells incrementally — first arrival is usually under a second.
- Backend · Python 3.12, FastAPI, asyncpg, asyncio, sse-starlette
- Frontend · React 18, Vite, Tailwind CSS, React Query, Recharts
- Database · PostgreSQL 16 + pgvector (semantic similarity)
- LLM SDKs ·
anthropic(async),openai(async),google-generativeai
| Variable | Required | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
✅ | — | Anthropic API key — required for Claude runs and llm_judge / rubric scoring |
OPENAI_API_KEY |
✅ | — | OpenAI API key |
GOOGLE_API_KEY |
✅ | — | Google AI API key (Gemini) |
VOYAGE_API_KEY |
— | — | Voyage AI key — required for semantic scoring |
ALLOWED_ORIGIN |
— | http://localhost:5173 |
CORS allowed origin |
DEMO_MODE |
— | false |
Disable live runs, show demo banner |
PORT |
— | 80 |
Host port for Docker Compose |
- Install Docker + Docker Compose on your EC2 instance.
- Clone the repo and
cp .env.example .env. - Point DNS at the box:
evalkit.danblanco.dev→ EC2 public IP. - Get a TLS cert:
certbot certonly --nginx -d evalkit.danblanco.dev. - Copy
nginx.confto/etc/nginx/sites-available/evalkitand enable it. - Run
./deploy.sh.
Subsequent deploys: ./deploy.sh (pulls, rebuilds, restarts, re-runs seed).
- No SaaS, no hosted dashboard. Runs live in your Postgres. Bring your own backups.
- No model fine-tuning or training loop. EvalKit grades; it does not train.
- No agent / tool-use harness. Test cases are single-turn
(system, user) → completion. Multi-turn / tool-using agents are out of scope for v1 — pair EvalKit with a tracer like TraceForge for that workflow. - No auto-generated test cases. You write the suite; EvalKit scores it.
| Feature | Status |
|---|---|
| Multi-provider parallel runs (Claude + OpenAI + Gemini) | ✅ shipped |
exact / semantic / llm_judge / rubric scoring |
✅ shipped |
| Hallucination detection (judge-flagged + rationale) | ✅ shipped |
| Per-result + per-run cost & latency tracking | ✅ shipped |
| SSE live result streaming | ✅ shipped |
| Baseline-vs-current regression compare (≥10% drop highlight) | ✅ shipped |
| Demo mode (read-only public deploy) | ✅ shipped |
| Idempotent seed container | ✅ shipped |
| API auth / multi-user | deferred |
| Scheduled / cron runs | deferred |
| Custom scoring plugins | deferred |
| Multi-turn / tool-use suites | non-goal (see Non-goals) |
Track progress and propose features via GitHub Issues.
