QuantitativeFinance-Bench
QFBench · Quantitative Finance Benchmark
The definitive benchmark for AI agents in quantitative finance. Hard problems. Real code. Verifiable outputs.
What Makes QFBench Different
Not another QA benchmark — agents must think like quants
vs HumanEval / MBPP
HumanEval tests algorithm logic with unit tests. QFBench requires domain knowledge: Black-Scholes, hazard rates, OU processes. The math must be right, not just the code structure.
vs FinanceBench (RAG)
The other FinanceBench asks questions about financial documents. Ours requires agents to write and execute quantitative code — no retrieval, no lookup, pure numerical implementation.
vs Terminal-Bench
Terminal-Bench evaluates CLI proficiency. QFBench evaluates whether agents can implement numerical methods correctly inside a Docker sandbox with Python financial libraries.
The Finance-Zero Rule
A non-agentic baseline: one LLM call, one script, one run. If Finance-Zero passes a task, that task is too easy and gets removed. Every task shown here has defeated Finance-Zero.
Leaderboard
Agent performance ranked by pass rate across 14 calibration tasks
claude-opus-4-6
via claude-code
claude-sonnet-4-6
via claude-code
claude-haiku-4-5
via claude-code
gemini-2.5-pro
via gemini-cli
Pass Rate Comparison
Task Performance Matrix
Each model was evaluated once. Results are single-run — statistical significance requires ≥5 runs per task. Finance-Zero baseline pending.
Key Findings
Insights from the calibration run
Opus Leads at 50%
claude-opus-4-6 passes 7 of 14 tasks (50%), outperforming claude-sonnet-4-6 at 36% (5/14). Opus shows consistent strength on complex multi-method tasks like kelly-var-sizing and mc-greeks-surface where Sonnet fails.
Regime Detection is the Bottleneck
Both models fail regime-riskparity-cvar, regime-cta-vol-target, and sentiment-factor-alpha. Tasks requiring multi-step numerical pipelines with cascading state (eigenvalue → regime → portfolio) remain unsolved.
Easy Tasks Confirm Calibration
Both Opus and Sonnet pass all easy/medium tasks (fama-french, momentum, bollinger). Hard tasks separate the models: Opus uniquely passes kelly-var-sizing and mc-greeks-surface; Sonnet times out on hull-white-swaption.
Task Catalog
Loading tasks from main…
Run It Yourself
Evaluate any agent on QFBench using the Harbor framework
pip install harbor
# Build sandbox base image (one-time, ~5 minutes)
docker build -t quantitativefinance-bench-sandbox:latest -f docker/sandbox.Dockerfile .export ANTHROPIC_API_KEY=<your-key>
# Run Claude Code on all calibration tasks
harbor run --path ./tasks --agent claude-code --model anthropic/claude-sonnet-4-20250514
# Or target a single task
harbor run --path ./tasks --task-name cds-pricing --agent claude-code --model anthropic/claude-haiku-3-5-20241022export GEMINI_API_KEY=<your-key>
harbor run --path ./tasks --agent-import-path agents.finance_zero:FinanceZeroAgent --model gemini/gemini-2.0-flashResults are saved to jobs/<timestamp>/result.json. Each run creates agent trajectories, test output, and token usage logs.
How It Works
Rigorous evaluation powered by Harbor. Binary pass/fail scoring with strict numerical tolerances.
Task Specification
Each agent receives instruction.md with input data and evaluation criteria. No hints, no examples — just the problem.
Sandbox Execution
The agent writes and runs code inside a Docker sandbox with Python, NumPy, Pandas, TA-Lib pre-installed. Full iteration allowed.
Verification
Harbor runs pytest against the agent's output. Strict numerical tolerances. Pass or fail — no partial credit.
Finance-Zero Baseline
A single-call non-agentic baseline: one LLM call, one script, one run. If Finance-Zero passes a task, that task is too easy.
What's Next
Expanding the leaderboard with more models and a baseline
GPT-4o
codex-cli
o3
codex-cli
Gemini 2.5 Pro
gemini-cli
Finance-Zero Baseline
gemini-2.0-flash
Full 90 Tasks
all models
Contributors
Thank you to everyone who has contributed to the benchmark or this website. Sourced from QFBench and quantitativefinance-bench-website.
Loading contributors…
How to Contribute
QFBench is community-built. Every task on this leaderboard was contributed by the community — yours could be next.
Submit a Task
Design a hard quant problem, write the verifier, and open a PR. Three merged tasks = authorship on the paper.
Contribution guide →🤖Evaluate a Model
Run any agent on all tasks and submit your results. Help us expand the leaderboard beyond the current models.
Model reference →💬Join the Community
Found a bug? Have a task idea? Want to co-author? Open an issue or start a discussion on GitHub.
Open an issue →What makes a good task?
QFBench tasks must require real quant expertise: numerical methods, dirty data, and verifiable outputs. Not trivia — real professional workflows that a senior quant would recognize. See the full guide for design principles and examples.
Task contribution guide →Task requirements
- Tasks must be verifiable and easy to verify: explicit output contract (what to produce and where to save it), programmatically checkable by code (e.g.
np.isclose). instruction.mdandtask.tomlmust be written entirely by humans.instruction.mdmust not reference which skills to use — the agent must figure that out itself.- The reference solution must not be leaked via skills or the Dockerfile; no task-specific hints that give away the answer.
- Oracle must pass 100%: the reference solution must pass all tests. Run
harbor run --path ./tasks --task-name <task-id> --agent oracleand confirm every test passes before submitting. - Finance-Zero must not pass: run the single-shot baseline; if it passes, the task is too easy.
- Deterministic: same input → same output; no external APIs at runtime.
- Use real data, not synthetic — real data has missing values, outliers, mixed formats.
- Tasks must represent realistic professional workflows without artificial difficulty. The problem itself should be fundamentally hard, not an ordinary problem made adversarial so that agents score low and one can claim hardness.
Task format
Every task directory must include:
tasks/<task-id>/
+-- task.toml # Metadata & resource limits
+-- instruction.md # Agent-facing problem statement
+-- environment/
| +-- Dockerfile # Inherits from quantitativefinance-bench-sandbox
| +-- data/ # Input datasets
| \-- skills/ # OPTIONAL — skills available to agent
| \-- <skill-name>/
| +-- SKILL.md
| +-- scripts/ # optional
| \-- ...
+-- tests/
| +-- test.sh # Harbor verifier entry-point
| \-- test_outputs.py # Pytest assertions
\-- solution/
\-- solve.sh # Reference (oracle) solutionWorkflow
- Design the task and implement all required files (instruction, metadata, environment, tests, reference solution).
- Run
harbor run --path ./tasks --task-name <task-id> --agent oracle— oracle must pass 100%. - Run Finance-Zero baseline; it must fail (otherwise the task is too easy).
- Run at least two frontier agents from different companies (see the "Frontier (Strongest)" section in the model reference) and record results; include screenshots and a summary table in your PR.
- Open a PR with your task under
tasks/.
FAQ
What kind of tasks are we looking for?
See the task design principles and difficulty guide in the task contribution guide.
How do I qualify for authorship?
Three high-quality tasks merged to main count as automatic authorship. Your set must include at most one easy task and at least one hard task. Edge cases (e.g. two hard tasks) need to be reviewed on a case-by-case basis.
What if I contribute fewer tasks but help in other ways?
We count other contributions too: engineering (infrastructure, tooling, CI/CD), running experiments, and paper writing. We’re flexible — if you want to help, reach out.