Open Benchmark

QuantitativeFinance-Bench

QFBench · Quantitative Finance Benchmark

The definitive benchmark for AI agents in quantitative finance. Hard problems. Real code. Verifiable outputs.

Tasks/xModels/Pass/FailScoring
View on GitHub

What Makes QFBench Different

Not another QA benchmark — agents must think like quants

General Coding

vs HumanEval / MBPP

HumanEval tests algorithm logic with unit tests. QFBench requires domain knowledge: Black-Scholes, hazard rates, OU processes. The math must be right, not just the code structure.

RAG / QA

vs FinanceBench (RAG)

The other FinanceBench asks questions about financial documents. Ours requires agents to write and execute quantitative code — no retrieval, no lookup, pure numerical implementation.

Terminal Ops

vs Terminal-Bench

Terminal-Bench evaluates CLI proficiency. QFBench evaluates whether agents can implement numerical methods correctly inside a Docker sandbox with Python financial libraries.

Quality Control

The Finance-Zero Rule

A non-agentic baseline: one LLM call, one script, one run. If Finance-Zero passes a task, that task is too easy and gets removed. Every task shown here has defeated Finance-Zero.

Leaderboard

Agent performance ranked by pass rate across 14 calibration tasks

#12026-03-07

claude-opus-4-6

via claude-code

50%7 / 14
#22026-03-07

claude-sonnet-4-6

via claude-code

36%5 / 14
#32026-03-08

claude-haiku-4-5

via claude-code

29%2 / 7tested
#42026-03-06

gemini-2.5-pro

via gemini-cli

0%0 / 7tested

Pass Rate Comparison

claude-opus-4-6
50%7/14
claude-sonnet-4-6
36%5/14
claude-haiku-4-5
29%2/7 tested
gemini-2.5-pro
0%0/7 tested
0%25%50%75%100%

Task Performance Matrix

AOF
BGV
BBA
CBC
FFM
HWS
KVS
MGS
MOM
RCV
RRC
SFA
SIS
SNR
SCORE
claude-opus-4-6
50%
claude-sonnet-4-6
36%
claude-haiku-4-5
29%
gemini-2.5-pro
0%
Pass
Fail
Not tested
very hard
hard
medium-hard
medium
easy

Each model was evaluated once. Results are single-run — statistical significance requires ≥5 runs per task. Finance-Zero baseline pending.

Key Findings

Insights from the calibration run

Winner

Opus Leads at 50%

claude-opus-4-6 passes 7 of 14 tasks (50%), outperforming claude-sonnet-4-6 at 36% (5/14). Opus shows consistent strength on complex multi-method tasks like kelly-var-sizing and mc-greeks-surface where Sonnet fails.

Common Failure

Regime Detection is the Bottleneck

Both models fail regime-riskparity-cvar, regime-cta-vol-target, and sentiment-factor-alpha. Tasks requiring multi-step numerical pipelines with cascading state (eigenvalue → regime → portfolio) remain unsolved.

Task Insight

Easy Tasks Confirm Calibration

Both Opus and Sonnet pass all easy/medium tasks (fama-french, momentum, bollinger). Hard tasks separate the models: Opus uniquely passes kelly-var-sizing and mc-greeks-surface; Sonnet times out on hull-white-swaption.

Task Catalog

Loading tasks from main…

Run It Yourself

Evaluate any agent on QFBench using the Harbor framework

# 1. Install & build sandbox
pip install harbor

# Build sandbox base image (one-time, ~5 minutes)
docker build -t quantitativefinance-bench-sandbox:latest   -f docker/sandbox.Dockerfile .
# 2. Run an agent
export ANTHROPIC_API_KEY=<your-key>

# Run Claude Code on all calibration tasks
harbor run --path ./tasks     --agent claude-code     --model anthropic/claude-sonnet-4-20250514

# Or target a single task
harbor run --path ./tasks     --task-name cds-pricing     --agent claude-code     --model anthropic/claude-haiku-3-5-20241022
# 3. Finance-Zero baseline (free with Gemini)
export GEMINI_API_KEY=<your-key>

harbor run --path ./tasks     --agent-import-path agents.finance_zero:FinanceZeroAgent     --model gemini/gemini-2.0-flash

Results are saved to jobs/<timestamp>/result.json. Each run creates agent trajectories, test output, and token usage logs.

How It Works

Rigorous evaluation powered by Harbor. Binary pass/fail scoring with strict numerical tolerances.

01

Task Specification

Each agent receives instruction.md with input data and evaluation criteria. No hints, no examples — just the problem.

02

Sandbox Execution

The agent writes and runs code inside a Docker sandbox with Python, NumPy, Pandas, TA-Lib pre-installed. Full iteration allowed.

03

Verification

Harbor runs pytest against the agent's output. Strict numerical tolerances. Pass or fail — no partial credit.

04

Finance-Zero Baseline

A single-call non-agentic baseline: one LLM call, one script, one run. If Finance-Zero passes a task, that task is too easy.

What's Next

Expanding the leaderboard with more models and a baseline

Planned

GPT-4o

codex-cli

Planned

o3

codex-cli

Planned

Gemini 2.5 Pro

gemini-cli

In Progress

Finance-Zero Baseline

gemini-2.0-flash

Coming Soon

Full 90 Tasks

all models

Contributors

Thank you to everyone who has contributed to the benchmark or this website. Sourced from QFBench and quantitativefinance-bench-website.

Loading contributors…

How to Contribute

QFBench is community-built. Every task on this leaderboard was contributed by the community — yours could be next.

What makes a good task?

QFBench tasks must require real quant expertise: numerical methods, dirty data, and verifiable outputs. Not trivia — real professional workflows that a senior quant would recognize. See the full guide for design principles and examples.

Task contribution guide →

Task requirements

  • Tasks must be verifiable and easy to verify: explicit output contract (what to produce and where to save it), programmatically checkable by code (e.g. np.isclose).
  • instruction.md and task.toml must be written entirely by humans. instruction.md must not reference which skills to use — the agent must figure that out itself.
  • The reference solution must not be leaked via skills or the Dockerfile; no task-specific hints that give away the answer.
  • Oracle must pass 100%: the reference solution must pass all tests. Run harbor run --path ./tasks --task-name <task-id> --agent oracle and confirm every test passes before submitting.
  • Finance-Zero must not pass: run the single-shot baseline; if it passes, the task is too easy.
  • Deterministic: same input → same output; no external APIs at runtime.
  • Use real data, not synthetic — real data has missing values, outliers, mixed formats.
  • Tasks must represent realistic professional workflows without artificial difficulty. The problem itself should be fundamentally hard, not an ordinary problem made adversarial so that agents score low and one can claim hardness.

Task format

Every task directory must include:

tasks/<task-id>/
+-- task.toml                                 # Metadata & resource limits
+-- instruction.md                            # Agent-facing problem statement
+-- environment/
|   +-- Dockerfile                            # Inherits from quantitativefinance-bench-sandbox
|   +-- data/                                 # Input datasets
|   \-- skills/                               # OPTIONAL — skills available to agent
|       \-- <skill-name>/
|           +-- SKILL.md
|           +-- scripts/                      # optional
|           \-- ...
+-- tests/
|   +-- test.sh                               # Harbor verifier entry-point
|   \-- test_outputs.py                       # Pytest assertions
\-- solution/
    \-- solve.sh                              # Reference (oracle) solution

Workflow

  1. Design the task and implement all required files (instruction, metadata, environment, tests, reference solution).
  2. Run harbor run --path ./tasks --task-name <task-id> --agent oracle — oracle must pass 100%.
  3. Run Finance-Zero baseline; it must fail (otherwise the task is too easy).
  4. Run at least two frontier agents from different companies (see the "Frontier (Strongest)" section in the model reference) and record results; include screenshots and a summary table in your PR.
  5. Open a PR with your task under tasks/.

FAQ

What kind of tasks are we looking for?

See the task design principles and difficulty guide in the task contribution guide.

How do I qualify for authorship?

Three high-quality tasks merged to main count as automatic authorship. Your set must include at most one easy task and at least one hard task. Edge cases (e.g. two hard tasks) need to be reviewed on a case-by-case basis.

What if I contribute fewer tasks but help in other ways?

We count other contributions too: engineering (infrastructure, tooling, CI/CD), running experiments, and paper writing. We’re flexible — if you want to help, reach out.

Resources