Open Benchmark

QuantitativeFinance-Bench

QFBench · Quantitative Finance Benchmark

The definitive benchmark for AI agents in quantitative finance. Hard problems. Real code. Verifiable outputs.

87Tasks/42Models/10,962Runs/61.7%Best Pass@1

Paper (NeurIPS 2026)Dataset & Code

View Leaderboard ↓

What Makes QFBench Different

Not another QA benchmark — agents must think like quants

General Coding

vs HumanEval / MBPP

HumanEval tests algorithm logic with unit tests. QFBench requires domain knowledge: Black-Scholes, hazard rates, OU processes. The math must be right, not just the code structure.

RAG / QA

vs FinanceBench (RAG)

The other FinanceBench asks questions about financial documents. Ours requires agents to write and execute quantitative code — no retrieval, no lookup, pure numerical implementation.

Terminal Ops

vs Terminal-Bench

Terminal-Bench evaluates CLI proficiency. QFBench evaluates whether agents can implement numerical methods correctly inside a Docker sandbox with Python financial libraries.

Quality Control

The Finance-Zero Rule

A non-agentic baseline: one LLM call, one script, one run. V11 tracks it separately so agentic CLI results are compared against a transparent single-shot baseline.

Leaderboard

Agent performance ranked by pass@1 across 87 tasks with complete 3-run CLI coverage

#12026-05-07

GPT-5.5

via codex-cli

61.7%pass@3 66.2%

#22026-05-07

claude-opus-4-7

via claude-code

61.2%pass@3 67.1%

#32026-05-07

GPT-5.3-codex

via codex-cli

60.8%pass@3 67.5%

#42026-05-07

claude-opus-4-6

via claude-code

59.2%pass@3 65%

#52026-05-07

GPT-5.4

via codex-cli

57.5%pass@3 63.8%

#62026-05-07

GPT-5.4-mini

via codex-cli

57.1%pass@3 68.8%

#72026-05-07

claude-sonnet-4-6

via claude-code

56.3%pass@3 67.1%

#82026-05-07

claude-sonnet-4-5

via claude-code

46.2%pass@3 60%

#92026-05-07

claude-haiku-4-5

via claude-code

20.8%pass@3 31.2%

pass@1 comparison

GPT-5.5

61.7%p@3 66.2%

claude-opus-4-7

61.2%p@3 67.1%

GPT-5.3-codex

60.8%p@3 67.5%

claude-opus-4-6

59.2%p@3 65%

GPT-5.4

57.5%p@3 63.8%

GPT-5.4-mini

57.1%p@3 68.8%

claude-sonnet-4-6

56.3%p@3 67.1%

claude-sonnet-4-5

46.2%p@3 60%

claude-haiku-4-5

20.8%p@3 31.2%

0%25%50%75%100%

V11 score heatmap

Model × task score field

Each pixel is the average score across three runs for one model on one task. Task IDs are shown as clickable vertical labels; CLI rows glow green and Finance-Zero rows glow blue.

0.00

0.50

1.00 CLI

Finance-Zero

model

1.13f-amendment-aware-crowding 2.alpha-hedge-strategy 3.american-option-fd-new 4.asian-option-levy-curran 5.barone-adesi-whaley 6.barrier-garch-var 7.binance-btc-participation-tca 8.bl-regime-hmm 9.bollinger-backtest-aapl 10.brinson-sector-attribution 11.bs-greeks-pde 12.cir-bond-pricing 13.cliquet-ratchet-pricing 14.cme-hdd-option-pricing 15.compound-option-geske 16.copula-equity-fitting 17.copula-sampling-rank-correlation 18.corporate-action-adjustment 19.credit-migration-matrix 20.credit-portfolio-var-cvar 21.credit-spread-decomposition 22.creditmetrics-portfolio-var 23.cross-sectional-momentum 24.crypto-funding-rate-basis-carry 25.cta-basel-capital 26.dcc-garch-portfolio-var 27.delta-hedging-pnl-simulation 28.digital-barrier-options 29.double-sort 30.dupire-local-vol 31.earnings-surprise-calculator 32.etf-cross-asset-lead-lag 33.etf-overlap-redemption-pressure 34.event-study-earnings 35.evt-pot-var 36.ewma-portfolio-risk-decomposition 37.fama-french-factor-model-new 38.fft-compound-poisson 39.first-passage-time 40.fomc-tone-event-study 41.form4-cross-sectional-sale-pressure 42.fx-carry-forward-hedge 43.fx-forward-cross-rate 44.geometric-mean-reverting-jd 45.historical-var-data-prep 46.hull-white-swaption 47.implied-vol-approximations 48.interest-rate-cap-floor 49.intraday-volume-fitting-and-execution-scheduling 50.ipca-latent-factors 51.kelly-var-sizing 52.lob-pc-signal 53.localvol-barrier 54.lookback-options 55.mc-greek-surface-1 56.merton-jump-diffusion 57.momentum-backtest 58.mtm-xccy-basis-desk 59.ohlc-realized-vol-estimators 60.option-put-call-parity-forward-audit 61.ou-jump-commodity 62.pairs-cointegration-kalman 63.pca-factor-portfolio 64.polars-api-migration 65.realized-vol-estimators 66.regime-cta-vol-target 67.regime-riskparity-cvar 68.residual-momentum 69.sec-10k-report-long 70.sec-8k-event-alpha 71.sentiment-factor-alpha 72.sma-crossover-spy 73.smith-tail-index 74.spread-option-kirk-margrabe 75.stable-residual 76.standard-var-methods 77.stochvol-implied-surface-new 78.structured-note-risk 79.swap-curve-bootstrap-ois 80.var-es-estimation 81.variance-swap-replication 82.yield-curve-bond-immunization 83.yield-curve-bootstrap-immunization 84.yield-curve-pca-dynamics 85.zero-coupon-bootstrapping

avg

CLIGPT-5.5

73%

CLIGPT-5.3-codex

69%

CLIOpus 4.6

68%

CLIGPT-5.4

68%

CLIGPT-5.4-mini

68%

CLISonnet 4.5

54%

CLIHaiku 4.5

29%

FZFZ Opus 4.6

28%

FZFZ Sonnet 4.5

17%

FZFZ Haiku 4.5

14%

7 CLI agents × 85 tasks × 3 runs

3 Finance-Zero baselines tracked separately

Source: V11-RESULTS.md · d2ad3a2 · 2026-05-04 UTC

V11 uses three independent runs per task. ERR runs caused by Docker/verifier failures are excluded from both numerator and denominator. CLI comparison covers 80 tasks where all seven models have complete 3-round data.

Key Findings

Insights from the V11 three-run benchmark sweep

Winner

GPT-5.5 Leads V11

GPT-5.5 ranks first on pass@1 at 61.7% across the 80-task complete CLI comparison set. GPT-5.3-codex is close behind at 60.8%, with Opus 4.6 third at 59.2%.

Stability

pass@3 Shows Recovery Potential

GPT-5.4-mini posts the strongest pass@3 at 68.8%, showing that repeated attempts can recover many failures even when pass@1 trails the top models.

Baseline

Finance-Zero Remains Far Behind

The best non-agentic Finance-Zero baseline reaches 24.7% pass@1 across 83 valid tasks, well below the CLI-agent leaderboard and useful as a quality-control floor.

Task Catalog

Loading tasks from main…

Run It Yourself

Evaluate any agent on QFBench using the Harbor framework

# 1. Install & build sandbox

pip install harbor

# Build sandbox base image (one-time, ~5 minutes)
docker build -t quantitativefinance-bench-sandbox:latest   -f docker/sandbox.Dockerfile .

# 2. Run an agent

export ANTHROPIC_API_KEY=<your-key>

# Run Claude Code on all calibration tasks
harbor run --path ./tasks     --agent claude-code     --model anthropic/claude-sonnet-4-20250514

# Or target a single task
harbor run --path ./tasks     --task-name cds-pricing     --agent claude-code     --model anthropic/claude-haiku-3-5-20241022

# 3. Finance-Zero baseline (free with Gemini)

export GEMINI_API_KEY=<your-key>

harbor run --path ./tasks     --agent-import-path agents.finance_zero:FinanceZeroAgent     --model gemini/gemini-2.0-flash

Results are saved to jobs/<timestamp>/result.json. Each run creates agent trajectories, test output, and token usage logs.

How It Works

Rigorous evaluation powered by Harbor. Binary pass/fail scoring with strict numerical tolerances.

Task Specification

Each agent receives instruction.md with input data and evaluation criteria. No hints, no examples — just the problem.

Sandbox Execution

The agent writes and runs code inside a Docker sandbox with Python, NumPy, Pandas, TA-Lib pre-installed. Full iteration allowed.

Verification

Harbor runs pytest against the agent's output. Strict numerical tolerances. Pass or fail — no partial credit.

Finance-Zero Baseline

A single-call non-agentic baseline: one LLM call, one script, one run. V11 reports it separately from CLI-agent results.

What's Next

Maintaining the V11 leaderboard while the benchmark closes the 90-task target

Planned

Per-task public matrix

website

87 Tasks Done

Finance-Zero Baseline

single-call scripts

87/90 Merged

Full 90 Tasks

all models

Latest News

Latest updates from the QFBench project

View all news →

N-003Leaderboard2026-05-04

V11 Leaderboard Published Across 80 Complete CLI Tasks

The homepage leaderboard now reflects V11 pass@1/pass@3 results: GPT-5.5 leads at 61.7% pass@1, followed by GPT-5.3-codex and Opus 4.6.

V11 ranks CLI agents by pass@1 across the 80 tasks where all seven evaluated models have complete three-run coverage.

Read full update

N-002Dataset2026-05-04

QFBench Expands to 87 Merged Tasks

The benchmark repository has grown to 87 merged quantitative finance tasks, with the full 90-task milestone now in sight.

QFBench main now includes 87 merged tasks across derivatives pricing, risk, market microstructure, factor research, credit, crypto, and event-driven workflows.

Read full update

N-001New2026-04-18

Weekly QFBench Discussion Is Open to Everyone

Join our weekly discussion to talk about benchmark progress, quantitative finance tasks, and upcoming evaluation updates.

We welcome everyone to join our weekly QFBench discussion.

Meeting link: https://meet.google.com/oyz-oyky-urc

Meeting time: Saturday 4:00 PM PST

Read full update

Contributors

Thank you to everyone who has contributed to the benchmark or this website. Sourced from QFBench and finbench-website.

Loading contributors…

Become a Contributor

How to Contribute

QFBench is community-built. Every task on this leaderboard was contributed by the community — yours could be next.

🧪

Submit a Task

Design a hard quant problem, write the verifier, and open a PR. Three merged tasks = authorship on the paper.

Contribution guide →🤖

Evaluate a Model

Run any agent on the benchmark and submit your results. Help us expand the V11 leaderboard beyond the current CLI models.

Model reference →💬

Join the Community

Found a bug? Have a task idea? Want to co-author? Open an issue or start a discussion on GitHub.

Open an issue →

What makes a good task?

QFBench tasks must require real quant expertise: numerical methods, dirty data, and verifiable outputs. Not trivia — real professional workflows that a senior quant would recognize. See the full guide for design principles and examples.

Task contribution guide →

Task requirements

Tasks must be verifiable and easy to verify: explicit output contract (what to produce and where to save it), programmatically checkable by code (e.g. np.isclose).
instruction.md and task.toml must be written entirely by humans. instruction.md must not reference which skills to use — the agent must figure that out itself.
The reference solution must not be leaked via skills or the Dockerfile; no task-specific hints that give away the answer.
Oracle must pass 100%: the reference solution must pass all tests. Run harbor run --path ./tasks --task-name <task-id> --agent oracle and confirm every test passes before submitting.
Compare against Finance-Zero: run the single-shot baseline and report it separately from CLI-agent runs.
Deterministic: same input → same output; no external APIs at runtime.
Use real data, not synthetic — real data has missing values, outliers, mixed formats.
Tasks must represent realistic professional workflows without artificial difficulty. The problem itself should be fundamentally hard, not an ordinary problem made adversarial so that agents score low and one can claim hardness.

Task format

Every task directory must include:

tasks/<task-id>/
+-- task.toml                                 # Metadata & resource limits
+-- instruction.md                            # Agent-facing problem statement
+-- environment/
|   +-- Dockerfile                            # Inherits from quantitativefinance-bench-sandbox
|   +-- data/                                 # Input datasets
|   \-- skills/                               # OPTIONAL — skills available to agent
|       \-- <skill-name>/
|           +-- SKILL.md
|           +-- scripts/                      # optional
|           \-- ...
+-- tests/
|   +-- test.sh                               # Harbor verifier entry-point
|   \-- test_outputs.py                       # Pytest assertions
\-- solution/
    \-- solve.sh                              # Reference (oracle) solution

Workflow

Design the task and implement all required files (instruction, metadata, environment, tests, reference solution).
Run harbor run --path ./tasks --task-name <task-id> --agent oracle — oracle must pass 100%.
Run Finance-Zero baseline and report the result separately from agentic CLI attempts.
Run at least two frontier agents from different companies (see the "Frontier (Strongest)" section in the model reference) and record results; include screenshots and a summary table in your PR.
Open a PR with your task under tasks/.

FAQ

What kind of tasks are we looking for?

See the task design principles and difficulty guide in the task contribution guide.

How do I qualify for authorship?

Three high-quality tasks merged to main count as automatic authorship. Your set must include at most one easy task and at least one hard task. Edge cases (e.g. two hard tasks) need to be reviewed on a case-by-case basis.

What if I contribute fewer tasks but help in other ways?

We count other contributions too: engineering (infrastructure, tooling, CI/CD), running experiments, and paper writing. We’re flexible — if you want to help, reach out.

QuantitativeFinance-Bench

What Makes QFBench Different

vs HumanEval / MBPP

vs FinanceBench (RAG)

vs Terminal-Bench

The Finance-Zero Rule

Leaderboard

Model × task score field

Key Findings

GPT-5.5 Leads V11

pass@3 Shows Recovery Potential

Finance-Zero Remains Far Behind

Task Catalog

Run It Yourself

How It Works

Task Specification

Sandbox Execution

Verification

Finance-Zero Baseline

What's Next

Latest News

V11 Leaderboard Published Across 80 Complete CLI Tasks

QFBench Expands to 87 Merged Tasks

Weekly QFBench Discussion Is Open to Everyone

Contributors

How to Contribute

Submit a Task

Evaluate a Model

Join the Community

What makes a good task?

Task requirements

Task format

Workflow

FAQ

Resources