DeepSWE Information Hub

DeepSWE Benchmark: Why GPT Leads Claude on Long-Horizon Coding Tasks

Why does DeepSWE show GPT outperforming Claude?

Reason Explanation
More complete requirement coverage GPT is less likely to miss explicit prompt requirements, especially when a task has multiple branches.
More stable interpretation Across repeated runs of the same task, GPT more often converges on the same understanding.
Stronger long-horizon engineering DeepSWE pairs short prompts with long implementations and multi-file changes, and GPT-5.5 leads in that setting.
Better efficiency GPT-5.5 leads on score while also looking strong on token, time, and cost efficiency.
Less dependent on benchmark leakage By removing gold-commit leakage, DeepSWE cuts off some of the advantage Claude showed on older benchmarks.

First, GPT is better at carrying out the full request in DeepSWE, not just the most obvious part of it.

DeepSWE tasks are often more than a simple bug fix. They regularly ask the model to handle multiple parallel cases at once: support the synchronous path and the asynchronous path, or handle one input format and a closely related one. Datacurve's analysis found that Claude often produced a solution that looked close to correct, but still dropped one branch. In plain terms, it might get the main path right while forgetting to mirror the same logic in the second scenario. By contrast, GPT-5.5 had the lowest rate of missing explicit requirements in DeepSWE, with GPT-5.4 very close behind. That suggests GPT is better at turning each requirement in the prompt into actual code changes.

Screenshot inserted between the first and second DeepSWE comparison points

Second, GPT's task understanding is more consistent, and less dependent on getting lucky.

DeepSWE does not only ask whether a model passes once. It also looks at how the same model behaves across multiple runs of the same task. Datacurve says GPT tends to converge on a similar interpretation and implementation direction from run to run. That matters in real development, because users want a coding agent they can predict, not one that reads the task as A in one run and B in the next. GPT is more likely to follow the user's prompt closely and work within the interfaces and structure that already exist in the repository, which makes its output easier to anticipate, review, and reuse.

Third, DeepSWE is a stronger test of long-horizon engineering, and GPT performs better in that setting.

What makes DeepSWE difficult is that the prompts are not long, but the actual implementation work often is. The average prompt is only 2,158 characters, shorter than SWE-Bench Pro's 4,614. But the average reference solution in DeepSWE adds 668 lines of code and touches 7 files, versus 120 lines and 5 files in SWE-Bench Pro. That means the model cannot rely on detailed step-by-step instructions. It has to read the codebase, find the right entry points, understand the project structure, make changes across files, and avoid breaking existing behavior. GPT-5.5 posts the top score on exactly this mix of short prompt, long execution path, and multi-file change, which is a strong sign that it is better suited to realistic engineering work.

Fourth, GPT is not just higher-scoring. It is also more efficient.

DeepSWE compares more than pass rate. It also tracks how many tokens, how much time, and how much cost a model consumes to finish a task. Datacurve reports that GPT-5.5 reaches the top pass rate at 70%, while also posting a median output length of 47k tokens, the best token efficiency in the chart. Its median completion time is 20 minutes, which is also strong among the highest-scoring models. On cost, both GPT-5.4 and GPT-5.5 are marked as the most cost-efficient configurations in the figure. In other words, GPT's edge does not come from brute-forcing the task with more output, more runtime, or more spend. It comes from a better balance between accuracy and resource use.

Fifth, DeepSWE reduces benchmark leakage, which makes GPT's underlying ability easier to see.

Datacurve emphasizes that DeepSWE tasks are rewritten from scratch rather than adapted directly from existing GitHub commits, pull requests, or public patches, and those tasks are not merged back into the original projects. That makes it much harder for a model to guess the answer from memorized training data or public history. This differs from some older benchmarks. In its analysis of SWE-Bench Pro, Datacurve found that some tasks had a gold-commit leakage risk, and that some agents could recover the original fix from git history. Claude Opus configurations showed that behavior more often in the SWE-Bench Pro sample, while GPT-5.4 and GPT-5.5 did not. Once that shortcut is removed, DeepSWE looks more like a test of whether a model can solve a genuinely new problem, rather than whether it has seen the answer before.

Did Opus 4.8 catch GPT-5.5 on DeepSWE?

As of now, DeepSWE includes Claude Opus 4.8. The conclusion is fairly clear: Opus 4.8 improved, but it has not overtaken GPT-5.5. The top Opus 4.8 [max] setting is 58% ±5%, below GPT-5.5 [xhigh] at 70% ±4%; it is closer to GPT-5.4 [xhigh] at 56% ±5% and Opus 4.7 [max] at 54% ±5%.

From the chart below:

DeepSWE table comparing Claude Opus 4.8, Claude Opus 4.7, and GPT-5.5 across effort settings, pass rate, cost, output tokens, and time.
Opus 4.8, Opus 4.7, and GPT-5.5 compared across effort settings, cost, runtime, and tokens.
  • Do not default Opus 4.8 to max. Opus 4.8 moves from medium to high to max, with scores of 47% → 51% → 58%. But max average cost jumps from high’s $3.98 to $12.58, average output grows from 48k to 136k tokens, and runtime moves from around 21 minutes to 44 minutes. In other words, max is stronger, but it is the expensive last gear: use it for high-value, high-failure-cost, long-horizon exploration tasks, not as the default for every daily issue.
  • Opus 4.8’s progress is mainly that it gets above Opus 4.7 max while being stronger and cheaper. Opus 4.8 [max] is 58%, while Opus 4.7 [max] is 54%; at the same time, Opus 4.8 [max] averages $12.58, below Opus 4.7 [max] at $18.19. That does not mean 4.8 failed to improve. It means the improvement looks like efficiency and ceiling gains along the same route, not a direct takedown of GPT-5.5.
  • GPT-5.5’s advantage is the efficiency baseline. The chart uses GPT-5.5 [medium], not the leaderboard-leading GPT-5.5 [xhigh]. Even so, GPT-5.5 [medium] is already at 48%, with $2.34 cost, 10m 53s runtime, and 18.6k output tokens. It is close to Opus 4.8 [medium] at 47%, but cheaper, faster, and lighter on tokens. In practice, simple to medium-complexity coding tasks look more like GPT-5.5 default routes; Opus 4.8 fits tasks that need deeper reasoning, solution discussion, and complex context judgment.

Reddit reactions are split: some users say DeepSWE is one of the few benchmarks that matches their lived experience with GPT-5.5, Opus 4.7, and Opus 4.8; in r/developersIndia, one user said heavy GPT-5.5 usage made DeepSWE explain why it feels steadier on delegated tasks and /goal. Others question whether using mini-swe-agent uniformly may suppress Opus’s native ceiling. More specifically: Opus 4.8 has a good reputation for low-level C, assembly, memory management, high concurrency, lock-free work, and solution discussion; but for business apps, React, SQL, and backend implementation, many users still find Codex/GPT-5.5 more stable in code quality and verification.

What is DeepSWE?

A benchmark built to test real repository-level engineering behavior, not just short-answer coding.

DeepSWE is a benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. It was introduced by Datacurve to measure how well AI agents handle realistic coding work that requires repository exploration, multi-file changes, behavioral correctness, and verification.

Unlike benchmark tasks that are copied from existing pull requests or public commits, DeepSWE tasks are written from scratch. Datacurve says this design is intended to reduce training-data contamination and test problem-solving rather than recall.

What is DeepSWE used for?

It is useful when teams care about multi-file implementation, verification, and reliability under real constraints.

DeepSWE is used to compare AI coding agents on tasks closer to real software engineering work than short coding puzzles. It helps researchers, model providers, and engineering teams see which agents can follow a compact developer-style request, inspect an unfamiliar codebase, implement the change, and keep existing behavior working.

The benchmark can also be run by teams that want to score a new agent or reproduce the leaderboard. Datacurve publishes the task corpus, task metadata, verifier format, and instructions for running DeepSWE with Pier.

What are the advantages of DeepSWE?

The benchmark is shaped to reveal capability gaps that smaller or more saturated evaluations may hide.

DeepSWE stands out because it focuses on original tasks, broader repository coverage, and outcome-based verification. Together, those choices make it a stronger proxy for practical coding-agent work than a benchmark that mostly measures recall or tiny edits.

113 original software engineering tasks
91 active open-source repositories
5 languages: TypeScript, Go, Python, JavaScript, Rust
668 mean reference solution lines added
1

Original tasks reduce contamination risk

DeepSWE tasks are not adapted from public fixes. This makes the score less likely to reflect a model having seen the answer during training.

2

Long-horizon tasks resemble agentic development

Datacurve reports that DeepSWE prompts are shorter than SWE-bench Pro prompts, while reference solutions require substantially more code and more files.

3

Broader repository coverage

The task set spans many active repositories instead of concentrating on a small number of flagship projects, making it a broader proxy for day-to-day coding-agent work.

4

Behavioral verifiers reward correct outcomes

DeepSWE verifiers are designed to test observable behavior rather than internal implementation shape, so different correct solutions can pass.

What are the DeepSWE benchmark results?

The main story is not just ranking, but the amount of separation between frontier model families.

Rank Model DeepSWE score Signal
1 GPT-5.5 [xhigh] 70% +- 4% Top published pass rate on the official DeepSWE leaderboard.
2 Claude Opus 4.8 [max] 58% +- 5% Newest Opus result on the official leaderboard; above Opus 4.7 max, but still below GPT-5.5.
3 GPT-5.4 [xhigh] 56% +- 5% Close to Opus 4.8 within the stated margin and reported as cost-efficient by Datacurve.
4 Claude Opus 4.7 [max] 54% +- 5% Close to GPT-5.4 within the stated margin, but now below Opus 4.8 on this benchmark.
5 Claude Sonnet 4.6 [high] 32% +- 4% Lower pass rate on long-horizon DeepSWE tasks.

The main meaning of the result is separation. Datacurve reports that DeepSWE scores span a much wider range than SWE-bench Pro among the same frontier model families, which suggests that long-horizon, original tasks can reveal capability gaps that shorter or more saturated public benchmarks may hide.

What does this mean for coding users?

Use the benchmark as a decision input, then pressure-test the finalists on your own repositories.

For users choosing an AI model for programming, DeepSWE points toward evaluating models on the work you actually need done. If your task is a multi-file change in an unfamiliar repository, a long-horizon benchmark can be a more relevant signal than a short coding quiz or a saturated leaderboard.

The result also suggests that pass rate is not the only practical signal. Datacurve tracks output tokens, wall-clock time, and cost per trial, and reports that more tokens, more time, or higher cost do not consistently produce better results. Developers should compare reliability, cost, latency, and how often a model misses requirements.

A sensible workflow is to use DeepSWE as one benchmark-specific data point, then test the top candidate models on your own repositories, languages, and review standards before standardizing on a coding assistant.

Signal 01

Match the benchmark to your workflow

Prioritize long-horizon evaluations when your developers mostly do repository exploration and multi-file changes.

Signal 02

Measure reliability, not only speed

Track missed requirements, rework, cost, and latency alongside raw pass rate before deciding on a default model.

Signal 03

Run your own bake-off

Benchmarks narrow the field, but your final choice should come from tests on your own repo, review bar, and risk tolerance.

DeepSWE tasks and how to run the benchmark

The benchmark covers diverse repository work, and the quickstart is designed for reproducible agent runs.

Task coverage

What tasks are included in DeepSWE?

DeepSWE includes 113 stable tasks across TypeScript, Go, Python, JavaScript, and Rust repositories. Examples published by Datacurve include work such as aborting pending body reads on shutdown, fixing PromQL label sorting, adding config-file parsing to command-line tools, adding deterministic conflict detection to Y.Map writes, and adding XML diff, patch, and merge operations.

Runtime behavior Shutdown handling, cancellation, async lifecycle, and regression-sensitive behavior.
Data structures Sorting, pagination, maps, snapshots, schema composition, and deterministic conflict rules.
Developer tooling CLI config parsing, manifests, linting, profiling, caches, and generated reports.
Quickstart

How can you run DeepSWE?

Datacurve says DeepSWE tasks are Harbor-compatible and can be run with Pier, a framework for sandboxed coding-agent evaluations. The official quickstart clones the DeepSWE repository, installs Pier, and then runs a selected agent and model against the task directory.

git clone https://github.com/datacurve-ai/deep-swe
uv tool install git+https://github.com/datacurve-ai/pier

# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7