DeepSWE Information Hub Back to benchmark hub

Claude Opus 4.8 DeepSWE Result: What We Know So Far

Claude Opus 4.8 does not have an official DeepSWE leaderboard score yet. Early informal testing is directionally useful, but it is not leaderboard-grade evidence.

Quick answer

As of May 29, 2026, Claude Opus 4.8 does not have an official DeepSWE leaderboard score.

Source: Official source: DeepSWE leaderboard

That does not mean nobody has tried to test it. Product and AI writer Paweł Huryn ran an informal small-sample test after Opus 4.8 was released. In his single pass over roughly a dozen DeepSWE tasks, the result appeared to land in the same general range as Claude Opus 4.7. On tasks where the model had a fair shot, it solved 6 out of 7. On two tasks that no model on the leaderboard had solved before, it scored 0 out of 2.

Source: Third-party source: Paweł Huryn's X thread

However, this should not be treated as an official DeepSWE score. Huryn emphasized that his run was small, used only one attempt per task, and was affected by benchmark harness details such as timeout behavior and the "max effort" setting.

Source: Third-party source: Paweł Huryn's X thread

Claude Opus 4.8 has not yet been officially measured on DeepSWE, and early informal results should be read as directional rather than definitive.

Does Claude Opus 4.8 have an official DeepSWE score?

No. The official DeepSWE leaderboard currently lists Claude Opus 4.7 at 54% ± 5%, but it does not list Claude Opus 4.8.

Source: Official source: DeepSWE leaderboard

This distinction matters. A page about DeepSWE Opus 4.8 should not say that Opus 4.8 scored 54% on DeepSWE. A safer and more accurate statement is that Claude Opus 4.8 does not yet have an official DeepSWE score.

An informal small-sample run appeared to place Opus 4.8 in the same general range as Claude Opus 4.7, but that run was not leaderboard-grade.

What did Paweł Huryn's informal test suggest?

Paweł Huryn tested Claude Opus 4.8 informally after it shipped without a number on the DeepSWE leaderboard.

Source: Third-party source: Paweł Huryn's X thread

His rough result was that Opus 4.8 appeared to land in the same band as Opus 4.7 in a single pass over roughly a dozen tasks. But Huryn did not present this as a formal score. His main point was almost the opposite: a benchmark number is not just the model. It also reflects the harness, runtime environment, timeout rules, and model-specific API settings.

Source: Third-party source: Paweł Huryn's X thread

In a small, informal run, Opus 4.8 looked roughly comparable to Opus 4.7, but the run was too limited and too affected by harness details to be treated as a real leaderboard result.

What is Claude Opus 4.8?

Claude Opus 4.8 is Anthropic's newer Opus model, released as an upgrade over Claude Opus 4.7.

Source: Official source: Anthropic Opus 4.8 announcement

Anthropic describes Claude Opus 4.8 as its most capable generally available model for complex reasoning, long-horizon agentic coding, and high-autonomy work. These capabilities are relevant to DeepSWE because DeepSWE is not a short coding quiz. Its tasks require long-horizon repository exploration, multi-file code changes, self-verification, and repeated tool use.

Source: Official source: Anthropic Opus 4.8 docs

Model ID: claude-opus-4-8
  • 1M token context window by default on the Claude API, Amazon Bedrock, and Vertex AI.
  • 200k context on Microsoft Foundry at launch.
  • 128k max output tokens.
  • Adaptive thinking plus the same major platform features as Claude Opus 4.7.

What changed from Claude Opus 4.7?

Anthropic describes Opus 4.8 as a modest but tangible upgrade over Opus 4.7. For DeepSWE-style coding work, the most relevant changes are operational, not just raw intelligence.

Source: Official source: Anthropic Opus 4.8 announcement

1. Better long-horizon agentic coding

Anthropic says Claude Opus 4.8 targets improvements in long-horizon agentic coding, including better long-context handling, fewer compactions, and better recovery after compaction.

2. Better effort calibration

Opus 4.8 uses the effort parameter to control how much reasoning the model applies. Anthropic says the model defaults to high effort across major surfaces, and Huryn noted that inaccurate max-effort handling can distort benchmark results.

3. Better tool triggering

Anthropic says Opus 4.8 is less likely to skip a tool call that a task requires. That matters in DeepSWE, where the harness expects the model to inspect files, run commands, edit code, and verify behavior.

4. Fast mode

Opus 4.8 introduces fast mode as a research preview on the Claude API. Anthropic says it can deliver up to 2.5x higher output tokens per second at premium pricing.

5. Lower prompt cache minimum

Opus 4.8 lowers the minimum cacheable prompt length to 1,024 tokens, which can reduce repeated input cost for long-running agent loops.

6. Honesty and self-correction

Anthropic highlights honesty as a prominent improvement. In Anthropic's evaluation, Opus 4.8 was around four times less likely than its predecessor to let flaws in its own generated code pass unremarked.

Official Opus 4.8 benchmark signals, but not DeepSWE

Anthropic's launch material reports stronger benchmark performance for Claude Opus 4.8 compared with Claude Opus 4.7. These numbers are useful context, but they should not be confused with DeepSWE results.

Source: Official source: Anthropic Opus 4.8 announcement

Benchmark Claude Opus 4.7 Claude Opus 4.8 What it suggests
SWE-bench Verified 87.6% 88.6% Small improvement on a common coding benchmark.
SWE-bench Pro 64.3% 69.2% Larger improvement on a harder software engineering benchmark.
Terminal-Bench 2.1 66.1% 74.6% Stronger terminal-use and agentic execution signal.
These are not DeepSWE scores. They show that Opus 4.8 improved on several official or commonly cited evaluations, but they do not tell us where Opus 4.8 lands on DeepSWE.

Why DeepSWE is an important benchmark

DeepSWE is a coding benchmark from Datacurve. Its goal is to measure frontier coding agents on original, long-horizon engineering tasks.

Source: Official source: DeepSWE leaderboard

This makes DeepSWE especially relevant for models like Claude Opus 4.8. Anthropic says Opus 4.8 improves long-horizon agentic coding, and DeepSWE is built to test exactly that kind of work.

Source: Official source: Anthropic Opus 4.8 docs

1. Contamination-free tasks

Tasks are written from scratch rather than adapted from existing commits or pull requests, so models are less likely to have seen the solution during pretraining.

2. Broad repository coverage

DeepSWE spans 113 tasks, 91 repositories, and 5 languages: TypeScript, Go, Python, JavaScript, and Rust.

3. Real-world complexity

The official blog reports a mean prompt length of 2,158 characters, a mean reference solution size of 668 lines added, and a mean of 7 files edited.

4. Behavioral verification

Verifiers are hand-written to test software behavior rather than exact implementation details, so different correct solutions can pass.

What is the current official DeepSWE result for Claude models?

The current official DeepSWE leaderboard lists the following published scores. Claude Opus 4.8 is not currently listed.

Source: Official source: DeepSWE leaderboard

  • Official leaderboard score for Claude Opus 4.8: not available yet.
  • Informal third-party testing: useful as an early signal, but not definitive.
Model DeepSWE score
GPT-5.5 [xhigh] 70% ± 4%
GPT-5.4 [xhigh] 56% ± 5%
Claude Opus 4.7 [max] 54% ± 5%
Claude Sonnet 4.6 [high] 32% ± 4%

Who is Paweł Huryn, and why is his test worth watching?

Paweł Huryn is the creator of The Product Compass, a product and AI-focused newsletter that describes itself as having over 135,000 subscribers.

Source: Background source: The Product Compass

His Opus 4.8 DeepSWE thread is worth watching because he did more than repeat a benchmark headline. He attempted to run the benchmark himself, spent money on the test, and then explained why the result should not be treated as a formal score.

Source: Third-party source: Paweł Huryn's X thread

The value of Huryn's test is not simply that Opus 4.8 is similar to Opus 4.7. The more important lesson is that DeepSWE scores depend on the model, the harness, timeout rules, runtime environment, and model-specific API settings.

How did Huryn run the test?

According to Huryn's X thread, he ran a single pass over roughly a dozen DeepSWE tasks after Claude Opus 4.8 shipped without an official leaderboard number.

Source: Third-party source: Paweł Huryn's X thread

But this was not a full benchmark run. Huryn said his test used one attempt per task on a small sample he selected. By contrast, he said the official board-grade run uses four attempts across all 113 tasks.

Source: Third-party source: Paweł Huryn's X thread

Why the informal Opus 4.8 result has limitations

Huryn's test is useful, but it has clear limitations.

1. Small sample size

A dozen tasks is too small to represent the full DeepSWE benchmark, which contains 113 tasks across 91 repositories and 5 languages.

2. Only one attempt per task

If the official board runs multiple attempts per task, then a one-attempt-per-task test is not directly comparable.

3. Local runtime environment

A local run may not match the cloud environment used for official leaderboard runs, and timeout behavior can materially change the result.

4. Harness compatibility

When a new model changes API behavior or effort settings, the harness may need updates before the result is comparable.

5. Cost

Huryn said he had already spent a few hundred dollars on the small run. A full board-grade run can cost much more because it requires many tasks and multiple attempts per task.

Why benchmark harnesses can change the result

A benchmark harness is the system around the model: the prompt, tools, runtime, timeout rules, execution environment, and API settings.

DeepSWE intentionally uses mini-swe-agent to keep this scaffolding fixed. That makes the comparison cleaner than giving each model its own native coding product, such as Claude Code, Codex CLI, Cursor, or Gemini CLI.

  • A shared prompt.
  • A bash tool.
  • No vendor-specific edit tool.
  • No model-specific coding workflow.
If the harness does not call Opus 4.8 correctly, the score can understate or distort the model's real capability.

Benchmark leakage, CHEATED labels, and why DeepSWE exists

One reason DeepSWE has attracted attention is that it directly addresses contamination and benchmark leakage.

Datacurve's DeepSWE blog argues that many existing coding benchmarks rely on public GitHub issues, pull requests, or commits. That creates a risk that the answer, or a close version of the answer, already exists in model training data or in the benchmark runtime environment.

  • DeepSWE tasks are written from scratch.
  • The reference solutions are not copied from existing public commits.
  • Verifiers test behavior, not exact code shape.
  • The benchmark tries to reduce the chance that a model can pass by recalling or retrieving an existing answer.
CHEATED is the benchmark analyzer's verdict label. It does not mean every Claude result is invalid, and it does not mean DeepSWE is accusing all Claude usage of cheating.

What this means for interpreting Opus 4.8 on DeepSWE

The early signal is that Claude Opus 4.8 may not obviously leap far ahead of Opus 4.7 on Huryn's small informal sample. But that is not the same as saying Opus 4.8 has failed DeepSWE.

Claude Opus 4.8 has official improvements that should matter for long-horizon coding work. Anthropic highlights better agentic coding, better long-context behavior, better tool triggering, fast mode, effort control, and improved honesty.

But the currently available Opus 4.8 DeepSWE result is informal, small-sample, and affected by harness details. Until a full official run appears, the result should be treated as an early observation rather than a final score.

Bottom line

Claude Opus 4.8 does not yet have an official DeepSWE score.

Paweł Huryn's informal run suggests Opus 4.8 may be in the same general range as Opus 4.7 on a small sample, but the test is not comparable to a full leaderboard run.

The sample was small, each task was attempted only once, timeout behavior affected some results, and the harness still needs proper support for Opus 4.8's effort setting.

Claude Opus 4.8 has not yet been officially ranked on DeepSWE. Early informal testing is interesting, but not definitive. Wait for a full board-grade run before drawing a strong conclusion.

FAQ

Does Claude Opus 4.8 have an official DeepSWE score?

No. As of May 29, 2026, the official DeepSWE leaderboard does not list Claude Opus 4.8. It lists Claude Opus 4.7 at 54% ± 5%, but Opus 4.8 has not been added yet.

Did Paweł Huryn test Claude Opus 4.8 on DeepSWE?

Yes. He ran an informal small-sample test after Opus 4.8 was released. His result appeared to place Opus 4.8 in roughly the same range as Opus 4.7, but he explicitly said it was not a leaderboard-grade result.

Can we say Opus 4.8 scored the same as Opus 4.7?

No. A better statement is that in one informal small-sample run, Opus 4.8 appeared to land in the same general range as Opus 4.7.

Why is there no official Opus 4.8 DeepSWE result yet?

A proper DeepSWE run is expensive, time-consuming, and sensitive to harness compatibility. Huryn also noted that the current setup needed updates for Opus 4.8's max-effort setting.

Why does the benchmark harness matter?

DeepSWE uses a standardized harness so each model gets the same environment. This helps fairness, but it also means the score partly depends on how well the harness works with each model's API, effort settings, timeout behavior, and tool usage.

Is Opus 4.8 better than Opus 4.7?

Anthropic reports that Opus 4.8 improves over Opus 4.7 on several benchmarks and product behaviors, including agentic coding, effort calibration, tool triggering, and honesty. But those improvements do not automatically tell us the official DeepSWE score.

What should developers take away from this?

Do not treat a benchmark number as pure model capability. For long-horizon coding agents, the final score can be affected by the model, the harness, the prompt, the tools, timeout limits, environment speed, and effort settings.

Recommended source links