1. Better long-horizon agentic coding
Anthropic says Claude Opus 4.8 targets improvements in long-horizon agentic coding, including better long-context handling, fewer compactions, and better recovery after compaction.
Claude Opus 4.8 does not have an official DeepSWE leaderboard score yet. Early informal testing is directionally useful, but it is not leaderboard-grade evidence.
As of May 29, 2026, Claude Opus 4.8 does not have an official DeepSWE leaderboard score.
Source: Official source: DeepSWE leaderboard
That does not mean nobody has tried to test it. Product and AI writer Paweł Huryn ran an informal small-sample test after Opus 4.8 was released. In his single pass over roughly a dozen DeepSWE tasks, the result appeared to land in the same general range as Claude Opus 4.7. On tasks where the model had a fair shot, it solved 6 out of 7. On two tasks that no model on the leaderboard had solved before, it scored 0 out of 2.
Source: Third-party source: Paweł Huryn's X thread
However, this should not be treated as an official DeepSWE score. Huryn emphasized that his run was small, used only one attempt per task, and was affected by benchmark harness details such as timeout behavior and the "max effort" setting.
Source: Third-party source: Paweł Huryn's X thread
Claude Opus 4.8 has not yet been officially measured on DeepSWE, and early informal results should be read as directional rather than definitive.
No. The official DeepSWE leaderboard currently lists Claude Opus 4.7 at 54% ± 5%, but it does not list Claude Opus 4.8.
Source: Official source: DeepSWE leaderboard
This distinction matters. A page about DeepSWE Opus 4.8 should not say that Opus 4.8 scored 54% on DeepSWE. A safer and more accurate statement is that Claude Opus 4.8 does not yet have an official DeepSWE score.
An informal small-sample run appeared to place Opus 4.8 in the same general range as Claude Opus 4.7, but that run was not leaderboard-grade.
Paweł Huryn tested Claude Opus 4.8 informally after it shipped without a number on the DeepSWE leaderboard.
Source: Third-party source: Paweł Huryn's X thread
His rough result was that Opus 4.8 appeared to land in the same band as Opus 4.7 in a single pass over roughly a dozen tasks. But Huryn did not present this as a formal score. His main point was almost the opposite: a benchmark number is not just the model. It also reflects the harness, runtime environment, timeout rules, and model-specific API settings.
Source: Third-party source: Paweł Huryn's X thread
In a small, informal run, Opus 4.8 looked roughly comparable to Opus 4.7, but the run was too limited and too affected by harness details to be treated as a real leaderboard result.
Claude Opus 4.8 is Anthropic's newer Opus model, released as an upgrade over Claude Opus 4.7.
Source: Official source: Anthropic Opus 4.8 announcement
Anthropic describes Claude Opus 4.8 as its most capable generally available model for complex reasoning, long-horizon agentic coding, and high-autonomy work. These capabilities are relevant to DeepSWE because DeepSWE is not a short coding quiz. Its tasks require long-horizon repository exploration, multi-file code changes, self-verification, and repeated tool use.
Source: Official source: Anthropic Opus 4.8 docs
Model ID: claude-opus-4-8
Anthropic describes Opus 4.8 as a modest but tangible upgrade over Opus 4.7. For DeepSWE-style coding work, the most relevant changes are operational, not just raw intelligence.
Source: Official source: Anthropic Opus 4.8 announcement
Anthropic says Claude Opus 4.8 targets improvements in long-horizon agentic coding, including better long-context handling, fewer compactions, and better recovery after compaction.
Opus 4.8 uses the effort parameter to control how much reasoning the model applies. Anthropic says the model defaults to high effort across major surfaces, and Huryn noted that inaccurate max-effort handling can distort benchmark results.
Anthropic says Opus 4.8 is less likely to skip a tool call that a task requires. That matters in DeepSWE, where the harness expects the model to inspect files, run commands, edit code, and verify behavior.
Opus 4.8 introduces fast mode as a research preview on the Claude API. Anthropic says it can deliver up to 2.5x higher output tokens per second at premium pricing.
Opus 4.8 lowers the minimum cacheable prompt length to 1,024 tokens, which can reduce repeated input cost for long-running agent loops.
Anthropic highlights honesty as a prominent improvement. In Anthropic's evaluation, Opus 4.8 was around four times less likely than its predecessor to let flaws in its own generated code pass unremarked.
Anthropic's launch material reports stronger benchmark performance for Claude Opus 4.8 compared with Claude Opus 4.7. These numbers are useful context, but they should not be confused with DeepSWE results.
Source: Official source: Anthropic Opus 4.8 announcement
| Benchmark | Claude Opus 4.7 | Claude Opus 4.8 | What it suggests |
|---|---|---|---|
| SWE-bench Verified | 87.6% | 88.6% | Small improvement on a common coding benchmark. |
| SWE-bench Pro | 64.3% | 69.2% | Larger improvement on a harder software engineering benchmark. |
| Terminal-Bench 2.1 | 66.1% | 74.6% | Stronger terminal-use and agentic execution signal. |
These are not DeepSWE scores. They show that Opus 4.8 improved on several official or commonly cited evaluations, but they do not tell us where Opus 4.8 lands on DeepSWE.
DeepSWE is a coding benchmark from Datacurve. Its goal is to measure frontier coding agents on original, long-horizon engineering tasks.
Source: Official source: DeepSWE leaderboard
This makes DeepSWE especially relevant for models like Claude Opus 4.8. Anthropic says Opus 4.8 improves long-horizon agentic coding, and DeepSWE is built to test exactly that kind of work.
Source: Official source: Anthropic Opus 4.8 docs
Tasks are written from scratch rather than adapted from existing commits or pull requests, so models are less likely to have seen the solution during pretraining.
DeepSWE spans 113 tasks, 91 repositories, and 5 languages: TypeScript, Go, Python, JavaScript, and Rust.
The official blog reports a mean prompt length of 2,158 characters, a mean reference solution size of 668 lines added, and a mean of 7 files edited.
Verifiers are hand-written to test software behavior rather than exact implementation details, so different correct solutions can pass.
The current official DeepSWE leaderboard lists the following published scores. Claude Opus 4.8 is not currently listed.
Source: Official source: DeepSWE leaderboard
| Model | DeepSWE score |
|---|---|
| GPT-5.5 [xhigh] | 70% ± 4% |
| GPT-5.4 [xhigh] | 56% ± 5% |
| Claude Opus 4.7 [max] | 54% ± 5% |
| Claude Sonnet 4.6 [high] | 32% ± 4% |
Paweł Huryn is the creator of The Product Compass, a product and AI-focused newsletter that describes itself as having over 135,000 subscribers.
Source: Background source: The Product Compass
His Opus 4.8 DeepSWE thread is worth watching because he did more than repeat a benchmark headline. He attempted to run the benchmark himself, spent money on the test, and then explained why the result should not be treated as a formal score.
Source: Third-party source: Paweł Huryn's X thread
The value of Huryn's test is not simply that Opus 4.8 is similar to Opus 4.7. The more important lesson is that DeepSWE scores depend on the model, the harness, timeout rules, runtime environment, and model-specific API settings.
According to Huryn's X thread, he ran a single pass over roughly a dozen DeepSWE tasks after Claude Opus 4.8 shipped without an official leaderboard number.
Source: Third-party source: Paweł Huryn's X thread
But this was not a full benchmark run. Huryn said his test used one attempt per task on a small sample he selected. By contrast, he said the official board-grade run uses four attempts across all 113 tasks.
Source: Third-party source: Paweł Huryn's X thread
Huryn's test is useful, but it has clear limitations.
A dozen tasks is too small to represent the full DeepSWE benchmark, which contains 113 tasks across 91 repositories and 5 languages.
If the official board runs multiple attempts per task, then a one-attempt-per-task test is not directly comparable.
A local run may not match the cloud environment used for official leaderboard runs, and timeout behavior can materially change the result.
When a new model changes API behavior or effort settings, the harness may need updates before the result is comparable.
Huryn said he had already spent a few hundred dollars on the small run. A full board-grade run can cost much more because it requires many tasks and multiple attempts per task.
A benchmark harness is the system around the model: the prompt, tools, runtime, timeout rules, execution environment, and API settings.
DeepSWE intentionally uses mini-swe-agent to keep this scaffolding fixed. That makes the comparison cleaner than giving each model its own native coding product, such as Claude Code, Codex CLI, Cursor, or Gemini CLI.
If the harness does not call Opus 4.8 correctly, the score can understate or distort the model's real capability.
One reason DeepSWE has attracted attention is that it directly addresses contamination and benchmark leakage.
Datacurve's DeepSWE blog argues that many existing coding benchmarks rely on public GitHub issues, pull requests, or commits. That creates a risk that the answer, or a close version of the answer, already exists in model training data or in the benchmark runtime environment.
CHEATED is the benchmark analyzer's verdict label. It does not mean every Claude result is invalid, and it does not mean DeepSWE is accusing all Claude usage of cheating.
The early signal is that Claude Opus 4.8 may not obviously leap far ahead of Opus 4.7 on Huryn's small informal sample. But that is not the same as saying Opus 4.8 has failed DeepSWE.
Claude Opus 4.8 has official improvements that should matter for long-horizon coding work. Anthropic highlights better agentic coding, better long-context behavior, better tool triggering, fast mode, effort control, and improved honesty.
But the currently available Opus 4.8 DeepSWE result is informal, small-sample, and affected by harness details. Until a full official run appears, the result should be treated as an early observation rather than a final score.
Claude Opus 4.8 does not yet have an official DeepSWE score.
Paweł Huryn's informal run suggests Opus 4.8 may be in the same general range as Opus 4.7 on a small sample, but the test is not comparable to a full leaderboard run.
The sample was small, each task was attempted only once, timeout behavior affected some results, and the harness still needs proper support for Opus 4.8's effort setting.
Claude Opus 4.8 has not yet been officially ranked on DeepSWE. Early informal testing is interesting, but not definitive. Wait for a full board-grade run before drawing a strong conclusion.
No. As of May 29, 2026, the official DeepSWE leaderboard does not list Claude Opus 4.8. It lists Claude Opus 4.7 at 54% ± 5%, but Opus 4.8 has not been added yet.
Yes. He ran an informal small-sample test after Opus 4.8 was released. His result appeared to place Opus 4.8 in roughly the same range as Opus 4.7, but he explicitly said it was not a leaderboard-grade result.
No. A better statement is that in one informal small-sample run, Opus 4.8 appeared to land in the same general range as Opus 4.7.
A proper DeepSWE run is expensive, time-consuming, and sensitive to harness compatibility. Huryn also noted that the current setup needed updates for Opus 4.8's max-effort setting.
DeepSWE uses a standardized harness so each model gets the same environment. This helps fairness, but it also means the score partly depends on how well the harness works with each model's API, effort settings, timeout behavior, and tool usage.
Anthropic reports that Opus 4.8 improves over Opus 4.7 on several benchmarks and product behaviors, including agentic coding, effort calibration, tool triggering, and honesty. But those improvements do not automatically tell us the official DeepSWE score.
Do not treat a benchmark number as pure model capability. For long-horizon coding agents, the final score can be affected by the model, the harness, the prompt, the tools, timeout limits, environment speed, and effort settings.