First, GPT is better at carrying out the full request in DeepSWE, not just the most obvious part of it.
DeepSWE tasks are often more than a simple bug fix. They regularly ask the model to handle multiple parallel cases at once: support the synchronous path and the asynchronous path, or handle one input format and a closely related one. Datacurve's analysis found that Claude often produced a solution that looked close to correct, but still dropped one branch. In plain terms, it might get the main path right while forgetting to mirror the same logic in the second scenario. By contrast, GPT-5.5 had the lowest rate of missing explicit requirements in DeepSWE, with GPT-5.4 very close behind. That suggests GPT is better at turning each requirement in the prompt into actual code changes.