Claude Opus 4.5 performs strongly on software-engineering benchmarks. On the well-known SWE-bench Verified, Opus 4.5 achieves ≈ 80.9% success, placing it at the top among modern AI models for real-world coding tasks. :contentReference[oaicite:4]{index=4} That performance reflects its improved reasoning, understanding, and coding stability compared with prior versions. :contentReference[oaicite:5]{index=5}
In practical terms, this means that many typical open-source issues — bug fixes, small feature additions, code refactors — are within reach for an agent built around Opus 4.5. The high score is measured not just by small toy problems but by real-world tasks drawn from actual repositories. :contentReference[oaicite:6]{index=6}
Beyond SWE-bench, Opus 4.5 also shows improved performance in “computer-use” benchmarks: for example, its evaluation on a GUI-oriented benchmark (named OSWorld) reaches significant performance, which demonstrates its capability at interacting with tools beyond coding — like spreadsheets, slide decks, or mixed workflows. :contentReference[oaicite:8]{index=8} These results suggest that Opus 4.5 is not only effective at code but also more general agentic tasks combining code, documents, UI manipulations, and tooling.