|
In this newsletter:
- Claude Opus 4.7 checks its own work before responding
- Perplexity personal computer automates your to-do list
- Canva AI assistant creates designs using tools and layers
Plus, you’ll find new AI tools and this week’s top AI news headlines!
⚡ Anthropic's Claude Opus 4.7 Checks Its Own Work Before Responding
Anthropic released Claude Opus 4.7, its most powerful publicly available model, outperforming OpenAI's GPT-5.4 and Google's Gemini 3.1 Pro on key benchmarks, including agentic coding, tool use, computer use, and financial analysis.
What's new:
- Available today across Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry
- API pricing unchanged at $5/$25 per million tokens
- Processes images up to 2,576 pixels on the longest edge (roughly 3.75 megapixels), a threefold increase from previous versions
- New "effort" parameter with xhigh (extra high) setting between high and max for granular control over depth of reasoning
- Task budgets in public beta let developers set a hard ceiling on token spend for autonomous agents
- Updated tokenizer improves text processing efficiency but can increase token count of certain inputs by 1.0–1.35x
- Claude Code gets a new /ultrareview command that flags subtle design flaws and logic gaps like a senior human reviewer
- Auto mode extended to Max plan users for autonomous decisions without constant permission prompts
Performance highlights:
- GDPVal-AA knowledge work: Elo score of 1753, beating GPT-5.4 (1674) and Gemini 3.1 Pro (1314)
- SWE-bench Pro agentic coding: 64.3% task resolution vs 53.4% for Opus 4.6
- GPQA Diamond graduate-level reasoning: 94.2%
- arXiv visual reasoning with tools: 91.0% vs 84.7% for Opus 4.6
- XBOW visual-acuity tests: 98.5% vs 54.5% previously
- BigLaw Bench: 90.9%
- CyberGym vulnerability reproduction: 73.1%
- GPT-5.4 still leads in agentic search (89.3% vs 79.3%), multilingual Q&A, and raw terminal-based coding
How it works differently:
- Devises own verification steps before reporting task complete
- Internal test example: built a Rust-based text-to-speech engine from scratch, then independently fed generated audio through a separate speech recognizer to verify output against a Python reference
- Follows instructions literally, requiring legacy prompt libraries to be re-tuned
- Older models "read between the lines," Opus 4.7 executes the exact text provided
Enterprise feedback:
- Intuit VP of Technology: the ability to "catch its own logical faults during planning phase" is a game-changer for velocity
- Replit President: higher quality at lower cost for log analysis and bug hunting, "feels like a better coworker."
- Cognition CEO: can work coherently "for hours" and pushes through difficult problems that cause models to stall
- Notion AI Lead: 14% improvement in multi-step workflows, 66% reduction in tool-calling errors, feels like "true teammate"
- Factory Droids: carries work through to validation steps rather than "stopping halfway."
- Harvey, Head of Applied Research: "noticeably smarter handling of ambiguous document editing tasks."
Cybersecurity controls:
- Systems designed to detect and block requests suggesting high-risk cyberattacks like automated vulnerability exploitation
- New Cyber Verification Program allows vulnerability researchers, penetration testers, and red-teamers to apply for access for defensive purposes
- More powerful Mythos model remains restricted to a small number of external enterprise partners for cybersecurity testing through Project Glasswing
Why it matters:
The razor-thin margins between Opus 4.7, GPT-5.4, and Gemini 3.1 Pro signal that the frontier model race has plateaued, with improvements measured in single percentage points, forcing companies to compete on reliability rather than raw intelligence.
Self-verification addresses the hallucination problem that kept enterprises from trusting AI for autonomous work, shifting AI from a creative assistant to a tool that can run unsupervised.
|