It's been another big week in AI: just about every major vendor has had something to announce. Anthropic launched Cloud Code security and new plugins for Claude Cowork geared toward enterprise users, Google gave us its Gemini 3.1 Pro model, while OpenAI described its AI coding benchmark SWE-bench verified as “contaminated.”
On This Week in AI, my colleague Ben O'Mahony and I discussed the implications of the week's big stories and tried to unpack the context to help the world make sense of the dizzying pace of change.
We were also lucky enough to be joined by fellow Thoughtworkers Shodhan Sheth and Alessio Ferri. Shodhan and Alessio gave us their perspective on the Anthropic COBOL modernization story that hit IBM’s stock value hard this week.
The pair have been exploring how AI can be used on legacy systems for some time now, which means they're perfectly placed to offer their insight on what Anthropic gets right and the nuances the company’s blog post misses.
Here are links to the stories we discussed this week:
Perspectives on this week's big AI stories
Model releases and benchmarks
- Google's Gemini 3.1 Pro Model was released, showing frontier-level performance in abstract reasoning puzzles (like Humanity's Last Exam and ARC AGI 2) and strong results in coding benchmarks (like SWE-Bench Verified).
- Model naming conventions. There's no official methodology for determining minor vs. major version numbers. It's usually at the vendor's discretion, often reflecting incremental capabilities versus a new big model training generation.
- Benchmark contamination. The common coding benchmark, SWE-Bench Verified, is said (by Anthropic) to be reaching saturation and facing contamination issues because the tasks are too specific and have likely been incorporated into training data. Attention is shifting to the new benchmark, SWE-Bench Pro (a Scale AI initiative), for evaluating frontier models for coding.
Anthropic news
- Cloud Code Security. Anthropic released a tool that helps find and fix security vulnerabilities, potentially patching them when connected with Cloud Code. This announcement caused a significant market impact on security startups.
- Co-work enterprise plugins. New plugins were released for Claude Cowork targeting specific vertical use cases such as financial analysis, investment banking, equity research, private equity and wealth management. This underlines that Anthropic's strategy is to be an AI company for the enterprise.
- Distillation attacks. Anthropic published a report on identifying and preventing distillation attacks, where fraudulent accounts try to use Claude model answers to train their own smaller models.
- Responsible scaling policy update: The company released version three of their responsible scaling policy, aiming to clarify actions Anthropic commits to, regardless of competitors, while also identifying actions requiring government or broader AI field collaboration.
OpenAI updates
- Frontier alliances. OpenAI introduced new partnerships to create an alliance network that assists companies with the complex transformation required for AI adoption and enabling their people.
- AI alignment investment. OpenAI announced a $7.5 million grant to the Alignment Project, which is run by the UK's AI Safety Institutes (AISI), to support research in AI safety, security and bias mitigation.
AI and mainframe modernization
Shodhan and Alessio discussed Anthropic's announcement about using Claude Code for COBOL modernization:
- Generative AI impact. While using generative AI for legacy modernization isn't a new idea, the quality of model outputs is increasing its impact.
- Modernization complexity. Mainframe modernization is challenging because it involves more than just the COBOL language; it requires addressing data, people, processes, dependencies and contracts with external boundaries.
- Human-machine combination. The most successful results in this space rely on the combination of human expertise and machine tools.
- Engineering rigor. For enterprise-level production, significant engineering rigor is needed to complement LLMs, focusing on providing accurate context, reducing hallucinations and managing token usage (especially with large codebases).
- Finding the starting point. A major challenge is determining the first small piece of a large legacy system to modernize incrementally; this involves both science and art.
- The goal of modernization. Modernization is defined as balancing the faithfulness of the existing system with making intentional changes as a new system is developed.