This week delivered a flurry of news, from dramatic geopolitical developments in AI to groundbreaking research on how models 'think'. We discussed as much as we possibly could in about an hour on the latest This Week in AI livestream. Watch below, or scroll further for my summary...
The race for performance at lower costs is accelerating. OpenAI unveiled GPT-5.3 Instant, a new model version designed to reduce "over-caveating" for more direct answers and is exciting for its potential to enhance stronger writing and diagramming, including UML/Mermaid-type diagrams. On the image generation front, NanoBanana 2 was released as a fast, cheaper model offering Pro quality images at a fraction of the cost.
The general trend, exemplified by models like MinMax 2.5 and Quen Coder, is that frontier-level performance is becoming more accessible. We're currently actively evaluating these models for use in their AI-works offering, particularly for writing software and building at scale.
Geopolitics: The Department of War and AI giants
The ethics of AI in government use came to a head this week with the Department of War and Anthropic. Anthropic's negotiations with the DOW stalled over two red lines: autonomous weapons and domestic surveillance of American citizens. The DOW responded by mandating government agencies stop using Anthropic and designating the company as a supply chain risk. Anthropic’s Dario Amodei, however, noted in an interview that they had not received any legal paperwork regarding the designation and would seek litigation if it arrived.
Coincidentally, on the same evening of the DOW's announcement regarding Anthropic, OpenAI announced they had closed a deal with the DOW to use their models. This timing was viewed by many as opportunistic, leading to a "quick GPT movement". OpenAI posted its contract, which included clear clauses prohibiting the AI system from being used for independently directing autonomous weapons or unconstrained monitoring of U.S. persons. Sam Altman reportedly regretted the announcement’s timing.
Record funding and the 'many models' future
The largest AI funding round to date — a massive $110 billion — was announced. The funding involved SoftBank, Nvidia and Amazon, establishing a strategic partnership for next-generation compute and downstream inference. This timing aligns with the expectation that Nvidia's main conference, GTC, will announce new hardware.
Ben argued the market will support "many models," noting that combining intelligences is often superior and different models excel at different tasks. Open-source labs, alongside companies like Mistral, Cohere and Chinese labs, are increasingly competitive with frontier models.
Research spotlight: LLM metacognition
A study was discussed that used the Eleusis card game — a proxy for the "game of science" — to test model metacognition and reasoning. In the game, models must propose a hidden rule, refine their hypothesis based on feedback, and critically decide when to "commit to a guess".
Key findings included:
Recklessness Index: Most LLMs were "over-eager and reckless," guessing prematurely. GPT-5.2 Pro was the exception, acting as a "cautious scientist" by holding the correct rule for an average of 4.62 turns before making a guess.
Occam's Razor: Models often produced overly complicated tentative rules, violating the principle of parsimony.
Behavior Profiles: Models were categorized as the "cautious," "bold," and "balanced" scientist. This suggests that model preference should depend on the use case—for example, preferring a bold model for low-stakes summarization and a cautious model for high-stakes decisions.
Final thought: Disposable software?
A discussion on "sacrificial architecture" was sparked by a comment about disposable software at the meeting that puts together the Thoughtworks Technology Radar. Ben and I agreed that while some software has always been disposable (e.g., internal tools), the modern debate centers on finding a balance between safe, small increments for production code and fast, discardable experimentation using AI.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.