Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Volume 34 | April 2026

Techniques 

Subscribe
  • Techniques

    Adopt Trial Assess Caution Adopt Trial Assess Caution
  • New
  • Moved in/out
  • No change

Techniques

Adopt ?

  • 1. Context engineering

    Context engineering has evolved from an optimization tactic into a foundational architectural concern for modern AI systems. Unlike prompt engineering, which focuses on wording, context engineering treats the context window as a design surface and intentionally constructs the AI’s information environment.

    As agents tackle more complex tasks, dumping raw data into large context windows leads to "context rot" and degraded reasoning. To combat this, teams are shifting from static, monolithic prompts to progressive context disclosure. Instead of front-loading every instruction and reference an agent might need, these systems start with a lightweight index of what's available. The agent determines what prompts or contexts are relevant and pulls in only what’s needed, keeping the signal-to-noise ratio sharp at every step.

    We’re seeing several techniques mature in this space: Context setup leverages prompt caching to front-load static instructions, reducing costs and improving time to first token. Dynamic retrieval goes beyond basic RAG by selecting tools and loading only the necessary MCP servers, avoiding unnecessary context expansion. Context graphs model institutional reasoning — such as policies, exceptions and precedents — as structured, queryable data. Context management techniques use stateful compression and sub-agents to summarize intermediate outputs in long-running workflows.

    Treating AI context as a static text box is a fast track to hallucinations. To build resilient enterprise agents, teams must engineer context as a dynamic, precisely managed pipeline.

  • 2. Curated shared instructions for software teams

    As teams mature in their use of AI, relying on individual developers to write prompts from scratch is emerging as an anti-pattern. We advocate for curated shared instructions for software teams, treating AI guidance as a collaborative engineering asset rather than a personal workflow.

    Initially, this practice focused on maintaining general-purpose prompt libraries for common tasks. We’re now seeing a more effective evolution specifically for coding environments: anchoring these instructions directly into service templates. By placing instruction files such as CLAUDE.md, AGENTS.md or .cursorrules into the baseline repository used to scaffold new services, the template becomes a powerful distribution mechanism for AI guidance.

    During our Radar discussions, we also explored a related practice: anchoring coding agents to a reference application. Here, a live, compilable codebase serves as the source of truth. As architecture and coding standards evolve, both the reference application and embedded instructions can be updated. New repositories then inherit the latest agent workflows and rules by default. This approach ensures consistent, high-quality AI assistance is built into every project from day one, while separating general prompt libraries from repository-specific AI configuration.

  • 3. DORA metrics

    The metrics defined by the DORA research program have been widely adopted and have proven to be strong leading indicators of how a delivery organization is performing. These include change lead time, deployment frequency, mean time to restore (MTTR), change failure rate and a newer fifth metric, rework rate. Rework rate is a stability metric that measures how much of a team's delivery pipeline is consumed by unplanned rework to fix work previously considered complete, such as user-facing bugs or defects.

    In the era of AI-assisted software development, the DORA metrics are more important than ever. Measuring productivity by lines of code generated by AI is misleading; real improvement must be reflected in delivery flow and stability. If lead times don’t decrease and deployment frequency doesn't increase, faster code generation doesn’t translate into better outcomes. Conversely, degradation in stability metrics — particularly rework rate — provides an early warning sign of blind spots, technical debt and the risks of unchecked AI-assisted development.

    As always, we recommend using these metrics for team reflection and learning rather than just building complex dashboards. Simple mechanisms, such as check-ins during retrospectives, are often more effective than overly detailed tracking tools at improving capabilities.

  • 4. Passkeys

    Shepherded by the FIDO Alliance and backed by Apple, Google and Microsoft, passkeys have matured into Adopt. They are FIDO2 credentials that can replace passwords using asymmetric public-key cryptography. The private key is stored in a hardware-backed secure enclave on the user's device, protected by biometrics or a PIN, and never leaves it. Each credential is origin-bound to its relying-party domain, making passkeys structurally phishing-resistant: a lookalike site receives nothing, unlike SMS OTP or TOTP codes that a phishing proxy can intercept.

    With phishing responsible for more than one third of all data breaches, this structural resistance is increasingly important. The FIDO Alliance Passkey Index 2025 reports there are over 15 billion eligible accounts globally, Google reports a 30% improvement in sign-in success rates across 800 million users and Amazon has seen sign-ins six times faster than using traditional methods. NIST SP 800-63-4 (July 2025) now classifies synced passkeys as AAL2-compliant, reversing earlier guidance, and regulators in the UAE, India and US federal agencies mandate phishing-resistant authentication for financial services and government systems.

    The FIDO Credential Exchange Protocol enables secure portability of passkeys between credential managers, addressing earlier vendor lock-in concerns. Major identity providers including Auth0, Okta and Azure AD now support passkeys as a first-class feature, and implementation has been simplified from a multi-month effort to a two-sprint project. We’ve adopted passkeys internally and treat them as the default starting point for new authentication implementations. Teams should design account recovery carefully and avoid phishable fallback paths such as SMS OTP, which reintroduce the vulnerabilities passkeys eliminate. Device-bound credentials on hardware security keys remain necessary for AAL3 scenarios such as privileged access.

  • 5. Structured output from LLMs

    Structured output from LLMs is the practice of constraining models to produce responses in a predefined format, such as JSON or a specific programming language class. We continue to see this technique deliver reliable results in production, and we now consider it a sensible default for applications that consume LLM responses programmatically. All major model providers now offer native structured output modes, but implementations differ in the JSON Schema subsets they support, and these APIs continue to evolve rapidly. We recommend using a library such as Instructor or a framework like Pydantic AI to provide a stable abstraction across providers with validation and automatic retries, or Outlines for constrained generation on self-hosted models.

  • 6. Zero trust architecture

    As we enter the age of agents, many enterprises are grappling with how to build them while addressing the security risks of granting autonomy to unpredictable systems. Zero trust architecture (ZTA) remains a sensible default for securely building and operating agents. Principles such as "never trust, always verify," along with identity-based security and least-privilege access, should be treated as foundational for any agent deployment. Our teams are applying standards like SPIFFE to agents, establishing strong identity foundations and enabling fine-grained authentication in dynamic environments. Continuous monitoring and verification of agent behavior are also critical for proactively managing threats. Beyond agent deployments, our teams are adopting practices such as OIDC impersonation in GCP for different applications, including CI/CD pipelines, replacing long-lived static keys with short-lived tokens issued after identity verification. We recommend teams treat ZTA principles as non-negotiable defaults, regardless of the system being built.

Trial ?

  • 7. Agent Skills

    As AI agents evolve from simple chat interfaces toward autonomous task execution, context engineering has become a critical challenge. Agent Skills provide an open standard for modularizing context by packaging instructions, executable scripts and associated resources such as documentation. Agents load skills only when needed based on their descriptions, which reduces token consumption and mitigates context window exhaustion and problems such as agent instruction bloat.

    Skills have been adopted very quickly, not only in coding agents but also in personal assistants such as OpenClaw. They’re also one reason teams are becoming more cautious about defaulting to MCP, as many use cases can be addressed just as effectively by pointing an agent at a local CLI or script.

    As their popularity has grown, the surrounding ecosystem has expanded as well. Plugin marketplaces are emerging as a way to version and share skills, and multiple efforts are exploring how to evaluate skill effectiveness. We do, however, caution against unreviewed reuse of third-party skills, as they introduce serious supply chain security risks.

  • 8. Browser-based component testing

    In the past, when discussing component testing, we’ve generally advised against browser-based tools. They were difficult to set up, slow to run and often flaky. This has improved significantly. Today, browser-based component testing, using tools such as Playwright, is a viable and often preferable approach. Running tests in a real browser provides more consistency, as the test matches the environment where the code actually executes. The performance hit is now small enough that the trade-off is worthwhile. Flakiness has also decreased, and we’re seeing more value than from emulated environments such as jsdom.

  • 9. Feedback sensors for coding agents

    To make coding agents more effective and reduce the load on human reviewers, teams need feedback loops that agents can directly access. These feedback sensors for coding agents act as a form of feedback backpressure, increasing trust in generated results. Developers have long relied on deterministic quality gates such as compilers, linters, structural tests and test suites; here, they’re wired into agentic workflows so that failures trigger timely self-correction.

    These checks reduce routine steering work for the human in the loop. Teams can implement them in different ways, such as introducing a reviewer agent responsible for running checks and triggering corrections, or by exposing the checks through a companion process that runs in parallel that agents can query efficiently. Coding agents also make it cheaper to build custom linters and structural tests, further strengthening these feedback loops. Whenever possible, these sensors should run during the coding session and report clean results before a commit is made, rather than relying on post-commit checks.

  • 10. Mapping code smells to refactoring techniques

    Mapping code smells to refactoring techniques means instructing an agent to handle specific issues with a defined approach. The first layer typically points the agent to a generic reference, such as Refactoring, for common cases. For more specialized issues, teams can map unique smells to specific techniques using Agent Skills, slash commands or AGENTS.md. When integrated with linting tools, this creates deterministic feedback by triggering the appropriate refactoring approach whenever a smell is detected.

    This is particularly effective for legacy stacks like .NET Framework 2.0 or Java 8, where generic training data often falls short. It’s also useful for teams with distinctive engineering standards. Without these targeted instructions, an agent will tend to default to generic patterns rather than follow specific requirements.

  • 11. Mutation testing

    Mutation testing remains the most honest signal for evaluating the real fault-detection capability of a test suite. Unlike traditional code coverage, which only tracks line execution, this technique introduces deliberate bugs, or mutations, into source code to verify that tests fail when behavior breaks. If a mutation goes undetected, it reveals a gap in validation rather than just a lack of coverage. This distinction is critical in an era of AI-assisted development, where high coverage percentages can mask logically hollow tests or generated code that has never been meaningfully asserted.

    With AI-generated test cases now commonplace, mutation testing acts as a reinforcement layer for catching "perpetually green" tests — those that pass regardless of logic changes due to missing assertions or decoupled mocks. By using tools such as Stryker, Pitest or cargo-mutants, we shift the focus from how much code is executed to how much code is actually verified, particularly in core domain logic. The goal is to ensure that a passing test suite is a reliable signal of functional correctness, rather than simply a report of which lines were executed.

  • 12. Progressive context disclosure

    Progressive context disclosure is a technique within the practice of context engineering. Instead of overwhelming an agent with instructions upfront, you give it a lightweight discovery phase in which it selects what it needs based on the user’s prompt, loading detailed information into the context window only when it becomes relevant.

    This works great for RAG scenarios, where an agent first identifies the relevant domain from user queries and then retrieves specific instructions and data accordingly. It’s also how many agentic coding tools handle Agent Skills by first determining which skills are relevant to a task before loading detailed instructions, rather than providing a single, monolithic instruction set filled with conditions and caveats. When building agentic systems, it’s easy to fall into the trap of bloating instructions with endless "DO" and "DO NOT" rules in an attempt to control behavior, which can ultimately degrade performance. Progressive context disclosure avoids this by ensuring the agent receives the right guidance at the right moment, keeping the context window lean and preventing context rot.

  • 13. Sandboxed execution for coding agents

    Sandboxed execution for coding agents is the practice of running agents inside isolated environments with restricted file system access, controlled network connectivity and bounded resource usage. As coding agents gain autonomy to execute code, run builds and interact with the file system, giving agents unrestricted access to a development environment introduces real risks, from accidental damage to credential exposure. We see sandboxing as a sensible default rather than an optional enhancement.

    The landscape of sandboxing options now spans a broad spectrum. At one end, many coding agents offer built-in sandbox modes, and Dev Containers provide familiar container-based isolation. At the other, purpose-built tools take different positions on the ephemeral versus persistent trade-off. Shuru boots disposable microVMs that reset on every run, while Sprites(/radar/platforms/sprites) provides stateful environments with checkpoint and restore. For Linux-native isolation, Bubblewrap offers lightweight namespace-based sandboxing, and on macOS, sandbox-exec provides similar protection.

    Beyond basic isolation, teams should consider the practical requirements of a productive sandbox. This includes everything needed for building and testing, as well as secure, straightforward authentication with services like GitHub and model providers. Developers need port forwarding and sufficient CPU and memory for agent workloads. Whether the sandbox should be ephemeral by default or persistent for session recovery is a design decision that will depend on a team's priorities for security, cost and workflow continuity.

  • 14. Semantic layer

    Semantic layer is a data architecture technique that introduces a shared business logic layer between raw data stores and consuming applications, including business intelligence (BI) tools, AI agents and APIs. It centralizes metric definitions, joins, access rules and business terminology so consumers have shared definitions. The concept predates the modern data stack but has seen renewed interest with code-first approaches such as metrics stores.

    Without a semantic layer, business logic scatters across ad-hoc warehouse tables, dashboards, and downstream applications, while metric definitions quietly diverge—particularly problematic when used to support business decisions. Our teams have seen this become more acute with agentic AI: using LLMs to perform naive text-to-SQL translations will frequently produce incorrect results, especially when business rules, such as revenue recognition, live outside the schema. Cloud platforms are now embedding semantic layers directly: Snowflake calls it Semantic Views and Databricks calls it Metric Views. Standalone tools such as dbt MetricFlow and Cube provide a portable layer across systems. The recent release of Open Semantic Interchange (OSI) v1.0, backed by multiple vendors, signals growing standardization and interoperability across analytics, AI, and BI platforms.

    The main cost is upfront data modeling investment. Teams should start with a single domain rather than attempting an enterprise-wide rollout, as broad deployments often leave legacy reports running in parallel with the new layer, reintroducing inconsistent definitions.

  • 15. Server-driven UI

    Server-driven UI is returning to the Trial ring as we see more teams successfully shortening their path to production. By separating rendering into a generic container while providing structure and data via the server, mobile teams can bypass lengthy app store review cycles for every iteration. We’ve seen this significantly improve time to market, with JSON-based formats enabling real-time updates. While we previously cautioned against the "horrendous, overly configurable messes" that proprietary frameworks can create, the emergence of more stable patterns from companies such as Airbnb and Lyft has helped reduce complexity.

    Our experience shows that the substantial investment required for a proprietary framework is now easier to justify for large-scale applications. However, it still requires a strong business case and disciplined engineering to avoid creating a "god-protocol" that becomes difficult to maintain. For teams looking to reduce client-side complexity, this approach provides a powerful way to scale across teams and synchronize logic across platforms. We recommend applying it to highly dynamic areas of an application rather than as a blanket replacement for all UI development.

Assess ?

  • 16. Agentic reinforcement learning environments

    Agentic reinforcement learning environments provide a training ground for LLM-based agents, combining the context, tools and feedback to complete multi-step tasks. This approach reframes post-training of LLMs from simple single-turn outputs to agentic behaviors such as reasoning and tool use, with rewards or penalties assigned to each action. Techniques such as RLVR help ensure these rewards are verifiable and resistant to gaming.

    AI research labs are currently driving the development of these environments, particularly for coding and computer-use agents. One example outside of the frontier labs is Cursor's Composer, a specialized coding model trained within their product environment. Organizations building agentic systems should consider whether creating reinforcement learning environments could help train more capable and domain-specific models.

    Setting up the required infrastructure can be complex. However, frameworks and platforms are emerging to simplify the process, including Prime Intellect's environments hub, Agent Lightning and NVIDIA NeMo Gym. We recommend exploring this approach where it can deliver more capable and cost-effective models for your domain.

  • 17. Architecture drift reduction with LLMs

    Increased use of AI coding agents can accelerate drift from the intended codebase and architecture designs. Left unchecked, this drift compounds as agents and humans replicate existing patterns, including degraded ones, creating a feedback loop where poor code begets poorer code. Some of our teams are now addressing architecture drift reduction with LLMs.

    This approach combines deterministic analysis tools (such as Spectral, ArchUnit or Spring Modulith) with LLM-powered evaluation to detect both structural and semantic violations. LLMs are then used to help fix these issues. Our teams have applied this to enforce API quality guidelines across services and to define architectural zones that guide agent-generated improvements.

    A few lessons learned: like with traditional linting, initial scans can surface a large number of violations that require triage and prioritization, and LLMs can assist with this process. Keeping agent-produced fixes small and focused makes review easier, and an additional verification loop is essential to confirm changes improve the system rather than introduce regressions.

    This technique extends the idea of feedback sensors for coding agents into the later stages of the delivery lifecycle. As one team at OpenAI describes it, drift reduction acts as a form of "garbage collection," reflecting the reality that entropy and decay emerge even in systems with strong early feedback loops.

  • 18. Code intelligence as agentic tooling

    LLMs process code as a stream of tokens; they have no native understanding of call graphs, type hierarchies or symbol relationships. For code navigation, most coding agents today default to text-based search, the most powerful common denominator across all languages. For refactorings that are a quick shortcut in an IDE, agents need to generate multiple textual diffs. As a result, agents end up spending significant tokens reconstructing information that already exists in the abstract syntax tree (AST).

    Code intelligence as agentic tooling gives agents access to tools that are aware of the AST, e.g. via the Language Server Protocol (LSP). Through these integrations, agents can perform operations such as "find all references to this symbol" or "rename this type everywhere" as first-class actions rather than relying on fragile text substitutions. Another powerful code intelligence integration are codemod tools like OpenRewrite, which operates on the even richer Lossless Semantic Tree (LST) representation of the code. The result is fewer hallucinated edits and lower token consumption by delegating appropriate tasks to deterministic tools.

    Claude Code, OpenCode and others integrate with locally running LSP servers; JetBrains provides an MCP server that exposes IDE navigation and refactoring capabilities to external agents, while the Serena MCP server offers semantic code retrieval and editing.

  • 19. Context graph

    A context graph is a knowledge representation technique where decisions, policies, exceptions, precedents, evidence and outcomes are modeled as first-class connected nodes in a graph, structured for AI consumption. Where systems of record capture what happened, a context graph captures why, turning institutional reasoning buried in Slack threads, approval chains and people's heads into a queryable, machine-readable structure. This is vital for agent effectiveness; an agent handling a discount exception, for example, cannot determine whether it reflects standing policy or a one-time override and may reason incorrectly. A context graph can directly surface that provenance, enabling agents to traverse decision traces, apply relevant precedents and reason across multi-hop causal chains.

    Unlike GraphRAG, which builds from static document corpora, a context graph maintains temporal validity on every edge, so superseded facts are invalidated rather than overwritten. Context graphs are worth assessing for agentic applications that require persistent memory across sessions or traceable decision reasoning.

  • 20. Feedback flywheel

    Teams working with coding agents are increasingly adopting spec-driven development workflows. Whether they use a lightweight or opinionated framework, these workflows typically follow a similar flow of spec → plan → implement. The feedback flywheel extends this flow with an additional step focused on continuously improving the coding agent harness.

    The approach is similar to retrospectives: teams capture successes and failures during a coding agent session and use them to improve the predictability of future sessions, which compounds over time. It’s a meta-technique where a human on the loop focuses on improving the feedforward controls such as curated shared instructions as well as feedback sensors for coding agents. Our teams have found this effective, as it is analogous to code refactoring. The next level looks more like an agentic feedback flywheel, where, based on accumulated feedback, the agent decides what improvements are necessary. For now, however, teams still need a human-in-the-loop to avoid context rot and noisy feedback that could lead the agent astray.

    We suggest using this approach to evaluate the entire coding agent harness as your environment evolves, especially when adopting new models; what worked with one model may not be necessary with the next.

  • 21. HTML Tools

    Since agentic tools make it easy to build small, task-specific utilities, the main challenge is often how to deploy and share them. HTML Tools is an approach where a shareable script or utility is packaged as a single HTML file. You can run these directly in a browser, host them anywhere, or simply share the file. This approach avoids the overhead of distributing CLI tools, which require sharing binaries or using package managers. It’s also simpler than building a full web application with dedicated hosting. From a security perspective, running untrusted files still carries risk, but the browser sandbox and the ability to inspect source code provide some mitigation. For lightweight utilities, a single HTML file offers a highly accessible and portable way to share tools.

  • 22. LLM evaluation using semantic entropy

    Confabulation, a form of hallucination in LLM QA applications, is difficult to address with traditional evaluation methods. One approach uses information entropy as a measure of uncertainty by analyzing lexical variation in outputs for a given input. LLM evaluation using semantic entropy extends this idea by focusing on differences in meaning rather than surface-level variation.

    This approach evaluates meaning rather than word sequences, making it applicable across datasets and tasks without requiring prior knowledge. It generalizes well to unseen tasks, helping identify prompts likely to cause confabulations and encouraging caution when needed. Results show that naive entropy often fails to detect confabulations, while semantic entropy is more effective at filtering false claims.

  • 23. Measuring collaboration quality with coding agents

    We’re seeing real productivity gains when using coding agents, but most evaluation metrics still focus too heavily on coding throughput, such as time to first output, lines of code generated and tasks completed. Measuring collaboration quality with coding agents helps teams avoid falling into "the speed trap" by shifting focus toward how effectively humans and agents work together. Metrics such as first-pass acceptance rate, iteration cycles per task, post-merge rework, failed builds and review burden provide more meaningful signals than speed alone. Teams using Claude Code can use the /insights command to generate reports reflecting on successes and challenges from agent sessions. Our teams have also experimented with tracking first-pass acceptance of a customized /review command.

    In practice, shorter feedback cycles and fewer failed builds indicate more effective interaction with coding agents. When teams find themselves in repeated back-and-forth with their agents, these metrics highlight opportunities to improve the feedback flywheel. We recommend tracking collaboration quality at the team level, rather than the individual level, alongside DORA metrics to build a more complete picture of coding agent adoption.

  • 24. MITRE ATLAS

    Agentic systems and coding tools introduce new architectures and emergent security threats. MITRE ATLAS is a knowledge base of adversarial tactics and techniques targeting AI and ML systems. More focused than the broader MITRE ATT&CK framework and designed to complement it, ATLAS provides a taxonomy of threats for ML pipelines, LLM applications and agentic systems.

    We've found that without a shared vocabulary, security risks are often overlooked or reduced to a checkbox exercise. This is where ATLAS can help. Grounded in research on real incidents and technology patterns, teams can use the framework to support threat modeling. Teams may also find it a natural complement to control frameworks such as SAIF, helping describe the evolving threat landscape for AI systems.

  • 25. Ralph loop

    The Ralph loop (also sometimes called the Wiggum loop) is an autonomous coding agent technique where a fixed prompt is fed to an agent in an infinite loop. Each iteration starts with a fresh context window: the agent selects a task from a specification or plan, implements it, and the loop restarts with a fresh context. The core insight is simplicity. Rather than orchestrating teams of coding agents or coding agent swarms, a single agent works autonomously against a specification, with the expectation that the codebase will converge toward the spec over repeated iterations. Using a fresh context window on each iteration avoids the quality degradation that comes from accumulated context, though at significant token cost. Tools such as goose have implemented the pattern, in some cases extending it with cross-model review between iterations.

  • 26. Reverse engineering for design system

    Organizations often struggle with fragmented legacy interfaces where the "design standard" exists only as a loose collection of disjointed webpages, marketing materials and screenshots. Historically, auditing these artifacts to establish a unified foundation has been a manual and time-consuming process. With multimodal LLMs, this extraction can now be automated, effectively reverse-engineering design systems from existing visual assets.

    By feeding websites, screenshots and UI fragments into specialized tools or vision-capable AI models, teams can extract core design tokens — such as color palettes, typography scales and spacing rules — and identify recurring component patterns. The AI then synthesizes this unstructured visual data into a structured, semantic representation of a design system. When integrated with tools such as Figma, this output can significantly accelerate the creation of a formalized, maintainable component library.

    Beyond reducing effort in visual audits, this technique can serve as a stepping stone toward building "AI-ready" design systems. For enterprises burdened by brownfield design debt, using AI to establish a baseline design system is a practical starting point before a full redesign or front-end standardization.

  • 27. Role-based contextual isolation in RAG

    Role-based contextual isolation in RAG is an architectural technique that moves access control from the application layer down to the retrieval layer. Every data chunk is tagged with role-based permissions at indexing time. At query time, the retrieval engine restricts the search space based on the user's authenticated identity, which is matched against metadata on each chunk. This ensures the AI model cannot access unauthorized context because it’s filtered out at the retrieval stage. This provides a zero trust foundation for internal knowledge bases. As many vector databases now support high-performance metadata filtering, such as Milvus or services built on Amazon S3, this technique has become more practical to adopt, even for large knowledge bases.

  • 28. Skills as executable onboarding documentation

    Agent Skills, curated shared instructions and many other context engineering techniques appear throughout this edition of the Radar. One use case we want to highlight in the coding context is the use of skills as executable onboarding documentation. This technique applies at multiple levels. Within a codebase, a /_setup skill can take on the role of a go.sh script and a README file, combining scripting with LLM-executed semantics for steps that cannot be scripted. It can also go beyond what a script can do by dynamically taking the current state of the codebase and environment into account. Secondly, library and API creators can provide skills for their consumers as part of their documentation through internal or external skill registries like Tessl. And thirdly, we’ve found this useful for onboarding teams to internal platforms to lower the barrier to using a key technology or reduce friction when adopting a design system. So far, our experience with this has relied heavily on MCP servers but is now shifting toward skills.

    As with other forms of documentation, the challenge of keeping this up to date doesn’t go away. However, unlike static documentation, executable documentation can help you notice staleness much earlier.

  • 29. Small language models

    Small language models (SLMs) continue to improve and are beginning to offer better intelligence per dollar than LLMs for certain use cases. We've seen teams evaluate SLMs to reduce inference costs and speed up agentic workflows. Recent progress shows steady gains in intelligence density, making SLMs competitive with older LLMs for tasks such as summarization and basic coding. This shift reflects a move away from "bigger is better" toward higher-quality data, model distillation and quantization. Models such as Phi-4-mini and Ministral 3 3B demonstrate how distilled models can retain many capabilities of larger teacher models. Even ultra-compact models such as Qwen3-0.6B and Gemma-3-270M are becoming viable for running models on edge devices. For agentic use cases where older LLMs have been sufficient, teams should consider SLMs as a lower-cost, lower-latency alternative with reduced resource requirements.

  • 30. Team of coding agents

    In the previous Radar, we described a team of coding agents as a technique where a developer orchestrates a small set of role-specific agents to collaborate on a coding task. Since then, the barrier to adoption has dropped. Subagent support has become more of a table stakes feature across established coding agent tools, and Claude Code now includes an agent teams feature that provides built-in orchestration. In a team of agents, a primary orchestrator typically coordinates task sequencing and parallelization. Agents should be able to communicate not only with the orchestrator but also with one another. Common use cases include teams of reviewers or groups of implementers responsible for different parts of the application, such as backend and frontend.

    Although some in the industry are using the terms "agent teams" and "agent swarms" interchangeably (for example, Claude Code describes its agent teams feature as "our implementation of swarms"), we see value in distinguishing between them. A small, deliberate team of agents collaborating on a task differs significantly from a large swarm in terms of entry barriers, complexity and use cases.

  • 31. Temporal fakes

    Temporal fakes extend the idea of simulating real-world systems for development and testing, a practice long used in IoT and industrial platforms. With AI coding agents reducing the effort required to build such simulators, teams can now create high-fidelity replicas of external dependencies much more easily. Unlike traditional mocks that return static request–response pairs, temporal fakes maintain internal state machines and model the temporal evolution of real systems.

    One of our teams used this technique while developing an observability stack for large GPU data centers, avoiding the need to procure physical hardware. Testing alert rules, dashboards and anomaly detection against real systems can be impractical — for example, intentionally overheating a GPU to validate a thermal throttle alert. Instead, the team built fakes for hardware domains such as NVIDIA DCGM and InfiniBand fabric using Go. These simulators enabled failure scenarios such as thermal throttling, XID error storms, link flaps and PSU failures with configurable intensity and duration, orchestrated via a process-compose stack.

    A central registry defined valid failure scenarios, while an MCP server exposed scenario injection to the agent. The agent could trigger faults, for example, injecting a thermal throttle on a specific GPU and verify that metrics changed, alerts fired and dashboards updated as expected. This temporal fidelity makes the technique valuable for testing complex systems where failures cascade. However, teams must ensure the fakes remain faithful to real-world behaviour; otherwise, they risk creating false confidence in automated pipelines.

  • 32. Toxic flow analysis for AI

    Agent capabilities are outpacing security practices. With the rise of permission-hungry agents like OpenClaw, teams are increasingly deploying agents in environments that expose them to the lethal trifecta: access to private data, exposure to untrusted content and the ability to communicate externally. As capabilities grow, so too does the attack surface, exposing systems to risks such as prompt injection and tool poisoning. We continue to see toxic flow analysis as a primary technique for examining agentic systems to identify unsafe data paths and potential attack vectors. These risks are no longer limited to MCP integrations; our teams have observed similar patterns in Agent Skills, where a malicious actor can package a seemingly useful skill that embeds hidden instructions to exfiltrate sensitive data. We strongly encourage teams working with agents to perform toxic flow analysis and use tools such as Agent Scan to identify unsafe data paths before they're exploited.

  • 33. Vision language models for end-to-end document parsing

    Document parsing often relies on multi-stage pipelines combining layout detection, traditional OCR and post-processing scripts. These approaches often struggle with complex layouts and mathematical formulas. Vision language models (VLMs) for end-to-end document parsing simplify this architecture by treating the document image as a single input modality, preserving natural reading order and structured content. Open-source models specifically trained for this purpose — such as olmOCR-2, the token-efficient DeepSeek-OCR (3B) and the ultra-compact PaddleOCR-VL — have yielded highly efficient results. While VLMs reduce architectural complexity by replacing multi-stage pipelines, their generative nature makes them prone to hallucinations. Use cases with a low tolerance for error may still require a hybrid approach or deterministic OCR. Teams dealing with high-volume document ingestion should evaluate these unified approaches to determine whether they can replace complex legacy pipelines while maintaining accuracy and reducing long-term maintenance overhead.

Caution ?

  • 34. Agent instruction bloat

    Context files such as AGENTS.md and CLAUDE.md tend to accumulate over time as teams add codebase overviews, architectural explanations, conventions and rules. While each addition is useful in isolation, this often leads to agent instruction bloat. Instructions become long and sometimes conflict with each other. Models tend to attend less to content buried in the middle of long contexts, so guidance deep in a long conversation history can be missed. As instructions grow, the likelihood increases that important rules are ignored. We also see many teams using AI to generate AGENTS.md files, but research suggests that hand-written versions are often more effective than LLM-generated ones. When using agentic tools, be deliberate and selective with instructions, adding them as needed and continuously refine toward a minimal, coherent set. Consider leveraging progressive context disclosure to surface only the instructions and capabilities an agent needs for its current task.

  • 35. AI-accelerated shadow IT

    AI continues to lower the barriers for noncoders to build complex systems. While this enables experimentation and early validation of requirements, it also introduces the risk of AI-accelerated shadow IT. In addition to no-code workflow platforms integrating AI APIs (e.g., OpenAI or Anthropic), more agentic tools are becoming available to noncoders, such as Claude Cowork.

    When the spreadsheet that quietly runs the business evolves into customized agentic workflows that lack governance, it introduces significant security risks and a proliferation of competing solutions to similar problems. Distinguishing between disposable, one-off workflows and critical processes that require durable, production-ready implementation is key to balancing experimentation with control.

    Organizations should prioritize governance as part of their AI adoption strategy by facilitating experimentation within controlled environments. Appropriately instrumented Internal sandboxes give noncoders a place to deploy prototypes where usage can be tracked. Pairing these with a shared catalogue of existing workflows helps teams discover what's already been built before duplicating effort. Workflows that gain traction can then signal where to invest in more robust, production-grade applications.

  • 36. Codebase cognitive debt

    Codebase cognitive debt is the growing gap between a system’s implementation and a team’s shared understanding of how and why it works. As AI increases change velocity, especially with multiple contributors or Coding Agent Swarms, teams can lose track of design intent and hidden coupling. This, combined with rising technical debt, creates a reinforcing loop that makes systems progressively harder to reason about.

    Weaker system understanding also reduces developers’ ability to guide AI effectively, making it harder to anticipate edge cases and steer agents away from architectural pitfalls. Left unmanaged, teams reach a tipping point where small changes trigger unexpected failures, fixes introduce regressions and cleanup efforts increase risk instead of reducing it.

    Teams should avoid complacency with AI-generated code and adopt explicit countermeasures: feedback sensors for coding agents, tracking team cognitive load and architectural fitness functions to continuously enforce key constraints as AI accelerates output.

  • 37. Coding agent swarms

    Where a team of coding agents is a small, deliberate group, a coding agent swarm applies dozens to hundreds of agents to a problem, with AI determining composition and size dynamically. Projects such as Gas Town and Ruflo (formerly Claude Flow) are good examples of this approach. Early patterns for swarm implementations are emerging: hierarchical role separation (orchestrators, supervisors and ephemeral workers), a durable work ledger that helps agents divide and coordinate work (Gas Town uses beads for this) and a merging mechanism to handle conflicts from parallel work.

    Two swarm experiments have drawn particular attention: Anthropic's C compiler generation and Cursor's agent scaling experiment that created a browser over a week. It's worth noting that both teams chose use cases that could rely on existing detailed specifications, and in the case of the C compiler, comprehensive test suites that provide clear, measurable feedback. Those conditions are not representative of typical product development, where requirements are less defined and verification is harder. Nevertheless, these experiments contribute to emerging patterns for making long-running swarms technically viable. They remain costly and are still far from mature, which is why we advise caution when adopting this technique.

  • 38. Coding throughput as a measure of productivity

    AI coding assistants are delivering real productivity gains and are rapidly becoming standard developer tooling. However, we’re increasingly seeing organizations measure success using superficial indicators such as lines of code generated or the number of pull requests (PRs). When these coding throughput metrics are used in isolation, they can negatively shape employee behavior. The result is often a flood of poorly aligned code that slows reviews, harms delivery throughput and introduces security risks. Cycle times increase as engineers raise PRs filled with insufficiently reviewed AI output, leading to repeated back-and-forth with reviewers. These metrics fail to capture the residual effort required to adapt AI-generated code to a team's architecture, conventions and patterns.

    More meaningful leading indicators exist, such as first-pass acceptance rate — how often AI output can be used with minimal rework. Measuring this exposes hidden effort and makes improvement actionable: teams can refine prompts, improve priming documents and strengthen design conversations to progressively increase acceptance over time. This creates a virtuous cycle in which AI output requires less correction. First-pass acceptance also connects naturally with DORA metrics: lower acceptance rates tend to increase change failure rates, while repeated iteration cycles extend lead time for changes. As AI assistants become ubiquitous, organizations should shift focus away from coding throughput alone toward metrics that reflect real impact and delivery outcomes.

  • 39. Ignoring durability in agent workflows

    Ignoring durability in agent workflows is an anti-pattern we’ve seen across many teams, resulting in systems that work in development but fail in production. The challenges facing distributed systems are even more pronounced when building with agents. A mindset that expects failures and recovers gracefully outpaces a reactive approach.

    LLM and tool calls can fail due to network interruptions and server crashes, halting an agent's progress and leading to poor user experience and increased operational costs. Some systems can tolerate this when tasks are short-lived, but complex workflows that run for days or weeks require durability.

    Fortunately, durable execution is being integrated into agent frameworks such as LangGraph and Pydantic AI. It provides stateful persistence of progress and tool calls, enabling agents to resume tasks after failures. For workflows that involve a human in the loop, durable execution can suspend progress while awaiting input. Durable computing platforms such as Temporal, Restate and Golem also provide support for agents. Built-in observability of tool execution and decision tracking makes debugging easier and improves understanding of systems in production. Teams should start with native durable execution support in their agent framework and reach for standalone platforms as workflows become more critical or complex.

  • 40. MCP by default

    As the Model Context Protocol (MCP) gains traction, we're seeing teams and vendors reach for it as the default integration layer between AI agents and external systems, even when simpler alternatives exist. We caution against using MCP by default. MCP adds real value for structured tool contracts, OAuth-based authentication boundaries and governed multi-tenant access. It also introduces what Justin Poehnelt calls an "abstraction tax": every protocol layer between an agent and an API loses fidelity, and for complex APIs those losses compound.

    In practice, a well-designed CLI with good --help output, structured JSON responses and predictable error handling often gives agents everything they need without the protocol overhead. As Simon Willison notes, "almost everything I might achieve with an MCP can be handled by a CLI tool instead."

    This isn't a rejection of MCP. Teams should avoid adopting it by default and first ask whether their system actually requires protocol-level interoperability. MCP makes sense when its governance and integration benefits outweigh the added complexity and potential fidelity loss.

  • 41. Pixel-streamed development environments

    Pixel-streamed development environments use VDI-style remote desktops or workstations for software development, with editing, builds and debugging performed through a streamed desktop rather than on a local machine or a code-centric remote environment. We continue to see organizations adopt them to meet security, standardization and onboarding goals, especially for offshore teams and lift-and-shift cloud programs. In practice, however, the trade-off is often poor: latency, input lag and inconsistent screen responsiveness create constant cognitive friction that slows delivery and makes everyday development work more tiring. Unlike development environments in the cloud such as Google Cloud Workstations or tools like Coder and VS Code Remote Development, which move compute closer to the code without streaming the entire desktop, pixel-streamed setups prioritize centralized control over developer flow and are often imposed with too little input from the engineers who use them. We advise against pixel-streamed development environments as a default choice for software delivery unless a compelling security or regulatory constraint clearly outweighs the productivity cost.

Unable to find something you expected to see?

 

Each edition of the Radar features blips reflecting what we came across during the previous six months. We might have covered what you are looking for on a previous Radar already. We sometimes cull things just because there are too many to talk about. A blip might also be missing because the Radar reflects our experience, it is not based on a comprehensive market analysis.

Download the PDF

 

 

 

English | Português

Sign up for the Technology Radar newsletter

 

 

Subscribe now

Visit our archive to read previous volumes