The local inference boundary:

Apple’s AFM 3 and token economics

Alexandra Lovin and

Richard Gall

Published: June 23, 2026

The announcements at WWDC 2026 mark a subtle but important shift in how we think about building intelligent software. For the past few years, the architectural paradigm for AI integration has been the remote API pattern: a thin client shipping text or images over the network to a massive, black-box model in the cloud, paying a tax on every single token generated.

With the unveiling of Siri AI and the third generation of Apple Foundation Models (AFM 3), we’re seeing an approach that’s actually more aligned to alternative ways of tackling memory bandwidth and the DRAM wall: things like local models and model selection.

This isn’t novel; it’s really the continuation of an approach Apple has taken for a number of years. The fact it's gaining visibility is symptomatic, perhaps, of current resource constraints and the shifting token economics. And while Apple has sometimes been criticised for its slow and steady pace when it comes to AI, it may be a signal of what the future of AI will look like.

The 20B illusion: Instruction-following pruning

The headline grabber from a system design perspective is AFM 3 Core Advanced. On paper, it’s a 20-billion-parameter model running locally on consumer hardware. Historically, a model of that size would choke a mobile device’s memory architecture, as keeping 20B parameters resident in active RAM (DRAM) leaves little room for anything else.

Apple's solution is an elegant architectural compromise using a technique they call instruction-following pruning (IFP).

The flash-to-RAM pipeline. The full 20B model resides in flash memory (NAND).
Dynamic routing. Instead of routing token-by-token, which would be bottlenecked by NAND-to-DRAM bandwidth limits, the system uses a lightweight, dense block to make routing decisions per prompt.
Active footprint. Only one to four billion parameters are activated and swapped into DRAM depending on the specific task.

From an engineering viewpoint, this is a classic trade-off. Apple is trading a slice of the peak reasoning capability of a dense 20B model for an aggressive reduction in memory footprint, achieving 9B-class quality with an elastic footprint (the system scaling its active layer state from one billion to four billion parameters into the working DRAM buffer). It highlights a core principle of pragmatic software design, which is that you don't optimize the whole system; you optimize the hot path.

The system orchestrator pattern

Meanwhile, the architectural centerpiece of Siri AI is the system orchestrator. This component acts as a centralized event router, and understands the active application state, knows what task the user is trying to accomplish and maps the request to the most efficient tier in the model continuum.

How does the system orchestrator work?

The first point to recognize is that the system orchestrator handles a number of multi-decision variables:

Hardware: it checks the physical capability of the local device (A17 Pro and newer), SoC thermal state and battery reserves.
Context size: In other words, how much text or data needs processing.
Reasoning depth: Distinguishing between a simple single-step lookup and complex, multi-hop inference,
Latency thresholds: Evaluating whether a task demands real-time execution (voice/camera, for example) versus asynchronous processing (such as background summarization)
Modality complexity: Assessing the complexity of a given prompt — so, whether it’s plain text, requires deep image understanding or necessitates cross-app context.

Once these variables are processed, the orchestrator routes the workload down one of two primary pathways:

Local gatekeeper: Determining what stays on-device as a billion parameter local model acts as a system-wide pre-filter, asking whether it can handle a task before Private Cloud Compute (PCC) is pinged.
Off-device/cloud escalation: When complexity exceeds local limits — for tasks with long context, multi-step reasoning, cross-source synthesis, or generative depth — modelmanagerd escalates to the 32,000-token server models. (For context, Gemini Nano explicitly escalates to the cloud once inputs exceed approximately 3,000 words.)

It’s worth noting there are some overrides, such as latency and privacy. For instance, when it comes to health data on the Apple Watch, core features such as AFib detection, crash detection and sleep staging are kept entirely on-device.

In short, we should see the use of system orchestrators as a good example of the well-established principle of the separation of concerns in practice. The client application doesn’t need to know where or how intelligence is computed; it interfaces with a unified orchestration layer that optimizes for latency, battery, cost and privacy behind the scenes.

How novel is this, really?

While this appears to be a novel approach to breaking the monolith LLM/remote API pattern, it’s perhaps just aligned with how the industry is approaching the DRAM wall. Apple's current architectural leaps aren’t a sudden emergence but the industrialization of a trajectory that really began in 2017 with Core ML and the A11 Bionic's Neural Engine. By introducing early on-device training capabilities in 2019, Apple laid the groundwork for the exact battles the open-source community is fighting today.

Whether you’re a developer trying to squeeze open-weight models onto a consumer PC, or an Apple engineer designing the 38-TOPS M4 chip or a mobile developer, we’re fighting the same physical adversary: memory bandwidth and the DRAM wall. In other words, the primary constraint for localized AI is not raw computational compute; it’s the inescapable physics of moving massive weight matrices and expanding key-value (KV) caches through a constrained memory bus.

The reality of the token budget

To talk about software economics in 2026 is to talk about token budgets. In cloud-hosted environments, the token budget is financial — a variable operational cost that scales linearly with user engagement. On-device AI flips this model on its head. On-device inference offers $0 marginal token costs and total privacy, but it introduces a strict physical token budget in the form of limits on context windows, memory ceilings and battery life.

For instance, Apple’s local framework operating system constraints frequently limit local model sessions to a rigid 4,096-token context window. If you’re building an application that needs to synthesize a long thread of emails, a multi-page receipt and user context, you’ll hit that ceiling incredibly fast. This forces developers to treat tokens not as cash, but as a scarce system resource, not unlike memory management in the early days of computing.

This means we need to become more disciplined in:

Aggressive context pruning. Programmatically stripping out boilerplate and irrelevant metadata before feeding input to the local model.
Semantic compression. Using smaller models to summarize information into highly dense semantic representations before passing them up the chain.
Structured outputs. Leveraging AFM 3's native capability to output typed Swift values directly, avoiding the token bloat of messy, conversational text that then requires regex parsing.

Hidden token economics and monetization

There are however some hidden economics that need to be properly understood by businesses and engineering teams.

First, developers in the App Store Small Business Program (fewer than two million total first-time downloads) receive zero-cost API access to PCC and Apple Foundation Models. By erasing the financial boundary on remote compute, Apple removes the primary incentive for local optimization. The technical necessity to design aggressive context pruning frameworks or semantic compression pipelines evaporates. This is ultimately a classic developer acquisition strategy; developers are able to eliminate operational token costs, but there’s a potential long-term cost of platform lock-in.

Second, Apple has also set a strict 12GB RAM physical hardware floor for its best on-device models (AFM 3 Core Advanced). This means that consumers will be forced to upgrade to more expensive "Pro" tier hardware when developers build localized AI apps to require the 12GB substrate. This will drive up Apple's average selling price (ASP).

Finally, heavy computational features, like the diffusion-based ADM 3 Cloud image generation models, carry strict daily usage limits. Increased token generation access is being tied directly to premium iCloud+ subscriptions, offsetting third-party server costs with recurring services revenue.

Is Apple rethinking its approach to openness?

One thing that’s particularly intriguing about this Apple announcement is how it appears to be thinking about openness. Known for being a strict walled garden, there’s more nuance at work. One way to understand this is through a tripartite lens:

High accessibility

The Foundation Models Framework provides developers with high-level, unified access to route prompts through a single LanguageModelSession. This single session seamlessly handles on-device models, Private Cloud Compute (PCC) and third-party integrations like Claude and Gemini. It also offers powerful semantic levers, including the @Generable macro for structured Swift outputs, LoRA adapter loading and explicit tool protocols.

The low end: Embracing open source

The new Core AI framework welcomes the broader open-source community. It allows developers to bring arbitrary weights (such as Qwen or Mistral) and compile them directly for localized execution on Apple Silicon.

The gated middle: The black box

Apple strictly refuses to distribute its proprietary AFM weights and developers are blocked from directly programming the Neural Engine (ANE). The mechanics of Instruction-Following Pruning (IFP) remain completely obscured and developers no longer have the ability to manually override the system's routing decisions.

Apple’s Virtual Research Environment

There’s also another part of this story: Apple has built and distributed its own official emulator, the Virtual Research Environment (VRE). This change in tact may well be driven by a desire to establish absolute trust in cloud AI inference for enterprise adoption.

Non-targetability and mathematical trust. Apple is trying to solve a key enterprise AI dilemma by developing a mathematical proof that they cannot selectively surveil a user.
Portable trust on competitor hardware. The 2026 expansion of PCC uses NVIDIA GPUs hosted on Google Cloud infrastructure. Apple is able to achieve portable trust on competitor hardware they cannot physically control by decoupling identity: oblivious HTTP (OHTTP) relays strip IP addresses, and one-time tokens (OTTs) mathematically blind the request authentication. Stateless inference guarantees memory buffers are aggressively recycled and expunged after every request, which means no user data remains.

Regional challenges for Apple

It’s worth noting the direction Apple has taken does pose further challenges for the company. Due to the European Union's Digital Markets Act (DMA), for instance, Apple must allow alternative app distribution and integrate with competing products. While Apple argues these interoperability requirements compromise Siri AI’s data security and product integrity, it does mean that Siri AI and PCC are now blocked in the EU.

While local Foundation Models Framework APIs remain globally accessible, for developers, the geographic limitation completely breaks the automated routing cascade.

Developers building global apps must manually architect fallback routes to third-party providers (Claude, Gemini) for their EU and Chinese users when the local model hits its limits.

Final thoughts

WWDC 2026 was interesting for many reasons, but primarily the story is one that sees Apple as trying to respond to current industry challenges in a way that’s distinct from other major tech companies.

What’s more, it also signals that the period of monolithic LLM dependency could be ending — there’s little justification for building systems that default to sending every minor autocomplete or text summarization task to a multi-billion-parameter cloud model.

The design pattern of the immediate future is hybrid; the boundary between edge and remote compute is shifting. It will be up to us to design systems flexible enough to navigate that boundary gracefully.

View less

Industries

Publications and Tools

All Insights

The local inference boundary:

Apple’s AFM 3 and token economics

The 20B illusion: Instruction-following pruning