Semantic drift and semantic integrity

Stewarding meaning in the age of AI

Raul Vega

Published: May 26, 2026

As software development cycles accelerate and AI-assisted coding becomes the norm, we are building faster than we can think and the information we can retain about the system is shrinking. While our tools are better than ever at solving the implementation details, we are losing our grip on the meaning and purpose. This creates a dangerous misalignment: the distance between our technical execution and our business intent is widening at an exponential rate.

Let’s take modern retail as an example. A “shopping bag” is rarely just a collection of items. It’s a dense convergence of gift card logic, promotional stacking, regional tax juggling and loyalty program rules. I once worked on a system where this logic had become so opaque that the development team treated the calculation engine as a black box. The calculation was exact given an applied set of options, but no one could explain why that specific number was the answer.

The only person who truly understood the system was a veteran QA engineer. The developers had stopped thinking about the why to focus instead on the how and the mechanics of the code. In doing so, they lost sight of the underlying business concepts.

This is a classic example of semantic drift.

Semantic drift occurs when the distance between a technical implementation and its real-world meaning increases over time. In the shopping bag example, everything was modeled and abstracted to be a “product” to save time on database design. While this provided a unified technical structure, it destroyed the conceptual distinction between a physical item, a digital gift card and a temporary promotion. When terms are overloaded and definitions are left vague, the system’s integrity begins to fail.

The impact is rarely an immediate system crash. Instead, it materializes as a source of truth, a dogma we all must trust and follow. We see teams struggling to implement new features because they are afraid of breaking an undocumented edge case. When the meaning of your data is locked in the head of a single individual rather than reflected in the architecture, the system will inevitably become a tangled ball of wool that needs to be untangled every time.

Defining semantic drift

Semantic drift is the alignment distance that grows between a system’s execution and its business intent. A symptom of this can be seen in the gradual divergence between a technical identifier and the business concept it represents.

In software engineering, we often assume that once a schema is defined and a domain model is implemented, the meaning of those elements remains static. However, in practice, meaning is fluid. As organizations scale, new requirements are added and teams reorganize, the original intent of a data structure often erodes.

This erosion typically occurs in three distinct stages:

Overloading. To avoid the cost of schema migrations or breaking changes, teams repurpose existing fields. A status column originally designed for shipping updates might start carrying values related to payment failures or fraud alerts. The technical container remains the same, but the semantic payload has shifted.
Context dilution. As data moves from its source to downstream consumers, the original context is lost. A “Transaction” to a developer in the billing team includes taxes and discounts, while a “Transaction” to a marketing analyst might only refer to the gross merchandise value. Without a shared definition, both parties use the same word to describe different realities.
Knowledge loss. This occurs when the documentation and code no longer reflect the business process and rules persist only thanks to existing (if limited) institutional memory. New developers join a project and interpret the code based on variable names rather than actual business rules, which leads them to build new features on top of a foundation they don’t properly understand.

Isn’t semantic drift just a data quality issue?

It’s important to distinguish semantic drift from traditional data quality issues. Data quality focuses on technical correctness: is the field a string, is it null or does it match a specific regex? Semantic drift, by contrast, is a failure of communication. The data might be technically perfect (clean, unique and well-formatted) yet conceptually wrong.

When semantic drift goes unmanaged, it creates a semantic tax on every delivery cycle. Developers spend a significant portion of their time performing digital archaeology, investigating old code and questioning stakeholders to determine what a specific value actually signifies today. This uncertainty slows down feature development and increases the risk of regressions, as changes are made based on assumptions that might no longer be true.

The two faces of semantic drift

To manage this entropy, we must recognize that semantic drift now manifests in two distinct versions. While the result—a loss of conceptual integrity—is the same, the drivers are fundamentally different.

The human face: linear

Historically, drift has been a human-led process driven by miscommunication, time and organizational turnover.

It happens in the cracks of the development lifecycle. Some common examples include:

A developer repurposing a status field to bypass a frozen schema.
A marketing team redefining “Active User”.

It’s linear and relatively slow. It follows the speed of human communication and usually creates “code smells.” We see variable names like temp_field_2 or user_type_NEW_FINAL. This drift is ugly, which actually makes it easier to spot during a code review.

The AI face: exponential

With the introduction of LLMs and generative AI, we are seeing a new, moredangerous version of drift.

AI acts as a stochastic engine. It doesn’t ‘understand’ business logic but predicts the most likely next token. When an AI generates code or maps data based on a slightly misaligned prompt, it introduces a plausible error.

It is instantaneous and industrial. An AI can generate thousands of lines of logic or millions of data mappings in minutes, applying a drifted understanding across the entire estate.

Unlike human drift, AI-driven drift is syntactically perfect. The code might look clean and the documentation grammatically correct. It lacks the “smell” of technical debt, allowing it to bypass human review and bake misunderstanding directly into the foundation.

How AI multiplies the flaws of the system

The primary danger of AI-driven semantic drift is the volume of information produced. A human team might create a few overloaded fields over a year; an AI can generate thousands of lines of code or millions of rows of synthetic data in minutes, all based on a slightly misaligned understanding of the domain.

This creates a high-velocity feedback loop where the system’s source of truth is constantly being rewritten by a model that prioritizes linguistic coherence over conceptual integrity. In this environment, the black box problem described moves from a single component to the entire organizational knowledge base.

Software is, and always has been, a linguistic artifact. If we lose control of the language, we lose control of the meaning it represents.

Raul Vega

Senior Consultant, Thoughtworks

Software is, and always has been, a linguistic artifact. If we lose control of the language, we lose control of the meaning it represents.

Raul Vega

Senior Consultant, Thoughtworks

Establishing semantic integrity

To counter the acceleration of drift, organizations must shift their focus from mere data availability to semantic integrity. Semantic integrity is a state where the system’s data and logic accurately and consistently represent the real-world business concepts they are intended to model.

Achieving this state requires moving beyond standard schema validation. While a schema ensures that a value is an integer or a string, it cannot verify that the value represents the correct business event. Semantic integrity requires three core technical shifts:

1. Contextual validation

Traditional data contracts often focus on structure — field names and data types. Semantic integrity requires contracts that include contextual constraints. For example, a contract for a “Refund” should not only validate that the amount is a decimal, but also ensure that the amount doesn’t exceed the original purchase value.

2. Explicit domain mapping

When data moves between different parts of an organization, it often crosses context boundaries. A common cause of drift is the assumption that a term has a universal meaning across all services.

To maintain integrity, systems must use explicit translators or adapters at these boundaries. Instead of allowing a downstream marketing service to ingest a raw upstream billing object, the architecture should force an explicit mapping step. This step requires the engineer to define exactly how a “Billing Customer” maps to a “Marketing Lead.” If the mapping cannot be clearly defined, it’s a signal that semantic drift is occurring and needs to be addressed before the data is integrated.

3. Semantic observability

Historically, domain-driven design (DDD) was created as a design-first approach — a way to align stakeholders and engineers through whiteboards and static documents. The problem is that these artifacts are disconnected from the execution.

Instead of designing everything in advance, we could automatically use agents on every codebase change to assess the impact it has on our Domain model, we could evaluate how it changes the user journeys or how it aligns with other domains, identifying early possible misunderstandings.

To solve the black box problem, we can propose to repurpose DDD tools to function as a telemetry mechanism that reflects the current state of our system. The goal is to stop writing documentation and start instrumenting the domain.

Final thoughts

For decades, we treated black box systems as a failure of documentation or a byproduct of turnover, with tribal knowledge being our only hope. Today, with AI accelerating the rate at which we produce logic, it has become a structural inevitability if we don't change how we govern our systems.

Software is, and always has been, a linguistic artifact. If we lose control of the language, we lose control of the meaning it represents. Semantic integrity is not a one-time achievement; it’s a continuous act of keeping the language of our business alive within the heart of our code.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

View less

Industries

Publications and Tools

All Insights

Semantic drift and semantic integrity

Defining semantic drift

Isn’t semantic drift just a data quality issue?

The two faces of semantic drift

The human face: linear

The AI face: exponential

How AI multiplies the flaws of the system

Establishing semantic integrity

1. Contextual validation

2. Explicit domain mapping

3. Semantic observability

Final thoughts

Explore a snapshot of today's tech landscape