Enable javascript in your browser for better experience. Need to know to enable it? Go here.

The future of data is semantic

Most organizations will, at some point or other, have to deal with the "junk drawer" problem: the ever-growing collection of unstructured data in everything from PDFs to meeting notes that resist traditional attempts at organization.

 

Businesses have long recognised the value of structured data (customer databases, product catalogs, web analytics and so on) for the ways in which it can enable effective operations and support decision making. However, it’s only in recent years that organizations have started to explore the potential of their unstructured information (documents, images, audio and other miscellanea). The popularity of retrieval-augmented generation (RAG) and hybrid search software as applications of large language model (LLM) technologies reflects the widespread desire to make better sense of organizational junk drawers. These sensemaking activities also create a solid foundation on which to build new things, and are a type of so-called “AI readiness”.


However, it's not the generative power of LLMs that provides the real value for businesses, but rather their capacity to enable meaning-based associations, with semantic search acting as the crucial mechanism to unlock this potential.

Organizational data today: Where are we now?

 

Modelling and curating the conceptual relationships between digital entities has a rich history that underpins the idea of the Web itself, and its principles are rooted in much older fields such as information science and philosophy. However, thanks to technologies like vector databases and knowledge graphs it's now easier to surface, visualize and build upon semantic relationships in unstructured data.

 

The significance of this development shouldn’t be overlooked. We now have accessible tools that can turn unstructured data into "soft structured" data: highlighting existing semantic relationships between the ideas contained in technically unstructured datasets. 

 

Whether an organization uses these soft structures as passive signals to better understand the latent architecture of their own operations, takes action to shape these structures into formal ontologies or chooses an approach in between, the future of data is semantic. The digital junk drawer is no longer a problem to be managed, but a source of potential understanding and even organizational transformation.

What is semantic data?

 

In a very broad sense, semantic data refers to data enriched with meaning, context and structured relationships, allowing it to be interpreted and processed by computers in a way that aligns with human understanding. Semantic data explicitly captures the meaning behind the data through standardized vocabularies, clearly defined attributes and relationships. This explicit modeling makes it easier for machines to link and integrate information from diverse sources, in a manner that can be parsed by both humans and machines.

 

Traditionally, these structures and definitions are manually crafted by business analysts and data specialists working together to model the business in data. This manual approach to semantic enrichment is crucial for establishing clear, foundational knowledge structures.

 

However, a broader approach involves semantic associations uncovered through embedding models. These are relationships that arise organically from unstructured data itself, surfaced not through explicit modeling, but through learned patterns. This approach represents a shift from manual curation to a more dynamic method of extracting meaning, where the relationships and connections within data can be discovered rather than predetermined.

Why does semantic data matter?

 

Semantic data management techniques can help businesses stop fighting the existence of the organizational junk drawer, and begin to make better use of the things that are in it.

 

Practically every household sprouts a junk drawer where miscellaneous items gather and become increasingly difficult to retrieve, and the majority of organizations have digital equivalents. Think, for example of: 

 

  • Useful operational information that exists in departmental silos, with inconsistent naming conventions and storage practices. 

  • A key process that only exists in a pinned Slack conversation from seven years ago, or in a thrice-copied screenshot of a now-lost PDF. 

  • Strategic planning around products or brand positioning located in diffuse slide decks.

 

More than an inconvenience, this can lead to a significant loss of institutional knowledge, become a privacy and legal compliance risk, make version control impossible and impede accurate decision-making. Categorizing and labelling each file or message is an exhausting task, and any organizational system is bound to drift over time, just as the business and the individuals that make it up do not (cannot!) remain static.

 

However, taking a semantic approach to data makes it easier to locate the right things at the right time, no matter where in the junk drawer they are or how long they have been in there. Instead of struggling to piece together splinters of information, sources can be connected and analyzed holistically, which can lead people to clearer understandings and (hopefully) better decisions.

 

Consider, for instance, the useful observations that are shared with businesses through customer feedback. This type of data is often scattered across support tickets, social media posts, product reviews, survey responses, call recordings and more. By understanding these elements as data sources and using semantic techniques to work with them, the fragmented feedback can form a more comprehensive understanding of what customers are trying to say, revealing patterns and opportunities that would otherwise have remained invisible.

 

Rethinking our approach to AI readiness

 

Perhaps the most insidious consequence of unstructured data is the difficulty it creates for new tooling to help in its own excavation. Artificial intelligence and machine learning systems require clean, contextual and accessible information to function effectively. 

 

While hybrid search tools such as Redactive and Glean are powerful in their ability to index and retrieve unstructured information, if data is not already semantically enriched they are not likely to properly parse organization-specific terminology in context, find the most authoritative version of a document or distinguish between a work in progress and a finished product. 

 

They also don’t solve that inevitable, junk-drawer drift: without someone invested in curating and corralling all that data in the long term, the muddle will re-occur. Or, we can shift how we think about information.

Lilly Ryan, Thoughtworks
The demands of the modern business landscape make it tempting to believe that more data always leads to more insight. Semantic technologies show that it's not about how much we collect, but how we understand what we already have.
Lilly Ryan
Security Lead, Thoughtworks
The demands of the modern business landscape make it tempting to believe that more data always leads to more insight. Semantic technologies show that it's not about how much we collect, but how we understand what we already have.
Lilly Ryan
Security Lead, Thoughtworks

Putting semantic data into practice

 

Putting this into practice is more than just a mindset shift. Semantic data is both founded upon and driving forward several key technologies. To properly understand it, it's worth having a mental model of this technology landscape.

 

Ontologies

 

Ontologies serve as the intellectual architecture of semantic data. They are formal, structured representations of knowledge that define not just the things within a domain, but their relationships to one another. While they require standardized vocabulary and syntax (partly to remove ambiguity for machines), effective ontologies reflect the natural language of an organization, instead of imposing artificial terminology. Tools like Protégé help to create these structures, but the real work lies in capturing the semantic richness of an organization's context through carefully defined concepts, attributes and relationships.

 

The word ontology is sometimes confused with taxonomy, but the distinction is important: where taxonomies primarily concern themselves with names and categories, ontologies go beyond this to describe the relationships between named entities.

 

Vector search

 

Vector search represents perhaps the most transformative technology enabling semantic approaches to unstructured data. By using machine learning techniques to represent information numerically as vector embeddings, vector search allows similar data to be retrieved based on semantic relevance rather than exact keyword matches.

 

These embeddings function as points in a multi-dimensional space, where proximity reflects semantic similarity. When a search query or prompt is processed, it too becomes a vector. This allows the system to identify semantically similar content of many types (text, images, etc), regardless of the specific keywords used, and retrieve information that is likely to be relevant. This capability is enabled by vector databases, a rapidly growing market with diverse offerings like Pinecone and Chroma that can be tailored to specific organizational needs.

 

Graph databases and knowledge graphs

 

Where ontologies are the conceptual underpinnings for semantic data, graph databases provide the technical foundation to implement these structures. Unlike traditional relational databases that store information in tables, graph databases organize data as interconnected nodes and edges, focusing primarily on efficiently representing and querying relationships.

 

Knowledge graphs extend this functionality to create semantically rich representations of real-world entities and their connections. They incorporate ontologies, inference rules and domain knowledge to not just store relationships between entities (like customers, products or actions), but to understand their meaning and enable reasoning about them. Where a graph database can show how two things are connected, knowledge graphs help users consider what those connections might mean.

 

Semantic layers

 

Semantic layers are a key element in semantic data. They bridge the gap between technical data structures and an organization's natural vocabulary, and allow people across the business to be able to construct metrics that help them make better sense of their operations.

 

While semantic layers are well-established components of many enterprise business intelligence tools, they're also evolving to serve new functions, like integrating embedded web apps and chatbots with business data and supporting approaches like analytics-as-code. Semantic layers are most effective in conjunction with the other semantic technologies discussed here, because they make these structures accessible to users.


Tools like Cube, dbt Semantic Layer and GoodData are designed for metrics-focused semantic layers. Other solutions like OntoText, PoolParty, Sinequa and Semaphore offer their own distinct approaches, tailored to different use cases.

Potential approaches to semantic data

 

While semantic technologies are capable of helping an organization transform unstructured data into useful signals, any implementation should be approached strategically if it's to make a lasting difference. All approaches will be highly organization-specific, and can range from turning semantic insights into fully-structured ontologies to using them as passive signals about the soft structures of the business. We would recommend spending time observing any emerging patterns before incorporating them into a data strategy. Observe the desire paths in your organizational structures and pave them strategically.

 

As the current tooling landscape is highly volatile, a strategic approach should consider the health of the business's entire data ecosystem, both structured and unstructured. This means being careful not to let all of the insights, discovery and connection-making become locked into a single tool or platform that ingests from everything else and gives nothing back. It's worth the effort to ensure any labelling, categorization or insights about emergent soft structure are pushed back to enrich the sources where the data came from in the first place. 

 

Labelling and categorization are difficult, often tedious work (hence the inevitability of the junk drawer!), so anything that can be done to reinforce the benefits of this work by feeding it back to the sources is one of the most valuable outcomes of whatever semantic technologies are adopted. Tooling will come and go, but business dynamics, concepts, and context will continue to exist and evolve in all of the places where work happens.

 

This approach can align well with data mesh architectures and data product thinking by leaning into the tendency for smaller groups of motivated individuals and subject-matter experts to curate their own part of the world and making it easier to exchange the information with others, rather than trying to centralize everything.

 

Semantic technologies aren't really about eliminating the junk drawer; they’re about developing the capability to find what you need within it, understand how items relate to one another, and learn from the patterns of accumulation. By embracing these approaches, organizations can transform their unstructured data from a liability into a living archive of institutional knowledge that adapts and grows alongside the business itself.

Semantics over scale

 

The demands of the modern business landscape make it tempting to believe that more data always leads to more insight. Semantic technologies show that it's not about how much we collect, but how we understand what we already have. Vector search, knowledge graphs, and semantic layers allow businesses to see and make use of the connections and relationships already present in our organizational miscellanea. We don't need to fight the existence of the junk drawer, but start to make better use of what is in it.

 

The advantages extend beyond "AI readiness". Ultimately, semantic data can bridge the gap between the messy and often unclear abstractions of which our data usually consists and the day-to-day language through which we actually work together and get things done. Organizational knowledge isn’t a problem to be solved, but a living ecosystem waiting to be understood.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Discover a snapshot of today's technology landscape