Data locality is the process of moving computation to the node where that data resides, instead of vice versa — helping to minimize network congestion and improve computation throughput.
Processing massive volumes of data can take a toll on your network and systems. Moving colossal data sets between nodes and systems consumes a huge amount of bandwidth, slows other operations, and consumes a lot of time in the process. Data locality solves that challenge by moving the significantly lighter processing code to the data instead.
The process of bringing computation closer to where data resides within your processing ecosystem.
Reduced network congestion, increased computation throughput and more efficient use of bandwidth.
It’s not always as efficient as it looks — challenges often surface when dealing with heterogenous or just large clusters.
Today, it’s popular among with teams working with large data sets as a straightforward way of combating excess bandwidth usage.
What is it?
Data locality is the concept of moving processing code to the data within your systems, instead of forcing huge data volumes through the network to get it processed.
It’s used when the code required to process a data set is smaller than the data set itself — meaning that it’s more efficient and cost-effective to move the code to the data, rather than vice versa.
It provides a simple way of reducing network traffic and optimizing bandwidth use — especially for organizations that frequently processes very large data sets, spread across multiple storage nodes.
What’s in for you?
If your organization needs to process massive volumes of data, data locality can improve processing and execution times, and reduce network traffic. That can mean faster decision making, improved customer service and reduced costs.
It works by moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This means less traffic moving through your systems, lower network burden, and a far more efficient use of limited bandwidth — in turn helping to reduce costs, and increase overall network and system performance.
What are the trade offs?
Data locality can’t always be applied in every processing scenario. In some scenarios, the way data is distributed or placed will mean that data locality either doesn’t actually represent a very significant efficiency gain, or may not be applicable at all.
You will also find that deploying and maintaining your application can be more complicated as it becomes more distributed.
How is it being used?
For thousands of teams and organizations that depend on Apache Hadoop or Spark as core parts of their data ecosystem, data locality is part of their everyday operations — helping them optimize bandwidth use and keep costs under control for routine data processing workloads.
Where applicable, it’s being used to bring computation closer to data, instead of continuously moving colossal data sets around — which clogs up your networks and impacts system performance.
Search for another topic
Would you like to suggest a topic to be decoded?
Just leave your email address and we'll be in touch the moment it's ready.