Processing massive volumes of data can take a toll on your network and systems. Moving colossal data sets between nodes and systems consumes a huge amount of bandwidth, slows other operations, and consumes a lot of time in the process. Data locality solves that challenge by moving the significantly lighter processing code to the data instead.
What is it?
Data locality is the concept of moving processing code to the data within your systems, instead of forcing huge data volumes through the network to get it processed.
It’s used when the code required to process a data set is smaller than the data set itself — meaning that it’s more efficient and cost-effective to move the code to the data, rather than vice versa.
It provides a simple way of reducing network traffic and optimizing bandwidth use — especially for organizations that frequently processes very large data sets, spread across multiple storage nodes.
What’s in for you?
If your organization needs to process massive volumes of data, data locality can improve processing and execution times, and reduce network traffic. That can mean faster decision making, improved customer service and reduced costs.
It works by moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This means less traffic moving through your systems, lower network burden, and a far more efficient use of limited bandwidth — in turn helping to reduce costs, and increase overall network and system performance.
What are the trade offs?
Data locality can’t always be applied in every processing scenario. In some scenarios, the way data is distributed or placed will mean that data locality either doesn’t actually represent a very significant efficiency gain, or may not be applicable at all.
You will also find that deploying and maintaining your application can be more complicated as it becomes more distributed.
How is it being used?
For thousands of teams and organizations that depend on Apache Hadoop or Spark as core parts of their data ecosystem, data locality is part of their everyday operations — helping them optimize bandwidth use and keep costs under control for routine data processing workloads.
Where applicable, it’s being used to bring computation closer to data, instead of continuously moving colossal data sets around — which clogs up your networks and impacts system performance.