BIG Data, Fast Data — Part Two | Thoughtworks Germany

Cloud
Blog

Tom Glover

Published: October 23, 2019

In this, the second of a three-part series, we’ll look at the Internet of Things world from a data perspective in order to appreciate the design challenges we face in a world where everything will be connected. Read Part One here.

Processing of Things’ Data

No discussion of the data explosion delivered by IoT would be complete without thinking about cloud computing. Historically, any requirement to process off-device data would necessitate expensive servers with expensive licensing. There were always bottlenecks in terms of costs and time to provision new servers and new software. Those constraints are now evaporating. Cloud providers deliver readily available and affordable services that can be provisioned in an instant to collect and process the deluge of data communicated between things and the digital world.

You don’t know what you don’t know

One of the greatest challenges of designing for IoT platforms is predicting the future data processing requirements, with all of its considerable uncertainty. Whether it’s a startup with ambitious customer (and device) adoption projections; or an established manufacturer wishing to transform its existing (and understood) machine monitoring solution; it’s incredibly difficult to plan for the long term.

The insights gleaned from IoT data can sometimes take a period of time to accumulate (seasonality being a key example), while other scenarios may require re-processing of existing data following a tweaking of a machine learning algorithm. For example, computer vision machine learning technologies are evolving at such a rapid rate that newer algorithms may either provide completely new insights and/or significant improvements in predictive quality compared to their predecessors. The flexibility to be able to adapt the data processing or reprocess existing data is a powerful capability when working with the potentially large data volumes we are considering.

Compute, network and storage are now practically unconstrained in real terms (fees may apply). More importantly, cloud providers such as Amazon AWS are acknowledging the increasing importance of IoT and providing more and more tailored services around these demands that also allow rapid coupling of IoT data to their existing and emerging services.

Even in recent years, we’ve seen rapid technology innovation in how we consume IoT data. Serverless technologies offering, such as AWS Lambda, now offer completely dynamic compute capability that starts to deliver on the promise of utility computing. This is radically changing how we design architectures to manage this data — and how quickly we can build and adapt such architectures to changing data demands.

You’re gonna need a bigger boat

Estimating the predicted data storage needs for IoT solutions can be even more challenging and complex than estimating data processing requirements. We can apply some rudimentary empirical predictions, based on our perceived understanding of the solution. But we’re unlikely to appreciate the complexity of all the various data touchpoints and repositories present in a large solution design at the outset. The diagram below illustrates a simplified example of the role data storage plays at various points in the lifecycle of an IoT message arriving from a device.

Figure 1: IoT includes many hidden storage requirements

It can be clearly seen that an incoming IoT message does in fact leave a digital “footprint” multiple times throughout its lifecycle. This fact is commonly overlooked when assessing the commercial storage implications of various IoT cloud platform providers.

Speed comes at a price

In addition to the fact that message data may be present in many areas within the platform at one time, there is a powerful multiplier that can also significantly impact the performance design of the platform. This is the frequency of both the device messages and the rate at which they’re processed. We have various tradeoffs to consider which are essentially the speed at which you want to both receive incoming data (device frequency) and process incoming data (cloud platform processing). The absolute importance to the business of having the desire to have realtime (or near-realtime) data will determine the quantity and type of data processing components that must be enabled in the platform to handle the IoT “firehose” of streaming data.

It’s all in the packaging

Things’ data may exist in many forms during its lifecycle from device to cloud. It may start its journey wrapped in a concise but human unreadable binary encoding format, such as CBOR, and spend its final days hiding in the vastness of a DynamoDB database. As shown in the previous illustration, it can also exist in many other formats at the same time. Each of these data formats has its relative advantages and disadvantages, depending on the particular use-case. For example, JSON may be easier to read by mortals and quick to encode/decode but can result in vastly larger data storage requirements long-term. Mix and matching of data formats can be a good way of optimising for speed versus storage but it can be very difficult to predict the actual resultant costs — owing to the various unexpected overhead penalties incurred by various data storage services. You can get a long way with a spreadsheet model for data projections for an IoT design, but I’d highly recommend you validate this with measurements of the actual data usage when your platform of choice goes live.

Best before

We frequently talk about the temperature of data when describing how frequently data is accessed in a Big Data scenario. Terms such as hot, warm and cold are used to classify different types of data in terms of how they’re stored and accessed. This can help as part of an architecture design to ensure we get the best tradeoff of cost versus performance. These approaches still apply to IoT data but we have an extra level of complexity to consider and that is data freshness.

IoT data is generally considered to have a small lifespan, as far as value is concerned. Some change in variable in the physical world will generate a data event that’ill be consumed by interested parties. The longer it takes to both receive and process that event, the less relevant that information now is and thus the less value it provides. Obviously. the stale data still retains some degree of value but this is mostly when aggregating for trend analysis and predictive scenarios. The design goal is to optimise the platform so that the “freshest” data is stored on “hot” storage mechanisms, while the “stalest” data is retained in the most cost-effective storage medium possible.

The Big Chill

We’ve already discussed how to ensure that the freshest data is available on-demand as quickly as possible. The question remains as to what to do with all of the accumulating data that has passed its “best before date” ?. The approaches taken generally mirror the scenario in the real-world when we have too much “stuff” that we’re reluctant to dispose of for many reasons, mostly sentimental. There’s a perception that we must keep everything just in case a tiny fragment of that data may be needed some day. So, we resort to choosing the most cost-effective long-term storage mechanism available, such as Amazon S3 Glacier. This is a valid design decision but it’s all too easy to forget about this data and the cost accumulates significantly over time. The longer we archive this data and the more data we have, the harder it is to make the decision to purge it. What’s often overlooked are the challenges with then “defrosting” this deep-frozen data, which will have to be placed in a “warm” storage medium so it can be accessed accordingly. The challenge is that it’s difficult to predict how often we will need to perform these de-archiving activities and there is a cost premium associated. Remember: this is cloud, everything comes at a price.

The recommendation here is to not always go with what appears to be the simplest option. We’d normally recommend that customers consider accumulating large volumes of data for a limited time period, in order to then analyse and understand both which parts of the data are actually useful and also to evaluate the real storage costs associated when on the platform. Instead of simply archiving all raw data, we’ve a number of mechanisms available at our disposal, ranging from efficient compression data formats to data sampling techniques.

In Part III of this three-part series we will focus on the data integration and application aspects of the Internet of Things.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

BIG Data, Fast Data — Part Two