Big data, fast data — Part Three

IoT
Blog

Tom Glover

Published: August 20, 2020

In this final part of the three-part series, we’ll look at the Internet of Things world from a data integration and application perspective to appreciate the design challenges we’ll face in a world where everything is connected.

Integration of things data

The rapid emergence of IoT cannot be explained by CPU processing advances alone. Increasing volumes of data are meaningless unless they can be consumed efficiently and quickly. This is where standards have evolved to deliver the necessary catalyst.

Historically, data was communicated predominantly in either binary or XML formats. These were adequate for their original purposes such as interprocess communication and web services data exchange. Our modern “everything is digital” world is different. It needs everything to be not only open but concise and comprehensible. That includes being capable of being understood by humans as well as machines.

The ability to observe the data in transit, even at high speed, makes it possible to produce modern, high-quality IoT solutions that can be updated iteratively. This was unheard of in legacy IoT solutions. Here, data was encoded in a highly proprietary, and often compressed, format as a way to circumvent device and network limitations.

Many of these limitations weren’t simply technology constraints but commercial constraints. The data in its original (and readable) form was simply too expensive to be transmitted over cellular networks — a handicap that telecommunications providers have only recently started to acknowledge and adapt to.

JSON — One format to rule them all

The poster child in the area of data visibility is JSON (JavaScript Object Notation), a lightweight data interchange format designed for both humans and machines.

Figure 1: An exampe of the JSON data format

While JSON is core to almost all HTML/REST based software architectures, it has also found a home in the IoT world. That’s mostly due to the ease with which it can be generated and manipulated. One argument against JSON was its relatively inefficiency compared to binary encoded protocols. The majority of these concerns can be alleviated by the use of fast and efficient binary serialisation protocols such as MessagePack and CBOR (Concise Binary Object Representation). These can provide the readability benefits of JSON combined with the compactness of binary objects. You’ll see an increasing number of IoT platform providers supporting these binary serialisation standards in addition to JSON.

MQTT — It’s all in the delivery

Just having a great data format doesn’t necessarily guarantee efficient communication and delivery of IoT data to where it can actually be consumed. We also need an effective, performant and guaranteed security mechanism to protect data both to and from devices. Standard protocols have emerged in this space but, as with JSON, there are some dominant candidates.

Figure 2: An IoT data delivery mechanism

MQTT (Message Queue Telemetry Transport) is the main contender for the data transport protocol. We soon began to realise that HTTP, with it’s request-response approach, isn’t necessarily the best fit for fast-moving bite-sized pieces of data that need some possible guarantee of delivery over sloppy, unpredictable mobile networks. MQTT has advantages here and also keeps power requirements under control too. Did I mention it’s also an open standard to boot?

Standards are our friends

We’ve already touched on some examples of the myriad of standards we encounter in the IoT world. At every single stage of the IoT data lifecycle, from event data generation on the device to data consumption, you’ll find a smorgasbord of available standards to choose from. Much like the offerings on a smorgasbord, they don’t all necessarily work well together.

Each standard that you adopt will impact the structure and visibility of the underlying data in some way. Some standards are more prescriptive than others; while vendor-driven standards may sacrifice openness for the sake of (vendor device) compatibility. What’s clear is that there is no worthwhile pretender to the throne in terms of “one standard to rule them all”. There will never be such a standard.

Despite this, there are some really interesting examples of efforts to address this data standard fragmentation issue. One such example is the Web of Things (WoT), which is designed to address the ways in which you communicate with the devices associated with physical objects in a way that is agnostic to hardware, software and underlying protocols. Adoption of standards such as this will play an important role in removing the constraints to the consumption of IoT data.

Stay away from the edge

Despite the increasing move towards sending everything to the cloud, there are many “edge” cases where this doesn’t make sense from either a commercial and/or technical basis. For example, it makes little sense to send every command from a light switch to a light bulb over the cloud (yes that is being done but it doesn’t make it right). The introduction of edge servers or gateways (yes, those gateways of old but cheaper and faster) allow us to provide another mechanism for control and processing of data.

Timeliness is best in all matters.

The quotation from Hesiod, the Greek poet, neatly summarises the temporal aspect of IoT and it’s overall value in relation to time. We’ve discussed many technological aspects of the challenges in handling IoT data but the most important aspect of all is that of time.

The end goal of any IoT solution is to make sense of the never-ending (and ever increasing) torrent of incoming data in order to make actionable insights. The longer this data processing takes, the less useful these insights will be. A good example is that of a flood sensor. Every second of additional delay that prevents the sensor reporting a water leak can result in exponential increases in water damage.

At the other end of the spectrum, you can consider the business's ability to make informed judgements on intelligence received from IoT analytics. The longer these analytical activities take, the less likely that the insights will still retain their potency and thus competitive value. We have discussed this concept in Part II when we used the term “freshness” relating to data.
One additional point to consider here is that the longer you wait to process the data, the longer it takes to process since you now have much greater data volumes. So the moral of the story is to focus on eliminating data latency as a core design principle.

Applications of things

In the preceding sections, we’ve described the long and winding data journey from the device to the cloud. We know that different problems will require differing approaches to the handling of the device data along this journey. Certain solutions will mandate that the data must be delivered as quickly as possible while others may sacrifice urgency for data quality in order to provide the most accurate prediction for a mission critical business analysis scenario.

Real-world IoT solutions generally comprise a combination of various approaches which usually means that data will be transmitted, processed and stored many times over. While this may appear to be inefficient, it’s an inevitable consequence of the architecture design tradeoffs of cost, performance (timeliness) and storage(retention). Cost-efficient IoT data retention approaches (e.g AWS S3) were never designed to be performant, while perfomance efficient analytical data stores (e.g AWS RedShift) were never designed to be cost-efficient. Each IoT solution design generally consists of a number of overlapping architecture design patterns that are each optimised for a particular set of tradeoffs. The diagram below illustrates this at a high-level.

Figure 3: A high-level overview of an IoT solution consisting of a number of overlapping architecture design patterns

While this is a high-level conceptual view of a typical IoT solution from a data standpoint, there are some key takeaways that apply to the majority of these solutions.

An event-driven architecture approach governs the data pipeline of IoT data. We have at our disposal an ever increasing selection of related design patterns, techniques and technologies that are fit-for-purpose for the problem at hand. More importantly, the IoT application engineer now has a great deal more flexibility in accessing and controlling the data for their end-user applications, whatever their design constraints.

We’ve already discussed the variety of applicable open data formats available for use. When you combine this with the emerging use of alternative IoT embedded programming languages (Rust, Golang, Java, Javascript, Micropython) then we can see that modern developers are no longer constrained to building IoT solutions on C/C++ using proprietary data encoding formats. This polyglot approach to interacting with IoT data not only enables considerably increased developer productivity but has had a demonstrable impact on the agility of IoT solutions.

This in turn has accelerated the rate of IoT innovation as developers can use the tools and techniques they’re familiar with to interact with IoT data — that could be Javascript on the device or Python in the data warehouse. A beneficial side-effect of this is the active nurturing of IoT developer communities, as engineers are encouraged to actively exchange knowledge on the design and building of IoT solutions globally. The bar has been lowered significantly for developers exploring how to access, consume and control IoT data compared to just a few years ago. The more developers that join this community, the more we can expect to see some exciting innovation and applications in the market that use the data of things.

Big data — Connecting the dots

Attempts to provide some articulation of the fundamental concepts of “Big Data” have outlined what is known as the “four V’s”: volume, veracity, velocity and variety. Various researchers have since evolved these definitions over time, but there are in fact seven which align well with many of the topics we have discussed in these articles. We have added value, volatility and validity.

Fundamentally, all of the key aspects of designing data architectures for IoT solutions can be distilled to these seven areas. These seven areas can be considered the tradeoffs in any large IoT solution design and we have already discussed all of them, perhaps using different terms. The illustration below is a summary to show how these big data trade offs relate to the various discussion topics we’ve covered in these articles.

Figure 4: The seven areas of tradeoffs in any large IoT solution

Complex, not complicated

Over the duration of this series of articles we have described the many diverse aspects of the data generated and consumed in the Internet of Things world. We’ve explored the key areas to consider when evaluating the data requirements for an IoT solution, together with the challenges and associated recommendations to overcome these.

There’s a popular misconception that IoT systems are complicated, when in reality they are mostly complex. Whenever we’re dealing with IoT solutions, we’re in fact dealing with things that exist in the real world. These things are thus subject to the constraints of real world systems, which are by their nature, complex. The goal of a given IoT solution is simply to provide some digital abstraction of a part of the real world in order to derive some benefits. Complication sets in when we become too ambitious and unrealistic in our ability to create this digital model. We either don’t fully understand what is the right data to be collecting and we usually accumulate far too much of the wrong data before we realise.

These commonly observed pitfalls are in fact very similar to those of a so-called Wicked Problem, a problem that is incredibly difficult to solve, if at all possible. There is no perfect solution design for large scale IoT data architectures. There are simple solutions that are better or worse. This area is innovating so fast that before the solution has even been partially completed, the goalposts have probably already moved. The exhaustive collection of available components to solve these problems are actually part of the problem itself in that they impede progress.

The recommendation for preventing your particular IoT solution from veering into the Wicked Problem space is to ensure that you’re as realistic as possible about the purpose of the solution and equally pragmatic regarding the technology approaches to address the requirements. Accept that the technology available is continuously evolving to address these problems and agree to an iterative approach that is agile enough to adapt as necessary as both the problem and the solution will inevitably evolve.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Publications and Tools

All Insights