It's one of the most confusing industry terms at the moment... 'big data' what does that really mean? Is there a classification model? What happens if it's big but not quite big enough...is that not-so-big-data?
The only thing that passes as a definition was coined by Doug Laney in 2002 who worked as an analyst for the META group (later bought by Gartner) who suggested the 3 Vs: volume, velocity and variability. Volume is obviously the amount of data, velocity is that rate of data churn, either arrival and/or processing thereof, and variety nowadays tends to apply to data from various sources and in various formats or structures.
During the following few years after this definition people would try to add numbers to the volume dimension, sometimes terabytes or even petabytes and even new 'V's' would appear - 'veracity' is one of them, the correctness of the data (in other words a measure of how much work is required to clean and process the data) - and also 'value.' I think this is the most important of the 'Vs' generally bandied about. I'll come back to that.
So how does this align with the world of analytics? The 3 Vs actually makes no mention of the various types of analytics that have been in use for ages by statisticians, scientists, market researchers, mathematicians, and so on. Tools have existed there for years to analyse 'data' ... after all...'data' is all there is.
The wonderfully generic term of 'analytics' actually means getting some chosen method and model of analysis as close to your (probably cleaned) data as possible. However, we know that seemingly new sources of data have become available in the last few years. This is commonly called the ‘information explosion.’ Indeed, it's estimated that 90% of all the world's data has been produced in the last two years, data production rates are growing along with new sources of data such as the geometric growth of device adoption. We used to call 'devices' mobile phones, now it's phones, tablets, fitness, home and various other devices commonly aggregated as the “internet of things.”
So this potentially presents traditional data analysis with the following problems:
1. What if we need to analyse real time data feeds, historically most tools aren't really geared up for that, and;
2. What if we can't sample the data by reducing the data set adequately enough to gain real insight?
We have to deal with larger volumes of data.
Many discussions on this topic get more quickly to certain technologies such as Apache Hadoop which has become synonymous with big data. Sometimes these frameworks need to be deployed to simply cater to very large amounts of data and perform simple (or relatively simple) operations on them. More often, however, they are simply frameworks for the specific types of analytical libraries of choice for the data set we are dealing with. So dealing with these frameworks has it's drawbacks, the developer testing cycle times are longer and dealing with large amounts of data incurs logistical considerations.
So what can we do?
Remember the 'other' V... or one of them? 'Value'. So if we start with the truism that data is all there is, then in fact data analysis to provide value is all there is. We need to do just that. If we get value by using small amounts of data - great, if we need to analyse mountains of data, ok, then we need to consider some of the bigger frameworks, but work up to it. Start small and build up. This is aligned with the tenets of agile analytics. Brief test and learn cycles providing value early - if you have to wheel out the big guns do so when you need to. It's the analysis that's important. Of course in doing so one finds that the analytics have to be adapted to work in a Map Reduce style if Hadoop is used, there are other ways of analysing large amounts of data such as MPP databases, but that's another article!
Finally, the same is true broadly for types of analysis, which can be broken down into descriptive, predictive and prescriptive analytics, again a much deeper discussion would do this area justice and this summation over simplifies it. But one should only go as far as the problem at hand and the acceptable level of value dictates you go i.e. go for the reward you need and no further. Many companies derive tremendous insight into their business by seeing a simple descriptive snapshot of themselves, counts, averages and other aggregates - not complicated, but can be very valuable.
So hopefully it can be seen that it's too diverse for simple definitions, but follow the value, use the data in the best way as dictated by the problem at hand.
Start small and build up.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.