Companies that aspire to achieve competitive advantage by using data as a key asset must build their execution plan around two phases, as described by HBR bloggers Redman & Sweeney:
For example, let us consider an online retail organisation trying to forecast the demand for various items in their inventory, with the objective of maximising sales conversion and minimising inventory carrying costs. Demand forecasting techniques that tell us how to use historical data (primarily sales) to forecast the sales in the future have been around for some time. However, the more aggressive players explore ways of honing these base models by exploiting the wealth of data they have at their disposal. Maybe the sales of a type of women’s accessory is seen to go up whenever the sales of a specific cut of jeans goes up. Or the sales of a controversial book is observed to be affected by sentiment expressed in tweets about the same in the past few days.
During the lab phase, the data scientists explore and experiment with data from various sources to identify the right signals that impact sales of various items and how they could be correlated, with the objective of building a model capturing this interdependence. Once this model is codified (which is usually in the form of a series of equations of some complexity), the next step is to build a robust and scalable application that runs the forecasting model every period looking at the data sources, extracting the defined signals, and providing the probable demand for the next period. This phase, where the data engineering team takes over to build the application, is referred to as the Factory phase.
The so-called “Big Data technologies” of various strains have brought about a sea change in the approach to analytics - be it descriptive or predictive in nature. However, as with any kind of technology solution, SMEs and startups demand a high degree of business responsiveness for an analytics solution as well. These high expectations can be met only if the tech team can achieve agility in both the phases described above. In the lab phase, this implies that the data scientists need to be provided with nimble and lightweight approaches and tools to explore and experiment with data, allowing analysts to fail fast at low cost. In the factory phase, the engineering teams tasked with the responsibility of productionising the insights require platforms, frameworks, and tools to enable them to work iteratively and rapidly.
Approaches to adoption
The ability of the Big Data technologies to cheaply handle unstructured or semi-structured data in large volumes is being leveraged by organisations to induce agility into the data mining and analysis. This is enabling the new breed of data scientists to experiment and fail fast with sophisticated modelling and/or machine learning techniques and analytics, shrinking the cycle time of taking newer models from conceptualisation to production. On the one side, organisations like Amazon, Facebook, etc. have used these technologies to build complex applications to generate insights that provide real competitive advantage to the businesses, monetising the data they have collected. On the other side, traditional organisations have also started adopting these technologies, relooking at the legacy approach of building Enterprise Data Warehouse (EDW) solutions. The traditional (waterfall-centric) approach to building EDWs based on concepts like enterprise data modelling, holistic master data management strategies, heavyweight enterprise data governance policies, etc. are expensive and non-agile.
Given the open source nature of most Big Data technologies, many organisations, vendors, and users alike have been contributing back to the community. This rapid maturing of the stack is pushing the adoption from innovators and early adopters to the mainstream in a very short period. We at Thoughtworks believe that a number of the advancements in the Big Data space in the last year will enable SMEs and startups to accelerate the adoption of advanced analytics.
Key trends enabling agility in Big Data Analytics
1. Lowering of the entry barrier:
Big Data on the Cloud: Capacity planning and operationalising an in-house Big Data environment takes considerable effort and does become a barrier of entry for SMEs and startups. Several companies and open source projects have come up to provide these infrastructural capabilities on the cloud, both in public and private flavours. For the Hadoop world, in addition to mature solutions like Amazon’s Elastic MapReduce, several newcomers like Rackspace and OpenStack’s project Savanna and startups like Qubole, Altiscale, etc. are providing entire Hadoop ecosystems on the cloud. Additionally, most of the MPP database vendors like Vertica and Teradata have introduced their cloud offering in the recent past. Most notable among them would be Amazon’s Redshift. Value added services on top of the basic Big Data environment augment the core infrastructure, with critical functions like ability to manage data processing workflows, schedule jobs or import and export data from other data sources. With such plumbing work out of the way, organisations can quickly put their solutions into production and extract business value economically.
2. Deepening of the capabilities of the ecosystem to support data analysis
SQL-on-Hadoop: A key drawback of the dominant paradigm in Hadoop world, Map Reduce, is very much a batch approach; the lack of interactivity puts a dampener on the agility of the analysis process, as it does not lend itself to the way analysts think. Most of the vendors have been working feverishly towards the goal of removing this impedance mismatch. Quite a few of them have been starting to see the light of the day in the last few months. Impala from Cloudera, Drill from MapR, Lingual from Cascading, Hadapt, Polybase from Microsoft, Hawq from Pivotal HD are but a few of them, the latest entrant being Presto from Facebook.
Machine Learning on Big Data: Availability of Machine Learning libraries for Big Data is reaching critical mass, enabling even smaller players like SME and startups to move into the realm of extraction of insights from very large data sets. In addition to Mahout, which has been around for some time, newer offerings like Oryx (Cloudera), Pattern (Cascading), and MLBase (Berkeley AmpLabs) provide implementations of advanced algorithms like clustering, classification, regression, collaborative filtering out of the box. The barrier of entry is being reduced, allowing organisations to focus more on building business functionality; think hyper personalisation, recommendations, fraud detection, etc.
3. Deepening of the capabilities of the ecosystem to support engineering
Hadoop 2.0 and separation of concerns: In Hadoop 2.0, via a new resource management framework called YARN, it is now possible to run a variety of workloads alongside traditional MapReduce on the same Hadoop cluster, sharing data using the underlying distributed file system. For e.g. this can be used to run graph oriented processing (Giraph), or stream based processing (Storm) for real time analytics. This trend is only likely to accelerate further In 2014. The ability to run multiple frameworks on the same infrastructure will help users to select the right framework for solving their analytics problem. Such consolidation will potentially enhance the appeal of the Hadoop stack to the mainstream market.
Proliferation of small open source components: There has been a regular stream of smaller open source projects contributed to the open source community that focus on solving some repetitive, niche problems in the analytics space. For e.g. incremental data processing is a common problem in several Big Data aggregation systems. In October 2013, LinkedIn released an open source system called Hourglass that makes it easier to solve this problem. Usually, such projects get published by the originating companies after being used for a while in production, thereby giving credibility to the work, giving the opportunity to startups and SMEs to “stand on the shoulders of giants” to achieve their aspirations.
Traditional data science libraries / applications and Big Data: Until recently, data scientists and analysts had to choose between leveraging the power of Hadoop and a wealth of open source libraries and applications, R and NumPy / SciPy being the chief ones. In the past year the community has built frameworks to enable these sophisticated libraries to be used in conjunction with Hadoop, democratising the access to a sophisticated statistical modeling and machine learning environment.
Most of these patterns are ushering in a high degree of agility in the Lab phase of the model described at the beginning the article, aiding nimble players to adopt Big Data and Agile Analytics, even with limited resources.
For analytics initiatives to be truly agile, the ecosystem should also be mature enough to provide tools which aid agile software development, not just in project management practices, but in engineering practices as well. Only then will agility fully percolate to the factory phase of the analytics value stream. Case in point is the testability of advanced analytics applications. MRUnit is a good start in that direction. However, breadth and depth of the testing tools is very limited. Ability to build a comprehensive test suite as a safety net is essential in supporting iterative development cycles. Conventional software application development has matured to a stage where there are a number of tools to support agile engineering practices like Test Driven Development, Continuous Integration, Refactoring, etc. One should start seeing the emergence of parallel concepts and tools supporting the same in the Analytics space as the adoption continues through the “Slope of Enlightenment.”
Learn about our Big Data Analytics practice.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.