ThoughtWorks
  • Contact
  • Español
  • Português
  • Deutsch
  • 中文
Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

    Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

    Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

    Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
Go to overview

Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
  • Español
  • Português
  • Deutsch
  • 中文
ThoughtWorksMenu
  • Close   ✕
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact
  • Back
  • Close   ✕
  • Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

  • Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

  • Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Back
  • Close   ✕
  • Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
  • Back
  • Close   ✕
  • Go to overview
  • Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

  • Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

  • All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

  • Back
  • Close   ✕
  • Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

  • Back
  • Close   ✕
  • Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
Blogs
Select a topic
View all topicsClose
Technology 
Agile Project Management Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionary Architecture Experience Design IoT Languages, Tools & Frameworks Legacy Modernization Machine Learning & Artificial Intelligence Microservices Platforms Security Software Testing Technology Strategy 
Business 
Financial Services Global Health Innovation Retail  Transformation 
Careers 
Career Hacks Diversity & Inclusion Social Change 
Blogs

Topics

Choose a topic
  • Technology
    Technology
  • Technology Overview
  • Agile Project Management
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionary Architecture
  • Experience Design
  • IoT
  • Languages, Tools & Frameworks
  • Legacy Modernization
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Platforms
  • Security
  • Software Testing
  • Technology Strategy
  • Business
    Business
  • Business Overview
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Careers
    Careers
  • Careers Overview
  • Career Hacks
  • Diversity & Inclusion
  • Social Change
Data Science & EngineeringBangaloreTechnology

Trends in Big Data

Shyam Kurien Shyam Kurien

Published: Mar 4, 2014

Companies that aspire to achieve competitive advantage by using data as a key asset must build their execution plan around two phases, as described by HBR bloggers Redman & Sweeney:

  1. In the Lab phase, they must find interesting, novel, and useful insights about the real world from data.
  2. Thereafter, the focus moves to the Factory phase, where the challenge is to turn those insights into products and services, in most cases supported by a robust, scalable, and high performance data analytics platform.

For example, let us consider an online retail organisation trying to forecast the demand for various items in their inventory, with the objective of maximising sales conversion and minimising inventory carrying costs. Demand forecasting techniques that tell us how to use historical data (primarily sales) to forecast the sales in the future have been around for some time. However, the more aggressive players explore ways of honing these base models by exploiting the wealth of data they have at their disposal. Maybe the sales of a type of women’s accessory is seen to go up whenever the sales of a specific cut of jeans goes up. Or the sales of a controversial book is observed to be affected by sentiment expressed in tweets about the same in the past few days.

During the lab phase, the data scientists explore and experiment with data from various sources to identify the right signals that impact sales of various items and how they could be correlated, with the objective of building a model capturing this interdependence. Once this model is codified (which is usually in the form of a series of equations of some complexity), the next step is to build a robust and scalable application that runs the forecasting model every period looking at the data sources, extracting the defined signals, and providing the probable demand for the next period. This phase, where the data engineering team takes over to build the application, is referred to as the Factory phase.

The so-called “Big Data technologies” of various strains have brought about a sea change in the approach to analytics - be it descriptive or predictive in nature. However, as with any kind of technology solution, SMEs and startups demand a high degree of business responsiveness for an analytics solution as well. These high expectations can be met only if the tech team can achieve agility in both the phases described above. In the lab phase, this implies that the data scientists need to be provided with nimble and lightweight approaches and tools to explore and experiment with data, allowing analysts to fail fast at low cost. In the factory phase, the engineering teams tasked with the responsibility of productionising the insights require platforms, frameworks, and tools to enable them to work iteratively and rapidly.

Approaches to adoption

The ability of the Big Data technologies to cheaply handle unstructured or semi-structured data in large volumes is being leveraged by organisations to induce agility into the data mining and analysis. This is enabling the new breed of data scientists to experiment and fail fast with sophisticated modelling and/or machine learning techniques and analytics, shrinking the cycle time of taking newer models from conceptualisation to production. On the one side, organisations like Amazon, Facebook, etc. have used these technologies to build complex applications to generate insights that provide real competitive advantage to the businesses, monetising the data they have collected. On the other side, traditional organisations have also started adopting these technologies, relooking at the legacy approach of building Enterprise Data Warehouse (EDW) solutions. The traditional (waterfall-centric) approach to building EDWs based on concepts like enterprise data modelling, holistic master data management strategies, heavyweight enterprise data governance policies, etc. are expensive and non-agile.

Given the open source nature of most Big Data technologies, many organisations, vendors, and users alike have been contributing back to the community. This rapid maturing of the stack is pushing the adoption from innovators and early adopters to the mainstream in a very short period. We at ThoughtWorks believe that a number of the advancements in the Big Data space in the last year will enable SMEs and startups to accelerate the adoption of advanced analytics.

Key trends enabling agility in Big Data Analytics

1. Lowering of the entry barrier:

Big Data on the Cloud: Capacity planning and operationalising an in-house Big Data environment takes considerable effort and does become a barrier of entry for SMEs and startups. Several companies and open source projects have come up to provide these infrastructural capabilities on the cloud, both in public and private flavours. For the Hadoop world, in addition to mature solutions like Amazon’s Elastic MapReduce, several newcomers like Rackspace and OpenStack’s project Savanna and startups like Qubole, Altiscale, etc. are providing entire Hadoop ecosystems on the cloud. Additionally, most of the MPP database vendors like Vertica and Teradata have introduced their cloud offering in the recent past. Most notable among them would be Amazon’s Redshift. Value added services on top of the basic Big Data environment augment the core infrastructure, with critical functions like ability to manage data processing workflows, schedule jobs or import and export data from other data sources. With such plumbing work out of the way, organisations can quickly put their solutions into production and extract business value economically.

2. Deepening of the capabilities of the ecosystem to support data analysis

SQL-on-Hadoop: A key drawback of the dominant paradigm in Hadoop world, Map Reduce, is very much a batch approach; the lack of interactivity puts a dampener on the agility of the analysis process, as it does not lend itself to the way analysts think. Most of the vendors have been working feverishly towards the goal of removing this impedance mismatch. Quite a few of them have been starting to see the light of the day in the last few months. Impala from Cloudera, Drill from MapR, Lingual from Cascading, Hadapt, Polybase from Microsoft, Hawq from Pivotal HD are but a few of them, the latest entrant being Presto from Facebook.

Machine Learning on Big Data: Availability of Machine Learning libraries for Big Data is reaching critical mass, enabling even smaller players like SME and startups to move into the realm of extraction of insights from very large data sets. In addition to Mahout, which has been around for some time, newer offerings like Oryx (Cloudera), Pattern (Cascading), and MLBase (Berkeley AmpLabs) provide implementations of advanced algorithms like clustering, classification, regression, collaborative filtering out of the box. The barrier of entry is being reduced, allowing organisations to focus more on building business functionality; think hyper personalisation, recommendations, fraud detection, etc.

3. Deepening of the capabilities of the ecosystem to support engineering

Hadoop 2.0 and separation of concerns: In Hadoop 2.0, via a new resource management framework called YARN, it is now possible to run a variety of workloads alongside traditional MapReduce on the same Hadoop cluster, sharing data using the underlying distributed file system. For e.g. this can be used to run graph oriented processing (Giraph), or stream based processing (Storm) for real time analytics. This trend is only likely to accelerate further In 2014. The ability to run multiple frameworks on the same infrastructure will help users to select the right framework for solving their analytics problem. Such consolidation will potentially enhance the appeal of the Hadoop stack to the mainstream market.

Proliferation of small open source components: There has been a regular stream of smaller open source projects contributed to the open source community that focus on solving some repetitive, niche problems in the analytics space. For e.g. incremental data processing is a common problem in several Big Data aggregation systems. In October 2013, LinkedIn released an open source system called Hourglass that makes it easier to solve this problem. Usually, such projects get published by the originating companies after being used for a while in production, thereby giving credibility to the work, giving the opportunity to startups and SMEs to  “stand on the shoulders of giants” to achieve their aspirations.

Traditional data science libraries / applications and Big Data: Until recently, data scientists and analysts had to choose between leveraging the power of Hadoop and a wealth of open source libraries and applications, R and NumPy / SciPy being the chief ones. In the past year the community has built frameworks to enable these sophisticated libraries to be used in conjunction with Hadoop, democratising the access to a sophisticated statistical modeling and machine learning environment.

Looking Ahead

Most of these patterns are ushering in a high degree of agility in the Lab phase of the model described at the beginning the article, aiding nimble players to adopt Big Data and Agile Analytics, even with limited resources.

For analytics initiatives to be truly agile, the ecosystem should also be mature enough to provide tools which aid agile software development, not just in project management practices, but in engineering practices as well. Only then will agility fully percolate to the factory phase of the analytics value stream. Case in point is the testability of advanced analytics applications. MRUnit is a good start in that direction. However, breadth and depth of the testing tools is very limited. Ability to build a comprehensive test suite as a safety net is essential in supporting iterative development cycles. Conventional software application development has matured to a stage where there are a number of tools to support agile engineering practices like Test Driven Development, Continuous Integration, Refactoring, etc. One should start seeing the emergence of parallel concepts and tools supporting the same in the Analytics space as the adoption continues through the “Slope of Enlightenment.”

Learn about our Big Data Analytics practice. 

  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Media and analyst relations | Privacy policy | Modern Slavery statement ThoughtWorks| Accessibility | © 2021 ThoughtWorks, Inc.