ThoughtWorks
  • Kontakt
  • Español
  • Português
  • English
  • 中文
Übersicht
  • Delivery Mindset trifft Software-Exzellenz

    Verfolgen Sie einen innovativen Ansatz in der Softwareentwicklung, um noch schneller erfolgreich zu sein.

    Erkenntnisgestützte Entscheidungsfindung

    Nutzen Sie Ihre Datenbestände, um neue Geschäftsmöglichkeiten zu erschließen.

  • Betriebsmodelle ohne Reibungsverluste

    Verbessern Sie die Fähigkeit Ihres Unternehmens, auf Veränderungen zu reagieren.

    Plattform Strategie

    Entwicklung dynamischer Technologieplattformen, die sich an Ihre Geschäftsstrategie anpassen.

  • Experience Design und innovative Produkte

    Liefern Sie schnell außergewöhnliche Produkte und Kundenerlebnisse. Entwickeln Sie Design und Funktion kontinuierlich weiter.

    Partnerschaften

    Nutzung unseres Netzwerks aus vertrauenswürdigen Partnern, um noch bessere Ergebnisse für unsere Kunden zu erzielen.

Übersicht
  • Automobil
  • Clientech, Energie und Versorgung
  • Banken und Versicherungen
  • Gesundheit
  • Medien
  • Non-Profit
  • Öffentlicher Sektor
  • Handel und E-Commerce
  • Reise und Transport
Übersicht

Unsere Empfehlungen

  • Technologie

    Ausführliche Betrachtungen neuer Technologien.

  • Business

    Aktuelle Business-Insights, Strategien und Impulse für digitale Querdenker.

  • Kultur

    Insights zu Karrieremöglichkeiten und unsere Sicht auf soziale Gerechtigkeit und Inklusivität.

Digitale Veröffentlichungen und Tools

  • Technology Radar

    Unser Leitfaden für aktuelle Technologietrends.

  • Perspectives

    Unsere Publikation für digitale Vordenker*innen

  • Digital Fluency Model

    Ein Modell zur Priorisierung digitaler Fähigkeiten, um für das Unvorhersehbare bereit zu sein.

  • Decoder

    Der Technology-Guide für Business Entscheider

Alle Insights

  • Artikel

    Expertenwissen für Ihr Unternehmen.

  • Blogs

    Persönliche Perspektiven von ThoughtWorkern aus aller Welt.

  • Bücher

    Stöbern Sie durch unsere umfangreiche Bibliothek.

  • Podcasts

    Spannende Gespräche über das Neueste aus Business und Technologie.

Übersicht
  • Bewerbungsprozess

    Finde heraus, was dich in unserem Bewerbungsprozess erwartet.

  • Hochschulabsovent*innen und Quereinsteiger*innen

    Dein Einstieg in die IT-Welt.

  • Stellenangebote

    Finde offene Stellen in deiner Region.

  • In Kontakt bleiben

    Abonniere unsere monatlichen Updates.

Übersicht
  • Konferenzen und Events
  • Diversity und Inclusion
  • Neuigkeiten
  • Open Source
  • Management
  • Social Change
  • Español
  • Português
  • English
  • 中文
ThoughtWorksMenü
  • schließen   ✕
  • Unsere Services
  • Unsere Kunden
  • Insights
  • Karriere
  • Über uns
  • Kontakt
  • Zurück
  • schließen   ✕
  • Übersicht
  • Delivery Mindset trifft Software-Exzellenz

    Verfolgen Sie einen innovativen Ansatz in der Softwareentwicklung, um noch schneller erfolgreich zu sein.

  • Experience Design und innovative Produkte

    Liefern Sie schnell außergewöhnliche Produkte und Kundenerlebnisse. Entwickeln Sie Design und Funktion kontinuierlich weiter.

  • Betriebsmodelle ohne Reibungsverluste

    Verbessern Sie die Fähigkeit Ihres Unternehmens, auf Veränderungen zu reagieren.

  • Erkenntnisgestützte Entscheidungsfindung

    Nutzen Sie Ihre Datenbestände, um neue Geschäftsmöglichkeiten zu erschließen.

  • Partnerschaften

    Nutzung unseres Netzwerks aus vertrauenswürdigen Partnern, um noch bessere Ergebnisse für unsere Kunden zu erzielen.

  • Plattform Strategie

    Entwicklung dynamischer Technologieplattformen, die sich an Ihre Geschäftsstrategie anpassen.

  • Zurück
  • schließen   ✕
  • Übersicht
  • Automobil
  • Clientech, Energie und Versorgung
  • Banken und Versicherungen
  • Gesundheit
  • Medien
  • Non-Profit
  • Öffentlicher Sektor
  • Handel und E-Commerce
  • Reise und Transport
  • Zurück
  • schließen   ✕
  • Übersicht
  • Unsere Empfehlungen

  • Technologie

    Ausführliche Betrachtungen neuer Technologien.

  • Business

    Aktuelle Business-Insights, Strategien und Impulse für digitale Querdenker.

  • Kultur

    Insights zu Karrieremöglichkeiten und unsere Sicht auf soziale Gerechtigkeit und Inklusivität.

  • Digitale Veröffentlichungen und Tools

  • Technology Radar

    Unser Leitfaden für aktuelle Technologietrends.

  • Perspectives

    Unsere Publikation für digitale Vordenker*innen

  • Digital Fluency Model

    Ein Modell zur Priorisierung digitaler Fähigkeiten, um für das Unvorhersehbare bereit zu sein.

  • Decoder

    Der Technology-Guide für Business Entscheider

  • Alle Insights

  • Artikel

    Expertenwissen für Ihr Unternehmen.

  • Blogs

    Persönliche Perspektiven von ThoughtWorkern aus aller Welt.

  • Bücher

    Stöbern Sie durch unsere umfangreiche Bibliothek.

  • Podcasts

    Spannende Gespräche über das Neueste aus Business und Technologie.

  • Zurück
  • schließen   ✕
  • Übersicht
  • Bewerbungsprozess

    Finde heraus, was dich in unserem Bewerbungsprozess erwartet.

  • Hochschulabsovent*innen und Quereinsteiger*innen

    Dein Einstieg in die IT-Welt.

  • Stellenangebote

    Finde offene Stellen in deiner Region.

  • In Kontakt bleiben

    Abonniere unsere monatlichen Updates.

  • Zurück
  • schließen   ✕
  • Übersicht
  • Konferenzen und Events
  • Diversity und Inclusion
  • Neuigkeiten
  • Open Source
  • Management
  • Social Change
Blogs
Wählen Sie ein Thema
Alle Themen ansehenschließen
Technologie 
Agiles Projektmanagement Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionäre Architekturen Experience Design IoT Sprachen, Tools & Frameworks Modernisierung bestehender Alt-Systeme Machine Learning & Artificial Intelligence Microservices Plattformen Sicherheit Software Testing Technologiestrategie 
Geschäft 
Financial Services Global Health Innovation Retail  Transformation 
Karriere 
Karriere Hacks Diversity und Inclusion Social Change 
Blogs

Themen

Thema auswählen
  • Technologie
    Technologie
  • Technologie Überblick
  • Agiles Projektmanagement
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionäre Architekturen
  • Experience Design
  • IoT
  • Sprachen, Tools & Frameworks
  • Modernisierung bestehender Alt-Systeme
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Plattformen
  • Sicherheit
  • Software Testing
  • Technologiestrategie
  • Geschäft
    Geschäft
  • Geschäft Überblick
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Karriere
    Karriere
  • Karriere Überblick
  • Karriere Hacks
  • Diversity und Inclusion
  • Social Change
Data Science & EngineeringChennaiTechnologie

To Hadoop or Not to Hadoop?

Anand Krishnaswamy Anand Krishnaswamy

Published: Aug 13, 2013

Hadoop is often positioned as the one framework your business needs to solve nearly all your problems. Mention “Big Data” or “Analytics” and pat comes the reply: Hadoop! Hadoop, however, was purpose-built for a clear set of problems; for some it is, at best, a poor fit and others, even worse, a mistake. While data transformation (or, broadly, ETL operations) benefit significantly from a Hadoop setup, if your business needs fall into any of the following five categories, Hadoop might be a misfit.



1. Big Data cravings

While businesses like to believe that they have a Big Data dataset, sadly, it seems that is often not the case. Regarding data volume and common perceptions that one possesses “Big Data”, a research article, Nobody Ever Got Fired For Buying a Cluster, reveals that while Hadoop was designed for tera/petabyte scale computation, majority of real world jobs process less than 100 GB of input (with median jobs at Microsoft & Yahoo under 14 GB and 90% of jobs at Facebook being well under 100GB) and hence, puts forth the case for a single “scale-up” server over a “scale-out” setup running Hadoop.

Ask Yourself:

  • Do I have several terrabytes of data or more?
  • Do I have a steady, huge influx of data?
  • How much of my data am I going to operate on?

2. You are in the queue

When submitting jobs, Hadoop's minimum latency is about a minute. This means that it takes the system a minute or more to respond, and provide recommendations, to the customer’s purchase. It would be a loyal and patient customer who would stare at the screen for 60+ seconds waiting for a response. An option is to pre-compute related items for every item in the inventory a priori using Hadoop, and provide the web site or mobile app immediate, one-second-or-less access to the stored result. Hadoop is an excellent Big Data pre-computation engine. Of course, as the nature of your response gets more complicated complete pre-computation is very inefficient.

Ask Yourself:

  • What are user expectations around response time?
  • Which of my jobs can be batched up?

3. Your call will be answered in...

Hadoop has not served businesses requiring real-time responses to their queries. Jobs which go through the map-reduce cycle also spend time in the shuffle cycle. None of these are time-bound making developing real-time applications on top of Hadoop, very difficult. Volume-weighted average price trading is an example where responses need to be time-bound to place buys.

Analysts sorely miss SQL. Hadoop doesn’t function well for random access to its datasets (even with Hive, which basically makes MapReduce jobs of your query). Google’s Dremel (and by extension, BigQuery) architecture is designed to support ad-hoc queries over huge row-sets in under seconds. And SQL lets you do joins. Shark from University of California, Berkeley’s AmpLab and the Stinger initiative led by Hortonworks are other alternatives to look out for.

Ask Yourself: 

  • What is the level of interaction users/analysts expect with my data?
  • Do they wish to have interactivity with terabytes of data or just a subset?

Let’s say it together: Hadoop works in batch mode. That means as new data is added the jobs need to run over the entire set again. Hence, analyses time keeps increasing. Chunks of fresh data, mere updates or small changes might flow in real-time. Often, businesses need to make decisions based on these events. However rapidly the incoming data is ingested Hadoop would still process them in batch mode. YARN promises to address this in the future. Twitter’s Storm is already popular & an available alternative. The case for combining Storm with a distributed messaging system like Kafka opens up a variety of use cases for stream aggregation and processing. But load balancing is sorely missing in Storm while available in Yahoo’s S4.

Ask Yourself:   

  • What is the shelf-life of my data?
  • How rapidly should my business produce value from incoming data?
  • How important is it for my business to respond to live changes or updates?

Real-time advertisements and monitoring sensor data mandate real-time processing of streaming input. But Hadoop or tools built on tops of them are not the only alternatives. SAP’s HANA in-memory database was used in the McClaren team’s ATLAS suite of analytics tools during the recent Indy 500 along with MATLAB to run simulations and respond to telemetry during the race. Many analysts opine that the future of Hadoop is interactive and real-time.

4. I Just Broke Up With My Social Network

Hadoop, especially MapReduce, is best suited for data that can be decomposed to key-value pairs without fear of losing context or any implicit relationship. Graphs possess implicit relationships (edges, sub-trees, child and parent relationships, weights, etc.) and not all of them will exist on a node. This attribute requires most graph algorithms to carry a portion or the entire graph through each iteration. This is often not feasible or at least convoluted to realize in MapReduce. There is also the problem of strategy of data partitioning across nodes. If your primary data structure is a graph or a network, then you are probably better off using a graph database like Neo4J or Dex or you could explore recent entries on the scene like Google’s Pregel or Apache Giraph.

Ask Yourself:

  • Is the underlying structure of my data as vital as the data itself?
  • Is the insight I wish to gain reflective of the structure as much as or more than the data?

5. The Mold of MapReduce
Some tasks/jobs/algorithms simply do not yield to the programming model of MapReduce. One such set of problems was touched upon in the previous paragraph. Tasks that need the results of intermediate steps to compute results of current step would be another category (an academic example is the Fibonacci series computation). Some machine learning algorithms (gradient-based learning or expectation maximization) too do not fall well into the MapReduce paradigm. There are specific optimisations/strategies (global state, passing along data structures for reference, etc.) for each of these issues that have been suggested by researchers but it still makes the implementation more non-intuitive & complicated than is necessary.

Ask Yourself:  

  • Does my business places great emphasis on highly specialised algorithms or domain specific processes?
  • Wouldn’t the technical team be better equipped to analyse if the algorithms are MapReducible or not?

Added to these are business cases where the data is not significantly large or the total data set is large but made up of billions of small files (e.g. many image files which need to be scanned for a particular shape) which can’t be concatenated. As we already mentioned, jobs which do not lend themselves to the MapReduce paradigm of divide and aggregate also make adopting Hadoop contrived.

Now that we have explored when Hadoop might be a misfit, let's look at when it might make sense.

Ask Yourself:  

Does your organization...

  1. Want to extract information from piles of text logs?
  2. Want to transform largely unstructured or semi-structured data into some other useable and structured format?
  3. Have tasks that can run over the entire set of data, overnight (like credit card companies do with the day’s transactions)?
  4. Treat conclusions drawn from a single processing of data as valid till the next scheduled processing (unlike stock market prices which definitely change between end of day values)?

Then, most certainly you should explore Hadoop.

These represent a sizeable list of categories of business problems which fit well into the Hadoop model (although reports suggest that even on those, taking it to production is a non-trivial challenge). Typical jobs that have to go over huge quantities of unstructured or semi-structured data and either summarise the contents or transform relevant observations into a structured form to be utilised by other components in the system, are very well suited for the Hadoop model. If your collected data has elements that can easily be captured as an identifier with its corresponding value (which in Hadoop-speak is key-value pairs), you can utilise that simple association to perform several kinds of aggregations.

At the end of the day, the key is to recognise the business resources available and understand the nature of the problem you wish to solve. That and the elaboration above would help you choose the best tools for your business.

And it may very well be Hadoop.

What has been your experience? Share in the comments section.

  • Unsere Services
  • Unsere Kunden
  • Insights
  • Karriere
  • Über uns
  • Kontakt

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Presseanfragen | Datenschutz | Impressum | Modern Slavery statement ThoughtWorks| Barrierefreies Webdesign | © 2021 ThoughtWorks, Inc.