Over the past year, there have been a lot of new product / project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with enhancements over its base platform, players like Qubole, Mirantis, VMWare, and Rackspace all announced product or service offerings marrying the elephant with the cloud. Projects like Apache Whirr, Netflix’s Genie started getting noticed. Recent announcements at the HadoopWorld + StrataConf summit prompted analysts to claim that Hadoop is taking over the cloud.
It has become evident that Hadoop in the cloud is a trending topic. This post explores six of the reasons why this association makes sense, and why customers are seeing increased value in this model.
Running Hadoop on the cloud makes sense for similar reasons as running any other software offering on the cloud. For companies still testing the waters with Hadoop, the low capacity investment in the cloud is a no-brainer. The cloud also makes sense for a quick, one time use case involving big data computation. As early as in 2007, the New York Times used the power of Amazon EC2 instances and Hadoop for just one day to do a one time conversion of TIFF documents to PDFs in a digitisation effort. Procuring scalable compute resources on demand is appealing as well.
The point above of quick resource procurement needs some elaboration. Hadoop and the platforms it was inspired from made the vision of linear storage and compute using commodity hardware a reality. Internet giants like Google, who always operated at web-scale, knew that there would be a need for running on more and more hardware resources. They invested in building this hardware themselves.
In the enterprise though, this was not necessarily an option. Hadoop adoption in some enterprises grew organically from a few departments running tens of nodes to a consolidated medium-sized or large cluster with a few hundreds of nodes. Usually such clusters started getting managed by a ‘data platform’ team different from the Infrastructure team, the latter being responsible for procurement and management of the hardware in the data centre. As the analytics demand within enterprises grew, the need to expand the capacity of the Hadoop clusters also grew. The data platform teams started hitting a bottleneck of a different kind. While the software itself had proven its capability of handling linear scale, the time it took for hardware to materialise in the cluster due to IT policies varied from several weeks to several months, stifling innovation and growth. It became evident that ‘throwing more hardware at Hadoop’ wasn’t as easy or fast as it should be.
The cloud, with its promise of instant access to hardware resources, is very appealing to leaders who wish to provide a platform that scales fast to meet growing business needs. For instance, what could take several weeks to get 50 more machines into a data centre would become available in a cloud platform like Amazon EMR in tens of minutes.
Being a batch-oriented system, typical usage patterns for Hadoop involve scheduled jobs processing new incoming data on a fixed, temporal basis. Companies collecting activity data from devices or web server logs ingest this data into an analytics application on Hadoop once or couple of times in a day and extract insights from them. The load on compute resources of a Hadoop cluster varies based on the timings of these scheduled runs or rate of incoming data. A fixed capacity Hadoop cluster built on physical machines is always on whether it is used or not – consuming power, leased space, etc. and incurring cost.
The cloud, with its pay as you use model, is more efficient to handle such batch workloads. Given predictability in the usage patterns, one can optimise even further by having clusters of suitable sizes available at the right time for jobs to run. For the example cited above on activity data analysis, companies can schedule cloud-based clusters to be available only for the period of time during the day when the data needs to be crunched.
Not all Hadoop jobs are created equal. While some of them require more compute resources, some require more memory, and some others require a lot of I/O bandwidth. Usually, a physical Hadoop cluster is built of homogeneous machines, usually large enough to handle the largest job.
The default Hadoop scheduler has no solution for managing this diversity in Hadoop jobs, causing sub-optimal results. For example, a job whose tasks requires more memory than average could affect tasks of other jobs that run on the same slave node due to a drain on system resources. Advanced Hadoop schedulers like the Capacity Scheduler and Fairshare Scheduler have tried to address the case of managing heterogeneous workloads on homogeneous resources using sophisticated scheduling techniques. For example, the Capacity Scheduler supported the notion of ‘High RAM’ jobs – jobs that require more memory on average. Such support is becoming more prevalent with Apache YARN, where the notion of a resource is being more comprehensively defined and handled. However, these solutions are still not as widely adopted as the stable Hadoop 1.0 solutions that do not have this level of support.
Cloud solutions meanwhile already offer a choice to the end user to provision clusters with different types of machines for different types of workloads. Intuitively, this seems like a much easier solution for the problem of handling variable resource requirements.
For example, with Amazon Elastic MapReduce, you can launch a cluster for yourself with m2.large machines if your Hadoop jobs require more memory, and c1.xlarge machines if your Hadoop jobs are compute intensive.
As businesses move their services to the cloud, it follows that data starts living on the cloud. And as analytics thrives on data, and typically large volumes of it, it makes no sense for analytical platforms to exist outside of the cloud leading to inefficient, time consuming migration of this data from source to the analytics clusters.
Running Hadoop clusters in the same cloud environment is an obvious solution to this problem. This is, in a way, applying Hadoop’s principle of data locality at the macro level.
As cluster consolidation happens in the enterprise, one thing that gets lost is the isolation of resources for different sets of users. As all user jobs get bunched up in a shared cluster, administrators of the cluster start to deal with multi-tenancy issues like user jobs interfering with one another, varied security constraints etc.
The typical solution to this problem has been to enforce very restrictive cluster level policies or limits that prevent users from doing anything harmful to other users jobs. The problem with this approach is that valid use cases of users are also not solved. For instance, it is common for administrators to lockdown the amount of memory Hadoop tasks can run with. If a user genuinely requires more memory, he or she has no support from the system.
Using the cloud, one can provision different types of clusters with different characteristics and configurations, each suitable for a particular set of jobs. While this frees administrators from having to manage complicated policies for a multi-tenant environment, it enables users to use the right configuration for their jobs.
Running Hadoop in the clouds is not without potential problems. Stay tuned for part 2 when we will explore those challenges in greater detail.
Read "To Hadoop or Not To Hadoop" to learn when Hadoop might be a misfit for your organization.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.