Scaling your startup's data platform using Akka, Scalding, and Spark
- Session presented by Rajesh Muppalla, Indix
Rajesh is a co-founder and director of engineering at Indix, a product intelligence platform for retailers and brands. He leads a team responsible for collecting, organizing, and structuring all the product related data collected from web. His main areas of focus are big data and distributed systems.
The tech stack at Indix comprises of Scala, Akka, Hadoop, Scalding, Spark, among several other technologies. Rajesh first gave the audience an overview of what the stages in Indix’ data pipeline look like, and then he went on to elaborate on how each stage employs various technologies to a great effect.
The data collection stage in the pipeline involves crawling. A crawler must be distributed, efficient, fault tolerant, and extensible. Akka actors provide a simple and performant concurrency model, are distributed by design, and provide supervision features which allow for highly fault tolerant designs. This, and the recent clustering features, meant that Akka fit the bill perfectly. Rajesh went into a bit more detail about the patterns they used, and the lessons they learned through their use of Akka.
The data processing stage is about processing of data, which deals with about 200 million products and 8 billion prices every day. This also needs to be fast and fault tolerant. They use Scalding, a Hadoop abstraction layer built atop Cascading. Rajesh illustrated with some examples how Scalding excels over raw Hadoop API or other alternatives like PIG. They also use Apache Spark, which is an in-memory analytics engine compatible with Hadoop storage APIs. In their experience, Spark runs upto 40x faster for memory intensive operations.
Indix uses Play for their metrics and dashboard.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.