Big Data Pipeline powered by Scala
Session presented by Rohit Rai, tuplejump
Rohit is a founder and the CEO of tuplejump Inc. Rohit is a true polyglot with experience in a number of programming languages. He is also a prolific open source contributor. He has been working in Scala, Akka, Play and the ecosystem for over 4 years. Tuplejump is a startup, with a vision to simplify data engineering, by making the data and tools to work with it accessible to the people who need it. They have built a big data platform powered by Scala everywhere.
Their big data pipeline comprises of various stages viz., collect, transform, store, explore, predict, and visualize. The “collect” stage uses Hydra, a framework built atop Akka to gather high volume and velocity data from both push based and pull based sources. The collected data is streamed to “transform” stage, which employs Spark to deal with both structured and unstructured data. The “store” stage uses DStore, a Cassandra based storage solution, which boasts of scalability and high availability with high performance reads and writes. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency and high fault tolerance. The “explore” stage uses Shark analytics engine, Calliope, and Ubercube, a distributed OLAP cube engine developed by tuplejump. In “predict”, they are building their own EA and ANN/DL frameworks, gearing towards what they refer to as “Machine Assisted Insights”. The “visualize” stage uses Pizzaro, a modern data visualization front-end with highly interactive and reactive capabilities.
Tuplejump found Scala attractive for a number of reasons. It unifies OOP and FP, is modern and evolving, and is hosted on JVM, the only VM worth putting in production according to Rohit. :) Rohit went into details of how Akka’s actor concurrency works out in practice, the supervising and clustering features thereof. He spoke about Spark, the secret sauce in their batch processing system. He also touched upon Play, SBT, and ScalaTest.
Tuplejump has open sourced a number of tools, which you can find on their github here.