Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Blogs Banner

Scala Symposium: Big Data Pipeline Powered by Scala

Big Data Pipeline powered by Scala

Session presented by Rohit Rai, tuplejump

Rohit is a founder and the CEO of tuplejump Inc. Rohit is a true polyglot with experience in a number of programming languages. He is also a prolific open source contributor. He has been working in Scala, Akka, Play and the ecosystem for over 4 years. Tuplejump is a startup, with a vision to simplify data engineering, by making the data and tools to work with it accessible to the people who need it. They have built a big data platform powered by Scala everywhere.


Their big data pipeline comprises of various stages viz., collect, transform, store, explore, predict, and visualize. The “collect” stage uses Hydra, a framework built atop Akka to gather high volume and velocity data from both push based and pull based sources. The collected data is streamed to “transform” stage, which employs Spark to deal with both structured and unstructured data. The “store” stage uses DStore, a Cassandra based storage solution, which boasts of scalability and high availability with high performance reads and writes. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency and high fault tolerance. The “explore” stage uses Shark analytics engine, Calliope, and Ubercube, a distributed OLAP cube engine developed by tuplejump. In “predict”, they are building their own EA and ANN/DL frameworks, gearing towards what they refer to as “Machine Assisted Insights”. The “visualize” stage uses Pizzaro, a modern data visualization front-end with highly interactive and reactive capabilities.

Tuplejump found Scala attractive for a number of reasons. It unifies OOP and FP, is modern and evolving, and is hosted on JVM, the only VM worth putting in production according to Rohit. :) Rohit went into details of how Akka’s actor concurrency works out in practice, the supervising and clustering features thereof. He spoke about Spark, the secret sauce in their batch processing system. He also touched upon Play, SBT, and ScalaTest.

Tuplejump has open sourced a number of tools, which you can find on their github here.



Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Keep up to date with our latest insights