Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Blogs Banner

Scala Symposium: Number Crunching in Scala

Number Crunching in Scala

- Session presented by Chris Stucchio, BayesianWitch

Chris is one of the founders of BayesianWitch, a web analytics company built almost entirely on Scala. He is currently focused on improving the scientific computing ecosystem in Scala.

 

The algorithms used at BayesianWitch have to solve coupled PDEs, minimize high dimensional objective function, and do some statistical sampling. All of this in under 400ms. Scala and Akka excel at real time streaming, concurrency, and fault tolerance. Python excels at solving PDEs and such, because it has excellent libraries like NumPy, SciPy, Matplotlib, and Bokeh. However using the two together does not sound like a very attractive option for multiple reasons. Chris talked about how Scala could replace Python in this domain, and what are the hurdles in its path.

Regular idiomatic Scala tends to be slow, and even though it’s possible to hand-tune it, the result you get is almost always ugly. You either trade performance or expressiveness. There are some advanced techniques though, which allow you to be expressive while retaining as much performance as possible. These include macros, carefully placed @specialized annotations, among others. The libraries in the number crunching domain have to make use of these techniques.

He then mentioned some key libraries in the domain. Spire provides numeric type-classes and primitives. Breeze aims to be NumPy for Scala, and has support for all the usual suspects - vectors, matrices, polynomials, statistics etc. It uses an interesting abstraction called UFunc to provide shape polymorphic operations, a la NumPy. Saddle is like Breeze, but with some other interesting structures, and nicer IO. For visualization, there are Breeze-Viz, Breeze-Bokeh, and JFreeChart. (Chris is a committer to Breeze and Breeze-Bokeh.) BayesianWitch also uses Scalding for dealing with big data sets. A big problem with all of these libraries is that they’re all mutually incompatible in a number of ways.

He concluded with a thought that given some effort, and if all these libraries could play nicely together, we could eventually get to where NumPy/SciPy is.

 

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Keep up to date with our latest insights