Show mobile menu

Alumni blogs

Lots of our people have lots of opinions. Here are just a few of them

ThoughtWorks embraces the individuality of the people in the organization and hence the opinions expressed in the blogs may contradict each other and also may not represent the opinions of ThoughtWorks.

Sinking Data to Neo4j from Hadoop with Cascading

Recently, I worked with a colleague (Paul Lam, aka @Quantisan on building a connector library to let Cascading interoperate with Neo4j: cascading.neo4j. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.

The data processing pipeline we’ve been growing at uSwitch.com is built around Cascalog, Hive, Hadoop and Kafka.

Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive.…

Blog post by Paul Ingles
20 May 2013

Original Link

Compressing CloudFront Assets and dfl8.co

Amazon’s web services have made rebuilding uSwitch.com so much easier. We’re gradually moving more and more static assets to CloudFront (although most visitors are in the UK responses have much lower latencies than direct from S3 or even our own nginx servers). CloudFront doesn't support serving gzip'ed content direct from S3 out of the box.

Because of this, up until last week we were serving uncompressed assets, at least anything that wasn’t already compressed (such as images). Last week we put together a simple static assets nginx server to help compress things.

Whilst doing the work for uSwitch.com…

Blog post by Paul Ingles
20 May 2013

Original Link

Evaluating classifier results with R part 2

In a previous article I showed how to visualise the results of a classifier using ggplot2 in R. In the same article I mentioned that Alex, a colleague at Forward, had suggested looking further at R’s caret package that would produce more detailed statistics about the overall performance of the classifer and within individual classes.

Confusion Matrix

Using ggplot2 we can produce a plot like the one below: a visual representation of a confusion matrix. It gives us a nice overview but doesn’t reveal much about the specific performance characteristics of our classifier.

Blog post by Paul Ingles
20 May 2013

Original Link

Visualising classifier results with R and ggplot2

Earlier in the year, myself and some colleagues started working on building better data processing tools for uSwitch.com. Part of the theory/reflection of this is captured in a presentation I was privileged to give at EuroClojure (titled Users as Data).

In the last few days, our data team (Thibaut, Paul and I) have been playing around with some of the data we collect and using it to build some classifiers. Precision and Recall provide quantitative measures but reading through Machine Learning for Hackers showed some nice ways to visualise results.

Binary Classifier

Our…

Blog post by Paul Ingles
20 May 2013

Original Link

Protocol Buffers with Clojure and Leiningen

This week I’ve been prototyping some data processing tools that will work across the platforms we use (Ruby, Clojure, .NET). Having not tried Protocol Buffers before I thought I’d spike it out and see how it might fit.

Protocol Buffers

The Google page obviously has a lot more detail but for anyone who’s not seen them: you define your messages in an intermediate language before compiling into your target language.

There’s a Ruby library that makes it trivially easy to generate Ruby code so you can create messages as follows:

Clojure and Leiningen

Blog post by Paul Ingles
20 May 2013

Original Link

Social Enterprise Development

When I read the transcript of Linus Torvald’s talk on Git at Google I was working at an investment bank in London and it was about 4 years ago. It was just as I’d started using GitHub for hosting my own side-projects and for doing some open-source work. Fast forward to today and I’ve just read an article about the fast rise of GitHub as the software repository of choice for open-source development and an interesting space for Enterprise hosting.

All the banks I worked in were extremely centrally controlled: you’d use approved libraries and tools only. However, the…

Blog post by Paul Ingles
20 May 2013

Original Link

Kafka for uSwitch's Event Pipeline

Kafka is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of uSwitch.com’s new data analytics pipeline and this post will cover a little about Kafka and how we’re using it.

Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a simpler distributed messaging system).

Messaging

Over the past 2 years we’ve migrated from a monolithic environment based around Microsoft .NET and SQL Server to a mix…

Blog post by Paul Ingles
20 May 2013

Original Link

Clojure - From Callbacks to Sequences

I was doing some work with a colleague earlier this week which involved connecting to an internal RabbitMQ broker and transforming some messages before forwarding them to our Kafka broker.

We’re using langohr to connect to RabbitMQ. Its consumer and queue documentation shows how to use the subscribe function to connect to a broker and print messages that arrive:

The example above is pretty close to what we started working with earlier today. It’s also quite similar to a lot of other code I’ve written in the past: connect to a broker or service and provide a…

Blog post by Paul Ingles
20 May 2013

Original Link

Multi-armed Bandit Optimisation in Clojure

The multi-armed (also often referred to as K-armed) bandit problem models the problem a gambler faces when attempting maximise the reward from playing multiple machines with varying rewards.

For example, let’s assume you are standing in front of 3 arms and that you don’t know the rate at which they will reward you. How do you set about the task of pulling the arms to maximise your cumulative reward?

It turns out there’s a fair bit of literature on this topic, and it’s also the subject of a recent O’Reilly book: “Bandit Algorithms for Website Optimization” by…

Blog post by Paul Ingles
20 May 2013

Original Link

Analysing and predicting performance of our Neo4j Cascading Connector with linear regression in R

As I mentioned in an earlier article, Paul and I have produced a library to help connect Cascading and Neo4j making it easy to sink data from Hadoop with Cascading flows into a Neo4j instance. Whilst we were waiting for our jobs to run we had a little fun with some regression analysis to optimise the performance. This post covers how we did it with R.

I’m posting because it wasn’t something I’d done before and it turned out to be pretty good fun. We played with it for a day and haven’t done much else with it…

Blog post by Paul Ingles
20 May 2013

Original Link