Lots of our people have lots of opinions. Here are just a few of them
ThoughtWorks embraces the individuality of the people in the organization and hence the opinions expressed in the blogs may contradict each other and also may not represent the opinions of ThoughtWorks.
Blog post by Trisha Gee
23 May 2013
Netflix continues as a Java technology powerhouse, delivering one open source tool or framework after another. The latest posting from their excellent blog is Brian Moore on Garbage Collection Visualization, a tool for turning
gc.log into usable graphs.
The JVM heap teaser shot:
Blog post by Brian Oxley
22 May 2013
Blog post by Trisha Gee
22 May 2013
This post is a continuation of my earlier ‘Clojure at a Bank’ posts. I’ve since left the bank and am working for a large newspaper company, fortunately for me still writing Clojure.
It’s an obvious point to make, that different projects can have very different testing demands. At the bank we managed a throughput of financial products so it was critical that we got no surprises. Prod deployments were often like moon-landings, staged well in advance with lots of people in mission control.
At the newspaper it’s a bit different. Whilst bugs are still not to be warmly…
Blog post by Jon Pither
21 May 2013
Recently, I worked with a colleague (Paul Lam, aka @Quantisan on building a connector library to let Cascading interoperate with Neo4j: cascading.neo4j. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.
The data processing pipeline we’ve been growing at uSwitch.com is built around Cascalog, Hive, Hadoop and Kafka.
Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive.…
Amazon’s web services have made rebuilding uSwitch.com so much easier. We’re gradually moving more and more static assets to CloudFront (although most visitors are in the UK responses have much lower latencies than direct from S3 or even our own nginx servers). CloudFront doesn't support serving gzip'ed content direct from S3 out of the box.
Because of this, up until last week we were serving uncompressed assets, at least anything that wasn’t already compressed (such as images). Last week we put together a simple static assets nginx server to help compress things.
Whilst doing the work for uSwitch.com…
In a previous article I showed how to visualise the results of a classifier using ggplot2 in R. In the same article I mentioned that Alex, a colleague at Forward, had suggested looking further at R’s caret package that would produce more detailed statistics about the overall performance of the classifer and within individual classes.
Using ggplot2 we can produce a plot like the one below: a visual representation of a confusion matrix. It gives us a nice overview but doesn’t reveal much about the specific performance characteristics of our classifier.
Earlier in the year, myself and some colleagues started working on building better data processing tools for uSwitch.com. Part of the theory/reflection of this is captured in a presentation I was privileged to give at EuroClojure (titled Users as Data).
In the last few days, our data team (Thibaut, Paul and I) have been playing around with some of the data we collect and using it to build some classifiers. Precision and Recall provide quantitative measures but reading through Machine Learning for Hackers showed some nice ways to visualise results.
This week I’ve been prototyping some data processing tools that will work across the platforms we use (Ruby, Clojure, .NET). Having not tried Protocol Buffers before I thought I’d spike it out and see how it might fit.
The Google page obviously has a lot more detail but for anyone who’s not seen them: you define your messages in an intermediate language before compiling into your target language.
There’s a Ruby library that makes it trivially easy to generate Ruby code so you can create messages as follows:
When I read the transcript of Linus Torvald’s talk on Git at Google I was working at an investment bank in London and it was about 4 years ago. It was just as I’d started using GitHub for hosting my own side-projects and for doing some open-source work. Fast forward to today and I’ve just read an article about the fast rise of GitHub as the software repository of choice for open-source development and an interesting space for Enterprise hosting.
All the banks I worked in were extremely centrally controlled: you’d use approved libraries and tools only. However, the…