menu

Alumni blogs

Lots of our people have lots of opinions. Here are just a few of them

ThoughtWorks embraces the individuality of the people in the organization and hence the opinions expressed in the blogs may contradict each other and also may not represent the opinions of ThoughtWorks.

Introducing Blueshift: Automated Amazon Redshift ingestion from Amazon S3

I’m very pleased to say I’ve just made public a repository for a tool we’ve built to make it easier to ingest data automatically into Amazon Redshift from Amazon S3: https://github.com/uswitch/blueshift.

Amazon Redshift is a wonderfully powerful product, if you’ve not tried it yet you should definitely take a look; I’ve written before about the value of the analytical flow it enables.

However, as nice as it is to consume data from, ingesting data is a little less fun:

  1. Forget about writing raw INSERT statements: we saw individual inserts take on the order of 5 or 6 seconds…

Blog post by Paul Ingles
23 October 2014

Original Link

Neo4j: Cypher – Avoiding the Eager

Although I love how easy Cypher’s LOAD CSV command makes it to get data into Neo4j, it currently breaks the rule of least surprise in the way it eagerly loads in all rows for some queries even those using periodic commit.

Neverwhere

Beware of the eager pipe

This is something that my colleague Michael noted in the second of his blog posts explaining how to use LOAD CSV successfully:

The biggest issue that people ran into, even when following the advice I gave earlier, was that for large imports of more than one million rows, Cypher ran into an…

Blog post by Mark Needham
23 October 2014

Original Link

Is good design to be equated with functional?

Musings on my past, good design, functionality, ergonomics, customer experience, taps, light switches and a juicer.

Is good design to be equated with functional?

That was the question. For the next 40 minutes I scribbled the answer to the ‘A’ Level History of Art question.  Twenty five years later two things strike me. Firstly, that my answer must have satisfied the examiner because I got a good grade.  And that after all these years, that question still sticks in my mind.  (My answer, not so much).  It sticks in my mind because I’ve spent most of my working life addressing…

Blog post by Marc McNeill
21 October 2014

Original Link

Neo4j: Modelling sub types

A question which sometimes comes up when discussing graph data modelling is how you go about modelling sub/super types.

In my experience there are two reasons why we might want to do this:

  • To ensure that certain properties exist on bits of data
  • To write drill down queries based on those types

At the moment the former isn’t built into Neo4j and you’d only be able to achieve it by wiring up some code in a pre commit hook of a transaction event handler so we’ll focus on the latter.

The typical example used for showing how to design…

Blog post by Mark Needham
20 October 2014

Original Link

Elasticsearch, Ruby and Unicorn

We have been using Ruby on Rails as well as Elasticsearch for a while. To avoid downtime during deployment, we have been using Unicorn more or less configured like this blog post describes. While running a single instance of Elasticsearch was pretty trivial with Karel’s new Elasticsearch Ruby Gem – moving to a Clustered setup forced us to understand the configuration of the Gem a bit better. I thought I’d sum up a few lessons learned here just in case it might be useful to someone:

There’s quite a few options you can pass into the Elasticsearch client that allow…

Blog post by Erling Wegger Linde
20 October 2014

Original Link

Python: Converting a date string to timestamp

I’ve been playing around with Python over the last few days while cleaning up a data set and one thing I wanted to do was translate date strings into a timestamp.

I started with a date in this format:

date_text = "13SEP2014"

So the first step is to translate that into a Python date – the strftime section of the documentation is useful for figuring out which format code is needed:

import datetime
 
date_text = "13SEP2014"
date = datetime.datetime.strptime(date_text, "%d%b%Y")
 
print(date)
$ python dates.py
2014-09-13

Blog post by Mark Needham
20 October 2014

Original Link

Neo4j: LOAD CSV – The sneaky null character

I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.

The CSV file looked like this:

$ cat foo.csv
foo,bar,baz
1,2,3

I wrote the following LOAD CSV query to extract some of the fields and compare others:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2"
==> +--------------------------------------+
==> | line.foo | line.bar | line.bar = "2" |
==> +--------------------------------------+
==> |    | "2"     | false          |
==> +--------------------------------------+

Blog post by Mark Needham
18 October 2014

Original Link

R: Linear models with the lm function, NA values and Collinearity

In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant.

On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:

2014 10 17 06 21 07

So in this case ‘z’ doesn’t help us in predicting Fertility since it doesn’t give us any more information that we can’t already get from ‘Agriculture’ and ‘Education’.

Although in this case we know why ‘z’ doesn’t have a coefficient sometimes it may not be clear which other variable…

Blog post by Mark Needham
18 October 2014

Original Link

Introducing Bifrost: Archive Kafka data to Amazon S3

We're happy to announce the public release of a tool we've been using in production for a while now: Bifrost

We use Bifrost to incrementally archive all our Kafka data into Amazon S3; these transaction logs can then be ingested into our streaming data pipeline (we only need to use the archived files occasionally when we radically change our computation).

Rationale

There are a few other projects that scratch the same itch: notably Secor from Pinterest and Kafka's old hadoop-consumer. Although Secor doesn't rely on running Hadoop jobs, it still uses Hadoop's SequenceFile file format; sequence files allow…

Blog post by Paul Ingles
17 October 2014

Original Link

Monitoring Go programs with Riemann

uSwitch.com uses Riemann to monitor the operations of the various applications and services that compose our system. Riemann helps us aggregate systems and application metrics together and quickly assemble dashboards to understand how things are behaving now.

Most of these services and applications are written in Clojure, Java or Ruby, but recently we needed to monitor a service we implemented with Go.

Hopefully this is interesting for Go authors that haven’t come across Riemann or Clojure programmers who’ve not considered deploying Go before. I think Go is a very interesting option for building systems-type services.

Whence Go?

Recently at…

Blog post by Paul Ingles
17 October 2014

Original Link