Put Data Science Before Data Infrastructure

David Johnston

Published: October 13, 2015

“Big Data” and “Data Science” are today’s business buzzwords. Many companies today are trying to modernize their data platform and enable their employees to monetize their valuable data, but most businesses are not seeing the benefits. Advanced data science may be driving some of the hottest startups but most mature companies are struggling to get into gear. The key reason for this is an overemphasis on architecture over ideas and a tendency to ignore agile practices that have proven so successful in other areas of software development.

While there are many tenets of Agile software development, it can be described succinctly: Don’t try to plan it all out up-front and then do it. Plan it lightly and adapt that plan while you do it. As my colleague Ken Collier argues in his book Agile Analytics, data infrastructure seems to have survived the big-upfront-investment extinction that happened in the rest of the software industry, and that has hamstrung Business Intelligence in the past and is now doing the same for Data Science.

Overcoming Data-Organization Paralysis

As a data science consultant, working for many large corporations, I see a similar pattern for failing to succeed at data science and it’s not surprising that it happens mostly with large mature companies. The starting problem is that their legacy data platform, put in places years ago, is not organized for effective data science. It is organized to enable efficient running of the business. While most companies recognize this problem and desire to transform to enable a more data-driven culture, they make the mistake of delaying data science until the data organization task is completed. This is the classic “waterfall” mentality that leads them into a kind of paralysis.

So often we hear: "We can’t do advanced data science yet. Our data is too disorganized."
Big Data product vendors are lined up at the door to take advantage of this mentality. But in many cases they sell a solution that is then imposed in a top-down fashion from executives to data scientists in a way that actually hinders progress.

Don’t Delay Data Science

I’m here to tell you that you can, and in fact you must, start doing data science before putting Big Data infrastructure in place. The biggest reason is that it’s only through solving data science problems that you know how your data should be structured or even whether it should be structured. It’s only when you run into problems of data unavailability or scaling that you really discover what type of solution is needed to solve that. And different problems will prefer different solutions. You need to have solved enough problems with your current infrastructure to see the pattern emerge for what the new one should look like. These insights need to come in a bottom-up fashion from data science practitioners not executives or even data architects and least of all product vendor sales-people.

One myth about data scientists is that they need to have all the data available to get that global picture of the business, attain insights and suggest actions. In fact, data scientists, like everyone else, can only look at so much data before information overload sets it.

The key skill of a data scientist is in actually deciding what data NOT to look at.

They need to able to see the emerging picture amidst incomplete information. Furthermore, successful data science application are nearly always built from a relatively small fraction of available data. More data fields or more data volume should be brought in to improve already successful models in an iterative fashion; standard agile practice.

Scaling is an Overrated Problem

Hadoop, NoSQL databases and other Big Data technologies have helped some prominent companies build data science applications at full-scale. However, all of those successful companies share one thing in common: they were already successful doing data science at a smaller scale. Perhaps they computed things in batch, rather than in real-time. Perhaps they ran algorithms on a subsample of data or utilized far fewer data fields. But they solved their problem in a simpler way before trying to solve it in a better or faster way.

Scaling is never what prevents a data science team from arriving at their first successful models. Investing in scalable platforms is not what it takes to get started.

Invest in Data Science First

Investing in data science talent before data infrastructure is the key to becoming a data-driven company. Kapow Software in a research paper on Big Data concludes:

"Big Data projects are taking far too long, costing too much and not delivering on anticipated ROI because it's really difficult to pinpoint and surgically extract critical insights without hiring expensive consultants or data scientists in short demand.”

While Data Scientists may be difficult to hire, it’s an investment you must make. The best structured data and most advanced data science tools are simply not effective in the hands of people without the required background. Being ready for data scientists is not a question of data organization. They can help you achieve that organization. Rather, it’s a commitment to removing the barriers of business-as-usual and allowing your data and your data professionals to truly influence your business strategy.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights

Put Data Science Before Data Infrastructure

Overcoming Data-Organization Paralysis

Don’t Delay Data Science

Scaling is an Overrated Problem

Invest in Data Science First

Keep up to date with our latest insights