Master
ThoughtWorks
Menu
Close
  • What we do
    • Go to overview
    • Customer Experience, Product and Design
    • Data Strategy, Engineering and Analytics
    • Digital Transformation and Operations
    • Enterprise Modernization, Platforms and Cloud
  • Who we work with
    • Go to overview
    • Automotive
    • Healthcare
    • Public Sector
    • Cleantech, Energy and Utilities
    • Media and Publishing
    • Retail and E-commerce
    • Financial Services and Insurance
    • Not-for-profit
    • Travel and Transport
  • Insights
    • Go to overview
    • Featured

      • Technology

        An in-depth exploration of enterprise technology and engineering excellence

      • Business

        Keep up to date with the latest business and industry insights for digital leaders

      • Culture

        The place for career-building content and tips, and our view on social justice and inclusivity

    • Digital Publications and Tools

      • Technology Radar

        An opinionated guide to technology frontiers

      • Perspectives

        A publication for digital leaders

      • Digital Fluency Model

        A model for prioritizing the digital capabilities needed to navigate uncertainty

      • Decoder

        The business execs' A-Z guide to technology

    • All Insights

      • Articles

        Expert insights to help your business grow

      • Blogs

        Personal perspectives from ThoughtWorkers around the globe

      • Books

        Explore our extensive library

      • Podcasts

        Captivating conversations on the latest in business and tech

  • Careers
    • Go to overview
    • Application process

      What to expect as you interview with us

    • Grads and career changers

      Start your tech career on the right foot

    • Search jobs

      Find open positions in your region

    • Stay connected

      Sign up for our monthly newsletter

  • About
    • Go to overview
    • Our Purpose
    • Awards and Recognition
    • Diversity and Inclusion
    • Our Leaders
    • Partnerships
    • News
    • Conferences and Events
  • Contact
Global | English
  • United States United States
    English
  • China China
    中文 | English
  • India India
    English
  • Canada Canada
    English
  • Singapore Singapore
    English
  • United Kingdom United Kingdom
    English
  • Australia Australia
    English
  • Germany Germany
    English | Deutsch
  • Brazil Brazil
    English | Português
  • Spain Spain
    English | Español
  • Global Global
    English
Blogs
Select a topic
View all topicsClose
Technology 
Agile Project Management Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionary Architecture Experience Design IoT Languages, Tools & Frameworks Legacy Modernization Machine Learning & Artificial Intelligence Microservices Platforms Security Software Testing Technology Strategy 
Business 
Financial Services Global Health Innovation Retail  Transformation 
Careers 
Career Hacks Diversity & Inclusion Social Change 
Blogs

Topics

Choose a topic
  • Technology
    Technology
  • Technology Overview
  • Agile Project Management
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionary Architecture
  • Experience Design
  • IoT
  • Languages, Tools & Frameworks
  • Legacy Modernization
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Platforms
  • Security
  • Software Testing
  • Technology Strategy
  • Business
    Business
  • Business Overview
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Careers
    Careers
  • Careers Overview
  • Career Hacks
  • Diversity & Inclusion
  • Social Change
Data Science & EngineeringTechnology

Coding practices for data scientists

Shraddha Surana Shraddha Surana

Published: Jun 11, 2019

“As a data scientist, I am not expected to write clean code as most of what I write is throwaway.” I am a data scientist and I do not identify with this sentiment. I believe that clean code practices apply to the entire team, no matter the role.

Let me tell you why - software development does not really need clean code, and neither does it need agile principles. One can craft perfectly working software without either. But, without either maintaining, modifying and scaling the software becomes incredibly cumbersome.

In most if not all instances, data scientists work in the space of applied sciences, and their experiments have to be reproducible and verifiable. This means a data scientist like yourself needs to understand someone else's problem and the respective team and business leads need to understand your solution.

Let’s quickly look at the other team members who you will be frequently collaborating with -
  • Data scientists or data analysts collaborate with you on algorithms, models, interpreting results, feature engineering and more
  • You collaborate with data engineers because one does not want to end up processing all the data by themselves. Data engineers will optimize queries to hand you, your consolidated results in record time. But, they need to begin by understanding what data it is that you need.
  • Business analysts and domain experts help you clarify results in business terms.
How can we make the above-discussed collaboration more effective with code? Let’s discuss a few low hanging fruits.

Meaningful names

Team members shouldn’t have to ask for the meaning of the variable used. Clear and intentional names are most suitable. e.g. instead of alpha = 0.3 try, learning_rate = 0.3.

Long names are perfectly acceptable. And, to make this manageable, better editors that don’t require repeated typing the entire variable name instead of shortening the name (making it cryptic) are recommended. Also, let’s avoid magic numbers.

Instead of:
for i in range (1,3000):
    ...

# Use:
number_of_iterations = 3000
for i in range(1,number_of_iterations):
...

Avoid mental mapping

When there are code parameters that need to be changed to obtain desired results, you might want to save team members from reading your mind for those changes by creating a config file/ object. This will contain all the parameters and hyperparameters that have to be changed in your codebase. You could also try keeping relevant parameters together. For e.g. when building a neural network model that runs on the dataset of various different countries, your config file could resemble this:

# Neural Network hyperparameters
learning_rate = 0.1
epoc = 3000
lambd = 0

# Other constants needed for your codebase
country = "India"

Uniform naming conventions

Avoid multiple names for the same instance/occurrence either in mail, code or anywhere else in the project. For instance, During internal discussions personA might refer to the ‘Earth’ while personB might refer to the ‘Globe,’ both, being the same thing. Stick to one name during discussions and when coding.

Use business domain names

Feature engineering, analyzing, creating models are great places to exercise your creativity. Inventing names might not be the best use of your talents. Let’s say, business calls ‘it’ a sphere, then, the advice is to call ‘it’ a sphere instead of a ‘circle_in_3d’ in your code.

Comments

Comments are discouraged because they quickly become obsolete. However, one could add comments that advise on using or not using certain models and features. For instance, a comment that reads, “This ADF (Augment Dickey Fuller) test that identifies stationary signals will take more than 5 hours on the laptop, because we cannot run combinations in parallel,” will let your team know not to run it at the start of their day.

# Neural Network hyperparameters
# learning_rate < 0.05 is going to take more than 5 hours to train
learning_rate = 0.1
epoc = 3000
lambd = 0

# Other constants needed for your codebase
country = "India"

Modularize your code

Big blobs of code are opaque. So, instead of requiring team members to go through every line of the code to understand its purpose, you could create functions that do what their name states. Functions are a good level of abstraction. The reader can choose to go inside a function or a class only if they want to understand the internal working.

Don’t repeat yourself

Extract code you need, out into a function or component to reuse it. This ensures code consistency and the reader only needs to understand that component once.

Unit tests

Unit tests are relevant to a data scientist's code because data science code is still code - especially if we are testing functions (like algorithms that recommend product prices) based on which business decisions are taken.

Formatting

Formatting is a team decision. This includes how members use spaces vs. tabs for indentation, how the directory is structured, how file naming conventions are decided, the format for output results and more. Use tools like YAPF, autopep8 , prettier to automatically format the code when saving a file.

In conclusion, I vote for MVP approaches and trying out various approaches and models that might sometimes be quick and dirty but, give results. This approach is effective when figuring out how much time one should invest and in what direction. Once you do finalize your approach and incorporate it into your working code, do clean it up.

All of the concepts I have discussed in this article are for the benefit of the unified team. Interestingly, I find it useful when working solo, too. It serves as quick reminders of why I have chosen to do things a certain way and helps me explain results, and quickly cross-examine unexpected results as well.

Some of what I have talked about in this article can be found in the book, Clean Code by Robert C. Martin, an American software engineer, and instructor. He is best known for being one of the authors of the Agile Manifesto and for developing several software design principles. If you’d like a more in-depth view of the subject, I would urge you to read up. And, please remember, while every data scientist might have their own set of practices and rules when working in a team, it’s always good to abide by the commonly accepted set of do’s and don’ts.

Technology Radar

Don't miss our opinionated guide to technology frontiers.

Subscribe
Related blogs
Machine Learning & Artificial Intelligence

Artificial Intelligence and Intelligent Empowerment

David Johnston
Learn more
Data Science & Engineering

Getting Smart: Applying Continuous Delivery to Data Science to Drive Car Sales

Arif Wider
Christian Deger
Learn more
Data Science & Engineering

Introducing Agile Analytics

Ken Collier
Learn more
Master
Privacy policy | Modern Slavery statement | Accessibility
Connect with us
×

WeChat

QR code to ThoughtWorks China WeChat subscription account
© 2021 ThoughtWorks, Inc.