We’re often finding what makes or breaks machine learning projects is the availability of high-quality labeled data.
Iteratively improving the data and holding the learning algorithm fixed allows teams to quickly improve model performance and reduce time to market. This is referred to as a data-centric approach, as opposed to a model-centric approach where the model and learning algorithm is iteratively improved, and the data is held fixed.
There are a few popular approaches to labeling data:
Weak labeling can reduce annotation time by 10 to 100x compared to unassisted hand-labeling, allowing teams to create large labeled datasets quickly. Weak labeling is often more efficient because it’s faster to create many cheap noisy supervision signals and produce probabilistic labels than to manually label each data point. For more details on weak labeling, we recommend reading Data Programming: Creating Large Training Sets, Quickly.
If you’re interested in how we can help you with your machine learning project, please get in touch!
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.