We’re often finding what makes or breaks machine learning projects is the availability of high-quality labeled data.
Iteratively improving the data and holding the learning algorithm fixed allows teams to quickly improve model performance and reduce time to market. This is referred to as a data-centric approach, as opposed to a model-centric approach where the model and learning algorithm is iteratively improved, and the data is held fixed.
There are a few popular approaches to labeling data:
- Hand labeling: manually labeling each data point.
- Assisted labeling: manually labeling each data point but allowing the model to label instances it’s confident about as it learns throughout the annotation process.
- Weak labeling: combining (often easy to obtain) noisy supervision signals, such as rule-based systems or other models, to obtain probabilistic labels.
Weak labeling can reduce annotation time by 10 to 100x compared to unassisted hand-labeling, allowing teams to create large labeled datasets quickly. Weak labeling is often more efficient because it’s faster to create many cheap noisy supervision signals and produce probabilistic labels than to manually label each data point. For more details on weak labeling, we recommend reading Data Programming: Creating Large Training Sets, Quickly.
If you’re interested in how we can help you with your machine learning project, please get in touch!
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.