Results are not perfect. In some cases, Mechanical Turk workers agree on the wrong label. We still likely only capture a lower bound on the error given that we only validated a small fraction of the datasets for errors.
This site, and the research behind it, was created by Curtis Northcutt, Anish Athalye, and Jonas Mueller. For more details, see our blog post or paper. Code to reproduce the label errors for each dataset as well as corrected test sets are available on GitHub.
Some key-takeaways about the label errors shown in this site:
- This website displays data examples across 1 audio (AudioSet), 3 text (Amazon Reviews, IMDB, 20 news groups), and 6 image (ImageNet, CIFAR-10, CIFAR-100, Caltech-256, Quickdraw, MNIST) datasets.
- Label errors are prevalent (3.4%) across benchmark ML test sets.
- We identify these label errors automatically using confident learning, using the open-source cleanlab package, and validate these label errors on Mechanical Turk.
- Surprisingly, we find lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on the ImageNet validation set with corrected labels: ResNet-18 outperforms ResNet-50 if we randomly remove just 6% of accurately labeled test data. On the CIFAR-10 test set with corrected labels: VGG-11 outperforms VGG-19 if we randomly remove just 5% of accurately labeled test data.
Each label error depicted on this site includes three things:
- the original label given by the dataset
- our guess at what the correct label might be (the argmax prediction of the model)
- the consensus label among 5 Mechanical Turk human raters
The Mturk consensus label may be (1) both, (2) neither, (3) the given label, or (4) the label we guess (with some exceptions for multi-class datasets like AudioSet). This is because when we had the label errors validated, we provided to reviewers the original data example, the original label, and our guess of the label, and each rater chose one of those four options. Here is what the Mturk validation of label errors looked like:

For more details, see our blog post or paper.