import React from 'react'

import Page from './Page'
import styles from './About.module.css'

const PAPER_URL = "https://arxiv.org/abs/2103.14749"
const BLOG_URL = "https://l7.curtisnorthcutt.com/label-errors"

const About = () => {
  return (
    <Page title="Label Errors: About">
      <div className={styles.textContent}>
        <p><i>
          Results are not perfect. In some cases, Mechanical Turk workers agree on the wrong label. We still likely only capture a lower bound on the error given that we only validated a small fraction of the datasets for errors.
        </i></p>
        <p>
          This site, and the research behind it, was created by <a href="https://www.curtisnorthcutt.com">Curtis Northcutt</a>, <a href="https://www.anish.io/ ">Anish Athalye</a>, and <a href="https://people.csail.mit.edu/jonasmueller/">Jonas Mueller</a>. For more details, see our <a href={BLOG_URL}>blog post</a> or <a href={PAPER_URL}>paper</a>. Code to reproduce the label errors for each dataset as well as corrected test sets are available on <a href="https://github.com/cleanlab/label-errors">GitHub</a>.
        </p>

        <p>Some key-takeaways about the label errors shown in this site:</p>

        <ul>
          <li>This website displays data examples across 1 audio (AudioSet), 3 text (Amazon Reviews, IMDB, 20 news groups), and 6 image (ImageNet, CIFAR-10, CIFAR-100, Caltech-256, Quickdraw, MNIST) datasets.</li>
          <li>Label errors are prevalent (3.4%) across benchmark ML test sets.</li>
          <li>We identify these label errors automatically using <a href="https://arxiv.org/abs/1911.00068">confident learning</a>, using the open-source <a href="https://github.com/cleanlab/cleanlab">cleanlab package</a>, and validate these label errors on Mechanical Turk.</li>
          <li> Surprisingly, we find <b>lower capacity</b> models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on the ImageNet validation set with corrected labels: ResNet-18 outperforms ResNet-50 if we randomly remove just 6% of accurately labeled test data. On the CIFAR-10 test set with corrected labels: VGG-11 outperforms VGG-19 if we randomly remove just 5% of accurately labeled test data.</li>
        </ul>

        <p>
          Each label error depicted on this site includes three things:
        </p>

        <ol>
          <li>the original label given by the dataset</li>
          <li>our guess at what the correct label might be (the argmax prediction of the model)</li>
          <li>the consensus label among 5 Mechanical Turk human raters</li>
        </ol>

        <p>
          The Mturk consensus label may be (1) both, (2) neither, (3) the given label, or (4) the label we guess (with some exceptions for multi-class datasets like AudioSet). 
          This is because when we had the label errors validated, we provided to reviewers the original data example, the original label, and our guess of the label, and each
          rater chose one of those four options. Here is what the Mturk validation of label errors looked like:
        </p>

        <p><a href={`${process.env.PUBLIC_URL}/img/mturk.png`}><img src={`${process.env.PUBLIC_URL}/img/mturk.png`} alt="Mturk validation experiment" /></a></p>

        <p>For more details, see our <a href={BLOG_URL}>blog post</a> or <a href={PAPER_URL}>paper</a>.</p>
      </div>
    </Page>
  )
}

export default About
