TuringBench

The Turing Test Benchmark Environment

What is TuringBench?

TuringBench is a benchmark environment that contains :

  1. Benchmark tasks- Turing Test (i.e., human vs. machine) and Authorship Attribution: (i.e., who is the author of this texts?)
  2. Datasets (Binary and Multi-class settings)
  3. Website with leaderboard
The dataset has 20 labels (19 AI text-generators and human). We built this dataset by collecting 10K news articles (mostly Politics) from sources like CNN and only keeping articles with 200-400 words. Next, we used the Titles of these human-written articles to prompt the AI text-generators (ex: GPT-2, GROVER, etc.) to generate 10K articles each. This gives us a sum total of 200K articles and 20 labels. However, since there are two benchmark tasks - Turing Test and Authorship Attribution settings, we have all 20 labels in one dataset for the multi-class setting and only human vs. one AI text-generator, making 19 binary-class datasets.

Getting Started

We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.

Have Questions?

Ask us questions at our emails turingbench@gmail.com.

Leaderboard: Turing Test on PPLM distil

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written. This Leaderboard is for the Turing Test scenario.

DETECTOR F1 score
GROVER detector
(Zellers et al. '19)
0.5815
GPT-2 detector
(OpenAI '19)
0.5602
GLTR
(Gehrmann et al. '19)
0.6842
BERT
(Devlin et al. '19)
0.8890
RoBERTa
(Liu et al. '19)
0.9015
AVERAGE
0.7233