TuringBench

The Turing Test Benchmark Environment

What is TuringBench?

TuringBench is a benchmark environment that contains :

  1. Benchmark tasks- Turing Test (i.e., human vs. machine) and Authorship Attribution: (i.e., who is the author of this texts?)
  2. Datasets (Binary and Multi-class settings)
  3. Website with leaderboard
The dataset has 20 labels (19 AI text-generators and human). We built this dataset by collecting 10K news articles (mostly Politics) from sources like CNN and only keeping articles with 200-400 words. Next, we used the Titles of these human-written articles to prompt the AI text-generators (ex: GPT-2, GROVER, etc.) to generate 10K articles each. This gives us a sum total of 200K articles and 20 labels. However, since there are two benchmark tasks - Turing Test and Authorship Attribution settings, we have all 20 labels in one dataset for the multi-class setting and only human vs. one AI text-generator, making 19 binary-class datasets.

Getting Started

We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.

Have Questions?

Ask us questions at our emails turingbench@gmail.com.

Leaderboard: Human Evaluation of Turing Test

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written. This Leaderboard is for the Turing Test scenario.

GENERATOR Human Test (MACHINE) Human Test (HUMAN vs. MACHINE)
GPT-1
(Radford et al. '18)
0.4000 0.5600
GPT-2 small
(Radford et al. '19)
0.6200 0.4400
GPT-2 medium
(Radford et al. '19)
0.5800 0.4800
GPT-2 large
(Radford et al. '19)
0.7400 0.4400
GPT-2 XL
(Radford et al. '19)
0.6000 0.4800
GPT-2 pytorch
(Graykode. '19)
0.5000 0.5600
GPT-3
(Brown et al. '20)
0.4400 0.5800
GROVER base
(Zellers et al. '19)
0.3200 0.4200
GROVER large
(Zellers et al. '19)
0.4800 0.5800
GROVER mega
(Zellers et al. '19)
0.5400 0.4800
CTRL
(Keskar et al. '19)
0.5000 0.6900
XLM
(Lample et al. '19)
0.6600 0.7000
XLNET base
(Zhilin et al. '19)
0.5200 0.5400
XLNET large
(Zhilin et al. '19)
0.5200 0.5200
FAIR wmt19
(Ng et al. '19)
0.5600 0.5600
FAIR wmt20
(Chen et al. '20)
0.5800 0.2800
TRANSFORMER XL
(Dia et al. '19)
0.5000 0.5000
PPLM distil
(Dathathri et al. '20)
0.5600 0.4400
PPLM gpt2
(Dathathri et al. '20)
0.5600 0.5000
AVERAGE
0.5358 0.5132