TuringBench: The Turing Test Benchmark Environment

What is TuringBench?

TuringBench is a benchmark environment that contains :

Benchmark tasks- Turing Test (i.e., human vs. machine) and Authorship Attribution: (i.e., who is the author of this texts?)
Datasets (Binary and Multi-class settings)
Website with leaderboard

The dataset has 20 labels (19 AI text-generators and human). We built this dataset by collecting 10K news articles (mostly Politics) from sources like CNN and only keeping articles with 200-400 words. Next, we used the Titles of these human-written articles to prompt the AI text-generators (ex: GPT-2, GROVER, etc.) to generate 10K articles each. This gives us a sum total of 200K articles and 20 labels. However, since there are two benchmark tasks - Turing Test and Authorship Attribution settings, we have all 20 labels in one dataset for the multi-class setting and only human vs. one AI text-generator, making 19 binary-class datasets.

Getting Started

We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.

Have Questions?

Ask us questions at our emails turingbench@gmail.com.

Star

Leaderboard: Turing Test on GPT-2 PyTorch

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written. This Leaderboard is for the Turing Test scenario.

DETECTOR	F1 score
GROVER detector (Zellers et al. '19)	0.5679
GPT-2 detector (OpenAI '19)	0.5096
GLTR (Gehrmann et al. '19)	0.7183
BERT (Devlin et al. '19)	0.9875
RoBERTa (Liu et al. '19)	0.8444
AVERAGE	0.7255