The Turing Test Benchmark Environment

What is TuringBench?

TuringBench is a benchmark environment that contains :

  1. Benchmark tasks- Turing Test (i.e., human vs. machine) and Authorship Attribution: (i.e., who is the author of this texts?)
  2. Datasets (Binary and Multi-class settings)
  3. Website with leaderboard
The dataset has 20 labels (19 AI text-generators and human). We built this dataset by collecting 10K news articles (mostly Politics) from sources like CNN and only keeping articles with 200-400 words. Next, we used the Titles of these human-written articles to prompt the AI text-generators (ex: GPT-2, GROVER, etc.) to generate 10K articles each. This gives us a sum total of 200K articles and 20 labels. However, since there are two benchmark tasks - Turing Test and Authorship Attribution settings, we have all 20 labels in one dataset for the multi-class setting and only human vs. one AI text-generator, making 19 binary-class datasets.

How to get the Dataset

from datasets import load_dataset

import pandas as pd

# AA

train = load_dataset('turingbench/TuringBench', name='AA', split='train')

train = pd.DataFrame.from_dict(train)

test = load_dataset('turingbench/TuringBench', name='AA', split='test')

test = pd.DataFrame.from_dict(test)

valid = load_dataset('turingbench/TuringBench', name='AA', split='validation')

valid = pd.DataFrame.from_dict(valid)

# GPT-1 TT task

TT_gpt1 = load_dataset('turingbench/TuringBench', name='TT_gpt1', split='train')

TT_gpt1 = pd.DataFrame.from_dict(TT_gpt1)

Getting Started

We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.

Related publications

  1. Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang and Dongwon Lee. "TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation," In Proceedings of the Findings of the 2021 Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
  2. Adaku Uchendu, Thai Le, Kai Shu, Dongwon Lee. "Authorship Attribution for Neural Text Generation," In Conf. on Empirical Methods in Natural Language Processing (EMNLP), Virtual Event, November 2020.

Have Questions?

Ask us questions at our emails turingbench@gmail.com.

Leaderboard: Authorship Attribution

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written texts. This Leaderboard is for the Authorship Attribution scenario.

Rank Model Precision Recall F1 Accuracy


May 5, 2021
(Liu et al., '19)
0.8214 0.8126 0.8107 0.8173


May 5, 2021
(Devlin et al., '18)
0.8031 0.8021 0.7996 0.8078


May 5, 2021
(Fabien et al., '20)
0.7796 0.7750 0.7758 0.7812


May 5, 2021
OpenAI detector
0.7810 0.7812 0.7741 0.7873


May 5, 2021
SVM (3-grams)
(Sapkota et al. '15)
0.7124 0.7223 0.7149 0.7299


May 5, 2021
N-gram CNN
(Shreshta et al., '17)
0.6909 0.6832 0.6665 0.6914


May 5, 2021
(Jafariakinabad, '19)
0.6694 0.6824 0.6646 0.6898


May 5, 2021
(Zhang et al. '18)
0.6520 0.6544 0.6480 0.6613


May 5, 2021
Random Forest 0.5893 0.6053 0.5847 0.6147


May 5, 2021
(Mahmood et al. '19)
0.4578 0.4851 0.4651 0.4943