TuringBench: The Turing Test Benchmark Environment

What is TuringBench?

TuringBench is a benchmark environment that contains :

Benchmark tasks- Turing Test (i.e., human vs. machine) and Authorship Attribution: (i.e., who is the author of this texts?)
Datasets (Binary and Multi-class settings)
Website with leaderboard

The dataset has 20 labels (19 AI text-generators and human). We built this dataset by collecting 10K news articles (mostly Politics) from sources like CNN and only keeping articles with 200-400 words. Next, we used the Titles of these human-written articles to prompt the AI text-generators (ex: GPT-2, GROVER, etc.) to generate 10K articles each. This gives us a sum total of 200K articles and 20 labels. However, since there are two benchmark tasks - Turing Test and Authorship Attribution settings, we have all 20 labels in one dataset for the multi-class setting and only human vs. one AI text-generator, making 19 binary-class datasets.

How to get the Dataset

from datasets import load_dataset

import pandas as pd

# AA

train = load_dataset('turingbench/TuringBench', name='AA', split='train')

train = pd.DataFrame.from_dict(train)

test = load_dataset('turingbench/TuringBench', name='AA', split='test')

test = pd.DataFrame.from_dict(test)

valid = load_dataset('turingbench/TuringBench', name='AA', split='validation')

valid = pd.DataFrame.from_dict(valid)

# GPT-1 TT task

TT_gpt1 = load_dataset('turingbench/TuringBench', name='TT_gpt1', split='train')

TT_gpt1 = pd.DataFrame.from_dict(TT_gpt1)

Getting Started

We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.

Related publications

Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang and Dongwon Lee. "TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation," In Proceedings of the Findings of the 2021 Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
Adaku Uchendu, Thai Le, Kai Shu, Dongwon Lee. "Authorship Attribution for Neural Text Generation," In Conf. on Empirical Methods in Natural Language Processing (EMNLP), Virtual Event, November 2020.

Have Questions?

Ask us questions at our emails turingbench@gmail.com.

Star

Leaderboard: Authorship Attribution

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written texts. This Leaderboard is for the Authorship Attribution scenario.

Rank	Model	Precision	Recall	F1	Accuracy
1 May 5, 2021	RoBERTa (Liu et al., '19)	0.8214	0.8126	0.8107	0.8173
2 May 5, 2021	BERT (Devlin et al., '18)	0.8031	0.8021	0.7996	0.8078
3 May 5, 2021	BertAA (Fabien et al., '20)	0.7796	0.7750	0.7758	0.7812
4 May 5, 2021	OpenAI detector	0.7810	0.7812	0.7741	0.7873
5 May 5, 2021	SVM (3-grams) (Sapkota et al. '15)	0.7124	0.7223	0.7149	0.7299
6 May 5, 2021	N-gram CNN (Shreshta et al., '17)	0.6909	0.6832	0.6665	0.6914
7 May 5, 2021	N-gram LSTM-LSTM (Jafariakinabad, '19)	0.6694	0.6824	0.6646	0.6898
8 May 5, 2021	Syntax-CNN (Zhang et al. '18)	0.6520	0.6544	0.6480	0.6613
9 May 5, 2021	Random Forest	0.5893	0.6053	0.5847	0.6147
10 May 5, 2021	WriteprintsRFC (Mahmood et al. '19)	0.4578	0.4851	0.4651	0.4943

Leaderboard: Turing Test

The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written. This Leaderboard is for the Turing Test scenario.

GENERATOR
Human Test
GPT-1 (Radford et al. '18)
GPT-2 small (Radford et al. '19)
GPT-2 medium (Radford et al. '19)
GPT-2 large (Radford et al. '19)
GPT-2 xl (Radford et al. '19)
GPT-2 PyTorch (Graykode. '19)
GPT-3 (Brown et al. '20)
GROVER base (Zellers et al. '19)
GROVER large (Zellers et al. '19)
GROVER mega (Zellers et al. '19)
CTRL (Keskar et al. '19)
XLM (Lample et al. '19)
XLNET base (Zhilin et al. '19)
XLNET large (Zhilin et al. '19)
FAIR wmt19 (Ng et al. '19)
FAIR wmt20 (Chen et al. '20)
TRANSFORMER-XL (Dia et al. '19)
PPLM distil (Dathathri et al. '20)
PPLM gpt2 (Dathathri et al. '20)