TuringBench is a benchmark environment that contains :
We've built a few resources to help you get started with the dataset. These datasets will be hosted on huggingfaces' datahub: data repo. We ask contributors to submit their code and/or model weights at turingbench@gmail.com so we can run the model on the test set to preserve the integrity of the results. Because TuringBench is an ongoing effort, we expect the dataset to increase. To keep up to date with major changes to the dataset.
Ask us questions at our emails turingbench@gmail.com.
The TuringBench Datasets will assist researchers in building robust Machine learning and Deep learning models that can effectively distinguish machine-generated texts from human-written. This Leaderboard is for the Turing Test scenario.
GENERATOR | Human Test (MACHINE) | Human Test (HUMAN vs. MACHINE) |
---|---|---|
GPT-1 (Radford et al. '18) |
0.4000 | 0.5600 |
GPT-2 small (Radford et al. '19) |
0.6200 | 0.4400 |
GPT-2 medium (Radford et al. '19) |
0.5800 | 0.4800 |
GPT-2 large (Radford et al. '19) |
0.7400 | 0.4400 |
GPT-2 XL (Radford et al. '19) |
0.6000 | 0.4800 |
GPT-2 pytorch (Graykode. '19) |
0.5000 | 0.5600 |
GPT-3 (Brown et al. '20) |
0.4400 | 0.5800 |
GROVER base (Zellers et al. '19) |
0.3200 | 0.4200 |
GROVER large (Zellers et al. '19) |
0.4800 | 0.5800 |
GROVER mega (Zellers et al. '19) |
0.5400 | 0.4800 |
CTRL (Keskar et al. '19) |
0.5000 | 0.6900 |
XLM (Lample et al. '19) |
0.6600 | 0.7000 |
XLNET base (Zhilin et al. '19) |
0.5200 | 0.5400 |
XLNET large (Zhilin et al. '19) |
0.5200 | 0.5200 |
FAIR wmt19 (Ng et al. '19) |
0.5600 | 0.5600 |
FAIR wmt20 (Chen et al. '20) |
0.5800 | 0.2800 |
TRANSFORMER XL (Dia et al. '19) |
0.5000 | 0.5000 |
PPLM distil (Dathathri et al. '20) |
0.5600 | 0.4400 |
PPLM gpt2 (Dathathri et al. '20) |
0.5600 | 0.5000 |
AVERAGE |
0.5358 | 0.5132 |