Paper ID: 2202.03371

Cedille: A large autoregressive French language model

Martin Müller, Florian Laurent

Scaling up the size and training of autoregressive language models has enabled novel ways of solving Natural Language Processing tasks using zero-shot and few-shot learning. While extreme-scale language models such as GPT-3 offer multilingual capabilities, zero-shot learning for languages other than English remain largely unexplored. Here, we introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language. Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks. Furthermore, we provide an in-depth comparison of the toxicity exhibited by these models, showing that Cedille marks an improvement in language model safety thanks to dataset filtering.

Submitted: Feb 7, 2022

Topics

Large Scale Language Model
Autoregressive Language Model
Autoregressive Large Language Model
Zero Shot Benchmark
French Language

Links

arXiv PDF