MOSEL: a multilingual dataset collection

The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses. In particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using Whisper large v3. Whisper is released under the OS Apache 2.0 License which allows releasing the generated content under any license. Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.

The dataset is available on HF: MOSEL

And on GitHub: https://github.com/hlt-mt/mosel

MOSEL: a multilingual dataset collection

Recent Posts