Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Posted by snakers41 on May 1, 2019


If you do not pay the iron price, you know someone paid it for you. It works like this in every aspect of life

TLDR

This is an accompanying post for our release of Russian Open Speech To Text (STT/ASR) Dataset. This is meant to be a bit light-hearted and tongue in cheek. All opinions are my own, and probably opinions of my colleagues differ. This is a non-technical summary. Do not take this too seriously, probably 50% of this text is some kind of subtle joke (ping me if you find all the Easter eggs!).

Anyway, here is the dataset:

Data type Annotation type / quality / noise Utterances Hours GB
Books / stories alignment / 95% / crisp 1,149,404 1,511 166
Phone calls ASR / 70% / noisy 837,665 812 89
Manually generated (Russian addresses) TTS / 100% / 4 voices 1,741,838 754 81
YouTube videos subtitles / 95% / noisy 786,529 724 78
Books / stories ASR / 80% / crisp 124,328 116 13
Public datasets reading and alignment / 99% / crisp 17,527 43 5
4,657,291 3,961 431

TLDR:

  • We have collected and published a dataset with 4,000+ hours to train speech-to-text models in Russian;
  • The data is very diverse, cross domain, the quality of annotation ranges from good enough to almost perfect. Our intention was to collect a dataset that would somehow relate to real-life / business applications. Collecting only pure / clean data in academic fashion is of little interest. Ideally this dataset is a first step on a path to a repo with pre-trained STT models;
  • We intent to grow this amount to around 10,000 hours or even maybe to 20,000 hours, if the stars align properly (we know how to get to 6-7k, we will improvise something);
  • We have NOT invested any real money in creation of this dataset (except of course for our time and effort), so we are releasing it under cc-by-nc license. If you want to use the dataset for commercial purposes please go here;
  • You can see releases history here;

Speeding up the Imagenet moment

Ideally it goes like this:

  • Pull existing public models and datasets;
  • Collect some MVP dataset in your domain;
  • Build an MVP model;
  • Add more difficult cases;
  • Validate, test, rinse and repeat;

In domains like Computer Vision (CV) and Natural Language Processing (NLP) there is something to build on, these are only things that work in 95% of cases:

  • Widely shared Imagenet pre-trained models. Each framework has a repo like this - where you can get any sort of CV model with weights. Just add water;
  • In NLP, in my experience tools like FastText work best. Modern huge networks like BERT also work, but in real life we did not find them practical;

But in STT in Russian there is nothing really to build on:

  • Of course there are paid APIs / commercial last-gen products / products from government-related entities - but they have their drawbacks (besides being private or being built of less transparent technologies);
  • Public datasets are scarce at best (<100 hours) and non-diverse / too clean at worst;
  • Even for English public datasets … are academic and detached from real life usage;
  • STT has a long history, and it has a bias towards being developed by large tech companies;
  • Unlike Google / Facebook / Baidu which are known to publish stellar research (FAIR’s papers are awesome and accessible, Google less so), Yandex is not really known to add anything to the community;

In STT also there are several less discussed “gotcha” perks:

  • There is a speculation on how much data you need for proper generalization - estimates range from 5,000 hours to 20,000. For example Librispeech (LS one of the most popular datasets) is 1,000 hours and very “clean”. Google and Baidu report training on 10,000 - 100,000 hours of data in various papers for various settings;
  • If in CV your “domain” for which you learn features is for example “cats” (in certain positions that is, see explanations of Hinton’s CapsNets on this) in STT it is a bit more complicated. Your data can be clean / noisy. Also each voice is its own domain, models are amazing in adapting to voices. Also vocabulary is also an issue. So, basically when someone reports a 3 percentage point improvement on LS - you should take this with a grain of salt. My initial pipeline built in 1-2 days was able to work on LS quite well;
  • Cross domain transfer (i.e. when you train on books and validate on phone calls) works, but quite poorly. You can easily get +20pp WER. Also minor things like intonation / talking differently in noisy conditions matter;

So - you can think of our venture as a first step towards providing a set of public models for the community in our language (Russian, the majority of our methods scale to other languages) and bringing the Imagenet moment closer / making reasonably good STT models for the public.

You can say that now we are knee-deep in the data. To succeed we will need to go deeper! Please stop me from producing these horrible puns.

Seeking contributors

If you are willing to make a sensible contribution to our dataset / venture - you are welcome!
We will make sure to publish as much as possible as friendly as possible.
Please contact us here.

You can frame it like this. If 4 people (not working 100% of time on this project only) can make a difference, probably your addition will tip the balance of scales towards building a truly flexible model?

How not to share datasets

First of all, here is the list of things I dislike that people do all the time when sharing datasets:

  • Obviously paywalls with no way to inspect what you will get in advance;
  • Academic ivory tower attitude - "our dataset is cool and large, but useless in real life";
  • Registration walls. Yes, “our dataset is public, but we will share it only for the chosen”. This thing. And yeah, if we decide not to share it with you, you will not be notified. Or most probably - nobody pays the moderator;
  • Sharing via Google Drive (btw, you can use wget with Google Drive, you need to download a cookie and use their link structure) or something that relies on temporary / dynamic links. Ideally you should just be able to use wget or better aria2c to get the data;
  • Poor hosting, poor CDN, poor speeds. It is laughable when people share a dataset, that probably would cost upwards of US$100k - 1m to annotate from scratch manually, but spare US$100 per month on actually properly hosting the dataset;
  • Sharing via torrent … w/o having active seeders with at least 100 Mbit/s uplink speed. Academic torrents is great, but no seeders at any time;
  • Poor folder organization, no meta-data files, illogical structure;
  • Bloated, outdated tools and file formats used to load the dataset (hello xml and object detection datasets);
  • Convoluted and obscure code to read such data. You can pull off miracles with one liners in pandas. No, really;
  • No way to check your data;
  • Generally not caring about users;
  • Sharing data in proprietary formats / pushing some form of agenda;

We made a reasonable effort not to make any of these mistakes:

  • First of all, all links are public. We sourced the majority of our data from the Internet, so we share back whatever we can;
  • We did not do this yet (help us, why not), but you can just write a script that would download everything, unpack everything, check md5 sums and files;
  • Dataset is hosted on AWS compatible bucket with CDN - download speeds will be good;
  • Data is mostly checked and written in the same format;
  • Data is collected in a disk DB optimized (see the repo for details) to work even on hard drives (we did not test it). I believe that RAID array or NVME cache for your hard drive array will solve the IO problem entirely (we ourselves use SSD NVME drives);
  • Meta data file;
  • Some rudimentary hackable code snippets for easier start;

Finding a motivation, making a difference, ethics and motivation

Finally a set of points I would like to make:

  • After I made some baseline prototype and dataset, I was really disappointed in the result. If not for Yuri’s sound advice, experience and collaboration - I probably would have abandoned the project. Yuri, I am eternally grateful for your help. Yeah, I saw that STT in Russian was ripe for “ImageNet moment” plucking, but we would not stay motivated if not for you;
  • Obviously, even if we fail to provide usable / robust models for the public, our data / models will serve as a first step in this direction;
  • Oh, such dataset should cost a fortune. Yes, building something like this from scratch probably would cost millions. But we are not alone, and life is not a zero sum game;
  • Life is not a zero sum game - therefore we saw an opportunity to make the dataset public - and we did it;
  • No, it is not unethical to use the data the way we use it. First of all, for 100% of data we use - we did not put it there. Someone did. And we do not use the data for the same purpose it was uploaded;
  • And finally, the elephant in the room. By publishing such a dataset, we bring the impending doom described in Orwell’s 1984 closer. Yes and no. You see, I do not believe that it is feasible to achieve such accuracy that actually jailing people based on STT results would make economic sense. If you have a real life WER of 10% - for most business applications, you have built amazing product! But to put people in prison it is too much - you will just end up processing false positives. I understand that in more civilized countries a court decision is required to take such actions. I believe that some other technologies that are INHERENTLY more toxic (like face recognition) than STT. After all - if you want to say something in private - say something in private (use encryption or just talk in person). You cannot go outside in private. I understand that we are talking about mass applications of such technologies, but even if you are up to something sketchy - it is your choice to call someone to talk about it. Going to a mall and being apprehended because of looking like some ethnical minority (i) first of all is already an issue already in many countries (ii) is not your choice. Concealing your appearance may also pose questions;
  • “Your dataset sucks because X”! Yeah, please give feedback, but make it logical. Mostly we just collected all the data we could grab. But if you care so much, you are welcome to contribute!;
  • I also love ad hominem criticisms like this “you pretend to take a high moral stance by doing something (i.e. sharing the dataset), the why you did Y”. I mostly addressed this a two points above;


ML themed stickers, if you use telegram