Open STT 1.0 release

Finally we made it!

Posted by snakers41 on November 4, 2019

Open STT / TTS 1.0 release

TLDR

This is a very brief accompanying post for the release of of Open STT / TTS v1.0.

In a nutshell:

  • Open STT is published here, Open TTS is published here, noise dataset is published here;
  • We added 2 new datasets in 2 new large and diverse domains with around 15,000 hours of annotation;
  • New datasets have real speaker labels (which will soon be released);
  • Overall annotation quality is improved, most numerous annotation edge cases were fixed;
  • Dataset normalization greatly improved;

Open STT recap

Some time ago we were disappointed by the state (a more refined article coming soon) of STT in general (compared to Computer Vision for example), especially in Russian.

In general it suffers from many problems: (i) small / not very useful / not always public academic datasets (ii) huge impractical solutions (iii) bulky / outdated / impractical toolkits (iv) an industry with a lot of history and already bitten by the SOTA bug (v) lack of working solutions without too many strings attached.

So we decided to build from ground up a dataset for Russian and then build a set of pre-trained deployable models based on the dataset. And then maybe cover a couple more languages.

Open STT is arguably the largest / best open STT / TTS dataset that exists now. Please follow the above links to learn more.

Major features in latest releases

Major latest release highlights:

  • See the above TLDR bullets;
  • Speaker labels for the new datasets;
  • The dataset is now available as .wav files via torrent (I am seeding with a 1 gbit/s channel) and as .mp3 files via direct download link (also fast download speeds);
  • Small manually annotated validation dataset (18 hours) covering 3 main domains;
  • Major edge cases fixed:
    • Overall upstream model quality improvement;
    • No more “dangling” letters;
    • Improved voice activity detection;
  • Vastly improved dataset normalization;
  • Obviously annotation is not perfect, but from time to time we add exclude lists when we add new data to filter the most obnoxious cases;

Potential usage:

  • Speech-to-text (obviously);
  • Denoising (also consider our asr-noises dataset for this);
  • Large scale text-to-speech (new);
  • Speaker diarization (new);
  • Speaker identification (new);

Licensing

сс-nc-by-license

The dataset is mostly published under cc-nc-by license.

If you would like to use it commercially, submit a form and then contact us here.

If you need a fast (i.e. not requiring GPUs to run) / dependable / offline STT / TTS system please contact me.

What is next?

  • Improve / re-upload some of the existing datasets, refine the labels;
  • Attempt to annotate previous data with speaker labels;
  • Publish pre-trained models and post-processing;
  • Refine and publish the speaker labels;
  • Probably add new languages;
  • Refine the STT labels;