This is a very brief accompanying post for the release of of Open STT / TTS v1.0.
In a nutshell:
- Open STT is published here, Open TTS is published here, noise dataset is published here;
- We added 2 new datasets in 2 new large and diverse domains with around 15,000 hours of annotation;
- New datasets have real speaker labels (which will soon be released);
- Overall annotation quality is improved, most numerous annotation edge cases were fixed;
- Dataset normalization greatly improved;
Some time ago we were disappointed by the state (a more refined article coming soon) of STT in general (compared to Computer Vision for example), especially in Russian.
In general it suffers from many problems: (i) small / not very useful / not always public academic datasets (ii) huge impractical solutions (iii) bulky / outdated / impractical toolkits (iv) an industry with a lot of history and already bitten by the SOTA bug (v) lack of working solutions without too many strings attached.
So we decided to build from ground up a dataset for Russian and then build a set of pre-trained deployable models based on the dataset. And then maybe cover a couple more languages.
Open STT is arguably the largest / best open STT / TTS dataset that exists now. Please follow the above links to learn more.
Major latest release highlights:
- See the above TLDR bullets;
- Speaker labels for the new datasets;
- The dataset is now available as
.wavfiles via torrent (I am seeding with a 1 gbit/s channel) and as
.mp3files via direct download link (also fast download speeds);
- Small manually annotated validation dataset (18 hours) covering 3 main domains;
- Major edge cases fixed:
- Overall upstream model quality improvement;
- No more “dangling” letters;
- Improved voice activity detection;
- Vastly improved dataset normalization;
- Obviously annotation is not perfect, but from time to time we add exclude lists when we add new data to filter the most obnoxious cases;
- Speech-to-text (obviously);
- Denoising (also consider our
asr-noisesdataset for this);
- Large scale text-to-speech (new);
- Speaker diarization (new);
- Speaker identification (new);
The dataset is mostly published under
If you need a fast (i.e. not requiring GPUs to run) / dependable / offline STT / TTS system please contact me.
- Improve / re-upload some of the existing datasets, refine the labels;
- Attempt to annotate previous data with speaker labels;
- Publish pre-trained models and post-processing;
- Refine and publish the speaker labels;
- Probably add new languages;
- Refine the STT labels;