Navigating the Speech to Text Dark Forest

Do not fall prey to fallacies

Posted by snakers41 on August 12, 2019
  • Navigating the STT (speech-to-text) Dark Forest

  • The proverbial Dark Forest

    Navigating the STT (speech-to-text) Dark Forest

    Boring introduction

    Recently I read this trilogy of books (which is awesome, down-to-earth, gritty and instant classic by the way). There is a very nice analogy that permeates the whole trilogy in one way or another (I just could not help sharing it):

    Like hunters in a "dark forest", a civilization can never be certain of an alien civilization's true intentions. The extreme distance between stars creates an insurmountable "chain of suspicion" where any two civilizations cannot communicate well enough to relieve mistrust, making conflict inevitable. Leaving a primitive civilization alone is not an option due to the exponential progress of technological change, and a civilization you have detected might easily surpass your own technological level in a few centuries and become a threat. If you have detected a civilization, than you have also confirmed that said civilization will eventually be able to detect you. Therefore, it is in every civilization's best interest to preemptively strike and destroy any developing civilization before it can become a threat, but without revealing their own location to the wider universe, thus explaining the Fermi paradox.

    To put it more plain words - imagine that it is dark and you are alone in the forest. You can see the marks on the trees / snow and you have some idea that there are other creatures there. You want to study the forest and communicate with them, but you do not know yet that the forest is populated by solitary hunters for whom the optimal strategy is to attack / plunder and flee. Even though sticking together in large groups may be more optimal in the long run (what omnivorous humans did by the way compared to many solitary apex predators), the current state heavily punishes all newcomers for revealing themselves.

    If the proverbial Dark Forest existed, probably creatures inhabiting it would look something like this

    It really amazed me that this kind of analogy transfers really well to real life and kind of explains the status quo in many fields. You can say that ML in general is becoming a Dark Forest now. Long ago social networks became Dark Forests where everyone is virtue signalling (or just using it as a messenger, lol) and everybody has a very cool artificial version of themselves (I do not use social networks, btw). Unlike social networks, that heavily punish all people who dare honestly express their real selves on them, ML / tech is less punitive (yet, but if your carefully read you NDA / indemnity / IP part of your English law agreement - you may think otherwise).

    If you came here for a hardcore article or post (you may find something recent on my channel), I am sorry to disappoint, this will be more of a rant. And yeah, if you have ever been taken out of polite Western society of fake smiles for a contrast - western society is also a Dark Forest. This news article just takes the cake - when the Chinese government just builds surveillance - but the Americans ask the public to build the cages themselves. So cool.

    I have not posted for a while in this blog. I was preoccupied with many things, building an easy to use / approachable speech-to-text model / dataset being one of them. I also had a first decent vacation since 2015, but that would not be the topic of this post. Also there was a series of down-to-earth "challenges" in my life connected with relocations / renting an office / solving some basic questions. It does not sound like much, but it takes time and attention.

    Initially I wanted to postpone any STT related blog posts until we have released pre-trained models / have released version 0.6 or 0.7 of our dataset (now it's 0.5) / have done a very cool thing we are planning to do in future (I will not reveal it yet, but you are welcome to guess). But I realize that this is probably taking too much time, so let's have a bit of fun! We are very close to finishing off our models and the dataset, and this cool thing, but one thing at a time.

    Anyway now I just would like to tell you about my overall experience of working on this thing - a working Russian STT model and huge STT dataset - and how this and ML in general resembles a Dark Forest:


    • Technology is a Dark Forest. Until it isn't. But then it is again. It is cyclic;

    • Academic world is a kind of a Dark Forest

      Everyone is virtue signalling. The signal in papers is very diluted, unless this is a once a year breakthrough paper. Finally I came to a conclusion that 95% of current ML academic papers are nothing but career building. Just read between the lines, hunt for ideas. People who read just code and / or just abstracts and / or look for hyper-params in papers - are cynical, but in the end they are right. I had some glimmer of hope previously, but now I am more cynical about it. Of course sometimes there are truly awesome ideas in papers - but 95% of crap on top is necessary evil?;

    • Machine Learning is trending towards being a Dark Forest

      More precisely - it is following a technology cycle and now it is entering this Dark Forest phase. Remember when everyone was amazed at how open everything was (open architectures, reproducible experiments, pre-trained networks). Now the trend has shifted to industrialization, huge networks, AutoML, self-supervised learning (which btw is a good thing) and building walled-off gardens. This is a good thing that ML is actually useful and deployable (unlike the crypto bubbles, financial derivative bubble, etc) in real life, but it feels like people are becoming more separated into practitioners / 0.01% benchmark SOTA pushers / hardcore C++ engineers. The awesome melting pot of ideas seems to be less pronounced at least for now;

    • And finally - STT is also kind of a mild Dark Forest

      Being that it is data intensive (there are some cool ideas like joint STT / TTS training, but I believe that something resembling weakly supervised training, i.e. automatic pseudo-labelling, or mix-match will work in "in-the-wild" scenario. It already works btw in offline fashion) it is no wonder that it is mostly done by large corporations. With all of their perks - fitting to idealized "unlimited" data, building walled-off ecosystems, using 10x more resources than necessary. There is no motivation in the system to build simple / dependable / reproducible systems, especially for languages other than English. This is reflected in papers on the topic. Of course you may say that wav2letter++ / kaldi / esp-net exist and they are open-source an I just suck at ML / coding. Yeah, right, please use them, go on. But can you really depend on them? Can you really tell what is wrong with your non-standard wav2letter++ model, unless you can internalize the whole C++ codebase? If you need this yesterday and do not have 6 months to dig into a project that took 10 people a year to write? Of course in case you are a "rockstar 10x ninja guru" developer that may not apply to you xD;

    So, people may say "if it works do not fix it". But if no-one voices such concerns, then the systems may evolve beyond recognition. And we may find yet another walled-off corporate bullshit ecosystem instead of a thriving community that benefits real people. "First they came" argument.

    A system should serve people, not itself. For now, such system makes our life better asymptotically, i.e. if perfect wav2letter++ would not mean free ASR for all of us, but it may trickle down in the form of ideas / free annotation / datasets - you name it. But it is easy just to project these walling-off trend and see that awesome sprawling communities of ideas may perish should some trends continue.

    Technology cycles

    Do not get me started on cargo cult! I know that this is just an aerodynamic model, please preserve your bile for now

    Technology is a Dark Forest. Until it is not. Let me explain what I mean. If you are from the CIS, have you ever tried posting anything original to (just read Russian comments here or here)? Anything where you can proudly say - "I created this and I think it is useful! I want to share this cool stuff!". If you are not from the CIS, probably same can be said about Reddit, I do not know, please PM me if you have experience.

    Best case scenario - everyone will ignore it. More down-to-earth scenario - everyone will be appalled because whatever you did does not stick to some idealistic pre-conceived notion (ofc everyone will disregard your time / resource limitations, goals etc etc). The best way to seek the help of online community is to half-ass something, post it, and then see all of these self-righteous angry trolls coming from under their bridges to show you the true and the only way. Because people care more about boosting their (virtual) ego at your expense instead of actually helping do something helpful for greater good (i.e. non zero sum game).

    If you collect such feedback, you can even find good bits in it! Because you may ask why anyone would care if something wrong is said in the Internet? Because they are idealistic knights in shining armor? No - because of virtue signalling.

    You built some complicated thing. If you ask people politely to review / help - they will ignore. But if you deliberately make a couple mistakes - everyone will seem to care at once

    Ok, but what does it have to do with technology and Dark Forests? Well, you see, if 95% of what you see is this, then how does actual progress happen? Of course if 1000 engineers work on a new generation of CPUs for 2 years something will happen regardless of how efficient they are. But my bet is that you do not have 1000 engineers and 2 years. You need your result yesterday.

    And here it gets interesting. Imagine a new technology born or reaching some form of maturity (i.e. modern machine learning, web, mobile apps, etc). Each new technology first attracts people who want to see the results of their actions. Because they need people who get the shit done without rambling for 2 hour about self-righteousness. Then public and marketing catches up, marketing BS starts flooding the information channels. Then someone builds a killer app and / or applies this technology to a real problem / creates a 10x faster neural network / solves some challenge - you name it. Then this gets purchased / copied / stolen / lawfully expropriated / replicated by a large company (the end result is the same just the amount of time required differs) ... and the technology enters a Dark Forest state. No one has an incentive to take risks, no one is wants to build something new, status quo is the new motto, everyone is happy and content.

    This is why it is important to stay hungry / foolish, keep eggs in separate baskets and in general try to keep everything as simple as fuck.

    Bitten by the SOTA Bug

    A medieval cat achieving state of the art speed record disregarding some practical issues like blowing up afterwards

    It is not news that academic world or to put it in more plain terms - people writing papers (scholars, corporate employees, ML competition participants, whatever) are not immune to Goodhart's law. Found this gem and this gem recently on this topic. Seriously just read it.

    Wait, we should this also be a part of the Dark Forest?

    When reading ML papers they feel more and more like publish or perish / chasing 0.01% on some vague outdated academic leaderboard. Filling the tickmarks instead of doing something useful.

    Probably you are already fed up with me ranting, let me be more concrete. Dear people writing papers! I have a bone to pick with you. To be more precise several bones:

    1. What is with the fact that no papers are publishing convergence curves?

      Why? Who cares about achieving +0.05% (being that ML is kind of random) on some outdated not-in-the-wild benchmark by using 10x more compute and / or using 10x more GPU hours on your computing cluster? I understand that ML progresses in cycles, first someone achieves +0.05% at the sake of 10x more compute, then someone optimizes it, but why not be more open about it? Also sometimes it is worth running all of your experiments till the end, but usually you can see on first epochs that something does not fly.

      All other things being equal, feels like baseline_scale2 is the best, right?

    Hell no! If only papers posted charts like this!

    1. Why do you copy-paste and drag the obscure mathematical formulas from paper to paper even if your contribution has nothing to do with these formulas?

      Do not give me the BS that "each paper would be self-sufficient and any newcomer would understand everything from just one paper". It does not work like this. Why not just write like "we use this loss, invented by this guy here, implemented by this guy here, and yeah, we just use this implementation for everything, and here is our Dockerfile!". This is productive. But probably then the majority of papers will sound like "ahem, I came to work, just did used from utils import *; the monolithic behemoth of our internal wrappers and just changed some settings - sent it to our 1000 GPUs - and viola, new SOTA result!". Sounds not very sexy, right?

    2. Being vague on what you really built and where you started

      Being vague on which elements are crucial to your system and which can be trimmed. If you started with some black box framework and just slapped something on top of that - just say this. Or do proper ablation tests. Or just show your path. Just be frank. Do not hide behind big words.

      I understand that an illustration from a NAS paper is probably a bit misleading here, but it obviously shows which exploration steps have been taken

    3. Being vague about the generalization of your approach.

      Doing something on your laptop on MNIST is kind of cringe nowadays (yeah you can fit any sufficiently large shitty network to anything given enough time and money), but at least it feels honest. But I would never believe that people who have 4-5 people working on a paper do not have OOD (out-of-dataset) / private / additional data to test their model on. DeepSpeech paper probably is the best paper to illustrate this. They just showed how much performance depends on adding more data. Even if they cannot share this data. If what you do works only on ideal small dataset and requires 10x more compute and time - it is useless. No, you are not pushing state-of-the-art unless there is a fundamentally cool new characteristic about what you did. Ah, maybe you care about and reproducibility? But ahem, making your contribution reproducible and approachable by the public may be 10x more work. So it's better just to omit anything that would point your fellow practitioners in the right direction. Because otherwise you are vulnerable and the Dark Forest will strike. Also do not get me started on stack and blend 100 similar things approach. If you genuinely mix several different things, i.e. a RNN + regular expressions + some linear regressions each solving a small subtask its very cool. But just stacking 100 similar networks is not.

    4. Another flavour of the same thing - hyper-param stability

      Hyper-params usually deserve a small section in the middle of paper. But sometimes it actually takes a lot of time to tune them just right (and it even may be more important that your brand-new SOTA improvement). So why not be more open about how you found them? How much compute it takes?Does it scale to other datasets? Are they stable?

    Please do something useful! Also in particular, the problems I had with STT papers were the following:

    • All the cool new unsupervised learning approaches (wav2vec, cyclic training, etc)

      Looks very cool, but why show "progress" on some random small datasets? Why not just scale with speakers in LibriSpeech? Run a test - does it scale to X% speakers from LibriSpeech and that's it! Why not extrapolate how compute efficient is your method, even by means of back of the envelope computation? Of course, unsupervised / self-supervised / semi-supervised methods may be an order of magnitude slower, but what also about annotation cost trade-off? There are not complicated matters, but just commons sense. But you see, saying that "we achieved almost unsupervised something, but in real life is even less practical than supervised" does not sound sexy at all;

    • Some remarks on Wav2letter

      Wav2letter paper is cool. So deceivingly simple and nice ... like a fresh breath of air. I was one of the reasons I decided to try build an STT system. But wait, on real data it just does not fit as promised! Maybe you need to fit it for 30 days on 8 V100 GPUs? I fully understand that actually you can just train longer to miniaturize the model, but papers like this (they use ~150M params instead of ~30M to achieve similar results) kind of shows that everything is a bit more complicated;

    • Huge highly regularized networks

      All papers featuring large networks, GLU activations and 0.6 dropout. You can fit any network on anything. Really. They are this flexible. But you know, the time and compute taken may differ 10x. And will it work on real life / dirty data;

    • All papers boasting learning custom pre-processing instead of STFT / MFCC or something similar

      May be I am just an idiot, but we tried such things, and they worked, but even on toy data converged much slower. Given that actually one "comma" in such a fragile thing as pre-processing may be deal-breaker, lack of published code from paper authors is strange (come-on, if you spent a lot of time tuning your pre-processing, and probably it is just 50 lines of code, then why just not share it? The community will benefit and you will get the PR boost). There is a cool paper - SincNet - where the author had a cool idea and even published down-to-earth code you can test with your data - we are yet to try it. Also - as some papers have shown (1) 2 3 4) - sometimes you do not even need a complicated structure and per se training for the model to work;

    • Doing everything end-to-end mantra

      It is so cool when you can just put everything in one network on one framework (and it can be even agnostic to framework) and just hit .forward()! Then you will enjoy all sorts of cool things - like porting to TensorRT, using ONNX, etc etc But when you do something supposedly end-to-end, i.e. train a conv-LM instead of using an arbitrary one ... you nevertheless require a separate post-processing step using a different language / stack / etc. It is not useful. It is just a novelty item. Maybe adding an additional LM-loss will help instead?

    Dog Eat Dog

    Medieval cats are so cool

    Probably Dark Forest is a bit of exaggeration, but some behaviors were observed were kind of similar in spirit.

    Usually I am not the one to brag about how many "job offers per second I have" or "how many jobs I have at the moment" or "how much money per second I make" (all of this is vain and undignified) or any obnoxious shit like this, but stick with me for a couple of anecdotes now.

    This may get a bit funny / ridiculous. All of this happened after we published one of first more or less mature versions of our dataset.

    • When we published our dataset on, 33% of comments were constructive (i.e. data encoding, becoming a co-author, etc) and 66% were about what assholes we were:

      • We have committed a crime against book / video / recording right holders and we should drop dead and be punished immediately! Yeah right, Imagenet was compiled in a similar fashion, but no one has any problems using it. Because progress and fair use. I agree that this is a grey area, but for more details just read the comments;
      • We should have published data under Public Domain licence because of reasons. Because a private company should reach an agreement with authors to use the data commercially to continue virtue signalling. You cannot just tell authors to go take a walk. cc-bc-by actually allows any people working for social good to use the data and we have helped researchers do so on several occasions already;
    • Ofc on numerous occasions people told me that what we do should not be done, because Google or because Yandex. Because Dark Forest. Yeah right, we will see when comparing benchmarks xD;

    • We were approached on 3 different occasions with some vague offers to join some teams, and in one case we even could collaborate on making our dataset better, which is very cool! The Dark Forest thing about this is how one of these guys was advertising their company. After some enquiries we kind of realized that 5,000 hours of annotation he was bragging about did existed only in theory and the sole purpose of our communication was to eliminate a potential competitor in its infancy. Yeah like this guy. We may fail and we may suck at what we do, our dataset may be full of rubbish but (i) we are not idiots (ii) we actually believe that life is not a zero sum game for a change;

    The picture says "Prohibit! Tighten! Do not allow!

    Make your network 4-5x faster. But first make it 4-5x slower

    Actually the pix2pix Doomguy on the right does not look creepy at all!!

    So, you may ask, wtf is this article? If you are ranting, then offer something. The idea of this article is that a lot of resources is being spent on resume building and protecting against Dark Forest strikes - not on really useful things. True, some of my tests leave much to be desired in terms of their rigorousness, but at least I am not full of bs.

    Let's return to this chart:

     - Applying MobileNet-style optimizations make any CNN better - and any network is over-parametrized

    We learned the following useful things about STT so far. Some of these may be fairly obvious in hindsight (in bold are the most important facts, now you can officially start spewing your venom right now how all of this was obvious from the very beginning for you 7 years ago):

    • The Russian language:

      • Your CER will easily be 2x of the English model with similarly clean data;
      • Your WER may be 2-3x depending on your post processing and how clean your data is;
    • Generalization:

      • You should cross validate your network across different domains. Then you will see real performance of your model in the wild;
      • Your domain difficulty essentially is noise x voice diversity x vocabulary diversity;
      • This will be mitigated when you have more data, but as a rule of thumb you may think that if you have one noisy domain, on another noisy domain you will have +10pp CER;
      • Your model will get better if you have more annotation. Refer to DeepSpeech paper 2 for exact figures - but probably x5 - x10 annotation may give you minus 7-10pp of CER;
      • Annotation quality matters and domain noise matters. If on perfect clear data a non over-fitted network may have 3-4% CER, then probably you can extrapolate that 5-10% CER on more noisy in-the-wild data is achievable, and very noisy data will be even worse;
    • Your network:

      • CNNs are a bit worse than similarly sized GRUs / LSTMs, but are 2-3x faster (at least);
      • Separable convolutions + bottlenecks can make your model 4-5x lighter and 3-4x faster (an idea very similar to this paper, their particular model did not really work for us, but a model with similar ideas did). Such network takes 3x longer to train and overfits less, though;
      • Using 4x or 8x sequence down-scaling (probably with BPE, we are yet to achieve similar performance with 10k BPE) does not really hurt performance, but speeds up your training 2x-3x;
      • Dilated convolutions make your network ~2x slower, but CNNs stop skipping words (unlike RNNs);
      • If you are using a CNN, using skip-connections / residual networks is an obvious design choice;
      • Your CNN capacity probably will have to be 2-4x larger than wav2letter;
      • Try using SRU instead of LSTM / GRU (we are yet to try it);
      • You may use any pre-processing and network. The only difference will be - how long you will have to train it given enough data;
    • Training your network:

      • Batch similarly sized audio chunks together. This can speed up your training up to 2x;
      • Make your pre-processing pipeline as simple as possible - why not just read binary int16 data from disk instead of converting data on-the-fly?;
      • Make your augmentation pipeline as efficient and cheap as possible;
      • Streamline your hardware. SSD storage is very cheap nowadays;
      • Having several sources of annotation (i.e. graphemes + phonemes) may help avoid plateaus. But this has to be investigated in more detail;
    • Augmentations

      • Are crucial for larger networks. Helped avoid plateaus for larger networks. Not so crucial for smaller networks;
      • Reduce the generalization gap;

    Your path to a working STT model may look something like this