Russian Text Normalization for STT and TTS

Converting numbers, abbreviations and latin symbols to russian spoken form

Posted by islanna on March 4, 2020

Many speech related problems including STT(Speech-To-Text) and TTS (Text-To-Speech) require transcripts to be converted into a real "spoken" form, i.e the exact words that speaker said. This means that before some written expression becomes our transcript it needs to be normalized. In other words text must be processed in several steps including:

  • Converting numbers: 1984 год -> тысяча девятьсот восемьдесят четвёртый год;
  • Expanding abbreviations: 2 мин. ненависти -> две минуты ненависти;
  • Transcription of latin symbols(mostly english): Orwell -> Оруэлл
    and so on.

Normalization

In this article I’d like to summarize our approach to text normalization in the Russian speech dataset Open_STT, briefly covering the ideas and tools we used.

As an icing on the cake, we decided to share our seq2seq normalizer with the community. It’s literally invoked via a one-liner, just check out our github repository.

norm = Normalizer()
result = norm.norm_text('С 9 до 11 котики кушали whiskas')
>>> 'С девяти до одиннадцати котики кушали уискас'

A Little More about the Task

So, what’s the matter? In theory, abbreviations would not be so commonly used if it were difficult to expand them to full form: everyone knows that “Dr.” stands for “Doctor”. In fact, it’s not that simple and some native speaker intuition is a must-have attribute to get through all the nuances.

To realize how deep the rabbit hole is, take a look at the examples below:

  • - второе(ые), but д. 2е - два е;
  • 2 части - две части, but нет 2 части - нет второй части;
  • длиной до 2 км - длиной до двух километров, but едем до 2 км - едем до второго километра;
  • = 2/5 - равно две пятых, but д. 2/5 - дом два дробь пять или даже - два пять.

Abbreviations and acronyms are also a problem: the same abbreviation can be read differently depending on the context(г - город or год) or the speaker(БЦ - б ц or бизнес центр?). Just imagine what a headache is a transliteration: somehow it’s comparable to learning another language. All the problems above are especially challenging for the spoken forms.

Statistical Pipeline

It’s easy to get lost in this endless variety: you are immersed in an endless cycle of searching and processing more and more new cases. At some point, it is better to stop and recall the Pareto principle or the 20/80 rule. Instead of solving the general task we can process ~20% of the most popular cases and cover ~80% of the whole language.

The approach of the first Open_STT releases was even more brutal: if you see a number, change it to a default cardinal numeral. As a hack for the STT application, this decision was even justified. By transforming 2020 год to две тысячи двадцать год you lose only 1-2 letters while ignoring a full number leads to a three-word mistake.

Gradually, we added some form of contextual dependency. Now the numerals before the word год became ordinal and 2020 finally turned into две тысячи двадцатый. So our “manual” statistical pipeline appeared - we find the most popular combinations and add them to the ruleset.

Sequence to Sequence Networks

At some point, it became clear that sequence-to-sequence (seq2seq) architecture is just perfect for our task. Indeed, seq2seq works just as well as the manual pipeline, and even better:

  • Converts one sequence to another - check;
  • Models context dependencies - check;
  • Finds the most appropriate rule to transform sequence - check;

Attention

Attention scores plot for a sequence “5 января”. To generate the end of the “пятОГО” model takes into account not only the word “5” but also subsequent characters from the word “января”.

As a basis, we took the seq2seq implementation on PyTorch [from here] (https://bastings.github.io/annotated_encoder_decoder/). To find out more information about the architecture read an original guide, it’s just great. In more details:

  • Our model is Char-level;
  • Source dictionary contains Russian letters + English + punctuation + special tokens;
  • Target dictionary - only Russian letters + punctuation.

The more diverse and high-qulity data you have the better your model will train - it’s common knowledge. If you have ever tried to find such data for the English language, then you know that it is quite simple. Moreover, there are even open-source solutions for number normalization. As for Russian, everything is more complicated.

So, training data is a mix of:

  • Open-source normalized data - Russian Text Normalization - mostly books, partly normalized with rulesets;
  • Filtered data of random websites processed using our manual pipeline;
  • A couple of augmentations to make the model more robust - for example, punctuation and spaces at random places, capitalization, long numbers, etc.

TorchScript

In addition to solving the problem we wanted to test out some new interesting tools like TorchScript.
TorchScript is a great tool provided by PyTorch, that helps you to export your model from Python and even run it independently as a C++ program.

In a nutshell, there are two ways in PyTorch to use TorchScript:

  1. Hardcore, that requires full immersion to TorchScript language, with all the consequences;
  2. Gentle, with torch.jit.script (and torch.jit.trace) tools working out of the box.

As it turned out, for more complex models than those presented in the official guide, you have to rewrite a couple of lines of code. However, it’s not rocket science and only some minor changes are required:

  • Pay more attention to typing;
  • Find a way to rewrite unsupported functions and methods.

For a more detailed guide check a channel post.

Examples

To sum up, check out a few normalization results. All sentences below were taken from the test dataset, i.e the model has not seen any of these examples:

  • norm.norm_string("Вторая по численности группа — фарсиваны — от 27 до 38 %.")

‘Вторая по численности группа — фарсиваны — от двадцати семи до тридцати восьми процентов.’

  • norm.norm_string("Висенте Каньяс родился 22 октября 1939 года")

‘Висенте Каньяс родился двадцать второго октября тысяча девятьсот тридцать девятого года’

  • norm.norm_string("играет песня «The Crying Game»")

‘играет песня «зэ краинг гейм»’

  • norm.norm_string("к началу XVIII века")

‘к началу восемнадцатого века’

  • norm.norm_string("и в 2012 году составляла 6,6 шекеля")

‘и в две тысячи двенадцатом году составляла шесть целых и шесть десятых шекеля’