Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain

Posted by snakers41 on November 2, 2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron

Winning a CFT 2018 spelling correction competition

Or building a task-agnostic seq2seq pipeline on a challenging domain

This is the first time we managed to win (i.e. 1st place) an ML competition together

A small philosophic preamble.

We participated as team (yours truly + Dmitry Voronin helped a bit) and Sava Kalbachou (Lucidworks), my old time pal from many competitions, also a specialist in NLP.
It was a long, difficult, nervous and surprisingly rewarding journey for our team.

To sum all the experience in one line - the journey is more important than the destination. This competition taught us a lot about being persistent, trying new things, falling down and standing up again.

I personally compiled / optimized 3 or 4 pipelines for this task, until I could build the best one. Also, as usual, we found a crucial error one day before the finish line xD


We managed to apply some of the current state-of-the-art sequence-to-sequence models to a real in the wild problem in a challenging domain in a mixed multi-task / auto-encoder setting.

The core of the task was - text input classification (correct / needs to be corrected / random non-relevant trash) and input correction (essentially spell check).

The domain was - names of individuals from the countries of the CIS:

  • Mostly Russia, Moldova, Uzbekistan, Georgia, Tadzhikistan, Kazakhstan, Ukraine, etc;
  • Names usually consist of 3 parts (name, surname, patronym) - Иванов Иван Иванович. Sometimes in certain republics patronyms have additional suffixes added as a fourth word (ОГЛЫ, same as вич suffix in Russian, but a separate word);
  • There were 2+ alphabets with a total around 70+ characters + a lot of noise in the data;

Models that worked best

  • Sequence-to-sequence:
    • Tuned sequence-to-sequence bidirectional GRU (see details below) - 92.5% wo heuristics, 93% with heuristics, 93%+ in ensemble;
    • Proper sequence-to-sequence inference loop;
  • Also a simpler model in Keras without bottleneck and with Conv1D output layer worked well - with a top score of 88-89%;
  • Strongest naïve heuristic - if the output of seq2seq inference loop is the same as input - then the input is correct;

Key takeaways:

  • Seq2seq transformer took much more time to train and we failed to make it work the same as traditional models. Obviously we lack experience with it, but this is the fourth time I notice real people mention that transformer is not for “real” people (i.e. non-practical);
  • Our main competitor (with ~200 submissions) did not use Deep Learning at all - he essentially reverse engineered the augmentation pipeline of the hosts - which is tough work, but does not scale to other domains;
  • Batch seq2seq predictions are key to fast inference;
  • Beam search, though being a tough gem to implement, boosted the score insignificantly (obvious culprit - small dictionary);

Domain / dataset description

One illustration is better than a thousand words.
Please also note - in train + test there were ~2000 countries, i.e. country input was also very noisy.
We ended up not using countries at all - as adding country embeddings did not boost our metrics

id fullname country target fullname_true

The hosts of the competition also shared a strong and simple baseline:

  • A linear regression classifier on ~1m n-grams;
  • Plain usage of 3rd party library to correct misprints;

The distribution of the number of errors per query

Code release

We will not bother replicating the whole training process (and due to our agreement with competition hosts), since there were a lot of fine-tuning steps, but overall these bits will be reusable for you:

  • Preparing the competition dataset (see the first chapter of the notebook);
  • Dockerfile in case you run into problems in your environment (when the image was build PyTorch 4.0 or 4.1 was a major version, target it, or you will have to make code compatible);
  • Keras model:
  • PyTorch BiLSTM seq2seq:
    • Best model training: utils;
    • Best model classes;
    • Best model train pipeline;
    • Typical launch command (flag names are self-explanatory):
    --resume weights/pred_eval_encdec_savva_l4_heavy_h512e512_1e4_aug2fxpseudo_lrf01_resume3_best.pth.tar \
    --heavy_decoder True --lr_factor 0.5 \
    --trn_df_path ../data/pseudo_aug2_proc_trn.csv \
    --tb_name encdec_savva_l4_heavy_h512e512_1e4_aug2fxpseudo_lrf01_resume4 \
    --lr 1e-6 --epochs 100 \
    --batch_size 128 \
    --hidden_size 512 --dropout 0.2 \
    --num_layers 4 --num_classes 3  --emb_size 512 \
    --cn_emb_size 0 --num_cn 0 \
    --tensorboard True \

Key milestones in achieving our result

Model Overall score Plain heuristic boost Comment
OpenAI transformer <50% NA we abandoned this idea (*)
Baseline <80% NA
Baseline naïve Keras seq2seq 88-89% NA it took Sava ~3 hours to build this
BiGRU baseline seq2seq 82-85% NA following annotated encoder decoder
Best transformer seq2seq 85%+ NA too long to train (**)
BiGRU seq2seq + tuning 89-90% 1% (***)
BiGRU seq2seq + tuning + countries 89-90% 1% no improvement
BiGRU seq2seq + tuning + heavy decoder ~91% 1% minor improvement
BiGRU seq2seq + tuning + heavy decoder + skip ~91% 1% minor or no improvement
BiGRU seq2seq + tuning + heavy decoder + WRONG train / psedo augs ~91.5% 0.5%+
BiGRU seq2seq + tuning + heavy decoder + CORRECT train / psedo augs >92.5% 0.5%+ (****)
  • (*) Most likely we just did not invest enough time into this, but the target of the base pipeline was different, so we abandoned the idea;
  • (**) Whereas the biggest reasonable BiGRU model trained for 15-20 hours, this would train for at least a week on one GPU to achieve similar result;
  • (***) LR decay mostly, MOAR layers!;
  • (****) Two time we made errors when generating augmented data. At first I mixed 2 columns (source and target) in 50% of cases - and the model trained a bit better at first. Then I forgot to change target labels when augmenting the data

Ideas / solutions / models that worked

Please refer to these amazing articles for more nuts and bolts of the models we used:

What worked:

  • Sequence to sequence modelling with biLSTMs (91%+);
  • Validation matched LB precisely - we did not try K-Fold validation, but splitting the dataset in advance was accurate enough;
  • Using a simple pipeline to clean extra characters and obvious errors;
  • Pseudo-labelling and text augmentations (91% + 1.5% on a single model);
  • Using the “whole” train dataset, i.e. using all of the correct names as “correct” and using all of the incorrect names. Paired with additional augmentations (1x size of the train dataset) - turned out to be our single best pre-processing pipeline (91% + 1.5% on a single model);
  • Post-processing heuristics (see below);
  • Running batch-based inference loops (in the annotated encoder decoder and transformer predictions are handled with batch-size of 1);

Ideas / solutions / models that did not work or we did not try

  • BPE / Sentencepiece - we tried using several vocabularies (500, 1500, 5000 tokens size etc.) obtained in unsupervised fasion via Sentencepiece. But it didn’t really help model to converge faster and results were worse than we expected;
  • Beam search - most likely because of model confidence and small vocabulary. It boosted the score a bit, but did not provide a lot of value. It was a tough nut to crack, but I was dissapointed it did not work as much as we expected;
  • Transformer model - when I adopted and optimized the annotated transformer pipeline it worked, but failed to converge within meaningful time;
  • LM assisted classifier from Open AI transformer experiments - it failed to converge at all, most likely because it lacked a proper seq2seq inference loop, it was a low hanging fruit for me (I just had the this pipeline ready, while Sava boasted his initial 88% pipeline);
  • Multi-GPU / cluster / distributed training - didn’t try it;
  • Running augmentations on the fly - I believe you can extend this line in torchtext to do so, but I postponed it a bit, and then just had no time to do that. But that may speed up the convergence 3-10x from my experience;

Text data augmentation

After some reverse engineering process (by the way based on the word ngrams), it turned out, that organizers use pretty simple typos generation methods. There are 5 main types:

  • Insert, the most dominant type (~50% of all cases). Just takes a random letter in a word and inserts a new letter before or after it. A new letter is a random closest (on a keybord) letter to the chosen one, for example - letters W, S or A if Q is chosen. IVAN - ICVAN;
  • Change, ~30%, just changes a random letter in a word into random closest to it on a keybord (like in insert case). IVAN - IVQN;
  • Wwap positions of two nearest random letters, ~5%. IVAN - IAVN;
  • Delete space between two words, ~7.5%;
  • Delete letter (completely random ~7.5%), IVAN - VAN;

Each query with typos contains a combination of one (93% of all cases), two (6%) or three (~1%) mistakes listed above.

So we managed to write our own funcion for generating an additional dataset and use it alongside the original one for training.

Post-processing heuristics

Heuristic that worked

After building the first seq2seq model we noticed that sometimes model predicts class 0 (phrase doesn’t contains error), but predicted phrase is different from original one. We used this as a heuristic to change predicted class from 0 to 1 (phrase contains error), that gave us 0.5-1.5% boost, depending on the quality of the model (less for strong models).

Another heuristic was in changing predicted class/phrase to predictions from quite different model (like Keras one) in cases when we are sure that there is a mistake in predictions (model predicts class 1, but predicted phrase is equal to original phrase).

Heuristics that did not work, or that we did not explore

At some moment of time we noticed, that the network is several times more likely to make errors with the words it “knows” compared to “new” words (names / surnames / patronyms). It meant that the network generalized well, but nevertheless we tried applying an obvious heuristic to test, in essence similar to pseudo-labelling.

Obviously we tried using dictionary heuristic, i.e. checking whether the predicted words are present in the vocabulary of names. It gave a significant boost on validation, but failed to provide reasonable boost on test (most likely because there are rare names that are predicted wrong and / or are not present in train). Obviously we tried adding predicted “correct” test names, but that did not help.

Train vs. test word dictionary overlap

Ensembling models

As it often happens during competitions, at some point close to the end you just have to do some stacking/ensembling to continue conpeting and stay at the top of LB. This contest wasn’t an exception.

We used hold-out validation data to find optimal way of ensembling and its parameters. For the first task of classification we used kind of geometrical mean of predicted probabilities from variety diverse models. For the second taks of name correction we implemented simple majority voting mechanism that was applied for each of the token of predicted names. After each step heuristics described above were used as well.

This quite simple yet efficient ensemble gave us final ~0.003% boost and 1st place. We selected it and best single model as our two final submissions:

Improving the annotated encoder-decoder and the annotated transformer

As you may know, biLSTMs / biGRUs and transformers are the best models on the market when it comes to the challenging tasks like the following:

  • Neural Machine Translation;
  • Text abstraction;
  • Sequence to sequence modelling;
  • Text inference tasks;

And we were standing on the shoulders of the giants when re-using the stellar pipelines from these tutorials encoder decoder and annotated transformer.

But as everything in our life, they are not perfect.

Key shortcomings:

  • Predictions in both of them are handled with batch-size of one. This is not ideal when you run a lot of end-to-end experiments;
  • Annotated encoder decoder boasts loss weighting, but it is not explained that when pairing this with secondary objectives (like in our case) - you have to be extra careful with weighting both components of the loss to avoid over / underfitting on one task;
  • While torchtext is awesome for the purpose it was build (to accelerate NMT experiments) - for applied competition NLP pipelines - it is a bit slow and bulky and seed parameters are non intuitive. Probably I should have solved caching a bit better, but I opted for the simplest solution with CSV files I explain below;
  • I fully understand that distributed / data parallel routines in PyTorch is a bit experimental and in high maintenance, but ofc annotated transformers multi GPU loss compute function broke in PyToch 0.4+. I decided not to fix this, but probably I should have, expecially for transformer;

Batch predictions and beam search

Much attention in both turotials was paid to running models efficiently when training and validating the model. But for some reason the inference greedy search loops presented in the tutorials are written for batches with size 1.

As you may be aware, this slows down end-to-end validation 10x. So, I have written essentially 3 functions:

All of these abstractions from torchtext … how to cook them well

Obviously these were built for NMT. They also load all the data in the form of tensors into RAM for faster iteration (and it is a good idea to use iterators in python when working with these to avoid more RAM usage) unlike the standard PyTorch Dataset / Dataloader classes, which a bit more flexible.

Advantages of these abstractions:

  • Code reuse and simplicity;
  • They are … high level and abstract;
  • The steps of creating, describing and pre-processing the dataset are easy and streamlined;
  • Tedious steps like adding start tokens, end tokens, unique tokens, pad tokens are just done for you;
  • Once you get hold of the splits and dictionary building - it becomes really easy to substitute your text preprocessing functions to try different tokenization schemes (for example check this one and this one);


  • You have to be very careful with leaks when building dictionaries to make re-runs of the same model compatible (this actually held me a bit in the beginning);
  • Obviusly this works out of the box only for numerical and text data. Any other forms of data - you will have to improvise / extend classes;
  • I did not like how splits were implemented there - so we ended up creating a hould out dataset via splitting a separate .csv file;
  • In case you have a lot of data (e.g. 10M input target pairs) - RAM consumption is very high - so you essentially have to monkey patch / extend these classes if you want to have more data / use augmentations;
  • Though in theory it is easy to extend these classes to read .feather instead of .csv, I did not do it due to lack of time and lack of focus;
  • Pickling issues - since reading .csv files is slow, I tried to pickle some obvious objects - but failed - iterators / generators are not picklable in python3. I guess I should have done it with the dataset class (not the BucketIterator class);

Model flavours

At some point in time I noticed that there are essentially 2 bottle-necks in our best model:

  • The flow between encoder and decoder is bottle-necked by the fact that encoder is bidirectional, and decoder is not. So simply speaking I solved this by making my decoder 2 times bigger (and it helped my model to converge);
  • I also experimented with adding country embeddings and adding a skip connection and one more attention layer in essense similar to copy attention from abstraction models - these ideas did not work for me;
  • I also played with essentially dropping embeddings at all (char level model), with poor results; On the other hand, Keras models worked reasonably better at char level than using embeddings.

Notably, best models were quite heavy, heavy decoder also adding quite a lot of time to epoch.

Loss weighting is crucial!

I did not run ablation tests for this (I had a lot of experience failing in object detection with 2-3 losses, though) - but I assumed that since loss is essentially applied n_token times for language modelling target and then normalized by the number or tokens, to achieve best results the classification loss should be similary sized. I achieved this simply by using the most convenient loss layer in PyTorch (yeah, I forgot to divide this loss by batch-size, but this turned out to be not so cricial, I guess).

KLD loss?

A neat idea from annotated transformer that I did not explore - replace the classification loss with KLD-like loss (usually used in variational auto-encoders) to add more variance to predictions. But I guess it is not important for a small vocabulary.

Kind words about organization and comparison to Kaggle

If you do Kaggle / monitor the ML scene, then you may be aware that Kaggle is all of these things:

  • Essentially now it is a joke for serious people;
  • Plagued by bozo explosion;
  • Each recent major competition features a severe idiotic leak and/or organization flaw - obviously no one cares there anymore;
  • In 2018 it is essentially great to get free experience / free datasets / free code - but all the medals and achievements are largely overrated (kernels are mostly farmed by guys who wants to get a lot of attention, but sometimes there are super stellar pipelines);

Do not believe me - just read these posts 1 2 3;

Anyway to the point - the organization of this competition was NOTHING like Kaggle:

  • No leaks and / or opportunities for idiotic LB probing;
  • Good correlation of local validation / private / public LB;
  • Only a few minor erros by hosts (occasional glitches with the platform, no team formation deadline, too late deadline for exernal data, no compute limitations and no dockerization / re-running requirements);
  • Interesting and challenging domain;



End to end metric curves

Original lm transformer

Original encoder decoder seq2seq

Tuned encoder decoder seq2seq

Adding WRONG augmentations

Playing with transformer based models

Adding CORRECT augmentations

Competition pulse

At first Sava dominated the leaderboard (LB) with his 88% score for a long time.
This is the moment when competition caught up. My pipelines mostly were failures at that moment.

Then I finally managed to make the annotated encoder decoder work and we were trolling the LB again for some time (probably a week or so).

We did not really bother to stack our model and / or show our best results, but at this point it was worrying, because our single best model + heuristics did not really have a lot of breathing room. At this point we could achieve ~92%, but we were waiting and trying different models and heuristics.

This was pre-final results sometime during the last day. Notice thousandvoices was at the top. We had to ensemble our models quite a bit here. Also I found a bug in my augmentation pipeline 1 day before the competition end, and our single best model started learning quite quickly and achieved ~92.5% w/o heuristics and ~93% with a plain heuristic.

But this did not transfer well to our ensemble. 1% gain in the best model translated into 0.2-0.3% for reasons explained above.

It was quite a hot battle during the last minutes (I wonder why people do not sleep at 11:59 PM on Sunday???)

And this is the final private leaderboard. Notice that we kept our ground and thousandvoices likely:

  • Either somehow over-fitted (or maybe he used some leaks, who knows);
  • Or just forgot to choose his best final submissions, who knows;

But history is written by the victors.