Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Really?

Posted by snakers41 on September 3, 2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron


Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

TLDR

Nowadays this is as easy as:

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
wget http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
python3 WikiExtractor.py -o ../data/wiki/ --no-templates --processes 8 ../data/ruwiki-latest-pages-articles.xml.bz2


following it up with this plain script [5] (the script is published via a gist, it is kept functional and concise for extendability)

python3 process_wikipedia.py


This will produce a plain .csv file with your corpus.


Obviously:

  • http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2 can be tweaked to obtain your language, for details see [4];
  • For wikiextractor params simply see its man page (I suspect that even its official docs are out of date);


The post-processing script turns Wikipedia files into a table like this:


idxarticle_uuidsentencecleaned sentencecleaned sentence length
074fb822b-54bb-4bfb-95ef-4eac9465c7d7Жан I де Шатильон (граф де Пентьевр)Жан I де Ш…жан i де шатильон граф де пентьевр жан i де ша…38
174fb822b-54bb-4bfb-95ef-4eac9465c7d7Находился под охраной Роберта де Вера, графа О…находился охраной роберта де вера графа оксфор…18
274fb822b-54bb-4bfb-95ef-4eac9465c7d7Однако этому воспротивился Генри де Громон, гр…однако этому воспротивился генри де громон гра…14
374fb822b-54bb-4bfb-95ef-4eac9465c7d7Король предложил ему в жёны другую важную особ…король предложил жёны другую важную особу фили…48
474fb822b-54bb-4bfb-95ef-4eac9465c7d7Жан был освобождён и вернулся во Францию в 138…жан освобождён вернулся францию году свадьба м…52


article_uuid is pseudo-unique and sentence order is supposed to be preserved.

Motivation


Arguably, the state of current ML instruments enables practitioners [8] to build and deliver scalable NLP pipelines within days. The problem arises only if you do not have a trust-worthy public dataset / pre-trained embeddings / language model. This article aims to make this a bit eae} a bit by illustrating that preparing Wikipedia corpus (the most common corpus for word vector training in NLP) is real in just a couple of hours. So, if you spend 2 days to build a model, why spend much more time engineering some crawler to get the data?)

High level script walk-through

The wikiExtractor tool saves Wikipedia articles in the plain-text format separated into <doc> blocks. This can be easily leveraged using the following logic:

  • Obtain the list of all output files;
  • Split the files into articles;
  • Remove any remaining HTML tags and special characters;
  • Use nltk.sent_tokenize to split into sentences;
  • To avoid code bulk, we can keep our code simple by assigning a uuid to each article;


As text pre-processing I just used (change it to fit your needs) this:

  • Removing non-alphanumeric characters;
  • Removing stop words;

I have the dataset, what to do next?

Basic use cases

Usually people use one of the following instruments for the most common applied NLP task - embedding a word:

  • Word vectors / embeddings [6];
  • Some internal states of CNNs / RNNs pre-trained on some tasks like fake sentence detection / language modeling / classification [7];
  • A combination of the above;

It has also been shown multiple times [9], that plain averaged (with minor details, we will omit for now) word embeddings are also a great baseline for sentence / phrase embedding task.

Some other common use cases

  • Using random sentences from Wikipedia for negatives mining while using triplet loss;
  • Training sentence encoders via fake sentence detection [10];

Some charts for Russian Wikipedia

Sentence length distribution in Russian Wikipedia

In natural terms (truncated)

_Log10 _

References

  1. Fast-text word vectors pre-trained on Wikipedia;
  2. Fast-text and Word2Vec models for the Russian language;
  3. Awesome wiki extractor library for python;
  4. Official Wikipedia dumps page;
  5. Our post-processing script;
  6. Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
  7. Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
  8. Imagenet moment in NLP?
  9. Sentence embedding baselines 1, 2, 3, 4;
  10. Fake Sentence Detection as a Training Task for Sentence Encoding;