Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval
TLDR
Nowadays this is as easy as:
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
wget http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
python3 WikiExtractor.py -o ../data/wiki/ --no-templates --processes 8 ../data/ruwiki-latest-pages-articles.xml.bz2
following it up with this plain script [5] (the script is published via a gist, it is kept functional and concise for extendability)
python3 process_wikipedia.py
This will produce a plain .csv
file with your corpus.
Obviously:
http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
can be tweaked to obtain your language, for details see [4];- For
wikiextractor
params simply see its man page (I suspect that even its official docs are out of date);
The post-processing script turns Wikipedia files into a table like this:
idx | article_uuid | sentence | cleaned sentence | cleaned sentence length |
---|---|---|---|---|
0 | 74fb822b-54bb-4bfb-95ef-4eac9465c7d7 | Жан I де Шатильон (граф де Пентьевр)Жан I де Ш… | жан i де шатильон граф де пентьевр жан i де ша… | 38 |
1 | 74fb822b-54bb-4bfb-95ef-4eac9465c7d7 | Находился под охраной Роберта де Вера, графа О… | находился охраной роберта де вера графа оксфор… | 18 |
2 | 74fb822b-54bb-4bfb-95ef-4eac9465c7d7 | Однако этому воспротивился Генри де Громон, гр… | однако этому воспротивился генри де громон гра… | 14 |
3 | 74fb822b-54bb-4bfb-95ef-4eac9465c7d7 | Король предложил ему в жёны другую важную особ… | король предложил жёны другую важную особу фили… | 48 |
4 | 74fb822b-54bb-4bfb-95ef-4eac9465c7d7 | Жан был освобождён и вернулся во Францию в 138… | жан освобождён вернулся францию году свадьба м… | 52 |
article_uuid is pseudo-unique and sentence order is supposed to be preserved.
Motivation
Arguably, the state of current ML instruments enables practitioners [8] to build and deliver scalable NLP pipelines within days. The problem arises only if you do not have a trust-worthy public dataset / pre-trained embeddings / language model. This article aims to make this a bit eae} a bit by illustrating that preparing Wikipedia corpus (the most common corpus for word vector training in NLP) is real in just a couple of hours. So, if you spend 2 days to build a model, why spend much more time engineering some crawler to get the data?)
High level script walk-through
The wikiExtractor
tool saves Wikipedia articles in the plain-text format separated into <doc>
blocks. This can be easily leveraged using the following logic:
- Obtain the list of all output files;
- Split the files into articles;
- Remove any remaining HTML tags and special characters;
- Use
nltk.sent_tokenize
to split into sentences; - To avoid code bulk, we can keep our code simple by assigning a
uuid
to each article;
As text pre-processing I just used (change it to fit your needs) this:
- Removing non-alphanumeric characters;
- Removing stop words;
I have the dataset, what to do next?
Basic use cases
Usually people use one of the following instruments for the most common applied NLP task - embedding a word:
- Word vectors / embeddings [6];
- Some internal states of CNNs / RNNs pre-trained on some tasks like fake sentence detection / language modeling / classification [7];
- A combination of the above;
It has also been shown multiple times [9], that plain averaged (with minor details, we will omit for now) word embeddings are also a great baseline for sentence / phrase embedding task.
Some other common use cases
- Using random sentences from Wikipedia for negatives mining while using triplet loss;
- Training sentence encoders via fake sentence detection [10];
Some charts for Russian Wikipedia
Sentence length distribution in Russian Wikipedia
In natural terms (truncated)
_Log10 _
References
- Fast-text word vectors pre-trained on Wikipedia;
- Fast-text and Word2Vec models for the Russian language;
- Awesome wiki extractor library for python;
- Official Wikipedia dumps page;
- Our post-processing script;
- Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
- Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
- Imagenet moment in NLP?
- Sentence embedding baselines 1, 2, 3, 4;
- Fake Sentence Detection as a Training Task for Sentence Encoding;