Parsing Common Crawl in 4 plain scripts in python

(alas not 2) =)

Posted by snakers41 on October 8, 2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron


Parsing Common Crawl in 4 plain scripts in python (not 2)

TLDR

After starting the CC mini-project in our last post, we ran into several challenges, all of which we more or less resolved (or avoided altogether).

In the end, the full pipeline looks like (see detailed explanations below) this:

python3 parse_cc_index.py
python3 save_cc_indexes.py
python3 prepare_wet_indexes.py
python3 process_wet_files.py

New challenges:

  1. The amount of data turned out ~10-20x larger than expected;
  2. The structure of indexes - looks like the CC index is alphabetically ordered, but the domains in the WET / WARC files are not (see some charts below), i.e. the data in question is more or less “uniformly” distributed across the WET files (i.e. there are no huge WET blobs with 100% of Russian sites);
  3. Concurrency / download speed;
  4. CPU-bound pre-processing - turned out that the pre-processing turned out to be ~4x more resource-intensive;

How we solved them:

  1. Well, since such corpus is to be used in combination with other corpuses, there is no need to download terabytes of text, I guess ~100-200 GB of text will suffice;
  2. This is just how CC works, but this can be mitigated by processing the most “important” WET files first;
  3. Just order the fattest download speed you can with your ISP and load files in as many threads as you can. Funnily enough, in Moscow this is not a bottleneck at all. Just look the tarriffs available - 500 Mbit/s connections are available for essentially USD10-15 per month (and you would pay USD5-10 for your Internet anyway - so this is not a real sunk cost);



  1. This is a tougher one - you need a lot of CPU cores. But in my case the output of my pipeline was around ~30-50 GB of text per day on 50% of my home machine (~ 3 physical cores of my Intel® Core™ i7-6800K CPU @ 3.40GHz), which is sufficient, I guess. I underestimated the post-processing, but I underestimated the amount of data even more (texts compress well);


As usual new scripts are available here.

What the end results looks like

Yes, you can drop the url / domain / tld column.
But disk space is essentially free nowadays )


urldomaintldsentence
https://priem.s-vfu.ru/priemnaya-kampaniya-201s-vfuru43.03.01 Сервис (Сервис в индустрии моды и кра…
https://www.japancar.ru/gd/T1732772.htmljapancarruПрямую ссылку на страницы сайта, которые содер…
http://blackannis.org/rock-5232.htmlblackannisorgОн был найден в качестве алебастра, декоративн…
https://bykvu.com/bukvy/91424-minzdrav-predlagbykvucomСамые важные и свежие новости - вы узнаете пер…
http://metkarkasnn.ru/tables/podstoliya/karkasmetkarkasnnruОборудование для тату салона
http://limpopo.com.ru/komplekt-v-krovatku-5-prcomruРазвивающие и интерактивные
https://costagarant.com/news-1472642410/costagarantcomНаписать нам сообщение
https://roof-facade.com/catalog/product/d_foliroof-facadecomПанельные ограждения
https://habr.com/post/354774/habrcomМне вот интересно, а почему на некоторых играх…
https://kino.otzyv.ru/film/Пле%otzyvruДней до премьеры:
http://sharepix.ru/dzhenerik-ledihep-ot-gepanesharepixruЗдоровье
http://iknigi.net/avtor-anna-oduvalova/160271-ikniginetПожалуйста, подождите…
http://www.kaskad-electro.ru/magazin/product/skaskad-electroruГлавная \ Интернет-магазин \ Светодиодные комм…
https://libertycity.ru/files/gta-4/5152-nypd-plibertycityruНовый, совершенно новый кар-пак.Особенности: -…
http://www.debri-dv.com/m/post_comment/6027/alldebri-dvcomВ 2006 г. проект «Дебри-ДВ» был создан как эле…
http://omsk.vselennaya-shluh.com/prostitutka-evselennaya-shluhcomМагнитогорск
http://topwiz.ru/companies/produkty-0?region=336topwizruПокровское
http://flexer.ru/stat/sect.php?r=71flexerruwww.cdstudio.ru 574 Аудио- и видеопродукция, н…
http://www.avtovzglyad.ru/author/olga-grekova/avtovzglyadruСтолица возвращается к радиальной системе план…
https://habr.com/post/354774/habrcomИспользованные инструменты
https://mir-auto.net/category/byd-f3/offset120mir-autonetПетрово
http://chel-oblsud.ru/index.php?html=inspectiochel-oblsudruНепроцессуальные обращения в суд
https://forum.aing.ru/viewtopic.php?f=5&t=1394aingruНа пару секунд задумался: «Не я же инициатор в…
http://www.skiff-impex.ru/index.php?categoryIDskiff-impexruНа главную
http://moselservis.ru/katalog/internet-magazinmoselservisruМы в социальных сетях
https://ovaciya-krasnodar.ru/the-news/315-goroovaciya-krasnodarru• 27 - "Зиме поем мы песни.
http://tipslife.ru/56227-в-ла%DtipsliferuК счастью, обошлось без серьезных последствий.
http://barnaul.msboy.ru/catalog/motornye_maslamsboyruВитебск
http://upsape.ru/rerayting-onlayn/upsaperuCopyright 2018 , Как стать копирайтером?
https://priem.s-vfu.ru/priemnaya-kampaniya-201s-vfuru21.05.04 Горное дело (Открытые горные работы; …


You can see that there are a lot of short texts from website navigation, but they are easily filtered by length.


Sentence length distribution on a small sample


I have no intention whatsoever to check in scientifically, but also my rough guess is that 1-5% websites are dedication to pornography / prostitution (just a guess by looking at some random data samples).

Some more insights into the CC structure

This is table showing how many unique URLs with Russian texts we had in our final table contained in how many unique WET files in which top level domain zone. You can see that all popular domain zones are distributed evenly across all of the 71,520 WET files.

tldurl countwet dcount
com12,899,61571,520
ru7,888,98871,520
net3,832,06571,520
info3,295,94971,520
by3,061,34771,520
org2,557,37771,520
kz1,415,77871,520
biz592,52871,494
pro469,20271,374
me451,56471,162
club245,57068,796
pl186,90165,671
online157,60562,649
lv147,17361,112
cc140,58559,818
name137,45759,939
az134,90656,795
md128,71958,751
eu127,22656,783
kg104,83553,391


This is a distribution of the number of times a given WET file was contained in the above URL list

We can see that there are several distinct peaks, but obviously you should start your CC crawl processing with the more frequent files. My guess is that they will contain much more relevant data, i.e. there is some underlying logic to the way the WET/WARC files are ordered domain-zone wise, but this is not apparent to me now.

Overall pipeline walk-through

python3 parse_cc_index.py # (1)
python3 save_cc_indexes.py # (2)
python3 prepare_wet_indexes.py # (3)
python3 process_wet_files.py # (4)


Pipeline explanation:

  1. The first script downloads all of the 299 CC indexes and filters our the urls by language;
  2. (2) just takes these 299 files and saves in 10 feather files for convenience;
  3. (3) calculates 2 things - a set of unique urls for later use and a list of urls with WET files;
  4. (4) just goes through WET files one by one (in a multi-processed fashion of course), downloads them, parses and cleans them, divides sentences and save the results into feather files;

Further optimization, processing speed

Obviously it will differ hugely depending on your Internet connection and CPU power, but in our case:

  • Downloading the whole index took approximately 1 week with average speed around 300-500 kb/s. It was done in background in the office with no hurry;
  • Downloading and processing ~1000 WET files (with largest Russian content ratio) took ~15 hours and produced ~25-30 GB of texts on 3 physical cores of my Intel® Core™ i7-6800K CPU @ 3.40GHz. So I guess it is safe to say that 1 day roughly equals to a 40-50 GB corpus on half of my home PC. Also for this task - the bandwidth was almost a non-issue in my case;

How the script can be improved:

  • Download files in advance and / or add a some queue to collect the data and a separate queue to post-process the data. This will help you utilize your bandwidth and CPU power at 100% of its capacity always. I personally decided not to invest in this as it is easier just to wait a bit, also my CPU will not just suddenly grow in size;

References

  1. Previous articles in this mini-series:
    1. Parsing Wikipedia in 4 plain commands in python;
    2. Previous article about parsing the Common crawl;
    3. A gist with the scripts;
  2. A list of useful Common Crawl starter links:
    1. http://commoncrawl.org/connect/blog/
    2. http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
    3. https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus
  3. Getting links to WET files;
  4. Reanimated python3 WARC file library;