- Go to common crawl website;
- Download the index (~200 GB);
- Choose domains in your country / language (now they also have language detection =) );
- Download only plain-text files you need (I would suspect that since the websites are crawled alphabetically, single-language websites will cluster into several hundreds or thousands large files);
Parse the common crawl data in 2 plain commands in Python with minimum external dependencies:
Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the “compute” lies within actually downloading these files.
Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately
140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s).
In case you have not seen the previous post in the series about mining Wikipedia for NLP corpus in 4 commands in Python, check it out.
In a nutshell, sometimes you just need to collect a web-corpus without investing a lot of time into engineering. Or you need to develop an NLP application, but your language or domain does not readily provide any prepared corpuses / data.
When I began thinking about implementing something like this, I thought it would be prohibitively difficult and / or expensive - after all you do not see CC pop up very often in the research / blogs / news nowadays (I guess this is partially due to the copyright or the fact that biggest NLP research is sponsored by Google / FAIR which can afford collecting their own data).
But it turns out - it is not. This can be attributed to the effort that has been made to make the CC more accessible. The killer feature for me was the presence of their index weighting only ~200Gb, that also features a language detection option, i.e. you do not need to analyze top-level-domains or do any significant data mining.
It is thoroughly explained in the CC blog, but basically you have 3 main types of files:
- Index files (200 GB total, ~299 files);
- WARC files with HTML and all (~68TB total, ~71k files);
- WET files only with plain-text (~7TB, ~71k files)
As you may have guessed, index files contain links to WARC files and some meta-data (url, language, timestamp, etc - better see for yourself). But wait - for NLP purposes we do not need the links / schema.org / HTML - we just need the text (for the most generic applications anyway)!
Turns out, even though explicitly not mentioned in their docs (or I just did not find it), you can just amend the WARC file url slightly to a get a WET file.
It’s easy to achieve the location of a WET (or WAT) file given a WARC file:
/wet/in the path;
.wetbefore the suffix
- Literally that is it!
But one more problem arises, CC also provides us with offsets for WARC files, but not for WET files. Turns out it was also not a problem. WET file essentially is a WARC file without tags. So you can use this library. But do not confuse it with a library with a similar name on
PyPi - the one on PyPi (at least now) - is an obsolete old one. The authors of the new library do no provide a ready package yet.
Anyway in my case I just cloned it into a subfolder and then it just worked (with very minor tweaks of source code):
git clone https://github.com/erroneousboat/warc3.git
This library enables you to parse through WARC / WET files very quickly. So essentially you can just skim through the whole headers in no time and save data only for ulrs that you need. In my case these were Russian websites.
- Previous post link;
- A gist with the scripts;
- A list of useful Common Crawl starter links:
- Getting links to WET files;
- Reanimated python3 warc file library;