Wikipedia data dump to XML conversion

Intention

  1. Pure XML format of the documents in the Wikipedia collection for easier textual processing of its data.
  2. Simpler scheme serving the needs of information retrieval and friends.
  3. One file per document for parallel and incremental processing and easier debugging.
  4. Crystallize the relations important for textual information processing but hard to extract from the original dump format. For example the heading to cell relations in tables.
  5. Open a discussion and share efforts.

Example XML plain document

This example XML document illustrates the output generated by the conversion from the original dump.

XML tag summary

A summary of all tag paths appearing in the Wikipedia collection (english) with an example and some statistics can be found here. Unfortunately there is a bug in the calculation of the df that is always 1. But the analysis gives you an overview on the tag paths appearing in the converted content. A schema will be provided in the future.

Example calls

You have to get a Wikipedia dump from here. To get all articles and redirects, use the option -n 0 of the strusWikimediaToXml call to restrict the extraction to namespace 0 documents (articles).

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
bunzip2 enwiki-latest-pages-articles.xml.bz2
mkdir xml
strusWikimediaToXml -I -B -n 0  -P 10000 -t 12 enwiki-latest-pages-articles.xml xml
If you want to resolve page links to redirect pages, you can run the program twice. First with option -R <redirectfile> and then with option -L <redirectfile>. In the extracting link mode (option -R specified) there are no converted XML documents written and the program runs single threaded.
strusWikimediaToXml -n 0 -P 10000 -R ./redirects.txt enwiki-latest-pages-articles.xml xml
strusWikimediaToXml -I -B -n 0 -P 10000 -t 12 -L ./redirects.txt enwiki-latest-pages-articles.xml xml

The option -I for the conversion generates more than attribute with the same name per tag. For example a table cell my look like <cell id="C1" id="R2"> if called with -I. Unfortunately this is not valid XML. Without -I the same tag will be printed as <cell id="C1,R2">.

Resources needed

You need less than 8 GB RAM. Conversion on a Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz, 12 Threads and strusWikimediaToXml called with Option -t 12:

Command being timed: "strusWikimediaToXml -I -B -n 0 -t 12 -L ./redirects.txt enwiki-latest-pages-articles.xml doc"
        User time (seconds): 10381.37
        System time (seconds): 405.87
        Percent of CPU this job got: 764%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 23:30.65
        Maximum resident set size (kbytes): 2682604
        Exit status: 0

Program

The conversion program is part of the project strusWikipediaSearch.

Usage: strusWikimediaToXml [options] <inputfile> [<outputdir>]
<inputfile>   :File to process or '-' for stdin
<outputdir>   :Directory where output files and directories are written to.
options:
    -h           :Print this usage
    -V           :Verbosity level 1 (output document title and errors to stderr)
    -VV          :Verbosity level 2 (output lexems found additional to level 1)
    -S <doc>     :Select processed documents containing  as title sub string
    -B           :Beautified readable XML output
    -P <mod>     :Print progress counter modulo <mod> to stderr
    -D           :Write dump files always, not only in case of an error
    -K <filename>:Write dump file to file <filename> before processing it.
    -t <threads> :Number of conversion threads to use is <threads>
                  Total number of threads is <threads> +1
                  (conversion threads + main thread)
    -n <ns>      :Reduce output to namespace <ns> (0=article)
    -I           :Produce one 'id' attribute per table cell reference,
                  instead of one with the ids separated by commas (e.g. id='C1,R2').
                  One 'id' attribute per table cell reference is non valid XML,
                  but you should use this format if you process the XML with strus.
    -R <lnkfile> :Collect redirects only and write them to <lnkfile>
    -L <lnkfile> :Load link file <lnkfile> for verifying page links

Description

Besides the <docid>.xml files, the following files are written:

Output XML tags

The tag hierarchy is as best effort intendet to be as flat as possible. The following list explains the tags in the output:

Structural XML tags embeding a structure

Structural XML tags describing links

Textual XML tags (tags marking a content)

Processing the data for information retrieval