Part of speech tagging of the Wikipedia collection for information retrieval

Intention

The part of speech tagging of the Wikipedia collection for the Strus search engine intends to address the needs of a search engine only. It is not a general purpose approach and not a start to such an approach. An attempt to take this POS tagging as base for other purposes will most likely fail. The POS tagging implemented here aims to distinguish between verbs, adverbs and adjectives, nouns and entities and it tries to resolve personal pronoun references. Not more.

Tags

The following list describes the tags assigned to words or sequences of words in a document. The base document structure is the one described here. The POS tags are assigned to content to '<text>' tags. The assignements of the tags is flat, there are no structures described with the tags. Every element or sequence of elements is uniquely tagged by POS tags. A POS tag is one of the following:

The tag names do not correspond to the Penn Treebank tag names because this would be misleading. The reduction to a small set of tags is intentional here.

Conversion

The enriching of the Wikipedia collection english with POS tags is part of the conversion process of the Wikimedia format, but it is separated from it. The basic POS tagging was done with a script using the output of SpaCy and some heuristics to resolve entities referenced by partial names or personal pronouns.

Conversion script

The following shell function (excerpt from source in scripts/install_data.sh). shows the conversion for one directory of the XML converted from the original dump. One such directory contains up to 1000 wikipedia articles.

processPosTagging() {
    # [0] Some variable initializations
    # DID = sub directory id
    DID=$1
    # NLPCONV = script doing the conversion of file with multiple textdumps into a file with multiple structure dumps 
    #	with lines with 3 elements separated by tabs: (type,value,referenced value)
    NLPCONV=$SCRIPTPATH/strusnlp.py
    # Make output deterministic
    PYTHONHASHSEED=123

    # [1] Call a strus program to scan the Strus Wikipedia XML generated in the previous step from the Wikimedia dump.
    #	the program creates a text dump in /srv/wikipedia/pos/$DID.txt with all the selected contents as input for the
    #	POS tagging script.
    strusPosTagger -I -x xml -C XML -D '; ' -X '//pagelink@id' -Y '##' -e '//pagelink()' -e '//weblink()' -e '//text()' -e '//attr()' -e '//char()' -e '//math()' -e '//code()' -e '//bibref()' -E '//mark' -E '//text' -E '//entity' -E '//attr' -E '//attr~' -E '//quot' -E '//quot~' -E '//pagelink' -E '//weblink' -E '//tablink' -E '//citlink' -E '//reflink' -E '//tabtitle' -E '//head' -E '//cell' -E '//bibref' -E '//time' -E '//char' -E '//code' -E '//math' -p '//heading' -p '//table' -p '//citation' -p '//ref' -p '//list' -p '//cell~' -p '//head~' -p '//heading~' -p '//list~' -p '//br' /srv/wikipedia/xml/$DID /srv/wikipedia/pos/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error creating POS tagger input: $EC" > /srv/wikipedia/err/$DID.txt
    fi

    # [2] Call the POS tagging script with the text dumps in /srv/wikipedia/pos/$DID.txt and write the output to /srv/wikipedia/tag/$DID,txt
    cat /srv/wikipedia/pos/$DID.txt | $NLPCONV -S -C 100 > /srv/wikipedia/tag/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error in POS tagger script: $EC" > /srv/wikipedia/err/$DID.txt
    fi

    # [3] Merge the output of the POS tagging script with the original XML in /srv/wikipedia/xml/$DID/
    #	and write a new XML file with the same name into /srv/wikipedia/nlpxml/$DID/
    strusPosTagger -x ".xml" -C XML -e '//pagelink()' -e '//weblink()' -e '//text()' -e '//attr()' -e '//char()' -e '//math()' -e '//code()' -e '//bibref()' -o /srv/wikipedia/nlpxml/$DID /srv/wikipedia/xml/$DID /srv/wikipedia/tag/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error tagging XML with POS tagger output: $EC" > /srv/wikipedia/err/$DID.txt
    fi
}
The function is called with the sub directory name consisting of 4 digits. The following example shows such a call:
processPosTagging 1234

Complexity

The conversion of the Wikipedia collection English on an Intel(R) Core(TM) i7-6800K CPU 3.40GHz with 64 G RAM and a GTX 1060 GPU will presumably last 20 days (number extrapolated from the conversion of 10% of the collection). Detailed numbers will follow.

Example Output: XML document with POS tagging

This example XML document illustrates the output generated by POS tagging process from the plain XML. There are still bugs to fix, but the results start to be usable.