Part of speech tagging of the Wikipedia collection for information retrieval

Intention

The part of speech tagging of the Wikipedia collection for the Strus search engine intends to address the needs of a search engine only. It is not a general purpose approach and not a start to such an approach. An attempt to take this POS tagging as base for other purposes will most likely fail. The POS tagging implemented here aims to distinguish between verbs, adverbs and adjectives, nouns and entities and it tries to resolve personal pronoun references. Not more.

Conversion

The enriching of the Wikipedia collection english with POS tags is part of the conversion process of the Wikimedia format, but it is separated from it. The basic POS tagging was done with a script using the output of SpaCy and some heuristics to resolve entities referenced by partial names or personal pronouns.

Conversion script

The following shell function (excerpt from source in scripts/install_data.sh). shows the conversion for one directory of the XML converted from the original dump. One such directory contains up to 1000 wikipedia articles.

processPosTagging() {
    # [0] Some variable initializations
    # DID = sub directory id
    DID=$1
    # NLPCONV = script doing the conversion of file with multiple textdumps into a file with multiple structure dumps 
    #	with lines with 3 elements separated by tabs: (type,value,referenced value)
    NLPCONV=$SCRIPTPATH/strusnlp.py
    # Make output deterministic
    PYTHONHASHSEED=123

    # [1] Call a strus program to scan the Strus Wikipedia XML generated in the previous step from the Wikimedia dump.
    #	the program creates a text dump in /srv/wikipedia/pos/$DID.txt with all the selected contents as input for the
    #	POS tagging script.
    strusPosTagger -I -x xml -C XML -D '; ' -X '//pagelink@id' -Y '##' -e '//pagelink()' -e '//weblink()' -e '//text()' -e '//attr()' -e '//char()' -e '//math()' -e '//code()' -e '//bibref()' -E '//mark' -E '//text' -E '//entity' -E '//attr' -E '//attr~' -E '//quot' -E '//quot~' -E '//pagelink' -E '//weblink' -E '//tablink' -E '//citlink' -E '//reflink' -E '//tabtitle' -E '//head' -E '//cell' -E '//bibref' -E '//time' -E '//char' -E '//code' -E '//math' -p '//heading' -p '//table' -p '//citation' -p '//ref' -p '//list' -p '//cell~' -p '//head~' -p '//heading~' -p '//list~' -p '//br' /srv/wikipedia/xml/$DID /srv/wikipedia/pos/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error creating POS tagger input: $EC" > /srv/wikipedia/err/$DID.txt
    fi

    # [2] Call the POS tagging script with the text dumps in /srv/wikipedia/pos/$DID.txt and write the output to /srv/wikipedia/tag/$DID,txt
    cat /srv/wikipedia/pos/$DID.txt | $NLPCONV -S -C 100 > /srv/wikipedia/tag/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error in POS tagger script: $EC" > /srv/wikipedia/err/$DID.txt
    fi

    # [3] Merge the output of the POS tagging script with the original XML in /srv/wikipedia/xml/$DID/
    #	and write a new XML file with the same name into /srv/wikipedia/nlpxml/$DID/
    strusPosTagger -x ".xml" -C XML -e '//pagelink()' -e '//weblink()' -e '//text()' -e '//attr()' -e '//char()' -e '//math()' -e '//code()' -e '//bibref()' -o /srv/wikipedia/nlpxml/$DID /srv/wikipedia/xml/$DID /srv/wikipedia/tag/$DID.txt
    EC="$?"
    if [ "$EC" != "0" ]; then
        echo "Error tagging XML with POS tagger output: $EC" > /srv/wikipedia/err/$DID.txt
    fi
}

The function is called with the sub directory name consisting of 4 digits. The following example shows such a call:

processPosTagging 1234

Complexity

The conversion of the Wikipedia collection English on an Intel(R) Core(TM) i7-6800K CPU 3.40GHz with 64 G RAM and a GTX 1060 GPU will presumably last 20 days (number extrapolated from the conversion of 10% of the collection). Detailed numbers will follow.

Example Output: XML document with POS tagging