Part of speech tagging of the Wikipedia collection for information retrieval

Intention

The part of speech tagging of the Wikipedia collection for the Strus search engine intends to address the needs of a search engine only. It is not a general purpose approach and not a start to such an approach. An attempt to take this POS tagging as base for other purposes will most likely fail. The POS tagging implemented here aims to distinguish between verbs, adverbs and adjectives, nouns and entities and it tries to resolve personal pronoun references. Not more.

Tags

The following list describes the tags assigned to words or sequences of words in a document. The base document structure is the one described here. The POS tags are assigned to content to '<text>' tags. The assignements of the tags is flat, there are no structures described with the tags. Every element or sequence of elements is uniquely tagged by POS tags. A POS tag is one of the following:

The tag names do not correspond to the Penn Treebank tag names because this would be misleading. The reduction to a small set of tags is intentional here.

Conversion

The enriching of the Wikipedia collection english with POS tags is part of the conversion process of the Wikimedia format, but it is separated from it. The basic POS tagging was done with a script using the output of SpaCy and some heuristics to resolve entities referenced by partial names or personal pronouns.

Complexity

The conversion of the Wikipedia collection English on an Intel(R) Core(TM) i7-6800K CPU 3.40GHz with 64 G RAM and a GTX 1060 GPU will presumably last 20 days (number extrapolated from the conversion of 10% of the collection). Detailed numbers will follow.

Example Output: XML document with POS tagging

This example XML document illustrates the output generated by POS tagging process from the plain XML. There are still bugs to fix, but the results start to be usable.