Strus document analyzer configuration

Language grammar

The following grammar (as EBNF) is the formal language for configuration used by the strus utilities (strusUtilities) for describing document analysis.

Comments

Comments are starting with # and are reaching to the end of the line. Using # as part of a symbol is possible if it is part of a single or double quoted string.

Handling of spaces

Spaces, control characters and end of lines have no meaning in the language.

Case sensivity/insensivity

Keywords and identifiers referring to elements in the storage are case insensitive. So are function names of tokenizers and normalizers. Selection expression case sensitivity is dependent on the segmenters target language. For example a selector for XML will have case sensitive expressions because XML is case sensitive.

EBNF

IDENTIFIER     : [A-Za-z][A-Za-z0-9_]*
STRING         : <single or double quoted string with backslash escaping>
EXPRESSION     : <string format depending on segmenter>
MIMETYPE       : <MIME type definition string with semicolon separated definitions (content and encoding)>
PRGFILENAME    : <Name of a program to load (e.g. a program with patterns to match)>
MODULEID       : <Identifier or string identifying a module to use>
config         = configsection config
               ;
configsection  = "[" "Attribute" "]" attrdeflist
               | "[" "MetaData" "]" attrdeflist
               | "[" "SearchIndex" "]" featdeflist
               | "[" "ForwardIndex" "]" featdeflist
               | "[" "PatternLexem" "]" lexemdeflist
               | "[" "Aggregator" "]" aggdeflist
               | "[" "Document" "]" docdeflist
               | "[" "Content" "]" contentdeflist
               | "[" "PatternMatch" MODULEID "]" prgdeflist
               ;
docdeflist     = docdef docdeflist
               |
               ;
docdef         = type "=" selector ";"
               ;
contentdeflist = contentdef contentdeflist
               |
               ;
contentdef     = MIMETYPE selector ";"
               ;
aggdeflist     = aggdef aggdeflist
               | 
               ;
aggdef         = metadataelem "=" functioncall ";"
               ;
metadataelem   = IDENTIFIER ;
featdeflist    = featdef featdeflist
               |
               ;
featdef        = type "=" normalizer tokenizer [ "{" posbindoptlist "}" ] selector ";"
               ;
attrdeflist    = attrdef attrdeflist
               |
               ;
attrdef        = type "=" normalizer tokenizer selector ";"
               ;
lexemdeflist   = lexemdef lexemdeflist
               |
               ;
lexemdef       = type "=" normalizer tokenizer selector ";"
               ;
posbindoptlist = posbindopt posbindoptlist
               ;
posbindopt     = "position" "=" [ "succ" | "pred" ]
               ;
type           = IDENTIFIER ;
prgdef         = type "=" PRGFILENAME ";"
               ;
prgdeflist     = prgdef prgdeflist
               ;
normalizer     = functioncall ":" normalizer
               | functioncall
               ;
tokenizer      = functioncall
               ;
functioncall   = functionname "(" argumentlist ")" ;
               | functionname
               ;
functionname   = IDENTIFIER ;
argumentlist   = argument "," argumentlist
               |
               ;
argument       = IDENTIFIER
               | STRING
               ;
selector       = EXPRESSION ;

Meaning of the sections

Document

The declarations in this sections are sub document definitions (in case of multipart documents).

Content

The declarations in this sections are sub content definitions that use a different segmenter to process the document content (e.g. JSON embedded in XML).

Attribute

The declarations in this section are document attribute value definitions.

MetaData

The declarations in this section are document meta data element value definitions.

SearchIndex

The declarations in this section are feature definitions to put into the search (inverted) index.

ForwardIndex

The declarations in this section are feature definitions to put into the forward index.

PatternLexem

The declarations in this section are lexem definitions that are not inserted into the index. They are just used to feed post processing pattern matchers with lexems.

PatternMatch

The declarations in this section define pattern matching programs. The pattern matcher module is selection with the argument <moduleid> of the section header. There exists no pattern matcher in the core. The standard pattern matcher name "std" is implemented in the module "analyzer_pattern" of the project strusPattern.

Aggregator

The declarations in this section are meta data definitions that assign a value calculated from a function called after all other document analysis steps. The function called is meant to aggregate statistical values of the document. The functions get the resulting indexed document as argument and return the aggregated value, like for example the count of elements of a specified type.

Meaning of the grammar elements

type

Type of feature or element name assigned to this definition result.

functionname

Name of the function that identifier this tokenizer or normalizer.

selector

The selector expression defines what document segments are used to produce the resulting feature or element.

posbindopt

Options that stear ordinal position assignment. There are currently two options implemented: position=succ => The feature does not get an own ordinal position assigned but gets the follow position assigned or disappears if there exists none. position=pred => The feature does not get an own ordinal position assigned but gets the previous position assigned or disappears if there exists none.

Example

The following example relies on the standard XML segmenter (based on the textwolf template library). The selection expressions are in a language resembling the abbreviated syntax of XPath, with the difference that a tag selection selects the tag and not the subtree and tag content selections are expressed with oval brackets instead of "::text()".

[Attribute]
        title = orig content /doc/title();

[SearchIndex]
        para = empty orig /doc/para;
        stem = convdia(en):stem(en):lc word /doc/title();
        stem = convdia(en):stem(en):lc word /doc/para/text();
        punctuation = orig punctuation(en) /doc/para/text();

[ForwardIndex]
        orig = orig word /doc/para/text();
        orig = orig word /doc/title();

[Aggregator]
        doclen = count( stem);