Strus document analyzer configuration

Language grammar

The following grammar (as EBNF) is the formal language for configuration used by the strus utilities (strusUtilities) for describing document analysis.


Comments are starting with # and are reaching to the end of the line. Using # as part of a symbol is possible if it is part of a single or double quoted string.

Handling of spaces

Spaces, control characters and end of lines have no meaning in the language.

Case sensivity/insensivity

Keywords and identifiers referring to elements in the storage are case insensitive. So are function names of tokenizers and normalizers. Selection expression case sensitivity is dependent on the segmenters target language. For example a selector for XML will have case sensitive expressions because XML is case sensitive.


IDENTIFIER     : [A-Za-z][A-Za-z0-9_]*
STRING         : <single or double quoted string with backslash escaping>
EXPRESSION     : <string format depending on segmenter>
MIMETYPE       : <MIME type definition string with semicolon separated definitions (content and encoding)>
PRGFILENAME    : <Name of a program to load (e.g. a program with patterns to match)>
MODULEID       : <Identifier or string identifying a module to use>
config         = configsection config
configsection  = "[" "Attribute" "]" attrdeflist
               | "[" "MetaData" "]" attrdeflist
               | "[" "SearchIndex" "]" featdeflist
               | "[" "ForwardIndex" "]" featdeflist
               | "[" "PatternLexem" "]" lexemdeflist
               | "[" "Aggregator" "]" aggdeflist
               | "[" "Document" "]" docdeflist
               | "[" "Content" "]" contentdeflist
               | "[" "PatternMatch" MODULEID "]" prgdeflist
docdeflist     = docdef docdeflist
docdef         = type "=" selector ";"
contentdeflist = contentdef contentdeflist
contentdef     = MIMETYPE selector ";"
aggdeflist     = aggdef aggdeflist
aggdef         = metadataelem "=" functioncall ";"
metadataelem   = IDENTIFIER ;
featdeflist    = featdef featdeflist
featdef        = type "=" normalizer tokenizer [ "{" posbindoptlist "}" ] selector ";"
attrdeflist    = attrdef attrdeflist
attrdef        = type "=" normalizer tokenizer selector ";"
lexemdeflist   = lexemdef lexemdeflist
lexemdef       = type "=" normalizer tokenizer selector ";"
posbindoptlist = posbindopt posbindoptlist
posbindopt     = "position" "=" [ "succ" | "pred" ]
type           = IDENTIFIER ;
prgdef         = type "=" PRGFILENAME ";"
prgdeflist     = prgdef prgdeflist
normalizer     = functioncall ":" normalizer
               | functioncall
tokenizer      = functioncall
functioncall   = functionname "(" argumentlist ")" ;
               | functionname
functionname   = IDENTIFIER ;
argumentlist   = argument "," argumentlist
argument       = IDENTIFIER
               | STRING
selector       = EXPRESSION ;

Meaning of the sections


The declarations in this sections are sub document definitions (in case of multipart documents).


The declarations in this sections are sub content definitions that use a different segmenter to process the document content (e.g. JSON embedded in XML).


The declarations in this section are document attribute value definitions.


The declarations in this section are document meta data element value definitions.


The declarations in this section are feature definitions to put into the search (inverted) index.


The declarations in this section are feature definitions to put into the forward index.


The declarations in this section are lexem definitions that are not inserted into the index. They are just used to feed post processing pattern matchers with lexems.


The declarations in this section define pattern matching programs. The pattern matcher module is selection with the argument <moduleid> of the section header. There exists no pattern matcher in the core. The standard pattern matcher name "std" is implemented in the module "analyzer_pattern" of the project strusPattern.


The declarations in this section are meta data definitions that assign a value calculated from a function called after all other document analysis steps. The function called is meant to aggregate statistical values of the document. The functions get the resulting indexed document as argument and return the aggregated value, like for example the count of elements of a specified type.

Meaning of the grammar elements


Type of feature or element name assigned to this definition result.


Name of the function that identifier this tokenizer or normalizer.


The selector expression defines what document segments are used to produce the resulting feature or element.


Options that stear ordinal position assignment. There are currently two options implemented: position=succ => The feature does not get an own ordinal position assigned but gets the follow position assigned or disappears if there exists none. position=pred => The feature does not get an own ordinal position assigned but gets the previous position assigned or disappears if there exists none.


The following example relies on the standard XML segmenter (based on the textwolf template library). The selection expressions are in a language resembling the abbreviated syntax of XPath, with the difference that a tag selection selects the tag and not the subtree and tag content selections are expressed with oval brackets instead of "::text()".

        title = orig content /doc/title();

        para = empty orig /doc/para;
        stem = convdia(en):stem(en):lc word /doc/title();
        stem = convdia(en):stem(en):lc word /doc/para/text();
        punctuation = orig punctuation(en) /doc/para/text();

        orig = orig word /doc/para/text();
        orig = orig word /doc/title();

        doclen = count( stem);