Strus components

Introducing the components of strus

This section introduces the components of strus.

Key/value store database

A key/value store database stores blocks of data for fast retrieval by their key. The database is separated as own component to allow competitive solutions for various architectures with different requirements implemented by experts for this topic. The key/value store database has to implement an upper bound seek on keys to support fast merging operations needed by the logical storage. Currently there exists an implementation based on LevelDB (In fact, LevelDB was the main stimulus for me to write a search engine. I looked at it and noticed: "Heureka ! It has an upper bound seek. With this I can write a search engine !").

Storage

The storage provides interfaces to define the units to store for retrieval and presentation of the search result. It allows you to define documents as numbered lists of atomic terms, content, attributes and meta data. For every document you can define user rights that restrict access to a document to defined users. The storage groups these definitions into blocks and tables stored for fast access in underlying the key/value store database.

Query evaluation

The query evaluation combines the occurrencies of search terms according to a given query to higher level expressions, ranks a set of selected documents according some defined weighting schemes and returns a list of documents with named attributes as result. Query evaluation is defined with the help of functions of three different types:

Expression evaluation

The base of query evaluation are features respresented as sets of postings that are built from terms and from expressions built from basic terms. The operators to build expressions from terms are called posting join operators. Many posting set join operators are defined in the core. A description of the built-in posting join operators can be found here (posting join operators).

Weighting

Weighting accumulates a value as the weight of a document based on a retrieval scheme (e.g. BM25, tf-idf, proximity weighting, etc.). and the occurrencies of expressions in this document. A description of the built-in weighting functions can be found here (weighting functions).

Summarization

Summarization extracts content elements, attributes or meta data from a matching document. As result summarizers return a set of weighted key value pairs for the presentation of the result. Summarization can be used for showing properties of the result to a user as well as for exraction of data for feature selection for another iteration of query evaluation in the background (relevance feedback). A description of the built-in summarization functions can be found here (summarization functions).

Associated components of strus

For feeding a search engine there are some components needed that are not part of the core.

Analyzer

The analyzer (also called indexer in other information retrieval engines) exists as a project, but it is not interlinked in an intrusive way with the strus core. The strusAnalyzer provides segmentation, tokenization and normalization to get the atomic terms to insert into the storage and to tokenize and normalize phrases of the query accordingly. The analyzer uses the following components to do its job:

Segmenter

The segmented splits a document of a certain format (XML,JSON,etc.) into content chunks defined by selection expressions. Currently there exists only an implementation for XML based on the textwolf library using abbreviated syntax of XPath as selection language.

Tokenizer

A tokenizer splits a segment or alternatively the join of all segments of a certain type into tokens. The tokens a referencing elements in the segments without modification.
A description of the built-in tokenizer functions can be found here (tokenizer functions).

Normalizer

A normalizer maps a token to its nomalized form, the term inserted into the storage.
A description of the built-in normalizer functions can be found here (normalizer functions).

Aggregator

An aggregator aggregated some properties of a document to a single value. With aggregators you can for example define statistics as properties of the document and store them in the meta table to use them for weighting in retrieval.
A description of the built-in aggregator functions can be found here (aggregator functions).

Expandability of strus

Strus offers various interfaces to hook in. The project strusModule provides a mechanism to load functions extending capabilities of the storage or the analyzer of strus.

strus core

You can extend the strus core with own dynamically loadable modules with functions written in C++:

Iterator join operators

You can define your own functions that create an iterator on postings representing the result of an n-ary join of iterators on postings.

Weighting functions

You can define your own document weighting functions used for ranking.

Summarizers

You can define your own summarization functions used for attributing the results.

strus analyzer

You can extend the strus analyzer with own dynamically loadable modules with functions written in C++:

Segmenters

You can define your own segmenters for the document formats you need to process.

Tokenizer

You can define your own tokenizers splitting the document segments into tokens.

Normalizer

You can define your own normalizer functions to produce the retrievable items from the document tokens for the storage and the query.

Aggregator

You can define your own aggregator functions to produce some statistical values from the document structure after analysis.

What is still missing in strus

Documentation

The documentation of strus and its associated components is still poor. I am currently working hard on it every day.

What is not part of strus

Several parts are not a subject for strus. Here follows a list of parts you may miss and have to find elsewhere.

Crawler

A crawler (also called robot) that searches for documents in the the internet or an intranet to perform the input or update operations of the search index is not part of strus. There exist sophisticated solutions for different classes of document collections. In the strusUtilities project there exists a program that is able to insert all files a directory of a filesystem, but not more.

Mapping of hierarchical ACL trees

In strus user rights are attached to each document for each user allowed to see the document. ACLs are usually defined hierarchically with exclusion and inclusion rules defined for a node and its descendants. For strus you have to calculate the transitive cover of all positively declared user rights and assign them to each document. In this model updates of user rights are awkward, but taking them into account for retrieval is fast.