Wikipedia search on a NUC with Strus
We run a fulltext search engine on the complete Wikipedia collection English (without citations, but with contents of tables) as demo project. The machine we use is an Intel NUC (NUC6i3SYK with 32GB Ram and a 256GB SSD)
Picture of the maschine
Why using a NUC for a demo system ?With new generation SSDs, non-volatile memory units are grouped closer to the CPU cores of modern servers. The hardware of a NUC is conceptually close to such a server. Just like one node of it. Because of the scalability of Strus we can now make some predictions about how Strus will perform on real servers.
The scripts buildword2vec.sh and buildstorages.sh is the scripts directory of the strusWikipediaSearch project are needed for building the wikipedia storage for retrieval. They have to be adapted for your use. We suggest to use a stronger machine than a NUC for building the data and the storages. On an Intel NUC the whole process of building the data and the storages will last for about roughly 10 days (4 days NLP + 4 days insert + 2 days Word2vec and some other helpers). This is substantially longer than 5 1/2 hours in a previous version.
For building the data for the wikipeadia search the following steps have to be done:
- Collect all link relations of documents into a file. (script buildword2vec.sh)
- Run NLP (with help of the NLTK package for python) and create the input for word2vec. (script buildword2vec.sh)
- Run word2vec and insert the resulting vectors and the associated named entities into a storage and build all relations needed (grouping vectors into concept classes). (script buildword2vec.sh)
- Calculate the page weights. (script buildword2vec.sh) Page weights are used in the first pass query to find the most relevant documents for title link extraction.
- Build rules for pattern matching to recognize multi-part named entities in the documents and the query. (script buildword2vec.sh)
- Analyze and insert the documents. (script buildstorages.sh)
- Assign weights to pages. (script buildstorages.sh)
- Patch some structures in the storage. (script buildstorages.sh) This is a hack for removing unwanted features inserted. It will be subsituted with a proper solution in the future.