Problem Statement: To simply put, You have 1 million text files in a directory and your application must cater text query search on all files within few seconds (say ~1-2 seconds). How will you develop such system !!.
Motivation: The idea came from my previous post “Performing OCR by running parallel instances of Tesseract 4.0 : Python“. Saying that following could be some use cases where you may have to build such search engine on top of other applications. e.g.
- You have built an OCR app and converted millions of images into text files. You may want to build a search engine over converted text files to search contents of images.
- You have built a speech to text system where you are converting thousands of recorded audios into text data. You may like to search contents of audio in real time.
Here is a video demonstration of an desktop app developed in QT. It is a whoosh python implementation working in back end.
Some of you might have heard about a popular java based library “Lucene” which is a search engine library written entirely in Java. You may find a python wrapper for Lucene. If you are looking for similar pythonic library, “Whoosh” is the one. Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites.
Whoosh pypi package can simply be installed with pip:
pip install Whoosh
For the example demonstrated in this blog-post, You can download a data-set of 70,000 text files which were taken from simple wiki articles from here.
1. Creating Indexed Data: Whoosh
It is easy to index all your text files with Whoosh. Initially, the schema of the index has to be defined. Schema defines list of fields to be indexed or stored for each text file. It’s similar to how we define it for database. A field is a piece of information for each document in the index, such as its title or text content. Indexing of a field means it can be searched and it is also returned with results if defined as argument (stored=True) in schema. You only need to create the schema once while creating the index.
Finally, all the text documents are added to index writer in loop. Documents are indexed as per schema and has to be added as per schema design. Below is the python implementation for indexing all the text documents of a directory.
import os from whoosh.index import create_in from whoosh.fields import Schema, TEXT, ID import sys def createSearchableData(root): ''' Schema definition: title(name of file), path(as ID), content(indexed but not stored),textdata (stored text content) ''' schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\ content=TEXT,textdata=TEXT(stored=True)) if not os.path.exists("indexdir"): os.mkdir("indexdir") # Creating a index writer to add document as per schema ix = create_in("indexdir",schema) writer = ix.writer() filepaths = [os.path.join(root,i) for i in os.listdir(root)] for path in filepaths: fp = open(path,'r') print(path) text = fp.read() writer.add_document(title=path.split("\\"), path=path,\ content=text,textdata=text) fp.close() writer.commit() root = "corpus" createSearchableData(root)
2. Querying Indexed Data : Whoosh
Querying a indexed data has two important parts which you may like to look upon.
Query String : It is passed while searching the indexed data. Query string can be a single word, a single sentence to be matched exactly, multiple words with ‘AND’, multiple words with ‘OR’ etc. For examples –
Query : politics (returns if the word occurs)Query : sports OR games OR play (returns if any one of the strings occur)
Query : alpha beta gamma (return if a document contains all strings)
Query : “alpha beta gamma“ (returns if all strings occur together in a document).
Scoring : Each document is ranked according to a scoring function. There are quite a few types of scoring function supported by whoosh.
- Frequency : It simply returns the count of the terms occurred in the document. It does not perform any normalization or weighting.
- Tf-Idf scores : It returns tf * idf scores of each document. To know more read here or wiki page here.
- BM25F scoring : It is the by default ranking function used by whoosh. BM stands for best matching. It is based on tf-idf along with bunch of factors like length of document in words, average length of documents in the collection. It also has free parameters k = 1.2 and b = 0.75. To read more check here.
- Cosine scoring : It is useful for finding document similar to your search query.
There are few more scoring algorithms which has been implemented. Check here to know more.
Below is the python implementation for searching a query in the indexed database.
from whoosh.qparser import QueryParser from whoosh import scoring from whoosh.index import open_dir ix = open_dir("indexdir") # query_str is query string query_str = sys.argv # Top 'n' documents as result topN = int(sys.argv) with ix.searcher(weighting=scoring.Frequency) as searcher: query = QueryParser("content", ix.schema).parse(query_str) results = searcher.search(query,limit=topN) for i in range(topN): print(results[i]['title'], str(results[i].score), results[i]['textdata'])
QueryParser class of whoosh implements query language very similar to java Lucene’s.
Below are the basic terminologies you will always come across in discussions involving searching and indexing documents (taken from whoosh docs).
The set of documents you are indexing.
The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.
Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.
The process of examining documents in the corpus and adding them to the reverse index.
The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.
Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.
Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.
A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.
At the End
Hope it was an easy read and good heads-up to start with. So, what can we do more from here:
- Searching for alike documents instead of exact term searches only.
- Exploring hierarchical search in file system.
- Correcting errors in the queries. Did you mean…?
- Search N-grams for getting fast, “search as you type” functionality
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂