Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library

Problem Statement: To simply put, You have 1 million text files in a directory and your application must cater text query search on all files within few seconds (say ~1-2 seconds). How will you develop such system !!.

Motivation: The idea came from my previous post “Performing OCR by running parallel instances of Tesseract 4.0 : Python“. Saying that following could be some use cases where you may have to build such search engine on top of other applications. e.g.

  • You have built an OCR app and converted millions of images into text files. You may want to build a search engine over converted text files to search contents of images.
  • You have built a speech to text system where you are converting thousands of recorded audios into text data. You may like to search contents of audio in real time.

Here is a video demonstration of an desktop app developed in QT. It is a whoosh python implementation working in back end.

Introduction: Whoosh

Some of you might have heard about a popular java based library “Lucene” which is a search engine library written entirely in Java. You may find a python wrapper for Lucene. If you are looking for similar pythonic library, “Whoosh” is the one. Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites.

Whoosh pypi package can simply be installed with pip:
pip install Whoosh

For the example demonstrated in this blog-post, You can download  a data-set of 70,000 text files which were taken from simple wiki articles from here.

1. Creating Indexed Data: Whoosh

It is easy to index all your text files with Whoosh. Initially, the schema of the index has to be defined. Schema defines list of fields to be indexed or stored for each text file. It’s similar to how we define it for database. A field is a piece of information for each document in the index, such as its title or text content. Indexing of a field means it can be searched and it is also returned with results if defined as argument (stored=True) in schema. You only need to create the schema once while creating the index.

Finally, all the text documents are added to index writer in loop. Documents are indexed as per schema and has to be added as per schema design. Below is the python implementation for indexing  all the text documents of a directory.

import os
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
import sys

def createSearchableData(root):   

    Schema definition: title(name of file), path(as ID), content(indexed
    but not stored),textdata (stored text content)
    schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
    if not os.path.exists("indexdir"):

    # Creating a index writer to add document as per schema
    ix = create_in("indexdir",schema)
    writer = ix.writer()

    filepaths = [os.path.join(root,i) for i in os.listdir(root)]
    for path in filepaths:
        fp = open(path,'r')
        text =
        writer.add_document(title=path.split("\\")[1], path=path,\

root = "corpus"

2. Querying Indexed Data : Whoosh

Querying a indexed data has two important parts which you may like to look upon.

Query String : It is passed while searching the indexed data. Query string can be a single word, a single sentence to be matched exactly, multiple words with ‘AND’, multiple words with ‘OR’ etc. For examples –

Query : politics (returns if the word occurs)Query : sports OR games OR play (returns if any one of the strings occur)
Query : alpha beta gamma (return if a document contains all strings)
Query : alpha beta gamma (returns if all strings occur together in a document).

Scoring : Each document is ranked according to a scoring function. There are quite a few types of scoring function supported by whoosh.

  1. Frequency : It simply returns the count of the terms occurred in the document. It does not perform any normalization or weighting.
  2. Tf-Idf scores : It returns tf * idf scores of each document. To know more read here or wiki page here.
  3. BM25F scoring : It is the by default ranking function used by whoosh. BM stands for best matching. It is based on tf-idf along with bunch of factors like length of document in words, average length of documents in the collection. It also has free parameters k = 1.2 and b = 0.75. To read more check here.
  4. Cosine scoring : It is useful for finding document similar to your search query.

There are few more scoring algorithms which has been implemented. Check here to know more.

Below is the python implementation for searching a query in the indexed database.

from whoosh.qparser import QueryParser
from whoosh import scoring
from whoosh.index import open_dir

ix = open_dir("indexdir")

# query_str is query string
query_str = sys.argv[1]
# Top 'n' documents as result
topN = int(sys.argv[2])

with ix.searcher(weighting=scoring.Frequency) as searcher:
    query = QueryParser("content", ix.schema).parse(query_str)
results =,limit=topN)
for i in range(topN):
    print(results[i]['title'], str(results[i].score), results[i]['textdata'])

QueryParser class of whoosh implements query language very similar to java Lucene’s.

3. Glossary

Below are the basic terminologies you will always come across in discussions involving searching and indexing documents (taken from whoosh docs).

The set of documents you are indexing.

The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.

Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.

Forward index
A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.

The process of examining documents in the corpus and adding them to the reverse index.

The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.

Reverse Index
Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.

Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.

Term vector
forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.

At the End

Hope it was an easy read and good heads-up to start with. So, what can we do more from here:

  1. Searching for alike documents instead of exact term searches only.
  2. Exploring hierarchical search in file system.
  3. Correcting errors in the queries. Did you mean…?
  4. Search N-grams for getting fast, “search as you type” functionality

Update: Readers can download the backup of QT application built from below link. It was a old work and don’t have a working app now. You may have to figure out codes in order to reproduce it. Good luck with that.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

34 thoughts on “Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library

  1. Nice weblog here! Additionally your site quite a bit up very fast!
    What web host are you the use of? Can I get your affiliate link in your host?
    I desire my web site loaded up as quickly as yours lol

  2. getting error: File “C:\Users\1patilha\AppData\Local\Continuum\anaconda3\lib\site-packages\whoosh\”, line 515, in __init__
    raise LockError

    • Can you check that you have administrator rights or permission to access files (try starting cmd or Spyder with admin rights) ?

      You can run simple whoosh example to ensure that whoosh is working fine.

      • I have data in pdf, doc, docx , ppt and excel in 4 folders , I want to use whoose for indexing how I can read it. could you help me.

      • I am trying to run the code using Jupyter or Spyder but got stuck on locked (running with admin privileges). The weird thing is that yesterday it worked even though the txt file was not read correctly. Now I don’t even get the index created.
        I run the first half to create the index folder and the schema without issues.
        Any idea?

        Thank you very much

  3. Hi, thanks a lot for this tutorial it really useful, could you please share with us a link for your app developed in QT?

  4. Hello, I get this error
    File “”, line 25, in createSearchableData
    writer.add_document(title=path.split(“\\”)[1], path=path,\
    IndexError: list index out of range
    is there any process needed to do before installing Whoosh so it wont produce this error?

    • Examples of both are there in video as well as blog-post.
      TopN is number of top results you want eg TopN = 5
      query_str is string you want to search for. eg query_str = “football fifa”

  5. File “”, line 13
    query = QueryParser(“content”, ix.schema).parse(query_str)
    IndentationError: expected an indented block

    What to do with it?

  6. Hello,

    Thank you for your wonderful blog post on Whoosh. I’m kinda new to this but just to make sure: indexdir = the directory where my text/pdf are stored? (i’m i getting this right?). My issue here is that when i run this:

    # query_str is query string
    query_str = sys.argv[1]
    #Top ‘n’ documents as result
    topN = int(sys.argv)[2]

    then i got this error:

    File “”, line 8, in
    query_str = sys.argv[1]

    IndexError: list index out of range

    is it because we are not using the same dataset?


    p.s an introduction for implementing python scripts to QT please? 🙂

  7. Indeed a good work. Really appreciate. just a curiosity, Is this based on natural language processing? Or is NLP involved here Or entirely based native Python capabilities?

    • Glad that you found it good.

      Whoosh based search is based on indexing on documents where all the docs are saved in mathematical matrix format. This indexing technique makes search very fast.
      Ranking of docs are based on various kinds of distance metric calculations.

      So, there is no machine learning model or AI model trained/applied here but text processing/manipulation is definitely involved.

      • Hi Abhijeet,
        This implementation demonstrate how we can extract the specific text document.
        Is it possible that if we can extract specific paragraph in which specific search keywords are present?
        Is this implementation works for pdf files?
        Actually I am working on 50-60K pdf files and I have to create search engine to extract the specific paragraph in which the search keywords are present.
        Please guide me.

  8. Dost, i am getting this error

    ~\AppData\Local\conda\conda\envs\Tensorflow\lib\site-packages\whoosh\ in doc_count(self)
    637 def doc_count(self):
    638 if self.is_closed:
    –> 639 raise ReaderClosed
    640 return self._perdoc.doc_count()


    Can you please help me, am using your same code and changed the root to a local folder with some 3 notepad text documents in them


    • The front end in app here in blog-post was developed in QT C++ platform. Python codes were simply ran in spyder.

      To start with, you can use spyder. pycharm is also a good option. Sometimes I just use text editors like atom for writing codes and runs it on terminal. Sometimes python notebooks are much useful.

  9. Hello,
    Why do you need content and textdata on schema?

    schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\

    • Schema defines all the attributes we want to store in table like title, path, content text etc.
      It creates a numeric fast searchable table internally and saves it.

      That is how basically the search becomes fast because documents are not stored in string format but numeric vector format.

  10. Hi,
    can you guide me where to store the txt files,
    is that i need to store in the ‘indexdir’ folder or any other folder?


Leave a Reply