Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library

Date: July 31, 2018Author: Abhijeet Kumar 36 Comments

Problem Statement: To simply put, You have 1 million text files in a directory and your application must cater text query search on all files within few seconds (say ~1-2 seconds). How will you develop such system !!.

Motivation: The idea came from my previous post “Performing OCR by running parallel instances of Tesseract 4.0 : Python“. Saying that following could be some use cases where you may have to build such search engine on top of other applications. e.g.

You have built an OCR app and converted millions of images into text files. You may want to build a search engine over converted text files to search contents of images.
You have built a speech to text system where you are converting thousands of recorded audios into text data. You may like to search contents of audio in real time.

Here is a video demonstration of an desktop app developed in QT. It is a whoosh python implementation working in back end.

Introduction: Whoosh

Some of you might have heard about a popular java based library “Lucene” which is a search engine library written entirely in Java. You may find a python wrapper for Lucene. If you are looking for similar pythonic library, “Whoosh” is the one. Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites.

Whoosh pypi package can simply be installed with pip:
pip install Whoosh

For the example demonstrated in this blog-post, You can download a data-set of 70,000 text files which were taken from simple wiki articles from .

1. Creating Indexed Data: Whoosh

It is easy to index all your text files with Whoosh. Initially, the schema of the index has to be defined. Schema defines list of fields to be indexed or stored for each text file. It’s similar to how we define it for database. A field is a piece of information for each document in the index, such as its title or text content. Indexing of a field means it can be searched and it is also returned with results if defined as argument (stored=True) in schema. You only need to create the schema once while creating the index.

Finally, all the text documents are added to index writer in loop. Documents are indexed as per schema and has to be added as per schema design. Below is the python implementation for indexing all the text documents of a directory.

import os
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
import sys

def createSearchableData(root):   

    '''
    Schema definition: title(name of file), path(as ID), content(indexed
    but not stored),textdata (stored text content)
    '''
    schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
              content=TEXT,textdata=TEXT(stored=True))
    if not os.path.exists("indexdir"):
        os.mkdir("indexdir")

    # Creating a index writer to add document as per schema
    ix = create_in("indexdir",schema)
    writer = ix.writer()

    filepaths = [os.path.join(root,i) for i in os.listdir(root)]
    for path in filepaths:
        fp = open(path,'r')
        print(path)
        text = fp.read()
        writer.add_document(title=path.split("\\")[1], path=path,\
          content=text,textdata=text)
        fp.close()
    writer.commit()

root = "corpus"
createSearchableData(root)

2. Querying Indexed Data : Whoosh

Querying a indexed data has two important parts which you may like to look upon.

Query String : It is passed while searching the indexed data. Query string can be a single word, a single sentence to be matched exactly, multiple words with ‘AND’, multiple words with ‘OR’ etc. For examples –

Query : politics (returns if the word occurs)Query : sports OR games OR play (returns if any one of the strings occur)
Query : alpha beta gamma (return if a document contains all strings)
Query : “alpha beta gamma“ (returns if all strings occur together in a document).

Scoring : Each document is ranked according to a scoring function. There are quite a few types of scoring function supported by whoosh.

Frequency : It simply returns the count of the terms occurred in the document. It does not perform any normalization or weighting.
Tf-Idf scores : It returns tf * idf scores of each document. To know more read here or wiki page here.
BM25F scoring : It is the by default ranking function used by whoosh. BM stands for best matching. It is based on tf-idf along with bunch of factors like length of document in words, average length of documents in the collection. It also has free parameters k = 1.2 and b = 0.75. To read more check here.
Cosine scoring : It is useful for finding document similar to your search query.

There are few more scoring algorithms which has been implemented. Check here to know more.

Below is the python implementation for searching a query in the indexed database.

from whoosh.qparser import QueryParser
from whoosh import scoring
from whoosh.index import open_dir

ix = open_dir("indexdir")

# query_str is query string
query_str = sys.argv[1]
# Top 'n' documents as result
topN = int(sys.argv[2])

with ix.searcher(weighting=scoring.Frequency) as searcher:
    query = QueryParser("content", ix.schema).parse(query_str)
results = searcher.search(query,limit=topN)
for i in range(topN):
    print(results[i]['title'], str(results[i].score), results[i]['textdata'])

QueryParser class of whoosh implements query language very similar to java Lucene’s.

3. Glossary

Below are the basic terminologies you will always come across in discussions involving searching and indexing documents (taken from whoosh docs).

Corpus
The set of documents you are indexing.

Documents
The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.

Fields
Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.

Forward index
A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.

Indexing
The process of examining documents in the corpus and adding them to the reverse index.

Postings
The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.

Reverse Index
Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.

Schema
Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.

Term vector
A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.

At the End

Hope it was an easy read and good heads-up to start with. So, what can we do more from here:

Searching for alike documents instead of exact term searches only.
Exploring hierarchical search in file system.
Correcting errors in the queries. Did you mean…?
Search N-grams for getting fast, “search as you type” functionality

Update: Readers can download the backup of QT application built from below link. It was a old work and don’t have a working app now. You may have to figure out codes in order to reproduce it. Good luck with that.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

36 thoughts on “Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library”

Add Comment

ian says:

September 26, 2018 at 6:32 am

Hi would you be keen to share your app developed in QT? thanks a lot for the tutorial

Like

Reply
1. Abhijeet Kumar says:
  
  May 2, 2019 at 6:53 am
  
  Though very late but I have updated the backup download link in “At the End” section.
  
  Like
  
  Reply
click says:

November 9, 2018 at 10:11 pm

Nice weblog here! Additionally your site quite a bit up very fast!
What web host are you the use of? Can I get your affiliate link in your host?
I desire my web site loaded up as quickly as yours lol

Like

Reply
harshal says:

December 21, 2018 at 4:19 am

getting error: File “C:\Users\1patilha\AppData\Local\Continuum\anaconda3\lib\site-packages\whoosh\writing.py”, line 515, in __init__
raise LockError

Like

Reply
1. Abhijeet Kumar says:
  
  December 21, 2018 at 3:46 pm
  
  Can you check that you have administrator rights or permission to access files (try starting cmd or Spyder with admin rights) ?
  
  You can run simple whoosh example to ensure that whoosh is working fine.
  
  Like
  
  Reply
  1. harshal says:
    
    December 24, 2018 at 8:27 am
    
    I have data in pdf, doc, docx , ppt and excel in 4 folders , I want to use whoose for indexing how I can read it. could you help me.
    
    Liked by 1 person
    
    Reply
    1. Abhijeet Kumar says:
      
      December 25, 2018 at 11:11 am
      
      You will have to figure out ways to read content of different files checking the extensions.
      Check python packages for reading pdf docs etc.
      I am not aware of ppts though.
      
      Like
      
      Reply
      1. harshal says:
        
        December 27, 2018 at 7:22 am
        
        I have converted in txt format even I am getting same error.
        
        Like
        
        Reply
        
        harshal says:
        
        January 10, 2019 at 5:30 am
        
        If I have to fetch files from sqlserver2008 then how I can do it.
        
        Liked by 1 person
        
        Reply
  2. Vito Piepoli says:
    
    February 17, 2020 at 9:49 pm
    
    I am trying to run the code using Jupyter or Spyder but got stuck on locked (running with admin privileges). The weird thing is that yesterday it worked even though the txt file was not read correctly. Now I don’t even get the index created.
    I run the first half to create the index folder and the schema without issues.
    Any idea?
    
    Thank you very much
    
    Like
    
    Reply
pedro says:

January 22, 2019 at 3:26 pm

Hi, thanks a lot for this tutorial it really useful, could you please share with us a link for your app developed in QT?

Like

Reply
1. Abhijeet Kumar says:
  
  May 2, 2019 at 6:55 am
  
  Though very late but I have updated the backup download link in “At the End” section
  
  Like
  
  Reply
Rahul Rajendran says:

February 27, 2019 at 3:53 pm

Hi Abhijeet, Can I have a meeting with you? I like the idea you used here in Whoosh and the OCR implementation using Tesseract.

Liked by 2 people

Reply
1. Abhijeet Kumar says:
  
  February 27, 2019 at 6:10 pm
  
  Sure. !!
  Feel free to reach out to me from Contact page OR through email
  
  Like
  
  Reply
John Smith says:

March 2, 2019 at 6:54 am

Hello, I get this error
File “indexing.py”, line 25, in createSearchableData
writer.add_document(title=path.split(“\\”)[1], path=path,\
IndexError: list index out of range
is there any process needed to do before installing Whoosh so it wont produce this error?

Like

Reply
1. John Smith says:
  
  March 2, 2019 at 6:57 am
  
  nevermind, it is Linux’s directory, should’ve changed it to ‘/’
  my bad
  
  Liked by 1 person
  
  Reply
Deepak says:

March 6, 2019 at 9:04 am

If possible, Can you please post the sample text for query_str and topN.

Like

Reply
1. Abhijeet Kumar says:
  
  March 7, 2019 at 6:23 am
  
  Examples of both are there in video as well as blog-post.
  TopN is number of top results you want eg TopN = 5
  query_str is string you want to search for. eg query_str = “football fifa”
  
  Like
  
  Reply
Shobhith says:

May 22, 2019 at 5:31 am

File “code2.py”, line 13
query = QueryParser(“content”, ix.schema).parse(query_str)
^
IndentationError: expected an indented block

What to do with it?

Liked by 1 person

Reply
1. Abhijeet Kumar says:
  
  May 22, 2019 at 5:41 am
  
  Updated the indentation. Kindly check the updated code in blog or simple resolve the error by setting up indentation right.
  
  Like
  
  Reply
thenotorious_1989 says:

June 7, 2019 at 11:15 am

Hello,

Thank you for your wonderful blog post on Whoosh. I’m kinda new to this but just to make sure: indexdir = the directory where my text/pdf are stored? (i’m i getting this right?). My issue here is that when i run this:

# query_str is query string
query_str = sys.argv[1]
#Top ‘n’ documents as result
topN = int(sys.argv)[2]

then i got this error:

File “”, line 8, in
query_str = sys.argv[1]

IndexError: list index out of range

is it because we are not using the same dataset?

Thanks!

p.s an introduction for implementing python scripts to QT please? 🙂

Like

Reply
Gopinathan Munappy says:

August 27, 2019 at 4:15 am

Indeed a good work. Really appreciate. just a curiosity, Is this based on natural language processing? Or is NLP involved here Or entirely based native Python capabilities?

Like

Reply
1. Abhijeet Kumar says:
  
  August 27, 2019 at 7:54 am
  
  Glad that you found it good.
  
  Whoosh based search is based on indexing on documents where all the docs are saved in mathematical matrix format. This indexing technique makes search very fast.
  Ranking of docs are based on various kinds of distance metric calculations.
  
  So, there is no machine learning model or AI model trained/applied here but text processing/manipulation is definitely involved.
  
  Like
  
  Reply
  1. Manish Vade says:
    
    August 29, 2019 at 7:08 am
    
    Hi Abhijeet,
    This implementation demonstrate how we can extract the specific text document.
    Is it possible that if we can extract specific paragraph in which specific search keywords are present?
    Is this implementation works for pdf files?
    Actually I am working on 50-60K pdf files and I have to create search engine to extract the specific paragraph in which the search keywords are present.
    Please guide me.
    
    Like
    
    Reply
Ravishanker says:

September 6, 2019 at 12:17 pm

Dost, i am getting this error

~\AppData\Local\conda\conda\envs\Tensorflow\lib\site-packages\whoosh\reading.py in doc_count(self)
637 def doc_count(self):
638 if self.is_closed:
–> 639 raise ReaderClosed
640 return self._perdoc.doc_count()
641

ReaderClosed:

Can you please help me, am using your same code and changed the root to a local folder with some 3 notepad text documents in them

Ravi

Like

Reply
Tara says:

September 10, 2019 at 12:28 pm

What Python IDE did you use for building this?

Like

Reply
1. Abhijeet Kumar says:
  
  September 18, 2019 at 9:35 am
  
  The front end in app here in blog-post was developed in QT C++ platform. Python codes were simply ran in spyder.
  
  To start with, you can use spyder. pycharm is also a good option. Sometimes I just use text editors like atom for writing codes and runs it on terminal. Sometimes python notebooks are much useful.
  
  Like
  
  Reply
Celso Franca says:

September 23, 2019 at 11:47 pm

Hello,
Why do you need content and textdata on schema?

schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
content=TEXT,textdata=TEXT(stored=True))

Like

Reply
1. Abhijeet Kumar says:
  
  September 27, 2019 at 6:55 am
  
  Schema defines all the attributes we want to store in table like title, path, content text etc.
  It creates a numeric fast searchable table internally and saves it.
  
  That is how basically the search becomes fast because documents are not stored in string format but numeric vector format.
  
  Like
  
  Reply
Praveen says:

September 25, 2019 at 11:25 am

Hi, will i be able to search alpha numeric strings?
Thanks,
Praveen

Like

Reply
1. Abhijeet Kumar says:
  
  September 27, 2019 at 6:47 am
  
  Yes.
  
  Like
  
  Reply
Praveen Kulkarni says:

September 26, 2019 at 7:39 am

Hi,
can you guide me where to store the txt files,
is that i need to store in the ‘indexdir’ folder or any other folder?

Thanks,
Praveen

Like

Reply
1. Abhijeet Kumar says:
  
  September 27, 2019 at 6:50 am
  
  i had kept all my files in folder named “corpus”
  Check line 31 in 1st code snippet.
  root = “corpus”
  
  The indexed searchable whoosh data will be saved in “indexdir”
  
  Like
  
  Reply
Pingback: How to Search Text Documents with Whoosh - Text Analytics Techniques
Fernando Cabral says:

June 30, 2021 at 11:59 am

I have an error message the points to an unicode issue:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 299: invalid start byte

I can’t figure out how to solve this issue (the file is Ok as seen by okular, evince, acrobat reader…)

Any hints?

Traceback (most recent call last):
File “./indexa.py”, line 35, in
createSearchableData(root)
File “./indexa.py”, line 27, in createSearchableData
text = fp.read()
File “/usr/lib/python3.8/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 299: invalid start byte

Like

Reply
Rohan says:

September 20, 2021 at 8:20 am

Hey, I am getting an exception while trying to execute your code,
I am getting the following exception
Exception has occurred: ReaderClosed
exception: no description
Can you please help resolve this issue?

Like

Reply