Problem Statement: To simply put, You have 1 million text files in a directory and your application must cater text query search on all files within few seconds (say ~1-2 seconds). How will you develop such system !!.
Motivation: The idea came from my previous post “Performing OCR by running parallel instances of Tesseract 4.0 : Python“. Saying that following could be some use cases where you may have to build such search engine on top of other applications. e.g.
- You have built an OCR app and converted millions of images into text files. You may want to build a search engine over converted text files to search contents of images.
- You have built a speech to text system where you are converting thousands of recorded audios into text data. You may like to search contents of audio in real time.
Here is a video demonstration of an desktop app developed in QT. It is a whoosh python implementation working in back end.
Introduction: Whoosh
Some of you might have heard about a popular java based library “Lucene” which is a search engine library written entirely in Java. You may find a python wrapper for Lucene. If you are looking for similar pythonic library, “Whoosh” is the one. Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites.
Whoosh pypi package can simply be installed with pip:
pip install Whoosh
For the example demonstrated in this blog-post, You can download a data-set of 70,000 text files which were taken from simple wiki articles from .
1. Creating Indexed Data: Whoosh
It is easy to index all your text files with Whoosh. Initially, the schema of the index has to be defined. Schema defines list of fields to be indexed or stored for each text file. It’s similar to how we define it for database. A field is a piece of information for each document in the index, such as its title or text content. Indexing of a field means it can be searched and it is also returned with results if defined as argument (stored=True) in schema. You only need to create the schema once while creating the index.
Finally, all the text documents are added to index writer in loop. Documents are indexed as per schema and has to be added as per schema design. Below is the python implementation for indexing all the text documents of a directory.
import os from whoosh.index import create_in from whoosh.fields import Schema, TEXT, ID import sys def createSearchableData(root): ''' Schema definition: title(name of file), path(as ID), content(indexed but not stored),textdata (stored text content) ''' schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\ content=TEXT,textdata=TEXT(stored=True)) if not os.path.exists("indexdir"): os.mkdir("indexdir") # Creating a index writer to add document as per schema ix = create_in("indexdir",schema) writer = ix.writer() filepaths = [os.path.join(root,i) for i in os.listdir(root)] for path in filepaths: fp = open(path,'r') print(path) text = fp.read() writer.add_document(title=path.split("\\")[1], path=path,\ content=text,textdata=text) fp.close() writer.commit() root = "corpus" createSearchableData(root)
2. Querying Indexed Data : Whoosh
Querying a indexed data has two important parts which you may like to look upon.
Query String : It is passed while searching the indexed data. Query string can be a single word, a single sentence to be matched exactly, multiple words with ‘AND’, multiple words with ‘OR’ etc. For examples –
Query : politics (returns if the word occurs)Query : sports OR games OR play (returns if any one of the strings occur)
Query : alpha beta gamma (return if a document contains all strings)
Query : “alpha beta gamma“ (returns if all strings occur together in a document).
Scoring : Each document is ranked according to a scoring function. There are quite a few types of scoring function supported by whoosh.
- Frequency : It simply returns the count of the terms occurred in the document. It does not perform any normalization or weighting.
- Tf-Idf scores : It returns tf * idf scores of each document. To know more read here or wiki page here.
- BM25F scoring : It is the by default ranking function used by whoosh. BM stands for best matching. It is based on tf-idf along with bunch of factors like length of document in words, average length of documents in the collection. It also has free parameters k = 1.2 and b = 0.75. To read more check here.
- Cosine scoring : It is useful for finding document similar to your search query.
There are few more scoring algorithms which has been implemented. Check here to know more.
Below is the python implementation for searching a query in the indexed database.
from whoosh.qparser import QueryParser from whoosh import scoring from whoosh.index import open_dir ix = open_dir("indexdir") # query_str is query string query_str = sys.argv[1] # Top 'n' documents as result topN = int(sys.argv[2]) with ix.searcher(weighting=scoring.Frequency) as searcher: query = QueryParser("content", ix.schema).parse(query_str) results = searcher.search(query,limit=topN) for i in range(topN): print(results[i]['title'], str(results[i].score), results[i]['textdata'])
QueryParser class of whoosh implements query language very similar to java Lucene’s.
3. Glossary
Below are the basic terminologies you will always come across in discussions involving searching and indexing documents (taken from whoosh docs).
Corpus
The set of documents you are indexing.
Documents
The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.
Fields
Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
Forward index
A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.
Indexing
The process of examining documents in the corpus and adding them to the reverse index.
Postings
The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.
Reverse Index
Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.
Schema
Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.
Term vector
A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.
At the End
Hope it was an easy read and good heads-up to start with. So, what can we do more from here:
- Searching for alike documents instead of exact term searches only.
- Exploring hierarchical search in file system.
- Correcting errors in the queries. Did you mean…?
- Search N-grams for getting fast, “search as you type” functionality
Update: Readers can download the backup of QT application built from below link. It was a old work and don’t have a working app now. You may have to figure out codes in order to reproduce it. Good luck with that.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂
Hi would you be keen to share your app developed in QT? thanks a lot for the tutorial
Like
Though very late but I have updated the backup download link in “At the End” section.
Like
Nice weblog here! Additionally your site quite a bit up very fast!
What web host are you the use of? Can I get your affiliate link in your host?
I desire my web site loaded up as quickly as yours lol
Like
getting error: File “C:\Users\1patilha\AppData\Local\Continuum\anaconda3\lib\site-packages\whoosh\writing.py”, line 515, in __init__
raise LockError
Like
Can you check that you have administrator rights or permission to access files (try starting cmd or Spyder with admin rights) ?
You can run simple whoosh example to ensure that whoosh is working fine.
Like
I have data in pdf, doc, docx , ppt and excel in 4 folders , I want to use whoose for indexing how I can read it. could you help me.
Liked by 1 person
You will have to figure out ways to read content of different files checking the extensions.
Check python packages for reading pdf docs etc.
I am not aware of ppts though.
Like
I have converted in txt format even I am getting same error.
Like
If I have to fetch files from sqlserver2008 then how I can do it.
Liked by 1 person
I am trying to run the code using Jupyter or Spyder but got stuck on locked (running with admin privileges). The weird thing is that yesterday it worked even though the txt file was not read correctly. Now I don’t even get the index created.
I run the first half to create the index folder and the schema without issues.
Any idea?
Thank you very much
Like
Hi, thanks a lot for this tutorial it really useful, could you please share with us a link for your app developed in QT?
Like
Though very late but I have updated the backup download link in “At the End” section
Like
Hi Abhijeet, Can I have a meeting with you? I like the idea you used here in Whoosh and the OCR implementation using Tesseract.
Liked by 2 people
Sure. !!
Feel free to reach out to me from Contact page OR through email
Like
Hello, I get this error
File “indexing.py”, line 25, in createSearchableData
writer.add_document(title=path.split(“\\”)[1], path=path,\
IndexError: list index out of range
is there any process needed to do before installing Whoosh so it wont produce this error?
Like
nevermind, it is Linux’s directory, should’ve changed it to ‘/’
my bad
Liked by 1 person
If possible, Can you please post the sample text for query_str and topN.
Like
Examples of both are there in video as well as blog-post.
TopN is number of top results you want eg TopN = 5
query_str is string you want to search for. eg query_str = “football fifa”
Like
File “code2.py”, line 13
query = QueryParser(“content”, ix.schema).parse(query_str)
^
IndentationError: expected an indented block
What to do with it?
Liked by 1 person
Updated the indentation. Kindly check the updated code in blog or simple resolve the error by setting up indentation right.
Like
Hello,
Thank you for your wonderful blog post on Whoosh. I’m kinda new to this but just to make sure: indexdir = the directory where my text/pdf are stored? (i’m i getting this right?). My issue here is that when i run this:
# query_str is query string
query_str = sys.argv[1]
#Top ‘n’ documents as result
topN = int(sys.argv)[2]
then i got this error:
File “”, line 8, in
query_str = sys.argv[1]
IndexError: list index out of range
is it because we are not using the same dataset?
Thanks!
p.s an introduction for implementing python scripts to QT please? 🙂
Like
Indeed a good work. Really appreciate. just a curiosity, Is this based on natural language processing? Or is NLP involved here Or entirely based native Python capabilities?
Like
Glad that you found it good.
Whoosh based search is based on indexing on documents where all the docs are saved in mathematical matrix format. This indexing technique makes search very fast.
Ranking of docs are based on various kinds of distance metric calculations.
So, there is no machine learning model or AI model trained/applied here but text processing/manipulation is definitely involved.
Like
Hi Abhijeet,
This implementation demonstrate how we can extract the specific text document.
Is it possible that if we can extract specific paragraph in which specific search keywords are present?
Is this implementation works for pdf files?
Actually I am working on 50-60K pdf files and I have to create search engine to extract the specific paragraph in which the search keywords are present.
Please guide me.
Like
Dost, i am getting this error
~\AppData\Local\conda\conda\envs\Tensorflow\lib\site-packages\whoosh\reading.py in doc_count(self)
637 def doc_count(self):
638 if self.is_closed:
–> 639 raise ReaderClosed
640 return self._perdoc.doc_count()
641
ReaderClosed:
Can you please help me, am using your same code and changed the root to a local folder with some 3 notepad text documents in them
Ravi
Like
What Python IDE did you use for building this?
Like
The front end in app here in blog-post was developed in QT C++ platform. Python codes were simply ran in spyder.
To start with, you can use spyder. pycharm is also a good option. Sometimes I just use text editors like atom for writing codes and runs it on terminal. Sometimes python notebooks are much useful.
Like
Hello,
Why do you need content and textdata on schema?
schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
content=TEXT,textdata=TEXT(stored=True))
Like
Schema defines all the attributes we want to store in table like title, path, content text etc.
It creates a numeric fast searchable table internally and saves it.
That is how basically the search becomes fast because documents are not stored in string format but numeric vector format.
Like
Hi, will i be able to search alpha numeric strings?
Thanks,
Praveen
Like
Yes.
Like
Hi,
can you guide me where to store the txt files,
is that i need to store in the ‘indexdir’ folder or any other folder?
Thanks,
Praveen
Like
i had kept all my files in folder named “corpus”
Check line 31 in 1st code snippet.
root = “corpus”
The indexed searchable whoosh data will be saved in “indexdir”
Like
I have an error message the points to an unicode issue:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 299: invalid start byte
I can’t figure out how to solve this issue (the file is Ok as seen by okular, evince, acrobat reader…)
Any hints?
Traceback (most recent call last):
File “./indexa.py”, line 35, in
createSearchableData(root)
File “./indexa.py”, line 27, in createSearchableData
text = fp.read()
File “/usr/lib/python3.8/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 299: invalid start byte
Like
Hey, I am getting an exception while trying to execute your code,
I am getting the following exception
Exception has occurred: ReaderClosed
exception: no description
Can you please help resolve this issue?
Like