Building jobmatc search index
jobmatc.com is built primarily on C#.NET with Azure for cloud infrastructure. With so much diversity in tech and infrastructure, there are infinite choices like decisions you make in life :-), but we had to choose one, start and then work on it, C# was what we knew best.
So most features of jobmatc.com for user is hugely built around search, you search for something, you get jobs listed, view employers, view requirements and then apply. Search is best assisted with suggestions. We choose typeahead for suggestion, good thing with typeahead was it had some pre-built features such as Prefetch which caches locally, which we might detail on other posts. So we needed text search for generic search that didn’t involve suggestion.
We need to start somewhere, and we need good search with probable features such as similar search, more advance text search options. Dough Cutting’s Lucene who is also renowned for Hadoop was what we opted for. Lucene had port for .NET and also package for syncing with Azure’s blob directory. Since we were on selected market, we believe we don’t have to scale rapidly that scaling lucene would be problem for us, else we would have opted for Solr.
So how lucene works? there are mainly three part, one building index, searching over that index and administrating lucene. Today we will talk briefly about lucene’s way of building index and also snippet of how we build index.
Building Lucene Index
There is not much code for building lucene index, its fair and simple. However building index inside has good complex process. As usual its like root which is not so sexy and shiny but necessity.
Basic unit for lucene’s indexing and searching is document which you can visualize as a real document which consists of fields. Fields might be header, body of that particular document. Generally how do you search? we search on a particular keyword, when it matches we list it out, while listing we don’t just list out title but also corresponding descriptions or date or other fields. So how do we do it on lucene, we index that field we need to do search on and we just store the field we will need after search matches. Indexing a particular field needs to be text else we can only store that field not index it. Indexed field might have small inverted index for that particular field by which functionality such as similar document can be achieved.
var doc = new Document(); // We create a document// Add fields to that document
doc.Add(new Field(“Id”, listing.JobSnippetId, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS, Field.TermVector.NO));// Index this field
doc.Add(new Field(“Name”,
listing.Name,
Field.Store.YES,
Field.Index.ANALYZED_NO_NORMS,
Field.TermVector.YES));// Just store category for listing
doc.Add(new Field(“Category”, listing.JobCategory.ToString(“D”), Field.Store.YES,
Field.Index.NO,
Field.TermVector.NO));// And finally add to indexWriter
indexWriter.AddDocument(doc);
If you see on first field i.e. Id we are storing it with Field.Store.YES to retrieve it later, but not indexing it with Field.Index.NOT_ANALYZED_NO_NORMS which means we don’t want this field to be analyzed, but retrieve as it is, coz we might need to delete it which kind of needs exact document id. With NO_NORMS we are stating we don’t want to boost that field.
What is with boosting thing?
Imagine you have two type of plans for simplicity. One free and other premium so you need to show premium bit higher on list than free one. So how do you do it, you boost that whole document with doc.setBoost(2F). Or you search result consists of multiple field like Title, and description, obviously description might have less relevance than title so you do a field boost here. It depends on what you are doing but we at jobmatc don’t boost at all, we do have premium and free model for posting jobs but it doesn’t affect search in anyways. Premium for now only lists that company on home page. So if you want to set any kind of boost you need to remove NO_NORMS. This is index time boost however you can also set run-time/dynamic boost while search. So how boost semi internally take place? All the boost that if you have assigned like documents, fields and lucene itself has its own boost value which is highly on tokens with shorter values are combined to single floating point and encoded to single byte (remember normally float is 8 byte) and stored on each fields. So on search its decoded to floating point type and then loaded on memory.
Inverted index sounds heavy?
We know lucene searches on index, and its fast so how does lucene achieve it?When building index it gets tokens from analysis and stores tokens along with info of where document actually is, which shift the search to tokens rather than actual field of document. So its like searching from index of book usually at back, and then locating page numbers rather than searching every line reading for keyword.
When we call indexWriter.AddDocument(doc); first analysis happens, which splits text to streams of tokens and does various optional operation on it, such as lower casing, removing stop words such as the, in, on or reducing tokens to root word using porter stemmer algo (like reading, revival, allowance to read, reviv, allow) which might not be best but is fast. And this whole process is analysis. After analyzing documents are actually buffered on memory unless every docs is added to indexWriter to minimize disk I/O.
After adding all documents on indexWriter you can explicitly commit using indexWriter.Commit(); which in general flushes any buffered or to be deleted docs, create segments, removes old commits which is governed by IndexDeletionPolicy. However there can be two phase commit using prepareCommit for simplicity we used straight commit(); If you want to rollback to previous commited states than you need to edit IndexDeletionPolicy else default is recent commit (i.e. KeepOnlyLastCommitDeletionPolicy).
So with each flushing a segment is created, which when searching combines result from every segments efficiently. But you can merge all these segments to one which is intensive operation by calling indexWriter.optimize() which has different configuration if you want to like either you want to have single file or multiple files which optimizing etc. We at jobmatc haven’t used it coz we think its kind of overkill, even if we need to do optimize we will do it on scheduler while traffic is low.
Index Building Strategy
Since adding index is resource intensive process takes bit time, we don’t build index as soon as someone adds vacancy. Rather we built it using Azure WebJobs in certain time interval from database with status, and delete accordingly using delete status. With this approach we can easily rebuild index with just change of status on database and also don’t have to worry too much about locks. Since IndexWriter creates locks which is file based i.e. a file write.lock file is created and stops creating other instance of IndexWriter on same index to avoid index corruption, scheduling index build pretty much solves waiting issue in case of locking. However there are configs to change it to no locking and different types of locking but we are with default. Next time we might discuss searching strategy.
Writing hungry, writing foolish.


