Building an InvertedIndex from scratch and optimizing

Last week I came across an interesting real world problem, Enabling efficient search across a C codebase. I was challenged not to use any pre existing solutions like lucene, solr or glimpse. Another additional functionality I was expected to implement was identifying function, parameter list, variable name.

Only understanding about the search engines I had was, these search engines used indexes to retrieve the data fastly. This is the story of how I implemented an in-memory inverted index.

Constructing the InMemory Inverted Index:

The schema of Inverted Index is 
 Dictionary<key, List<Schema>>
 where Schema is {
 documentId, 
 lineNo,
 isFunctionORisVariableORisParameter
 }

This schema is called the postings list.

The tasks that need to carried out for indexing are as follows:

Loop through the directory and get all the compatible files from those directories. Assign every file an unique number and map it in-memory. This unique number is the documentId.
To get the individual letters we just need to tokenize the file. But if we need to identify the function name, variable name or parameter list we need to parse the file, which is entirely a different work. I still looking to find a solution for that.
Once you have tokenized the file, append the schema details to the token key in the in-memory dictionary.
Once all the files are processed, pickle the dictionary to a file.

The tasks that need to be carried out to search the index are as follows:

Unpickle the file, where the index is stored and store it to in-memory index.
Search the word in the in-memory index and the return the schema details.

Futher ideas to optimize the storage of in-memory Index:

We can store the postings list in a file and store the index of the file pointer where postings list is stored to the in-memory index. So we can reduce the size of in-memory index and store more keys.