GSOC 2023 Final Report

Pratham Gupta
3 min readSep 14, 2023

--

This is the final report for my project. Here i will be explaining about the method we took to find anagrams.

GNOME Crosswords Editor

Although still under development, Editor is a important part of Crosswords application for GNOME. It allows us to create basic crosswords with grids and clues.

Project Information

My project is add anagram-search support for the Crosswords Editor. The data for the searching comes from a word list file. My task is to search for anagrams. It needs to fast enough, so that the user can set the input word, and the search results are displayed instantaneously (without any lag).

The word-list file

To understand my project, you need to know about the data file i.e. word-list file. The file is made up of 3 sections:

  1. WordList: Just a huge list of words.
  2. FilterFragments: A list of word-indices that follow a particular pattern. e.g. A?? is pattern and its list will have index for words that start with A and have length of 3. Similarly for fragment ?B??, the list will have offsets for words that have second letter as B and length of 4.
  3. Index Section: A JSON block which stores indexes for the above sections.

Finding Anagrams

Approach 1

We simulate a trie to find anagrams of a word.

Trie to search for anagrams

Here at each node we maintain a list of words that follow the particular pattern, like AT? will have words having a prefix of AT and length 3. For branches where the list of words becomes empty like AT?, we perform branch culling.

We were successful in finding anagrams for a word using this approach, but this method is a bit slow to be directly used in a user interface. For a 12 letter word like ABBREVIATION, it took 0.3 seconds to find the anagrams. Thus we need another faster method.

Approach 2

Basic Idea

I will try to explain this with an example:

  • Two words: HEART and EARTH
  • Sort these words, we get AEHRT and AEHRT
  • Create a hash for both of them
  • Notice that they will have the same hash as the sorted words are same
  • We use this unique property to find anagrams.

Implementation

We solve this problem in two stages:

Stage 1: At the compile time

Create a new section in word-list file for anagram fragments, it has the following two parts:

  1. anagram word list: This is a list of words that have same hash. We store a gushort (2 bytes) as index for a word.
  2. anagram hash index: Every entry in anagram word list has a corresponding entry in anagram hash index section, where we store the hash (guint — 4 bytes), offset of the the list entry (guint - 4bytes) and the length of the entry (gchar — 1 byte), with the total size of 9 bytes for each index.

Stage 2: At the run time

Here we search for anagrams from the data created in stage 1. The search follows like:

  • Suppose we have a word “BAT”, we sort its letters and hash it, lets call the hash generated as H1 and the length of word i.e. 3 as LE1.
  • Now, search for H1 in the anagram hash index section, this will be a binary search as the section is always sorted, thus will be very fast.
  • Once found, store the offset (the next 4 bytes), called O1 and the length (the next 1 byte), called L1. The offset points to the anagram-indices stored in the angram-word-list section and the length tells us the number of anagrams.
  • Go to O1 and read the next L1 * 2 bytes. Every 2 bytes here are the index of the words that we want.
  • We get the indices, lets say I1 and I2, set a Fragment list of length LE1, and read the words at index I1 and I2. These are the required anagrams.

Thus we have found the anagrams, we can now show them to the user.

What’s done

Code to write data into word list file has been written and pushed into the main repository. The code creates the above discussed anagram-hash-index and anagram-word-list sections.

What’s left to do

We need to read the word-list file in the run time, find the anagrams using the above mentioned approach and display them to the user.

--

--