I Want to Create a Thesaurus

Abhijit Mahabal
Reading Log
Published in
5 min readOct 14, 2016

It is my dream to create a thesaurus. I have had this desire for at least a decade now, and I have, time and again, dabbled into spelling out what this might look like, how precisely this would be different from, say, Roget’s.

Of course I realize that this is not an easy task — Peter Mark Roget toiled for 47 years before the thesaurus was released to the public in 1852, and I am currently reading The Meaning of Everything: the Story of the Oxford English Dictionary, which depicts that great tome’s 71-year journey from conception to reality. I have no intention of solitary toil on arcana, though I have gratitude for those who have thus toiled. My only hope is working out a scheme whereby a partial product would already start being useful, and to harness computers and their ability to sift through millions of documents.

Before I describe how a different thesaurus may be shaped, some remarks on what a thesaurus is are in order.

Roget’s thesaurus consists of about a thousand entries (he started with a thousand, and in the latest versions of Roget’s International Thesaurus that number stands around 1080). Each entry is about a concept. Consider entry #457 in the 1911 version, which is about “attention”. It mentions nouns, such as mindfulness, intentness, and alertness. It has verbs such as attend, give a thought to, trouble one’s head about, and animadevert. It also has adjectives (preoccupied) and interjections (lo! behold!). My summary hardly does justice: the words are arranged into subgroups, and these reveal the subtle shades of meanings of words and expressions related to attentiveness.

Unlike a dictionary that is organized by the head word and describe the many meanings of that term, a thesaurus is organized by meaning, and lists the associated terms.

What set of meanings are needed to cover all that needs covering? That, of course, begs the question of what needs covering. It is illustrative to look at the course taken by a work that was closely based on Roget’s. The Hindi thesaurus बृहत् समांतर कोश (Bruhat Samantar Kosh) by Arvind Kumar and Kusum Kumar — yet another epic journey to be cognizant of, clocking in at 43 years — at its inception, started with the same thousand categories, with the intent to tweak lightly as needed. It soon became apparent, write the authors in their preface, that this was inadequate, that the vast cultural gulf between the world of Roget and that of present day India necessitated more than tweaks. For instance, Hindi has a far broader religious vocabulary than does English, thanks in part to the purported existence of 330 million gods, many with over a thousand names, but also due to the more central role religion plays in daily life in India. The religious categories, if memory serves me right, make up for about 40% of the entries in the Samantar Kosh.

What would I do differently? Besides the two modes of grouping displayed by the thesaurus and the dictionary — by headword and by meaning — there are doubtless other forms of organization that have utility. What I am after is organization by purpose: given a particular writing task, what aspects can one focus on? What words can get that meaning across? What rhetorical moves, what choice of strong verbs, what sentence structure is useful?

I will provide one concrete example of what an entry may look like in the new thesaurus. Consider the task of writing a book review for something you just read and thought was good. What, indeed, makes a book good? There are many dimensions (or aspects) of being good, and these are what would be cataloged in the entry: it provides information (informative, enriching, enlightening, edifying, instructive, revealing, educative), is funny (fun, amusing, humorous, witty, hilarious, funny, comical), is somehow strange (quirky, surprising, surreal, whimsical, offbeat, bizarre, peculiar, eerie, eclectic, eccentric), is uplifting (inspiring, uplifting, inspirational, motivating, encouraging, exhilarating), is insightful (insightful, imaginative, eloquent, clever, thoughtful, erudite, nuanced, perceptive, well researched, incisive, cogent, inventive, lucid, evocative, articulate), emotionally stirring (poignant, heartwarming, disturbing, unsettling, shocking, heartbreaking, touching, humbling, sobering, haunting; relatable, lovable, endearing), hard to put away (interesting, fascinating, engaging, entertaining, intriguing, riveting, enjoyable, engrossing, captivating), provocative (provocative, titillating, profane, irreverent, offensive), believable (compelling, convincing, comprehensible), and so forth.

Such lists are best used as tools to jog memory and imagination, not as crutches. Once we see a category here that seems to fit the work we wish to describe as “good” (perhaps we were motivated by it or impressed by the erudition), we may flesh out reasons for that belief about that book. The list above has thus provided no more than a seed, a mere catalyst.

All these words go together not because they mean the same thing (they don’t), but because they fill a particular need. There are many other needs, big and small, that arise while writing. How to introduce a new term, how to present an example, how to lay out an analogy, how to state that you understand that things are more complicated, and so forth. This level of organization — organizing and cataloging purpose — is much harder to get at, and furthermore strays into the territory of the highly subjective.

For my day job, I get computers to do my heavy lifting relating to language. I have hope that tools can be built to make the task of doing some small portion of the vision above a reality. Unlike the dictionary, where stopping after getting from “A” to “Ant” leaves us with an utterly useless product, a purpose-based thesaurus, whatever progress it does, can be of some use (in any case, there is no sense in which it can be truly completed).

The editors of OED were able to get hundred of people excited about their project. These volunteers scoured thousands of books, looking for unusual ways in which words had been used, carefully noting them down on pieces of paper (word at top right corner, the quote, author, year, and page number), and sending them off for what ended up being tons of such slips.

If I knew how to get thousands of people excited about tasks such as these (as the editors of the OED managed, though not always smoothly), one can have an army of volunteers scouring writing looking for techniques — how did this author achieve this particular task on this page? What dimensions did they choose along which to describe individuals, their appearance, contributions, their theories? What words did they employ?

For completeness, it must be mentioned that books of the sort I envisage do exist for particular niches. I have not read the book How to Write Thank You Letters, but if true to its title, it must talk of aspects that one will do well to bear in mind when writing thank you notes, and would surely contain many phrases that can be employed.

I do plan, for my part, to keep chipping away at this, at least for my personal use and bemusement, but also with the end goal of getting this out the door.

If you think of something I must acquaint myself with for this journey, be sure to post a comment.

--

--

Abhijit Mahabal
Reading Log

I do unsupervised concept discovery at Pinterest (and previously at Google). Twitter: @amahabal