Free and Open-Source Grammar Correction in Neovim Using LTeX and N-Grams

Erik Krieg
5 min readJan 14, 2024

--

Created by Author with DALL-E3

After having devoted so much time learning and configuring Neovim, I take psychic damage each time I find myself writing more than a few sentences outside my terminal. When I’ve written outside Neovim, it’s generally been because I’m writing docs, notes, or blogging and had neglected to configure tooling to make this nice in my editor. Maybe you can relate; if so, I hope that after this you too never need to :q.

More specifically, we’re going to make writing markup documents, like Markdown, LaTex, and more, better by enabling grammar correction with LTeX Language Server using Neovim’s LSP. We’ll download a dataset of N-Grams to make LTeX more effective. What we’ll end up with is an open-source solution for grammar correction that works entirely offline.

I’ll also spend some time at the end to demystify LTeX and N-Grams a bit.

Configure LTeX Language Server

Besides Neovim, the requirements for this are:

- nvim-lspconfig
- LTeX Language Server (ltex-ls)
- N-Gram data

Getting started with nvim-lspconfig is out of scope, but if you need guidance, I’ve found this tutorial that covers the basics, but this is just one of many excellent tutorials you might find on the matter.

For ltex-ls you’ll be installing a binary that will eventually be executed by nvim-lspconfig once we have it set up. Install ltex-ls using your preferred package management tool. For me, that is Nix, but for Neovim packages in particular, mason.nvim is a popular option.

Once these are installed, a basic setup could be as simple as:

require(“lspconfig”).ltex.setup()

But I want to include a few settings for ltex-ls that improve grammar error detection. I also want to make sure some of my LSP customizations are used.


require("lspconfig").ltex.setup(config({
settings = {
ltex = {
language = "en",
additionalRules = {
languageModel = "~/models/ngrams/",
},
},
},
}))

Here’s what’s going on:

Even if the N-Grams model has not been downloaded yet, this configuration should still work. If we open up a new Markdown file and start writing there shouldn’t be any LSP errors, we just won’t have the full-fledged experience yet.

Luckily it’s not hard to change that, but it might take a few minutes. We’ll want to download the N-Gram data for the languages we expect to work with and unzip it to ~/models/ngrams. We could put this anywhere, but this is the path that I ended up using. Just a heads-up, the English model is about 8G zipped and 14G unzipped.

The resulting folder structure looks like this:

models
└── ngrams
└── en
├── 1grams
├── 2grams
└── 3grams

Okay, it should be doing its thing now. Try doing some writing or opening up an existing Markdown file you have on hand. Get a feel for it and check out the other available settings for ltex-ls and find the right config for you!

Here are a couple examples of how it’s looking in my Neovim configuration.

Inlay code hint

Code action preview

This will of course vary depending on your theme, LSP configuration and additional UI plugins. For the previews I’m using actions-preview.nvim.

How does this work?

When I first set LTeX LS up with N-Grams I was glad that it was so easy, but I also found it pretty mysterious. How does this work? What tools are involved?

Going through LTeX’s docs I learned that:

This satisfied most of my immediate curiosity about LTeX LS, but what about N-Grams? What are they, and how do they help discover grammar mistakes?

What are N-Grams?

N-Grams are a continuous sequence of n tokens, where tokens could be words or syllables (or other subdivisions of text). Here are a few examples derived from What are N-Grams:

  • 1-gram or unigram: What, are, N-Grams
  • 2-gram or bigram: What are, are N-Grams
  • 3-gram or trigram: What are N-Grams
  • etc…

There are models created from large collections of N-Grams that include the frequency at which distinct N-Grams occur within some corpus. With this, you might be able to determine whether What are N-Grams? or That are N-Grams? is more likely. If something is very unlikely, perhaps it is not grammatically correct. This is how I’m understanding how N-Grams could be used, at least.

Further digging revealed some LanguageTool docs that explain how LT uses N-Gram models to improve the tool’s ability to identify errors.

How effective is LTeX & N-Grams?

This is what I’m left thinking about. This article is the first writing I’ve done using these tools. I can attest that they do find grammar mistakes, and I really like that it uses open-sourced parts and works completely offline (I actually wrote part of this article during a power outage without the Internet).

But how good of a job does this do compared to other services, like Grammarly? Or even the premium LanguageTools offering? How much more effective can I make LTeX LS with additional rules or models?

I don’t know yet, but I intend to continue iterating on my configuration as well as explore the competitors.

If you do have much experience with any of these solutions, or something else, I’d genuinely like to know about them; leave a comment!

Extra Links

--

--

Erik Krieg

Technology-person currently doing DevOps stuff & Site Reliability Engineering things