NeuML — 2021 Year in Review

Recapping 2021 and looking ahead to 2022

David Mezzetti
NeuML
6 min readJan 4, 2022

--

NeuML develops groundbreaking data analytics and machine learning software to solve everyday problems. Our primary focus is developing a suite of open-source applications. We also provide consulting services around our open-source stack.

NeuML made great strides in 2021! We had over 500 total open-source commits towards our applications on GitHub. We also explored consulting efforts to strengthen our projects to handle production workloads. This article will review the state of our projects and efforts along with a look ahead to 2022.

txtai

txtai is an open-source platform for semantic search and workflows powered by language models. This is the foundational piece of software that all of our work stands on.

txtai had the following highlights for 2021:

  • 709 stars on GitHub to bring the total to ⭐1,382
  • 408 total commits on GitHub
  • 141 total issues resolved on GitHub
  • 9 releases. Entered the year at v1.5.0 and finished at v3.7.0 with a big 4.0 release coming soon!
  • Lines of code grew over 500%

Lot of development on a number of fronts indeed. In 2021, a full-fledged workflow framework with a number of pipelines to process data was added. Workflows streamline the processing of data into semantic indexes but can also be used as a general processing framework.

Embeddings indexes added the ability to update/delete content, previously a full-rebuild was required. Indexes are also now able to be split into shards and aggregated over multiple nodes.

Path ahead for 2022

Looking ahead, txtai 4.0 will be released in early 2022. 4.0 adds content storage, querying with SQL, object storage, reindexing and index compression. This notebook covers 4.0 with a series of examples covering what’s new.

txtai will continue to evolve on multiple fronts. The main components, Embeddings, Workflows/Pipelines and the API will become more intertwined to work better together. For example, workflows to build indexes, backup indexes or run scheduled/saved/automated queries. The API will further drive clustered indexes and more work will put into making it easier to run large indexes with container orchestration (i.e. Kubernetes).

paperai

paperai is a semantic search and workflow application for medical/scientific papers. It helps automate tedious literature reviews allowing researchers to focus on their core work.

paperai had the following highlights for 2021:

  • 177 stars on GitHub to bring the total to ⭐695
  • 52 total commits on GitHub
  • 24 total issues resolved on GitHub
  • 5 releases. Entered the year at v1.5.0 and finished at v1.10.0.

While the majority of our open-source efforts in 2021 were on txtai, paperai is the most popular project that uses txtai. It has roots in the COVID-19 Kaggle Challenge, the ideas there inspired the creation of paperai. The common NLP work here is where txtai got it’s start.

On top of that, the majority of our potential consulting efforts are centered around paperai. These initial efforts have helped this project evolve into a production-ready project and improve the txtai ecosystem. paperetl is a companion project and has the raw processing logic for extracting content from medical/scientific papers.

Looking ahead into 2022, paperai will have a 2.0 release in early 2022 adding support for a number of different data sources. 2.0 improves report generation, adds additional options and has a more streamlined code base.

codequestion

codequestion is a Python application that allows a user to ask coding questions directly from the terminal. Many developers will have a web browser window open while they develop and run web searches as questions arise. codequestion attempts to make that process faster so you can focus on development. This article gives a full overview.

codequestion had the following highlights for 2021:

  • 36 stars on GitHub to bring the total to ⭐261
  • 11 total commits on GitHub
  • 2 total issues resolved on GitHub
  • 3 releases. Entered the year at v1.1.0 and finished at v1.4.0.

codequestion clearly wasn’t a focus in 2021, although it has a longer history than many of our projects. The first release was in January 2020, before txtai and paperai. When COVID-19 hit in March 2020, the ideas in codequestion were the basis for our work in the COVID-19 Kaggle Challenge. Since then, much has been rolled up into txtai. Looking ahead, there will be work in 2022 to incorporate txtai 4.0, which will in turn reduce the codebase since many of the ideas have made it up into txtai.

tldrstory

tldrstory is semantic search application for headlines and text content related to stories. tldrstory applies zero-shot labeling over text, which allows dynamically categorizing content. This article gives a full overview.

tldrstory had the following highlights for 2021:

  • 51 stars on GitHub to bring the total to ⭐244
  • 11 total commits on GitHub
  • 2 total issues resolved on GitHub
  • 2 releases. Entered the year at v1.2.0 and finished at v1.4.0.

tldrstory didn’t have the focus that txtai and paperai had in 2021. Behind the scenes, it had a major impact on the 4.0 changes in txtai. tldrstory was the first project to have content storage and an embedding index combined as one. While this will continue to be more of a minor project, there will be work in 2022 to incorporate txtai 4.0, which will in turn reduce the codebase since many of the ideas have made it up into txtai.

neuspo

neuspo was started with a commitment to discovering objective, descriptive and real-time sports information. This article gives a full overview of neuspo.

Outside of routine maintenance and updates, it didn’t evolve much in 2021. As covered in last year’s recap, while neuspo was a part of the plan in early 2020, things changed over the course of last year. Like codequestion, many of the ideas predate our open-source work and have been incorporated into txtai.

Consulting Services

NeuML provides consulting services around our open-source stack as follows:

  • Advisory and Strategy Support Build out your data and AI strategy, leveraging our deep expertise
  • Model Development Create custom AI, Machine Learning and/or NLP models to excel in industry-specific domains
  • Training Group training covering how to implement our open-source stack
  • AI-driven Literature Review Automate reviews of large-scale unstructured literature datasets
  • Market Research Gather statistics on specific market trends and competitive analysis
  • Social Media Analytics Sentiment analysis, event discovery, trend detection and summarization

Our efforts in 2021 were centered around paperai and txtai. paperai attained a lot of interest with our prior efforts in 2020 and we’ve found that it has a lot of applicability.

If we revisit the activity chart at the beginning of this article, lulls in project development can be seen over the course of the year. It can be challenging to simultaneously develop an open-source platform and support consulting efforts. But it is important to prove that these projects solve real-world problems. After all, our tagline is “Applying machine-learning to solve everyday problems”.

Looking ahead into 2022, we expect paperai will be a primary focus for our consulting efforts but that txtai will gain more traction, especially with the changes in 4.0.

Wrapping up

This article covered the state of NeuML’s projects and efforts in 2021 and plans for 2022. The overarching theme here was one of consolidation. As new open-source projects have been created, the best ideas have ultimately rolled up into txtai. Then those newer projects integrate logic with the updated versions of txtai.

Our consulting efforts are important to prove that our projects can solve real-world problems. Consulting efforts can be tough for an open-source software company to balance but it’s crucial in building a well-rounded company with a sustainable business model.

Thank you for reading and please follow us on LinkedIn, Twitter and Facebook to check in on how we’re doing over the course of 2022!

--

--

David Mezzetti
NeuML

Founder/CEO at NeuML. Building easy-to-use semantic search and workflow applications with txtai.