NeuML — 2021 Year in Review
NeuML develops groundbreaking data analytics and machine learning software to solve everyday problems. Our primary focus is developing a suite of open-source applications. We also provide consulting services around our open-source stack.
NeuML made great strides in 2021! We had over 500 total open-source commits towards our applications on GitHub. We also strengthened our projects to handle production workloads via a series of consulting efforts. This article will review the state of our projects and efforts along with a look ahead to 2022.
txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. This is the foundational piece of software that all of our work stands on.
txtai had the following highlights for 2021:
- ⭐709 stars on GitHub to bring the total to ⭐1,382
- 408 total commits on GitHub
- 141 total issues resolved on GitHub
- 9 releases. Entered the year at v1.5.0 and finished at v3.7.0 with a big 4.0 release coming soon!
- Lines of code grew over 500%
Lot of development on a number of fronts indeed. In 2021, a full-fledged workflow framework with a number of pipelines to process data was added. Workflows streamline the processing of data into semantic indexes but can also be used as a general processing framework.
Embeddings indexes added the ability to update/delete content, previously a full-rebuild was required. Indexes are also now able to be split into shards and aggregated over multiple nodes.
Path ahead for 2022
Looking ahead, txtai 4.0 will be released in early 2022. 4.0 adds content storage, querying with SQL, object storage, reindexing and index compression. This notebook covers 4.0 with a series of examples covering what’s new.
txtai will continue to evolve on multiple fronts. The main components, Embeddings, Workflows/Pipelines and the API will become more intertwined to work better together. For example, workflows to build indexes, backup indexes or run scheduled/saved/automated queries. The API will further drive clustered indexes and more work will put into making it easier to run large indexes with container orchestration (i.e. Kubernetes).
paperai is an AI-powered literature discovery and review engine for medical/scientific papers. It helps automate tedious literature reviews allowing researchers to focus on their core work.
paperai had the following highlights for 2021:
- ⭐177 stars on GitHub to bring the total to ⭐695
- 52 total commits on GitHub
- 24 total issues resolved on GitHub
- 5 releases. Entered the year at v1.5.0 and finished at v1.10.0.
While the majority of our open-source efforts in 2021 were on txtai, paperai is the most popular project that uses txtai. It has roots in the COVID-19 Kaggle Challenge, the ideas there inspired the creation of paperai. The common NLP work here is where txtai got it’s start.
On top of that, the majority of our consulting efforts to date are centered around paperai. Those efforts in 2021 have helped this project evolve over the course of the year into a production-ready project and also helped improve the txtai ecosystem. paperetl is a companion project and has the raw processing logic for extracting content from medical/scientific papers.
Looking ahead into 2022, paperai will have a 2.0 release in early 2022 adding support for a number of different data sources. 2.0 improves report generation, adds additional options and has a more streamlined code base.
codequestion is a Python application that allows a user to ask coding questions directly from the terminal. Many developers will have a web browser window open while they develop and run web searches as questions arise. codequestion attempts to make that process faster so you can focus on development. This article gives a full overview.
codequestion had the following highlights for 2021:
- ⭐36 stars on GitHub to bring the total to ⭐261
- 11 total commits on GitHub
- 2 total issues resolved on GitHub
- 3 releases. Entered the year at v1.1.0 and finished at v1.4.0.
codequestion clearly wasn’t a focus in 2021, although it has a longer history than many of our projects. The first release was in January 2020, before txtai and paperai. When COVID-19 hit in March 2020, the ideas in codequestion were the basis for our work in the COVID-19 Kaggle Challenge. Since then, much has been rolled up into txtai. Looking ahead, there will be work in 2022 to incorporate txtai 4.0, which will in turn reduce the codebase since many of the ideas have made it up into txtai.
tldrstory is a framework for AI-powered understanding of headlines and text content related to stories. tldrstory applies zero-shot labeling over text, which allows dynamically categorizing content. This article gives a full overview.
tldrstory had the following highlights for 2021:
- ⭐51 stars on GitHub to bring the total to ⭐244
- 11 total commits on GitHub
- 2 total issues resolved on GitHub
- 2 releases. Entered the year at v1.2.0 and finished at v1.4.0.
tldrstory didn’t have the focus that txtai and paperai had in 2021. Behind the scenes, it had a major impact on the 4.0 changes in txtai. tldrstory was the first project to have content storage and an embedding index combined as one. While this will continue to be more of a minor project, there will be work in 2022 to incorporate txtai 4.0, which will in turn reduce the codebase since many of the ideas have made it up into txtai.
Outside of routine maintenance and updates, it didn’t evolve much in 2021. As covered in last year’s recap, while neuspo was a part of the plan in early 2020, things changed over the course of last year. Like codequestion, many of the ideas predate our open-source work and have been incorporated into txtai.
NeuML provides consulting services around our open-source stack as follows:
- Advisory and Strategy Support Build out your data and AI strategy, leveraging our deep expertise
- Model Development Create custom AI, Machine Learning and/or NLP models to excel in industry-specific domains
- Training Group training covering how to implement our open-source stack
- AI-driven Literature Review Automate reviews of large-scale unstructured literature datasets
- Market Research Gather statistics on specific market trends and competitive analysis
- Social Media Analytics Sentiment analysis, event discovery, trend detection and summarization
Our efforts in 2021 were centered around paperai and txtai. paperai attained a lot of interest with our prior efforts in 2020 and we’ve found that it has a lot of applicability.
If we revisit the activity chart at the beginning of this article, lulls in project development can be seen over the course of the year. It can be challenging to simultaneously develop an open-source platform and support consulting efforts. But it is important to prove that these projects solve real-world problems. After all, our tagline is “Applying machine-learning to solve everyday problems”.
Looking ahead into 2022, we expect paperai will be a primary focus for our consulting efforts but that txtai will gain more traction, especially with the changes in 4.0.
This article covered the state of NeuML’s projects and efforts in 2021 and plans for 2022. The overarching theme here was one of consolidation. As new open-source projects have been created, the best ideas have ultimately rolled up into txtai. Then those newer projects integrate logic with the updated versions of txtai.
Our consulting efforts are important to prove that our projects can solve real-world projects. Consulting efforts can be tough for an open-source software company to balance but it’s crucial in building a well-rounded company with a sustainable business model.