Add Natural Language Understanding to any application

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an AI-powered search engine that enables Natural…


A look back at founding NeuML in a year we’ll never forget

Photo by Priscilla Du Preez on Unsplash

NeuML was founded in January 2020. Entering the year, our initial focus was neuspo, a fact-driven, real-time sports event and news site. neuspo was released in February with analysis focused on the upcoming NCAA basketball tournament. This work developed a following and algorithms were ready to predict the matchups.

In early March, it’s safe to say the world went to crap. Sports were cancelled and everything changed.

NeuML in 2020

Instead of waiting for normalcy to return, NeuML made the best of the situation. We refocused on “Applying machine learning to solve everyday problems”. …


Continuous Integration is a breeze with GitHub Actions

Photo by Wilhelm Gunkel on Unsplash

Do you use GitHub for version control? Do you like to test your code? The answer to question #1 for many is yes and everyone should answer yes to question #2. Great, now that we’re on the same page, let’s talk about GitHub Actions.

GitHub Actions is a framework for automating workflows. Workflows are created in YAML and automatically run right on GitHub repositories. The link below gives an excellent overview of GitHub Actions.

The documentation above does a great job introducing GitHub Actions along with workflow file syntax. …


Analyze headlines and story text with Streamlit, Transformers and FastAPI

Vast amounts of media, news and commentary are generated on a daily basis. Headlines and other attention-grabbing content is constantly put on our screens to try to get us to click through. Putting together a good headline is almost as important as the content within an article and there are teams of people dedicated it.

Natural Language Processing (NLP) is a large and growing field focused on the application of machine learning to attain human-level understanding of textual data. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people…


Do more than compete

Photo by Luke Chesser on Unsplash

Kaggle is one of the most popular places to get started with data science and machine learning. Most in the data science world have used or at least heard of it. Kaggle is well-known as a site that hosts machine learning competitions and while that is a big part of the platform, it can do much more.

This year with the COVID-19 Open Research Dataset (CORD-19), I had the chance to use the platform more consistently. Honestly, Jupyter notebooks and GUI-based development hasn’t been my preferred approach (Vim is often good enough for me). But over the last few months…


A perspective from an unexpected participant

Photo by Martin Sanchez on Unsplash

This story covers my experience using machine learning and data science to help researchers find answers in the COVID-19 Open Research Dataset (CORD-19). CORD-19 was released “to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease”. What started as a effort to chip in and help, has led to the work I’ve done being covered in a Wall Street Journal article and cited on the COVID-19 Kaggle community contributions page.

This article gives a background on how I got involved, along with the evolution of my technical approach to…


Parsing, formatting, time zones, and more

Photo by chuttersnap on Unsplash

Python is a great language with a number of easy-to-use features baked in. For almost any problem, there’s either direct-language support or a library. That ease of use and rich ecosystem make it a go-to language for many.

For those of us with experience using Java and JavaScript, working with dates in Python isn’t as intuitive or easy. It just feels hard to implement what we’d consider straightforward functionality. But it’s not as bad as it appears. This piece will go through a couple of helpful tips for working with dates.

1. Parsing

The first use case covers when there’s a date…


Using webelapse

Photo by Jake Blucker on Unsplash

Time-lapse video is a popular way to show an area or event over a long period of time. Examples of time-lapse video include showing a city over multiple sunrises and sunsets or the flow of traffic on a road over a long period of time.

The same concept can be applied to a dynamic real-time website with frequently updated data. This use case came up in the development of a recent project, neupso. In order to demo how the site works, a method was needed to be able to build time-lapse video over a period of time.


Discover high quality sports content

Sports is a topic with many voices, opinions and commentary. Information is plentiful but objective, descriptive, real-time information is hard to find in an opinionated world. With social media in particular, it’s hard to separate fact from opinion and follow what is actually happening. The ability to find objective, descriptive, real-time information is not as easy as it should be.

Objective sports data

neuspo is laser-focused on identifying fact-driven, real-time event data from sports social media and news data. The platform looks to set an example of media providing tailored experiences to distinct topics of interest, with clarity into how content is recommended.


Some key concepts

Photo by Anthony Cantin on Unsplash

When developing a text search system, the first order of business is how to index data. The most common type of text search is token-based search. Users enter a query, the query is tokenized, and a corpus of documents is searched for the best matches.

How text is tokenized is an important factor in how users find documents. This article covers some key concepts to consider when tokenizing text to index in search systems. Enterprise search systems such as Elasticsearch have all of the discussed functionality built-in but it’s best to understand the concepts. …

David Mezzetti

Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store