How I built and launched an AI product for under $100

Phil Gooch
6 min readJun 25, 2018

--

In this short post, I’m going to talk about how I built and deployed the initial version of Scholarcy, a service that uses a combination of machine learning and symbolic AI to read and summarise research papers and reports in real time.

Despite the democratisation of machine learning provided by frameworks such as PyTorch, TensorFlow, Keras, and excellent libraries such as scikit-learn, gensim, spaCy, textaCy and others, building an AI-first product is still seen as difficult and expensive.

Even if you are developing something around pre-trained, off-the-shelf models, going beyond a demo is definitely still hard— but does it need to be expensive? For image processing, you can certainly do real-time object detection on cheap, low-powered hardware (although the frame rate will be painfully low), but for text processing you are going to have a tough time deploying a fastText or word2vec model on a machine with less than 16GB RAM.

An AI startup might typically start with a landing page and a demo, build interest to demonstrate demand, and then seek seed funding — say $100,000- $1million — to build the team and infrastructure to create the product. But for those just starting out, without a budget or funding, how much might you need to spend to build and deploy a production-worthy AI that does something useful?

I didn’t initially set out to find an answer to this question. I just wanted to create something that solved a problem I had:

My Twitter feed was full of interesting research papers that I wanted to read, but I didn’t have time to read them.

I wanted something that would give me a summary of the whole document — more in depth than the abstract, but less than the full paper. I wanted something that would do more than provide the keywords, but give me complete phrases that contained facts, claims or important points — as if it were a magic, virtual marker pen that somehow knew what needed to be highlighted. I also had a bunch of papers and reports as PDF and Word files, so I needed something that would work with these formats.

Now, my background is in text processing and data science, so I had a pretty good idea about what was required to build some sort of command line tool that had this functionality. Previously, I had developed or contributed to various open-source tools that addressed some of these problems. But at the time, I had no background in building or deploying an actual product. My preferred languages are Python and Ruby, so this ruled out extending many of the open-source tools that can parse PDFs, such as Apache Tika, Grobid or science-parse, as these are written in Java. I realised I was going to need to pretty much start from scratch, and overcome the following challenges:

  • to be fast and work in real-time — able to process a document in a few seconds or less
  • to be trainable and runnable on my laptop
  • to be deployable to relatively cheap cloud-based services, so a $20 a month Linode VPS for example
  • to be extendable, so I could deploy additional models as I added new features

The difficulties in processing PDFs are well documented elsewhere. Tools such as Poppler will give you the raw text, but to get usable text from a PDF you need to:

  • identify and remove headers and footers
  • identify document elements, such as title, author, headings, columns, figures, tables
  • reconstruct and reflow the text around these elements

Over the following weeks I built a tool in Python that would take a PDF or Word document, extract, clean and reflow the text, and then classify each line as one of the following types:

  • metadata: title, author, etc
  • section heading
  • body text
  • tabular text
  • caption
  • bibliographic reference

Using spaCy and gensim, I then built the following functionality tailored for research papers and reports:

  • text simplification and summarisation
  • keyword extraction
  • identification of facts, claims and ‘important points’
  • reference extraction and parsing

So I ended up with machine learning models for each of these tasks which I would need to deploy. As these had been trained on only a few hundred PDFs, they ended up being quite small, and because they were largely based on linear regression, logistic regression or conditional random fields, they didn’t need much RAM to run either. Not exactly state-of-the-art deep learning yet, but the goal was something I could build and deploy at low cost, and the results were looking good.

I decided to build an API in Flask to wrap this functionality into a microservice, and then worry about the front end later. The combination of a web app that provides user interaction and customisation, with a separate API that performs the heavy lifting, would hopefully give me a minimum viable product.

I figured the easiest way to build and deploy a user-facing app without having current web development skills was via a Chrome Extension. With that in mind, I created an HTML template with sections corresponding to API functions, and set about learning enough Javascript to populate the template with the output of the API.

Once I had this all working locally, I was ready to start putting it all together. In brief, this involved:

  • setting up private code repo ($7/month)
  • registering a domain name ($10)
  • building a website in SquareSpace ($15/month)
  • setting up a cloud VPS ($20/month)
  • registering with the Chrome Web Store ($5)

The API was deployed to api.scholarcy.com, I submitted the Extension to the Chrome Store, and few days later, the Scholarcy Chrome Extension was born. It had taken 2 months of my time but had cost me under $100 to get it out there.

It’s fair to say that since the initial release to alpha testers back in February 2018, quite a lot has been added and improved on, and more money has been spent — such as developing the Scholarcy Web Library. There’s still a lot to do, and Reid Hoffman’s famous maxim:

‘If you’re not embarrassed by the first version of your product, you’ve launched too late’

definitely applies here. But I hope that this demonstrates what you can build and deploy within the AI/ML space with a clear goal, a lot of work, but not much financial outlay.

Finally, if you’re interested in building an AI tool that does useful things with text, then I recommend these tutorials to get started:

One last thing. This post could also have been called ‘How I built a product from some unloved GitHub projects’, because my starting point was an implementation of textrank and a text scraper that identified term — abbreviation pairs, which I used to provide features for the text classifier, and then it continued from there. So if you have some old code lying around, see if you can use it as a starting point to build something new!

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

--

--