Make your own ML paper TLDR

achang
3 min readNov 28, 2022

--

There are so many Machine Learning (ML) papers out there and lots more comes out every year. They become part of a huge collection of modules, optimization, argumentation, equations, algorithms, diagrams, etc. On top of it, there are an enormous amount of blog posts and newsletters.

Paperswithcode/methods gives a good summary of what is out there. But I still easily get lost in this knowledge pool when trying to dev. I see a stack of new papers from CVPR, Neurips, ICLR, etc to read and I wonder:

is there a lazy way ?

This post goes through how to get latest papers and summarize then. The app can generate your own newsletter from a ML topic. Or you can generate slides from a paper you choose.

While any field of science has its history and its knowledge pool in various papers. ML field gets its extra popularity through its open-source community. This app could be used for other fields of science.

Setup

In this tutorial, you will see:

  • Paper collection
  • PDF data extraction
  • NLP summarization
  • App dev with streamlit
  • Extra: PPTX generation

Gather paper

First step is to be able to gather ML papers. Usually, latest and greatest ML papers are submitted to arxiv. So we can use arxiv python api to search and gather the paper PDF. The api is straight forward:

Example that will search arxiv given a query and print out the paper’s title and arxiv id.

Then you can download the pdf. We can also use paperswithcode api to gather arxiv papers for a particular topic. We do that here. With paperswithcode you can also get the code on github using github api.

Example that will search on paperswithcode and print its title, github link and publication date.

Wow, many pip packages for collecting data from internet. If you dont find a pip package for the website data you collecting try: BeautifulSoup

PDF data extraction

Now you have a PDF, but in most cases, PDFs has no logical structure such as sentences or paragraphs. A PDF contents are a bunch of instructions that tell how to place the things at each exact position on a display or paper. In worst cases, PDFs are bunch of scanned images. But lucky most arxiv papers are built from Latex or Word, and not from a scanner. So we wont need an OCR method for now.

Figure1: Diagram of PDF format from link.

There are various pip packages for mining PDFs and various post about it. For our tutorial, we will use PyMuPDF. It’s doc is here. We collect the images and concatenate paragraphs in order. We end up having a string with all text for each section of the paper.

There are others PDF extractor such as: PDFMiner or jina-ai

NLP Summarizer

Now we have a dictionary with text and images organized in different sections of the paper. We use the famous Natural Language Processing NLP AI models to summarize each section of the paper. Huggingface summarization models are a great option to try out.

Make a web app

We have all our components in our app: paper search, PDF extract and NLP summary. Now we want to create a webapp, deploy it and share with the world. For that, there is streamlit. It comes with various features and examples, making it is easy to create and deploy web apps.

PPTX generation

Here is a bonus level. Imagine the scenario: you want to make a presentation on the paper you just “read”. Now we can also generate powerpoint slides from the ML paper. All we need is some bullet points from NLP summarization model and some diagrams from the PDF. And boom: ready to present.

We can create slides using python-pptx. Below is a hello world example for python-pptx.

Done

Congrats. Hope you enjoyed the tutorial and maybe it helps on your paper research

--

--