There are so many Machine Learning (ML) papers out there and lots more comes out every year. They become part of a huge collection of modules, optimization, argumentation, equations, algorithms, diagrams, etc. On top of it, there are an enormous amount of blog posts and newsletters.
Paperswithcode/methods gives a good summary of what is out there. But I still easily get lost in this knowledge pool when trying to dev. I see a stack of new papers from CVPR, Neurips, ICLR, etc to read and I wonder:
is there a lazy way ?
This post goes through how to get latest papers and summarize then. The app can generate your own newsletter from a ML topic. Or you can generate slides from a paper you choose.
While any field of science has its history and its knowledge pool in various papers. ML field gets its extra popularity through its open-source community. This app could be used for other fields of science.
Setup
In this tutorial, you will see:
- Paper collection
- PDF data extraction
- NLP summarization
- App dev with streamlit
- Extra: PPTX generation
Gather paper
First step is to be able to gather ML papers. Usually, latest and greatest ML papers are submitted to arxiv. So we can use arxiv python api to search and gather the paper PDF. The api is straight forward:
Then you can download the pdf. We can also use paperswithcode api to gather arxiv papers for a particular topic. We do that here. With paperswithcode you can also get the code on github using github api.
Wow, many pip packages for collecting data from internet. If you dont find a pip package for the website data you collecting try: BeautifulSoup
PDF data extraction
Now you have a PDF, but in most cases, PDFs has no logical structure such as sentences or paragraphs. A PDF contents are a bunch of instructions that tell how to place the things at each exact position on a display or paper. In worst cases, PDFs are bunch of scanned images. But lucky most arxiv papers are built from Latex or Word, and not from a scanner. So we wont need an OCR method for now.
There are various pip packages for mining PDFs and various post about it. For our tutorial, we will use PyMuPDF. It’s doc is here. We collect the images and concatenate paragraphs in order. We end up having a string with all text for each section of the paper.
There are others PDF extractor such as: PDFMiner or jina-ai
NLP Summarizer
Now we have a dictionary with text and images organized in different sections of the paper. We use the famous Natural Language Processing NLP AI models to summarize each section of the paper. Huggingface summarization models are a great option to try out.
Make a web app
We have all our components in our app: paper search, PDF extract and NLP summary. Now we want to create a webapp, deploy it and share with the world. For that, there is streamlit. It comes with various features and examples, making it is easy to create and deploy web apps.
PPTX generation
Here is a bonus level. Imagine the scenario: you want to make a presentation on the paper you just “read”. Now we can also generate powerpoint slides from the ML paper. All we need is some bullet points from NLP summarization model and some diagrams from the PDF. And boom: ready to present.
We can create slides using python-pptx. Below is a hello world example for python-pptx.
Done
Congrats. Hope you enjoyed the tutorial and maybe it helps on your paper research