How to Read Conference Papers

11 min readJan 25, 2024

Discover the PACES Method: A Strategic Approach to Efficiently Absorb, Analyze, and Retain Key Insights from the Expanding Universe of AI, ML, and CV Research Papers

How many papers can you read in a day? In a week? Whether you read on paper or digitally, scribble in the margins or trust your memory, summarize in Notion or use the Zettelkasten method, reading is an integral part of scholarship and cutting-edge work. One needs to stay apprised of the latest advances, opinions, methods… Being able to read effectively and analytically indubitably lays the groundwork for your future output.

What was your answer? One? Two? Ten? Whatever your answer was, I’m sorry. Seriously, whatever it was, it’s not only not enough, but do you even remember what you manage to read? What, you ask? Now, I’m a questioner by nature (see Gretchen Rubin’s work on personality tendencies)—so I’m also curious, how can this be? I’m with you on two levels.

First, no way. There just can’t be that many papers out there I need to catch up on. In the figure below, you’ll find the statistics on paper submissions and acceptances at CVPR over the years. Yes, 2,359 papers at CVPR in 2023 alone. If you read about six and a half papers per day, you could get through all that in a year. As a many-time area chair for CVPR, I have firsthand knowledge that it (apparently) takes quite longer than that to seriously read a paper from a critical stance.

**Figure 1**: Statistics of papers submitted and accepted to CVPR in the last 18 years. Sourced from the CVF.

Fine, you say CVPR is just a really big conference. According to Google Scholar, CVPR happens to be the highest impact conference in any field; it sports an h5-index of 422, fourth amongst all publications. But from a recent scaling perspective, it is not unique; on the same list we find NeurIPS (9th), ICLR (10th), ICML (17th), ECCV (21st) and ICCV (26th.) Interestingly, among these conferences in 2023 alone there were a whopping 11,141 published papers (from a hard-to-grok 41,253 total submissions; see end of this article for details on these numbers). Back-of-the-envelope calculation for published papers: a very conservative 500 words-per-page and 8 pages-per-paper yields 44.5 million words. The average reader (238 words per minute) would need to spend 3120 hours reading (the average full time work-year is estimated at 2080 hours) before the process starts over again for the following year.

I think that settles it; it’s not possible to read everything in full.

Second, but you say, there may be a lot of papers, but there is no way there are that many great papers I need to read. I recall a “blog” from the great Prof. Don Geman circulating when I was a grad student that proposed to abolish the very existence of conference papers for a variety of reasons including their volume (and this was 20+ years ago). One suspects part of the motivation behind his proposal was that the noise in the system made it increasingly difficult to track progress and select what to read.

Indeed, the challenge in this second angle here is how do you choose what to select for reading. It’s certainly challenging to do so without at least skimming it first, which in and of itself, takes time.

So, what is the modern scholar to do? Too many papers, so little time. Over the years, I have developed an approach for a first read of an academic conference paper. This approach, which I call the PACES approach, emphasizes a quick distillation of the key elements to a contemporary conference paper with minimal time investment and maximal retention. I teach this approach to my students — in my courses and my research group. I use this approach on a regular basis myself when I read papers. I even taught ChatGPT how to use this approach. It does not do everything—when you really need to understand the nuances of a method or an experiment, you need to do more work, but that’s expected. But, it’s a handy tool for keeping up.

The PACES Method

Contemporary conference papers are quite formulaic. They generally make one technical contribution (even when more than one are listed in the last paragraph of the introduction) and they do so in the context of the literature. This reading method provides a mechanism to get at this technical contribution as swiftly as possible. Basically PACES is the acronym for the five questions you need to answer when working with a paper, to go through your “paces” with it.

PROBLEM

What is the main problem being studied in the paper? This seeks to contextualize the paper within the domain and literature as crisply as possible. Is this an object detection paper? Is this a translation paper? Sometimes papers will propose something at a higher level, like a new representation and apply it to multiple technical problems. In such a case, the problem description should be commensurately abstract: e.g., space-time video representations.

APPROACH

What is the main approach being studied in the paper? The two-to-three sentence answer must provide a concise and clear description of the technique proposed to address the problem. It is a summary description with insufficient details to reproduce the work; that’s fine. Any more than two or three sentences and it is too much for PACES. Any less and you have not grokked sufficiently to address the remaining three questions.

CLAIM

What is the main claim the paper makes as far as its unique contribution to the field? This is the most important question. What do the authors think is the main contribution they are making? One sentence. Clearly, this must be in the context of the field as they state it and as you know it: assimilate their claim into your understanding of the space. It is best to use your own words here and not quote from the paper (actually this is the case for your answers to all of the questions).

EVALUATION

How is the approach to the problem evaluated? What datasets are used? What baselines are used? What evaluation setup is used? Any obvious successes or obvious red flags in the evaluation approach, including whether or not the evaluation is on or off point to the direction of the paper? A few sentences summary is sufficient for the evaluation part. No copy-pasting of figures or tables. If the paper is strictly theoretical, then there may not be any evaluation. Note that. Importantly, if you had to do the evaluation, would you have done the same or different? Yes, this question can be answered without a deep understanding of the approach.

SUBSTANTIATION

Does the evaluation substantiate the claim the paper makes? When viewed through the lens of PACES, it becomes reasonable to assess a paper concretely on whether its key claim was backed up concretely by the evaluation. Here, I elect to write a summary sentence about the key contribution of the paper and how the argument (theoretical) and/or evaluation (empirical) back up that contribution. It’s great when the answer is “yes” (and somehow I find it easier to recall and integrate the paper); when the answer is no, this basically means I’ll not look at the paper again.

The PACES Method Means Active, Analytical Reading

So, the PACES method means actively reading a paper with the goal of distilling it down to answers to these five key questions. Usually, I will read the abstract and introduction first. Then, I will read the evaluation (often called “Experiments”) along with any discussion and conclusion. By this time, I can mostly already answer PACES. But, I still go back and skim the Methods and Related Work sections.

Where and how you catalog the answers is a different topic. The process of being active about these questions is the key. For most papers, I can do the PACES process in less than an hour if not significantly less than an hour. Win.

Importantly, the granularity of answers in the PACES method depends on your relatively familiarity with the domain in study. One may, for example, need to revisit papers multiple times to deepen the answers when digging deeper into a topic.

Example PACES

Let’s make this concrete. I selected one of my papers pretty much at random: Grounded Video Description from CVPR 2019. Let’s work through its PACES.

PROBLEM: Video captioning with an emphasis on grounding caption text in video pixels.

APPROACH: In a fully supervised manner, connects groundable language terms like a set of object classes to embedding-based region features, which are used by an LSTM-based language generation module. The loss function drives the attention module to learn how to emphasize groundable regions. To enable the full grounding supervision, a new dataset with sparse object grounding is introduced.

CLAIM: The work makes video captioning more robust and reliable by explicitly incorporating an object-grounding mechanism into the learned model.

EVALUATION: The introduced Activity-Net Entities dataset is used as the primary evaluation medium. The work tests against SOTA video captioning methods and also uses an ablation study to assess the impact of the grounding-and-attention mechanisms; both use standard language generation metrics, a new evaluation for localization accuracy of the grounded terms, and a human assessment of the text. For robustness, the work also tests on an existing image-based grounding dataset Flickr30K Entities.

SUBSTANTIATION: Yes, the paper’s evaluation sufficiently demonstrates the potential value for adding grounding mechanisms in language generation from video.

Automating PACES with LLMs

Now, even once you master the PACES method there is still the sheer volume of papers out there. Clearly this is a challenge. So, I decided to see if we can automate this process, at least to help you decide on whether or not to read a certain paper in detail.

Caption Image for the PACES GPT. Generated by DALL-E. — Figure 2: PACES GPT will help you speed up paper reading; this is the icon for the GPT.

To experiment with automating PACES using recent developments in AI, I created a custom GPT called “PACES GPT” on the OpenAI ecosystem. I trained it to generate PACES summaries for pdfs that are uploaded to it. Try it out: https://chat.openai.com/g/g-N4PHpmUeO-paces-gpt (You may need a subscription to try it out. I’ll consider alternative, open source methods in the future.)

I had the PACES GPT generate the output for the same paper I manually used as an example above. It is below (unedited except for formatting).

PROBLEM: The main problem tackled in the paper is the challenge of generating video descriptions that are accurately grounded in the video content, overcoming the limitations of existing models that often produce plausible but not necessarily accurate descriptions.
APPROACH: The authors introduce a novel video description model that utilizes a dataset with annotated bounding boxes for each noun phrase in video descriptions (ActivityNet-Entities). This model comprises three modules: grounding, region attention, and language generation. It aims to generate words and refine the grounding of objects in the descriptions, utilizing explicit supervision from bounding box annotations. This approach is innovative in its use of object-level supervision, such as region classification, object grounding (localization), and supervised attention, to enhance the grounding accuracy of video descriptions.
CLAIM: The key claim of the paper is that their novel video description model, leveraging bounding box annotations for grounding, outperforms existing state-of-the-art methods in generating more accurately grounded video and image descriptions.
EVALUATION: The evaluation involves comparing their method against state-of-the-art video description models on the ActivityNet Captions dataset and image captioning models on the Flickr30k Entities dataset. Metrics such as localization accuracy (both grounding and attention accuracy), top-1 classification accuracy for positive regions, and standard language evaluation metrics (Bleu@1, Bleu@4, METEOR, CIDEr, SPICE) were used. Human evaluation was also conducted to assess sentence quality.
SUBSTANTIATION: The evaluation substantiates the paper’s claim by demonstrating that the proposed model achieves state-of-the-art performance in both video and image description tasks. This is evidenced by significant improvements in grounding performance, accuracy, and explainability of the generated descriptions, as shown by automatic metrics and human evaluations. The model’s ability to not only generate sentences but also point to the corresponding regions in video frames or images further confirms its effectiveness.

As expected, it’s a quite good overview in the PACES mindset. Obviously, the GPT is not going to be able to understand your context, or even the context of recent papers, but it is a start. Clearly, automated PACES reading is not going to replace your need to read, understand and assimilate a paper. But, it certainly can help you speed up the process, especially in selecting which papers to read, and provide good context when you start reading the paper. Furthermore, the PACES GPT is also able to then further answer questions about the paper, such as “What does the paper say are the limitations of the existing literature?”

Closing

Reading is integral to technical and nontechnical work. You don’t have to take my word for it, it’s widely appreciated certainty. For example, Warren Buffet spends as much as 80% of the day reading and thinking [https://fs.blog/the-buffett-formula/]. I’ve described PACES, a functional approach to reading that I’ve been developing for the last two decades. I teach the approach. I use this approach. Happy reading.

Paper statistics information

The beginning of this article claims there were 11,141 published papers from 41,253 submissions in 2023 amongst the top machine learning and computer vision conferences, which happen to also be the top six conferences on Google Scholar’s H5-Index List. These numbers broken down are as follows:

CVPR: 9,155 submitted and 2,359 accepted.
NeurIPS: 12,345 submitted and 3,218 accepted.
ICML: 6,538 submitted and 1,828 accepted.
ICLR: 4,955 submitted and 1,575 accepted.
ICCV: 8,260 submitted and 2,161 accepted.
ECCV: Off-year in 2023, alternates with ICCV.

Any mistake in these numbers is my own. They are the data I could find on the internet.

Acknowledgements

Thank you to my friends and colleagues who read early versions of this article, especially Jacob Marks, Dan Gural, and Michelle Brinich. Thank you to the hundreds, maybe thousands, of students who have submitted PACES write-ups over the years in my courses, and helped refine the approach through excellent feedback.

Biography

Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at _JasonCorso_.