[Conference Projector] Website to see the big picture and find papers of CVPR 2023 powered by OpenAI.

yuukicammy
8 min readJul 9, 2023

--

Conference Projector
https://yuukicammy--conference-projector-wrapper.modal.run/en

Conference Projector

Overview

  • Published a web application that visualizes the all papers of CVPR 2023 and allows for paper search.
  • You can search for papers that are closely related in terms of category and application.
  • Instead of text-based search, you can explore papers with a wide perspective.
  • It also helps to grasp trending or niche areas within the conference.
  • Recommended for those who are searching for a research theme or want to learn about computer vision trends from a wide perspective.

What Conference Projector can do

Conference papers are projected on a scatterplot. By clicking on a node, you can view a summary of the paper and similar papers.

Here’s how it works.

Hovering over a node will display information about the paper. You can quickly grasp multiple paper titles and research categories by moving the cursor.

Clicking on a node allows you to view information about that paper. You can also view papers that are similar to the selected paper.

Screen with papers selected
Page to view the selected paper and similar papers.

System Overview

Technology Stacks

Pipeline

The system is built according to the following steps:

  • (1) Scraping
  • (2) Text Generation for Categories, Applications, etc.
  • (3) Embedding
  • (4) Image Extraction from PDFs
  • (5) Dimensionality Reduction
  • (6) K-D Tree Construction
  • (7) Web Application Deployment

(1) to (6) are preprocessing steps. I prepared all data before deploying the web application. The web application uses only the preprocessed data and does not call any new requests to OpenAI APIs at runtime. This reduces latency and minimizes running costs (API request costs).

Pipeline for data preprocessing.

The process for projecting the papers onto a scatterplot and measuring their similarity is shown below.

Processing steps from projecting papers onto a scatter plot to performing nearest neighbor search.

Implementation details are described in the next section.

Implementation Details

(1) Scraping

In this step, I extract all paper titles in the conference. The paper information was scraped from the CVPR 2023 Open Access title list page. The extracted data is stored in Azure Cosmos DB.

This step extracted the following information for each paper:

  • Title
  • Abstract
  • arXiv ID
  • PDF URL (CVPR Open Access).

(2) Text Generation for Categories, Applications, etc.

Use the OpenAI Chat API to generate information (e.g. categories) that cannot be extracted from the conference website. Function calling is used to specify the output format. The model was gpt-3.5-turbo-0613.

Text is generated from the title and abstract of the paper. Text will be generated from each of the following six items:

  • Brief description of the paper
  • Advantages over previous studies
  • The key essence of the proposed approach
  • Experimental results
  • Category
  • Application

I generated text for each item in Japanese and English, resulting in a total of 12 items. The prompts and the Function calling schema can be found in the following links:

The prompts were not batched, and a single request generates the text for one paper with 12 items. According to the OpenAI Tokenizer, the input token length was approximately 500, and the output token length was around 2000.

Generating text using GPT-3.5 Turbo was the most time-consuming and expensive process in this project. As mentioned in this post, GPT-3.5 Turbo responses can be slow, and frequent 503 Error due to overload can occur. It took around 20 seconds to receive a response, and it took a minimum of 16 hours to generate text for all 2,359 papers. There were cases where not all 12 items were generated even when they were specified as required in the function calling schema, so several retries were performed. With the addition of trial and error, time and money were wasted.

(3) Embedding

I used the OpenAI Embeddings API with the text-embedding-ada-002 model to perform embedding for four texts related to each paper: category, application, title, and abstract. To reduce processing time, I followed the GPT best practices - Improving latencies and batched the requests. I could batch up to 20 texts in a single request. Although there were cases where overload errors continued for a while, the processing was completed relatively quickly compared to the text generation using the Chat API.

(4) Image Extraction from PDFs

I wanted to extract images from PDFs to obtain a representative image that reflect the content of each paper, however, this process was not straightforward.

Although I used PyMuPDF to extract images from the PDFs available in CVPR Open Access, I could not extract useful images. This was likely because many papers embedded images in PDF format, which PyMuPDF cannot extract as images.

After trial and error, I developed the following process to extract representative images from the papers:

  1. Check the arXiv information available in CVPR Open Access or search for the title on arXiv. If the paper was registered on arXiv, obtain the largest image from the source.
  2. If the previous step fails, parse the PDF on CVPR Open Access using PyMuPDF, and obtain the image with the largest display area.

Although I made efforts to improve the image extraction process, the success rate of obtaining representative images is approximately 20% based on subjective assessment. This is an area that needs further improvement.

(5) Dimensionality Reduction

To project the information of each paper onto a scatter plot, I performed dimensional reduction on the embeddings. I adopted three methods for dimensional reduction: UMAP, t-SNE, and PCA. These methods were applied to the embeddings of four texts (category, application, title, abstract) in both 2D and 3D. The results of dimensional reduction play a crucial role in measuring the similarities between papers (as explained later in section 6, K-D Tree Construction). At the moment, I have not implemented a feature that allows users to change the hyperparameters of dimensional reduction on the website due to implementation complexity and latency concerns.

(6) K-D Tree Construction

To enable fast searching of similar papers, I constructed K-D Trees. I constructed a K-D Tree for each combination of the following settings:

  • Embeddings (x4): category, application, title, abstract
  • Dimensional reduction methods (x3): UMAP, t-SNE, PCA
  • Reduced dimensions (x2): 2D, 3D

A total of 24 trees (4 x 3 x 2) were constructed.

It would have been simpler to use the 1536-dimensional embeddings obtained from the OpenAI API. However, this approach would have resulted in discrepancies between the observed neighboring nodes on the scatter plot and the neighboring nodes determined by the system based on high-dimensional embeddings. This would have affected the user experience. Therefore, I constructed the K-D Tree using the dimensional-reduced 2D and 3D embeddings.

Nearest neighbors searched in high-dimensional space often do not appear as neighbors after 2D projection.

(7) Web Application Deployment

I used Dash, a Python framework developed by Plotly, to build the web application. Dash is based on Flask and allows for easy construction of web applications with interactive Plotly graphs, making it suitable for this project. Although this was my first time using Dash, I was able to build the website relatively easily.

Currently, I deploy Dash application on Modal, a serverless platform. I am still considering the choice. It may be more advantageous to build a monolithic website instead of a serverless one from the perspective of latency and avoiding troubles. Additionally, building the web application without using Dash would further reduce latency. The reason why I chose Modal is I personally liked it. The ease of use is very addictive, even if it has side effects. The reason for Dash is I did not want to put much effort into HTML/CSS frontend coding.

Did I end up creating a useful tool?

Point1: Browsing Papers from a Broader Perspective

  • I think I have achieved a wide perspective paper search in the sense that you can look for papers from the entire conference.
  • However, the visualization is not very crisp and requires some effort to find papers of interest.

Toward Improvement:

  • It would be helpful to have additional clues to navigate to papers, such as combining keyword searches or clearly indicating important/rated papers.

Point2: Searching for Similar Papers based on Categories and Applications

  • Specific categories and applications are often clustered together, making it easier to find similar papers.
  • However, there were cases where the categories and applications were not accurately estimated. For example, some papers were categorized as “Computer Vision,” and the applications were often long and not specific.

Toward Improvement:

  • It is necessary to provide more specific examples of categories and applications in the prompts.
  • Incorporating information from the full text of the papers could improve the accuracy of application estimation.
  • Additional techniques are needed to better represent the proximity between different categories.

Point3: Understanding the Overall Conference Trends

Node Clusters and Categories
  • The plot reveals that zero-shot/few-shot learning are relatively large and dense cluster, representing a trend in the conference.
  • I included specific examples of categories in the prompts, such as “3D human pose estimation, 3D object tracking, object detection, exposure correction.” These categories were well clustered, but other categories tended to have low density.
  • It is difficult to determine whether low-density areas represent minor categories or not.

Toward Improvement:

  • Improving the prompts by adding specific examples of categories and applications and tuning the dimensional reduction techniques are necessary to better represent conference trends.

Conclusion

Although there are still many challenges to overcome, the tool has potential for utility. I started by developing it as a one-person hackathon and completed it with almost an all-nighter on the first day. Since then, three weeks have passed with continuous improvements.

One realization I had during the development process is that, “prompt engineering” is essential in a position where models cannot be updated. In the past, I underestimated the value of “prompt engineering,” and this time as well. I did not put much effort into it. However, I now recognize that generating prompts accurately is crucial for providing categories and applications for the papers, which strongly influences the user experience. Specifically, including specific names of categories and applications that need to be generated in the prompts seems to be effective. I now consider “prompt engineering” as an essential part of effective tuning.

Conference Projector
https://yuukicammy--conference-projector-wrapper.modal.run/en

GitHub
https://github.com/yuukicammy/conference-projector

Originally published at http://github.com.

--

--