SciA11y — improving the accessibility of scientific documents (2 of 2)

lucy lu wang
Jun 10 · 9 min read
A drawing of different components of a scientific paper identified as blocks. The content of these blocks are extracted and reorganized. Paper snippets are represented as yellow blocks on a dark blue background with the text SciA11y and Semantic Scholar in the lower corners.

In the first post of this series, I described the current state of scientific PDF accessibility (or rather, inaccessibility). Now let’s dive deeper into some of the challenges faced by blind and low vision (BLV) readers in this domain, and introduce a system we created that attempts to address some of these challenges. We go into great detail about both of these points in our recently released preprint: “Improving the accessibility of scientific documents.”

Understanding user challenges

We interviewed several BLV researchers to understand their needs and challenges when reading papers. My high-level takeaways from these conversations were 1) any paper semantics that aren’t explicitly described in the PDF can and will break for some screen reader/PDF reader combination, and 2) paper reading can be quite frustrating for BLV researchers.

Our participants described a myriad of issues encountered by their screen readers on inaccessible PDFs, from high-level problems of lacking headings for navigation or issues with multi-columnar format, to specific problems that affect all possible components of a paper, from figures and tables, to equations, code blocks, and more.

To compensate for these issues, participants employ various coping strategies:

  • Try other PDF readers — one participant may try another browser even though it usually doesn’t help, but he feels “hopeful”
  • Message authors for source document — sometimes the author manuscript is accessible but the camera-ready version is not
  • Ask for remediation — one participant estimates that a 10-day turnaround is on the quick side
  • Ask co-worker or family member for help
  • Give up and abandon the paper — one participant estimates abandoning papers “60–70% of the time”, another says “sometimes the only option is to sit down and cry” (jokingly, though the sentiment is true)

Some of these compensation strategies can be fruitless, or take a significant amount of time. For example, requesting remediation is only an option for researchers affiliated with large, resourced institutions, and even when available, it may take up to two weeks for each request, which is too slow for many research needs.

Most dishearteningly, several participants in our study described frequently giving up and abandoning a paper. In some cases, whether a paper is accessible may also alter the course of the participant’s research; for example, when choosing between two methods, one described in an accessible PDF and one in an inaccessible PDF, the decision to choose the accessible paper may come from necessity rather than any comparative evaluation between the two methods.

The SciA11y system

To address some of these challenges, we designed and prototyped the SciA11y system (named after the numeronym for accessibility: a11y). This system extracts the components of a scientific paper PDF, and renders these components into an accessible HTML document. The system combines the output of several machine learning modules and heuristic logic to create an HTML render of the source PDF. Figure 1 shows how the primary components of our pipeline identify and extract textual elements and figure/table elements, then stitch these together in the HTML render. Our main focus for this prototype was to implement navigational features to improve skimming and scanning. As such, we focused on tagging paper objects, creating section headers for navigation, and introducing other navigation-assisting features such as the Table of Contents and providing links between inline citations and references.

On the left are the first page and half of a PDF for the paper “Construction of the literature graph in Semantic Scholar.” Textual elements are highlighted as blocks of blue. Figure elements are highlighted as blocks of pink. In the middle is the SciA11y HTML render of the same document, with arrows pointing to how blocks from the PDF are rearranged in the HTML. A table of contents is added near the beginning, highlighted in green. Links are added between inline citations and the bibliography.

SciA11y primarily leverages S2ORC and DeepFigures (two other projects from Semantic Scholar) to analyze and extract components of scientific PDFs. S2ORC combines metadata selection logic and a suite of utilities to convert scientific documents of various types (PDF, XML, LaTeX) to JSON. For PDFs, S2ORC uses the open source tool Grobid as a foundation. DeepFigures is a computer vision model that identifies and extracts figure and table objects from papers, along with their captions and titles. We combine the outputs of S2ORC and DeepFigures to form the main document body, by inferring a linear reading order that integrates text and figures. We also add links between parts of the document for better within-document navigation. To mitigate some of the dissonance around incorrect extractions, we also indicate to the user any known failures to extract particular components of the paper such as figures and equations, which echoes the first two guidelines for Human-AI interaction: indicating to the user what the system can and cannot do (Amershi et al. 2019).

An example output of SciA11y is shown in Figure 1. The original PDF is shown on the left, with a two-column layout where various paragraphs, figures, references, etc. are interspersed, and with minimal markup to indicate headings and object type. Our system takes this PDF as input and produces the linear document structure shown in the middle of Figure 1. From the top, the paper consists of metadata fields like title and authors, followed by a table of contents with links to all sections, then the various sections and subsections. Figures and tables are placed as close as possible to their first mentions at paragraph breaks, rather than in the middle of paragraphs as in the PDF. At the end of the document is the references section; bidirectional links are populated between inline citations and the reference entries to which they resolve.

Where previously, a PDF may lack labeled headings for navigation, the resulting SciA11y HTML render contains section headings in HTML header tags (e.g. <h1>, <h2>), making it easier for users of screen readers to navigate to specific headings. By associating figure and table captions with their respective figure and table images, and tagging these in HTML as <figure> objects, we prevent caption text from interrupting reading flow, and also allow users to take advantage of built-in screen reader shortcuts for navigating between figures. We also incorporate bidirectional links between inline citations and reference entries, to allow users to navigate between the text and references without losing their reading context.

We performed an intrinsic evaluation of our system, by sampling and annotating the HTML renders of 385 papers for extraction errors and overall parse quality and readability. Two expert annotators examined the PDF and HTML versions of each paper, and identified and quantified 12 different kinds of errors made by our models. The most common errors occurred in header, footer, and footnote extraction, and section headings, where a large proportion of papers have 1–5 extraction errors for each. These errors consist of failed extractions (e.g. a footnote was not extracted and mixed in with the body text) or extraneous extractions (e.g. text that is not a section heading is extracted as a section heading). Assessment of overall parse quality and readability is positive; annotators labeled 86% of papers in our sample as having an HTML render with good or okay overall readability. With additional modeling improvements on the horizon, we are optimistic about improving the rate of high-quality HTML renders.

SciA11y user study

Six thumbnails of silhouetted head and shoulders are shown in a line, representing the six participants recruited for our user study, two in the pilot session, 4 in a main session. A breakdown of each 75-minute semi-structured interview is shown, consisting of 15 minutes of introduction, 20 minutes demonstrating the participant’s current workflow, 20 minutes of prototype interaction, and 15 minutes of discussion.

We performed a qualitative evaluation of our system with six BLV researchers. During the evaluation, we asked each participant to read a paper of their choice first using their current tools and pipeline, and then using the SciA11y system. We then asked each participant to discuss the positives and negatives of their experience with SciA11y.

Participants responded positively to the navigation features we introduced. Among the most liked features are the bidirectional links between inline citations and references, the headings we introduced for screen reader navigation, and the table of contents at the top of the document. Regarding citation and reference links, one participant referred to them as a “crucial piece of the puzzle.” Many benefits are derived from converting a paper PDF into HTML, since most screen readers, web browsers, and OS features have been tested thoroughly on and operate well on HTML documents, e.g., OS features like find/copy/paste work as expected, and screen reader shortcuts also function as expected.

On the negative side, participants pointed out that some paper elements were not extracted correctly from the PDF. For example, some section headings are extracted incorrectly or missed during extraction. Other components that we currently do not handle in our prototype such as tables (currently extracted as images) and equations were also noted as missing. Negative feedback was largely concentrated on errors of specific paper elements, many of which we noted during our intrinsic evaluation.

The overall response was quite motivating. When asked whether they would use the SciA11y system if it were to be available over papers in our corpus, all six participants responded yes, that they would use the system in the future. When asked how the system might be integrated into their workflow, one participant responded:

“I think it would become the workflow.”

Another participant said that the system would be “life-changing” for currently inaccessible paper PDFs. We summarize learnings from these user studies into a set of 5 design recommendations for accessible paper reading systems, which are discussed in detail in our paper.

Accessibility and the Semantic Scholar reading experience

We envision incorporating the learnings of the SciA11y project into the broader Semantic Scholar scientific paper reading experience. The recently introduced Semantic Reader (Head et al. 2021) defines several elements of this experience, such as providing background knowledge and term and symbol definitions inline, when they are needed, and decluttering the page to minimize distractions. The focus of SciA11y is to separate paper layout from semantics, and to derive a linear representation of the paper that supports easier navigation and reading. The output generated by the SciA11y system can be extended with other reading assistance features like inline term or symbol definitions.

Our demo provides access to a static snapshot of 1.5 million open access papers from our corpus, but our ultimate goal is to provide accessible HTML renders for papers directly from Semantic Scholar whenever we have the ability and permissions to do so. As we begin thinking about productionizing this system, we desire to make improvements in several areas, such as replacing our extractive models with more performant layout-aware language models, and improving our handling of specific paper elements like tables and equations.

As mentioned in the last post, if this work resonated with you and you want to learn more or get involved, please reach out and let us know!

To find out more, read our preprint: “Improving the accessibility of scientific documents: current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users.” arXiv: 2105.00076. Accessible PDF: here. Demo:

Questions or feedback? Contact:

This article was part 2 of a 2-part series: find the first part here.


  1. Wang, L.L., Cachola, I., Bragg, J., Cheng, E.Y., Haupt, C.H., Latzke, M., Kuehl, B., Zuylen, M.V., Wagner, L.M., & Weld, D.S. (2021). Improving the Accessibility of Scientific Documents: Current State, User Needs, and a System Solution to Enhance Scientific PDF Accessibility for Blind and Low Vision Users. ArXiv, abs/2105.00076.
  2. Lo, K., Wang, L.L., Neumann, M., Kinney, R.M., & Weld, D.S. (2020). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL).
  3. Siegel, N., Lourie, N., Power, R., & Ammar, W. (2018). Extracting Scientific Figures with Distantly Supervised Neural Networks. Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries.
  4. Amershi, S., Weld, D.S., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S.T., Bennett, P.N., Quinn, K., Teevan, J., Kikin-Gil, R., & Horvitz, E. (2019). Guidelines for Human-AI Interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems.
  5. Head, A., Lo, K., Kang, D., Fok, R., Skjonsberg, S., Weld, D.S., & Hearst, M.A. (2021). Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.

Follow @allen_ai and @semanticscholar on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

AI2 Blog

AI for the Common Good.