SciA11y — improving the accessibility of scientific documents (1 of 2)

lucy lu wang
May 28 · 8 min read
An illustration of a scientific paper on a dark blue background covered in other papers. A purple universal access logo, a stick figure with arms outstretched in a circle, is displayed on top of the paper. Text in the lower left corner says “SciA11y”; and text in the lower right corner shows the Semantic Scholar logo and name.

One early motivation for the creation of the internet was to improve scientists’ abilities to communicate and share findings. Findings come in many forms, but one of the most common formats for dissemination is the scientific paper. Much has changed for the internet and for scientific communication since the 60s, but the web still holds enormous potential for democratizing access to information. Relatively recent developments in guidelines for web accessibility (WCAG 2.0, WCAG 3.0 — in progress) and principles for inclusive design have laid the groundwork for equalizing access to web content across the ability and resource spectrum. With regards to the original intent of sharing scientific documents, however, these promises have fallen short.

A couple of weeks ago, we, a group of researchers at Semantic Scholar, released a preprint titled Improving the Accessibility of Scientific Documents.” In this work, we outline issues and a potential solution for increasing scientific document accessibility for researchers and readers who are blind and low vision. We start by describing the somewhat dire state of scientific PDF accessibility, by performing a meta-scientific analysis of scientific PDFs published in the last decade (2010–2019); Spoiler: the vast majority of scientific PDFs we analyzed are not accessible. We then introduce a new system called SciA11y that attempts to mitigate current accessibility challenges, by rendering some 1.5 million open access scientific papers in accessible HTML format. Finally, we conduct a user study to better understand the challenges faced by blind and low vision researchers when reading papers, to better serve their needs going forward, and to determine whether SciA11y is a step in the right direction for serving these needs. In this post, we share some motivations and results from our analysis.

Quantifying the accessibility of scientific PDFs

The scientific community predominantly uses PDF as an exchange format for scientific papers. Though PDF is a reasonable file format for conveying faithful visual representation of document structure and content, it is not suited for the web, and is not accessible by many of today’s web accessibility standards. PDFs are challenging to read for researchers who are blind and low vision, especially for those who interact with papers using screen readers, and also for those who have limited technological resources, such as users on mobile devices or with low internet bandwidth. Bigham et al. 2016 provide a high level overview of the peculiarities of this communal decision for scientists to disseminate in PDF, and why improving accessibility by building on the current state is so challenging.

Understanding the scope of the accessibility problem for scientific PDFs can help us prioritize solutions. So one thing we wanted to investigate in our study is where are things now? How big is the accessibility problem for current publications and historical scientific PDFs?

Prior work that aims to quantify the accessibility of scientific PDFs (Brady et al. 2015, Nganji 2015, Lazar et al. 2017, Ribera et al. 2019) has been limited in scope, and also limited to domains where accessibility has been a bigger part of the conversation, e.g., accessibility and accessible computing, human-computer interaction, disabilities studies, etc. These previous studies have shown that overall PDF accessibility is low for blind and low vision users, but that things are improving in the fields mentioned above due to policy changes. For example, the ACM now encourages and sometimes requires authors to submit accessible PDFs for some of its conferences (e.g. CHI and ASSETS).

We were interested in whether these previous results generalize across the whole of scientific literature, or at least the portion of literature that Semantic Scholar indexes (190M and counting). We conducted a study to assess the rates of accessibility compliance for PDFs in our corpus, by analyzing a representative sample of over 11K papers published in different fields of study throughout the last decade. Similar to previous studies, we assessed five accessibility criteria:

  1. The presence of alt-text on images,
  2. The presence of table headers,
  3. Whether different components of the PDF are tagged, e.g., headings, figures, equations, footnotes etc.,
  4. Whether the PDF has a specified language, and
  5. Whether the reading order, or tab order, is specified.

Of these, 3 and 5 are crucial to navigating a document using screen readers, and 1 and 2 are necessary for screen readers to interact with the content of figures and tables, respectively, both of which are critical parts of scientific papers.

A line plot shows accessibility compliance of paper PDFs over time between 2010–2019. Of the five criteria, Default Language has increased the most, going from around 0.10 in 2010 to 0.27 in 2019. Table headers, Tagged PDF, and Tab order also show modest increases, from 0.05–0.08 in 2010 to 0.14–0.20 in 2019. Alt-text has not improved much in these years, going up and down between 0.05 and 0.10. The Adobe-5 Compliance rate (satisfying all criteria) has stayed consistent over time, at ~0.02–0.03.
A line plot shows accessibility compliance of paper PDFs over time between 2010–2019. Of the five criteria, Default Language has increased the most, going from around 0.10 in 2010 to 0.27 in 2019. Table headers, Tagged PDF, and Tab order also show modest increases, from 0.05–0.08 in 2010 to 0.14–0.20 in 2019. Alt-text has not improved much in these years, going up and down between 0.05 and 0.10. The Adobe-5 Compliance rate (satisfying all criteria) has stayed consistent over time, at ~0.02–0.03.

Let me start with the positives. We found an overall increase in accessibility compliance over the last decade, from 2010–2019 (Figure 1). The average total compliance of papers has increased from around 7.5% in 2010 to 17.5% in 2019, and the trend looks to be on the rise. More PDFs published in the last few years have tagged components, properly defined reading order, and specified languages compared to a decade ago. But unfortunately, when we consider the big picture, these numbers are still shockingly low. When we aggregated across all five criteria, we found that the compliance rates now are about the same as they were in 2010, a paltry 2.4% across the whole of the decade. If we examined individual criteria, the one with the largest increase is the presence of default language, which is arguably the least useful for improving accessibility; the criteria that has remained most difficult to meet is that of alt-text on figures, which is also unfortunately the only criteria which necessitates author or publisher intervention. This means that any accessibility gains we detected are likely not due to increasing awareness of the importance of accessibility, but may instead be an artefact of changes in scientific PDF production processes.

Now what do I mean by that? Scientific PDFs are created through a variety of means. Some are produced by authors directly using typesetting tools like LaTeX or Microsoft Word, and some are created by publishers using software that are perhaps less familiar to researchers, tools like Adobe InDesign, Arbortext APP, and others. Conventions can be quite different depending on a researcher’s field of study. For example, in Computer Science and AI, where most of AI2’s publications occur, authors are often responsible for the final forms of papers. Authors create the PDF versions of record distributed on preprint servers like arXiv or which are bundled and included in conference proceedings. In Computer Science and AI, anecdotally and as seen in our data, many authors elect to use LaTeX to create these paper PDFs. Whereas in other fields like Biology or Medicine, most papers are published by major academic publishing groups, in journals like Nature, Science, PLOS, JAMA, The Lancet, and others. These publishers each have their individual workflows for creating PDFs.

The outsize role of typesetting software

A somewhat surprising finding from our study is the large association between the typesetting software used to generate a PDF and the PDF’s accessibility. We grouped the PDFs in our analysis by the typesetting software used to create them (this information is available in the PDF metadata), and we found that some software, like Microsoft Word, produced PDFs with significantly higher compliance than other software. Figure 2 provides a breakdown of the accessibility compliance scores (higher compliance is better) associated with each of the top 5 typesetting softwares represented in our sample.

Five histograms show the distribution of accessibility criteria satisfied by paper PDFs created by the top five PDF typesetting software. Microsoft Word creates PDFs with the highest accessibility score, with many PDFs satisfying 3 or more criteria. The remaining four software in decreasing order of average accessibility (Adobe InDesign, Arbotext APP, Printer, and LaTeX) primarily generate PDFs that satisfy none of our defined criteria.
Five histograms show the distribution of accessibility criteria satisfied by paper PDFs created by the top five PDF typesetting software. Microsoft Word creates PDFs with the highest accessibility score, with many PDFs satisfying 3 or more criteria. The remaining four software in decreasing order of average accessibility (Adobe InDesign, Arbotext APP, Printer, and LaTeX) primarily generate PDFs that satisfy none of our defined criteria.

And to drive home the strength of this association: we also split our sample by different fields of study, and assessed the relationship between the proportion of PDFs created using Microsoft Word and the rate of accessibility compliance among the papers of that field. Figure 3 shows a strong correlation (r = 0.89, p < 0.001): fields where more papers are typeset using Microsoft Word have higher rates of accessibility compliance.

A scatter plot shows the relation between proportion of PDFs typeset using Microsoft Word and mean normalized total accessibility compliance rate, split by fields. Fields with higher proportions of PDFs typeset using Word have higher rates of compliance; the correlation is strong (r=0.89, p <0.001). Fields that typeset least with Word are Mathematics, Physics, and Medicine (<0.05), and fields that typeset most with Word are Business, Philosophy, and Sociology (between 0.2 and 0.25).
A scatter plot shows the relation between proportion of PDFs typeset using Microsoft Word and mean normalized total accessibility compliance rate, split by fields. Fields with higher proportions of PDFs typeset using Word have higher rates of compliance; the correlation is strong (r=0.89, p <0.001). Fields that typeset least with Word are Mathematics, Physics, and Medicine (<0.05), and fields that typeset most with Word are Business, Philosophy, and Sociology (between 0.2 and 0.25).

Accessibility and Semantic Scholar

So what does this mean for Semantic Scholar? As an intermediary between the consumers and publishers of scientific papers, we relay user requests for accessing certain papers, which can involve directing the user to a PDF file containing the paper. In this process, we try to offer other valuable features that help to enhance reader understanding, or the reader’s ability to perform scholarly tasks like finding new papers or organizing a library of papers. But what about accessibility? Is there something that we can do with the large number of PDFs that we index to make them more accessible, easier to read?

Our eventual goal is to mitigate some of the accessibility challenges described above by providing accessible HTML renders of papers in our corpus. In our paper, we present a proof-of-concept solution, a system called SciA11y (demo here) which extracts and converts the semantic content of scientific PDFs into accessible HTML. We also present some enlightening findings from a user study conducted with blind and low vision researchers, in which we try to better understand the challenges they face and strategies they employ when reading papers.

Beyond our corpus, there are ongoing debates in the digital libraries and scientific publishing communities around how publishing practices should change in response to modern web infrastructure. For reference, I point readers to the ACM Digital Library and eLife as successful examples of PDF and HTML dual publishing, and PubMed Central for generating HTML renders of many of its full text papers, which are possibilities we encourage other publishers and archives to explore.

In sum, making scientific papers more accessible is everyone’s challenge, and it is especially important for Semantic Scholar as a major indexer of academic publications. The large scale of the problem, as we found, provides extra motivation for acting to address the challenge as soon as possible and as best as possible with our existing resources. If this resonated with you and you want to get involved in our mission to make papers more accessible, please reach out and let us know!

To find out more, read our preprint: “Improving the accessibility of scientific documents: current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users.” arXiv: 2105.00076. Accessible PDF: here. Demo: https://scia11y.org/

Questions or feedback? Contact: accessibility@semanticscholar.org

References

  1. Wang, L.L., Cachola, I., Bragg, J., Cheng, E.Y., Haupt, C.H., Latzke, M., Kuehl, B., Zuylen, M.V., Wagner, L.M., & Weld, D.S. (2021). Improving the Accessibility of Scientific Documents: Current State, User Needs, and a System Solution to Enhance Scientific PDF Accessibility for Blind and Low Vision Users. ArXiv, abs/2105.00076.
  2. Bigham, J.P., Brady, E.L., Gleason, C., Guo, A., & Shamma, D. (2016). An Uninteresting Tour Through Why Our Research Papers Aren’t Accessible. Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems.
  3. Brady, E.L., Zhong, Y., & Bigham, J.P. (2015). Creating accessible PDFs for conference proceedings. Proceedings of the 12th International Web for All Conference.
  4. Nganji, J. (2015). The Portable Document Format (PDF) accessibility practice of four journal publishers. Library & Information Science Research, 37, 254–262.
  5. Lazar, J., Churchill, E., Grossman, T., Veer, G.C., Palanque, P.A., Morris, J., & Mankoff, J. (2017). Making the field of computing more inclusive. Communications of the ACM, 60, 50–59.
  6. Ribera, M., Pozzobon, R., & Sayago, S. (2019). Publishing accessible proceedings: the DSAI 2016 case study. Universal Access in the Information Society, 1–13.

AI2 Blog

AI for the Common Good.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store