Introducing a new R package to process primary care data

How to get to grips with CPRD AURUM

Published in

The Health Foundation Data Analytics

9 min readJun 30, 2021

Primary care physicians play a central role within the NHS in the UK. They are usually the patients’ first point of contact for non-emergency care and also act as a gatekeeper to secondary care. Insights from patient-level data are needed to help analysts and researchers understand primary care activity, patient needs and outcomes. And they offer huge opportunities to help improve care for patients.

However, getting started with complex primary care datasets isn’t straightforward. Juggling a large number of tables and variables, the complex record structure, clinical coding systems and the size of the files makes it challenging, both from an analytical and a computational perspective. Solving these analytical problems from scratch every time you start a project slows down progress and also prevents this kind of data from being used to its full potential.

At the Health Foundation we have used patient-level primary care data to help decision makers better understand emerging priorities for health and care. For example, we’ve shone a light on the number of people living with multiple long-term conditions. As well as creating innovations in data analytics to help tackle the big health challenges in the UK, we aim to develop and spread good practice in health data science. Our ambition is to work openly, to use open source tools and to share our code and methods, so others can benefit from our tools and assets.

In this blog, we’ll showcase the R tools we developed to prepare primary care data from the Clinical Practice Research Datalink (CPRD) for analysis, and more specifically a database called CPRD AURUM. The code can be found on GitHub, alongside guidance on how to use it. Our solutions can also be modified to tackle common challenges of other health and care datasets.

We will cover:

what CPRD data is
challenges of working with CPRD data
tools we developed
what we learned and our favourite R packages
how it’s working for us
next steps.

Clinical Practice Research Datalink

CPRD is a not-for-profit research service that collects electronic patient records from participating general practices and provides pseudonymised patient-level data for public health research. It is not possible to identify individual patients in any dataset that CPRD holds, because it doesn’t receive patient-identifiable information such as name, address, NHS number or date of birth from any data source. To access CPRD data, research applications are subject to independent review and require ethics approval. CPRD is widely used in the UK, with over 2,700 peer-reviewed publications to date.

New tools for new data

General practices in the NHS use several different clinical computer systems. Due to differences in structure and clinical coding, CPRD offers two distinct primary care databases: GOLD, which includes data from practices using Vision, and AURUM, which contains data from practices using EMIS Web. While GOLD has been more commonly used so far, AURUM is becoming more attractive for research due to the growing market share of EMIS among general practices in England. AURUM now captures almost 20% of current UK patients, compared to only around 4% for GOLD.

Despite this, there are few existing code and analysis resources for AURUM. We therefore set out to develop a pipeline that would efficiently and reproducibly clean and prepare CPRD AURUM data, from the initial extracts to analysis-ready data. Our end goal was to get our analyses up and running more quickly. We also saw it as an opportunity to build on our experience with other major health care datasets, for example our processing pipeline for Hospital Episode Statistics data (HES).

The challenges of working with CPRD data

Some challenges our R tools aim to solve will be familiar to everyone who has worked with large, linked health care datasets before, while others are unique to CPRD:

Data extracts are often large. Individual files can be larger than what R comfortably holds in memory and so processing needs to be performed iteratively, or while keeping the data on disk. Fortunately, various R packages offer options to deal with this.
AURUM has a complex relational data structure. There are several tables (see figure 2) with both one-to-one and many-to-one relationships. For example, a record in the consultation table can be linked to many clinical observations, each of which can be further connected to other observations and/or to the problem, referral, and drug tables. For analysis, defining patient information based on these links is often critical.

Analysis often involves defining conditions and treatments based on clinical code lists. In AURUM, observations are coded using SNOMED CT, a clinical coding system that is increasingly used internationally and has also recently become a requirement for NHS providers. Before defining conditions based on clinical code lists, codes in patients’ records often need to be de-duplicated and conflicting codes resolved.
Study design and analysis requirements vary between projects. For example, a study might aim to explore the differences between remote and face-to-face consultations, whereas another may need to define a set of medical conditions for patients based on observations and consultations. Our code aims to only take care of basic cleaning and processing tasks that are commonly required.

What does the code do?

The code to create analysis-ready AURUM data can be found in the first version of the R package called `aurumpipeline`. The package can be installed directly from GitHub.

The core functionality is wrapped up in its main processing function aurum_pipeline(), which by default:

reads in the files supplied by CPRD,
performs some basic checks,
assigns the correct column data types,
writes the processed data as a series of parquet files in a specified location (more on why we use parquet below).

In addition, a log file is created to keep track of all operations and outputs. The parquet files can subsequently be used by other functions in the package for further quality checks or to derive other variables. These include:

producing summaries of the data quality checks performed
checking the links between tables and reporting if they are as expected
creating demographic variables for each patient
reading and applying user-defined code lists to the data

What we learned

1. Always start with an R package

For ease of sharing, documenting, and testing we wrapped these tools up into an R package. Developing a package is slightly different from how you might write a standalone processing script, so it pays to get used to the package development workflow as early as possible. Luckily, great resources are available if you’d like to get started with package development, and our team has some previous experience.

The workflow provided by the package devtools helped and, fairly quickly, the process of editing functions, testing and building the package became second nature. We’ve often heard people say that writing a package improves your wider R skills and, having worked on this project, we strongly agree.

2. Build small, versatile functions

Lots of our effort went into the design of the pipeline and the balance between the skills level required to use it and its versatility:

More automated (‘push a button’ type) pipelines with fewer, more complex functions are easier for inexperienced users, but are much harder to trouble-shoot if and when they fail. They are harder to adapt to new or changing requirements.
A modular workflow with a set of smaller, more focused functions might take more time to learn, but well-written guidance can help. In addition, this type of pipeline is easier to troubleshoot and to add to when new requirements are identified.

Getting the balance right between these two approaches is an ongoing learning process for us.

3. Try our favourite R packages to read, write and store large data

Despite the challenges of large datasets, R was our tool of choice due to its flexibility, our team’s existing skillset, and the ability to draw on our existing codebase. There are several packages available that make it easier to work with large data in R. Good options are:

The Arrow package enables use of space-saving parquet files. Storing data in columnar format allows for fast reading and writing, as well as the ability to filter before loading them into the R environment.
The Feather package and file format. It can be called from the Arrow package and allows even faster read and write times, at the expense of disk space.
RSQLite, which is a convenient way to use a SQLite database from within R. We previously used this for a pipeline to process hospital data.
The disk.frame package, which splits large datasets into manageable chunks and utilises parallel processing where possible.

In the end, we chose the Arrow package and parquet files to make use of these benefits:

fast read and write times,
well compressed so relatively small on disk,
if stored in partition structure, data can be queried before loading into working memory,
straightforward sharing across platforms as parquet is compatible with many systems, including Spark and AWS.

4. Use robust R tools for data manipulation

Several other R packages ended up being key for our pipeline functions:

data.table, for very fast implementation of several useful functions and data manipulation.
Various aspects of the tidyverse — although it is generally considered bad practice to import the tidyverse into a package as a whole, many of the individual packages are useful. In particular vroom was useful for fast text reading and helpful for data type assignments.
here, to simplify working with file structures and locations
bit64, to deal with lots of very long numeric ID fields in AURUM. To avoid losing precision and speed up certain operations, it is important to store them as either a character or, with the help of bit64, as integer64 format.

How our pipeline is working for us so far

For our team, the benefits of building this package are already being realised. Two new research projects, investigating the relationship between outcomes for patients with multiple long-term conditions and their ethnicity, have made use of the pipeline to process the text files and create analysis datasets much more quickly than previously possible. Extracting clinical codes to define variables such as ethnicity and, more recently long-term conditions, has become more straightforward.

Additionally, there is now a central place for our team to coordinate AURUM-related analysis work, creating an opportunity to share what works well and document additions and ideas for future development. This helps us to develop and document agreed definitions and a common language around the data.

Next steps

While the pipeline is already in active use, there is still plenty of scope for further improvement. For example, in the future we’d like to:

Expand its functionality to incorporate the processing of other linked datasets available from CPRD, such as HES, small area-level deprivation (Index of Multiple Deprivation) and other location-based reference data or ONS mortality data.
Add to the functions that use clinical code lists quickly on an AURUM sample. Developing and updating code lists is a large body of ongoing work useful to all researchers using AURUM (as well as GOLD). As these are refined, we would like to have the required variables created as part of the pipeline.

The biggest breakthrough for us is that every time we start a new research project using AURUM, we don’t need to start from scratch.

If you’d like to find out more, or if you’ve used the code and have any suggestions, please get in touch via GitHub.

Special thanks to Fiona Grimm, Hannah Knight, Mai Stafford, Emma Vestesson and Sarah Deeny for their contributions. We are all part of the Data Analytics team at the Health Foundation.