Primary Care Data – Raw codes to meaningful outputs

Primary Care data is arguably one of the richest data sources within the NHS, containing a wealth of demographic and diagnosis information, clinical activity and onwards referral details. The cliché of Primary Care being the “front door to the NHS” really shines through in variety of data available.

However, the format and structure of Primary Care data is different to other data sets that CCGs have access to. Most commissioning data sets have a defined specification, extract frequency and are cleansed before we receive them. This makes them easy to undertake analysis with little work required.

What is the format of raw Primary Care data?

The extracts of Primary Care data that Manchester CCG can access (either through local arrangements or through GDPPR) provide access to all the coded information within the clinical system in the rawest form. An example of the extract format (using sample data) is shown below:

IDNHS NumberCodeTermDateValue
1123 456 789022KBMI Reading2021-03–0125.4
2123 456 789066YJAsthma Annual Review2021–03-01
3000 000 000022KBMI Reading2021-03-0230.2
4000 000 00008B3jAsthma medication review2020-06-01
5000 000 000022KBMI Reading202–04-0129.9
Example extract of Primary Care data

This makes analysis of the data straight from source difficult due to a number of factors:

Volume of data
In Manchester we have 85 GP Practices with over 680,000 registered patients. Therefore there are typically millions of records a month, meaning that querying the raw data directly can be time consuming.

– Coding terminologies
Primary Care data was previously different to other data sets with Read Codes predominately used over ICD or OPCS as used in Secondary Care and other settings. In April 2020 Primary Care migrated to SNOMED CT as part of a wider push to utilise the same coding terminology across all NHS settings. SNOMED CT is constantly evolving and new codes are introduced frequently, so any analysis must take into account any new codes that are available to be used.

– Variety of codes
As the coding terminologies used within GP Practices have evolved there are a variety of ways to code similar clinical information. This has meant that there are now many codes available that mean the same thing. The table above shows two different codes that both signify an Asthma Review. Any analysis of Asthma Reviews must therefore include all possible codes to ensure accurate and fair analysis.

– Range of clinical information
Depending on the clinical code used there may be additional information available which can be extracted. The table above shows that data for BMI readings also contains the value which is critical to understand context. This applies to all measured values and this additional information must also be extracted for complete analysis of results.

– Frequency of data
Most commissioning data sets are updated monthly but the system that enables our Primary Care data access is “live”, subject to a 2 – 3 day data lag as data leaves GP Practices and flows through the necessary infrastructure. In theory this gives us the opportunity to have almost live reporting, but requires us to standardise how frequently we process the volume of data coming through.

How can we turn this into meaningful outputs?

To overcome these challenges we have developed a system to process chosen clinical activity quickly:

1. Defining “clinical attributes”
Given the variety of codes available that define the same clinical activity we have developed a system that assigns codes to “clinical attributes”. This ensures that repeat analysis is standardised and any new codes introduced can easily be added. We have hundreds of defined attributes covering diagnosis of conditions, reviews, measurements and medications prescribed. Many of these attributes are updated following Quality Outcomes Framework (QoF) specifications or are locally agreed depending on the reporting requirements.

2. Pre-processing of data
Procedures run across the entire dataset and extract data only relating to the defined clinical attributes, copying the data into a separate environment which is used for analysis. Whilst there is a storage implication it is much quicker than looking across millions of records every time we need to run analysis of data. The extraction procedure typically runs overnight so when analysis is required the latest data is available.

3. Variable refresh frequency
Each clinical attribute has a defined refresh frequency depending on the use case. For example COVID vaccination information is processed daily as this is critical information, whereas BMI readings are extracted monthly as these are less time sensitive and are used for more ad-hoc analysis. Varying the frequency at which attributes are processed ensures that we have the right information available at the right time.

4. Time-splicing
Once all of the attributes are pre-processed queries are set up to search for codes across date ranges, which we call time-splicing. This is useful for understanding how performance of clinical activity changes across schemes, and how patients health changes over time. For example, patient reviews usually occur annually between 1st April and 31st March inline with QoF and analysis of this activity would look for the latest code in the relevant financial year. Understanding the range of BMI readings for patients however would look for the latest reading as this is the most clinically relevant.

5. Standardised outputs for analysis
With the above steps complete we can then produce standardised outputs specific to a question. To analyse the proportion of patients with Asthma who have received an Asthma Review between 1st April 2020 and 1st July 2020 we would:
– Start with a patient level list of all patients registered with a GP Practice in Manchester
– Link this to the list of patients who are currently on the Asthma Register as of 1st July 2020
– Further link a list of patients who have received an Asthma Review between 1st April 2020 and 1st July 2020

As we have become more familiar with Primary Care data the above steps have become automated, to the point where many of our outputs now automatically refresh without any input.

A tangible example of this is the process we have set up for reporting COVID vaccination coverage and variations by demographic factors, which will be the focus of a future blog.

Thanks for taking the time to read this, any feedback is always appreciated!


One response to “Primary Care Data – Raw codes to meaningful outputs”

  1. […] on from my previous post describing how we process Primary Care data the subject of this blog post is an overview of how this data has been used to support the COVID […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: