Chaos in the standardisation — how to handle HL7.
A simple guide for simple parsing
Iimagine someone gives you a single message that looks like the one below and asks you to extract all tests’ details. You’re not a medical standards guru, but give it a quick look:
MSH|^~\&|SendingApp|SendingFac|ReceivingApp|ReceivingFac|20120226102502||ORU^R01|Q161522306T164850327|P|2.3
PID|1||000168674|000168674|GUNN^BEBE||19821201|F||||||||M|||890-12-3456|||N||||||||N
PV1|1|I||EL|||00976^PHYSICIAN^DAVID^G|976^PHYSICIAN^DAVID^G|01055^PHYSICIAN^RUTH^K~02807^PHYSICIAN^ERIC^LEE~07019^GI^ASSOCIATES~01255^PHYSICIAN^ADAM^I~02084^PHYSICIAN^SAYED~01116^PHYSICIAN^NURUDEEN^A~01434^PHYSICIAN^DONNA^K~02991^PHYSICIAN^NICOLE|MED||||7|||00976^PHYSICIAN^DAVID^G||^^^Chart ID^Vis|||||||||||||||||||||||||20120127204900
ORC|RE|||||||||||00976^PHYSICIAN^DAVID^G
OBR|1|88855701^STDOM|88855701|4083023^PT|||20120226095400|||||||20120226101300|Blood|01255||||000002012057000145||20120226102500||LA|F||1^^^20120226040000^^R~^^^^^R|||||||||20120226040000
OBX|1|NM|PT Patient^PT||22.5|second(s)|11.7-14.9|H|||F|||20120226102500||1^SYSTEM^SYSTEM
OBX|2|NM|PT (INR)^INR||1.94||||||F|||20120226102500||1^SYSTEM^SYSTEM
NTE|1||The optimal INR therapeutic range for stable patients on oral anticoagulants is 2.0 - 3.0. With mechanical heart valves,
NTE|2||the range is 2.5 - 3.5.
NTE|3
NTE|4||Studies published in NEJM show that patients treated long-term with low intensity warfarin therapy for prevention of recurrent
NTE|5||venous thromboembolism (with a target INR of 1.5 - 2.0) had a superior outcome. These results were seen in patients after a median
NTE|6||6 months of full dose anti-coagulation.
It doesn’t look that bad, does it? You can see some kind of structure, repeatable fragments and even test’s name. “Though this be madness, yet there’s method in’t”. Like in every other problem, you just go with the method “let me google it”, but soon you’ll discover that a bunch of already asked (and answered!) questions on Stack Overflow are nowhere to be found.
You’ll find yourself surrounded by seriously written, long documentations of HL7 standard or its use, few interpreters and even fewer libraries. However, you don’t have much time to dig through all of narrow field knowledge. You want to keep it as a simple task as it should be. Then you find that guide.
Understanding HL7 structure
This is your first step to reveal the HL7. How this thing is constructed and in what way you can unfold your wanted “boxed” data.
Types of messages
Firstly, you should know what kind of messages you will be handling as there are few types of them:
a) ORM — an Order Message, the first message of results exchange. It can be “triggered” by placing new order (test), cancelling the previous one or other manipulation with the test “request”.
b) ORU — a Observation Result Unsolicited Message, can be called an “answer message” containing, in our case, test result!
c) ACK— acknowledge messages, more generic type.
How can the common case of communication look like?
First you need to send an ORM message your healthcare system provider as a information about specific order (e.g. blood test) and then from their server you can acquire the ORU message that contains results of that test.
Now you can tell that you will mostly deal with ORU messages as you want the results to be extracted in some nice structure.
Message segments
Each HL7 message, whichever type chosen, has its own specific segments (as you can see on the chart above). Let’s focus on the ORU type and its parts:
- MSH: a message header with encoding characters listed, date and time of message, version of HL7 used and many more very basic information about message itself.
- ORC called a Common Order Segment — information who place the order and who “fill” it.
- PID is patient identification part.
- OBR (observation request segment) has information about collection date and time, collector and some relevant clinical notes.
- OBX — observation results, our star. One message can has more than one OBX segments (e.g. for Full Blood test), while each of then contains single test value (let’s say Red Blood Cells Count).
Now, when you look once again at the example message on fig. 2, you can clearly see different parts.
MSH segment:
MSH|^~\&|SendingApp|SendingFac|ReceivingApp|ReceivingFac|20120226102502||ORU^R01|Q161522306T164850327|P|2.3PID segment:
PID|1||000168674|000168674|GUNN^BEBE||19821201|F||||||||M|||890-12-3456|||N||||||||NPV1 segment:
PV1|1|I||EL|||00976^PHYSICIAN^DAVID^G|976^PHYSICIAN^DAVID^G|01055^PHYSICIAN^RUTH^K~02807^PHYSICIAN^ERIC^LEE~07019^GI^ASSOCIATES~01255^PHYSICIAN^ADAM^I~02084^PHYSICIAN^SAYED~01116^PHYSICIAN^NURUDEEN^A~01434^PHYSICIAN^DONNA^K~02991^PHYSICIAN^NICOLE|MED||||7|||00976^PHYSICIAN^DAVID^G||^^^Chart ID^Vis|||||||||||||||||||||||||20120127204900ORC segment:
ORC|RE|||||||||||00976^PHYSICIAN^DAVID^G
OBR|1|88855701^STDOM|88855701|4083023^PT|||20120226095400|||||||20120226101300|Blood|01255||||000002012057000145||20120226102500||LA|F||1^^^20120226040000^^R~^^^^^R|||||||||20120226040000OBX segment (one of two):
OBX|1|NM|PT Patient^PT||22.5|second(s)|11.7-14.9|H|||F|||20120226102500||1^SYSTEM^SYSTEMNTE segments:
NTE|1||The optimal INR therapeutic range for stable patients on oral anticoagulants is 2.0 - 3.0. With mechanical heart valves,
NTE|2||the range is 2.5 - 3.5.
NTE|3
NTE|4||Studies published in NEJM show that patients treated long-term with low intensity warfarin therapy for prevention of recurrent
NTE|5||venous thromboembolism (with a target INR of 1.5 - 2.0) had a superior outcome. These results were seen in patients after a median
NTE|6||6 months of full dose anti-coagulation.
Where data sits
Now we shall go deeper in OBX itself:
However, when handling with raw data, you’ll be looking at something like this:
OBX|1|NM|PT Patient^PT||22.5|second(s)|11.7-14.9|H|
||F|||20120226102500||1^SYSTEM^SYSTEM
You can see that many fields are empty ( there’s no information between two pipe characters: ||).Why? Not all of field are obligatory to fill (I’ve marked that ones with a bold font below and with a grey field on schema).
OBX|segment-number|segment-type|observation-id|observation-sub-id|observation-value|units|reference-range|abnormal-flags|
probability|nature-of-abnormal-test|observation-result-status|
last-normal-observation-date|user-access-checks|
observation-datetime|producer-id|responsible-observer|observation-method
Before jumping into parsing, let’s have last quick check what kind of value type can be encountered:
- AD — address
- DT — date
- ED — some encapsulated data
- FT — formatted text
- ST — string data
- TM — time
- TN — telephone number
- TS — date and time (timestamp)
- TX — text data
- NM — probability(or usually just numerical data)
- SN — structured numerical: can be intervals, ratios or inequalities
- some of these type can have letter X at the beginning which means extended.
As you can see, now it’s easy to spot type of observation (result), its value, unit, normal range and even label (low/high/normal).
Python 3 hands-on: just start parsing!
After discovering and understanding the structure of HL7, the task of extracting the specific information looks bearable.
Libraries
To not be thrown into a black hole of text mining, you can choose between two python libraries that will help you to easily maneuver throughout the file.
1. The all-in-one lightweight solution
The hl7apy is an open-source “lightweight library, fully compliant with HL7 consortium specifications”. It allows you not only to parse, but also create and validate messages. At the moment it’s compatible with HL7 standard versions from 2.2 to 2.6. If you would like to have full range of options and stable tool to do whatever comes to mind — this is the right library for you.
2. Simple, parsing!
The second library is more “straight forward” when come to just parsing. The python-hl7 doesn’t have a high threshold when starting. It works with every 2.x version of HL7 standard. It should be your pick if you want only to parse message.
As the workflow focuses on parsing messages to extract numerical values, the python-hl7 seems to be a good choice.
Ready, set … go!
After installing the library, let’s grab our example message once again:
MSH|^~\&|SendingApp|SendingFac|ReceivingApp|ReceivingFac|20120226102502||ORU^R01|Q161522306T164850327|P|2.3
PID|1||000168674|000168674|GUNN^BEBE||19821201|F||||||||M|||890-12-3456|||N||||||||N
PV1|1|I||EL|||00976^PHYSICIAN^DAVID^G|976^PHYSICIAN^DAVID^G|01055^PHYSICIAN^RUTH^K~02807^PHYSICIAN^ERIC^LEE~07019^GI^ASSOCIATES~01255^PHYSICIAN^ADAM^I~02084^PHYSICIAN^SAYED~01116^PHYSICIAN^NURUDEEN^A~01434^PHYSICIAN^DONNA^K~02991^PHYSICIAN^NICOLE|MED||||7|||00976^PHYSICIAN^DAVID^G||^^^Chart ID^Vis|||||||||||||||||||||||||20120127204900
ORC|RE|||||||||||00976^PHYSICIAN^DAVID^G
OBR|1|88855701^STDOM|88855701|4083023^PT|||20120226095400|||||||20120226101300|Blood|01255||||000002012057000145||20120226102500||LA|F||1^^^20120226040000^^R~^^^^^R|||||||||20120226040000
OBX|1|NM|PT Patient^PT||22.5|second(s)|11.7-14.9|H|||F|||20120226102500||1^SYSTEM^SYSTEM
OBX|2|NM|PT (INR)^INR||1.94||||||F|||20120226102500||1^SYSTEM^SYSTEM
NTE|1||The optimal INR therapeutic range for stable patients on oral anticoagulants is 2.0 - 3.0. With mechanical heart valves,
NTE|2||the range is 2.5 - 3.5.
NTE|3
NTE|4||Studies published in NEJM show that patients treated long-term with low intensity warfarin therapy for prevention of recurrent
NTE|5||venous thromboembolism (with a target INR of 1.5 - 2.0) had a superior outcome. These results were seen in patients after a median
NTE|6||6 months of full dose anti-coagulation.
Now parse it and check what is, for example, the first OBX segment.
A little tip: sometimes you can encounter a strange error that the library cannot parse your message because of an unexpected character — check your encoding first (maybe something like “utf-8-sig” will work).
The already parsed_msg
looks like this:
[[['MSH'], ['|'], ['^~\\&'], ['SendingApp'], ['SendingFac'], ['ReceivingApp'], ['ReceivingFac'], ['20120226102502'], [''], [[['ORU'], ['R01']]], ['Q161522306T164850327'], ['P'], ['2.3\n']], [['PID'], ['1'], [''], ['000168674'], ['000168674'], [[['GUNN'], ['BEBE']]], [''], ['19821201'], ['F'], [''], [''], [''], [''], [''], [''], [''], ['M'], [''], [''], ['890-12-3456'], [''], [''], ['N'], [''], [''], [''], [''], [''], [''], [''], ['N\n']], [['PV1'], ['1'], ['I'], [''], ['EL'], [''], [''], [[['00976'], ['PHYSICIAN'], ['DAVID'], ['G']]], [[['976'], ['PHYSICIAN'], ['DAVID'], ['G']]], [[['01055'], ['PHYSICIAN'], ['RUTH'], ['K']], [['02807'], ['PHYSICIAN'], ['ERIC'], ['LEE']], [['07019'], ['GI'], ['ASSOCIATES']], [['01255'], ['PHYSICIAN'], ['ADAM'], ['I']], [['02084'], ['PHYSICIAN'], ['SAYED']], [['01116'], ['PHYSICIAN'], ['NURUDEEN'], ['A']], [['01434'], ['PHYSICIAN'], ['DONNA'], ['K']], [['02991'], ['PHYSICIAN'], ['NICOLE']]], ['MED'], [''], [''], [''], ['7'], [''], [''], [[['00976'], ['PHYSICIAN'], ['DAVID'], ['G']]], [''], [[[''], [''], [''], ['Chart ID'], ['Vis']]], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], ['20120127204900\n']], [['ORC'], ['RE'], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [[['00976'], ['PHYSICIAN'], ['DAVID'], ['G\n']]]], [['OBR'], ['1'], [[['88855701'], ['STDOM']]], ['88855701'], [[['4083023'], ['PT']]], [''], [''], ['20120226095400'], [''], [''], [''], [''], [''], [''], ['20120226101300'], ['Blood'], ['01255'], [''], [''], [''], ['000002012057000145'], [''], ['20120226102500'], [''], ['LA'], ['F'], [''], [[['1'], [''], [''], ['20120226040000'], [''], ['R']], [[''], [''], [''], [''], [''], ['R']]], [''], [''], [''], [''], [''], [''], [''], [''], ['20120226040000\n']], [['OBX'], ['1'], ['NM'], [[['PT Patient'], ['PT']]], [''], ['22.5'], ['second(s)'], ['11.7-14.9'], ['H'], [''], [''], ['F'], [''], [''], ['20120226102500'], [''], [[['1'], ['SYSTEM'], ['SYSTEM\n']]]], [['OBX'], ['2'], ['NM'], [[['PT (INR)'], ['INR']]], [''], ['1.94'], [''], [''], [''], [''], [''], ['F'], [''], [''], ['20120226102500'], [''], [[['1'], ['SYSTEM'], ['SYSTEM\n']]]], [['NTE'], ['1'], [''], ['The optimal INR therapeutic range for stable patients on oral anticoagulants is 2.0 - 3.0. With mechanical heart valves,\n']], [['NTE'], ['2'], [''], ['the range is 2.5 - 3.5.\n']], [['NTE'], ['3\n']], [['NTE'], ['4'], [''], ['Studies published in NEJM show that patients treated long-term with low intensity warfarin therapy for prevention of recurrent\n']], [['NTE'], ['5'], [''], ['venous thromboembolism (with a target INR of 1.5 - 2.0) had a superior outcome. These results were seen in patients after a median\n']], [['NTE'], ['6'], [''], ['6 months of full dose anti-coagulation.']]]
Nice! As it’s now special hl7.containers.Message
type, we can easily check if there are any OBX segments. Or, maybe, you would like to know how many of them you have and print the first one (code lines 7–8)
###OUTPUT###In this message you have 2 OBX segments
The first segment is: OBX|1|NM|PT Patient^PT||22.5|second(s)|11.7-14.9|H|||F|||20120226102500||1^SYSTEM^SYSTEM
As you can see, the structure of message (or at least its “levels”) has changed and it doesn’t match a straight forward filed-pipe-field build. Because of that, now you have to check how much you will need to iterate though this nested lists.
Let’ assume that you need a name of an observation first. If you type only parsed_msg["OBX"][0][3]
, you will end up with the still nested structure :
[[['PT Patient'], ['PT']]]
“We need to go deeper!”: the line OBX[0][3][0][1][0]
will, finally, return a string with name of the test in the first OBX segment. In a similar way you can extract all information about the test! (W)
'PT' - name
'22.5' - value
'second(s)' - unit
'11.7-14.9' - normal range
'H' - abnormal flag (high)
Last, but not least — remember to check the type of OBX segment you want to catch (NM, ST or else) and filter it out.
Where is chaos?
You can say to yourself at this point: “Well, I don’t see much chaos here”. That’s right: a structure and a standard themselves are as systematic as they can be. However, try to add a human error to that mix and the recipe for thousands of exceptions and your black hole of data processing is ready.
The problem manifests when you get more and more data: names of the very same test would not be the same as they come from different laboratories or are filled by different specialist. Especially when it is basic test like f.e. White Blood Cell Count or White Cell Count or White Cells or WBC or Leucocytes or WCC I can name it.
It doesn’t matter when only basic extraction is expected, but at Maxwell Plus we do our best to create not only our vision of post-scarcity, but also a consistent system providing better results understanding and availability for both patients and clinicians.
In that case, at the end of our data processing we would like to have not only full, but also uniformed data for each result.
Closing the Pandora box of exceptions
LOINC
If you are lucky, your companies and laboratories are using the specific code for each result, based on a test name, a method or units. In the OBX segment it will look like this:
OBX|0|NM|234^RBC^HSP_A^26453-1^Erythrocytes [#/volume] in Blood^LN||4.88|MIL/MM3|||||F|
If you are more than lucky it’s not some internal codes, but the international and open standard based ones: LOINC codes.
LOINC can handle a single lab test, collections, a set of questions(!) or just a simple measurement (patient height or weight).
Regenstrief Institute that manages the standard, provides two products that can help get it into your product: LOINC (to simplify: a base for a database with all tests coded) and RELMA (“the desktop mapping tool for your local terms”).
What you as a developer with a task to build a robust parser for HL7 files need to know about LOINC:
- for each coded test it contains information about component (name), property, time, system (specimen), scale and method. That’s more than enough.
- it provides unification and order to chaotic names nightmare for free
- the LOINC codes are placed in the OBX segment, in
id
field. You can extract it in almost the same way - on the LOINC website you can find a lot of learning resources: articles, presentations or videos
No standard, no gain
What if you have just a name, without any coded ID that might be useful? In that situation your task to build a robust data-processing script is no longer simple.
The problem you need to address first are various names for the same test. You can try to create your own mapping:
"White Blood Cell Count": ["count/L", ["White Cell Count",
"White Cells", "WBC", "Leucocytes", "WCC"]],
As you have surely noticed there is also a unit mapped. This arises next obstacle: what if the same tests have different units (remember that now the task is more than extraction!). You can then choose a preferred one and try to convert the other ones.
But the conversion is for another story!
Now the HL7 structure should be more clear and all its parts more visible at first sight! Data processing is a little bit easier and you are closer to build a whole healthcare system. Cool!
Good luck with parsing!