Hitachi Solutions Named Strategic Consulting Partner to John Snow Labs — How to Leverage Spark NLP on Databricks

Angelina Maria Leigh
Hitachi Solutions Braintrust
16 min readMay 5, 2021

Implementation & Industry Use Cases

I. Introduction

Hitachi Solutions is a business application consulting firm and trusted provider of vertical industry solutions built on the Microsoft Cloud. Hitachi’s mission is to help its clients compete with the largest global enterprises by using powerful, easy to use and affordable industry solutions.

Hitachi’s culture is defined by its values and its deep commitment to help its clients succeed. Hitachi Solutions is a division of the 38th largest company in the world and brings to bear the strength of a very large network of interconnected Hitachi companies. At the same time Hitachi remains absolutely committed to the nimble agility that helped it grow Hitachi Solutions from three founding partners to nearly 2,000 consultants, developers and support personnel all around the globe.

Hitachi Solutions is a recognized leader in delivering success with business applications based on the Microsoft Cloud. The company is the recipient of the following awards:

30x Microsoft Partner of the Year Winner

19x Microsoft Customer Excellence Winner

18x Microsoft Inner Circle Achievement

20x Microsoft President’s Club Achievement

Hitachi Solutions America provides global capabilities with regional offices in the United States, the United Kingdom, Canada, India, Japan, China, and Asia Pacific.

Databricks is the data and AI company. More than 5,000 organizations worldwide — including Comcast, Condé Nast, H&M, and over 40 percent of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems.

John Snow Labs Inc. is a Healthcare AI company accelerating progress in data science by taking on the headaches of managing data, models and platforms. John Snow Labs team expertise ranges from data science, medicine, data engineering, pharma, data research, security and compliance. As a USA Corporation run as a global virtual team located in 20 countries, John Snow Labs believes in being great partners, in making its customers wildly successful, and in using data philanthropy to make the world a better place.

John Snow Labs is the developer of Spark NLP, a state-of-the-art natural language processing library which has grown to be used by 22 percent of enterprises within 18 months of its first release (O’Reilly, AI Adoption in the Enterprise, 2019) and 33 percent of practitioners today (Gradient Flow, The NLP Industry Survey, 2020). The company has been recognized with multiple industry awards over the years.

John Snow Labs is a thought leader in AI and Natural Language Processing and regularly presents at top-tier technology conferences such as Strata Data, Spark+AI Summit, Global AI Conference, Open Data Science, and O’Reilly AI.

Introduction to Natural Language Processing

Natural Language Processing (NLP) is a key component in many data science systems that must understand or reason about text. Common use cases include knowledge extraction, question answering, entity recognition, spell correction, sentiment analysis, and document classification.

Simply put, natural language processing utilizes AI and machine learning to extract meaning from text. However, understanding human language, with all of its intricacies, dialects, inflections and the like, is sometimes even difficult for people, let alone computers. NLP is revolutionary because, as TechTarget defines it:

“Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. The development of NLP applications is challenging because computers traditionally require humans to “speak” to them in a programming language that is precise, unambiguous and highly structured, or through a limited number of clearly enunciated voice commands. Human speech, however, is not always precise — it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context.”

AI has advanced to the level today where natural language processing can analyze, extract meaning from, and determine actionable insights from both syntax and semantics in text.

Spark NLP

John Snow Labs specializes in helping companies accelerate their adoption of AI technology, in particular in the field of Natural Language Processing. The company is the developer of the open source NLP Library for Apache Spark and has a full team working on the library, which delivered 30 releases during 2019 while supporting the open-source community and commercial customers. Spark NLP is consistently recognized as the most widely used NLP library by practitioners:

Spark NLP provides Python, Java and Scala libraries with the full functionality of traditional NLP libraries (like spaCy, nltk, Stanford CoreNLP and Open NLP) and adds additional functionality such as spell checking, sentiment analysis, and document classification. It improves on previous efforts by providing state-of-the-art accuracy, speed, and scalability.

Spark NLP is by far the fastest open source NLP library, with recent public benchmarks showing it to be 38x and 80x faster than spaCy with comparable accuracy for training custom models. Spark NLP is also the only open source library that can leverage a distributed Spark cluster, since it is a native extension of Spark ML and operates directly on data frames. Therefore, speedups on a cluster result in another order of magnitude of performance gain. Since every Spark NLP pipeline is a Spark ML pipeline, it is particularly well suited to building unified NLP and ML pipelines such as document classification, risk prediction, and recommenders.

State of the Art

In addition to performance, Spark NLP also delivers state-of-the-art accuracy for a growing number of NLP tasks. The team regularly reads the latest academic papers in this area and productizes the most accurate models. In the past 2–3 years, the best performing models use deep learning, and as such the library comes with prebuilt deep learning models for Named Entity Recognition, Document Classification, Sentiment and Emotion Detection, and Sentence Detection. The library also includes dozens of pre-trained language models, including support for word, chunk, sentence, and document embeddings.

The library has optimized builds for CPUs, GPUS, and the latest Intel Xeon chips. Both training and inference can scale to leverage Spark clusters and run in production in all popular analytics platforms.

John Snow Labs is an established thought leader in the NLP space and is often invited to speak at top technology conferences and explain the state of the art. Among its public sessions are:

II. Implementation

Customer Model

The partnership between Databricks, Hitachi Solutions and John Snow Labs offers a flexible integration framework that helps customers design, build, iterate and integrate rapidly. Hitachi Solutions functions as the trusted SI working with customers enabling and solving their hardest industry problems.

Using Databricks, the power of Spark in Data + AI is brought to the forefront of distributed computing. John Snow Labs Spark NLP and Spark NLP for Healthcare runs within the Databricks runtime as any other library package solving Natural Language tasks in Classification, Entity Extraction, OCR, De-identification, and more.

Hitachi Solutions, Databricks and John Snow Labs focus on cost savings and shorter time to market for business processes requiring understanding of medical documents — from clinical design support through real-world data analysis to clinical trial submissions. Often, the current level of achievable accuracy makes possible new revenue and optimization opportunities for customers. Furthermore, the ability to use existing medical data assets for research, predictive models, real-world evidence, and sharing with pharma (sometimes as another revenue stream) enables customers iterating faster when building AI.

John Snow Labs in Databricks

Spark NLP 3.0 combining a set of major under-the-hood optimizations and upgrades that give the open-source community the most scalable and most tightly optimized NLP library ever.

Spark NLP 3.0 went through intensive testing and profiling across all the platforms, which includes their latest versions. Spark NLP 3 is officially supported on:

  • Spark 3.1, 3.0, 2.4, and 2.3
  • Databricks 6.x, 7.x, 8.x — both CPU and ML GPU
  • Linux, MacOS, and Windows — for local development
  • Docker — with and without Kubernetes
  • Hadoop 2.7.x and 3.x
  • AWS EMR 5.x and 6.x
  • Cloudera and Hortonworks
  • AWS, Azure, and GCP

Spark NLP is most widely used in Python but as always there is a complete and supported API in Scala and Java.

Beyond newly supported platforms, the big news for this release is a leap in the library’s speed — with a focus on the most common NLP tasks. As an example, here is an apples-to-apples comparison on running Spark NLP 3.0 versus the previous version (2.7), on 120,000 documents from AG’s corpus of news articles, which together have more than 4 million tokens. The benchmark was run on Databricks 7.3 LST ML using GPU’s with 10x AWS workers (g4dn.2xlarge) and the new version is:

  • 7.9 times faster in calculating BERT–Large
  • 6.5 times faster in calculating BERT-base
  • 3.0 times faster in calculating named entity recognition

Spark NLP 3 will get you much faster results using a CPU or GPU. John Snow Labs spent several months diving deep into the bowels of optimizing neural networks, multi-threading, in-memory vs. on-chip computation, distributed execution planning, and compiler optimization of modern deep learning libraries and compute platforms. John Snow Labs would like to thank the teams at Databricks (Spark and MLflow), Google (TensorFlow), Intel (MKL), and Nvidia (Spark and Rapids) for supporting them through this journey.

Methodologically, Spark NLP follows the pipeline approach common in PySpark development:

Holistic pipeline:

Code pipeline:

The development of transformer models or NLP related tasks follow the same idea as building a pipeline using Mllib in Pyspark. First a pipeline has to be defined, second the .fit function executes on a dataframe the defined pipeline.

III. Industry Use Cases

Privacy

Spark NLP for Healthcare is an extension and provided as licensed software, so it can be installed and run on any Spark cluster. The software can be configured to work without network access and doesn’t “call home” — therefore, using it on PHI data can be done in a locked-down environment within your control, without ever sending or sharing data with John Snow Labs.

State-of-the-art deep learning models for clinical and biomedical NLP in Spark NLP for Healthcare include:

  • Named Entity Recognition
  • Entity Resolution / Normalization
  • Assertion Status Detection (Negation Detection)
  • Relation Extraction (Temporal, Drug Elements, Disease/Treatment)
  • Spell Checking and Correction
  • Sensitive Data detection
  • De-identification via masking and obfuscation
  • Sentence Boundary Detection
  • Healthcare-Specific Embeddings

The library offers access to several clinical and biomedical transformers: John Snow Labs-BERT-Clinical, BioBERT, ClinicalBERT, GloVe-Med, GloVe-ICD-O, and others. It also includes over 50 pre-trained healthcare models, that can recognize the following entities:

  • Clinical — Signs, Symptoms, Treatments, Procedures, Tests, Labs, Sections
  • Drugs — Name, Dosage, Strength, Route, Duration, Frequency
  • Anatomy — Organ, Subdivision, Cell, Structure Organism, Tissue, Gene, Chemical
  • Demographics — Age, Gender, Height, Weight, Race, Ethnicity, Marital Status, Vital Signs
  • Sensitive Data — Patient Name, Address, Phone, Email, Dates, Providers, Identifiers

Spark NLP for Healthcare is by far the most widely used natural language processing library by practitioners in the healthcare space (Gradient Flow, The NLP Industry Survey 2020):

Healthcare Tech Outlook awarded John Snow Labs the Healthcare Analytics Provider of the Year Award in July 2020, summarizing its finding from interviewing customers (which include Kaiser Permanente, Roche, SelectData, McKesson, CancerLinQ, DocuSign, and others):

“By all accounts, John Snow Labs has created the most accurate software in history to extract facts from unstructured text.”

Business Use Case Identification

  1. Automated Clinical Coding Audit and Chart Review

Objective: Accurately answer clinical and billing questions by reading patient records, which can be a hundred or more pages long. These answers are used to code medical records. Medical coding determines the correct code for billing a claim as well as articulating known health conditions for treatment.

Solution

This is a clinical document classification and information extraction use case. It leverages Spark NLP for Healthcare and Spark OCR to extract fuzzy, implied, and complex facts from home health patient records — and deploy the solution at scale in a PHI-compliant setting.

This work was presented as a joint case study at Strata Data New York on Sep 13, 2018.

The solution’s outline is as follows:

  • Documents to Text
  • Enhanced OCR
  • Medical spell correction
  • Text to Features
  • Entity Recognition
  • Entity Normalization
  • Assertion Status
  • Features to Models
  • Document Classification
  • Automated Coding

2. Improving Patient Flow Forecasting

Objective: Optimize the patient flow models and provide insights for real-time decision-making and for strategic planning, by predicting bed demand, ‘safe’ staffing levels, and hospital gridlock.

Solution

John Snow Labs assisted Kaiser’s enterprise architecture team in integrating data from medical facilities and developing a model that forecasts patient flow in a hospital. Accurate forecasting is critical to ensuring that enough beds and nurses are available to take care of incoming patients.

Further, in order to perform accurate risk prediction on who will be admitted from the emergency department, John Snow Labs built and deployed custom deep learning NLP models that mined patient records (specifically ER triage notes).

Predictive features were developed from both structured and unstructured data.

This work was presented as a joint case study at the Strata Data Conference on Mar 7, 2018.

3. Automating Knowledge Extraction from Pathology Reports

Objective: Many critical facts required by healthcare AI applications like patient risk prediction, cohort selection, and clinical decision support are locked in unstructured free-text reports. There’s a strong need to unlock unstructured data to build a comprehensive longitudinal view of the patient, to enable both clinical decision support and population analytics.

Solution

This is an information extraction, clinical NER and OCR case study which applies deep learning using Spark NLP for Healthcare to extract clinical facts from unstructured free-text pathology and radiology reports, in order to enable downstream clinical decision support for oncologists.

This work was presented as a joint case study at the Strata Data Conference on Mar 27, 2019.

The following NLP pipeline was presented:

  • OCR of de-identified hospital reports
  • Sentence Boundary Detection: Extracting complete sentences from messy, multi-page documents
  • Tokenization: point of analysis for every other annotator algorithm
  • Clinical Part of Speech (POS) Tagging: assign a reference to the role each token has in a sentence
  • Named Entity Recognition (NER): trained a model to extract 45-plus labels from TCGA reports
  • Entity Resolution (ER): pre-trained models for resolving healthcare entities to standard
    SNOMED-CT and ICD-10 codes

4. Clinical Trial Recruitment

Objective: Recruiting patients for clinical trials is a major challenge in drug development. Finding patients requires an in-depth understanding of their medical histories and current health statuses while the majority of patient data is unstructured and spread across physician notes, pathology, imaging, genomic, and other reports. This case study applies machine learning, deep learning, and Spark NLP to accelerate this slow and manual process.

Solution

Deep 6 uses the Spark natural language processing (NLP) platform to apply state-of-the-art deep learning to accurately extract relevant clinical facts from unstructured text. These facts are then used in subsequent data science pipelines in constructing patients’ medical histories. Being able to match trials’ inclusion and exclusion criteria with structured and normalized patient histories is the key to automatically matching patients accurately and at scale. The solution’s components include:

  • Clinical named entity recognition, and mapping of entities to standard clinical terminology codes
  • Assertion status (negation) detection — to distinguish between present, absent and possible findings
  • Custom models to understand each patient’s stage (symptomatic, in treatment, in remission)

This work was presented as a joint case study at Strata Data New York on Sep 25, 2019.

Sample John Snow Labs Solutions

  1. De-identification

John Snow Labs de-identification methods include masking, obfuscation, generalization, shifting, hashing. They are configurable per field and are easy to enable with Spark NLP for Healthcare.

2. OCR

Spark OCR is another commercial extension of Spark NLP for optical character recognition from images, scanned PDF documents, and DICOM files. It is a software library built on top of Apache Spark and offers the following capabilities:

  • Image pre-processing algorithms to improve text recognition results
  • Adaptive thresholding and denoising
  • Skew detection and correction
  • Adaptive scaling
  • Layout Analysis and region detection
  • Image cropping
  • Removing background objects
  • Text recognition, by combining NLP and OCR pipelines
  • Extracting text from images (optical character recognition)
  • Extracting data from tables (table extraction)
  • Recognizing and highlighting named entities in PDF documents
  • Masking sensitive text in order to de-identify images
  • Output generation in different formats:
  • PDF, images, or DICOM files with annotated or masked entities
  • Digital text for downstream processing in Spark NLP or other libraries
  • Structured data formats (JSON and CSV), as files or Spark data frames
  • Scale out: distribute the OCR jobs across multiple nodes in a Spark cluster.
  • Frictionless unification of OCR, NLP, ML and DL pipelines.

3. NER

Named Entity Recognition extracts structured data from free text to automate record keeping and enable an abstraction process feeding downstream tasks understanding free text.

John Snow Labs pretrained NER models have been leading industry benchmarks and show overall better accuracies than competitor solutions in the cloud market as well as other Natural Language Libraries.

Cloud services benchmarking:

NLP Library benchmarking:

4. Assertion Status

The Assertion Status detection interprets entities from the NER model and classifies them with Deep Learning into labels expressing timeliness or status of a disease.

NER model in the clinical NLP pipeline is to assign an assertion status to each named entity given its context. The status of an assertion explains how a named entity (e.g. clinical finding, procedure, lab result) pertains to the patient by assigning a label such as present (“patient is diabetic”), absent (“patient denies nausea”), conditional (“dyspnea while climbing stairs”), or associated with someone else (“family history of depression”). (Improving Clinical Document Understanding on COVID-19 Research with Spark NLP)

IV. Conclusion

Hitachi Solutions

When it comes to providing your patients with exceptional and, in some cases, life-saving care, you can’t afford to let anything stand in your way — especially unstructured data.

Hitachi Solutions is committed to helping organizations within the healthcare and health insurance industries do more with their data using innovative solutions and services, including natural language processing. All of Hitachi’s offerings come backed by decades of proven data expertise, and it has the resources to help your organization go further, faster, and at scale.

Specifically, Hitachi Solutions data experts will work closely with your team to identify data and, within four weeks, will deliver a robust modeling environment designed to accelerate your cloud journey. Services within the four weeks include: Delta Lake environment set up, notebook templates with pre-defined engineering patterns for mounting storage to compute, syncing to data sources, key vault security enablement, consumption cost estimates, visualization and comprehensive data analysis.

Are you ready to take patient care to the next level using NLP? There’s no time like the present to get started — contact Hitachi today to learn more.

Meet Your Solution Integrators:

John Young — VP of Data Science and Machine Learning

Angelina Leigh — NLP Data Scientist

Fred Heller — Sr. Data Engineer

John Snow Labs

Spark NLP for Healthcare is used by 54 percent of healthcare AI teams. It delivers proven, peer-reviewed, state-of-the-art accuracy for common clinical and biomedical text mining tasks, from information extraction to de-identification. It runs inside Databricks clusters, scales natively, supports both training and inference, and comes with 1,000-plus pre-trained models.

In addition to the software libraries, John Snow Labs also provides Spark NLP training & certification, enterprise support, and implementation services. The software is licensed as an annual subscription which includes all new functionality that is released during the subscription period (new releases have consistently happened twice a month for the past three years); regular model updates (so that models’ accuracy remains current with new academic advances, terminologies, and embeddings); and enterprise support by practicing NLP data scientists.

Visit John Snow Labs here.

Meet Your AI Innovators:

David Talby — CTO

Moritz Steller — AI Evangelist

Veysel Kocaman — Principal Data Scientist

Databricks

Databricks enables Health and Life Sciences companies to ingest all data types — including structured data, text, image, and genomics — into Delta Lake. Unifying data and analytics on a single platform enables a range of use cases, including genetic association studies for drug discovery, sepsis risk prediction, fraud detection, and next-best-action models.

Databricks has over 350 customers in the Healthcare and Life Sciences industry, including 9 of the 10 largest pharmaceutical companies and 8 of the 10 largest healthcare companies.

Visit Databricks here.

Meet Your Industry Experts:

Michael Sanky — Global Industry Lead, Health & Life Sciences

Amir Kermany — Technical Director, Health & Life Sciences

Marc Lobree — Director, Consulting Partners

--

--