spark-nlp - Medium

Spark NLP 6.3.3: ModernBERT Embeddings, Vector DB Integration, and Layout-Aware Document Processing

Muhammad Abdullah — Fri, 15 May 2026 19:44:56 GMT

Building production NLP pipelines means juggling tradeoffs: embeddings that can’t handle long documents, custom glue code to push vectors into a database, document readers that lose spatial context, and inference engines that can’t carry metadata. Spark NLP 6.3.3 addresses all of these in one release.

This version ships five new capabilities: ModernBertEmbeddings for faster, longer-context text embeddings; VectorDBConnector for seamless vector database ingestion; LayoutAlignerForVision and LayoutAlignerForText for layout-aware multimodal document understanding; MultiColumnAssembler for merging annotation columns; and enhanced LightPipeline with metadata support.

Here’s a quick overview before we go deep on each feature:

TL;DR

ModernBertEmbeddings: 8x faster BERT with 8,192-token context and 5x less memory
VectorDBConnector: push embeddings straight into Pinecone from your pipeline, no glue code
LayoutAlignerForVision/Text: keeps images and text spatially linked when processing PDFs and PPTX files
MultiColumnAssembler: merge separate annotation columns into one with source tracking
LightPipeline Metadata: pass context like source or category through your inference pipeline

ModernBertEmbeddings: 8x Faster, 8192-Token Embeddings

Three ways ModernBERT outperforms classic BERT. Token context window scaled to reflect the full 16× difference.

If you’ve been working with BERT-based embeddings and hitting walls around sequence length or throughput, ModernBertEmbeddings is the upgrade you’ve been waiting for. Based on the Smarter, Better, Faster, Longer paper, ModernBERT was trained on 2 trillion tokens and brings three major improvements over classic BERT:

8x faster inference through architectural optimizations including Flash Attention, Unpadding, and GeGLU activation
5x lower memory usage, enabling larger batches and more cost-effective deployments
8,192-token native sequence length eight times the 512-token limit of classic BERT eliminating the need to truncate long documents, legal texts, or code files

These gains are reported compared to BERT-base under equivalent benchmark conditions.

For proof, see the authors’ benchmark reporting in the ModernBERT paper and the official model card evaluation tables: arXiv paper, PDF, and ModernBERT-base model card.

ModernBERT produces 768-dimensional token-level WORD_EMBEDDINGS, making it a drop-in replacement for existing BERT-based embedding stages in your pipelines.

Getting Started

The default pretrained model is modernbert-base. Here’s how to use it:

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = ModernBertEmbeddings.pretrained("modernbert-base", "en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("modernbert_embeddings") \
    .setMaxSentenceLength(8192)

embeddings_finisher = EmbeddingsFinisher() \
    .setInputCols(["modernbert_embeddings"]) \
    .setOutputCols(["finished_embeddings"]) \
    .setOutputAsVector(True)

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    embeddings,
    embeddings_finisher
])

data = spark.createDataFrame([["Spark NLP is an open-source library."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show()

+--------------------+
|              result|
+--------------------+
|[0.27688124775886...|
|[-0.4956612586975...|
|[0.57157814502716...|
|[0.40465933084487...|
|[0.00266735162585...|
|[0.48091220855712...|
|[0.20593170821666...|
+--------------------+

Key Parameters

Engine Support

ModernBertEmbeddings supports three inference backends: TensorFlow, ONNX, and OpenVINO. You can import custom HuggingFace ModernBERT models via ONNX using loadSavedModel():

embeddings = ModernBertEmbeddings.loadSavedModel(
    "/path/to/onnx/model/folder",
    spark
)

For more examples, including how to import custom HuggingFace models, see the ModernBertEmbeddings notebook.

VectorDBConnector: From Embeddings to Vector Search in One Pipeline

For teams building semantic search, retrieval-augmented generation (RAG), or similarity-based recommendation systems, the gap between generating embeddings and storing them in a vector database has always required custom integration code extracting embeddings from DataFrames, formatting payloads, managing batch upserts, and handling API authentication.

VectorDBConnector eliminates all of that. It plugs directly into your Spark NLP pipeline and automatically stores embeddings from any Spark NLP embedding annotator into a vector database.

How It Works

VectorDBConnector takes two input columns a DOCUMENT column and a SENTENCE_EMBEDDINGS column and upserts the embedding vectors to your configured vector database index. It handles batching, ID management, and metadata serialization automatically.

from sparknlp.annotator.vector_db import VectorDBConnector

vectorDB = VectorDBConnector() \
    .setInputCols(["document", "sentence_embeddings"]) \
    .setOutputCol("vectordb_result") \
    .setProvider("pinecone") \
    .setIndexName("my-semantic-index") \
    .setNamespace("production") \
    .setIdColumn("doc_id") \
    .setMetadataColumns(["text", "category"]) \
    .setBatchSize(100)

Key Parameters

Configuring Your API Key

The Pinecone API key can be set via Spark configuration or an environment variable:

import sparknlp

spark = sparknlp.start(params={
    "spark.jsl.settings.vectordb.api.key": "your-pinecone-api-key"
})

Output

Each processed row produces an output annotation containing:

result: The vector ID (either from your idColumn or a generated UUID)
metadata: Includes vectordb_status: “upserted” and provider: “pinecone”

A Complete RAG Ingestion Pipeline

Here’s a full pipeline that reads documents, generates embeddings, and stores them in Pinecone:

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = BertSentenceEmbeddings.pretrained() \
    .setInputCols(["sentence"]) \
    .setOutputCol("sentence_embeddings")

vectorDB = VectorDBConnector() \
    .setInputCols(["sentence", "sentence_embeddings"]) \
    .setOutputCol("vectordb_result") \
    .setProvider("pinecone") \
    .setIndexName("my-rag-index") \
    .setMetadataColumns(["text", "source"]) \
    .setBatchSize(100)

pipeline = Pipeline().setStages([document, sentence, embeddings, vectorDB])
pipeline.fit(data).transform(data)

For a full walkthrough, check out the VectorDBConnector Pinecone Demo notebook.

LayoutAlignerForVision and LayoutAlignerForText: Multimodal Document Understanding

When processing rich documents like PDFs or PowerPoint presentations, text and images are spatially interleaved. A revenue chart sits next to the paragraph that discusses it. A product diagram is surrounded by specifications. Without layout awareness, downstream models operating on extracted content lose this spatial context entirely the chart gets separated from its explanation, and the diagram floats without context.

LayoutAlignerForVision and LayoutAlignerForText solve this problem by forming a two-stage pipeline that preserves the spatial relationship between text and images throughout document processing.

Two-Stage Flow (at a glance)

Selective Multimodal Enrichment

Stage 1: LayoutAlignerForVision: Aligning Images with Text

LayoutAlignerForVision takes document text chunks and images extracted by ReaderAssembler and aligns each image with its spatially nearby text paragraphs based on actual page coordinates. It uses a distance-based confidence scoring system:

Distance ≤ 10px: Confidence 0.95 (very close alignment)
Distance ≤ paragraph spacing: Confidence 0.75 (moderate alignment)
Greater distance: Confidence 0.40 (loose alignment)

The annotator is format-aware it understands slide boundaries in PowerPoint files and page boundaries in PDFs, scoping its alignment search to the correct visual context.

It produces three output columns for each alignment:

_doc: The aligned text chunk
_image: The aligned image
_prompt: A captioning prompt ready for a Vision-Language Model (VLM)

from sparknlp.reader import LayoutAlignerForVision

aligner_vision = LayoutAlignerForVision() \
    .setInputCols(["data_text", "data_image"]) \
    .setOutputCol("aligned") \
    .setMaxDistance(40) \
    .setIncludeContextWindow(True) \
    .setAddNeighborText(True) \
    .setImageCaptionBasePrompt(
        "Describe the image with concise business details."
    ) \
    .setNeighborTextCharsWindow(500) \
    .setExplodeDocs(True)

Key Parameters:

Stage 2: LayoutAlignerForText: Rebuilding Coherent Documents

After LayoutAlignerForVision pairs images with text and a VLM generates captions, LayoutAlignerForText weaves those captions back into the document’s text flow. It replaces raw image placeholders with meaningful captions and re-computes character offsets so the resulting document is coherent for downstream NLP tasks like chunking, NER, or embedding.

The annotator is smart about image placement it classifies each image as “before text” or “after text” based on the image’s position type (inline vs. floating) and its x-coordinate, deduplicates captions that may have been matched to multiple paragraphs, and produces a clean rebuilt document.

from sparknlp.reader import LayoutAlignerForText

aligner_text = LayoutAlignerForText() \
    .setInputCols(["aligned_doc", "image_caption"]) \
    .setOutputCol("aligned_text") \
    .setJoinDelimiter("\n") \
    .setExplodeElements(False)

Key Parameters:

The Complete End-to-End Pipeline

Here’s how all the pieces fit together for a full multimodal document understanding pipeline:

from sparknlp.base import *
from sparknlp.reader import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Step 1: Extract text and images from documents
reader = ReaderAssembler() \
    .setContentType("application/pdf") \
    .setContentPath("./pdf-files") \
    .setOutputAsDocument(False) \
    .setOutputCol("data")

# Step 2: Align images with their nearby text
aligner_vision = LayoutAlignerForVision() \
    .setInputCols(["data_text", "data_image"]) \
    .setOutputCol("aligned") \
    .setAddNeighborText(True) \
    .setNeighborTextCharsWindow(500) \
    .setImageCaptionBasePrompt(
        "Describe the image with concise business details."
    )

# Step 3: Caption images using a VLM
vlm = AutoGGUFVisionModel.pretrained() \
    .setInputCols(["aligned_prompt", "aligned_image"]) \
    .setOutputCol("image_caption") \
    .setBatchSize(1) \
    .setNGpuLayers(99)

# Step 4: Rebuild coherent text with captions woven in
aligner_text = LayoutAlignerForText() \
    .setInputCols(["aligned_doc", "image_caption"]) \
    .setOutputCol("aligned_text") \
    .setExplodeElements(False)

# Step 5: Chunk for downstream retrieval
text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["aligned_text"]) \
    .setOutputCol("chunks") \
    .setChunkSize(1200) \
    .setChunkOverlap(120)

# Step 6: Embed for vector search
embedder = BertSentenceEmbeddings.pretrained() \
    .setInputCols(["chunks"]) \
    .setOutputCol("chunk_embeddings")

This pipeline takes raw PDF documents and produces layout-aware, image-captioned text chunks with embeddings ready for semantic search or RAG. For an in-depth walkthrough of building this pipeline with your own documents, read our blog post Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents and see the LayoutAligners Document Understanding Demo notebook.

MultiColumnAssembler: Merge Annotation Columns with One Line

When using ReaderAssembler to process documents such as PDFs or PPTX files, content is extracted into separate typed columns document_text, document_table, and image-related outputs. But many downstream annotators (like AutoGGUFVisionModel) expect a single input column. Previously, bridging this split required custom Spark transformations.

MultiColumnAssembler solves this directly within Spark NLP. It merges any number of DOCUMENT type annotation columns into a single output column, preserving all annotation metadata and automatically adding a source_column key to each annotation so you can trace which column it originated from.

from sparknlp.base import MultiColumnAssembler

merger = MultiColumnAssembler() \
    .setInputCols(["document_text", "document_table"]) \
    .setOutputCol("merged_document") \
    .setSortByBegin(True)

Key Parameters

How Sorting Works

sortByBegin=False (default): Annotations appear in input column order all annotations from the first column, then the second, and so on.
sortByBegin=True: Annotations from all columns are interleaved by their begin offset, reconstructing the original document order regardless of which column they came from.

Source Tracking

Every merged annotation gets a source_column metadata key, making it easy to inspect provenance:

result.selectExpr("explode(merged_document) as ann") \
    .selectExpr("ann.result", "ann.metadata.source_column") \
    .show(truncate=False)

Integration with ReaderAssembler

This is particularly useful when merging text and table outputs from document readers:

from sparknlp.reader import ReaderAssembler
from sparknlp.base import MultiColumnAssembler
from pyspark.ml import Pipeline

reader = ReaderAssembler() \
    .setContentType("application/pdf") \
    .setContentPath("./documents") \
    .setOutputAsDocument(False) \
    .setOutputCol("data")

merger = MultiColumnAssembler() \
    .setInputCols(["data_text", "data_table"]) \
    .setOutputCol("merged_document") \
    .setSortByBegin(True)

pipeline = Pipeline().setStages([reader, merger])
result = pipeline.fit(emptyDf).transform(emptyDf)

Note: Columns using the AnnotationImage schema (IMAGE-typed columns from ReaderAssembler) are not supported by MultiColumnAssembler.

For a full walkthrough, see the Merging Annotation Columns notebook.

LightPipeline Metadata Support: Context-Aware Inference

LightPipeline is the fast, single-machine inference mode in Spark NLP ideal for real-time applications and batch processing without the overhead of full Spark DataFrames. Starting in 6.3.3, LightPipeline now supports passing metadata columns alongside text inputs in both annotate() and fullAnnotate().

This is especially useful when routing or post-processing should behave differently by content type, such as handling news articles and legal documents with different downstream logic.

This means you can now attach contextual information document source, language, category, user ID to your inputs and have it flow through the entire annotation pipeline. Annotators that implement the HasLightPipelineAnnotate trait can access this metadata in their beforeAnnotateLight and afterAnnotateLight hooks, enabling context-aware processing, routing, and filtering.

Supported Call Signatures

Metadata can be passed as a keyword argument or as a positional trailing argument. Both annotate() and fullAnnotate() support the same patterns:

Single text with metadata:

result = light_pipeline.fullAnnotate(
    "U.N. official Ekeus heads for Baghdad.",
    metadata={"source": ["news_article"]}
)
# result[0]["document"][0].metadata
# → {"sentence": "0", "source": "news_article"}

Multiple texts with row-format metadata (list of dicts):

results = light_pipeline.annotate(
    ["Breaking: Market rally continues", "New study on climate change"],
    metadata=[
        {"source": ["reuters"], "category": ["finance"]},
        {"source": ["nature"], "category": ["science"]}
    ]
)

Multiple texts with columnar-format metadata (dict of lists):

results = light_pipeline.annotate(
    ["Breaking: Market rally continues", "New study on climate change"],
    metadata={
        "source": ["reuters", "nature"],
        "category": ["finance", "science"]
    }
)

PretrainedPipeline Support

This feature is also surfaced through PretrainedPipeline:

from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")

result = pipeline.fullAnnotate(
    "Google announced a new product.",
    metadata={"source": ["tech_news"]}
)

Validation

The implementation includes validation to ensure metadata is well-formed:

In columnar format, each metadata value must be a list with the same length as the text inputs
Metadata is only supported for text inputs (not audio, image, or question-answering modes)

What else is new

The Apache POI dependency used by Spark NLP’s document readers (ReaderAssembler) has been upgraded from 4.1.2 to 5.4.1 (poi-ooxml-full). This upgrade removes deprecated sub-dependencies and ensures compatibility with the latest Office document formats. If you’re processing Word, Excel, or PowerPoint files with ReaderAssembler, you benefit automatically from this upgrade.

References

Research Papers

Smarter, Better, Faster, Longer: A Technical Report on ModernBERT The paper behind ModernBertEmbeddings; covers the architecture improvements (Flash Attention, Unpadding, GeGLU, RoPE), training data (2T tokens), and benchmarks vs. classic BERT
ModernBERT Paper (PDF) Direct PDF source for the benchmark methodology and performance discussion

Benchmark Evidence (for the speed/memory claims)

ModernBERT-base model card: Evaluation Official benchmark tables and task-level results
Hugging Face ModernBERT release post Additional context on performance and efficiency claims

Example Notebooks

ModernBertEmbeddings : Getting Started Full walkthrough including loading from HuggingFace via ONNX
VectorDBConnector: Pinecone Demo End-to-end RAG ingestion pipeline into Pinecone
LayoutAligners: Document Understanding Demo LayoutAlignerForVision + VLM captioning + LayoutAlignerForText on mixed-type documents
MultiColumnAssembler: Merging Annotation Columns Merging document_text and document_table from ReaderAssembler into a single column

Blog Posts

Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents In-depth walkthrough of the full multimodal document pipeline on real PDF and PPTX files
Spark NLP on Medium All Spark NLP articles and tutorials

Pre-trained Models

ModernBERT-base on HuggingFace (answerdotai/ModernBERT-base) The upstream model behind modernbert-base in Spark NLP
Spark NLP Models Hub Browse all available pretrained models for Spark NLP

External Services

Pinecone Vector database supported by VectorDBConnector in this release

Community & Resources

Slack real-time discussion with the Spark NLP community and team
GitHub issue tracking, feature requests, and contributions
Discussions community ideas and showcases
Medium latest Spark NLP articles and tutorials
YouTube educational videos and demos

Spark NLP 6.3.3: ModernBERT Embeddings, Vector DB Integration, and Layout-Aware Document Processing was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents

Danilo Burbano — Tue, 10 Mar 2026 14:22:28 GMT

The RAG Ingestion Problem

In real-world RAG systems, the quality of the final answer is constrained by the quality of the indexed representation. If the ingestion layer fails to capture the meaning encoded in charts, diagrams, screenshots, tables, legends, and other layout-dependent visual artifacts, the retriever is not operating over the true document semantics. It is operating over an incomplete surrogate of the source page.

That failure mode is especially pronounced in multimodal business and technical documents. Consider a quarterly revenue report that contains a paragraph introducing a chart, but where the actual trend, inflection point, or category comparison only appears in the figure. A text only pipeline will embed the surrounding prose and perhaps some OCR fragments, yet it will often miss the precise visual claim that a human reader immediately extracts from the chart. In a RAG setting, that means relevant chunks may never be retrieved for questions such as Which product line declined in Q3? or What trend is shown in the revenue breakdown? even though the answer is visually obvious on the page.

Clinical and scientific documents introduce another variant of the same issue. Endpoint plots, cohort diagrams, treatment arm schemas, and adverse event tables often encode the most decision relevant information in highly structured visual form. If those artifacts are not semantically reconstructed during ingestion, a RAG system may retrieve a general summary paragraph while overlooking the image that actually contains the efficacy pattern, patient-group distinction, or safety signal needed to answer the user question precisely.

In other words, multimodal RAG does not fail only at generation time. It often fails much earlier, during ingestion, when visually grounded meaning is discarded or flattened into weak OCR text. Once that information is absent from the index, prompt engineering and reranking can only compensate so much.

Layout-Aware Multimodal Ingestion.

At enterprise scale, document ingestion rarely happens over a clean corpus of plain text files. Real world knowledge bases are usually composed of heterogeneous, mixed-type assets such as PDFs, PPTX decks, DOCX reports, technical summaries, clinical dossiers, financial statements, and slide based architecture reviews. These assets are multimodal by construction: they combine narrative text with charts, diagrams, screenshots, tables, icons, and other layout-dependent visual artifacts whose semantic contribution is often critical to downstream retrieval quality.

Multimodal Document

When teams deploy ingestion pipelines for these corpora, they typically fall into one of two sub-optimal patterns:

Push everything through text centric parsing and embedding pipelines, effectively treating the document as if its machine readable text were the whole signal.
Over correct by sending entire documents through vision language models (VLMs), even when most pages are predominantly textual and only a small subset of regions actually require visual interpretation.

Both strategies create avoidable failure modes.

Text-only ingestion pipelines are computationally efficient, but they systematically under represent visually encoded meaning, especially in charts, topology diagrams, annotated screenshots, and figure-heavy reports.

Full document VLM ingestion captures more multimodal context, but it is operationally expensive, introduces unnecessary latency, and allocates vision inference to document regions that are already well handled by OCR and standard NLP components.

A more robust design pattern is layout-aware selective multimodal ingestion. Instead of captioning the entire document, the pipeline first identifies the non-text visual regions that actually require multimodal interpretation, aligns those regions with their nearest textual context, prompts the VLM with localized semantic grounding, and then reconstructs the document so that the generated image understanding is reinserted into the final reading flow. This produces a retrieval ready representation that is both semantically richer and substantially more efficient than a brute force multimodal pass.

Layout-Aware Selective Multimodal Ingestion

This is precisely the problem space addressed by Spark NLP’s new LayoutAlignerForVision and LayoutAlignerForText annotators. Together, they provide an end-to-end mechanism for aligning extracted text and image annotations, generating context-aware captions only where needed, and rebuilding coherent multimodal document text for downstream chunking, embedding, indexing, and retrieval workflows.

Methodology

The core methodology follows a selective multimodal enrichment architecture designed for retrieval and indexing pipelines rather than generic document captioning. Conceptually, the workflow separates document understanding into two stages:

Identify and align the regions that require vision reasoning

Selective Multimodal Enrichment

Propagate the resulting visual semantics back into a text centric representation that can be consumed by conventional embedding and search infrastructure.

The processing path can be summarized as follows:

1. Ingest each document and extract both text annotations and image annotations.

2. Apply layout-aware alignment so that each relevant image is paired with the nearest textual region using positional heuristics.

3. Optionally enrich the image caption prompt with neighboring textual context by enabling:

addNeighborText=True
neighborTextCharsWindow=

4. Run VLM captioning only on those aligned image regions instead of the full document.

5. Reconstruct the document by reinserting captioned visual meaning into the surrounding text flow with LayoutAlignerForText.

6. Split the reconstructed multimodal text into retrieval sized chunks.

7. Generate dense sentence or document embeddings for each chunk.

8. Hand off the chunk text, vectors, and metadata to Elasticsearch for downstream vector indexing.

From a systems perspective, the important design choice is that multimodal inference is applied surgically, not globally. This reduces VLM utilization to the subset of content where it adds actual value, while preserving the strong throughput and distributed execution characteristics of Spark based text processing. The result is a retrieval oriented representation that retains chart and figure semantics without paying the cost of end-to-end VLM processing across all pages.

Why LayoutAligners Matter?

The key innovation in this approach is not merely that images are captioned, but that they are captioned in layout context. In rich documents, the meaning of a visual artifact is rarely self-contained. A chart may depend on the title above it, the explanatory paragraph below it, or the KPI definitions introduced in the previous section. A network diagram may only become interpretable when paired with adjacent architectural prose. Captioning such images in isolation often produces generic or weak descriptions that are insufficient for high-quality semantic retrieval.

LayoutAlignerForVision addresses this by aligning DOCUMENT and IMAGE annotations through layout-aware heuristics and emitting three derived outputs from the configured output column.

LayoutAlignerForText completes the second half of the workflow by rebuilding coherent text from document chunks and generated captions.

Inputs/Outputs for LayoutAlignerForVision, VLM, and LayoutAlignerForText

Taken together, these annotators convert what is usually a disconnected multimodal intermediate state into a deterministic and retrieval friendly document representation. That is the architectural reason they matter: they bridge the gap between layout parsing and downstream semantic indexing.

Implementation

This walk through follows the pipeline structure demonstrated in the notebook and frames it as a production oriented ingestion pattern rather than a one-off captioning demo. The implementation can be thought of as four major phases: extraction, alignment, caption generation, and reconstruction for embeddings

1) Ingestion with ReaderAssembler

 ReaderAssembler()
   .setContentType("application/pdf")
   .setContentPath(pdf_directory)
   .setOutputAsDocument(False)
   .setOutputCol("data")
   .setUseEncodedImageBytes(True)

The ingestion layer starts with ReaderAssembler, which parses the source files and emits structured text and image annotations instead of collapsing the entire document into a monolithic string. In this configuration, the assembler is reading PDFs and generating separate annotation streams such as data_text and data_image. That distinction is fundamental, because the downstream alignment stage relies on having explicit document chunks and image objects available as first-class annotations rather than implicit artifacts buried inside a raw payload. The notebook also enables encoded image bytes so that the extracted visual regions can be passed directly into downstream vision inference without an additional serialization or image reconstruction step.

2) Layout-aware image-text pairing

Default mode (no neighbor text):

 LayoutAlignerForVision()
  .setInputCols(["data_text", "data_image"])
  .setOutputCol("aligned")

Neighbor-aware mode:

LayoutAlignerForVision()
    .setInputCols(["data_text", "data_image"])
    .setOutputCol("aligned")
    .setAddNeighborText(True)
    .setNeighborTextCharsWindow(500)

This stage is where layout intelligence enters the pipeline. LayoutAlignerForVision consumes DOCUMENT and IMAGE annotations and applies proximity based heuristics to determine which text region should serve as the semantic anchor for each image. According to the implementation, the alignment logic considers distance, paragraph geometry, slide or page scope, optional contextual windows for floating images, and a confidence model derived from relative vertical distance. The annotator can also fall back to same slide or same page strategies when a strict local match is not found, which makes it more resilient across variable layouts such as presentations, PDF reports, and mixed visual documents.

The aligned outputs are: aligned_doc, aligned_image, aligned_prompt

These outputs are important because they enforce a structured data transition between layout alignment and VLM inference, replacing manual prompt assembly with a formalized process. The aligned_prompt column is not just a static instruction string; when neighbor text is enabled, it becomes a localized captioning prompt that blends a base instruction with surrounding textual context. In practice, this means a chart can be captioned with awareness of nearby business, scientific, or technical narrative, which substantially improves grounding quality for visuals whose meaning depends on local prose rather than pixel content alone. The notebook explicitly demonstrates this difference by comparing a default prompt path against a neighbor-aware prompt path that injects up to 500 characters of surrounding context.

3) VLM captioning only where needed

 AutoGGUFVisionModel.pretrained()
    .setInputCols(["aligned_prompt", "aligned_image"])
    .setOutputCol("image_caption")

At this point the pipeline hands only the aligned image regions, plus their layout informed prompts, to the VLM. This is the operational efficiency win of the overall design. The model is not asked to reinterpret entire pages or complete documents; it is asked to caption the specific image regions that survived alignment and confidence filtering. The demo notebook wraps this step with a helper builder for AutoGGUFVisionModel so that inference parameters such as batchSize, nCtx, nPredict, temperature, topK, and topP remain consistent across experiments, making it easier to isolate the effect of layout aware prompt construction.

From a machine learning systems perspective, this selective captioning pattern is much closer to how one would design a production ingestion service. It conserves GPU budget, reduces unnecessary multimodal tokens, improves throughput, and keeps vision inference bounded to the document subregions that are most likely to alter retrieval semantics.

4) Reassemble text and image meaning

Once captions are generated, LayoutAlignerForText reconstructs a text centric representation of the document in which visual meaning is inserted back into the reading flow. In the demo notebook this stage is fed with data_text and image_caption, while the annotator implementation itself is explicitly designed for aligned document caption reconstruction. The important outcome is that the final aligned_text column is no longer plain extracted text; it is a semantically enriched document representation that incorporates the informational payload of previously non-textual regions.

Internally, the annotator rebuilds text at the element or file scope by pairing document chunks with captions, normalizing layout metadata, deduplicating repeated image assignments, deciding whether captions should be inserted before or after paragraph text, and optionally merging all rebuilt elements into a single file level annotation. This reconstruction step is what makes the downstream embedding stage materially better: instead of embedding isolated paragraphs and separately storing opaque image captions, the system embeds a unified multimodal text stream whose semantics better reflect how a human reader would interpret the source document.

5) Split into chunks and generate sentence embeddings

DocumentCharacterTextSplitter()
   .setInputCols(["aligned_text"])
   .setOutputCol("chunked_docs")
   .setChunkSize(1200)
   .setChunkOverlap(120)

BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128", "en")
   .setInputCols(["chunked_docs"])
   .setOutputCol("chunk_embeddings")

EmbeddingsFinisher()
   .setInputCols(["chunk_embeddings"])
   .setOutputCols(["finished_embeddings"])
   .setOutputAsVector(True)

After multimodal reconstruction, the document is returned to a familiar retrieval pipeline. DocumentCharacterTextSplitter partitions the enriched document into index sized chunks, and BertSentenceEmbeddings converts each chunk into a dense vector representation suitable for semantic search. EmbeddingsFinisher then materializes the embedding annotations into vector form so they can be persisted in tabular or search-index-ready structures. In the demo notebook, the resulting dataset is flattened to one row per chunk so that each chunk can map cleanly to an indexable unit containing chunk_text, metadata, and its embedding vector. The sequencing here is critical. The chunking happens after the layout aware multimodal reconstruction, not before it. That ordering ensures the embedding model sees a coherent semantic unit in which visual meaning has already been grounded and merged, rather than forcing retrieval to correlate independent text chunks and disconnected caption artifacts later in the pipeline.

6) Indexing hand-off (next step)

This post intentionally stops before the Elasticsearch write path, but the operational hand-off is straightforward. The next stage is to send: chunk_text, embedding_vector, chunk, and source metadata into an Elasticsearch vector index, where the chunks can participate in k-NN search, hybrid lexical-semantic retrieval, filtered retrieval, or downstream re-ranking pipelines. The important point is that the LayoutAligners do not replace the retrieval stack; they improve the semantic quality of the content being indexed into it.

Representative Use Cases

Although the notebook demonstrates the workflow on a small sample of three PDFs, the pattern generalizes well across several high value enterprise scenarios. The included demo examples highlight financial reports, clinical trial summaries, and cloud architecture documents, each of which contains high density visual content whose meaning depends heavily on adjacent text.

In financial reporting, charts, KPI summaries, and annotated performance graphics often encode the most retrieval worthy facts in the document. Without layout aware captioning, a vector index may capture the narrative commentary but miss the visual explanation of revenue trend inflections, category breakdowns, or quarter-over-quarter variance. Aligning charts with nearby explanatory paragraphs improves the probability that downstream search will retrieve chunks containing both the business narrative and the visual evidence.

In clinical and life sciences content, endpoint plots, cohort flow diagrams, and adverse event tables are especially sensitive to contextual grounding. A caption generated without surrounding text may correctly identify that an image is a graph or table, yet still miss the medically salient variables, treatment groups, or outcome semantics that make the visual useful during evidence retrieval. Local neighbor text helps constrain the caption toward domain relevant interpretation.

In cloud and software architecture documentation, topology diagrams, service dependency visuals, and deployment schematics are often more informative than the prose alone. By aligning those diagrams to nearby technical paragraphs before caption generation, the reconstructed text can better preserve system boundaries, component relationships, and infrastructure intent, making architecture search and RAG-based troubleshooting more precise.

Benefits

This layout aware ingestion pattern produces several concrete benefits for mixed-type corpora.

Higher multimodal efficiency: Vision inference is targeted at image regions that actually need semantic interpretation, rather than being wasted on full pages whose dominant signal is already present in extracted text. This makes the approach more cost-efficient and more production friendly for large scale corpora.
Better semantic grounding: Neighbor aware prompts enable captions to incorporate local narrative context, which is particularly valuable for charts, diagrams, and screenshots whose meaning is not fully recoverable from pixels alone. The result is caption output with higher contextual fidelity and stronger downstream retrieval utility.
Stronger reconstructed text for embeddings: Because LayoutAlignerForText reinserts captioned visual meaning into the document reading order, the chunks that reach the embedding model are semantically more complete. This improves the representation quality of retrieval units without forcing the rest of the search stack to become natively multimodal.
Spark-native pipelines: The pattern is still a standard Spark pipeline and therefore inherits Spark’s distributed execution model for large document batches. In other words, the design scales not because the VLM sees bigger documents, but because the overall ingestion DAG remains partitionable, pipeline based, and aligned with Spark execution semantics
Cleaner downstream indexing: By the time the data reaches Elasticsearch or any other vector backend, each row can represent a retrieval-ready chunk with aligned text, grounded caption semantics, and associated metadata. This leads to a cleaner interface between document understanding and search infrastructure, which is particularly valuable in production RAG and enterprise search systems where observability and deterministic pre-processing matter. For large ingestion workloads, this becomes a high-leverage design pattern: apply multimodal reasoning only where it is information bearing, then collapse the result back into a text first representation that standard embedding and search systems can consume efficiently.

Conclusion

For mixed-type document ingestion, the architectural goal should not be to maximize vision usage, but to maximize useful multimodal signal per unit of compute. That is the broader lesson behind LayoutAligners. They offer a middle path between two extremes: the semantic blind spots of text-only ingestion and the computational overkill of sending whole documents to VLMs.

In practice, the workflow is straightforward but powerful: detect non-text content, align it to nearby text, caption it with localized context, merge the resulting visual semantics back into the reading flow, and only then split and embed the final representation for search. The end result is a corpus that is not merely parsed, but semantically reconstructed for retrieval.

For data scientists, machine learning engineers, and AI platform teams, this is the practical value proposition. LayoutAlignerForVision and LayoutAlignerForText do not just add two new annotators to Spark NLP; they introduce a more disciplined ingestion pattern for multimodal enterprise content. That pattern improves semantic completeness, constrains multimodal inference cost, and creates a stronger foundation for vector indexing, RAG, document understanding, and large-scale search over rich business and technical content.

Do you want to know more?

Read a related story of document ingestion with Spark NLP here
Check the example notebooks in the Spark NLP repository, available here
Visit John Snow Labs and Spark NLP Technical Documentation websites
Follow us on Medium: Spark NLP and Veysel Kocaman
Write to support@johnsnowlabs.com for any additional requests you may have

Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating Document AI Frameworks: Spark NLP vs Unstructured for Large-Scale Text Processing

Danilo Burbano — Mon, 12 Jan 2026 13:07:07 GMT

Problem1: Extracting Complete Text Coverage from Complex Documents

In many enterprise pipelines from compliance auditing to enterprise search and knowledge-graph building teams must extract every piece of visible text from large collections of mixed documents (PDFs, HTML pages, and DOCX files).
This includes not just paragraphs and headings, but also text found in:

Navigation menus and footers
Captions and embedded annotations
Tables, figure titles, disclaimers, and metadata fields

Capturing all visible text is essential when building traceable, auditable corpora where any omission (even from navigation or footer content) could lead to information loss or compliance gaps.

How Spark NLP Solves It

Spark NLP provides a unified data-processing and NLP pipeline that can read, parse, and clean text from diverse formats at scale using its Readers2X components.

To clean text extracted from HTML using Spark NLP, we leveraged the following annotators:

reader2doc = Reader2Doc() \
    .setContentType('text/html') \
    .setContentPath(directory) \
    .setOutputCol('document')

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setAutoMode("HTML_CLEAN") \
    .setPatterns([(":")])

sentence_detector = SentenceDetectorDLModel() \
    .pretrained() \
    .setInputCols(['normalized']) \
    .setOutputCol('sentences') \
    .setExplodeSentences(True)

When processing large volumes of documents, we simply leverage Spark’s native distributed engine to scale efficiently:

pipeline = Pipeline(stages=[reader2doc, normalizer, sentence_detector])
model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

flat_df = (
    result_df
    .withColumn("sentence", explode("sentences"))
    .select(
        col("filename"),
        col("sentence.result").alias("result")
    )
)

Complete coverage: Reader2Doc extracts the full visible text layer, including navigation menus, headers, and footers. It does not force semantic filtering or heuristics that skip content.
Scalable processing: Built on Apache Spark, it can handle millions of files distributed across clusters, ensuring fast ingestion and consistent structure.
Unified pipeline: The extracted text can flow directly into tokenizers, sentence detectors, embeddings, or downstream NLP models without reformatting.
Traceability: Every document keeps metadata such as source path, page number, and character offsets, supporting audit and compliance needs.

This makes Spark NLP particularly strong for enterprise-scale ingestion, full-text indexing, and document alignment tasks where completeness and consistency outweigh minimalism.

How Unstructured Handles It

Unstructured’s partition is designed with a different philosophy: to extract semantic content. They segment a document into structured “elements” e.g. Titles, NarrativeText, Tables, etc. while discarding what appears to be boilerplate, such as

menus or repetitive links.

To clean text extracted from HTML using Unstructured, we relied on its built-in cleaning utilities and added a custom function to remove colon characters:

def remove_colons(text: str) -> str:
    return re.sub(r":", "", text)

def clean_element_text(text: str) -> str:
    text = clean_extra_whitespace(text)
    text = replace_unicode_quotes(text)
    text = clean_non_ascii_chars(text)
    text = clean_bullets(text)
    text = remove_colons(text)
    return text.strip()

When processing large volumes of documents, a specialized function is required to iterate over the entire directory and apply this logic to each HTML file:

def ingest_and_clean_unstructured_html(html_path: str):
    elements = partition_html(filename=html_path)
    cleaned_output = []

    for el in elements:
        if hasattr(el, 'text') and el.text:
            cleaned_text = clean_element_text(el.text)
            cleaned_output.append({
                "filename": os.path.basename(html_path),
                "type": el.category if hasattr(el, 'category') else el.__class__.__name__,
                "text": cleaned_text
            })
    return cleaned_output

def process_html_directory(directory_path: str):
    all_results = []

    # Loop through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith(".html"):
            file_path = os.path.join(directory_path, filename)
            print(f"🔍 Processing: {file_path}")

            try:
                output_html = ingest_and_clean_unstructured_html(file_path)
                all_results.extend(output_html)
            except Exception as e:
                print(f"⚠️ Error processing {filename}: {e}")

While this produces cleaner and more human-readable content, it also means that:

Navigation or meta text is intentionally dropped.
Structural cues like captions may be separated from their context.
When completeness is required, downstream users have no direct way to recover filtered text, because Unstructured doesn’t retain the full raw text document stream.
Processing is file-by-file on CPU, without distributed scaling or Spark integration.

Thus, while Unstructured is ideal for content-centric summarization or LLM preprocessing, it is not appropriate for pipelines that require full document coverage or raw-text fidelity.

https://medium.com/media/803da64b5df304de32a794542ce163b1/href

To illustrate this scenario, we developed a notebook that processes a set of medical records and evaluates the quality of text extraction using a simple Jaccard similarity metric. The results show both frameworks performing closely:

However, as the saying goes, the devil is in the details. A deeper token-level analysis revealed that both frameworks missed the token dashboard, but Unstructured also omitted several contextually important words such as:

These missing tokens can be critical in clinical and biomedical contexts, where small lexical gaps may significantly affect the outcomes of downstream NLP tasks. Therefore, despite the seemingly similar similarity scores, these subtle omissions could lead to substantial performance differences in real-world NLP pipelines.

You can review the full notebook for this experiment and the result metrics👉 here

Problem2: Maintaining Structural Context for Data-Rich Documents

In many enterprise domains such as healthcare, finance, insurance, scientific publishing, and legal discovery critical insights critical insights are embedded in structured elements like tables and figures. These are not just blobs of text; their meaning depends heavily on their position within the document, including headers, captions, nearby narrative text, and visual layout.

Without preservation of this structural context, downstream NLP systems struggle to interpret, relate, and reason over the extracted information. For example:

A clinical lab table needs to be associated with its section heading (“Most Recent Laboratory Results”) so decision support systems know which test belongs to which patient visit.
A financial table summarizing quarterly results must be tied to the correct caption and date range to feed into a BI dashboard.
Scientific documents often contain dozens of tables and figures where semantic relationships between text and tables are essential for accurate knowledge extraction and reasoning.

This structural understanding matters not just for content extraction but for semantic NLP tasks such as information extraction, table-aware question answering, contextual reasoning, and knowledge graph construction. Research shows that incorporating structural and layout information significantly improves document understanding and extraction quality because it helps NLP systems interpret data in context, not just as isolated text or table cells [1].

How Spark NLP Solves It

Spark NLP’s Reader2Table orReader2Imageaddresses this challenge by preserving structural and positional metadata during extraction.
Each table or image is enriched with information such as its DOM path, nearest header, and section hierarchy, ensuring every piece of extracted data remains tied to its original context.

empty_df = spark.createDataFrame([], 'string').toDF('text')

reader2doc = Reader2Table() \
    .setContentType('text/html') \
    .setContentPath('html_docs/EHR-2025-12-000002.html') \
    .setOutputCol('table') \
    .setExplodeDocs(True)

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

JSON output

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"caption":"","header":["Test","Result","Units","Reference Range","Status"],"rows":[["PSA","0.32","ng/mL","0-4.0","Excellent"],["Testosterone","125","ng/dL","300-1000","Recovering"],["Hemoglobin","14.3","g/dL","13.5-17.5","Normal"],["WBC","7.2","K/uL","4.5-11.0","Normal"],["Creatinine","0.9","mg/dL","0.7-1.3","Normal"],["ALT","22","U/L","7-56","Normal"]]}]                                                                        |
|[{"caption":"","header":["Test","Result","Units","Reference Range","Status"],"rows":[["Testosterone","105","ng/dL","300-1000","Recovering"],["Hemoglobin","12.3","g/dL","13.5-17.5","Normal"],["Creatinine","0.7","mg/dL","0.7-1.3","Normal"]]}]                                                                                                                                                                                               |
|[{"caption":"","header":["Medication","Dose","Frequency","Indication","Status"],"rows":[["Atorvastatin (Lipitor)","10 mg PO","Daily","Hyperlipidemia","Active"],["Aspirin","81 mg PO","Daily","Cardiovascular prophylaxis","Active"],["Vitamin D3","2000 IU PO","Daily","Bone health","Active"],["Calcium carbonate","500 mg PO","BID","Bone health (post-ADT)","Active"],["Multivitamin","1 tab PO","Daily","Nutritional support","Active"]]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

HTML output

reader2doc = Reader2Table() \
    .setContentType('text/html') \
    .setContentPath('html_docs/EHR-2025-12-000002.html') \
    .setOutputCol('table') \
    .setOutputFormat('html-table') \
    .setExplodeDocs(True)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Test Result Units Reference Range Status
PSA 0.32 ng/mL 0-4.0 Excellent
Testosterone 125 ng/dL 300-1000 Recovering
Hemoglobin 14.3 g/dL 13.5-17.5 Normal
WBC 7.2 K/uL 4.5-11.0 Normal
Creatinine 0.9 mg/dL 0.7-1.3 Normal
ALT 22 U/L 7-56 Normal
]|
|[Test Result Units Reference Range Status
Testosterone 105 ng/dL 300-1000 Recovering
Hemoglobin 12.3 g/dL 13.5-17.5 Normal
Creatinine 0.7 mg/dL 0.7-1.3 Normal
]                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|[Medication Dose Frequency Indication Status
Atorvastatin (Lipitor) 10 mg PO Daily Hyperlipidemia Active
Aspirin 81 mg PO Daily Cardiovascular prophylaxis Active
Vitamin D3 2000 IU PO Daily Bone health Active
Calcium carbonate 500 mg PO BID Bone health (post-ADT) Active
Multivitamin 1 tab PO Daily Nutritional support Active
]                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Metadata output:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[orderTableIndex -> 1, nearestHeader -> 🔬 Most Recent Laboratory Results (10/22/2016), pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[4]/table[1], elementType -> Table, sentence -> 8}]|
|[orderTableIndex -> 2, nearestHeader -> History Laboratory Results (10/22/2016), pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[4]/table[2], elementType -> Table, sentence -> 10}]      |
|[orderTableIndex -> 1, nearestHeader -> 💊 Current Medications, pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[5]/table[1], elementType -> Table, sentence -> 12}]                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This output captures rich structural information alongside extracted content:

DOM paths (e.g., /html[1]/body[1]/div[3]/div[5]/table[1]) that identify exactly where in the HTML document a table or image came from.
Nearest section header context so that a table is semantically linked to its surrounding narrative (“Laboratory Results”, “Current Medications”, etc.).
Order and hierarchy metadata such as orderTableIndex, allowing precise reconstruction of document structure.
A structured JSON representation of tables (with headers, rows, captions, and field metadata)
A HTML representation for visualization, rendering, or further processing.

These enriched representations help downstream NLP tasks such as:

Table-aware question answering: Models like TAPAS leverage structured table data to answer natural language questions over tables with high accuracy, something that plain text extraction alone cannot support [2].
Contextual table interpretation: Structural metadata enables models to understand why a table occurs where it does, improving joint inference between narrative text and tabular data, which is known to boost extraction quality when the context is considered [1].
Semantic integration with knowledge graphs and IE systems: By preserving layout and section cues, extracted table data can be merged into structured knowledge representations with clear provenance.

In practice, this means that Spark NLP pipelines don’t just flatten structured content they provide traceable, semantically rich extractions that downstream models can consume with minimal ambiguity.

How Unstructured Handles It

Unstructured’s partition_html module focuses on extracting semantic content titles, paragraphs, tables, and images but does not preserve the structural layout or positional hierarchy of those elements.

def ingest_and_clean_unstructured_html_tables(html_path: str):
    """
    Extract and clean only HTML table data using Unstructured.
    Returns a list of dicts with text and HTML (if available).
    """
    elements = partition_html(filename=html_path)
    cleaned_output = []

    for el in elements:
        if hasattr(el, "category") and el.category == "Table":
            table_text = getattr(el, "text", None)
            cleaned_entry = {
                "filename": os.path.basename(html_path),
                "type": "Table",
            }

            # clean and add plain text
            if table_text:
                cleaned_entry["text"] = clean_element_text(table_text)

            # look in metadata for HTML, if it exists
            if hasattr(el, "metadata") and isinstance(el.metadata, dict):
                html_content = el.metadata.get("text_as_html")
                if html_content:
                    cleaned_entry["text_as_html"] = html_content

            cleaned_output.append(cleaned_entry)

    return cleaned_output

Output example:

[{'filename': 'EHR-2025-12-000002.html',
  'type': 'Table',
  'text': 'Test Result Units Reference Range Status PSA 0.32 ng/mL 0-4.0 Excellent Testosterone 125 ng/dL 300-1000 Recovering Hemoglobin 14.3 g/dL 13.5-17.5 Normal WBC 7.2 K/uL 4.5-11.0 Normal Creatinine 0.9 mg/dL 0.7-1.3 Normal ALT 22 U/L 7-56 Normal'},
 {'filename': 'EHR-2025-12-000002.html',
  'type': 'Table',
  'text': 'Test Result Units Reference Range Status Testosterone 105 ng/dL 300-1000 Recovering Hemoglobin 12.3 g/dL 13.5-17.5 Normal Creatinine 0.7 mg/dL 0.7-1.3 Normal'},
 {'filename': 'EHR-2025-12-000002.html',
  'type': 'Table',
  'text': 'Medication Dose Frequency Indication Status Atorvastatin (Lipitor) 10 mg PO Daily Hyperlipidemia Active Aspirin 81 mg PO Daily Cardiovascular prophylaxis Active Vitamin D3 2000 IU PO Daily Bone health Active Calcium carbonate 500 mg PO BID Bone health (post-ADT) Active Multivitamin 1 tab PO Daily Nutritional support Active'}]

Unstructured does not support text_as_html field for HTML files.

While the output for other file types may include a text_as_html field containing the visual representation of a table, it lacks:

The document’s DOM ancestry
Section or caption linkage (no nearest header tracking)
Element order within the page layout

As a result, Unstructured’s table output is essentially context-agnostic. While the extracted content may be correct in isolation, the surrounding structural relationships (vital for holistic NLP tasks) are lost. This limitation inhibits the use of Unstructured outputs in workflows that depend on understanding how the table fits into the document narrative or layout.

This notebook showcases the extraction of tabular data across both frameworks and emphasizes Spark NLP’s capability to generate rich DOM-structured output for precise contextual alignment.

https://medium.com/media/fdff6e8e0076e602ea7b46d9bd7e12da/href

In real-world enterprise pipelines, accurate understanding of where structured data appears not just what it contains enables advanced NLP use cases such as table question answering, context-aware information extraction, and integration into knowledge graphs. Spark NLP’s DOM-aware, dual JSON/HTML representations provide the structural foundation these tasks require, whereas simpler extraction tools lack the necessary positional fidelity.

Problem3: Processing Millions of Documents Efficiently and Reliably

Modern organizations are often tasked with processing massive volumes of unstructured documents PDFs, HTML pages, contracts, medical reports, or regulatory filings often numbering in the millions. These files arrive daily via ingestion pipelines, enterprise content systems, or compliance workflows.

While sequential or single-machine processing might suffice for small datasets, scaling becomes a serious challenge as data grows. Common bottlenecks include:

Excessive processing time when files must be handled one by one
Inconsistent outputs when pipelines fail mid-run and require manual restarts
Escalating infrastructure costs due to lack of distributed workload handling
Difficulty scaling NLP pipelines as tokenization, entity recognition, and classification steps are added

This becomes a critical bottleneck for data engineering teams tasked with maintaining real-time compliance, analytics, or document understanding workflows.

How Spark NLP Solves It

Spark NLP is built natively on Apache Spark, bringing distributed data processing to text analytics and NLP workloads..
This means text extraction, normalization, and NLP tasks can be performed in parallel across clusters, allowing millions of documents to be processed efficiently, reproducibly, and at scale.

Key advantages include:

Scalable architecture: Workloads are automatically partitioned across Spark executors, ensuring linear scalability as cluster resources grow.
Fault tolerance: Automatic checkpointing and resilient distributed datasets (RDDs) guarantee recovery from node or job failures.
Unified pipeline integration: Document ingestion, extraction (Reader2Doc, Reader2Table, Reader2Image, ReaderAssembler), tokenization, and NLP inference can all run as a single Spark job no need to move data between tools.
Operational efficiency: Ideal for enterprise pipelines that process terabytes of data daily.

Here’s a minimal example that ingests all files in a directory using Spark NLP’s ReaderAssembler and saves the results as a Parquet dataset:

reader_assembler = ReaderAssembler() \
    .setContentPath(directory) \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader_assembler])
model = pipeline.fit(empty_df)

df = model.transform(empty_df)
df.select("document_text.result").write.mode("overwrite").parquet(output)

How Unstructured Handles It

Unstructured, by design, is a single node Python library optimized for lightweight document parsing and LLM preprocessing, not distributed workloads.
While it provides easy-to-use functions like partition_html and partition_pdf, each document must be processed individually on a single CPU core.

def extract_text_from_file(filepath: Path) -> str:
   try:
        elements = partition(filename=str(filepath))
    except Exception as e:
        print(f"⚠️ Failed to read {filepath.name}: {e}")
        return ""

    text_content = []
    for element in elements:
        try:
            txt = getattr(element, "text", None)
            if txt:
                text_content.append(txt)
        except Exception:
            continue

    return "\n".join(text_content)

for idx, file_path in enumerate(files, start=1):
    file_t0 = time.perf_counter()

    text = extract_text_from_file(file_path)
    print(f"✔ [{idx}/{len(files)}] {file_path.name} processed in {file_t1 - file_t0:.2f}s")

This approach works well for small or ad-hoc datasets but faces clear limitations at enterprise scale:

No built-in parallel or distributed processing across clusters
Requires external orchestration tools (like Dask or Ray) to scale horizontally
Limited integration with Spark-based ETL or NLP workflows
Higher latency and I/O overhead when processing millions of files sequentially

Thus, for large-scale ingestion pipelines, Unstructured’s simplicity becomes a constraint increasing operational complexity and total runtime.

Experiment Results: Spark NLP vs Unstructured

To evaluate ingestion performance, we processed 60 mixed-format documents using both frameworks under the same conditions.

All experiments were executed on a single machine no Spark cluster, no distributed environment to ensure a fair comparison.

Average processing time for 60 documents. Spark NLP achieves ~2× faster throughput than Unstructured.

Even in this single-node setup, Spark NLP achieved nearly a 2x speedup, completing the full pipeline in roughly half the time of Unstructured.

This improvement comes primarily from Spark NLP’s ability to automatically parallelize work across all available CPU cores, distributing file reads and transformations efficiently under the hood.

Reproduce the benchmark and explore the full pipeline setup here.

Scaling Beyond a Single Machine

While this benchmark ran on one machine, Spark NLP’s real advantage emerges at larger scales:

The same pipeline can run unchanged across a Spark cluster, leveraging multiple nodes for linear scalability.
As document volume grows to thousands or millions, Spark’s distributed scheduler automatically partitions the workload each executor handling its own batch of documents in parallel.
This architecture ensures both speed and fault tolerance, something single-threaded Python tools can’t easily replicate.

https://medium.com/media/6be5ba8cff390276cad09f47ffad6f9b/href

When processing scales from hundreds to millions of documents, architecture becomes the differentiator.
Spark NLP’s distributed design allows organizations to scale text extraction and NLP workloads horizontally, maintaining both speed and reliability something single node solutions like Unstructured simply aren’t built to achieve.

Whether for compliance auditing, enterprise search, or large scale document intelligence, Spark NLP ensures that scale doesn’t compromise consistency.

References:

Evaluating Document AI Frameworks: Spark NLP vs Unstructured for Large-Scale Text Processing was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Semantic Search Across Text and Images? Meet E5-V in Spark NLP

Muhammad Abdullah — Fri, 27 Jun 2025 10:12:36 GMT

“What if you could compare a sentence and an image with just one model?”

Multimodal AI — models that can understand both text and images — has been making huge waves lately. Tools like CLIP and GPT-4 Vision have demonstrated what’s possible when language and vision are combined. Now, with Spark NLP 6.0.3, you can tap into that power at scale using a new feature called E5-V embeddings.

Whether you’re building search engines, recommendation systems, or just playing with embeddings, this new addition might be your next favorite trick.

So… What Is E5-V?

E5-V (short for “Embedding Everything Everywhere”) is a universal multimodal embedding model. The basic idea? It can take in text, images, or both, and map them all into the same vector space.

So if you feed it:

A sentence like “A dog playing fetch”
And a photo of that happening…

Their resulting embeddings will land close together in vector space. That’s pretty wild, especially when you consider that E5-V doesn’t even need to be fine-tuned on images. It just uses an innovative prompting method with a large language model that already “gets” a lot about the world.

How? It works by adding prompts like:


What is a short description of this?


Describe this image in one phrase:

These prompts help the model produce similar meanings for similar content, even across different formats.

How E5-V Embeds Images and Text Together (A Visual Breakdown)

To truly understand what makes E5-V special, it is helpful to examine how it learns to integrate both text and images into a shared space.

Here’s what’s going on:

On the left, a traditional large language model (LLM) is trained on a large amount of text using contrastive learning — essentially teaching it to pull semantically similar examples (such as two captions about dogs) closer together in vector space.
Then, in the Multimodal LLM (MLLM) on the right, they bring in images. Each image is passed through a vision encoder and then projected into the same embedding space as the text. The same LLM now handles both modalities.
The cool part is: it’s prompt-based. During training, the model is given tasks like:
“Summarize the sentence in one word.”
“Summarize the image in one word.”
But at inference time (far right), it can generalize to new prompts it’s never seen, like:
“Modify this image with ‘change cat to dog’, and describe the modified image in one word.”

So, not only does E5-V embed different kinds of content into the same vector space, but it also does so with flexible, prompt-driven control, which is crucial for making it useful across tasks without requiring additional fine-tuning.

This means that your sentence, your image, or their combination gets mapped into the same semantic space, ready for search, retrieval, or downstream tasks.

Why Should You Care?

Because this opens up a bunch of valuable things:

Semantic Search: Search images using text queries (or vice versa).
Content Recommendation: Show relevant articles, images, or products based on a user’s query, no matter the format.
Multimodal Retrieval: Find content that “means the same thing,” not just textually or visually, but across both.

And the best part? You can do all of this without training your model. Spark NLP handles the heavy lifting, and E5-V takes care of the rest.

🔧 Using E5-V in Spark NLP

With the Spark NLP 6.0.3 release, using E5-V is straightforward. There’s a new annotator called E5VEmbeddings, and it plugs into your existing Spark NLP pipelines.

Here’s a simple example comparing an Image and a Sentence with E5-V

import sparknlp
from sparknlp.base import DocumentAssembler, ImageAssembler
from sparknlp.annotator import E5VEmbeddings
from pyspark.ml import Pipeline
from pyspark.sql.functions import lit
from sparknlp.util import EmbeddingsDataFrameUtils

import os
from pathlib import Path

# Start Spark NLP session
spark = sparknlp.start()

# Download an image to test
img_url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
Path("images").mkdir(exist_ok=True)

# Save the image manually or use the line below if your environment supports it
os.system(f"wget -q -O images/image1.jpg {img_url}")

# Load the image
images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").option("dropInvalid", True).load(images_path)

# Add a prompt to guide the embedding
image_prompt = (
    "<|start_header_id|>user<|end_header_id|>\n\n\n"
    "Summary above image in one word: <|eot_id|>"
    "<|start_header_id|>assistant<|end_header_id|>\n\n"
)

# Attach text prompt to image DataFrame
image_df = image_df.withColumn("text", lit(image_prompt))

# Create a text-only row for comparison
text_prompt = (
    "<|start_header_id|>user<|end_header_id|>\n\n\n"
    "Summary above sentence in one word: <|eot_id|>"
    "<|start_header_id|>assistant<|end_header_id|>\n\n"
)
text_desc = "A cat sitting in a box."
empty_image_df = spark.createDataFrame(
    [EmbeddingsDataFrameUtils.emptyImageRow],
    schema=EmbeddingsDataFrameUtils.imageSchema
)
text_df = empty_image_df.withColumn("text", lit(text_prompt.replace("", text_desc)))

# Combine image and text into a single DataFrame
test_df = image_df.union(text_df)

# Set up the pipeline
image_assembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

# Load pretrained E5-V model (adjust the path if needed)
e5v = E5VEmbeddings.pretrained("E5-V_LLM_8B", "xx") \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("e5v")

pipeline = Pipeline(stages=[image_assembler, e5v])
results = pipeline.fit(test_df).transform(test_df)

# View the resulting embeddings
results.select("text", "e5v.embeddings").show(truncate=False)

This gives you a clean, comparable vector for each input, whether it’s an image, a sentence, or both. From here, you can measure similarity, group content by meaning, or plug the embeddings into whatever downstream task you’re working on. Super flexible, surprisingly easy.

💡 Final Thoughts

E5-V might not be the flashiest name, but it’s one of the most practical tools I’ve seen for working with text and images together. It simplifies a challenging problem — reasoning across different modalities — into something usable with just a few lines of code.

If you’re already in the Spark NLP ecosystem (or thinking about it), this release is worth checking out.

Want to dive deeper? The following resources might be of interest:

GitHub Repository: Spark NLP on GitHub: Source code, issue tracking, and community contributions.
E5V Embeddings Class Source: Python | Scala
Hands-On Notebooks on how to import E5V Embeddings models from Huggingface: HuggingFace_to_Spark_NLP_E5VEmbeddings
Spark NLP Release Notes: Version 6.0.3

❤ Join the Community

Slack: Join the Spark NLP community and team for live discussions.
Discussions: Engage with community members, share ideas, and showcase how you use Spark NLP!
Medium: Read Spark NLP articles on its official Medium page.
YouTube: Watch Spark NLP video tutorials for in-depth guidance.

Semantic Search Across Text and Images? Meet E5-V in Spark NLP was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Making Multi-Format Ingestion Easier with the New Partition and PartitionTransformer in Spark NLP 6.

Muhammad Abdullah — Wed, 28 May 2025 15:44:17 GMT

Making Multi-Format Ingestion Easier with the New Partition and PartitionTransformer in Spark NLP 6.0.2

If you’ve ever wrestled with reading different document formats in your NLP pipelines, PDFs here, Word docs there, the occasional HTML file thrown in, you know how frustrating it can get. The new Partition and PartitionTransformer annotators in Spark NLP 6.0.2 addresses this head-on.

What is the Partition Annotator?

In a nutshell, Partitionis a high-level abstraction for document ingestion. Think of it as an adapter for all file types. You don’t need to specify a reader (like PDF or DOCX); Partitionfigures it out for you. All you have to do is point it at your files, and it handles the rest for you.

This is especially helpful when you’re working with pipelines that need to process a variety of document types, like a legal or compliance workflow, without writing custom logic for each one.

🔧 Example Usage

from sparknlp.base import Partition

# Automatically detect and parse the file
df = Partition().partition("./pdf-files/text_3_pages.pdf")

df.show()

For a more hands-on tutorial using the Partition annotator in Spark NLP 6.0.2, check out this Colab notebook: 👉 Open Tutorial in Google Colab

Partition easily ingests files from:

Databricks (dbfs://)
HDFS (hdfs://)
Microsoft Fabric OneLake (abfss://)

No extra configuration is needed to access these distributed environments.

Customization with Parameters

You can fine-tune ingestion using keyword arguments. One key parameter is content_type, which lets you override automatic detection:

df = Partition(content_type="application/msword").partition("./word-files")
df.show()

Other configurable options include:

https://medium.com/media/b15f361df3e3de08f01a4c8c7b5f8c24/href

Smoother Pipelines with PartitionTransformer

When you’re working inside a Spark NLP pipeline, you often want everything from ingestion-to-annotation to flow seamlessly. That’s where PartitionTransformer comes in. It brings the power of Partition directly into your pipeline, letting you handle file and URL ingestion as just another stage

It supports reading from:

Local files
URLs
In-memory strings or byte arrays

…and it automatically detects formats like PDF, DOCX, HTML, emails, and more.

🔧 Example Usage

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import PartitionTransformer
from pyspark.ml import Pipeline

# Sample input: a single URL
data = spark.createDataFrame([("https://www.blizzard.com")], ["text"])

# Convert raw text into a Spark NLP document
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Ingest the content from the URL using PartitionTransformer
partition = PartitionTransformer() \
    .setInputCols(["document"]) \
    .setOutputCol("partition") \
    .setContentType("url")

# Build and run the pipeline
pipeline = Pipeline(stages=[documentAssembler, partition])
result = pipeline.fit(data).transform(data)

# Show the ingested document content
result.select("partition").show(truncate=False)

Use PartitionTransformer when you want ingestion as part of a pipeline. It’s a clean way to reuse and scale workflows without extra steps.

When to Use Partition vs. SparkNLPReader

Use Partitionwhen you want simplicity and automation, especially when dealing with multiple formats or distributed environments. It offers a automatic format detection, cleaner syntax, mixed-format handling and less boilerplate code.
Stick with SparkNLPReaderwhen you need fine-grained control or want to customize how a specific format is processed.

Real-World Use Case: Legal Automation

Let’s say you’re automating contract review for a legal team. They’ve got NDAs, employment agreements, and compliance documents in all kinds of formats. Instead of building a different ingestion flow for each one, you use Partition to read everything in bulk.

Downstream, you apply NLP components to extract metadata like parties involved, governing law, and obligations, then feed it into a legal CRM. What used to take hours of manual work now happens automatically.

Conclusion

The new Partition class brings a much-needed layer of abstraction to Spark NLP’s ingestion process. I love it because it’s so simple, powerful, and makes my life easier, especially when working at scale.

Want to dive deeper? The following resources might be of interest:

GitHub Repository: Spark NLP on GitHub: Source code, issue tracking, and community contributions.
Partition Class Source
• Python | Scala
PartitionTransformer Source
• Python | Scala
Hands-On Notebooks (Google Colab)
• Partition Tutorial
Spark NLP Release Notes
• Version 6.0.0 | Version 6.0.1 | Version 6.0.2

❤️ Join the Community

Slack: Join the Spark NLP community and team for live discussions.
Discussions: Engage with community members, share ideas, and showcase how you use Spark NLP!
Medium: Read Spark NLP articles on its official Medium page.
YouTube: Watch Spark NLP video tutorials for in-depth guidance.

Making Multi-Format Ingestion Easier with the New Partition and PartitionTransformer in Spark NLP 6. was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Transforming Document Ingestion at Scale with Spark NLP

Muhammad Abdullah — Wed, 28 May 2025 15:30:54 GMT

Spark NLP 6.0.0: A New Era of Enterprise-Scale Data Ingestion

Introduction

The release of Spark NLP 6.0.0 marks a significant shift in the platform’s capabilities. Originally architected as a high-performance natural language processing library atop Apache Spark, Spark NLP has now evolved into a full-scale, enterprise-grade ingestion engine for unstructured and semi-structured data. With native support for a broad range of enterprise file formats, Spark NLP 6.0.0 empowers organizations to ingest, process, and analyze documents at massive scale, directly within Spark ecosystems.

Key Enhancements in the new release

Redefining Spark NLP: From NLP Library to Ingestion Engine

While Spark NLP remains renowned for advanced NLP components such as tokenization, named entity recognition, and deep contextual embeddings, the latest iteration enhances its utility by embedding native ingestion features. This update allows users to interface directly with primary data sources, eliminating the fragmentation between ingestion (ETL) and downstream analysis. It facilitates an uninterrupted transition from raw documents to actionable insights.

Built for Scale: Distributed and Parallel Ingestion

Spark NLP inherits Apache Spark’s native support for distributed data processing. That means ingestion tasks, such as reading thousands of PDFs or parsing extensive Excel files, are processed in parallel across the cluster. The platform handles massive datasets efficiently, thanks to its built-in fault tolerance, lazy evaluation, and optimized memory management.

Direct Ingestion of Enterprise Files

Version 6.0.0 introduces SparkNLPReader, a set of readers designed to natively parse:

PDFs: Including encrypted files, font-specific rendering, and bounding-box-based extraction.
Excel Spreadsheets: Supports .xls and .xlsx formats, multi-sheet reading, and intelligent schema detection.
PowerPoint Presentations: Extracts content from slides and speaker notes.
Raw Text Logs and CSVs: Handling different encodings, delimiters, and irregular formats.

By supporting these formats out of the box, Spark NLP eliminates the need for intermediate file converters or manual preprocessing scripts.

With the new ingestion readers, developers can create ingestion pipelines declaratively. No custom file parsing or glue code is needed:

import sparknlp
from sparknlp.reader import SparkNLPReader

# Start Spark NLP session
spark = sparknlp.start()
reader = SparkNLPReader(spark)

# Read HTML
html_df = reader.html("https://www.wikipedia.org")

# Read PDFs from local directory
pdf_df = reader.pdf("/home/user/pdfs-directory")

# Read Excel files from local directory
excel_df = reader.xls("home/user/excel-directory")

# Read Emails from local directory
email_df = reader.email("/home/user/emails-directory")

# Example: shorthand syntax for HTML
html_df_short = sparknlp.read().html("https://www.wikipedia.org")

# Use shorthand for simplicity when custom configuration isn't needed

These high-level APIs make ingestion easy to integrate with Spark DataFrames, enabling rapid prototyping and deployment.

https://medium.com/media/92724d7eb41e34c335389fa1455a7bc1/href

Real-World Use Case: Legal Document Automation

Consider a law firm or in-house legal team that processes tens of thousands of documents monthly, including contracts, NDAs, compliance filings, and more. Traditionally, reviewing these manually is slow, expensive, and error-prone.

With Spark NLP 6.0.0, documents are ingested in bulk from PDFs, emails, and Word files, where structural elements like clauses and tables are automatically parsed, and NLP components extract key details such as parties, dates, governing law clauses, and obligations, with the processed data then routed into legal CRMs or contract intelligence platforms.

The result? Legal teams can dramatically reduce document review time, enabling attorneys to focus on high-value analysis instead of manual data entry.

Conclusion

Spark NLP keeps getting better and better, constantly expanding its capabilities. Now, it’s not just about NLP — it’s a powerful platform for handling all your document processing needs at scale with easy integration. Spark NLP makes it simple to build smart AI pipelines. Whether you’re in legal, finance, or healthcare, it’s designed to keep up with your growing data needs and help you work faster and more efficiently.

For further reading and resources, consider exploring the following:

GitHub Repository — Spark NLP on GitHub: Source code, issue tracking, and community contributions.
SparkNLPReader Source — Python | Scala
Spark NLP Release Notes — 6.0.0
Hands-On Notebooks (Google Colab)
• SparkNLPReader Tutorial
• Data Preprocessing Tutorials

❤️ Community Support

Slack: Join the Spark NLP community and team for live discussions.
Discussions: Engage with community members, share ideas, and showcase how you use Spark NLP!
Medium: Read Spark NLP articles on its official Medium page.
YouTube: Watch Spark NLP video tutorials for in-depth guidance.

Transforming Document Ingestion at Scale with Spark NLP was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Vision-Language Model (VLM) Inference at Scale with Spark NLP 6.0 + llama.cpp

Devin Ha — Wed, 14 May 2025 16:16:28 GMT

Large scale distributed inference of vision-language llama.cpp models in Spark NLP 6.0

Spark NLP 6.0 features Vision LLM inference based on llama.cpps GGUF models.

The world isn’t just text. Increasingly, the data we need to process and understand is multimodal. From analyzing social media feeds and product catalogs to understanding complex documents with diagrams and figures, the ability to seamlessly process both vision and language together is becoming essential for many data science and AI tasks.

The release of Spark NLP 6.0 marks another milestone for scalable, distributed AI pipelines. With the introduction of the AutoGGUFVisionModel to Spark NLP, we can leverage llama.cpp to directly load quantized vision-language models (VLM) and infer them at scale inside Apache Spark. VLMs from llama.cpp are now first-class citizens inside Spark Data Frames, delivering multimodal inference at scale with native performance.

In this post, we will dive deeper into our new llama.cpp VLM feature and walk through an example together. This should get you started on using this feature for your own projects.

Why is this important?

Integrating vision-language models into large-scale data processing pipelines has traditionally posed challenges due to infrastructure complexity, scalability constraints and integration overhead. Custom solutions using cloud-based APIs can be difficult to scale efficiently and pose concerns for data privacy.

Spark NLP allows you to pass your large image-text datasets directly to efficient quantized GGUF models from llama.cpp that can describe, summarize, or reason over visual inputs entirely on-premise. You can rest assured that sensitive data stays entirely with you.

In addition, Spark NLP can scale the model to the resources of your computing cluster effortlessly, especially if they are equipped with GPUs. This unlocks massively-parallel multimodal batch processing across large image datasets with Spark’s native scalability, enabling powerful new use cases:

Image Captioning: Generate descriptive text for vast image collections.
Visual Question Answering (VQA): Answer text-based questions about the content of images.
Enhanced Document Understanding: Process scanned documents, PDFs, and images by analyzing both text and visual layout/figures.
Multimodal Content Analysis: Analyze social media feeds or web content that integrates images and text for deeper insights.
Automated Product Descriptions: Generate e-commerce descriptions directly from product images.

How to use it — Example Walk-through

Let’s take a look at the AutoGGUFVisionModel in action. We will use it to caption a folder of images. You can follow along this example with this Google Colab notebook, or you can also find the notebook at this link. If you want to run it locally, you can set up Spark NLP using the following instructions: Spark NLP — Installation.

Spark NLP Setup

Let’s setup Spark NLP to load our VLM as a AutoGGUFVisionModel. If you are running it in Google Colab, now is the time to install Spark NLP. Skip this step if you have already set it up.

# Only execute this if you are on Google Colab
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

We start Spark NLP via our simple start() function. Depending on the platform, we can pass gpu=True, apple_silicon=True, or aarch64=True to load the right dependencies.

import sparknlp

# Let's start Spark with Spark NLP with GPU enabled. Skip it if running on CPU.
# You can also pass apple_silicon=True or aarch64=True
spark = sparknlp.start(gpu=True)

We can now load the default pretrained model llava_v1.5_7b_Q4_0_gguf with:

from sparknlp.annotator import *

autoGGUFModel = (
    AutoGGUFVisionModel.pretrained()
    .setInputCols(["caption_document", "image_assembler"])
    .setOutputCol("completions")
)

If LLaVA 1.5 is good enough for your use-case, then you are all set and can skip ahead to the image captioning section. You can also explore more models on our Models Hub.

If you want to bring your own GGUF model, then keep reading, where we will cover the import of a custom model.

Import and save your custom GGUF VLM in Spark NLP

As an example, we choose Mozilla/llava-v1.5–7b as our VLM, the default pretrained model. It is a 7B parameter model in the GGUF format used by llama.cpp, which also is available in 4-bit quantization. This VLM consists of two parts: A multimodal projection model (mmproj in llama.cpp) based on CLIP and a large language model (LLM) based on Vicuna (fine-tuned Llama 2). The projection model encodes the image to embedding vectors, which the LLM is then conditioned on to generate relevant text.

To download the models, you can run the following commands:

EXPORT_PATH_MODEL = "llava-v1.5-7b-Q4_K.gguf"
EXPORT_PATH_MMPROJ = "llava-v1.5-7b-mmproj-Q4_0.gguf"
! wget "https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/{EXPORT_PATH_MODEL}?download=true" -O  {EXPORT_PATH_MODEL}
! wget "https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/{EXPORT_PATH_MMPROJ}?download=true" -O  {EXPORT_PATH_MMPROJ}

Now we can use the loadSavedModel function in AutoGGUFVisionModelto load the model into Spark NLP:

from sparknlp.annotator import *

autoGGUFModel = (
    AutoGGUFVisionModel.loadSavedModel(EXPORT_PATH_MODEL, EXPORT_PATH_MMPROJ, spark)
    .setInputCols(["caption_document", "image_assembler"])
    .setOutputCol("completions")
    .setChatTemplate("vicuna")
    .setBatchSize(4)
    .setNGpuLayers(99)
    .setNCtx(4096)
    .setMinP(0.05)
    .setNPredict(40)
    .setTemperature(0.05)
    .setTopK(40)
    .setTopP(0.95)
)

The function loadSavedModel accepts three parameters:

the path to the exported GGUF model
the path to the exported mmproj GGUF model
the SparkSession that is the spark variable we previously started via sparknlp.start()

At this point, the model is loaded and ready to go. We can save it to disk so it is easier to be moved around and loaded with the Spark native .load function.

autoGGUFModel.write().overwrite().save(f"llava_v1.5_7b_Q4_0_gguf_spark_nlp")

This is your GGUF model loaded and saved by Spark NLP! You can now use it on other machines, clusters, or any place you wish.

Captioning Images

Now let’s see how we can use the model to caption some images, for which we need some examples. The following command will download some from our repository. For brevity, I skipped the plotting code, but you can find it in the linked notebook above.

!wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/images/images.zip
plot_images()

Example images for captioning consisting of various animals, a palace and a tractor

The next step is to load the images into an Apache Spark DataFrame. Note that instead of reading the images as a Spark data source, we need to read the images in a different format. Using Spark's native reader, the images will be loaded to a OpenCV compatible format. However, llama.cpp and AutoGGUFVisionModel expect raw image bytes.

For this, we can use the helper function loadImagesAsBytes from the ImageAssembler. It will load the images in the right format in a Spark DataFrame. Additionally, we will add a column for the caption:

from sparknlp.base import *
from pyspark.sql.functions import lit

# Load images as raw bytes to Spark DataFrame
data = ImageAssembler.loadImagesAsBytes(spark, images_path)
# Add a caption to each image.
data = data.withColumn("caption", lit("Caption this image."))

Having the data loaded in the right format, we can construct our pipeline. We require an ImageAssembler and DocumentAssembler to turn the images and captions into the right format for Spark NLP. We also use the model we just loaded above. Then we can assemble a pipeline and run it!

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = (
    DocumentAssembler().setInputCol("caption").setOutputCol("caption_document")
)
imageAssembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

pipeline = Pipeline().setStages([documentAssembler, imageAssembler, autoGGUFModel])

pipeline.fit(data).transform(data).selectExpr(
    "reverse(split(image.origin, '/'))[0] as image_name", "completions.result"
).show(truncate=False)

And this is the result:

+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name       |result                                                                                                                                                                                          |
+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|bluetick.jpg     |[ A dog with a red collar is sitting on the floor.]                                                                                                                                             |
|chihuahua.jpg    |[ A small brown dog wearing a sweater and collar is sitting on the floor.]                                                                                                                      |
|egyptian_cat.jpeg|[ The image features two cats lying on a pink surface, possibly a bed or sofa. One cat is positioned towards the left side of the frame and appears to be sleeping while holding]               |
|hen.JPEG         |[ The image features a large white chicken standing next to several baby chicks. There are at least five visible chickens in the scene, with one adult and four young ones surrounding it. They]|
|hippopotamus.JPEG|[ A large brown hippo is swimming in a pond, with its head above the water. The hippo appears to be enjoying itself as it floats on top of the water.]                                          |
|junco.JPEG       |[ A small bird with a black head and white chest is standing on the snow.]                                                                                                                      |
|ostrich.JPEG     |[ A large ostrich stands in a grassy field, surrounded by trees and bushes. The bird is the main focus of the image with its long neck stretched out as it looks around at]                     |
|ox.JPEG          |[ A large bull with long horns is standing in a grassy field.]                                                                                                                                  |
|palace.JPEG      |[ The image depicts a large, ornate room with high ceilings and yellow walls. It features an elegant sitting area with several chairs arranged around the space. There are also multiple c]     |
|tractor.JPEG     |[ A man is sitting in the driver's seat of a green tractor, which has yellow wheels. The tractor appears to be parked on top of an agricultural field with rows of]                             |
+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

As we are running everything within Spark, the whole pipeline will be distributed automatically to your entire cluster. With batch inference enabled, you could potentially be processing hundreds of rows of your text and images in parallel!

Conclusion

Spark NLP 6.0 represents a major leap in large-scale AI by introducing the AutoGGUFVisionModel. In this post, we explored the implications of this model and showed how it can be used to caption images at scale (link to notebook).

By natively integrating efficient, quantized GGUF vision-language models via Llama.cpp into Spark pipelines, we overcome previous complexities and scalability barriers. This enables users to perform distributed, privacy-preserving multimodal inference on vast datasets, unlocking powerful new capabilities for tasks like image captioning, VQA, and document analysis, all within the familiar, scalable Spark NLP framework.

We constantly update Spark NLP with the latest state-of-the-art models! To keep up to date and read more, consider exploring the following:

Medium Publications: https://medium.com/spark-nlp
Spark NLP Blog: https://www.johnsnowlabs.com/spark-nlp-blog/
Home repository: https://github.com/JohnSnowLabs/spark-nlp
Models Hub: https://nlp.johnsnowlabs.com/models

Vision-Language Model (VLM) Inference at Scale with Spark NLP 6.0 + llama.cpp was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

RAG with Spark NLP

Paulami Bhattacharya — Mon, 12 May 2025 17:02:29 GMT

Retrieval Augmented Generation (RAG) with Spark NLP

RAG is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside its training data before generating a response.

It has 3 steps:

Ingestion — source data is ingested, cleaned, converted into embeddings, and stored in a vector database.
Retrieval — the system finds the most relevant documents based on the user’s query.
Generation — the retrieved information is fed into a language model to create a more accurate and context-aware response.

In this tutorial, we show how to build a simple Retrieval-Augmented Generation (RAG) pipeline using Spark NLP. We load a dataset into Spark and use the AutoGGUFModel provided by Spark NLP to generate answers based on retrieved information.

Before we get started, download any text file to use as your dataset and setup a pinecone instance. For this tutorial, we have stored the text from the Harry Potter Wikipedia page into a text file.

Part 1 — Ingestion

Read your source text into a DataFrame.
Clean the text by tokenizing and normalizing.
Generate embeddings for your text.
Connect to pinecone server.
Create a collection to store your embeddings in pinecone.

1.1 Insert embeddings into pinecone collection.

from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import SentenceDetector, AutoGGUFModel, Tokenizer, Normalizer
from sparknlp.annotator.embeddings import AutoGGUFEmbeddings
from pyspark.ml import Pipeline

text_data = (
  spark
  .read
  .text("harry-potter.txt")
  .withColumnRenamed("value", "text") #Add your own text file
)

text_data.show()

# OUTPUT
+--------------------+
|                text|
+--------------------+
|Harry Potter is a...|
+--------------------+

1.2 Clean the text by tokenizing and normalizing and generate embeddings for your text.

We build a SparkNLP Pipeline with the following stages:

DocumentAssembler: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework.
SentenceDetector: Annotator to pragmatically separate complete sentences inside each document.
Tokenizer: Annotator used to convert text into tokens.
Normalizer: Annotator that cleans out tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary.
Finisher: Converts the cleaned text back into a string sentence.
AutoGGUFEmbeddings: Used to generate embeddings for each sentence. Any embeddings of your choice can be used in this step.


from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import StringType

document_assembler = (
    DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
)

sentence_detector = (
    SentenceDetector()
        .setInputCols(["document"])
        .setOutputCol("sentence")
        .setExplodeSentences(True)
)

tokenizer = (
    Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
)

normalizer = (
    Normalizer()
    .setInputCols("token")
    .setOutputCol("normalized")
)

finisher = (
    Finisher()
    .setInputCols("normalized")
    .setOutputCols("normalized_sentence")
    .setOutputAsArray(False)
    .setAnnotationSplitSymbol(" ")
)

normalized_document = (
    DocumentAssembler()
        .setInputCol("normalized_sentence")
        .setOutputCol("sentence")
)

embeddings = (
    AutoGGUFEmbeddings
    .pretrained("Nomic_Embed_Text_v1.5.Q8_0.gguf")
    .setInputCols(["sentence"])
    .setOutputCol("embeddings") \
    .setBatchSize(4) \
    .setNGpuLayers(99) \
    .setNCtx(8191)\
)

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        normalizer,
        finisher,
        normalized_document,
        embeddings,
    ]
)
result = pipeline.fit(text_data).transform(text_data)

sentence_and_embeddings_df = result.selectExpr(
    "explode(sentence.result) as sentence",
    "explode(embeddings.embeddings) as vector"
)

df_pinecone = sentence_and_embeddings_df
                .withColumn("id", monotonically_increasing_id().cast(StringType())) \
                .withColumnRenamed("vector", "values") \
                .withColumn("metadata", sentence_and_embeddings_df.sentence) \
                .drop("sentence")  # Store sentence as metadata, drop original column

df_pinecone.show()

#OUTPUT
+--------------------+---+--------------------+
|              values| id|            metadata|
+--------------------+---+--------------------+
|[0.009264344, 0.0...|  0|Harry Potter is a...|
|[-0.05384197, 0.0...|  1|The novels chroni...|
|[-0.056440134, -0...|  2|The main story ar...|
|[0.056693483, 0.0...|  3|The series was or...|
|[-0.041599147, 0....|  4|A series of many ...|
|[-0.030768316, 0....|  5|Major themes in t...|
|[6.1618997E-4, 0....|  6|Since the release...|
|[-0.047986094, -0...|  7|They have attract...|
|[0.027252585, 0.0...|  8|As of February th...|
|[0.04030519, 0.04...|  9|The last four boo...|
|[0.026217783, 0.0...| 10|It holds the Guin...|
|[0.007119997, -0....| 11|         Warner Bros|
|[0.0041893553, 0....| 12|Pictures adapted ...|
|[0.034328587, 0.0...| 13|In the total valu...|
|[0.017639026, 0.0...| 14|Harry Potter and ...|
|[-0.026738804, 0....| 15|A television seri...|
|[-0.02948736, 0.0...| 16|Themed attraction...|
|[-0.0024689648, 0...| 17|In the first book...|
|[-0.041317504, 0....| 18|At the age of Har...|
|[-0.059096724, 0....| 19|He meets a halfgi...|
+--------------------+---+--------------------+
only showing top 20 rows

1.3 Connect to Pinecone Server and create a collection to store your embeddings in Pinecone.

from pyspark.sql.functions import struct, array, lit
from pinecone import Pinecone
from pinecone import ServerlessSpec

# Set these environment variables
URL = 
API_KEY = 
INDEX_NAME = 
EMBEDDING_DIM = 768

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name=INDEX_NAME,
    dimension=EMBEDDING_DIM,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

1.4 Insert the embeddings into Pinecone.

# Function to insert a batch of vectors into Pinecone
def insert_batch(rows):
    pc2 = Pinecone(api_key=API_KEY)
    index = pc2.Index(INDEX_NAME)
    vectors = []
    for row in rows:
        vector = {
            "id": row.id,
            "values": row.values,
            "metadata": {"text": row.metadata}
        }
        if hasattr(row, "namespace") and row.namespace is not None:
            vector["namespace"] = row.namespace
        vectors.append(vector)
    
    if vectors:
        index.upsert(vectors=vectors)

# Convert DataFrame to RDD and process partitions in parallel
df_pinecone.rdd.foreachPartition(insert_batch)

Part 2 — Retrieval

Write your queries into a DataFrame.
Generate the embedding for the queries.
Query the pinecone vector database using the embeddings to find the relevant context.

2.1 Write your queries into a DataFrame.

We create a DataFrame which contains our queries. And we run it through the same pipeline which was used to generate the embeddings in the ingestion step. As a result, we get a new DataFrame with the sentence and the embedding of the query. We store the embeddings into a list.

from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import Row

# Create your own queries
queries = [
    Row(text="Who are Harry Potter's parents?"),
    Row(text="Which House did Harry belong to?"),
    Row(text="Who were Harry's friends?"),
    Row(text="What are the major themes in the Harry Potter series?")
]

schema = StructType([StructField("text", StringType(), True)])
query_df = spark.createDataFrame(queries, schema)
transformed_query = pipeline.fit(query_df).transform(query_df)

transformed_query = transformed_query.selectExpr(
    "explode(sentence.result) as sentence",  # Extract sentences
    "explode(embeddings.embeddings) as vector"  # Extract embeddings
)

vector_list = (
    transformed_query
    .select("vector")
    .rdd
    .flatMap(lambda x: x)
    .collect()
)

print(f"Total queries: {len(vector_list)}")

#OUTPUT 
4

2.2 Query the Pinecone vector database using the embeddings to find the relevant context.

For each query in the list we use the query function from pinecone which fetches text from the ingested data that is relevant to our query. You can change the top_k parameter to set the number of relevant vectors that should be fetched from the vector database.

from collections import defaultdict

query_and_context = defaultdict(list)
vector_database = pc.Index(INDEX_NAME)

for index, rag_query in enumerate(vector_list):
    response = vector_database.query(
        vector=rag_query,
        top_k=3,
        include_metadata=True
    )

    sentences = []
    matches = response["matches"]
    for match in matches:
        context = match["metadata"]["text"]
        sentences.append(context)

    for idx, sentence in enumerate(sentences, start=1):
        query_and_context[queries[index].text].append(sentence)

print(query_and_context)

#OUTPUT
defaultdict(, {"Who are Harry Potter's parents?": ['Harry learns that his parents Lily and James Potter also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a baby', 'wizards of Muggle parentage are the primary targets', 'He gains the friendship of Ron Weasley a member of a large but poor wizarding family and Hermione Granger a witch of nonmagical or Muggle parentage'], 'Which House did Harry belong to?': ['The event made Harry famous among the community of wizards and witchesHarry becomes a student at Hogwarts and is sorted into Gryffindor House', 'Harry learns that his parents Lily and James Potter also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a baby', 'In the first book Harry Potter and the Philosophers Stone Harry Potter and the Sorcerers Stone in the US Harry lives in a cupboard under the stairs in the house of the Dursleys his aunt uncle and cousin who all treat him poorly'], "Who were Harry's friends?": ['The novels chronicle the lives of a young wizard Harry Potter and his friends Hermione Granger and Ron Weasley all of whom are students at Hogwarts School of Witchcraft and Wizardry', 'He gains the friendship of Ron Weasley a member of a large but poor wizarding family and Hermione Granger a witch of nonmagical or Muggle parentage', 'Lupin enters the shack and explains that Sirius was James Potters best friend'], 'What are the major themes in the Harry Potter series?': ['A series of many genres including fantasy drama comingofage fiction and the British school story which includes elements of mystery thriller adventure horror and romance the world of Harry Potter explores numerous themes and includes many cultural meanings and references', 'Major themes in the series include prejudice corruption madness love and death', 'Harry Potter is a series of seven fantasy novels written by British author J K Rowling']})

Part 3 — Generation

Setup the prompt assembler using the template to query the model.
Fill the prompt template with the query and the relevant context fetched from the retrieval step.
Pass the prompt to the LLM to receive an answer to your query.

3.1 Setup the prompt assembler using the template to query the model.

Here, we define the default prompt template and use the PromptAssembler to set this template as the chat template.

from sparknlp.base import *


template = (
    "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- "
    "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- "
    'endif %} {%- if not date_string is defined %} {%- set date_string = "26 Jul 2024" %} {%- endif %} '
    "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the "
    "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}"
    " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else"
    ' %} {%- set system_message = "" %} {%- endif %} {#- System message + builtin tools #} {{- '
    '"<|start_header_id|>system<|end_header_id|>\\n\\n" }} {%- if builtin_tools is defined or tools is '
    'not none %} {{- "Environment: ipython\\n" }} {%- endif %} {%- if builtin_tools is defined %} {{- '
    '"Tools: " + builtin_tools | reject(\'equalto\', \'code_interpreter\') | join(", ") + "\\n\\n"}} '
    '{%- endif %} {{- "Cutting Knowledge Date: December 2023\\n" }} {{- "Today Date: " + date_string '
    '+ "\\n\\n" }} {%- if tools is not none and not tools_in_user_message %} {{- "You have access to '
    'the following functions. To call a function, please respond with JSON for a function call." }} {{- '
    '\'Respond in the format {"name": function name, "parameters": dictionary of argument name and its'
    ' value}.\' }} {{- "Do not use variables.\\n\\n" }} {%- for t in tools %} {{- t | tojson(indent=4) '
    '}} {{- "\\n\\n" }} {%- endfor %} {%- endif %} {{- system_message }} {{- "<|eot_id|>" }} {#- '
    "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message "
    "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if "
    "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set "
    'messages = messages[1:] %} {%- else %} {{- raise_exception("Cannot put tools in the first user '
    "message when there's no first user message!\") }} {%- endif %} {{- "
    "'<|start_header_id|>user<|end_header_id|>\\n\\n' -}} {{- \"Given the following functions, please "
    'respond with a JSON for a function call " }} {{- "with its proper arguments that best answers the '
    'given prompt.\\n\\n" }} {{- \'Respond in the format {"name": function name, "parameters": '
    'dictionary of argument name and its value}.\' }} {{- "Do not use variables.\\n\\n" }} {%- for t in '
    'tools %} {{- t | tojson(indent=4) }} {{- "\\n\\n" }} {%- endfor %} {{- first_user_message + '
    "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' "
    "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']"
    " + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in "
    'message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception("This model only '
    'supports single tool-calls at once!") }} {%- endif %} {%- set tool_call = message.tool_calls[0]'
    ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- "
    "'<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- \"<|python_tag|>\" + tool_call.name + "
    '".call(" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + \'="\' + '
    'arg_val + \'"\' }} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- ")" }} {%- '
    "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- '{\"name\": \"' + "
    'tool_call.name + \'", \' }} {{- \'"parameters": \' }} {{- tool_call.arguments | tojson }} {{- "}" '
    "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- "
    '"<|eom_id|>" }} {%- else %} {{- "<|eot_id|>" }} {%- endif %} {%- elif message.role == "tool" '
    'or message.role == "ipython" %} {{- "<|start_header_id|>ipython<|end_header_id|>\\n\\n" }} {%- '
    "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- "
    'else %} {{- message.content }} {%- endif %} {{- "<|eot_id|>" }} {%- endif %} {%- endfor %} {%- if '
    "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }} {%- endif %} "
)

promptAssembler = (
    PromptAssembler()
    .setInputCol("messages")
    .setOutputCol("prompt")
    .setChatTemplate(template)
)

3.2 Fill the prompt template with the query and the relevant context fetched from the retrieval step.

Now we populate our prompt template with the query and the context as shown below. Then we convert our prompts into a DataFrame to pass it to the prompt assembler. Below we can see the output of the prompt template with the query and the context that is fetched from the vector database.

prompts = []
for query, context in query_and_context.items():
    messages = [
        ("system", "You are a question answering system. You will be given a query and some context, you need to answer the query based on the context provided. Use your own knowledge if relevant context is not provided. Give your answer as a full sentence with minimum text."),
        ("assistant", "Hello there! What is your query today?"),
        ("user", f"Query: {query} Context: {''.join(context)}"),
    ]
    prompts.append([messages])

promptDF = spark.createDataFrame(prompts, ["messages"])
promptDF.show()

#OUTPUT
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|messages                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{system, You are a question answering system. You will be given a query and some context, you need to answer the query based on the context provided. Use your own knowledge if relevant context is not provided. Give your answer as a full sentence with minimum text.}, {assistant, Hello there! What is your query today?}, {user, Query: Who are Harry Potter's parents? Context: Harry learns that his parents Lily and James Potter also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a babywizards of Muggle parentage are the primary targetsHe gains the friendship of Ron Weasley a member of a large but poor wizarding family and Hermione Granger a witch of nonmagical or Muggle parentage}]                                                                                                                                                                           |
|[{system, You are a question answering system. You will be given a query and some context, you need to answer the query based on the context provided. Use your own knowledge if relevant context is not provided. Give your answer as a full sentence with minimum text.}, {assistant, Hello there! What is your query today?}, {user, Query: Which House did Harry belong to? Context: The event made Harry famous among the community of wizards and witchesHarry becomes a student at Hogwarts and is sorted into Gryffindor HouseHarry learns that his parents Lily and James Potter also had magical powers and were murdered by the dark wizard Lord Voldemort when Harry was a babyIn the first book Harry Potter and the Philosophers Stone Harry Potter and the Sorcerers Stone in the US Harry lives in a cupboard under the stairs in the house of the Dursleys his aunt uncle and cousin who all treat him poorly}]|
|[{system, You are a question answering system. You will be given a query and some context, you need to answer the query based on the context provided. Use your own knowledge if relevant context is not provided. Give your answer as a full sentence with minimum text.}, {assistant, Hello there! What is your query today?}, {user, Query: Who were Harry's friends? Context: The novels chronicle the lives of a young wizard Harry Potter and his friends Hermione Granger and Ron Weasley all of whom are students at Hogwarts School of Witchcraft and WizardryHe gains the friendship of Ron Weasley a member of a large but poor wizarding family and Hermione Granger a witch of nonmagical or Muggle parentageLupin enters the shack and explains that Sirius was James Potters best friend}]                                                                                                                       |
|[{system, You are a question answering system. You will be given a query and some context, you need to answer the query based on the context provided. Use your own knowledge if relevant context is not provided. Give your answer as a full sentence with minimum text.}, {assistant, Hello there! What is your query today?}, {user, Query: What are the major themes in the Harry Potter series? Context: A series of many genres including fantasy drama comingofage fiction and the British school story which includes elements of mystery thriller adventure horror and romance the world of Harry Potter explores numerous themes and includes many cultural meanings and referencesMajor themes in the series include prejudice corruption madness love and deathHarry Potter is a series of seven fantasy novels written by British author J K Rowling}]                                                             |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Load the AutoGGUFModel which loads phi3.5_mini_4k_instruct_q4_gguf model by default. Any LLM of your choice can be used in this step.

from sparknlp.annotator import AutoGGUFModel

autoGGUFModel = (
    AutoGGUFModel
    .pretrained()
    .setInputCols("prompt")
    .setOutputCol("completions")
    .setBatchSize(4)
    .setNGpuLayers(99)
    .setUseChatTemplate(True)  
)

3.3 Pass the prompt to the LLM to receive an answer to your query.

Now we build a SparkNLP pipeline with the following stages:

PromptAssembler: Annotator to fill prompt templates with relevant text.

AutoGGUFModel: LLM which completes our prompts.

generationPipeline = Pipeline(stages=[promptAssembler, autoGGUFModel])
output = generationPipeline.fit(promptDF).transform(promptDF)

Let’s check our final results from the AutoGGUFModel.

final_output = output.selectExpr(
    "explode(completions.result) as output"
)

final_output.show()

#OUTPUT
+--------------------------------------------------------------------------------------------------------+
|output                                                                                                  |
+--------------------------------------------------------------------------------------------------------+
|Harry Potter's parents are Lily and James Potter.                                                       |
|Harry belonged to Gryffindor House at Hogwarts School of Witchcraft and Wizardry.                       |
|Harry's friends were Ron Weasley and Hermione Granger.\n\n                                              |
|The major themes in the Harry Potter series include prejudice, corruption, madness, love, and death.\n\n|
+--------------------------------------------------------------------------------------------------------+

With this RAG pipeline, you’re now equipped to unlock richer insights and build smarter applications — time to put your data into action! ✨

RAG with Spark NLP was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale

Maziyar Panahi — Mon, 28 Apr 2025 19:06:27 GMT

From Raw Documents to Multimodal Insights at Enterprise Scale

Spark NLP 6.0.0 marks a monumental shift — from being the leading NLP library to becoming the de facto platform for distributed LLM ingestion and multimodal processing.

In this release, we expand Spark NLP beyond text:

Ingest PDFs, Excel files, PowerPoint presentations, and text logs natively into Spark pipelines.
Automatically extract structure, semantics, metadata — at scale, without writing a single line of parsing code.
Perform batch multimodal inference by running quantized Vision-Language Models (VLMs) like LLAVA, Phi-3 Vision, DeepSeek Janus, and Llama 3.2 Vision natively in Spark — no servers, no API bottlenecks.

With Spark NLP 6.0.0, building enterprise-grade RAG, document understanding, compliance audits, and multimodal analytics pipelines becomes frictionless.

One unified framework. Text, vision, documents — at Spark scale.

✨ Spotlight Feature: AutoGGUFVisionModel

Native Multimodal Inference with Llama.cpp

Spark NLP 6.0.0 introduces the groundbreaking AutoGGUFVisionModel, enabling vision-language model inference inside DataFrames — directly through Llama.cpp.

You can now:

Pass raw image bytes and captions into a multimodal LLM.
Generate descriptions, summaries, or visual Q&A results.
Run fully on-premises at Spark-native scale.

Why it Matters

Unlock pure multimodal workflows with zero infrastructure setup.
Perform massive batch inference across product catalogs, document archives, compliance audits, and more.
Enjoy full control over inference parameters like topK, topP, temperature, and nPredict.

How it Works

documentAssembler = DocumentAssembler().setInputCol("caption").setOutputCol("caption_document")

imageAssembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

data = ImageAssembler.loadImagesAsBytes(spark, "src/test/resources/image/")\
    .withColumn("caption", lit("Caption this image."))

model = AutoGGUFVisionModel.pretrained()\
    .setInputCols(["caption_document", "image_assembler"])\
    .setOutputCol("completions")\
    .setBatchSize(4)\
    .setNPredict(40)\
    .setTopK(40)\
    .setTopP(0.95)\
    .setTemperature(0.05)

pipeline = Pipeline().setStages([documentAssembler, imageAssembler, model])
results = pipeline.fit(data).transform(data)

results.selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result").show(truncate=False)

📚 Full notebook walkthrough available here

🔥 New Features and Enhancements

Universal Document Ingestion

PDF Reader: Font-aware ingestion, page segmentation, encrypted file support.
Excel Reader: Native .xls and .xlsx ingestion with multi-sheet support.
PowerPoint Reader: Capture slides, speaker notes, themes, alt text.
Text Reader: Load .txt, .csv, .log files at scale with encoding detection.

Multimodal & Vision-Language Modeling

AutoGGUFVisionModel: Native multimodal batch inference with Llama.cpp.
DeepSeek Janus Integration: Advanced instruction-following across text and images.
Qwen-2 Vision Language Models: Multilingual multimodal understanding.
Phi-3.5 Vision: Lightweight visual reasoning models under 1 GB.
LLAVA 1.5: Screenshot Q&A, chart reading, UI testing in Spark.

Massive LLM Catalog Expansion

Cohere Command-R Support: Up to 35B multilingual models natively.
OLMo Family: Open-weight, reproducible LLMs tuned for academic and benchmark tasks.
Multiple-Choice Heads: Lightweight heads for ALBERT, RoBERTa, XLM-RoBERTa.

Infrastructure Upgrades

VisionEncoderDecoder Improvements: Full Scala/Python API parity.
Better GGUF Error Reporting: Actionable fixes for model compatibility.

🐛 Key Bug Fixes

Clearer error messages for GGUF models.
Fixed MXBAI typo across notebooks.
Full alignment of VisionEncoderDecoder APIs between Scala and Python.
Smoothed variable naming across the entire codebase.

📝 Models, Models, and More Models

Over 110,000 new models and pipelines have been added — spanning 230+ languages.
Explore them all on the Models Hub.

❤️ Community Support

Slack — Chat with us live
GitHub — Report issues, suggest features
Medium — Articles and tutorials
YouTube — Video walkthroughs

📦 Installation

PyPI

pip install spark-nlp==6.0.0

Spark Packages

# CPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0

# GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0

# Apple Silicon (M1, M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0

# AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0

FAT JARs

🚀 Full Changelog

Want even more detail? Dive into the full changelog.

🌟 Final Words

Spark NLP 6.0.0 is not just a version update — it is a complete evolution of the platform.

From text, to vision, to multimodal reasoning — at Spark scale, zero servers, maximum performance.

The future of distributed AI pipelines has arrived.

🚀 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing NLLB: Breaking Language Barriers with Multilingual Translation in Spark NLP

Muhammad Abdullah — Thu, 14 Nov 2024 09:29:47 GMT

As Natural Language Processing (NLP) continues to break new ground, Spark NLP’s integration of the No Language Left Behind (NLLB) model marks a major step forward in multilingual machine translation. NLLB, a state-of-the-art encoder-decoder model, is designed to tackle the complexities of translating over 200 languages with remarkable accuracy, even for low-resource languages. With this latest addition, Spark NLP provides a powerful tool for developers and researchers to create highly efficient and versatile language translation pipelines.

Key Features of NLLB

NLLB stands out as an essential tool for multilingual translation, providing a robust and scalable solution for both common and underrepresented languages. Here’s a breakdown of its major capabilities:

1. Unmatched Language Coverage

NLLB directly supports translation between over 200 languages, from widely spoken ones like English, Spanish, and Chinese to low-resource languages such as Kashmiri, Tswana, and Quechua. This makes it a powerful model for organizations and developers looking to break through language barriers on a global scale.

2. Efficient Multilingual Translation

Unlike traditional models, NLLB excels at translating many-to-many language pairs without requiring an intermediary language like English. This direct approach reduces translation latency and improves the quality of translations, even for languages that lack extensive training data.

3. Optimized for Low-Resource Languages

One of NLLB’s core strengths is its ability to perform well with low-resource languages. By incorporating novel data mining techniques and a Sparsely Gated Mixture of Experts model architecture, NLLB narrows the performance gap between high and low-resource languages, ensuring consistent quality across the board.

4. Customizable Translation Pipelines

Spark NLP makes it easy to integrate and customize NLLB for various translation tasks. With parameters for setting source and target languages, output length, and decoding strategies, users have full control over the translation process, from casual content generation to high-stakes multilingual communication.

Simplicity of Integration

Getting started with NLLB in Spark NLP is as simple as a few lines of code. Here’s a quick example:

nllb = NLLBTransformer.pretrained() \\
     .setInputCols(["document"]) \\
     .setOutputCol("generation")

Tailoring Translations to Your Needs

The NLLB Transformer in Spark NLP offers several parameters that allow users to fine-tune translations according to their specific requirements. Here are a few key parameters that provide control over the output:

maxOutputLength(int): Limits the length of the translated text, preventing unnecessarily long outputs.
temperature(float): Controls the randomness in the prediction, balancing between deterministic and creative translations.
topK(int): Allows for top-K sampling, where only the most probable tokens are considered, ensuring translations maintain contextual accuracy.
repetitionPenalty(float): Reduces the likelihood of repeated phrases or words, ensuring smooth and coherent translations.

NLLB in Action: A Sample Multilingual Translation Pipeline

Building a multilingual translation pipeline with NLLB in Spark NLP is as simple as the following example, which translates a sentence from Chinese to English:

!pip install spark-nlp pyspark

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import NLLBTransformer
from pyspark.ml import Pipeline

# Step 1: Initialize Document Assembler
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

# Step 2: Load the pretrained NLLB model
nllb = NLLBTransformer.pretrained("nllb_418M") \
    .setInputCols(["documents"]) \
    .setOutputCol("generation") \
    .setSrcLang("zho_Hans") \
    .setTgtLang("eng_Latn")

# Step 3: Build the Spark NLP pipeline
pipeline = Pipeline().setStages([documentAssembler, nllb])

# Step 4: Input DataFrame with Chinese text
data = spark.createDataFrame([["生活就像一盒巧克力。"]]).toDF("text")

# Step 5: Run the pipeline
result = pipeline.fit(data).transform(data)

# Step 6: Display the generated translation
result.select("generation").show(truncate=False)

In this example, NLLB is used to translate Chinese to English, generating the sentence “Life is like a box of chocolates.” This highlights NLLB’s ability to handle diverse languages and produce high-quality translations in real-time.

Performance Considerations

Given the extensive number of languages supported, NLLB models are optimized for efficient performance. However, for high-throughput or long-sequence translations, it’s recommended to deploy the model on GPU-accelerated environments. This ensures faster inference times and maintains translation quality at scale.

Conclusion

Adding NLLB to Spark NLP opens up new possibilities for multilingual translation, particularly for low-resource languages. With its expansive language coverage, customizable parameters, and ease of integration, NLLB enables developers and researchers to break down language barriers with unprecedented accuracy. Whether working on cross-lingual chatbots, document translation, or international content creation, NLLB offers the flexibility and power you need to succeed in a globalized world.

📄 Full Release Notes: Spark NLP 5.5.0 Release Notes

For further reading and resources, consider exploring the following:

NLLB Research Paper: Read Meta AI’s research on NLLB models.
GitHub — Open NLLB: A community effort to provide open-source versions of NLLB checkpoints and related tools.
GitHub — Spark NLP: https://github.com/JohnSnowLabs/spark-nlp
Models Hub: https://nlp.johnsnowlabs.com/models
More examples: https://github.com/JohnSnowLabs/spark-nlp-workshop

❤️ Community Support

Slack: Join the Spark NLP community and team for live discussions.
Discussions: Engage with community members, share ideas, and showcase how you use Spark NLP!
Medium: Read Spark NLP articles on its official Medium page.
YouTube: Watch Spark NLP video tutorials for in-depth guidance.

Join the Spark NLP community for support, discussions, and insights into the latest advancements in NLP!

Introducing NLLB: Breaking Language Barriers with Multilingual Translation in Spark NLP was originally published in spark-nlp on Medium, where people are continuing the conversation by highlighting and responding to this story.