How to Use the Unstructured Module in Python with LangChain

Gary Svenson
8 min read6 days ago

--

how to use unstructured module in python with langchain

Let’s talk about something that we all face during development: API Testing with Postman for your Development Team.

Yeah, I’ve heard of it as well, Postman is getting worse year by year, but, you are working as a team and you need some collaboration tools for your development process, right? So you paid Postman Enterprise for…. $49/month.

Now I am telling you: You Don’t Have to:

That’s right, APIDog gives you all the features that comes with Postman paid version, at a fraction of the cost. Migration has been so easily that you only need to click a few buttons, and APIDog will do everything for you.

APIDog has a comprehensive, easy to use GUI that makes you spend no time to get started working (If you have migrated from Postman). It’s elegant, collaborate, easy to use, with Dark Mode too!

Want a Good Alternative to Postman? APIDog is definitely worth a shot. But if you are the Tech Lead of a Dev Team that really want to dump Postman for something Better, and Cheaper, Check out APIDog!

How to Use the Unstructured Module in Python with LangChain

Understanding the Unstructured Module

The unstructured module in Python is a library that provides tools for parsing, converting, and processing data without predefined structures. This module is particularly useful in the context of Natural Language Processing (NLP) and Machine Learning (ML) where input data often comes in formats that are not easily parsed, such as emails, PDFs, web pages, and other formats.

LangChain, which integrates naturally with the unstructured module, is a framework for developing applications powered by language models. By leveraging both the unstructured module and LangChain, developers can efficiently handle raw data inputs, transforming them into structured outputs for further processing.

Installation of Required Libraries

Before diving into coding, you must first install the necessary libraries: the unstructured module and langchain. You can easily install these Python packages with the following commands:

pip install unstructured
pip install langchain

Make sure you have Python already installed on your system. You can check your Python version with the command:

python --version

It is recommended to use Python 3.7 or higher for compatibility purposes.

Loading and Parsing Raw Data with Unstructured

Once you’ve installed the required libraries, you can begin working with unstructured data. The following examples illustrate how to load and parse data from an unstructured format.

Reading Text from a PDF File

One of the most common tasks is extracting text from PDF files. With the unstructured module, this can be easily accomplished:

from unstructured.documents import PDF

# Load a PDF file
pdf_file = PDF.from_file("example.pdf")

# Extract text
text = pdf_file.text
print(text)

In this example, PDF.from_file loads a PDF file named "example.pdf", and the .text attribute extracts all the text contained within the PDF. This text can now be used as input for a LangChain model.

Extracting Text from HTML Content

Another common source of unstructured data is HTML content from web pages. The unstructured module also provides functionalities for parsing HTML files.

from unstructured.documents import HTML

# Load an HTML file
html_file = HTML.from_file("example.html")

# Extract text
text = html_file.text
print(text)

In this example, the HTML.from_file method loads the HTML file "example.html", and we again retrieve the text using the .text property.

Preprocessing Text Data

Once you have extracted text, the next step is preprocessing it. Preprocessing includes cleaning the data, removing stop words, lowercasing, etc. The following example illustrates some basic text preprocessing.

import re
from nltk.corpus import stopwords

# Sample text
text = """This is an Example text! It will be Preprocessed."""

# Lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Tokenization
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

print(filtered_tokens)

In this code snippet, we lower the case of the text, remove punctuation using a regular expression, and filter out stop words using the NLTK library. It’s important to have the NLTK stop words corpus, which can be downloaded with:

import nltk
nltk.download('stopwords')

Integrating Unstructured Data with LangChain

LangChain provides an intuitive framework for generating or processing text using various language models. After preprocessing your unstructured data, you can integrate it into LangChain workflows for various applications such as question answering, summarization, or information retrieval.

Creating a Simple LangChain Application

To create a LangChain application that utilizes extracted text, you must first define a LangChain model. Here’s a simple way to achieve this:

from langchain import LLMChain, OpenAIChat
from langchain.prompts import PromptTemplate

# Initialize the language model
llm = OpenAIChat(model_name="gpt-3.5")

# Define a prompt template
prompt_template = PromptTemplate(
input_variables=["input_text"],
template="Summarize the following text: {input_text}"
)

# Create an LLM chain
chain = LLMChain(llm=llm, prompt=prompt_template)

# Run the chain with your filtered tokens as input
input_text = ' '.join(filtered_tokens)
summary = chain.run(input_text)

print(summary)

In this example, we create an LLMChain which uses OpenAI's ChatGPT model to summarize the preprocessed text. The PromptTemplate class allows for dynamic construction of the prompts to the model based on the provided input. The generated summary can then be further utilized in your application.

Building a Question Answering System

By utilizing the unstructured module and LangChain, you can build a simple question-answering system. This application will extract relevant information from the text based on a user-provided query.

from langchain import RetrievalQA

# Load necessary models
retriever = chain.as_retriever()
qa_system = RetrievalQA(llm=llm, retriever=retriever)

question = "What is the main idea of the text?"
answer = qa_system.run(question)

print(answer)

In this block of code, we create a RetrievalQA system that takes the user’s question and retrieves a relevant answer from the preprocessed text. This demonstrates how easily you can use unstructured data for more complex tasks.

Handling Different File Formats

The unstructured module supports a variety of file formats beyond HTML and PDF, including DOCX, Markdown, and more. The methodology remains similar, focusing on extracting and processing text. Here’s a brief overview of how to handle these formats.

Extracting Text from DOCX Files

You may often encounter Word documents (.docx) in your work. Here’s how to extract text from them:

from unstructured.documents import DOCX

# Load a DOCX file
docx_file = DOCX.from_file("example.docx")

# Extract text
text = docx_file.text
print(text)

This code allows you to easily work with contents from Word documents, just like with PDFs and HTML files.

Working with Markdown

Markdown files are popular for documentation purposes. To extract text from a Markdown file:

from unstructured.documents import Markdown

# Load a Markdown file
markdown_file = Markdown.from_file("example.md")

# Extract text
text = markdown_file.text
print(text)

In each case, the process of extracting text remains consistent — just plug in the appropriate file format.

Advanced Text Processing Techniques

Once you have mastered the basics of extracting and preprocessing data using the unstructured module with LangChain, you can delve deeper into more advanced text-processing techniques.

Implementing Custom Text Normalization

You might want to customize the preprocessing steps to fit specific application requirements. Here is an example of how to implement a custom normalization function:

def custom_normalization(text):
# Lowercase
text = text.lower()
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Lowercase and strip whitespaces
return text.strip()

normalized_text = custom_normalization("Example 1234 text! ")
print(normalized_text) # Output: 'example text'

This function normalizes the text to facilitate more accurate processing in subsequent steps.

Leveraging Transformers for Enhanced Performance

For users looking to exploit the latest techniques in NLP, integrating transformer models can significantly boost performance. Here’s an outline of how you might incorporate Hugging Face’s transformers library along with LangChain:

from transformers import pipeline

# Initialize a summarization pipeline
summarizer = pipeline("summarization")

# Summarize text
summary_text = summarizer(input_text)
print(summary_text[0]['summary_text'])

This example utilizes the Transformers library to summarize input text while providing background support for optimal NLP techniques. By combining various methodologies, you can craft sophisticated text-processing applications tailored to specific needs.

Conclusion

By integrating the capabilities of the unstructured module with LangChain, you now have a powerful toolkit for handling unstructured data. Throughout this essay, you learned to extract text from various formats, preprocess it, integrate it into LangChain applications, and leverage advanced NLP capabilities for effective data analysis and machine learning tasks. This knowledge sets the foundation for developing sophisticated language-based applications.

Let’s talk about something that we all face during development: API Testing with Postman for your Development Team.

Yeah, I’ve heard of it as well, Postman is getting worse year by year, but, you are working as a team and you need some collaboration tools for your development process, right? So you paid Postman Enterprise for…. $49/month.

Now I am telling you: You Don’t Have to:

That’s right, APIDog gives you all the features that comes with Postman paid version, at a fraction of the cost. Migration has been so easily that you only need to click a few buttons, and APIDog will do everything for you.

APIDog has a comprehensive, easy to use GUI that makes you spend no time to get started working (If you have migrated from Postman). It’s elegant, collaborate, easy to use, with Dark Mode too!

Want a Good Alternative to Postman? APIDog is definitely worth a shot. But if you are the Tech Lead of a Dev Team that really want to dump Postman for something Better, and Cheaper, Check out APIDog!

--

--