Apache Tika: Code with example walkthroughs

Simon Li
9 min readJun 14, 2019

--

In my previous article, I gave some overarching descriptions of what Apache Tika, how it works, and how you can use it. In this article, I will use code examples (in Java), and give detailed examples of usage.

Quick Overview

The first thing I will do is give a short and concise run-through/implementation of Tika.

Below is the code for a method you can use to find the type of the document:

public static String detectDocTypeUsingDetector(InputStream stream) 
throws IOException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();

MediaType mediaType = detector.detect(stream, metadata);
return mediaType.toString();
}

This java code will return the document type. Even if the extension of the file has been changed, Tika can still identify the correct type using the magic bytes it finds at the beginning of the file.

Another easier way to do this is through the facade Tika class:

public static String detectDocTypeUsingFacade(InputStream stream) 
throws IOException {

Tika tika = new Tika();
String mediaType = tika.detect(stream);
return mediaType;
}

Next, let us extract the file’s contents using a parser.

public static String extractContentUsingParser(InputStream stream) 
throws IOException, TikaException, SAXException {

Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

parser.parse(stream, handler, metadata, context);
return handler.toString();
}

This is a higher level approach that shows what is going on but again, we can use the Tika class for an easier approach:

public static String extractContentUsingFacade(InputStream stream) 
throws IOException, TikaException {

Tika tika = new Tika();
String content = tika.parseToString(stream);
return content;
}

Next, let’s extract the metadata using the parser as well:

public static Metadata extractMetadatatUsingParser(InputStream stream) 
throws IOException, SAXException, TikaException {

Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

parser.parse(stream, handler, metadata, context);
return metadata;
}

And again, we can make it easier:

public static Metadata extractMetadatatUsingFacade(InputStream stream) 
throws IOException, TikaException {
Tika tika = new Tika();
Metadata metadata = new Metadata();

tika.parse(stream, metadata);
return metadata;
}

Above are some short code snippets on how to use some of the larger utilizations of Tika. Next, I will show specific examples of extracting from different document/data types. These are not difficult applications but each one has slight differences. They are wrapped into main methods.

Importing packages

Here are most of the things you need to important for these methods:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

PDF

import org.apache.tika.parser.pdf.PDFParser;public class PdfParse {

public static void main(final String[] args) throws IOException,TikaException {

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.pdf"));
ParseContext pcontext = new ParseContext();

//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);

//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());

//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
}
}

When you compile this code, you should get the actual content of the pdf printed along with metadata. You can comment out lines to not have them printed for the sake of space or time.

ODF

Given below is the code to extract content and metadata from Open Office Document Format (ODF):

import org.apache.tika.parser.odf.OpenDocumentParser;
public class OpenDocumentParse {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.odp"));
ParseContext pcontext = new ParseContext();

//Open Document Parser
OpenDocumentParser openofficeparser = new OpenDocumentParser ();
openofficeparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}

The output is the same as the PDF, with the contents and metadata printed after compilation.

MS-Office

Below is the code to parse a MS Open Office XML (Excel) file:

import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
public class MSExcelParse {

public static void main(final String[] args) throws IOException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example_msExcel.xlsx"));
ParseContext pcontext = new ParseContext();

//OOXml parser
OOXMLParser msofficeparser = new OOXMLParser ();
msofficeparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

As you can see a trend, all of these output the content, then metadata.

Text Document

import org.apache.tika.parser.txt.TXTParser;
public class TextParser {
public static void main(final String[] args) throws IOException,SAXException, TikaException {//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.txt"));
ParseContext pcontext=new ParseContext();

//Text document parser
TXTParser TexTParser = new TXTParser();
TexTParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}

HTML

import org.apache.tika.parser.html.HtmlParser;
public class HtmlParse {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.html"));
ParseContext pcontext = new ParseContext();

//Html parser
HtmlParser htmlparser = new HtmlParser();
htmlparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

XML

import org.apache.tika.parser.xml.XMLParser;
public class XmlParse {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.xml"));
ParseContext pcontext = new ParseContext();

//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

Image

When using Tika an image, it is very helpful in reading the metadata of an image.

import org.apache.tika.parser.jpeg.JpegParser;
public class JpegParse {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.jpg"));
ParseContext pcontext = new ParseContext();

//Jpeg Parse
JpegParser JpegParser = new JpegParser();
JpegParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

Although the image will not be printed, all of the metadata will be printed.

MP4 and MP3

MP4:

import org.apache.tika.parser.mp4.MP4Parser;
public class Mp4Parse {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.mp4"));
ParseContext pcontext = new ParseContext();

//Html parser
MP4Parser MP4Parser = new MP4Parser();
MP4Parser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document: :" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

This would only output metadata, different then MP3 shown below.

MP3:

import org.apache.tika.parser.mp3.LyricsHandler;
import org.apache.tika.parser.mp3.Mp3Parser;
public class Mp3Parse {

public static void main(final String[] args) throws Exception, IOException, SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.mp3"));
ParseContext pcontext = new ParseContext();

//Mp3 parser
Mp3Parser Mp3Parser = new Mp3Parser();
Mp3Parser.parse(inputstream, handler, metadata, pcontext);
LyricsHandler lyrics = new LyricsHandler(inputstream,handler);

while(lyrics.hasLyrics()) {
System.out.println(lyrics.toString());
}

System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}

This will print the contents, metadata, and if the given file has any lyrics, it will also capture and display that along with the output.

Parsing to output types

With Tika, you can actually get the textual content of your files returned to you in a number of different formats. These can be types such as plain text, html, xhtml, xhtml of one part of the file, etc. The output type is controlled based on the ContentHandler you supply to the Parser. For example, if you are using the BodyContentHandler, you are able to request that Tika return the content of the body of the document as a string of plain-text. Using the ToXMLContentHandler would get the XHTML content of the whole document as a string.

Check out this link for more examples: https://tika.apache.org/1.21/examples.html

Translation using the Microsoft Translation API

Tika provides a pluggable translation system that allows you to send the results of parsing off to an external system or program to have the text translated into another language. Here is an example using Microsoft’s translation tool:

public String microsoftTranslateToFrench(String text) {
MicrosoftTranslator translator = new MicrosoftTranslator();
// Change the id and secret! See http://msdn.microsoft.com/en-us/library/hh454950.aspx.
translator.setId("dummy-id");
translator.setSecret("dummy-secret");
try {
return translator.translate(text, "fr");
} catch (Exception e) {
return "Error while translating.";
}
}

However, in order for you to actually use the Microsoft Translation API shown, you will need to sign up for a Microsoft account, get the API key, and then pass that key to Tika.

Different Methods

There are various Tika constructors you can call, depending on parameters, for whatever your needs are.

  • Tika() — Uses default configuration and constructs the Tika class.
  • Tika(Detector detector) — Creates a Tika, accepting a detector instance as a parameter
  • Tika(Detector detector, Parser parser) — Creates a Tika, accepting a detector instance and a parser instance as parameters
  • Tika(Detector detector, Parser parser, Translator translator) — Creates a Tika, accepting detector, parser, and translator instances as parameters
  • Tika(TikaConfig config) — Creates a Tika, accepting the object of the TikaConfig class as the parameter

There are also other helpful methods within Tika that were not displayed above.

  • int getMaxStringLength () — Returns the maximum length of strings returned by the parseToString methods
  • String detect (InputStream stream, Metadata metadata) — Accepts an InputStream object and a Metadata object as parameters, detects the type of the given document, and returns it as String object
  • String translate (InputStream text, String targetLanguage) — Accepts an InputStream object and a String representing the language that the text will be translated to, translates the text to the desired language after attempting to auto-detect the source language, and returns that string

Metadata Class

The constructor for this class is simple: Metadata(). However, there are many different types of getters and setters you can use with the class as well. For example:

  • add (String name, String value) — This adds a metadata value to a given document, it sets a new name value to the existing metadata
  • String get (Property property) — Returns the value of the metadata property parameter

Apache Tika in Python

So far, all of Tika’s uses have been in shown in Java. For people who have learned and know Java, this should no issue. However, many people, including myself, are more familiar with Python as a programming language so now I will show how you can use Tika in Python.

An Apache Tika port for Python actually does exist, which extracts text quickly, accurately and simply like Tika’s use in Java. This form ofTika works on .pdf, the most recent OOXML Microsoft Office file types and older binary file formats such as .doc, .ppt and .xls.

First you need to install the tika-python library written by Chris Mattmann, this can be done via pip in the command line:

pip install tika

(To allow the library to launch the Tika REST server in the background, Java 7 or higher also needs to be installed)

Check out the README for Chris Mattmann’s library for more details: https://github.com/chrismattmann/tika-python/blob/master/README.md

Book

There is actually an official Apache Tika book written that is a hands-on guide to content mining with the tool. The book’s many examples and case studies give helpful real-world examples from areas ranging from search engines to digital asset management and scientific data processing. The book can be found here: https://www.manning.com/books/tika-in-action. It is a little pricey and I do not own it but feel free to purchase it if you think its functionality will be helpful to you.

Wrap-up

I hope these examples can help you apply Tika to your own projects and documents. Of course, these are not all the document types or methods you can use. There are a few resources online you can use as well if you have issues. The official Apache Tika site can be found here and has news on updates, documentation, and a Getting Started guide.

--

--