Apache Tika: What is it and why should I use it?

Simon Li
9 min readJun 14, 2019

--

Introduction

With the ever growing presence of technology in our lives, certain aspects of our lives that have been with us for ages are also being changed and altered by technology. One of these things is the presence and presentation of text and language. For almost the entirety of human history, we have used some sort of system to communicate with each other. Obviously, over time, these forms of communication evolved into what we know as language today. And this became essential for our lives, not only orally, but visually, through text and writing. From painting on cave walls to drawing hieroglyphs to the textual languages we have today, humans have always had a visual counterpart to relay information to each other. Nowadays, with the exponential and insane growth of information, many people are developing ways to tackle and understand the data that we find in our everyday life. Yet, much of this information does not always come as numerics you can analyze, much of it comes in the form of text. The need to process large amounts of text is growing and a large part of data science and I will be exploring a tool to process text in this article called Apache Tika.

According to their site, “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.” Apache Tika is a content detection and analysis framework that is written in Java and stewarded at the Apache Software Foundation. It provides a Java library but also has server and command line tools that make it suitable for use from other programming languages. Tika uses existing various document parsers and document type detection techniques to detect and extract data. Using this tool, you can also develop a universal type detector and content extractor to extract both structured text as well as certain metadata from different types of documents including spreadsheets, text documents, images, PDFs and even some multimedia input formats. There are many different applications of Apache Tika, but there are some prominent ones. The first one is search engines. Using Tika, search engines can extract data and metadata using the tool on the sites found after the initial crawl search. It can also be used for document analysis as it can classify documents based on the most prominent terms in the document.

Tika is commonly used by many organizations including financial institutions like Goldman Sachs, NASA and academic researchers, and by major content management systems, in order to analyze massive amounts of content. It was also one of the one of the key technological tools used by more than 400 journalists that wanted to process and analyze 11.5 million documents that were leaked and eventually exposed an international scandal revolving around many world leaders storing their money in offshore shell corporations. The papers leaked as well as the effort to investigate them is better known as the Panama Papers.

Objectives

After this article, I hope you will have a better idea of what Apache Tika is, its many uses, and how it is used.

Architecture of Tika

Tika has 4 main modules, the first is the Language Detection Mechanism. When you give Tika a text document, it can detect the language of the given document using a class called Language Identifier. It can also detect the type of the data the document is in and the specific Multipurpose Internet Mail Extensions (MIME) using the MIME Detection Mechanism. The Parser Interface extracts the text and metadata, then summarizes it for the user given certain parser plugins the user specifies. Finally, the Tika Facade class is a facade design pattern and the way to call Tika right from Java’s object oriented language.

Below are some of the features when using Apache Tika.

  • Unified parser Interface − Tika utilizes different third party parser libraries into a single parser interface. With this feature, the user no longer needs to select the correct parser library and according to the file type.
  • Low memory usage − Tika consumes less memory resources and therefore it is easily embeddable into Java applications.
  • Fast processing − Tika come with quick content detection and extraction from applications.
  • Flexible metadata − Tika can comprehend all metadata models that are used to describe files.
  • Parser integration − Tika can use various parser libraries available for each document type in a single application
  • MIME type detection − Tika can detect and extract content from all the media types included in the MIME standards
  • Language detection − Tika includes a language identification feature and can be used in documents based on language type

Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the functionalities of Tika. Since it is a facade class, Tika abstracts the complexity behind its functions. In addition to this, users can also use the various classes of Tika in their applications.

The Tika Class has a few constructors that can be used (click here for parameters and descriptions: https://www.tutorialspoint.com/tika/tika_referenced_api.htm).

Tika can process a large number of file types such as: XML, HTML, Microsoft Office files, OpenDocument Format, PDF, digital books, Rich Text, and even java class files and jar files.

The Parser

The Parser API is one of the most important parts of the program. It abstracts the complex nature of parsing operations and makes it a lot easier for the user. Tika only relies on one method for this, called parse, which you can see below:

  • stream — the InputStream created from the document that will be parsed
  • handler — the ContentHandler that receives a sequence of XHTML SAX events parsed from the input document (this handler will process events and export the result)
  • metadata — the Metadata that describes metadata properties in and out of the parser
  • context — the ParseContext instance that carries context-specific information (can be used to customize the parsing process)

parse throws an IOException if the input stream fails to be read, a TikaException if the document stream cannot be parsed, and a SAXException if the handler is cannot process an event. While parsing, Tika tries to reuse existing parser libraries and as a result, most of the implementation classes are just adapters to such libraries.

Auto-Detection

Multipurpose Internet Mail Extensions (MIME) standards are the best available standards for identifying document types. The knowledge of these standards helps the browser during internal interactions.

Whenever the browser encounters a media file, it chooses a compatible software available with it to display its contents. In case it does not have any suitable application to run a particular media file, it recommends the user to get the suitable plugin software for it.

Type Detection in Tika

Tika supports all the types provided in MIME. Whenever a file is passed through Tika, it detects the type of a document as well as the language based on the document itself rather than on additional information.

The detection of document types can be done using a single method called detect from implementing a class of the Detector interface, as seen below:

One of the easiest ways is just using the file extension for the type of file. However, when you retrieve a file from a database or attach it to another document, you might lose the file’s name and/or extension. In these cases, the metadata of the document can be used to detect the file extension.

This method takes a document along with its metadata as an input and then returns a MediaType object that describes the type of the document, or gives a best guess. However, metadata is not the only thing the detector relies on. It can also use magic bytes, a special pattern near the beginning of a file that designates the format of the file. If that does not work, Tika also looks at character encodings or XML root characters (if the file is XML). However, when using Tika, you do not need to worry much about this as it is automated.

Tika can also delegate the detection to a more suitable detector, since the algorithm used by the detector is implementation dependent. For example, the default detector will look at the magic bytes first, then metadata information and if the content type still has not been figured out, it will use the service loader to try all available detectors.

Language Detection

Tika can also identify its language even without help from metadata information. In older releases of Tika, the language of the document is discovered using a LanguageIdentifier instance but LanguageIdentifier has been deprecated in favor of web services, something that is not completely clear in the Getting Started docs from the official site. Language detection now uses subtypes of the abstract class LanguageDetector. You can also incorporate web services, such as Google Translate or Microsoft Translator, for larger translation services.

In the 184 standard languages standardized by ISO 639–1, Tika is able to detect 18 of those languages and is done using the getLanguage method of the LanguageIdentifier class. This method returns the code name of the language as a String. Below is the list of the 18 language-code pairs detected by Tika:

  • da — Danish
  • de — German
  • et — Estonianel — Greek
  • en — English
  • es — Spanish
  • fi — Finnish
  • fr — French
  • hu — Hungarian
  • is — Icelandic
  • it — Italian
  • nl — Dutch
  • no — Norwegian
  • pl — Polish
  • pt — Portuguese
  • ru — Russian
  • sv — Swedish
  • th — Thai

Content Extraction

Tika uses different kinds of parser libraries for content extraction and it chooses the right parser after deciding the type of the given document.

When parsing documents, the parseToString method of the facade class is generally used. Below is a quick abstract description of the parsing process:

  • First, when we pass a document into Tika, it uses a suitable type detection mechanism, just like what was described earlier, and detects the document type
  • Next, after the document type is known, Tika chooses the suitable parser of the many selection parsers from the repository. The parser repository holds classes that utilize other external libraries.
  • Then, the document is passed into the parser which then parses the content, extracts the text and data, and maybe throws exceptions for unreadable formats.

Metadata Extraction

Besides content, Tika also has the ability to extract the metadata from a file. So I have mentioned metadata a few times and you may be confused: what even is metadata? Metadata is nothing but the additional information about a file the comes with the file itself. For example, take a song or an audio file. The metadata for it would consist of things such as the artist name, album name, and title and Tika can extract this type of info from documents.

The Extensible Metadata Platform (XMP) is a standard for processing and storing information related to the content of a file. XMP consists of the different types of standards for defining, creating, and processing of metadata for different kinds of document types.

When using Tika, you can simple use a method like metadata.name() to get the names from the file. However, you need a metadata object to call the name method on. You get this object through the parse method described above. One of the parameters is a metadata object that will hold the metadata after the parse method is complete. You can also use Tika to add metadata or set values to metadata. We can add new values using the add method of the metadata class or set different values from the metadata using the set method.

Tika GUI

Tika also comes with a graphical user interface (GUI) that the user can use. It can be found in the “gui” folder after Tika is installed and looks something like this:

How can we make use of the Tika GUI? On the GUI, click open, browse and select the file that is to be extracted (or drag it onto the whitespace of the window). Tika extracts the content of the files and displays it in five different formats: visualized metadata, formatted text, plain text, main content, and structured text. You can choose whichever format you want.

The following illustration shows what Tika can do. When we drop the image on the GUI, Tika extracts and displays its metadata:

Conclusion

Apache Tika is a very useful tool in processing documents. A lot of the complicated abstract methods are compiled into easy to use methods for the user and makes Tika such a valuable tool. Hopefully this article gave you a better understanding of what Apache Tika is, its prominent uses, and how it could be used for your own documents. I have also written another article giving more Java code examples on how to use Tika. To see it, click this link:

--

--