Extracting the Metadata from an IBM FileNet Document’s Content

Published in

IBM Data Science in Practice

7 min readOct 30, 2017

Background

The metadata of any file describes additional information about the document. For example, the metadata for an audio file might include the author/artist, release date, album, genre, and so on. When a user checks in a file to the IBM FileNet repository, he or she must explicitly add this additional information to the document’s properties. Otherwise the metadata simply lies hidden inside the document’s content element. By automatically extracting the available information and adding it as metadata, others can use the information to search for the document or to run additional analytics.

Introduction

The IBM FileNet Content Engine provides the capability to search for documents using the values of the document’s properties. In addition, the Content Search capability allows users to just search for documents based on the contents. This article describes how to extract the metadata of the document content elements and makes it available as property values on the document, thus paving a way to run searches based on metadata. To extract the metadata of the document content element, we use the open source Apache Tika™ toolkit.

Assumptions

This article assumes the reader is familiar with IBM FileNet Content Engine’s concepts of document class, properties, code module, event handler and subscriptions.

Procedure

The following steps automatically extract the metadata of a checked-in document’s content and populate the document’s properties using an Asynchronous Event Handler.

Step 1: Document Properties

Create the properties for the document class to hold the values of the extracted metadata content. You’ll create a property template and then assign it to the document class:

a) Create the property templates for the metadata to be captured. Consider distinguishing the metadata properties from the existing ones by using a standard prefix, say “Meta_” when defining the property template.

For Example: Create the property templates “Meta_author” and “Meta_title” to hold the author and title values from the metadata fields of the document.

b) Create a custom class and add these properties to the class. (You can also add them to an existing class definition, but using a custom class differentiates it from other classes.)

For Example: Create a custom class “AutoExtractor” with the two property definitions defined above.

Step 2: Event Handler

Create an Event Handler using a Java class code module for extracting the required metadata of the document’s content element.

Code Module’s code snippet for exacting the metadata:

/**
* Import the following Apache Tika packages.
* To extract the metadata from an mp3 file, these packages should be * sufficient. To extract the metadata from different types of files, 
* you’ll need additional packages.
**/
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
…/**
* In order to output log statements to the CE server trace, import 
* the HandlerCallContext and use the function “traceSummary()” as 
* required. Example below:
* import com.filenet.api.engine.HandlerCallContext;
* HandlerCallContext hcc = HandlerCallContext.getInstance();
* hcc.traceSummary(“ Custom log statement from Event Handler “);
**/public void onEvent(ObjectChangeEvent event, Id subId)
{
 // Retrieve the document object
 ObjectStore os = event.getObjectStore(); 
 Id id = event.get_SourceObjectId(); PropertyFilter pf = new PropertyFilter(); 
 pf.addIncludeProperty(new FilterElement(null, null, null, 
                           PropertyNames.CONTENT_ELEMENTS, null)); 
 Document doc = Factory.Document.fetchInstance(os, id, pf); // Retrieve the content elements and work on the first content
 // element. Can be extended to other content elements as desired.
 ContentElementList contElemList = doc.get_ContentElements();
 int nContentElements = contElemList.size(); 
 if(nContentElements > 0)
 {
   Iterator iter = contElemList.iterator();
   if(iter.hasNext())  // consider only the first content element
   {
     ContentTransfer ct = (ContentTransfer) iter.next();     // Extraction of metadata using Apache Tika Parser.
     Parser myParser = new AutoDetectParser();
     BodyContentHandler myHandler = new BodyContentHandler();
     Metadata myMetadata = new Metadata();
     ParseContext myContext = new ParseContext();     // TODO: Throws IOException, SAXException, TikaException. 
     // Handle them as required.
     myParser.parse(ct.accessContentStream(), myHandler, 
                    myMetadata, myContext);     // Retrieve the metadata of the document content.
     String[] metadataNames = myMetadata.names();     // Return document properties.
     com.filenet.api.property.Properties props =doc.getProperties();     // Iterate through the metadata names and 
     // Set the value for the predefined document properties.
     for(String name : metadataNames) 
       if(name.equalsIgnoreCase(Metadata.TITLE) || 
         name.equalsIgnoreCase(Metadata.AUTHOR))
       {
         String value = myMetadata.get(name);
         String strPropTemplateName = “Meta_” + name.toLowerCase();
         props.putValue(strPropTemplateName, value);
       }
     doc.save(RefreshMode.NO_REFRESH);
    }
  }
}

When creating the code module, check-in the compiled .class file above, including all the associated library files that are needed.

The code module and the Event Action details are below:

Fig 5: Event Action created using the Code Module

Step 3: Subscription
Create a new subscription for “checkin” event on the custom document class created in Step 1 by specifying the Event Handler created in Step 2.

Details of the Subscription are as below:

Fig 7: Subscribed Event — “Checkin Event”

Step 4: Checkin the document
Checkin the document against the custom class definition (created above), with a content element (currently handled for audio files — mp3 file).

Details of the checkin:

Fig 8: Document checked in against custom class

Fig 9: Content Element of the checked in Document Class

Fig 10: Custom properties not filled in while checking in the document

Fig 12: Document Checked-in status:successful

Step 5: Verifying Document properties
Once the document is checked in, the (asynchronous) Event Handler automatically extracts the metadata properties “Author”, “Title” of the audio file and populates them back to the Document properties “Meta_author” and “Meta_title” respectively.

In our example, after waiting for a few seconds, you can see that the values of the two properties are filled in from the metadata.

Fig 13: Document properties values filled in automatically

FAQ:

Q. Can we have the metadata properties automatically created instead of pre-defining them?
You can create new metadata from a code module, but the Content engine doesn’t support creating and editing the metadata in the same transaction where the code module runs. For that reason, the idea is to have a predefined set of properties. Also, since there could be many meta-data properties for a single file and it would be difficult to create a property for each of them, pre-creating the metadata properties makes more sense. Business Analyst can figure out the type of documents that will be checked in and the metadata for the documents. So, the administrator can define these custom properties for the class. You can import/export to copy the definitions to another object store.

Q. How can we handle multiple content elements of the document?
The above example just talks about two properties of a single content element. You can extend it so that the defined metadata properties have a cardinality of “list”. So, for each of the content element, the metadata values can be retrieved and appended to the corresponding list properties.

Q. Can we get the metadata for different file types?
For extracting the metadata of an audio (mp3) file, just use the tika-core (tika-core-1.14.jar) and tika-parser (tika-parsers-1.14.jar) libraries. For other types of document, you’ll need other jars.

Tip: You can develop the custom code module as a maven project by adding the dependencies below. All the dependent jars you need would be automatically pulled in, so that they can be identified and used.

<dependency>
   <groupId>org.apache.tika</groupId>
   <artifactId>tika-core</artifactId>
   <version>1.14</version>
</dependency><dependency>
   <groupId>org.apache.tika</groupId>
   <artifactId>tika-parsers</artifactId>
   <version>1.14</version>
</dependency>

Q. What version of Java is needed?
The latest Apache Tika libraries are built using Java 1.7. So, the AppServer for Content Engine should be running with a compatible Java version i.e. 1.7.

Bibliography:
https://tika.apache.org/
Apache Tika Toolkit: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.