Writing Kettle Plugins : Splunk

Matt Casters
Neo4j Developer Blog
4 min readOct 30, 2019

Dear Neo4j and Kettle friends,

Once in a while you come across a data source for which your favorite data orchestration platform doesn’t have a connector. It happens in all tools and platforms because technology has a tendency to advance. That is why, in the early days of Kettle, I decided to implement various ways to allow users to create plugins. These Kettle plugin systems make it possible to create steps, job entries, encryption, compression methods, database connection types, and even data types. Right now there are 25 different plugin types and you can explore them all in the plugin browser. You can find that browser by finding the “Show plugin information” entry in the Tools menu in Spoon.

The way that Kettle plugins are implemented is always the same:

  • Write a Java class that implements a plugin interface.
  • Package the class with all other classes, images and resources you need in a jar file.
  • Put the plugin jar file in a folder under plugins/ in your Kettle data-integration/ installation.

So a few weeks ago, one of our Neo4j customers asked us if it was possible to read some data from their Splunk server so that they could load it into Neo4j using Kettle.

Splunk Input icon and dialog screen shots

The plugin

The first thing to do when writing a plugin for Kettle is to figure out how we access the data. Sometimes the issue is dealing with specific file formats. In other instances, we need to access particular web services. In most cases though, major technologies provide Java libraries. For Splunk, we can use a standard Java library conveniently available through a Maven dependency which we can add to our project:

<dependency>
<groupId>com.splunk</groupId>
<artifactId>splunk</artifactId>
<version>1.6.5.0</version>
</dependency>

The interface to implement to create a Kettle step plugin is org.pentaho.di.trans.step.StepMetaInterface and we annotate the class with a Step annotation:

@Step(
id = "KettleSplunkInput",
name = "Splunk Input",
description = "Read data from Splunk",
image = "splunk.svg",
categoryDescription = "Input"
)
@InjectionSupported( localizationPrefix = "Splunk.Injection.", groups = { "PARAMETERS", "RETURNS" } )
public class SplunkInputMeta extends BaseStepMeta implements StepMetaInterface {

You can see the whole class here.

This class, recognized as a plugin through the @Step annotation, is the starting point of all the plugin components. The Meta class takes care of serializing metadata (XML and repository) and for telling the outside world which classes are responsible for handling the workload and the dialog for editing the metadata. We’re using a convenient BaseStepMeta class to make sure we’re not wasting time implementing stuff which is generic.

In the worker class, in our case SplunkInput.java, we take care of the actual work in reading from Splunk after connecting and executing a query. There are 3 main methods to look at: init(), processRow() and dispose(). The code is really simple and straightforward even though it took us a while to figure out how to make Splunk give us all the rows of a query and not just the first couple of hundred. (pro tip: you use JobResultsArgs.setCount(0))

The way we got over this hurdle is by working on-site with a test Splunk instance. We did 5 iterations to fix small user interface problems and the query results limitation. All in all that took a few hours showing how easy it is to get quick results with an agile approach.

MetaStore

To handle the Splunk connection information we want to have a re-usable object. This will prevent us from having to type in information in every Splunk Input (and maybe later Output?) step. So we need to have a dialog like this:

A parameterised Splunk connection

The way Metastore objects are handled is by creating a simple Java Bean (POJO) called SplunkConnection:

@MetaStoreElementType(
name = "Splunk Connection",
description = "This element describes how you can connect to Splunk"
)
public class SplunkConnection extends Variables {

private String name;

@MetaStoreAttribute
private String hostname;

@MetaStoreAttribute
private String port;

@MetaStoreAttribute
private String username;

@MetaStoreAttribute(password = true)
private String password;
...

The annotations define the top level MetaStore object (SplunkConnection) and the attributes in it. The password option takes care of automatic encryption or obfuscation of the password when it’s being serialized or expressed in any form (XML, JSON, database, …).

Once we have this class we can easily load or save instances using a MetaStoreFactory:

factory = new MetaStoreFactory<SplunkConnection>( SplunkConnection.class, metaStore, PentahoDefaults.NAMESPACE );// Load a connection...
//
SplunkConnection connection = factory.loadElement("Splunk");
// Save a connection
//
factory.save(connection);

As you can see this is very simple and convenient code which takes the pain out of the whole notion of having centrally-defined metadata.

Project

The complete kettle-splunk project is put up on my github account and I’m very much welcoming feedback and contributions in any form. Perhaps we can add features to convert the String data we get from Splunk into Kettle data types?

Give the current release a try and let us know how it works for you!

Cheers,
Matt

--

--

Matt Casters
Neo4j Developer Blog

Neo4j Chief Solutions Architect, Kettle Project Founder