Integrating Splunk Cloud Reports with Matillion ETL and Snowflake

Keegan Hudson
Hashmap, an NTT DATA Company
9 min readFeb 16, 2022

If you search through or monitor large amounts of log files you might have heard of Splunk. Splunk technology stores data using indexing and does not need to have a completely separate database, making Splunk a popular choice for analytics. Commonly used to generate graphs, reports, and other visualizations Splunk logs can give valuable insight into your data. Splunk Enterprise offers an on-prem solution but I will be focused on using the Splunk Cloud Platform.

I was challenged with a task from a client to integrate a dataset from Splunk Cloud reports using Matillion ETL and the Snowflake Data Cloud. The use case from the client was to add user app log-in data for further analysis down the road. This dataset was very large and the client also wanted new data to be added daily. The idea of being able to automatically extract Splunk reports and leverage Matillion’s transform and load capabilities with Snowflake is a no-brainer. Splunk and Matillion both offer a database connector that already has built-in Snowflake integration, however, there are few examples available online of how to integrate these technologies.

In this blog post, I will walk you through a step-by-step guide on how to integrate a dataset from Splunk Cloud reports using Matillion ETL and Snowflake with OKTA user data. This OKTA dataset uses login information from users and generates millions of events a week. Going back and looking at years’ worth of historical data could be a daunting task, in this case, there is well over a Terabyte of log data.

With larger datasets, Splunk can take a long time to generate a report or cause timeout errors, representing a greater challenge this project must tackle. In these cases loading both historical and current data in the same report is not a viable strategy, so an incremental load strategy is required. We need historical data to be carried over from Splunk first then any new reports generated can be incrementally loaded to the existing data.

So, let’s get started!

Steps needed:

  1. Contact the Splunk vendor to whitelist the IP of your Matillion server
  2. Export historical data from Splunk to a file
  3. Load the historical data file to Snowflake
  4. Create and schedule a daily report in Splunk
  5. Create a Matillion job to load the report to Snowflake
  6. Schedule the Matillion job

1. Contact the Splunk vendor to whitelist the IP of your Matillion server

For Splunk Cloud, the vendor will need to whitelist your Matillion server(s) to their firewall. I suggest first contacting and opening a ticket with Splunk as this may take several days.

  • Splunk ticket service
  • This ticket will need to be submitted by someone with authorization to the Splunk ticket service, most likely an admin account
  • Splunk should eventually ask to provide your IP address to whitelist your Matillion server
  • Next, Splunk should then add the provided IP addresses to their network firewall
  • You should now have access to Splunk from Matillion

2. Export historical data from Splunk to a file

First, you need to create a search statement in Splunk and download the results. The Splunk Cloud browser-based interface allows you to enter in a search statement to sort through indexed data or previously saved reports. You can find more information on Splunk search commands and syntax from their available documentation.

If you are working with considerably large datasets you may need to break the searches down into smaller chunks. Trying to run a search on large data sets could either take a considerable amount of time or clog down the rest of your Splunk system. If this is the case you will need to do a bulk load of the historical data.

  • Enter your search into Splunk:
 host=”example.okta.com” | spath “target{}.alternateId” | search “target{}.alternateId”=”Example Company name”| spath displayMessage | search displayMessage=”User single sign on to app”
  • I suggest appending your search using the “table” command to create a human-readable table of data you want to be parsed out by Splunk. The syntax for the “table” command is “table <wc-field-list>” where <wc-field-list> is a list of column names from the source. This will allow the data to be better formatted when exporting to a CSV file, otherwise, you can choose the data format when exporting.
host=”example.okta.com” | spath “target{}.alternateId” | search “target{}.alternateId”=”Example Company name”| spath displayMessage | search displayMessage=”User single sign on to app” | table — _time, action, actor.displayName, displayMessage, eventType, user, user_id
  • Before starting the search we need to change the default date range from 30 minutes to our desired range. In this example, I set the range to use a 24 hour period. Change the date range by clicking on the time frame next to the search box:
  • Click on one of the options from the drop-down menu and configure for the date and time needed, click apply. In this example I am using one day so I chose the “Date Range” tab:
  • Depending on the size of the data you’re searching through this search may take several hours to complete.
  • After the search completes click on the Export button (download symbol):
  • This will open a dialog box where you can choose the format (CSV, JSON, XML), name the file, or choose to limit the results. Clicking on “export” will start the download in your browser:

3. Load the historical data file to Snowflake

Now that you have your search results downloaded from Splunk you can load the historical data to Snowflake. You can simply upload the data directly to Snowflake, but due to client constraints and having to bulk load multiple files I used an S3 bucket and Matillion.

  • Example of using Matillion’s S3 load component:

4. Create and schedule a daily report in Splunk

Now that you have the historical data loaded you will need to set up a scheduled report in Splunk for the current data. In this example, the report is to be generated each morning using the data from the previous day. Once you have a search that you want to save to use with Matillion you will need to create a report in Splunk Cloud. I suggest specifying a time range as Matillion could have a timeout error.

  • Run a search for a given timeframe, in this case, “Yesterday”.
  • Make sure the search finishes before saving. Click on Save As, and select Report from the drop-down menu.
  • Enter the desired name for the report in the pop-up menu and click save.
  • Schedule the report to run, in this case on a daily interval. With the report, open click on “Edit”, from the dropdown menu click on “Edit Schedule”.
  • From the “Edit Schedule” menu click on the “Schedule Report” check-box. This will bring up another box where you can change or confirm your scheduling settings. In this case, the report is set to run every day at 1:00 a.m. to scan for “Yesterday” data. Click save when you are satisfied with the settings.

5. Create a Matillion job to load the report to Snowflake

Now that we have our historical data loaded to a table in Snowflake and a report scheduled in Splunk, we need to create and schedule a job to load the report’s data to Snowflake. Matillion has a Splunk query component that is perfect for this use.

  • The Splunk query component can be used with a table iterator component for incremental load in Advanced mode. A SQL like query can be written to query a report from Splunk: SELECT [column_names] FROM [table_name]. [table_name]
  • Example:
SELECT _time ,action ,”actor.displayName” ,displayMessage ,eventType ,user_id FROM “okta_login_history”;
  • A date can be specified with: SELECT [column_names] FROM [table_name] WHERE [_Time] < ‘DD/MM/YYYY 00:00:00’. Keep in mind that with large datasets, over 1 million events/rows, the component will run into time-out issues.
  • Fill in the necessary configuration settings. If the Splunk vendor has whitelisted your Matillion servers’ IP address you should be able to make a connection. To find out more information on the Splunk Query component check out Matillion’s documentation. Example of the Properties section of the Splunk Query component:

6. Schedule the Matillion job

Now that the report has been scheduled in Splunk and a Matillion job has been created, all that is left is to Schedule the Matillion job. This should be done with Matillion’s built-in scheduler or your preferred third-party tool. Keep in mind how long you expect the Splunk report to run for. In my example the daily report generated in Splunk takes about 30 minutes to run at 1:00 a.m., so to be safe I scheduled the Matillion job to run several hours later at 9:00 a.m.

Final Thoughts

The size of the data in this project is what represents the challenge. Splunk can provide a very powerful tool for data analysis and log collection, however, if you’re looking to have larger datasets at a moment’s notice I suggest getting that data into Snowflake. Using a tool like Matillion ETL to achieve that goal streamlines the process and allows you to leverage Matillion and Snowflake’s data transformation potential.

This project has shown me that there is still a lot of stagnant data out there that we are not necessarily utilizing to its full potential. The idea of switching or upgrading your tools can be daunting and not necessarily cost-effective, so solutions like this to curate large niche data sets are something that I can see being needed in many future projects. If your company is currently using Splunk, or a similar platform, and has large datasets from logs just sitting around this may be just the solution you are looking for.

Related Resources

Let’s Do Data and Cloud Together!

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges — we can help you shorten time to value!

We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for designing and building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.

Keegan Hudson is a Cloud Data Engineer at Hashmap, an NTT DATA Company, and provides Data & Cloud solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--