Making your Data Lake accessible with Search

Elena Fullman
Foxtrot Code
Published in
12 min readApr 19, 2017

Does your organization have a large amount of data stored in a Hadoop environment? Are you trying to make your Data Lake more accessible to the enterprise users in your organization? Data Lakes must be accessed through a toolset to be hospitable to an enterprise audience. You are here because you’ve already realized this, and you will need an interface that is designed for analysis, a Search interface.

We will be comparing the two leaders in Big Data Search. They are both based on Apache Lucene project. Both deliver similar functionality: a distributed, multitenant-capable, full-text, schema-less search engine with a RESTful web interface.

The leaders we will compare are: the Elastic Stack (Elastic + Beats, Logstash, and Kibana) versus Splunk + Foxtrot Code. While these products have similar roots, they are two very different products.

Elastic (formerly Elasticsearch) claims over 500,000 free downloads per month. Splunk has over 13,000 licensed installs making it the leading company among commercial applications. As a proxy for engagement, below we’ve used Google Trends to graph interest in Elastic (blue) versus Splunk (red).

Figure 1 Google Trends (Elasticsearch versus Splunk)

What is accessibility?

Our goal is making your Data Lake accessible to your enterprise audiences. This means that a domain expert, or analyst without programming should be able to use the solution to create and execute a search, perform analysis, visualize, and then output data to end-user applications.

The solution must be accessible via a browser. It needs to provide the capability to manage the security of the platform for a large distributed audience.

Do you need to own the Search engine?

Is the Search application strategic to your business, and part of your brand? Do you need to own the Search application you will deploy? If so, the Elastic Stack may be the right decision for you.

Examples of this kind of Search application are the Search bars on Facebook or LinkedIn. They are unique to the brand, and largely determine the utility of the product for users. Elastic is well suited as a starting point to this type of problem where the search is constrained to a specific set of features and data.

If you don’t need to own the Search engine, you are in the 99% of use cases.

Your next step is to perform a “Build versus Buy” analysis. We will focus on Total Cost of Ownership (TCO) to make your Data Lake accessible. Once you understand the TCO, you’ll also want to consider the risk of completing the project successfully and maintaining a high service level.

This solution will not be a toy or demonstration installation of Splunk or Elastic. You are setting out to create a serious solution worth millions in ROI to your organization.

We took a detailed approach to the comparison of these products and found, Elastic is 3 times the TCO of Splunk + Foxtrot Code. We’ve provided our findings below including the spreadsheets to help you build your own analysis.

The Elastic Stack

Elastic Stack is a set of Open Source projects (code libraries) designed to provide developers with a starting point to create a Search engine. It is usually referred to as ELK and lately as BELK, which stands for (Beats + Elastic + Logstash + Kibana). Beats and Logstash get data into Elastic, the Search Engine that stores the data. Kibana is an application for graphing data stored in Elastic.

As we discussed above, using Elastic provides you with the opportunity to create an entirely unique and special purpose Search engine. Elastic is a programming library and a starting point to build your custom application.

Using Elastic means that you will be building your own Search product using the Elastic Stack as a starting point. If it isn’t obvious, this will be a full-scale product development effort. The resources you will need are some of the most expensive and scarce in the industry. As you are probably very aware, these technologists command salaries above $150,000. Maintenance will also be expensive, because you will need to engineer improvements on your own, as well as be careful to isolate your improvements to Elastic so that you can take any updates into your custom engine.

Therefore, the TCO of Elastic is directly related to the product management and engineering activity you will drive. You will need to own the maintenance and support for your Elastic Search engine as well.

Most important to the ROI of your Data Lake and making it accessible is the ability for your enterprise audience to build their own searches and leverage the data themselves. Elastic is designed for programmers, and executed in code.

At least in the beginning in order to make your Data Lake accessible, you will need to create a Search Algorithm development team to build all the searches for your enterprise audience. In order to fulfill the goal you will need to develop a second product to integrate with Elastic to allow your enterprise users to create their own searches. If the enterprise audience can’t develop their Big Data Searches (self-service), you will slow down ROI and increase TCO dramatically.

To understand some of your considerations when developing a solution with Elastic, we suggest this post from Treasure Data. You should also check out this post from OReilly.com on the metrics you need to watch in your Elastic deployment.

Learn more about Elastic.

Splunk + Foxtrot Code

Splunk with Foxtrot Code provides all the capabilities required to meet the goals for Data Lake accessibility.

Splunk and Foxtrot Code are commercial products. Unlike Elastic, they both carry fees associated with use. However, these are complete products that work without any programming. They can be configured very quickly, and at a lower cost. As a result, Splunk + Foxtrot Code has a low TCO even in comparison to a free, open source project like the Elastic Stack.

Splunk is well known for its ability to ingest data from a wide variety of inputs. It is also designed to work in real time, and indexes data at a high compression rate. Splunk clusters are relatively simple to setup, it comes from the same software configured in different ways. Splunk can be scaled both vertically and horizontally. It also comes with powerful ODBC capabilities to make refined data accessible to end-user applications like: Tableau, Excel, PowerBI, MicroStrategy, and other ODBC compliant solutions. Splunk can also export refined data directly into structured or unstructured databases, or make them available through RESTful endpoints.

Foxtrot Code is directly integrated with Splunk and provides a cloud-based, codeless development environment for building Splunk Searches. Foxtrot Code provides a complete environment that includes Splunk to build, and test Splunk Search Algorithms. It is designed for enterprise users without programming skills so that your domain experts can focus on data and analysis, and bridge the data from your Data Lake to end-user applications. Foxtrot Code also comes with a Marketplace to support collaboration and the reuse of Search fragments called Foxpatches. Users within an organization can capture domain expertise within these Foxpatches, and then share them in a private or public marketplace.

Learn more about Splunk and Foxtrot Code.

Exploring TCO

TCO for making your Data Lake accessible to enterprise users breaks down into two questions.

  • Cost of Deployment
  • Cost of Operations, Management
  • Cost of Developing Searches (Accessibility)

We won’t be including infrastructure deployment costs in our analysis. Generally, infrastructure tends to be slightly more expensive for Elastic, but it won’t be a relevant differentiator to your decision.

Cost of Deployment

Both products are free to download and deploy. As a result, the Cost of Deployment is limited to salaries (OPEX) of the technical resources required to get the applications in place and operational.

Elastic Stack

As described above the Elastic stack is the combination of at least 4 products or open source projects marketed by Elastic. You can download and perform a build per the instructions.

To make Elastic functional, you will have to program the integration of the REST calls between the main BELK products, and other applications to create a data pipeline. You will need to integrate libraries to import the data from Hadoop. Once you have the components integrated, and tested synchronizing the on going loading of data will also require customization.

The resources you will need for this effort require some experience. Most of the projects we have seen to get large data stores transferred into Elastic Stack deployments will require a team of three senior developers for a 4 weeks project.

Budget $173,820 in OPEX to deploy the Elastic Stack.

Splunk + Foxtrot Code

Splunk in contrast to Elastic is a commercial product that is in use at over 13,000 corporations. There is nothing to integrate with Splunk and code deployment is a very quick 15-minutes.

Unless you have a Splunk architect already on staff, a deployment of hundreds of terabytes of data will require a little architectural support. A certified consultant will make short work of the actual deployment. We also included the cost to get one of your admins, Splunk certified.

Foxtrot Code is a platform service, there is nothing to setup, and no initial fee. If you want to get signed up and check out both a live version of Splunk as well as Foxtrot Code’s Codeless Development environment and the Marketplace go here and sign up for a free account to get started.

Deploying Splunk for a project like this should take less than 5-days. To be conservative, we planned for delays with other parts of the total solution like the Data Lake. We budgeted 10-days for the duration.

Budget $31,430 in OPEX to deploy Splunk + Foxtrot Code.

Conclusion — Cost of Deployment

Foxtrot Code Analysis — Cost of Deployment

Cost of Operations, Maintenance

Operations and maintenance are where the TCO of a customized Elastic Stack solution and a Splunk + Foxtrot Code solution diverge.

Elastic Stack

The good news is that the Elastic Stack has no license fee. However, you will be creating a custom Elastic Stack application, and you will be trading a license fee for a lack of completeness and ease of operation as well as maintenance.

Operations and Maintenance of your Elastic Search Engine solution will resolve around the development of a custom application. The complexity of the application will determine the size and cost of your development effort.

It is unlikely that you will build your Elastic Search Solution and just stop. It would be more realistic to assume that this will be an on-going effort. At a minimum, we expect that you will need to invest the time of at least a team of three for a year, along with support from outside consultants. This will be a significant cost as well as a risk to the project given the scarcity of the programming resources, and the salaries they command.

Budget $1,125,066 for the Cost of Operations and Maintenance with a custom Elastic Solution.

Splunk + Foxtrot Code

Splunk + Foxtrot Code does come with a license fee. Before you panic, the cost is relatively small in comparison to a custom Elastic solution. Splunk + Foxtrot Code deliver all the capabilities required to make your Data Lake accessible to enterprise users. There is no software for your team to write or maintain. Splunk pumps out periodic upgrades designed to be seamlessly incorporated into your deployment.

The license fee for Splunk is related to the amount of raw data you index per day. There is no cost for how many users, distribution of servers, or output. We’ve included a 100GB daily license, and the annual renewal fee is $60,000. You can also purchase a perpetual license for a higher one-time fee.

Splunk provides many ways to manage the cost of indexing. For example, What if we need 10GB of daily indexing of data, but we needed to load a 500TB (terabyte) archive of old data to get started? Splunk assumes this is going to happen, and let’s you exceed your license 3 times a month without any extra cost.

The subscription fee for Foxtrot Code for a mix of 50-users with varying levels of capabilities will cost approximately $60,000 per year. Keep in mind that Foxtrot Code offloads the infrastructure costs and the cost of Splunk for your enterprise users to build and test their Searches. Where your application requires strict privacy you can obtain dedicated hosted solutions and on-premise installations from Foxtrot Code.

Budget $492,972 for the Cost of Operations and Maintenance for Splunk + Foxtrot Code.

Conclusion — Cost of Operations and Maintenance

Foxtrot Code Analysis — Cost of Operations, Maintenance

Cost of Developing Searches (Accessibility)

Elastic Stack

Achieving our goal of making our Data Lake accessible to enterprise users is difficult with a custom Elastic Stack solution. The Elastic Stack doesn’t come with an interface designed for enterprise users. Kibana is a graphing engine for Elastic. If that would satisfy the requirements for visualization, it still doesn’t begin to support the development of searches and performing analysis, by an enterprise user without programming skills.

Assuming that you aren’t going to build a custom search engine and a product like Foxtrot Code to provide codeless development for enterprise users, you are going to need to directly support your enterprise users with a team of programmers.

We included a team of four to support 50 enterprise users (e.g. domain experts and analysts). This is a high price to pay, but a Codeless Development environment for Elastic would be an expensive additional product development effort.

The downside is programmers developing requirements with enterprise users is an old fashioned and inefficient model. It assumes that the majority of analyses developed won’t be ad-hoc, but will fill a more general purpose. This will severely limit ROI from your Data Lake.

Budget $783,000 for the Cost of Developing Searches (Accessibility) with a custom Elastic Solution.

Splunk + Foxtrot Code

Splunk + Foxtrot Code is uniquely positioned to support the goal of making your Data Lake accessible to your enterprise users.

Splunk is significantly simpler to operate than a custom Elastic Stack search solution. When you need to configure data forwarding from your Hadoop environment into Splunk indexes, it is delivered without coding. Management of clusters and deployments is automated and centrally managed. Splunk also provides ODBC drivers to get data easily out of your Splunk platform and into end-user applications like: Tableau, Excel, PowerBI, MicroStrategy, etc.

Foxtrot Code extends the Splunk ecosystem by providing enterprise users with a combination of a Codeless Development platform and a Marketplace designed to make big data accessible to a potential group of users many times larger than your IT department.

Foxtrot Code makes the development, testing, deployment, and hosting of algorithms easy for enterprise audiences. It features a codeless development environment where you build Search Algorithms using a drag-n-drop, visual programming interface. It also features an Algorithm Marketplace that accelerates and lowers the cost of developing Algorithms. Foxtrot Code’s marketplace promotes reuse of search algorithms within your organization enabling your users to leverage whole solutions or fragments of solutions that your team or others have already perfected.

Splunk Search Algorithms built on Foxtrot Code capture both domain knowledge and the analysis experience. They are naturally modular, self-documenting, and can be broken into pieces that perform logical parts of an algorithm to increase readability as well as reuse. A fragment such as one that retrieves, reduces, and obfuscates data, can be connected to multiple analyses from multiple users performing a deeper ad-hoc analysis.

Leveraging the Marketplace on Foxtrot Code, enterprise users can share their algorithms and algorithm fragments. Foxtrot Code provides the ability for users to List their algorithms. Additionally, enterprise users on Foxtrot Code users can combine algorithms built by other teammates into their work for FREE, creating an ad-hoc partnership. Organizations creating algorithms (e.g. “brand-gorithms”) can keep their work private and confidential sharing their work between teams with different levels of access creating a powerful knowledge management environment.

Foxtrot Code offers the entire solution as an on-premise or private cloud deployment, or organizations can purchase a dedicated Foxtrot Code hosted solution. The base Foxtrot Code subscription provides access to the platform service available through a FREE subscription to learn about Foxtrot Code with upgrades for additional features and services.

Splunk + Foxtrot Code already serves the majority of need making your Data Lake Search Platform self-service to enterprise users. We added the cost of one Certified Power User plus the cost of certification, as dedicated consulting support for your enterprise users.

Budget $126,453 for the Cost of Developing Searches (Accessibility) with Splunk + Foxtrot Code.

Conclusion — Cost of Developing Searches (Accessibility)

Foxtrot Code Analysis — Cost of Developing Searches (Accessibility)

Final Tally (First Year — TCO)

The objective of this comparison was to illustrate the difference in TCO between the two solutions as well as potential value.

Total Cost of Ownership for making Data Lake accessible with:

  • The Elastic Stack — $2,081,886
  • Splunk + Foxtrot Code — $650,855

To support our goals described above an Elastic Stack solution will have a TCO 3 times the TCO of a Splunk + Foxtrot Code solution. As described above, a custom Elastic Stack solution with a TCO of over $2,000,000 delivers substantially less results than Splunk + Foxtrot Code.

The Splunk + Foxtrot Code solution will enable 50 enterprise users to build as many Search Algorithms as they want. In contrast, only four programmers will be supporting the same organization with the Elastic Stack Solution. You can purchase more seat subscriptions from Foxtrot Code to support more enterprise users.

--

--