Curse of the security data lake monster

Published in

Just Eat Takeaway-tech

8 min readApr 25, 2023

In my time I have seen security data lake projects turn into “a few good ideas and a smoking crater” more than once. There are several pitfalls that we can learn from, and a great success story we don’t often celebrate as a security data-lake use case.

There comes a time in a security team’s maturity where they end up deciding that security has a data problem and the team needs to become more data driven (like the rest of the org), therefore a security data-lake is the answer (to a problem that’s not fully fleshed out).

Yet… in many security organizations there is a bona fide success story that we do not call a security data lake, but for all intents and purposes is one. I am talking about a SIEM (Security Information and Event Management). Despite there being many, many examples of failed roll outs of a SIEM there are more successes than failures and for the most part it’s an established part of information security.

What are the major pitfalls of security data lakes?

1) The wrong organisational structure with the wrong data store/platform solving the wrong problem.

In general the failures of many data lakes are an organisational structure and a process problem. There is a great article by Zhamak Dehghani on the phenomenon (see link). The key to this failure is having a centralised team servicing multiple stakeholders with a centralised data lake (often nicknamed a data swamp).

Equally some teams might go into using a data lake platform thinking it has the same capabilities as other data stores that they the are used to (like a relational database e.g. MySQL) and discover it’s tradeoffs too late in the process.

This is especially true if you require a near real-time answer to a (large) query and have not factored you might want to perform the data processing on the data lake (daily/hourly/other depending on the data refresh) and serve that data in a “lake shore-mart’ database to meet your SLAs for customers/use cases. It’s the classic “when you are a hammer every problem is a nail” mistake.

2) Engineers wanting to use ‘shiny’ new tech without proper technical due diligence or training

Another pitfall is a senior engineer discovering a shiny new capability (e.g. Graph databases, data streaming, Apache BEAM), and building a project/initiative around a piece of technology that has not gone through technical due diligence training for the engineers prior to starting.

Seriously… Say “Graph the planet” one more time without knowing what betweenness centrality means… I dare you!

I remember passing by an engineers desk, asking them if they needed help to get a deliverable on time, and the answer was “no, just need the time to figure it out” with an O’reilly book open on the desk. Cut to a few months later, the project complete and the solution is not fit for purpose (returning results 24 hours out of our SLA i.e. Service Level Agreement). When another engineer took over with experience in the technology, they had to rewrite the whole thing and recommended long term to scrap the approach altogether as the tech was not fit for purpose to begin with.

3) Too much work and resources for little to no reward

A security data lake can be an endless time & money sink if you are not careful. Be mindful of the work needed for data ingestion, data processing, data quality, governance, serving the data, and maintaining the data platform. Without the support/structure, the right use case, the data already in a good state, and the right platform — there is a high risk this will hurt.

4) A single point of failure many security teams

Let’s say you have delivered your fist set of use cases, and now the security data lake is entrenched into the processes of several security teams. What happens when a team wishes to move on to a new security solution and needs the data ingestion/processing/serving to change in order to factor in this new solution? The answer is that the data engineering team becomes a bottleneck for security initiatives, and the more entrenched the bigger the problem.

5) The data & data model is just not there — Use cases for immature processes

I cannot overstate this pitfall. One of the first things Google teaches in it’s “How Google Does Machine Learning” course, is that you need to ensure the maturity of the process you are trying to provide a big data solution for.

Is the data available in a usable and accessible form?

To solve the data problem do you have the right data model to link different datasets for your use case?

Imagine you want to map a set of security findings to the team responsible for resolving the issue. My favourite example of this is: A bug bounty finding that reported a vulnerability found in a specific parameter in a specific web page. To solve this mapping you need to have the ability to map (somewhere) the FQDN/URL/API/Page to the relevant team, and the data from the bug bounty program needs to be consumable in a usable fashion you are able to do that mapping. Good Luck with that.

What made the SIEM a security data lake success story?

It is not all doom and gloom though. There are success stories like the SIEM and it’s important to understand what made them work, and what can we do to limit the likelihood of failure.

1) Achievable & impactful use case.

Use Case: Collect the following monitoring logs from our systems and if the following events are triggered (i.e. indicators of compromise) raise an alert for the security team to triage.

Simple.Achievable.Impactful use cases.

Of course there is more than that in the example above (tune the rules to avoid false positives, alerting fatigue, false negatives etc..), that said it has found its place in the security industry for good reason.

2) Support and maintenance

While not insignificant, the effort to maintain such a platform is worth the risk/reward. Teams have enough support, documentation, training and a community that they are able to gain value from these solutions.

Not to mention a small cottage industry of professional services to support and make the most of that capability.

3) The data & data models are in a usable state

As mentioned above, I can’t begin to tell you how crucial it is for a use case that the data exists and that it exists in a usable state (a little effort here and there notwithstanding). This is a key success for the SIEM, that for the most part the data is there to make the use case achievable.

One often missed success factors for a use case is a good data model. For example the Common Information Model in Splunk, which reduces some of the design activities which are overlooked.

4) Extensibility

Another key success is that you can build on top of what you have rather than having to rebuild existing capabilities. Multiple use cases on the existing data sets, and the community built additional capabilities (SOAR, UEBA) to further improve the use cases.

5) The right data-store for the right job

Last but not least the current successful SIEMs are built on top of data stores which meet the user requirements. I remember over a decade ago that was not the case. There were SIEM products, while good on paper, that would hit technical limits based on the amount of data and data processing requirements thrown at it.

What can we learn from this to avoid common pitfalls

1) Make sure you have a data product mindset going in

Have a focus on the value you are trying to get out of your data initiatives and start with simple & valuable use cases. Take a product approach (a continuous stream of work focused on the intended value) opposed to a project approach (a scope and time bound effort focused on delivering work- not value).

2) The right team for the job

Have data engineering team members available and committed data security products your team wishes to have. Do not go into initiatives without the right technical due diligence and training ahead of time.

Ensure your internal users (e.g. other security teams) have been training on using the analytics tools enough to be able to use the capabilities and build on top of them. An example of that is for them to write their own queries, know how to discover what data is available and where it makes sense to ingest their own data.

3) Leverage common tooling, platforms and self service model

If the company has a data team that has a reference architecture for platforms and support functions, it is strongly recommended that you leverage those as you would be able to reach out to those teams for help if you are leveraging a common set of tools and design paradigms

Ensure you have the right self service model for the data platform so that your security teams are not bottle necked on 1–2 engineering resources.

Make the data discoverable in your company’s data catalogue so that other internal teams can take advantage of it (with the right governance). Example Internal Audit, Compliance, Data Privacy Office and others.

4) Use federated querying where possible

While this is a tactical detail, remember you do not have to make a copy of the data into your own data store, one option is to query the datasets you need at source and have a virtual table /view of the data to process in your data store.

There are a few tools that can do that such as Apache Drill and Trino

5) Define a definition of ready for use cases — Do your due diligence

As the saying goes: “Before you attempt to beat the odds make sure you can survive the odds beating you”. In the case of a security data lake, make sure you set some clear guidelines on what needs to be in place for a security data use case to be viable. A definition of ready might work for your teams, and limit the times you go down a wrong path.

Example:

Is the data for this use case generated from automated processes?
Is the data available and the data quality is sufficient for the use case?
Have the maintainers of the systems of record committed to the data quality requirements?
Have you proven the data model for the use case in order to ensure there is the necessary data points to solve the use case?

6) Self Service & Training — Build a community around it

Finally, make sure you have enough support for your security teams to make the most of these capabilities and keep them and the customers front and centre. If the data can be queried over SQL, done assuming everyone is proficient, some could also use a refresher. Same with BI tools like Tableau — do not assume, if we build it, they will come. Every step on the way counts.

Just Eat Takeaway.com is hiring! Want to come work with us? Apply today.’

References

The curse of the data lake monster

Artificial intelligence and machine learning are currently all the rage. Every organization is trying to jump on this…

www.thoughtworks.com

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Zhamak Dehghani Zhamak is a principal technology consultant at Thoughtworks with a focus on distributed systems…

martinfowler.com

Data Mesh Principles and Logical Architecture

Zhamak Dehghani What do we really mean by data? The answer depends on whom you ask. Today's landscape is divided into…