Security Data Lakes, Normalization and OCSF

Open Cybersecurity Schema Framework Logo

As Snowflake’s Cybersecurity Field CTO, I get asked fairly frequently around my thoughts on the Open Cybersecurity Schema Framework (OCSF) and about normalization in general. The following are things I’ve learned from conversations with the field as well as with the OCSF community.

OCSF has a great community.

OCSF has commitments and active support by some major players in the industry including Splunk, AWS and of course Snowflake. The contributors themselves know how to get stuff done, are welcoming and consist of industry vets, experts, startups and established companies all working together. Contributors I’ve talked to don’t complain about getting their changes in and everyone seems relatively eager to help and support each other. I highly recommend joining their Slack.

OCSF is still relatively new

Having only recently released its first major version, it’s still in development and you may find a lot of info stored under the “additional info” field. Many vendors who created integrations for the launch of AWS Security Lake have neglected to maintain or keep up the date with the schema versions. Until the schema stabilizes, more vendors adopt and existing vendors catch up, don’t expect to find everything normalized and ready to go for you out of the box.

Its a standard by vendors for vendors

This is a good thing for everyone. An open and accepted standard promotes interoperability and makes the ecosystem more open. This means less work for enterprises that consume that data and for vendors.

As a vendor, supporting an open and accepted standard means less work building connectors while giving your customers more value. More importantly it means that other non-competing services won’t need to build custom connectors for your product, increasing your company’s value in the ecosystem.

Enterprise security teams should opt for local normalization

Coming from traditional SIEMS, a lot of companies just accept the idea that everything must be normalized at all times. If you’re using something like Snowflake then features like native searching of JSON data and the ability to have unlimited concurrent queries means that a universal normalization effort may be unnecessary and not worth the cost.

Since a security data lake can support a variety of use cases, different teams may want the same data in different formats. A data scientist working on an ML model may want data as close to raw as possible, while a threat detection engineer may prefer the data normalized to work with existing detections.

Consider as well log sources are not appropriate for normalization to begin with. VPC flow logs for instance can take up a lot of storage space in their raw form. Normalizing to OCSF can add a considerable amount of overhead. Compared to raw, uncompressed OCSF formatted records are about 10x larger.

Embrace the fork

Rolling out a schema is hard enough, keeping it up to date with the latest versions is probably not gonna happen. Most companies I talk to that have a schema (or multiple schemas) have forked, not just to avoid keeping up with changes but to fit it to their individual needs. This isn’t necessarily a bad thing and doesn’t mean you “failed”. Even if you start with something like OCSP and fork it, you’ll have a great starting point and future OCSP sources will be much easier to integrate.

Consider Elastic Common Schema as well

Most enterprise security teams I talk to are either using their own in-house schema or have opted to use or fork the Elastic Common Schema (ECS). As a standard it’s been around longer than OCSF and is more mature and stable.

Recently ECS was donated to Open Telemetry (OTel). In the long run this makes the standard more open and less under the influence of a single vendor. It also means that ECS and the existing OpenTelemetry Semantic Conventions will be merged. Given OTel’s massive adoption and industry acceptance as a dominant standard for observability it is possible that adopting a future version of ECS can provide a gateway to a much larger ecosystem of tooling and observability data. Still, these are all on the road map and in the short term expect a period of “under construction”.

--

--