Building an effective EDR

Atul Kabra
5 min readJan 16, 2018

--

One of the key metrics of measurement for an effective EDR (or DFIR) solution is its ability to bring down the time required to detect threats in your organization. When it comes to detection of endpoint threats,this is traditionally achieved by collecting as much data of the endpoint activity as possible, to gain visibility. The threat intelligence can then be applied to this data by collecting it at a central repository. This might be time consuming as the data sets to be compared grows in size and in a short time it starts representing a “big-data” problem with the cost of data storage becoming an inhibiting factor.

This comes to the next metric of measurement for an effective EDR which is about reducing the cost of security operations. This cost can have multiple components including the cost of different agents across endpoint platforms from different vendors (Windows/MacOS/Linux), the data storage cost, the flexibility of integration for automation of security alerts and threat Intel correlation with the endpoint data. This overall cost (and/or how it can be reduced without compromising the efficacy) can be a big factor in the minds of CISOs before choosing the EDR solution.

To reduce the cost of cloud storage, one of the alternates for the EDR solution is to store the end point data on the endpoint itself and distribute the searches to make it more efficient. This may reduce the burden on cloud storage but it can have a side effect of impeding endpoint performance as the data collection and matching happens on the endpoint. The impact and its cost is dependent on the choices of the engineering constructs in building the agent and underlying data stores. I remember in my past life when we migrated our EDR implementation from a memory mapped “ring-buffer” kind of a flat file based data store to a SQLite DB. The motivation for the move was driven by angry customers who complained on how we jammed their end points when the threat hunting/incident response jobs were scheduled. Naturally “ring-buffer” is not the most optimal data store choice when the queries need to be very relational like “which files were created by the process” or “give a list of all processes between time t1 and t2”. It seemed like SQLite would solve all our problems. Except that it didn’t, with one customer getting angrier than before because now we apparently broke their systems even when the threat hunting was not going on. A deeper investigation revealed that the customer environment was a large number of VDIs provisioned from a shared NAS volume. The endpoint agent with SQLite generated lot more disk IOs for the same work flow, in comparison to the earlier agent. This makes sense because to make its searches easy, SQLite does a great deal of normalization and indexing as and when the data is pumped into it causing additional load on storage. Even a seemingly idle Windows system is constantly generating temporary files, each of which were being tracked, hashed and recorded in the SQLite. On a single endpoint, this may be insignificant, but multiply this to hundreds of VDIs and add to the fact that they were all pipe-lined to a single storage controller.

Having spent a big part of my past life in storage industry, I could understand how the queue depths on the storage controllers must have overran and the packets started to fall, causing the VDIs to malfunction.

While this may be one off case that I personally came across, I won’t be surprised if other EDR vendors would have come across something similar because it is not uncommon for enterprises to create shared volumes from single storage box across multiple end points. The flexibility in the search syntax of SQLite comes at a price of additional IOs and load on storage infrastructure. Perhaps that is why most big data architecture shy away from a relational model of database because of the scaling limitations and variety of other key-value pair databases are more preferred.

A more efficient EDR (or DFIR) agent would strive to be built on a model that can provide the best of the two i.e. flexibility of a SQL syntax but the leanness of a key-value data store.

Enter osquery.

Developed at facebook (“with love”, as they call it), osquery is a flexible, scalable and highly performant tool that can be used to query your endpoints to detect, investigate, and proactively hunt for various types of threat. Providing its interface in the form of easy to use SQL language, osquery not only eliminates the need to navigate thru vendor specific APIs to be able to ask questions from your end point, the SQL interface is built on virtual SQL tables providing the benefits of SQLite without the overheads of the disk IO. The cross-platform availability can reduce the complexity of having to deal with multiple vendors for different end points. On Mac and Linux, osquery comes with kernel extension that allows for real time event capture and data collection commands that require higher privileges. For storing the events, and the relevant operating system information at the event time, osquery uses a highly performant backing store built upon RocksDB. A query time abstraction of tables is created based on the data from the backing storage that allows for usage of SQL syntax to efficiently navigate over the data. This allows osquery to provide the flexibility of a SQL syntax but the leanness of a key-value data store and build out of some very useful security features like File Integrity Monitoring (FIM) and Process Audits.

Being an open source and extensible tool, and backed by a very vibrant community of developers, osquery provides the collective wisdom of community that has also created an ecosystem of feel managers around it that can further be used to build high performing automation and orchestration for incident management in a SOC.

Some of the features, around collection and storage of real time events that require a support from OS kernel, are not available in the base release for Windows osquery. These gaps are now filled by the PolyLogyx Extension for Windows osquery thereby laying the essential foundation for building an effective EDR solution not only in terms of cost but also in terms of flexibility, performance, scalability, feature set and resource load.

--

--