Auditing with osquery: Part Two — Configuration and Implementation

In part one of this series, we covered the basics of the Linux Audit Framework. In this post, we will be focusing on the osquery auditing implementation details. In addition to configuring auditing, we will provide readers with strategies for reducing the performance impact and logging volume that accompanies most auditing configurations.

Enabling auditing with osquery

Command-line flags are the primary method for changing core behaviors in osquery. The list of audit-related flags for a basic configuration are listed below:

  • --audit_allow_sockets=true
    Tells osquery to record network connections (bind() and connect() syscalls)
  • --audit_allow_process_events=true
    Tells osquery to record process executions (execve() syscall)
  • --audit_allow_config=true
    Tells osquery that it is allowed to change audit config options. You may want to disable this if you’re using custom audit rules.
  • --audit_persist=true
    Tells osquery to regain access to the netlink socket if it loses the connection
  • --disable_audit=false
    Tells osquery to enable audit
  • --events_expiry=1
    Tells osquery to expire events after a single select
  • --events_max=500000
    Tells osquery to buffer 500,000 events between SELECTs defined by the query interval
  • --logger_plugin=filesystem
    Tells osquery to log to the filesystem
  • --watchdog_memory_limit=350
    Tells osquery to bump the memory limit to 350mb

These command line flags are commonly saved to /etc/osquery/osquery.flags.

Aside from the command-line flags, you'll want to ensure you are querying the process_events and socket_events tables within your osquery.conf configuration file. 

Palantir has open-sourced our osquery configuration here:

Logging considerations

Choose specific columns to log from each table

Although it may be tempting to simply use a wildcard in your queries (e.g. SELECT * FROM process_events and SELECT * FROM socket_events), consider how much space this data will consume as it scales to 10, 1000, or even 10,000 hosts. If you're able to reduce the average process and socket event size by 25% by only SELECT-ing certain columns, your overall osquery audit logging volume would shrink by 25%. Be selective in the data that you collect to minimize the log volume generated by osquery.

Filter out extraneous events using sql

Consider whether or not you need to record process execution from each and every binary on the system. There are many tools that are used quite frequently, but provide little security or forensic value, for instance awk, sed, tr, cut.
 
Simply creating a query like the following has a negligible performance impact on osquery and can save massive amounts of log volume: 
SELECT * FROM process_events WHERE path NOT IN ('/usr/bin/awk', '/usr/bin/sed', '/usr/bin/tr');
 
Note: As mentioned in part one of this blog post, excluding events in this way does not lower the performance overhead. osquery still has to process events that have been filtered out via SQL, it simply does not log them.

Only record successful socket events

The socket_events table contains a column titled “success” which indicates whether a connection was successful or not. Depending on your log volume requirements, it may be worth determining what percentage of your socket events are successful vs. unsuccessful and potentially filtering out all unsuccessful connections. Obviously, this is a tradeoff between visibility and logging volume.

Monitor top 10 processes by count

At Palantir, we maintain an auditing dashboard to keep tabs on the processes that generate the most logs. If we find that specific processes are generating orders of magnitude more events than others on our systems, we are able to take action and filter them with a one-line change.

Troubleshooting auditing

If you’re having trouble getting osquery to record process and audit events, review each item on the checklist:

  • Ensure osquery is being run as root
  • The proper flags (documented in this post) are present in osquery.flags
  • Queries against the process and/or socket events table are present in osquery.conf
  • osquery.flags and osquery.conf are in the correct location (/etc/osquery on Linux)
  • osquery is installed and running
  • The pid listed in the output of auditctl -s is mapped to an osqueryd process and does not change
  • No other audit consumer (auditd/go-audit/auditbeat) is running concurrently
  • At least one syscall rule is listed in the output of auditctl -l
  • If using syslog, ensure it is configured to allow large messages and is not rate limiting

If you’ve verified all of those items, there are two primary troubleshooting strategies that we use:

  1. Stop any running osquery process and run osqueryd as a foreground process with the --verbose flag and check the output for errors:
    (a) $ osqueryctl stop; osqueryd --flagfile=/etc/osquery/osquery.flags --verbose
  2. Stop any running osquery process and run osqueryi using the hidden
    --audit_debug=true flag:
    (a) $ osqueryctl stop; osqueryi --flagfile=/etc/osquery/osquery.flags --audit_debug=true --verbose
    (b) osqueryi will drop you into an interactive shell, similar to the one presented by sqlite
    (c) The --audit_debug flag will output raw audit events that osqueryd receives from the netlink socket before it has processed them into an internal table. If no audit events appear, it's likely something is misconfigured.

Understanding performance implications

Auditing with osquery introduces performance overhead in two main ways:

  1. When auditing is enabled in the kernel and audit rules exist, the kernel has to perform additional operations as it compares each syscall to the list of audit rules and generates the actual audit events.
  2. The audit consumer (osquery in this case) pulls events from the kernel netlink socket, parses the event data into its internal table format, and stores that event in RocksDB. Once the event has been SELECT-ed by a query, that event is written to a log file or sent via a logging plugin.

As we mentioned earlier, osquery is only monitoring 1–3 specific syscalls depending on which auditing options are enabled. As such, each time one of these syscalls occurs in the kernel, an audit event will be generated. The more audit events there are, the more work the kernel has to do generating them and the more work osquery has to do parsing them and storing them in a database. This is a gross oversimplification of what is happening under the hood, but it is fundamental to understand.

osquery stress test

The official git repo for osquery contains a rudimentary stress test written in Python. This stress test simply creates a user-defined number of shells and sends data to a UDP socket. By using the stress test in conjunction with the perf tool, you can observe the computational overhead generated by running audit and osquery.

Note: If you are using the perf tool on a VM, be sure to enable performance counters / code profiling on the VM first.
 
 Audit & osquery disabled:

# perf stat python system_stress.py -i lo -n 10
Expecting 10240 (default shell) processes
Executed 10240 (default shell) processes
Elapsed: 41.857626915
 Performance counter stats for 'python system_stress.py -i lo -n 10':
      83219.491922      task-clock (msec)         #    1.986 CPUs utilized
<snip>
41.897018727 seconds time elapsed

Audit & osquery enabled:

# perf stat python system_stress.py -i lo -n 10
Expecting 10240 (default shell) processes
Executed 10240 (default shell) processes
Elapsed: 47.5021560192
 Performance counter stats for 'python system_stress.py -i lo -n 10':
      84672.567230      task-clock (msec)         #    1.781 CPUs utilized
<snip>
47.535512135 seconds time elapsed

As you can see, the stress test took ~6 seconds longer to complete when osquery auditing was enabled. Although this is a very obtuse and imprecise way of calculating computational overhead, it does get the point across fairly clearly. The stress test is able to complete its operations faster when audit and osquery are not running.

Measuring syscalls generated by process

To get a basic idea of which processes are generating which syscalls, you can use a combination of auditd, ausearch, and aureport. Auditd records the audit events to a logfile, and ausearch combined with aureport does some handy parsing of the auditd logfile to display how many syscalls of a particular type were generated by each process.

# Stop osquery so it doesn’t conflict with auditd
sudo osqueryctl stop
# Clear out any existing audit rules
sudo auditctl -D
# Insert audit rules used by osquery
sudo auditctl -a always,exit -S execve
sudo auditctl -a always,exit -S bind
sudo auditctl -a always,exit -S connect
# List audit rules
sudo auditctl -l
# Remove old audit logs 
sudo rm /var/log/audit/*.log
# Start auditd and let it run for 30 seconds
sudo service auditd start; sleep 30; sudo service auditd stop
# Delete audit rules
sudo auditctl -D
# Create a report using aureport for each syscall that we have a
# rule for
for SYSCALL in connect bind execve
do
echo -e "\nReport for $SYSCALL syscall"
ausearch --syscall $SYSCALL --line-buffered --raw | aureport --executable --summary | head -15
done

A snippet of the resulting report generated after running system_stress.py is shown below:

Report for connect syscall
Executable Summary Report
=================================
total file
=================================
10240 /usr/bin/python2.7
532 /usr/sbin/aureport
14 /usr/sbin/ausearch

The primary goal of running this report is to gain an understanding of how many of each syscalls are generated on a particular system and to determine if there are any outlier processes that generate substantially more syscalls than others. These outlier processes are the most likely to cause performance issues. If logs from those processes are not important to collect, consider filtering them via audit rules.

Excluding processes using audit rules

In some cases, I suspect users have attempted to enable auditing on a system with osquery, only to notice unacceptable levels of CPU cycles being consumed by osquery. Without any obvious resolution in sight, they determine it is not the right tool for the job and abandon the idea of using it for auditing purposes. 

While acknowledging that the audit framework is suboptimal in some ways, we would like to provide some options for reducing the performance impact of enabling it on systems.

One option is to create a list of custom audit rules for processes that you are not interested in monitoring. Consider the report generated by the stress test shown above. /usr/bin/python2.7 was responsible for generating 10240 connect() syscalls. We can use this binary as an example when creating an audit rule for exclusions.

The format of these “filter” audit rules is as follows:
 -a never,exit -F exe=/path/to/bin -S [all|syscall_name]

We are telling audit to never generate an audit event for the binary at /path/to/bin for either all syscalls, or a specific syscall.

Filtering noisy processes using audit rules

  1. Stop audit and osquery
  2. Edit the audit rules file (sometimes found in /etc/audit/rules.d/audit.rules)
  3. Add “filter” audit rules in the format mentioned above
  4. Add the audit rules required by osquery
  5. The resulting audit.rules file should look something like this:
# Delete any pre-existing rules
-D
# Don't generate audit events for python2.7
-a never,exit -F exe=/usr/bin/python2.7 -S all
# Increase the buffers to survive stress events.
# Make this bigger for busy systems
-b 1024
# Add rules for osquery auditing
-a always,exit -S execve
-a always,exit -S bind
-a always,exit -S connect
# Enable audit
-e 1
  1. Set the --audit_allow_config flag to false in your osquery.flags file. We no longer want osquery to manage our audit configuration because we are now using custom rules.
  2. Start osquery
  3. Run augenrules --load to consolidate rules and load them into the kernel. You may see messages about specifying an arch - they can be ignored.
  4. Run auditctl -l to list active audit rules and confirm the audit rules you added are present

Unfortunately, audit rules do not support wildcards in the exe field, so you will explicitly have to whitelist one binary at a time. In addition to whitelisting binaries by path, audit also supports filtering by PID, UID, and SELinux context.

We recognize this process is a bit tricky, so we have opened the following osquery feature request to simplify this workflow: https://github.com/facebook/osquery/issues/5308

Conclusion

This two-part blog post series presents a basic understanding of the audit framework, provides direction for configuring osquery in audit mode, and presents strategies for shrinking the performance footprint that audit and audit consumers have on a system. As we look towards the future, it becomes clear that audit is quickly becoming an outdated solution and newer technologies such as eBPF are likely to become a much more performant replacement. Until that becomes a reality, we’re focused on providing the best guidance on the tools that are currently available.

References


Authors

Chris L.