Why you can’t trust your NULLs in Osquery
The pain of empty strings in SQLite
A basic assumption of nearly any database is that the absence of data in a table is recorded as NULL
. This fundamental principle guides many of the approaches an individual would take to querying the data within. The behavior of common operators like COUNT
, ISNULL
, IFNULL
, MAX
, are predicated on the interpretation of NULL
rows.
…but, if you are using the popular open-source endpoint visibility tool osquery, proceed with caution, I am here to show you the quicksand —
Table of Contents:
- Background: What is osquery?
- Introduction: What is a
NULL
and what is an empty string? - Problems using
NULL
inWHERE
clause - Problems using
COUNT
- Problems using
AVG
and other arithmetic operators - Real
NULL
s do exist in some osquery tables - True
NULL
s break||
concatenation output - Literal
'null'
strings exist in osquery - Conclusion: Query writing strategies for avoiding empty string problems
A brief background on osquery
Osquery is an endpoint visibility tool that translates native API calls into SQLite virtual tables, allowing a Mac, Linux, or Windows computer to be queried in real-time as if it were a relational database:
osquery> SELECT name, pid, parent FROM processes LIMIT 5;+----------------+-----+--------+
| name | pid | parent |
+----------------+-----+--------+
| kernel_task | 0 | 0 |
| launchd | 1 | 0 |
| syslogd | 67 | 1 |
| UserEventAgent | 68 | 1 |
| uninstalld | 71 | 1 |
+----------------+-----+--------+
Created at and later open-sourced by Facebook, osquery is now under the ownership and governance of The Osquery Foundation, which is under The Linux Foundation. Osquery is the underlying agent for dozens of security solutions and runs on millions of devices across the globe.
If you aren’t familiar with osquery, you can get started exploring it in less than 5 minutes by installing the agent on your device:
Osquery / Downloads / Official
For those of you who are already familiar with osquery, or using another osquery endpoint manager like Kolide, feel free to dive in and get started exploring why what you thought was NULL
may not be what you expect.
Empty String != NULL
A painful lesson I learned early on in my days of writing SQL in osquery was realizing that its output is littered with empty strings.
What is an empty string, you might ask? Let’s take a look:
WITH
demonstration_table (example_name,example_data,example_datatype) AS (
VALUES
('first','foo','not empty string'),
('second','','empty string'),
('third',NULL,'NULL')
)SELECT * FROM demonstration_table;+--------------+--------------+------------------+
| example_name | example_data | example_datatype |
+--------------+--------------+------------------+
| first | foo | not empty string |
| second | | empty string |
| third | | NULL |
+--------------+--------------+------------------+
The second value ''
is what is commonly referred to as an empty string. Despite containing no content, ''
represents data.
The third value, on the other hand, is a NULL
, which means the absence of data.
While these two values (the second and third rows in the column: example_data
) look the same, they are very different, and the way the SQLite engine interprets them could not be more dissimilar.
There are many ways in which databases, including osquery, handle empty strings differently from NULL
s.
WHERE NOT NULL & WHERE IS NULL
The first example of difficulties you may encounter is in situations in which you attempt to query for data where a specific column is or is not NULL
. Let’s take the block_devices
table, for example. This table outputs a list of any connected block devices. In this case, my hard drive (with its various volumes) and some external media (thumb drives and external hard drives).
It also returns the basic mapping of these volumes to their respective parent volumes (e.g. a hard drive may have many partitions).
osquery> SELECT name, parent, label FROM block_devices;+----------------+------------+----------------------------+
| name | parent | label |
+----------------+------------+----------------------------+
| /dev/disk0 | | APPLE SSD AP1024M Media |
| /dev/disk0s1 | /dev/disk0 | EFI System Partition |
| /dev/disk0s2 | /dev/disk0 | Untitled 2 |
| /dev/disk1 | | AppleAPFSMedia |
| /dev/disk1s1 | /dev/disk1 | Macintosh HD - Data |
| /dev/disk1s2 | /dev/disk1 | Preboot |
| /dev/disk1s3 | /dev/disk1 | Recovery |
| /dev/disk1s4 | /dev/disk1 | VM |
| /dev/disk1s5 | /dev/disk1 | Macintosh HD |
| /dev/disk1s6 | /dev/disk1 | Update |
| /dev/disk1s5s1 | /dev/disk1 | com.apple.os.update-5DC2B8 |
| /dev/disk2 | | WD My Passport 25E2 Media |
| /dev/disk2s1 | /dev/disk2 | EFI System Partition |
| /dev/disk2s2 | /dev/disk2 | Untitled 2 |
| /dev/disk2s3 | /dev/disk2 | Booter |
| /dev/disk3 | | Patriot Memory Media |
+----------------+------------+----------------------------+
The parent volumes are those with no data in the parent
column (because they have no parent of their own). Let’s say we wanted to return just these parent volumes; we might attempt to do so by running the following:
osquery> SELECT name, parent, label FROM block_devices WHERE parent ISNULL;
osquery>
No results are returned! This is because, as we demonstrated in the previous section, these empty rows are not actually NULL
s, they are empty strings. Let’s amend our SQL to return the desired output:
osquery> SELECT name, parent, label FROM block_devices WHERE parent = '';+------------+--------+---------------------------+
| name | parent | label |
+------------+--------+---------------------------+
| /dev/disk0 | | APPLE SSD AP1024M Media |
| /dev/disk1 | | AppleAPFSMedia |
| /dev/disk2 | | WD My Passport 25E2 Media |
| /dev/disk3 | | Patriot Memory Media |
+------------+--------+---------------------------+
Changing our WHERE
clause to: WHERE parent = ''
results in the output we expect. This basic example lays the foundation for all of the examples to follow.
NULL != NULL
Though we didn’t return NULL
values in our previous example it is important to point out the significance of the ISNULL
operator.
A nuance of true NULL
values is that NULL
does not equal itself and NULL
is not LIKE
itself either. You may wonder what that means, let’s see it in practice below:
SELECT
1=1,
NULL LIKE NULL,
NULL LIKE '%',
NULL = NULL,
NULL ISNULL;+-----+----------------+---------------+-------------+-------------+
| 1=1 | NULL LIKE NULL | NULL LIKE '%' | NULL = NULL | NULL ISNULL |
+-----+----------------+---------------+-------------+-------------+
| 1 | | | | 1 |
+-----+----------------+---------------+-------------+-------------+
You must use the operator ISNULL
to find real NULL
values.
Inconsistencies of behavior when aggregating
Many of the issues with empty strings stem from the way they are treated when performing aggregation operations.
COUNT treats empty strings as NOT NULL
When the COUNT
operator is used on a column, it records a count of any rows which are not NULL
. We can see this behavior using the query below:
WITH
demonstration_table (example_data,example_datatype) AS (
VALUES
('foo','not empty string'),
('','empty string'),
(NULL,'NULL')
)SELECT COUNT(example_data) FROM demonstration_table;+---------------------+
| COUNT(example_data) |
+---------------------+
| 2 |
+---------------------+
While we would expect to see a count of 1
, counting only the row containing the example data: 'foo'
, our empty string is counted alongside 'foo'
as the presence of row data.
We have to be mindful of that when looking for something based on the presence of data in a WHERE
statement. For example, WHERE path NOT NULL
may not return the output we expect.
AVG and other aggregation
The confusion of empty strings vs NULL
extends into any SQLite arithmetic due to the different ways each is handled in aggregation functions.
Let’s take a look at the AVG
function which averages the contents of the data it operates on.
We will do a basic example first to verify the behavior:
WITH
example_data (data_type,sample_data) AS (
VALUES
('integer',50),
('integer',100))SELECT AVG(sample_data) FROM example_data;+------------------+
| AVG(sample_data) |
+------------------+
| 75.0 |
+------------------+
As we expect: (50+100)/2 = 75
But what happens when we throw a NULL
row in the mix?
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('integer',50),
('integer',100))SELECT AVG(sample_data) FROM example_data;+------------------+
| AVG(sample_data) |
+------------------+
| 75.0 |
+------------------+
The absence of data is not data, so the NULL
row is ignored when calculating the average of the values. Let’s see what happens when we substitute our empty string ''
for NULL
:
WITH
example_data (data_type,sample_data) AS (
VALUES
('empty string',''),
('integer',50),
('integer',100))SELECT AVG(sample_data) FROM example_data;+------------------+
| AVG(sample_data) |
+------------------+
| 50.0 |
+------------------+
The empty string is interpreted as 0
by SQLite leading to it being calculated as part of our average and subsequently producing a value, we may have otherwise not expected.
Empty strings being interpreted as 0
may produce other inconsistencies you do not expect:
value * NULL = NULL
value * empty string = 0
Let’s see that in osqueryi:
WITH
example_data (value_1,value_2,data_type) AS (
VALUES
(2,'','empty string'),
(2,NULL,'NULL'),
(2,50,'integer'),
(2,100,'integer')
)
SELECT
value_1,
value_2,
data_type,
(value_1 * value_2) AS multiplied_output
FROM example_data;+---------+---------+--------------+-------------------+
| value_1 | value_2 | data_type | multiplied_output |
+---------+---------+--------------+-------------------+
| 2 | | empty string | 0 |
| 2 | | NULL | |
| 2 | 50 | integer | 100 |
| 2 | 100 | integer | 200 |
+---------+---------+--------------+-------------------+
Boolean interpretation
Boolean values of TRUE
and FALSE
are interpreted by SQLite as 1
and 0
respectively. This means that an average of boolean data is susceptible to the same problems mentioned above if empty strings are present:
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('boolean-true',TRUE),
('boolean-false',FALSE),
('empty string','')
)
SELECT AVG(sample_data) FROM example_data;+-------------------+
| AVG(sample_data) |
+-------------------+
| 0.333333333333333 |
+-------------------+
Complicating things further: Boolean fields may actually be Tri-State
Many tables in osquery have columns that appear to encode boolean (TRUE
,FALSE
) data. However, the prevalence of empty strings complicates the utility of these columns.
Let’s take a look at the chrome_extensions
table to understand why this might cause us some confusion:
SELECT name, state
FROM users CROSS JOIN chrome_extensions USING(uid);+----------------------------------------------+-------+
| name | state |
+----------------------------------------------+-------+
| Gmail | 1 |
| Application Launcher For Drive (by Google) | 1 |
| Advanced REST client | 1 |
| The Camelizer | 0 |
| EditThisCookie | 1 |
| Privacy Badger | 1 |
| React Developer Tools | 1 |
| 1Password extension (desktop app required) | 1 |
| Take Webpage Screenshots Entirely - FireShot | 1 |
| Endpoint Verification | 1 |
| Chrome Web Store Payments | 1 |
| Google Drive | 1 |
| Chrome Media Router | |
| Chrome Media Router | 1 |
| Application Launcher For Drive (by Google) | 0 |
| Google Docs Offline | 1 |
| Chrome Web Store Payments | 1 |
| Sheets | 1 |
| Gmail | 1 |
| Adobe Acrobat | 0 |
If we had naively assumed the state
column was a guaranteed boolean, we would have been mistaken and missed our Chrome Media Router
row.
And again, that empty row? You guessed it, empty string:
SELECT name,state
FROM users CROSS JOIN chrome_extensions USING(uid)
WHERE state = '';+---------------------------------+-------+
| name | state |
+---------------------------------+-------+
| Chrome Media Router | |
+---------------------------------+-------+
If we weren’t paying attention, it would be easy to get caught by this inconsistent data.
WITH
chrome_extensions_sum AS (
SELECT
COUNT(*) AS total,
SUM(CASE WHEN state = TRUE THEN 1 END) AS enabled,
SUM(CASE WHEN state = FALSE THEN 1 END) AS disabled
FROM users CROSS JOIN chrome_extensions USING(uid)
)
SELECT *,
enabled + disabled AS enabled_plus_disabled
FROM chrome_extensions_sum;+-------+---------+----------+-----------------------+
| total | enabled | disabled | enabled_plus_disabled |
+-------+---------+----------+-----------------------+
| 53 | 44 | 8 | 52 |
+-------+---------+----------+-----------------------+
Our reliance on the TRUE
, FALSE
operators to evaluate this column do not account for the possibility of an empty string.
Sometimes osquery uses NULL? 🤷
Like so many things in life, there are few hard and fast rules. While there are a litany of examples of osquery tables with columns that encode empty strings instead of NULL
s; there are also outliers.
For example, the socket
column in the process_open_sockets
table correctly encodes NULL
values:
SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76';+------+--------+----------------+------------+------+-------------+
| pid | socket | remote_address | local_port | path | state |
+------+--------+----------------+------------+------+-------------+
| 1464 | | 54.230.75.76 | 49501 | | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND socket ISNULL;+------+--------+----------------+------------+------+-------------+
| pid | socket | remote_address | local_port | path | state |
+------+--------+----------------+------------+------+-------------+
| 1464 | | 54.230.75.76 | 49501 | | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+
Finally! A bit of expected behavior!
🤨 Wait…
The path
column in that result above also has nothing.
🤔 🤔 🤔
Certainly, it will also be a true NULL
, let’s verify…
SELECT
COUNT(*)
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND path ISNULL;+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
😑 Welp, it was a good shot. So much for that moment of hope.
But! It does reveal a characteristic that can help us avoid ambiguity when writing our queries.
Patterns which you can expect to observe:
- Columns that are encoded as a
TEXT
field will typically use empty strings in place ofNULL
; - Columns that are numeric (eg.
INT
,BIGINT
,DOUBLE
) will typically encode the absence of data asNULL
.
Be careful concatenating results with true NULLs
One peculiarity which you should be aware of when using osquery is that the common concatenate operator ||
will not produce output when you concatenate a NULL
with any other string.
You can work around this behavior by instead relying on the CONCAT
operator.
Let’s see the difference in output in osqueryi below:
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('boolean-true',TRUE),
('boolean-false',FALSE),
('empty string','')
)
SELECT
CONCAT(data_type,',',sample_data) AS concat_example,
(data_type || ',' || sample_data) AS double_pipe_example
FROM example_data;
+-----------------+---------------------+
| concat_example | double_pipe_example |
+-----------------+---------------------+
| true-null, | |
| boolean-true,1 | boolean-true,1 |
| boolean-false,0 | boolean-false,0 |
| empty string, | empty string, |
+-----------------+---------------------+
As we can see above, the true NULL
row produces no concatenated output in the ||
generated column. If we weren’t rigorously testing our output, we might have been unaware that we were dropping data on the floor!
When ‘null’
is not NULL
Sadly, even with the rules above, there are stranger examples of things happening in query outputs.
What about when 'null'
is not actually NULL
…
Let’s take a look at the firefox_addons
table.
SELECT name, creator
FROM users CROSS JOIN firefox_addons USING(uid);
+---------------------------------+-------------------+
| name | creator |
+---------------------------------+-------------------+
| Adobe Acrobat | Adobe Systems Inc.|
| Form Autofill | null |
| Firefox Screenshots | Mozilla |
| Web Compatibility Interventions | null |
| WebCompat Reporter | Thomas Wisniewski |
| DoH Roll-Out | null |
| Picture-In-Picture | null |
| Google | null |
| Amazon.com | null |
| Bing | null |
| DuckDuckGo | null |
| eBay | null |
| Wikipedia (en) | null |
| System theme | Mozilla |
| Dark | Mozilla |
| Light | Mozilla |
| Firefox Alpenglow | null |
| Reset Search Defaults | null |
| Add-ons Search Detection | null |
| LastPass: Free Password Manager | LastPass |
| 1Password – Password Manager | AgileBits |
+---------------------------------+-------------------+
Oh no! This is not at all what we expect to see. There are rows with the literal text string 'null'
.
Let’s take a look at the data source osquery uses to populate the firefox_addons
table, a JSON file:
~/Library/Application Support/Firefox/Profiles/%.default/extensions.json
While the value is correctly encoded as NULL
in the JSON file, somehow, this data is being parsed as a literal string 'null'
instead.
This is a fantastic example of another general rule that you should keep in mind when working with osquery:
Data is populated from disparate sources and broadly speaking, osquery relays these sources with minimal consideration for whether they are misformatted, misencoded, or inconsistent.
Never forget:
osquery is merely a messenger; you must check your results!
How to avoid the pitfalls of empty strings
The following three rules will help ensure you do not fall victim to any pitfalls described above.
- Consider each table’s schema (whether a column is
NUMERIC
vs.TEXT
) - Sanitize the output of your queries to ensure they cannot silently fail in ways you do not predict.
- Test your queries rigorously before blindly relying on their output.
1. Consider each table’s schema
As we discussed above in the process_open_sockets
example, there is a typical pattern in osquery that TEXT
columns will typically encode empty-strings and NUMERIC
columns will typically encode NULL
. Knowing the type of columns you are working with will help you predict how they will behave.
When writing a new query, you can spot-check the schema by referencing the osquery Schema Docs or by typing the .schema table_name
command in osqueryi:
osquery> .schema process_open_socketsCREATE TABLE process_open_sockets(
`pid` INTEGER,
`fd` BIGINT,
`socket` BIGINT,
`family` INTEGER,
`protocol` INTEGER,
`local_address` TEXT,
`remote_address` TEXT,
`local_port` INTEGER,
`remote_port` INTEGER,
`path` TEXT,
`state` TEXT,
`net_namespace` TEXT HIDDEN,
PRIMARY KEY (`pid`)) WITHOUT ROWID;
Understandably, many queries you write will call columns from more than one table. You can use another unique trick to determine the data types you are dealing with for a given query.
The .types
command will give you the datatypes from all the columns present in your query’s output:
osquery> .types SELECT * FROM users, logged_in_users WHERE username = user;+-------------+---------+
| name | type |
+-------------+---------+
| uid | BIGINT |
| gid | BIGINT |
| uid_signed | BIGINT |
| gid_signed | BIGINT |
| username | TEXT |
| description | TEXT |
| directory | TEXT |
| shell | TEXT |
| uuid | TEXT |
| is_hidden | INTEGER |
| type | TEXT |
| user | TEXT |
| tty | TEXT |
| host | TEXT |
| time | BIGINT |
| pid | INTEGER |
+-------------+---------+
2. Sanitize the output of your queries
Due to the sometimes unpredictable nature of NULL
vs empty string, it is often best to take a multi-pronged approach and try to sanitize any strings you wish to perform NULL
dependent operations (COUNT
etc.) on.
Thankfully, SQLite and, by extension, osquery support the NULLIF
operator. This is more or less a shortcut to a CASE
statement, but it is still handy to know. You supply the column name you wish to nullify empty strings for and then declare ''
as the rule for nullification. We can take any of our earlier examples and nullify the problematic rows using this approach:
SELECT name, NULLIF(creator,'null') AS creator
FROM users CROSS JOIN firefox_addons USING(uid);+---------------------------------+-------------------+
| name | creator |
+---------------------------------+-------------------+
| Adobe Acrobat | Adobe Systems Inc.|
| Form Autofill | |
| Firefox Screenshots | Mozilla |
| Web Compatibility Interventions | |
| WebCompat Reporter | Thomas Wisniewski |
| DoH Roll-Out | |
| Picture-In-Picture | |
| Google | |
| Amazon.com | |
| Bing | |
| DuckDuckGo | |
| eBay | |
| Wikipedia (en) | |
| System theme | Mozilla |
| Dark | Mozilla |
| Light | Mozilla |
| Firefox Alpenglow | |
| Reset Search Defaults | |
| Add-ons Search Detection | |
| LastPass: Free Password Manager | LastPass |
| 1Password – Password Manager | AgileBits |
+---------------------------------+-------------------+
Let’s verify that our COUNT
function is now working the way we anticipate:
SELECT
COUNT(NULLIF(creator,'null')) AS creator_present_count
FROM users CROSS JOIN firefox_addons USING(uid);+-----------------------+
| creator_present_count |
+-----------------------+
| 8 |
+-----------------------+
So there we have it! If there is a table where you suspect empty strings are being recorded in place of NULL
all you have to do is make sure to nullify that column using the NULLIF
operator or to wrap your aggregated column inside a NULLIF
.
3. Test your queries rigorously
Check twice, then check again.
The last point cannot be emphasized enough; you should never assume osquery is just going to ‘do the right thing.’
Independently verify the output of your queries in osqueryi on representative devices and supply sample data (including empty strings) using CTEs if you want to play it extra safe.
Never assume that just because output looks consistent at a glance that it will always remain so across a diverse fleet of endpoints, the chrome_extensions
example earlier demonstrated how even a single row might be missing a given column.
—
Further Reading
Interested in more how-to guides on osquery? I recommend reading some of my other posts:
- Pivoting registry and plist data from EAV rows into columns using SQLite
- Running osquery as sudo(root) vs. user
- Spotlight search across every Mac in your fleet using osquery
Kolide is hiring!
Want to write cool queries like this and get paid? Kolide just raised 17MM in Series B funding, and we’re hiring! HMU in the comments or visit our recruiting link below:
About the Author:
Fritz Ifert-Miller is the UX designer for Kolide, an endpoint visibility platform that prioritizes the tenets of Honest Security. Fritz is responsible for Kolide’s design, user experience, and researching innovative ways to collect better ground truth from devices using osquery.