On the value of “Wax On, Wax Off”
“You don’t need IT fundamentals or programming skills to learn cybersecurity.”
“You don’t need to have a degree to be a software engineer.”
While it’s noble to promote lowering the barriers of entry within desperately understaffed STEM fields (such as cybersecurity), it seems that beneath these well-intended words often lies an undercurrent of anti-intellectualism. The sentiment is that we don’t really need all that theory the eggheads busy themselves with — learn the practical skills and knowledge you’ll need on the job, and you’ll be able to fill one of those hot entry level tech jobs .
While there are some environments like Security Operations Centers (SOCs) that can nurture talent from an entry level (I myself am going through this route), there will come moments where the problem or task at hand cannot simply be solved by warm bodies and enthusiasm — especially when dealing with large datasets. You can try to manually carve away at the work little by little, but this approach will fail when a situation is time-sensitive or you already have a lot of critical work occurring on the day to day.
Solving these problems at scale necessitates knowledge and expertise that you often cannot brute force or Google on the spot. These cases highlight how it’s not that formal education or theory is useless, it’s that we have to keep our eyes open for the opportunity to apply it.
“If do right, no can defense”
Suppose that we work in a consultancy that monitors network traffic for a large number of clients. Once in a blue moon, one of our intelligence sources sends an enormous dataset which we must check against a database of partner-owned public IP ranges.
The task is challenging not just because of the number of records to verify, but also because of the size of the database that it must be checked against —since we have records of ranges, as opposed to individual addresses, associating an IP to the client it belongs to is not so easy.
Manual Divide & Conquer
Our first “naive” approach is to divide the Excel sheet up into chunks for a few analysts to work on one at a time. This would be quite inefficient due to the speed of manual checking, as well as the “opportunity cost” of taking our analysts away from other tasks.
Simple Python Automation
Using Python libraries for performing SQL queries, we could create a function for automating the address checking steps. This could then be scripted to run as a loop over the records in the dataset, working in the background while we did our usual work.
Despite being a good step up from performing the work manually, this approach is still weak in terms of both time and resources — it would take a long time to complete because a separate query has to be run for each record, and may put strain on database resources.
Offline Copy & Linear Search
To lessen the load on the database, we could create a variant which saves a sorted copy of the table in memory, then perform an “offline” search for the IP ranges. Despite this making our IT team a little happier, there are still considerable performance problems — in the worst case scenarios where an IP address doesn’t belong to any recorded range, the function would still have to iterate all the way to the end in order to check each record (linear search).
“Vectorizing” With Pandas
By using the Pandas library, which is built off of NumPy (used for matrix operations), we can implement parallel processing via “vector operations” as opposed to iterating one element at a time. While this method does not cut down on the amount of work (still need to perform operations on all elements), the time required to find the appropriate record is reduced greatly.
We can go a step further in our optimization by using binary search to cut down on the number of operations performed. This approach compares the value searched with the middle of the data, then goes half the distance to the left or right depending on the comparison.
While we had to search a worst case N records via linear search, we now have a worst case log base 2 of N. More simply put, if we had 2²⁰ records (1,048,576), we would need a maximum of 20 checks with binary search, as opposed to over a million with linear search.
The concepts and considerations laid out from linear search onward were informed by various computer science courses I’ve taken over the past few years. I believe that without this perspective, the average analyst with some programming ability would probably get stuck somewhere around the offline linear search approach.
As I’ve argued before, faster laptops and cloud resources are not a cure-all. Cloud resources, though scalable, not are free. More importantly, inefficiently-designed algorithms/programs cause us to do more work or spend more time than is needed; you cannot simply throw more cash at the program to make it go faster.
You may not need a formal education to get started in tech, but it certainly does not hurt to pick it up later. Why reinvent the wheel, or worse, chronically do things sub-optimally, when you can absorb knowledge that others have already painstakingly developed? Although college-style courses may feel stifling compared to gamified platforms like TryHackMe, I personally advocate for the value of taking formal classes like Data Structures and Algorithms as I have seen the value such topics in my own career.