Do data practitioners are the new (security) weakest link?

Published in

CodeX

5 min readMar 2, 2022

Secrets in code

The secrets in code phenomena is one of the major threats to the software world today. From Uber’s 2016’s data leak to the more recent SolarWind’s breach, hackers are leveraging un-intended, left in code, secrets to easily hack into organizations. This is why DevSecOps is shifting left, paying more attention to the development early phases, making sure secrets are always safe and secure, never left in code. But while the focus is set to the organizations’ main repositories, the employees’ accounts are left aside, literally becoming the new security soft spots. Why is watching the employees’ repositories so important?

Source code hosting services — Explicitly social networks

Many tend to forget that services like GitHub and GitLab are not only a place for source code hosting and collaboration but first of all social networks. While users can update their personal profiles and follow repositories and accounts, they can also upload code to their private accounts which, again, many tend to forget that is publicly shared. The issue is developers are known to forget secrets in code (Sinha et al), meaning that at least some of the code being published to the employees personal (publicly shared) Github accounts will include their secrets within it. How to prevent hackers from stealing those secrets? How to catch and mitigate it on time?. According to Meli et al it can take only 20 seconds for one to find a secret once it was leaked to Github. The mitigation can become even more complicated for companies which tolerate their employees to do open source. Consider for example Meta with its thousands of employees; even if they will monitor every single one of them and even if they will find a secret the minute it was leaked, who knows if it’s company or private related? Moreover, do private secrets even exist any more? (stealing employee’s private Google credentials may be enough to eventually get to their work related secrets as well. Especially during the pandemic when everyone works from home and enters their enterprise services remotely). Not to mention contractors and users’ private accounts (not the company powered ones but accounts they opened on their own) which the employer may not even be aware of. The problem gets ever more complicated when we consider the different developers scopes, aka, let’s have a deeper look at data practitioners.

Data practitioners code smells

One of the main differences between data employees and the rest of the R&D organization is the fact that most data employees are probably not recruited and evaluated based on their code development skills (and even if they do, this is never the main KPI), meaning that it’s fair to assume their code practices to be of a lower standards. Therefore, as developers in general tend to forget their secrets in code, it makes sense such phenomena will further appear in this segment as well. This point is critical given that data employees commonly interact with the organization data — what increase the rate of a leak to become severe (consider the Uber case for example), and given that data employees ecosystem commonly include many external services (whether it’s to train models on Google Colab, keep datasets on Amazon S3 or review analysis on Tableau dashboards) — what increase the number of the potential secrets to be forgotten on code. To further demonstrate it we did a small test in which we searched Github for Python (commonly in use for data applications, especially among data practitioners like data scientists, machine learning engineers and data employees) code that connects a popular cloud Data Lake service. Given the importance of such assets (supposed to include almost every data piece that an organization has) we assumed that the relevant snippets will handle credentials with an extra care, using best practices like environment variables. But the reality was way different; we found many files with plain secrets within them. Some seem clearly related to organizations assets. Waiting for evil hackers to come and leverage them.

Problem scope

The issue we describe is not unique to data employees. To be fair, for many cases their code lives in the darkest parts of the organization’s back office, making it less visible for the potential attacker. But at the same time, once such a leak happens, the problem scope is dramatically becoming severe; while client side developers (as an example) commonly have some exposure to internal company services, data employees commonly interact with the most critical parts of the organization data on a daily basis. It doesn’t mean only data practitioners work should be extra verified; a severe leak can happen to every single one of the organization employees. The main difference is the likelihood of such leaks to relate to a super critical resource such as the organization data. This problem becomes even more critical given the many places where it can happen; not only on the organization or employees repositories, but a simple search can reveal that plain secrets are visible on almost every source code related social service that exist out there, whether its on StackOverflow questions and answers, Docker hub images, Google play apps, or even on private blogs and forums which can be easily found using some simple Google dorks.

How to avoid it?

As always, the first step is to better educate; to make sure your employees are aware of the up to date security best practices (like don’t use plain secrets in code and never on public accounts). It should be verified especially among ones which are less code oriented (probably like the many data groups that exist out there). At the same time, it is worth noting that mistakes will happen. On the company accounts we can apply existing solutions to verify that the repositories don’t include secrets. But as the company secrets may appear at other places on the web, a more holistic solution would be to try to monitor potential leak areas, search secrets within them and once found try to evaluate their risk. The issue is this is not a simple requirement, but a constant strive to cover all the possible leak areas. The intermediate and the more feasible solution will be to understand that everyone who touches code should be verified to have the security basic understanding and best practices. Regardless of the role they have. Without it, it’s just a matter of time until your secrets will be hacked.