Why Anaconda in Snowpark

In the “good old times,” when big data was still like the wild west, I ran a project for a client in the manufacturing industry. We ran a legacy Spark cluster on 100 nodes and wrote transformations using PySpark. Our goal was to implement a solution to a non-standard problem on a large scale (it was not a standard replacement for a legacy BI system). In that area, the Time to Market KPI was crucial. To deliver results faster, my team, like any development team, used pre-existing Python libraries. It was relatively easy to install them from PyPI on the cluster’s shared drives and add them as a dependency to requirements.txt. Piece of cake!

Photo by Will Echols on Unsplash

The solution provided sufficient value for the client to move out of the PoC phase. However, reality hit when internal audit revealed that the project did not manage the licenses of the libraries. At the time, we didn’t care about it, nor did the client, as Time to Market was the priority. We didn’t know what was licensed or how, or if we were even allowed to use some of the libraries in our project. Additionally, Security team had a few concerns regarding the libraries, too. Therefore, after a period of rapid development, we had to undergo a phase of refactoring and auditing the libraries. This caused a freeze on new development for six weeks.

While working with Snowflake, I have heard comments from clients regarding restrictions when using Snowpark, UD(T)Fs or Stored Procedures written in Python. In short, in Snowflake, one can either use libraries already existing in Snowflake’s Conda repository or libraries written in pure Python (more details here). You can’t install any library from PyPI. This may be seen as a restriction that impacts the client’s freedom to build whatever they want on the platform. But, you know what? It’s actually a feature that saves our clients a lot of trouble. Why?

Typosquatting and Obfuscation

While working with PyPI, one of the major risks is the potential for typosquatting and obfuscation techniques to be used to create malicious packages that can compromise system security and lead to data extraction. Snowflake addresses the latter by running Python code in a secure sandbox, preventing data extraction. However, it may still be possible to cause damage to the data, which no one would want to experience.

Using Snowflake’s Anaconda repository helps mitigate these risks. Anaconda curates and vets packages before they are added to Snowflake’s repository, reducing the risk of malicious packages being installed due to typosquatting or obfuscation techniques. This provides businesses with added security and peace of mind knowing that the packages they are using have undergone thorough vetting and are less likely to cause harm to their system or data.

License

Additionally, license management can be a challenge when relying on public repositories, as licenses can change unexpectedly between different versions of packages. Using packages only from a curated repository allows businesses to manage the licenses in a simple, centralized way without additional overhead generated by each project and to ensure that they are using compatible packages.

Dependencies

One additional risk of relying on public repositories is the potential for disappearing dependencies, as demonstrated by the left-pad library incident in npm. When a dependency is removed from a public repository, it can cause significant problems for projects that depend on it. To address this risk, businesses can use Anaconda repositories with Snowflake’s curated packages to reduce their reliance on public repositories and minimize the possibility of dependencies disappearing. This approach can help ensure that their Snowflake workloads remain stable and secure, even if there are changes in the broader software landscape.

Conclusion

In conclusion, the integration of the Anaconda repository with Snowflake provides numerous benefits. Typosquatting, obfuscation techniques, license management issues, and disappearing dependencies are all potential risks that can be mitigated by using Snowflake’s Anaconda repository. The use of curated and vetted packages allows businesses to reduce the risk of security vulnerabilities or malware, better manage license risks, and ensure compatibility. Furthermore, using Snowflake’s repository can help ensure the stability and security of your project.

PS

After the audit the project was not cancelled, and we run the project for 1.5 years. But save yourself and your team some stress, and use curated repository like Snowflake’s one.

Note: These are my personal opinions and not of my current employer (Snowflake). I am not promoting any portal and all links in the text are for reader’s convenience only.

--

--

Bart Wrobel
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Architect, Data Engineer, Evangelist and Data Cloud Partner Enabler in EMEA at Snowflake - The Data Cloud. Presenting personal opinions, nothing else.