Science Shorts #1: Functional Programming, Nomad & Pylance
Useful python
functions to prepare for “Big Data” processing, the advantages of Nomad over Kubernetes and Visual Studio Codes’s next-generation code completion for Python.
Scope
Links to three interesting and wide ranging topics on Data Science, from understanding key features of python
as a means of preparing to process “Big Data” with libraries such as Dask and PySpark. Nomad as an alternative to Kubernetes as a container orchestrator, which covers the full range of Data Science activities from Data Ingestion to Model deployment to Data Visualisation. Finally, an introduction to Microsoft’s next-generation code completion engine expanding on the previous Language Server Protocol (LSP).
Introduction
The following three articles were randomly selected from my Pocket list, which I’ve curated over the past 5 years in the field of Data Science; the motivations and background are discussed in a previous post: Data Science Shorts: An Introduction to my Pocket List.
Functional Programming with Python
Summary
RealPython (an excellent resource for all things python
) recently published an article on Functional Programming in Python. I’m not advocating using a functional approach in Data Science but the article discusses in detail features of python
, which are useful to know such as the use of lambda
functions, map
, filter
and reduce.
Context
These functions are useful in their own right, however they’re also a great introduction to distributed computing with Spark and in particular PySpark. As an example, map
applies a function to each item in the iterable or set of iterables:
Imagine each iterable (a set
, list
or other iterables) is hosted on a different node and that you need to apply a function to each item. This is easily a potential parallel operation, since the function can be applied on each node independently and if required collected at the end. Similarly, the filter
function could be applied on per node basis and if the results need to be combined, it’s analogous to the reduce
function.
The act of distributing data with PySpark is done with a SparkContext (sc
) and its parallelize
function as the following example illustrates:
There are a number of key changes between the two examples; first of all created a local SparkContext
, the details of which will be expanded in a later post. Like the previous example, a function is defined to add the number 3 to an example array. The advantage of Spark is distributing the calculation across many nodes and in this example its done with the sc.parrallelize
function, which returns a Resilient Distributed Dataset (RDD). This in effect distributes the data across the available nodes.
The rdd.map()
function behaves exactly the same as the map()
function in python
. The key difference is that the iterable remains distributed, which is why it must be collected using .collect()
before printing in the for
loop. What this demonstrates is that a stepping stone to understanding and utilising distributed computing is to understand python
's built in functions map
, filter
and reduce
(technically must be imported from functools
).
Container Orchestration with Nomad
Summary
Nomad is a workload and container orchestrator similar to Kubernetes (K8s). There is dedicated documentation to explain the differences. In essence Kubernetes provides workload orchestration, service discovery, secrets managements and other related services to enable an end-to-end capability, primarily using docker
based containers. Nomad in contrast supports a wide range of container and virtual machine architectures but focuses purely on workload orchestration. It uses related technologies such as Consul for example for service meshes and Vault for secrets management but these are separate entities and not bundled together.
Context
The current “Industry Standard” platform is based on Kubernetes, created by Google and currently under the stewardship of the Cloud Native Computing Foundation (CNCF). It’s important to note that other container orchestrators exist, and in the case of Nomad it respects the Unix philosophy of single purpose programs. In practice this means a single binary can be used in production, development and even on the “edge”.
In contrast, there are a multitude of Kubernetes distributions, some of which use Hashicorp elements such as Consul and Vault, by a number of providers such as Charmed Kubernetes from Canonical (the developers of Ubuntu). For edge devices (i.e Internet of Things (IoT)) there is k3s
from Rancher (purchased recently by SUSE) or microk8s
from Canonical. For development the same tools can be used on local machines or Docker Desktop (for Windows and MacOS only), which provides a single K8s node. For Nomad, there is only a single binary for all use cases.
Although, for Data Science itself the platform may not itself be a primary concern, it’s good practice to understand high level the systems a Data Scientist must integrate with. In this case, Nomad offers the ability to scale to 10,000s of nodes whereas Google themselves with Kubernetes were only capable of achieving cluster sizes a fifth of that.
Pylance: Next-Generation Python Support in Visual Studio Code
Summary
In June 2020, the Visual Studio Code team announced the release of their Pylance Language Server (PLS) as an eventual replacement for the long standing support for Python with the Language Server Protocol (LSP). Its advantages have been discussed extensively on Medium at Towards Data Science and elsewhere. It’s worth noting that the original LSP is used extensively by other editors through plugins.
The key difference between Pylance and LSP is the use of Python type information in stub files (PEP-0561), which have the extension .pyi
and can contain rich information about a function or class. This enables Pylance to support features such as strict type checking and potentially supporting auto imports of modules.
Context
The current implementation of LSP is effective but doesn’t support some common Data Science approaches to Python. The support for stub files means that the community can provide some of these features in a standard way, which the editor can utilise. It’s good to see the VS Code respond to the challenges from Artificial Intelligence (AI) based code completion engines such as Kite.
Conclusion
Three articles from my Pocket list have been shared, summarised and the context provided. The use of map
, filter
and reduce
in python
as good preparation for distributing computing. Alternatives to the currently ubiquitous Kubernetes platform for work load orchestration using a potentially lighter, simpler and more scalable system using Nomad. Finally, an interesting article on the advantages of Pylance over LSP.