Science Shorts #1: Functional Programming, Nomad & Pylance
python functions to prepare for “Big Data” processing, the advantages of Nomad over Kubernetes and Visual Studio Codes’s next-generation code completion for Python.
Links to three interesting and wide ranging topics on Data Science, from understanding key features of
python as a means of preparing to process “Big Data” with libraries such as Dask and PySpark. Nomad as an alternative to Kubernetes as a container orchestrator, which covers the full range of Data Science activities from Data Ingestion to Model deployment to Data Visualisation. Finally, an introduction to Microsoft’s next-generation code completion engine expanding on the previous Language Server Protocol (LSP).
The following three articles were randomly selected from my Pocket list, which I’ve curated over the past 5 years in the field of Data Science; the motivations and background are discussed in a previous post: Data Science Shorts: An Introduction to my Pocket List.
Functional Programming with Python
RealPython (an excellent resource for all things
python) recently published an article on Functional Programming in Python. I’m not advocating using a functional approach in Data Science but the article discusses in detail features of
python, which are useful to know such as the use of
These functions are useful in their own right, however they’re also a great introduction to distributed computing with Spark and in particular PySpark. As an example,
map applies a function to each item in the iterable or set of iterables:
Imagine each iterable (a
list or other iterables) is hosted on a different node and that you need to apply a function to each item. This is easily a potential parallel operation, since the function can be applied on each node independently and if required collected at the end. Similarly, the
filter function could be applied on per node basis and if the results need to be combined, it’s analogous to the
The act of distributing data with PySpark is done with a SparkContext (
sc) and its
parallelize function as the following example illustrates:
There are a number of key changes between the two examples; first of all created a local
SparkContext, the details of which will be expanded in a later post. Like the previous example, a function is defined to add the number 3 to an example array. The advantage of Spark is distributing the calculation across many nodes and in this example its done with the
sc.parrallelize function, which returns a Resilient Distributed Dataset (RDD). This in effect distributes the data across the available nodes.
rdd.map() function behaves exactly the same as the
map() function in
python . The key difference is that the iterable remains distributed, which is why it must be collected using
.collect() before printing in the
for loop. What this demonstrates is that a stepping stone to understanding and utilising distributed computing is to understand
python's built in functions
reduce (technically must be imported from
Container Orchestration with Nomad
Nomad is a workload and container orchestrator similar to Kubernetes (K8s). There is dedicated documentation to explain the differences. In essence Kubernetes provides workload orchestration, service discovery, secrets managements and other related services to enable an end-to-end capability, primarily using
docker based containers. Nomad in contrast supports a wide range of container and virtual machine architectures but focuses purely on workload orchestration. It uses related technologies such as Consul for example for service meshes and Vault for secrets management but these are separate entities and not bundled together.
The current “Industry Standard” platform is based on Kubernetes, created by Google and currently under the stewardship of the Cloud Native Computing Foundation (CNCF). It’s important to note that other container orchestrators exist, and in the case of Nomad it respects the Unix philosophy of single purpose programs. In practice this means a single binary can be used in production, development and even on the “edge”.
In contrast, there are a multitude of Kubernetes distributions, some of which use Hashicorp elements such as Consul and Vault, by a number of providers such as Charmed Kubernetes from Canonical (the developers of Ubuntu). For edge devices (i.e Internet of Things (IoT)) there is
k3s from Rancher (purchased recently by SUSE) or
microk8s from Canonical. For development the same tools can be used on local machines or Docker Desktop (for Windows and MacOS only), which provides a single K8s node. For Nomad, there is only a single binary for all use cases.
Although, for Data Science itself the platform may not itself be a primary concern, it’s good practice to understand high level the systems a Data Scientist must integrate with. In this case, Nomad offers the ability to scale to 10,000s of nodes whereas Google themselves with Kubernetes were only capable of achieving cluster sizes a fifth of that.
Pylance: Next-Generation Python Support in Visual Studio Code
In June 2020, the Visual Studio Code team announced the release of their Pylance Language Server (PLS) as an eventual replacement for the long standing support for Python with the Language Server Protocol (LSP). Its advantages have been discussed extensively on Medium at Towards Data Science and elsewhere. It’s worth noting that the original LSP is used extensively by other editors through plugins.
The key difference between Pylance and LSP is the use of Python type information in stub files (PEP-0561), which have the extension
.pyi and can contain rich information about a function or class. This enables Pylance to support features such as strict type checking and potentially supporting auto imports of modules.
The current implementation of LSP is effective but doesn’t support some common Data Science approaches to Python. The support for stub files means that the community can provide some of these features in a standard way, which the editor can utilise. It’s good to see the VS Code respond to the challenges from Artificial Intelligence (AI) based code completion engines such as Kite.
Three articles from my Pocket list have been shared, summarised and the context provided. The use of
python as good preparation for distributing computing. Alternatives to the currently ubiquitous Kubernetes platform for work load orchestration using a potentially lighter, simpler and more scalable system using Nomad. Finally, an interesting article on the advantages of Pylance over LSP.