MLearning.ai
Published in

MLearning.ai

Science Shorts #1: Functional Programming, Nomad & Pylance

Useful python functions to prepare for “Big Data” processing, the advantages of Nomad over Kubernetes and Visual Studio Codes’s next-generation code completion for Python.

Scope

Links to three interesting and wide ranging topics on Data Science, from understanding key features of python as a means of preparing to process “Big Data” with libraries such as Dask and PySpark. Nomad as an alternative to Kubernetes as a container orchestrator, which covers the full range of Data Science activities from Data Ingestion to Model deployment to Data Visualisation. Finally, an introduction to Microsoft’s next-generation code completion engine expanding on the previous Language Server Protocol (LSP).

Introduction

The following three articles were randomly selected from my Pocket list, which I’ve curated over the past 5 years in the field of Data Science; the motivations and background are discussed in a previous post: Data Science Shorts: An Introduction to my Pocket List.

Functional Programming with Python

Screenshot of the Real Python page on Functional Programming with Python
Real Python article on Functional Programming | Screenshot by Author | Article and Artwork by Real python

Summary

RealPython (an excellent resource for all things python) recently published an article on Functional Programming in Python. I’m not advocating using a functional approach in Data Science but the article discusses in detail features of python, which are useful to know such as the use of lambda functions, map, filter and reduce.

Context

These functions are useful in their own right, however they’re also a great introduction to distributed computing with Spark and in particular PySpark. As an example, map applies a function to each item in the iterable or set of iterables:

Imagine each iterable (a set, list or other iterables) is hosted on a different node and that you need to apply a function to each item. This is easily a potential parallel operation, since the function can be applied on each node independently and if required collected at the end. Similarly, the filter function could be applied on per node basis and if the results need to be combined, it’s analogous to the reduce function.

The act of distributing data with PySpark is done with a SparkContext (sc) and its parallelize function as the following example illustrates:

There are a number of key changes between the two examples; first of all created a local SparkContext, the details of which will be expanded in a later post. Like the previous example, a function is defined to add the number 3 to an example array. The advantage of Spark is distributing the calculation across many nodes and in this example its done with the sc.parrallelize function, which returns a Resilient Distributed Dataset (RDD). This in effect distributes the data across the available nodes.

The rdd.map() function behaves exactly the same as the map() function in python . The key difference is that the iterable remains distributed, which is why it must be collected using .collect() before printing in the for loop. What this demonstrates is that a stepping stone to understanding and utilising distributed computing is to understand python's built in functions map, filter and reduce (technically must be imported from functools).

Container Orchestration with Nomad

Container Orchestration with Nomad | Screenshot by Author | Article and Artwork by HashiCorp

Summary

Nomad is a workload and container orchestrator similar to Kubernetes (K8s). There is dedicated documentation to explain the differences. In essence Kubernetes provides workload orchestration, service discovery, secrets managements and other related services to enable an end-to-end capability, primarily using docker based containers. Nomad in contrast supports a wide range of container and virtual machine architectures but focuses purely on workload orchestration. It uses related technologies such as Consul for example for service meshes and Vault for secrets management but these are separate entities and not bundled together.

Context

The current “Industry Standard” platform is based on Kubernetes, created by Google and currently under the stewardship of the Cloud Native Computing Foundation (CNCF). It’s important to note that other container orchestrators exist, and in the case of Nomad it respects the Unix philosophy of single purpose programs. In practice this means a single binary can be used in production, development and even on the “edge”.

In contrast, there are a multitude of Kubernetes distributions, some of which use Hashicorp elements such as Consul and Vault, by a number of providers such as Charmed Kubernetes from Canonical (the developers of Ubuntu). For edge devices (i.e Internet of Things (IoT)) there is k3s from Rancher (purchased recently by SUSE) or microk8s from Canonical. For development the same tools can be used on local machines or Docker Desktop (for Windows and MacOS only), which provides a single K8s node. For Nomad, there is only a single binary for all use cases.

Although, for Data Science itself the platform may not itself be a primary concern, it’s good practice to understand high level the systems a Data Scientist must integrate with. In this case, Nomad offers the ability to scale to 10,000s of nodes whereas Google themselves with Kubernetes were only capable of achieving cluster sizes a fifth of that.

Pylance: Next-Generation Python Support in Visual Studio Code

Pylance GitHub Repo README.md | Screenshot by Author

Summary

In June 2020, the Visual Studio Code team announced the release of their Pylance Language Server (PLS) as an eventual replacement for the long standing support for Python with the Language Server Protocol (LSP). Its advantages have been discussed extensively on Medium at Towards Data Science and elsewhere. It’s worth noting that the original LSP is used extensively by other editors through plugins.

The key difference between Pylance and LSP is the use of Python type information in stub files (PEP-0561), which have the extension .pyi and can contain rich information about a function or class. This enables Pylance to support features such as strict type checking and potentially supporting auto imports of modules.

Context

The current implementation of LSP is effective but doesn’t support some common Data Science approaches to Python. The support for stub files means that the community can provide some of these features in a standard way, which the editor can utilise. It’s good to see the VS Code respond to the challenges from Artificial Intelligence (AI) based code completion engines such as Kite.

Conclusion

Three articles from my Pocket list have been shared, summarised and the context provided. The use of map, filter and reduce in python as good preparation for distributing computing. Alternatives to the currently ubiquitous Kubernetes platform for work load orchestration using a potentially lighter, simpler and more scalable system using Nomad. Finally, an interesting article on the advantages of Pylance over LSP.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ashraf Miah

Ashraf Miah

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.