MLearning.ai
Published in

MLearning.ai

Science Shorts #1: Functional Programming, Nomad & Pylance

Useful functions to prepare for “Big Data” processing, the advantages of Nomad over Kubernetes and Visual Studio Codes’s next-generation code completion for Python.

Scope

Links to three interesting and wide ranging topics on Data Science, from understanding key features of as a means of preparing to process “Big Data” with libraries such as Dask and PySpark. Nomad as an alternative to Kubernetes as a container orchestrator, which covers the full range of Data Science activities from Data Ingestion to Model deployment to Data Visualisation. Finally, an introduction to Microsoft’s next-generation code completion engine expanding on the previous Language Server Protocol (LSP).

Introduction

The following three articles were randomly selected from my Pocket list, which I’ve curated over the past 5 years in the field of Data Science; the motivations and background are discussed in a previous post: Data Science Shorts: An Introduction to my Pocket List.

Functional Programming with Python

Screenshot of the Real Python page on Functional Programming with Python
Real Python article on Functional Programming | Screenshot by Author | Article and Artwork by Real python

Summary

RealPython (an excellent resource for all things ) recently published an article on Functional Programming in Python. I’m not advocating using a functional approach in Data Science but the article discusses in detail features of , which are useful to know such as the use of functions, , and

Context

These functions are useful in their own right, however they’re also a great introduction to distributed computing with Spark and in particular PySpark. As an example, applies a function to each item in the iterable or set of iterables:

Imagine each iterable (a , or other iterables) is hosted on a different node and that you need to apply a function to each item. This is easily a potential parallel operation, since the function can be applied on each node independently and if required collected at the end. Similarly, the function could be applied on per node basis and if the results need to be combined, it’s analogous to the function.

The act of distributing data with PySpark is done with a SparkContext () and its function as the following example illustrates:

There are a number of key changes between the two examples; first of all created a local , the details of which will be expanded in a later post. Like the previous example, a function is defined to add the number 3 to an example array. The advantage of Spark is distributing the calculation across many nodes and in this example its done with the function, which returns a Resilient Distributed Dataset (RDD). This in effect distributes the data across the available nodes.

The function behaves exactly the same as the function in . The key difference is that the iterable remains distributed, which is why it must be collected using before printing in the loop. What this demonstrates is that a stepping stone to understanding and utilising distributed computing is to understand 's built in functions , and (technically must be imported from ).

Container Orchestration with Nomad

Container Orchestration with Nomad | Screenshot by Author | Article and Artwork by HashiCorp

Summary

Nomad is a workload and container orchestrator similar to Kubernetes (K8s). There is dedicated documentation to explain the differences. In essence Kubernetes provides workload orchestration, service discovery, secrets managements and other related services to enable an end-to-end capability, primarily using based containers. Nomad in contrast supports a wide range of container and virtual machine architectures but focuses purely on workload orchestration. It uses related technologies such as Consul for example for service meshes and Vault for secrets management but these are separate entities and not bundled together.

Context

The current “Industry Standard” platform is based on Kubernetes, created by Google and currently under the stewardship of the Cloud Native Computing Foundation (CNCF). It’s important to note that other container orchestrators exist, and in the case of Nomad it respects the Unix philosophy of single purpose programs. In practice this means a single binary can be used in production, development and even on the “edge”.

In contrast, there are a multitude of Kubernetes distributions, some of which use Hashicorp elements such as Consul and Vault, by a number of providers such as Charmed Kubernetes from Canonical (the developers of Ubuntu). For edge devices (i.e Internet of Things (IoT)) there is from Rancher (purchased recently by SUSE) or from Canonical. For development the same tools can be used on local machines or Docker Desktop (for Windows and MacOS only), which provides a single K8s node. For Nomad, there is only a single binary for all use cases.

Although, for Data Science itself the platform may not itself be a primary concern, it’s good practice to understand high level the systems a Data Scientist must integrate with. In this case, Nomad offers the ability to scale to 10,000s of nodes whereas Google themselves with Kubernetes were only capable of achieving cluster sizes a fifth of that.

Pylance: Next-Generation Python Support in Visual Studio Code

Pylance GitHub Repo README.md | Screenshot by Author

Summary

In June 2020, the Visual Studio Code team announced the release of their Pylance Language Server (PLS) as an eventual replacement for the long standing support for Python with the Language Server Protocol (LSP). Its advantages have been discussed extensively on Medium at Towards Data Science and elsewhere. It’s worth noting that the original LSP is used extensively by other editors through plugins.

The key difference between Pylance and LSP is the use of Python type information in stub files (PEP-0561), which have the extension and can contain rich information about a function or class. This enables Pylance to support features such as strict type checking and potentially supporting auto imports of modules.

Context

The current implementation of LSP is effective but doesn’t support some common Data Science approaches to Python. The support for stub files means that the community can provide some of these features in a standard way, which the editor can utilise. It’s good to see the VS Code respond to the challenges from Artificial Intelligence (AI) based code completion engines such as Kite.

Conclusion

Three articles from my Pocket list have been shared, summarised and the context provided. The use of , and in as good preparation for distributing computing. Alternatives to the currently ubiquitous Kubernetes platform for work load orchestration using a potentially lighter, simpler and more scalable system using Nomad. Finally, an interesting article on the advantages of Pylance over LSP.

--

--

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 28K+ Unique DAILY Readers 🟠

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ashraf Miah

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.