Evolution of calling Python from Node

Node -> Python

A bit of disclaimer here: I just started learning Node.js from this project so still consider myself a very bad Node.js developer. There might be errors or things I’m not aware of.

A bit of background

At Geoblink, we have 3 different tech teams:

  • Infra, which I won’t talk about here
  • Core, the web team working on the front-end and back-end of the app. They control every of the app itself, the looks, the interactions, the business logic, etc… They use Node.js everywhere and master it completely.
  • Data, doing data things like data-science, data-engineering, data-mining or data-cleaning… This is the team I’m in. A majority of us primarily use Python, even if some prefer Scala or R or Java…

How does the data go from the inside of the brain of a data scientist to the screen of a Geoblink user? Well usually, the data scientist thinks of a solution, write some Python code to do it, this Python code will modify a database, a database that the Core team will make use of to beautifully rendered data. So, the usual flow is data scientist -> Python -> Postgres Database -> Node.js -> end user.

But when I joined, we started to need to make a Node.js service and a Python process talk to each other directly. The database in-between would be really superfluous, add a lot complexity, and plumb the whole service speed. That is because we needed some tools written by a Python guy to be called within a service written by a JS guy.

In this post I will cover how the communication between Python and Node.js evolved over time, as we had to implement this communication in different projects.

The ‘brute’ solution: bash script

The first Python tool we connected to Node was a heavy statistical model. It applies a lot of pre-processing and data-cleaning before giving the actual output and rewriting it into Node.js would have cost a lot of time and efforts.

The solution we came up with was to run it via the `spawn` function of `child_process` of Node.js.

We rewrote the input/output of the Python tool so that it would consume JSON from `stdin` and produce a JSON in `stdout`. Capture `stdout` from Node.js and you got a working tool! But if you do that, you might have problems with `PYTHON_PATH` or with encoding (working with Spanish and French addresses implies caring about accented letters. Enforcing UTF-8 encoding can help here). So we changed how we made the call from a Python file to a Bash file where we would export a valid `PYTHON_PATH` and do other operations.

Good thing:
 * You control everything

Bad thing:
 * You have to control everything

From the Node side:

const spawn = require('child_process').spawn 
const toolParams = JSON.stringify(params)
const pythonProcess = spawn('bash', [config.tool.path, toolParams], {})
const stdout = []
const stderr = []
pythonProcess.stdout.on('data', data => stdout.push(data.toString())) 
pythonProcess.stderr.on('data', data => stderr.push(data.toString())) 
pythonProcess.on('close', (code) => { 
if (code !== 0) {
const errorMessage = stderr.join('')
return reject(errorMessage)
}

const pythonResult = JSON.parse(stdout.join(''))
// Exploit pythonResult
}

Intermediate Bash script:

#!/usr/bin/env bash export 
PYTHON_PATH=/some/location/:$PYTHONPATH
cd $DEDUPER_WORKSPACE && \ 
python3 run_tool.py $@ || exit 1

From the Python side:

import sys import json tool_params = json.loads(sys.argv[1]) 
result = do_something(tool_params)
json_output = json.dumps(result)
sys.stdout.write(json_output)

The ‘let-the-experts-do’ solution: python-shell

Some time later, another challenge arrived: Our service had to execute a lot of small Python scripts, each one very different from the others, each with their dependencies, some of them at the same time, but all returning the same type of data.

This time, I remember deciding to fetch a package to do the job for us. In the Python universe, most of your problems can be solved by a `import antigravity`. So I went to `npm` and found `python-shell` (https://github.com/extrabacon/python-shell) that seemed to do the job.

It works a bit like our previous solution: it sends the parameters to the Python script via `stdin` and gets the output via `stdout`. But, it can also support binary files, show tracebacks or execute the Python process in a child process. Most importantly, it is well tested and proven to work. With that part taken care of, the only thing left was to agree with the Python developers on a common way of writing the input/output of their scripts.

Good thing: * Your job is reduced to understanding and using a library * The library is done by people who probably spent more time

than you thinking about the problem

Bad thing: * You are dependent on a library, its bugs and security failures included

* You are limited by the possibilities of the library (unless you expand it)

From the Node.js side:

const PythonShell = require('python-shell')

const pythonOptions = {
mode: 'json',
scriptPath: '/some/location/small_script.py',
pythonPath: '/usr/bin/python3'
}
const pyshell = new PythonShell(pythonParams.filepath, pythonOptions)
const pythonData = []

pyshell.on('message', function (message) {
pythonData = message
})

pyshell.end(function (err) {
if (err) {
throw err
}
// Exploit pythonData
}

From the Python side:

import sys 
import json
data = do_something() 
print(json.dumps(data, ensure_ascii=False))

Our most recent solution: micro-services

Fast-forward a few months and a new project now required us to connect another tool to a Node.js service. This time, the Python tool we had to connect was a heavy application, another statistical model.
 It has these characteristics:

  • When launched, it occupies a lot of memory
  • The launching is not instantaneous
  • It is (kind of) slow (takes seconds to produce a result)
  • It requires a lot of dependencies

It seemed like a dreamed opportunity to build a Python service!

To avoid complexity, we decided to run the service in our local network. We separated the Python code into the loading of resources and the actual use of the tool. Then we prepared it so it would consume and produce a JSON, before agreeing on an endpoint and a JSON format.

There are ton of resources to explain how to deploy a statistical model in a service. Most of them will explain very well how to do it with Flask (http://flask.pocoo.org/). But oftentimes, they will miss one important point about Flask:

Flask’s built-in server is not suitable for production (from Flask documentation)

You need to connect your Flask service to a WSGI HTTP server to scale correctly. If you choose a standalone solution, you can choose between Gunicorn (the one we choose), uWSGI, Twisted, etc…

Good thing: * Scalable, fully reusable component

* Separated concerns

Bad thing: * Writing a fully functional service can take some time (don’t skip security, logging or deploying steps) * Demands DevOps work

* HTTP request can be slow

From the Node.js side (classic HTTP request):

const request = require('request')

const options = {
url,
json: true,
body: {variable: value}
}
const pythonResult = request.post(options, function (err, response, body) {
if (err) {
throw err
}
if (response.statusCode === 200) {
// Exploit body
}
}

From the Python side (no WSGI setup shown):

from flask import Flask, request
from myproject.models import my_model
app = Flask(__name__)

@app.route(rule='/predict', methods=['POST'])
def do_prediction():
input_sent = request.json
result = my_model.predict(input_sent)
return result, 200

if __name__ == '__main__':
app.run()

Key takeaways

Do you want to control exactly the context in which Python is called?
 Write a Bash script.

Do you want an out-of-the-box solution?
 Search for an already implemented library or package.

Do you want a scalable solution, something you can easily reuse in you app?
 Take the time to create a micro-service.

I like the similarities between biological evolution and choosing a solution in software engineering. There is not really a solution better than the other, but rather the best solution is the one that best fits your needs.
 Building a micro-service is cool, but it may not be adapted to the problem you want to solve (and will be more complex and take longer to build).

Thanks for reading.

Denis Vivies — Data Science