Tealium Functions

Tealium
Tealium Blog
Published in
17 min readSep 13, 2021

Introduction

We had an interesting experience with GraalVM, Isolates, Quarkus and related technologies; we feel obliged to share issues found, and different pitfalls. We have managed to implement a service which can execute client functions written in JS, with memory/timeout restrictions. Our application is being launched in native image mode which adds to the overall complexity. This is the story of the Tealium team’s adventures, the most interesting tasks, and the issues solved.

Requirements

Brief description: The goal was to implement a serverless solution for the execution of JavaScript functions that potentially can be unsafe. We also needed to take into account that it might need support for several programming languages. Inside the functions, the clients should have the ability to make HTTP requests and possibly database requests. Also, there should be the ability to use the internal Tealium API inside client functions.

Isolation: The functions should be isolated in terms of security and resource controls. The invocation of one function should not influence the other functions.

Timeout limit: Timeout limit was one of the most important tasks for our application. It was vitally important as a long-running client function can spend a lot of application resources.

Memory limit: Memory is another critical part of our application because one function can take full memory of the service. A mechanism to limit the memory of each function is needed to prevent such issues.

Technical Solution Research

The most promising options, in our view, were:

  • AWS Lambda
  • External JS process (Node.js application)
  • Embedded JS engine (V8 engine, GraalVM with Polyglot API).

AWS Lambda

We started with a review of the possibility of using AWS Lambda. All Tealium’s services were launched in AWS, which is why it looked logical to reuse the already implemented serverless solution provided by AWS. After initial research, we found several potential issues:

  • Limit on the number of parallel invocations
  • High cost
  • Potential issues with configuring and provisioning new client functions in AWS

This is why we started to research other options to implement a serverless solution for function execution — we had to decide on a technology.

External JS Process (Node.js application)

We reviewed the possibility of using the node.js application for function execution. Here is the list of pros/cons of this solution.

Pros:

  • Support of native npm modules

Cons:

  • Inside the function, the client has unsafe access to application resources
  • Support of only one language
  • It’s difficult to limit the function by memory

Based on the cons, we decided to move forward with embedded engines and their ability to configure memory and time limits for client code.

GraalVM Isolates

Because of these issues with the external JS process solution, we started to review the technologies able to limit the process in the application by memory. We reviewed the option with GraalVM Isolates. Isolate — is a feature allowing to have the isolation and memory limit. More info about Isolates can be found in the Medium article.

In short, Isolate is a separate java machine that is running in the same process. Isolate has its heap which doesn’t have the dependency on the heap of the main application or the heaps of other Isolates. The creation and correct closing of Isolate takes a few milliseconds. Native image mode is needed to allow usage of the Isolates inside the application. Native image is a technology of ahead-of-time compilation of Java code to an executable file. This executable file includes the classes of application, the classes from their dependencies, runtime library classes, and native JDK code.

Embedded Engine. V8 vs GraalVM Polyglot API

After choosing the technology for memory limits, we started to research embedded engines. We chose two main candidates V8 and GraalVM with Polyglot API.

In the case of V8 we found few issues:

  • The engine is written in C++, there are potential problems with attempts of using Java code for internal API
  • Supports only JS
  • There is no implemented support for native images

And some benefits in the case of GraalVM:

  • The comfort of using some embedded Java objects
  • The possibility of using other languages via Polyglot API
  • Support of native images

Because of that, we resolved to move forward with GraalVM.

GraalVM as a Potential Solution

At first, our service for execution client functions using GraalVM was planned in this way:

The client code is executed inside the GraalVM JS context. JS context is created for a client function: it was designed to prevent conflicts of functions in the same context. GraalVM JS engine is a wrapper for JS context which is created under the hood in the case of creating the context. Finally, all of this is executed inside the powerful tool GraalVM Isolates.

We needed to choose the framework supporting the native image.

Framework

After choosing the solution with GraalVM and Isolates, we received a new direction for choosing the framework. In the context of our application, we had such requirements for framework:

  • Native image support
  • Dependency injection
  • Service discovery
  • Integration with kafka
  • Persistence
  • Performance
  • Community support

Usually, the question of choosing the framework is not the problem, the developers use Spring Framework which resolves all the problems. Since we had to have native image mode support, Spring Framework was not our choice. A full list of limitations of native image mode can be found here https://www.graalvm.org/reference-manual/native-image/Limitations/.

Spring moves toward the support for GraalVM native image mode but still, it is in the development stage. The https://github.com/spring-projects-experimental/spring-native repository is in spring-projects-experimental, in April 2021.

So we had a choice among 3 frameworks: Helidon.io, Quarkus, Micronaut. After examining technical requirements and some basic application examples, we were left with 2 options: Quarkus and Micronaut. Without much experience of usage, it was difficult to choose. We decided on Quarkus, mostly, because of the large community and number of releases.

After several months of Quarkus usage in our application, we can say that this framework succeeded at its task. During the work, we found several issues, and right now, they are not fixed. Since the technology is new, we were ready for such a situation. Related to Isolates, Quarkus is not the exception, in case of any issues, they will not be fixed by the Quarkus team. It’s the comment from one issue https://github.com/quarkusio/quarkus/issues/14089:

So for Isolate-specific problems, we needed to prepare workarounds.

Support for HTTP Requests for graal-js Context

Graal JS context does not have any embedded HTTP clients for using from functions. There is a potential solution without modifying JS context — to share the access to Java classes so the clients can invoke Java code and execute HTTP requests. This option didn’t work for us because we wanted to have the final JS context completely safe without any access to Java functionality. So we resolved to implement HTTP client for JS context.

We were looking for browser specification for http requests, the choice fell on fetch API as a popular API in the browser world. As result, fetch in our implementation:

const fetch = (...args) => {
if (args.length === 0) {
throw new TypeError('Failed to execute \'fetch\': 1 argument required, but only 0 present.');
}

const request = args[0] instanceof Request
? new Request(args[0].url, Object.assign({}, args[0], args[1]))
: new Request(args[0], args[1]);

return new Promise((resolve, reject) => {
httpClient.send({
uri: request.url,
body: request.body,
method: request.method,
headers: request.headers.map,
}, (response) => {
if (response.statusCode) {
return resolve(new Response(response.body, {
status: response.statusCode,
headers: response.headers,
body: response.body,
url: request.url,
}));
} else {
return reject(new TypeError(response.errorMessage));
}
});
});
}

Based on the HTTP request result, we return Promise executing either resolve or reject function. httpClient — is a java backend client which can be used only inside JS-library, the clients don’t have the access to java operations. Under the hood, the java HTTP client creates a separate Callable for each HTTP request. After request execution, this thread executes a callback defined in the client function. Also, on the backend, there is the logic for stopping HTTP requests if the function reaches the timeout limit. In our implementation, we use ‘connection: close’ which makes Isolate closeable, more info in “Create Isolate much easier than to kill”.

Isolates Are Such Isolates

We needed the GraalVM Isolates technology because of the memory limits for client functions and the function execution isolation. There are plans for the implementation of resource limits for context in future GraalVM versions. Having such a feature, it might be possible to limit the functions by memory without using Isolates.

Because the technology is new but very important for us, we needed to understand all the pitfalls and develop our own best practices. There is a real example of an Isolate with a memory limit:

@CEntryPoint
private static ObjectHandle isolateWrapper(
@CEntryPoint.IsolateThreadContext IsolateThread jsContext,
IsolateThread nettyContext, ObjectHandle requestHandle
) {
RuntimeOptions.set("MaxHeapSize", 30L * 1024L * 1024L);
}

The method with @CEntryPoint is being executed inside Isolate. We set up the value MaxHeapSize for the Isolate heap. If Isolate reaches this value, we get Out of Memory Error, which has special catch logic, but more about that later.

Logs Inside Isolate Don’t Work with Quarkus

It is neither the GraalVM nor the Isolate problem. But this situation should be resolved to support a native image inside a large application, as one of the frameworks has to be used. It might be useful for Quarkus — the most popular framework, to know potential problems.

The problem is simple, let’s use a Quarkus sample application with a simple class and logger:

private static final Logger LOGGER =  
LoggerFactory.getLogger(SomeClass.class);

Execute logger from simple method:

void myMethod() {
LOGGER.info("abc");
}

The log is printed inside the console. Let’s create IsolateThread inside the method:

var isolateThread = Isolates.createIsolate(params);

Execute LOGGER.Log() from method annotated by @CEntryPoint:

myMethod2(@CEntryPoint.IsolateThreadContext IsolateThread isolateThread)
@CEntryPoint
private static void myMethod2(
@CEntryPoint.IsolateThreadContext IsolateThread isolateThread
) {
LOGGER.info("not shown");
}

The result: no logs. We created the issue https://github.com/quarkusio/quarkus/issues/10157. And we tried to prepare a workaround for having the logs inside Isolate. For these goals, we resolved to prepare the implementation where needed data for logs will be passed from Isolate to the main thread. The switching between Isolate and the main thread was planned to execute for each log. The implementation:

/* This method is performed in jsContext Isolate thread */
public static void logMessageFromIsolate(
Logger logger,
Level level,
String logMessage,
IsolateThread nettyContext
) {
try {
String msg = OBJECT_MAPPER.writeValueAsString(
new LogObject(logger.getName(), logMessage, level, ""));
ObjectHandle requestHandle =
IsolateCopyUtil.copyString(nettyContext, msg);
performLoggingInContext(nettyContext, requestHandle);
} catch (JsonProcessingException e) {
e.printStackTrace();
throw new RuntimeException(e);
}
}

/* This method is performed main Isolate thread and returns to jsContext Isolate thread */
@CEntryPoint
private static void performLoggingInContext(
@IsolateThreadContext IsolateThread nettyContext,
ObjectHandle requestHandle
) throws JsonProcessingException {
/* Resolve and delete the requestHandle,
now that execution is in nettyContext. */
try {
String data = ObjectHandles.getGlobal().get(requestHandle);
var logObject = OBJECT_MAPPER.readValue(data, LogObject.class);
LOG_ACTIONS.get(logObject.getLevel()).accept(
LoggerFactory.getLogger(logObject.getLogger()),
logObject.getMessage()
);
} finally {
ObjectHandles.getGlobal().destroy(requestHandle);
}
}

As a result of this solution, we got a lot of interesting exceptions during the application usage. One of them:

StackOverflowError: Enabling the yellow zone of the stack did not make any stack space available. Possible reasons for that: 1) A call from native code to Java code provided the wrong JNI environment or the wrong IsolateThread; 2) Frames of native code filled the stack, and now there is not even enough stack space left to throw a regular StackOverflowError; 3) An internal VM error occurred.

Hence, we prepared the changes for configuration for Isolate and the main thread. Periodically receiving such exceptions, we implemented a simple library for logs inside Isolate.

Die-hard Isolate

The first thing we have learned about Isolates — creating them is very easy. But closing an Isolate is not a simple thing. Our team faced the problems during attempts of http requests from Isolate. The example of safe code for creating Isolate and executing http request inside:

public static void httpConnectionInsideIsolate() throws InterruptedException {
System.out.println("------ HTTP Request Inside Isolate ------");
var context = Isolates.createIsolate(
CreateIsolateParameters.getDefault()
);
nativeWrapper(context);
Isolates.tearDownIsolate(context);
}

@CEntryPoint
public static void nativeWrapper(
@CEntryPoint.IsolateThreadContext IsolateThread context
) {
try {
httpRequestsInsideIsolate();
} catch (Exception e) {
e.printStackTrace();
}
}

public static void httpRequestsInsideIsolate() {
try {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("uri"))
.build();

HttpResponse<String> response =
client.send(request, BodyHandlers.ofString());

System.out.println(response.body());
} catch (Exception e) {
e.printStackTrace();
}
}

As a result, Isolate cannot be closed until all resources from this Isolate are not cleared. In this case, the time for closing equals the timeout of HTTP connection from the connection pool. After additional investigation, we found out that the problem was in the Java connection pool. All connections are marked as keep-alive and added to the pool. We didn’t find a working solution for disabling keep-alive in Java 11.

Learned it, we tried to switch to a simple http connection by disabling keep-alive in the request by header Connection: close. And it works:

HttpParams request = getHttpParams();
var url = new URL(request.getUri());
//open connection
var connection = (HttpURLConnection) url.openConnection();
//set method
connection.setRequestMethod(request.getMethod());
//set timeout
connection.setConnectTimeout(Math.min(HTTP_TIMEOUT_MS, timeLeft));
//set headers
for (var header : request.getHeaders().entrySet()) {
connection.setRequestProperty(header.getKey(), header.getValue());
}
//ensure connection will be closed
connection.setRequestProperty("Connection", "close");
//write body if required
if (bodyRequired(request.getMethod())) {
connection.setDoOutput(true);
try (var out = connection.getOutputStream()) {
out.write(replaceAuthPlaceholder(
request.getBody(),
host
).getBytes(StandardCharsets.UTF_8));
out.flush();
}
}
//get response code
int responseCode = connection.getResponseCode();
var statusText = Status.fromStatusCode(responseCode).getReasonPhrase();
//process response body
String response = null;
try (var inputStream = is2xx(responseCode)
? connection.getInputStream()
: connection.getErrorStream()
) {
if (inputStream != null) {
try (var in = new BufferedReader(
new InputStreamReader(inputStream)
)) {
response = in.lines().collect(Collectors.joining("\n"));
}
}
}

After these changes, we have learned that Isolate can be closed fast as all resources are cleared. Disabling keep-alive and connection pool is not the problem inside isolate because client function is executed in just 10 seconds: it’s not a long-running application where connection pool can affect the performance in the case of a big number of HTTP requests.

Out of Memory Error

Isolate has the possibility for memory limit which was described earlier. Now it’s time to investigate exceptional situations where client function tries to get more memory than was defined. The example of a function with out-of-memory error:

//
// Allocate a certain size to test if it can be done.
//
'use strict';

function alloc(size) {
const numbers = size / 8;
const arr = []
arr.length = numbers; // Simulate allocation of 'size' bytes.
for (let i = 0; i < numbers; i++) {
arr[i] = i;
}
return arr;
}

//
// Keep allocations referenced so they aren't garbage collected.
//
const allocations = [];

//
// Allocate successively larger sizes, doubling each time until we // hit the limit.
//
function allocToMax() {

console.log("Start");

let allocationStep = 1024 * 1024 * 5;

while (true) {
// Allocate memory.
const allocation = alloc(allocationStep);

// Allocate and keep a reference so the allocated memory
// isn't garbage collected.
allocations.push(allocation);

}
}

fetch('http://localhost:8081')
.then(response => {
fetch('http://localhost:8081')
.then(response => {
fetch('http://localhost:8081')
.then(response => {
console.log("generate timeout error")
while(true);
}).catch(err=>{console.error(err.message)});
console.log("generate reference error")
qqq.uty;
}).catch(err=>{console.error(err.message)});
console.log("generate oom")
allocToMax();
console.log("never get here")
return response.json();
}).catch(err=>{console.error(err.message)});

We reproduced the situation on this function. After getting out of memory error, we were not able to close Isolate. The reason for it was a second out-of-memory error during the execution of the logic for closing Isolate. To figure out why the memory is not cleared after processing the exception, we did the testing on different Java versions. As a result, we found out that in the case of GraalVM, an out-of-memory error keeps a stack trace with an array of all JS objects created in the client function. A full description of the problem is placed in the created issue https://github.com/oracle/graaljs/issues/384. To have the correct solution with clearing the memory of Isolate, we prepared the solution which is already implemented in Java. Namely: in JVM OutOfMemoryError objects are created on application start and later, these objects are used when the application really gets out of memory error. In our implementation it looks like:

@CEntryPoint
private static ObjectHandle isolateWrapper(
@CEntryPoint.IsolateThreadContext IsolateThread jsContext,
IsolateThread nettyContext,
ObjectHandle requestHandle
) {
//...
// prepared object that should be returned
// if heap size exceeded in isolate
var preparedOOMResult = IsolateSupport.writeHandle(
nettyContext, InvocationResponseBuildHelper.OOM_RESPONSE
);

try {
//action
} catch (Throwable th) {
if (th.getCause() instanceof OutOfMemoryError) {
return preparedOOMResult;
}
//...
}
//...
}

We created a preparedOOMResult object in the memory before function execution. This object is used for passing the result from Isolate in the case of issues with memory. At the expense of the object that was created earlier, we don’t spend memory inside Isolate at the moment when the memory is completely spent.

Too Many Open Files

It seemed that the application was ready and all critical tasks were solved but load testing uncovered an unclear problem. The same function started to fail with the error “Too many open files…”, during load testing. This error means that the application cannot create a new HTTP connection because the max open files limit was reached in the operating system. Test function contained 2 HTTP requests without any additional logic. We used 2 threads in load testing for running the same just described function. The first suspicion was, precisely, that sockets opened for HTTP requests were not closed. We started with attempts to investigate the configuration of sockets that were created by Isolate. The idea was to disable keep-alive for sockets on the operating system level. We prepared the changes on the operating system and tried to load the system with the new configuration. As a result, it doesn’t help, it doesn’t influence the number of open files. After additional executions and research we got additional details:

  • Open files were Unix sockets
  • In the case of several HTTP requests, only one Unix socket is opened. Also, we figured out that this file was created during the first HTTP request.

Given this knowledge, it was found that the problem was in GraalVM, not in our code. Because of that, we asked for help in GraalVM slack chat for understanding how it can be fixed. GraalVM developers were advised to create the issue on GitHub. We prepared a small project where the issue can be easily reproduced and shared this example with a description https://github.com/oracle/graal/issues/2967. The example:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.stream.Collectors;
import org.graalvm.nativeimage.IsolateThread;
import org.graalvm.nativeimage.Isolates;
import org.graalvm.nativeimage.Isolates.CreateIsolateParameters;
import org.graalvm.nativeimage.c.function.CEntryPoint;

public class OpenedFileDescriptors {
private static final String URL = "http://localhost:8085";
public static void main(String[] args)
throws IOException, InterruptedException {
//no opened file descriptors
httpConnectionOutsideIsolate();
printOpenedFileDescriptors();
//a lot of file descriptors
httpConnectionInsideIsolate();
printOpenedFileDescriptors();
}
private static void printOpenedFileDescriptors()
throws IOException, InterruptedException {
String[] cmd = {
"/bin/sh",
"-c",
"lsof -n | cut -f1 -d' ' | uniq -c | sort | tail"
};
Process pr = Runtime.getRuntime().exec(cmd);
BufferedReader input = new BufferedReader(
new InputStreamReader(pr.getInputStream())
);
String line;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
int exitVal = pr.waitFor();
}
public static void httpConnectionOutsideIsolate() {
System.out.println("---- HTTP Requests Without Isolate ----");
var iterations = 5000;
while (iterations > 0) {
iterations--;
try {
httpRequestsInsideLoop();
} catch (Throwable t) {
t.printStackTrace();
}
}
}
public static void httpConnectionInsideIsolate() {
System.out.println("---- HTTP Requests Inside Isolate ----");
var iterations = 5000;
while (iterations > 0) {
iterations--;
var context = Isolates.createIsolate(
CreateIsolateParameters.getDefault()
);
nativeWrapper(context);
Isolates.tearDownIsolate(context);
}
}
@CEntryPoint
public static void nativeWrapper(
@CEntryPoint.IsolateThreadContext IsolateThread context
) {
try {
httpRequestsInsideLoop();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void httpRequestsInsideLoop() {
try {
//open connection
var connection = (HttpURLConnection)
new URL(URL).openConnection();
//set method
connection.setRequestMethod("POST");
//ensure connection will be closed
connection.setRequestProperty("Connection", "close");
connection.setDoOutput(true);
try (var out = connection.getOutputStream()) {
out.write(
"{'param':'value'}"
.getBytes(StandardCharsets.UTF_8)
);
out.flush();
}
//get response code
int responseCode = connection.getResponseCode();
var responseMessage = connection.getResponseMessage();
//process response body
String response = null;
try (var inputStream = is2xx(responseCode)
? connection.getInputStream()
: connection.getErrorStream()
) {
if (inputStream != null) {
try (var in = new BufferedReader(
new InputStreamReader(inputStream)
)) {
response = in.lines()
.collect(Collectors.joining("\n"));
}
}
}
var headers = connection.getHeaderFields();
// System.out.println("HTTP call is completed, code = " +
// responseCode +
// ", response message = " + responseMessage + ",
// response = " + response);
} catch (Exception e) {
e.printStackTrace();
}
}
private static boolean is2xx(int code) {
return code >= 200 && code <= 299;
}
}

Initially, in the created issue, there was no information about the root cause. Continuing research, there was an idea to debug the application in jar mode. In jar mode, there is no possibility to investigate the logic inside Isolates but there is a possibility to find the places of creation of file descriptors during HTTP requests. After a long debug session, we found out the root cause. It turns out that before the first HTTP request in the java application marker file descriptor is created which opens the full application life cycle. This file descriptor is removed only on application stop. In the case of Isolates, this marker is created for each HTTP request inside the new Isolate and never deleted. Accordingly, the application opens a lot of markers that cannot be cleared. Let’s look into the details, why it works in this way. There is PlainSocketImpl from JDK:

class PlainSocketImpl extends AbstractPlainSocketImpl
{
static {
initProto();
}

Static block with the initProto call is executed during the first usage of PlainSocketImpl. The first usage of this class happens during the attempt of the first HTTP request. Now we can check the method initProto:

static native void initProto();

From java sources, we can get the info that initProto is native method. Because of that, we checked jdk sources, inside PlainSocketImpl.c:

JNIEXPORT void JNICALL
Java_java_net_PlainSocketImpl_initProto(JNIEnv *env, jclass cls) {
...
/* Create the marker fd used for dup2 */
marker_fd = getMarkerFD();
}

Namely, this method getMarkerFD creates a marker file descriptor. So we found the root cause — not cleared file descriptors. But it is not clear how it can be solved. The info about the number of a marker file descriptor is stored inside PlainSocketImpl.c:

/*
* file descriptor used for dup2
*/
static int marker_fd = -1;

It’s impossible to read this variable from java code because it’s private. Thus, we had two possible solutions:

  • Implement own version of PlainSocketImpl.c with needed dependencies for having the possibility for getting the value of this variable from Java code;
  • Create the scheduled job which will clean the markers. The idea is that in the case of several executions of the job. We have the same Unix sockets — it means that such sockets can be closed and removed;

The first option is very difficult for implementation because source code has a lot of dependencies which should be also overridden and be updated for each java version. We added more info in the GitHub issue and implemented the second option as a workaround. After load testing, the problem was not reproduced.

The issue, in April 2021, still is not fixed by the GraalVM team but it’s a very serious problem in the case of using Isolates which should be known before using such a feature.

Conclusion

We have developed some knowledge now on how to implement quite demanding systems in production with a complete Serverless architecture that is cheap and effective. In this article, we wanted to share our experience using GraalVM/Isolates and Quarkus.

Overall, it has been an inspiring adventure that gave us experience and knowledge on various technologies and their specifics. Hopefully, it was worth the read and you could learn something out of that. We look forward to your feedback, questions, and comments.

Resources

--

--