“Coder’s Office” from the HMCS Sackville

Highlights from Community Over Code, Halifax, Canada, October 2023

Paul Brebner
Open Source Journal
6 min readNov 1, 2023

--

Arriving in Halifax, Canada, after a rather long trip from Canberra, Australia (via Vancouver, Canada), I was still pretty jet-lagged so decided to benefit from the sea breeze down at the scenic Halifax harbour. There are a couple of ships docked at the harbour, and I had the chance to look around the HMCS Sackville, a (the last) Flower Class corvette from WW2 — I was impressed to see that coder’s were important enough to have their own office on ships in those days! (actually signals cryptographic coders/decoders).

HMCS Sackville

This is actually the second Community Over Code conference that I’ve had the privilege to attend, chair a track (the 2nd Performance Engineering track), talk at, and meet lots of interesting people at this year — it’s the renamed “ApacheCon” conference series, and the first was actually in Beijing earlier in the year — with significantly less jet lag for me (just 1 hour time difference from Canberra — Halifax was 40+ hours trip + 10 hours time difference — or 14 hours depending which way you do the maths).

Instaclustr/NetApp had a booth, and were Gold sponsors, too.

Go for Gold! Rich Bowen thanking the sponsors.

The conference was pretty intense, squeezing a lot into 3 days — but between talking to people, attending the social events, co-chairing the Performance Engineering track and giving a couple of talks I also squeezed in attending some talks (Unfortunately 1/2 the Streaming track was on at the same time as my track so I didn’t get to see all of those).

Here’s are a couple of interesting talks I attended.

Declarative Reasoning with Timelines: The Next Step in Event Processing — Ryan Michael

I was particularly interested in this talk, as I was giving a talk later in the conference in the Geospatial track on streaming Machine Learning. Given my experiences with real-time ML I wasn’t surprised when Ryan Michael explained why real-time AI is difficult — but some of the reasons were new to me (or at least the terminology — I suspect I had come across them already), including sub-minute data/time granularity, meta-temporal reasoning (critical/complex), and temporal leakage (whoops, I think I had that in my data as a I computed the average delivery times over an hour and included that as a feature in my data — obviously the average time over an hour is only something you can know at the end of hour, not before). He also mentioned Timelines which are a simple way to reason about events — these look like a great tool for visual temporal reasoning, and help by integrating temporal aggregations and different window types. And you need efficient execution as real-time learning needs to be continuously up to date — which is computationally expensive. The project is called Kaskada and is Apache 2.0.

Building a Commercial Service with Community Engagement — John Jackson

I was interested in this talk as Instaclustr recently went on a similar journey in providing Uber’s open source Cadence Workflow engine as a managed service — as well as Cassandra, Kafka, OpenSearch, Redis and PostgreSQL — John talked about Amazon’s Apache Airflow service. This talk was in the community track co-chaired by my colleague Sharan Foga.

Sharan was co-chair of the community track

Adding vector search to Apache Cassandra — Jonathan Ellis

Everyone is talking about Vector search these days, and Jonathan Ellis did a great job explaining the ideas behind embeddings and vector search, relevant technologies (E.g. JVector), some tips (don’t over parameterize your embeddings), and how it will work in Cassandra (using SAIs).

Performance Engineering Track (3rd of this event)

I’ve already summarized the Performance Engineering track here, but here are a few photos from the track.

Here are the slides for my talk (“Developing Fast Applications With Open Source Software — Without The Fury”).

Ritesh Shukla and Duong Nguyen presenting
With some simulation (first time I’ve seen simulation used in Open Source performance engineering!)
Co-chair Roger Abelenda introduces the next speaker, German Eichberger, to talk about Cassandra transaction benchmarking challenges.
Otavio Piske assuring the audience that no actual monsters were harmed in the talk “Hunting Performance Monsters on the Back of a Camel”
Roger Abelenda gave a talk on “Quick load testing from Selenium scripts” (using Apache JMeter DSL)
Stefan Vodita introducing his talk on Apache Lucene benchmarking at Amazon.
Roger and I after it’s all over — see you next year!

Leveraging Large Language Models for SQL generation in Hue [Cloudera Sponsor Session] — Sreenath Somarajapuram

This was an interesting talk on how to use LLMs for SQL query generation using Hue, an open source SQL editor. I keep thinking I should try LLMs for Cassandra, Kafka Streams and others (PostgreSQL) query generation, as this is at least the 3rd time I’ve seen an example of this in the last few months.

IoT Overkill: Running a Cassandra and Kafka cluster on Open-Source Hardware — Kassian Wren

My new technology evangelist colleague (whom I met for the 1st time in person in Halifax), Kassian Rosner Wren, gave a great talk in the IoT track (which I’ve spoken at before, too) on their demonstration open source Cassandra/Kafka cluster. The conclusions? Running an open source cluster is lots of work, they learnt a lot, but it’s easier to spin up some clusters on our managed service!

Talk renamed to IoT Overdrive!
The first time there’s been an “intermission” in a talk to check out a working cluster — as far as I know!

Optimizing Apache Spark Data Pipelines on Kubernetes: Leveraging Spot Instances for Production Efficiency — Hichem Kenniche

Hichem Kenniche introducing this talk.

And another DevRel team colleague, Hichem Kenniche (also meeting up for the 1st time in Halifax) gave a great talk covering:

  • The motivation behind running Spark on Kubernetes
  • Understanding the benefits of Kubernetes for Spark applications
  • Optimizing resource utilization and reducing costs with Spark on Kubernetes
  • Harnessing the potential of spot instances for substantial cost savings
  • Strategies for gracefully handling spot instance interruptions
  • Best practices and future considerations for running Spark on Kubernetes
Hichem looking at your cluster.

Machine Learning over Streaming Spatiotemporal Data with Drifts, using Apache Kafka, Cadence and TensorFlow — Paul Brebner

I had the somewhat dubious privilege of giving one of the last talks at the conference, in the Geospatial track — this is the second time I’ve spoken in this track, thanks to the track chair for the invitation again. This was a tailored version of my “Spinning your Drones” talk, including streaming machine learning over spatiotemporal data. The slides are here.

The Instaclustr booth got a boost this year from Kassian’s demo cluster (a great attractor and discussion starter!) — a couple of photos with Kassian Rosner Wren Ritam Das and Stefan Sakamoto

Kassian knitting their demo cluster a jumper? (downtime at the booth)
A close up of part of the cluster (Ritam, Stefan and Kassian)

I really enjoyed my first trip to Canada — it’s a pity it’s so far away from Australia! Thanks to everyone who helped make Community Over Code Halifax a great event (even from the other side of the world — approximately 14,000 km and 14 hours time difference away).

Candadia! (misspelling inspired by Australia)

Original Linkedin article.

--

--

Paul Brebner
Open Source Journal

Open Source Technology Evangelist at Instaclustr (by Spot by NetApp). Previously, computer scientist working in R&D in distributed systems, performance, etc.