Strata Data Conference — SF 2019

Magnus Palmér
Täckblog
Published in
8 min readApr 6, 2019
Data is fun! Especially if you go the day before it starts to collect the passes.

I will start with a really lame disclaimer. Booked this conference already last year, before actively working on reducing my CO2 footprint.

I am working in an awesome team building an in-house (or rather in-cloud?!) recommendation engine at INGKA, part of IKEA group. Four of us went to the Strata Data Conference in San Francisco. Here are my highlights from this trip. Highly subjective as always.

The measure and mismeasure of fairness in machine learning

Risk assessment algorithm for criminal offenders. The speaker used an example about an algorithm used today by the justice system in some states. It is biased against blacks. Why is it so?

Math is not equal to equity!

It goes on showing an example where gender was excluded, it discriminated the women. This can happen in a lot of cases, when trying to protect a minority by excluding attributes, we can easily end up harming them more instead!

During the Q&A at the end, several very intelligent people asked questions of applying methods to address this that is beyond my skillset to understand. However, all of these would not help in this case, it would hurt more instead.

Hoping this becomes available to watch, I would say it should be mandatory for everyone working with algorithms involving real peoples lives.

My favourite talk of the conference.

Link to more info.

Customer stories about events using Apache Pulsar

Unified API for queueing and streaming with tiered storage…

Yes, finally, I have seen the light. I am now officially an Apache Pulsar fanboy.

I have been skeptic about Kafka for years and never found a reason to use it.

Now I will hopefully not have to.

The most important difference is that in Pulsar the message broker is separated from the storage. In this way, older messages can automatically be offloaded into S3 or GCS.

The multi-tenancy and namespace features is also a great improvement over Kafka in a, wait for it…, multi-tenant setup.

Regarding community and tooling however, Kafka has the advantage as a more mature and well adopted project.

First customer case, they evaluated the following messaging systems: Apache Pulsar, Apache Kafka, RabbitMQ, Apache RocketMQ, Apache ActiveMQ, Pravega. This customer got half of the cost for running the infrastructure on Apache Pulsar and using Pulsar Functions for stream processing than to run it on AWS Kinesis plus Apache Spark.

Second customer case, they had MSMQ (Microsoft), RabbitMQ and Kafka. First they moved MSMQ to RabbitMQ. Then tried to move to using only Kafka. But since on Kafka you can’t scale to more consumers in a consumer group than you have partitions on that topic. (You can create more consumers but they will be idle.)

I had a discussion during the “Meet the experts session” with Dean at Lightbend that held the Tuesday morning tutorial that I went to. Discussed Kafka vs Apache Pulsar. The cofounder of Streaml.io also joined us. So we hade a nice discussion about Spark streaming, Kafka Streams, Akka Streams and Pulsar and messaging in general. Think it is definitely worth checking out Apache Pulsar as an alternative to Kafka and GCP PubSub.

Keynotes

On Wednesday it was mostly sales during keynotes. The one big exception was Cyberconflict: A new era of war, sabotage, and fear by David Sanger, NYT. That was a good and interesting talk. Although I have been binge-listening to Darknet Diaries podcast last couple of weeks so nothing really upsetting. Talks on Thursday was much better. Hacking the vote: The neuropolitical universe by Elizabeth Svoboda, this one is quite disturbing. Really great talk even if it leaves you with a feeling of unease. As did the talk by Peter Singer — Likewar: How social media is changing the world…and how the world is changing social media.

The talk by Electronic Arts had the funnies joke (or was it a pun?) of the whole conference, not sure I remember it exactly, but something like this:

I am fortunate enough to work at a company where people get tattoos of what we do. (And some pictures of people with tattoos from their games). Then a question: “How many here have a Cloudera tattoo?”

This was so funny, sorry Cloudera. I really do appreciate your work with this conference, but your talks were all pure sales…

The talk by Theresa Johnson from Airbnb was also a good one. She made a case against blackbox models and showed why. Makes perfect sense to me. Thanks!

Exhibitors

I wonder how many of these will still be around in one or two years? Predicting half of them will be gone. Acquired or out of venture capital.

Anyway, it was lots of DataOps, Data as a Service, Data Platforms, Data Integrations, Data Pipelines etc.

Lets see how many of these are left in one year or two.

Workshops

Hands-on Machine Learning with Kafka-based Streaming Pipelines — a Tutorial

Good, but could have been great. I had really high expectations on this tutorial.

Lots of jokes that only the two presenters thought were funny. Really tough crowd. Dead silent and they couldn’t even see us due to the light settings.

“Microsoft is doing great things, Google is doing amazing things, but in the end, it is all comes down to AWS.” — Boris @Lightbend (probably misquoted from memory)

Kafka — not possible to scale down, always expects growth. Max msg size is 1Mb, can pass references instead.

Pulsar — solves some of these problems in a better way

Anyway, I think the major problems with this workshop is that they wanted to cover too much in their presentation.

So hardly any time for hands-on exercises. But there is a git repo with everything to try on my own if I want.

https://github.com/lightbend/model-serving-tutorial

Some of the most interesting topics was left out or covered briefly since we ran out of time. The danger of saving the best stuff to the end. There were some hard learned production problems/tips gold nuggets every now and then, both in the talk and the presentations.

Note to self: I will have to be a good citizen and do some pull requests with proper docker-compose setup that would make the experience much nicer for the attendees instead of running raw docker commands.

Also bothers me the mapr slides from a book where the author clearly didn’t know about EIP for some reason. This was the Aggregator pattern. Not the a made up rendezvous.

Models as data instead of code.

Anyway, GCP PubSub + DataFlow (Apache Beam) actually feels even more compelling after this session.

Revisiting Scala 10 years later was not as bad as anticipated. Could see myself learning it.

The hitchhiker’s guide to deep learning-based recommenders in production

Dang, that was even more ambitious than the previous one. I really liked it!

Started off with a Recommendations 101 and then the fun stuff began.

  • Setup Kubeflow on GCP
  • Use Jupyter Hub on Kubeflow
  • Build a docker container to train on Kubeflow
  • Use TensorFlow Serving to serve the data
  • Deploy a Python Flask app that uses the API

Was great fun, although I ran into problems due to a small bug in the KSonnet config. The presenter ran into the same problems later so it wasn’t me, I had to go up and tell him what was wrong with the config and why. Small thing but easy to miss.

I really don’t like KSonnet and the project is recently discontinued.

There is a current discussion within the Kubeflow project with what to replace it with (Kustomize, Helm or Jsonnet).

Guessing and hoping that Kustomize will win now that it is shipped as part of Kubrernetes 1.14.

I am using Kustomize myself instead of Helm 2, really don’t like the templating in Helm 2. (It is changing in Helm 3 since the project also realised it is not very great. ) There is a lot of support for Helm out there though in lack of better alternatives.

Speaker slides and videos

There were lots of talks, didn’t cover them all here.

Recommendations

Netflix lunch & learn

We visited Netflix for a lunch and learn.

Magnus Palmér, Dominicke Kim, Ratna Desai and Shilpi Sinha

“There is no such thing af failure, only feedback”.

Was a great visit, however I underestimated the interest from Netflix.

I should have asked for an estimation of who would attend the meeting and what they wanted to get out of it.

We have a really cool story as a team and as a company, it should have been more condensed and better prepared. 2–3 slides at the most.

Much of the good stuff that is not standard stuff, we had to rush through at the end. The India lightbulb is a really nice story to tell.

We have some great stories around diversity and sustainability as well where we could just add one super short story about both. Also the Sonos since we are in US.

We should of course also have shown the video of our team story.

The whitepaper we got sent us after the meeting I already have read briefly some time ago. On my bucket list to go through thoroughly, but given that we were going to Netflix. I should have read it carefully. I did read up on the Netflix Prize competition.

Still, think we did OK and they seemed interested, and I really enjoyed it. Well worth the trip.

“Stealing” at Amazon Go

This is how shopping should be. Wow! Worth keeping in mind when building customer experiences. This really was an experience.

“Just walk out shopping.”

Going home — Epic fail

“Remember to always fallback to JSON”.

--

--