Is Kubeflow Dead?

Demetrios Brinkmann
MLOps.community
Published in
15 min readOct 2, 2020

--

This is a thread taken from our MLops community slack. It has been a hot topic of debate recently and we have since recorded a few more chats on kubeflow you can check out on our youtube channel.

Let us know what you think in the comments or jump on slack and voice your opinion there.

Toni Perämäki

Is it just me or does it seem that kubeflow is not going anywhere. Why I ask:

  • Contributor activity basically zero
  • David left for Microsoft (already earlier) — Is the vision lost?
  • There is a large amount of companies that we have seen who say that they tried Kubeflow but came to the conclusion it makes no sense to them.

Thoughts?

Joey Zwicker

David is still involved at MSFT, but less so over time. We too have greatly experienced a ton of users that try using Kubeflow (including KF & Pachyderm together) and quickly give up on the KF half because it is so impossibly hard to get running effectively. It’s a bit of mess from my experience from a project governance standpoint and KF and TFX and IBM and others are all trying to put it in different directions (edited)

Joe Peskett

Is for the entire Kubeflow organisation or just the kubeflow repo? Wonder if this is the same story for pipelines and manifests repo?

Joey Zwicker

Note also that the attempt at a KFP and TFX merge means dev work might be moving to other repos even outside of the KF org. I don’t know all the details on this. @Luke Marsden has been very involved with KFP. Care to share your exerience?

Gonçalo Martins Ribeiro

They made a huge effort to release version 1.0 and then integrate it into Google’s AI Platform… Now they have a version 1.1 still in beta and it seems to be even more attached to GCP but the activity is much lower than before.

Demetrios

I think we may need to do a coffee session on this one

Diego Oppenheimer

@Joey Zwicker we are hearing the same from folks really struggling to get not from 0 to started but actually maintaining and running effectively over time.

David Aponte

Interesting question! As a big kubernetes user, I like it but I also understand why some people struggle with it. It would be cool to hear from some of the contributors w.r.t. to the low contributor activity though. I can reach out to someone I know whos contributing to it and using it in their production system.

Joey Zwicker

We’re coordinating with the contributors we know too and can inquire further. Pachyderm team is also directly contributing and leading the newly-formed KFData working group so we (@Luke Marsden mainly) are experiencing a ton of the KF internal chaos and complexity right now.

Gonçalo Martins Ribeiro

Sorry but I can’t hold myself from asking this: @Joey Zwicker can we expect an integration of Pachyderm with Kubeflow?

Joey Zwicker

@Gonçalo Martins Ribeiro Please never hold back questions!

We’ve been working with JLewi and the KF team for nearly a year. Mostly on small steering directional things, examples, and best practices. We do already have examples of KF & Pachyderm together https://github.com/pachyderm/pachyderm/tree/master/examples/kubeflow/mnist (although KF 1.1 breaks things in slightly obnoxious ways

that we’re trying to fix now). Pachyderm is more recently getting much more intimately involved in KF with the KFData working group creation. We’re both contributors to KFP, making proposals/RFC and helping to try to build a better KF story for managing data and data-driven pipelines because to date, KF has mostly just decided data is someone else’s problem — which seems ridiculous to us for a ML platform. So we’re working to make that much much better.https://docs.google.com/presentation/d/1LlVk-5Ua-GdNClekHbj-4b6Nt5-whU4LdU0VAVWLFoE/edit
https://github.com/pachyderm/kfdata

Toni Perämäki

@Joey Zwicker & @Diego Oppenheimer — hear hear. So seems that others have the same experience. But that is good for us.. Feel the pain of kf and witness a better way for doing MLOPs.

@Joe Peskett There’s some activity with the pipelines repo but not much. Interesting to see if the situation changes in the future. @Luke Marsden — any thoughts on this?

Gonçalo Martins Ribeiro

@Joey Zwicker loved it! @Fabiana Clemente you really need to take a look at KFData

Diego Oppenheimer

Based on what I’m learning, nearly everyone who uses (or has used) Kubeflow isn’t happy with it and is imagining derivatives that might better. At least that’s the takeaway that nearly every data science tooling team across Netflix, Uber, Github, and AWS has given me. That said i think its more nuanced like any newer project. I dont see a strong kubeflow as bad for the industry (my 2 cents).

Toni Perämäki

@Diego Oppenheimer — Agreed.

And I’ve been waiting for kubeflow to evolve from what it is today. Hoping it will and the low level of activity made me wonder what is the situation.

Joe Peskett

Do we have anyone from Spotify in this group? I guess they’re on quite a different scale compared to a lot of teams but last I heard they were quite heavily invested in KF pipelines through TFX? Could be wrong on that one though…

Gonçalo Martins Ribeiro

Nubank uses Kubeflow too. They already presented in a meetup.

Diego Oppenheimer

I would think that success is really dependent on the maturity of kubernetes inside an organization

a lot of enterprises are just beginning their kubernetes journey and porting projects a small amount of at a time. The (succesful) ports I have seen is start with web apps, move to service messhes, finally do platforms. The likely ness of a company being able to succesfully go from 0 k8s to running an ML platform on it feels problematic (specially without the internal know how on k8s)… BUT if the organization has a lot of k8s muscle I dont see what they wouldnt be succesful.

and microcosms of that internal to an organization (teams with k8s experience vs not)

Diego Oppenheimer

+1 for centralized and standardized ML platform

Joey Zwicker

I think everyone here is +100 for centralized and standardized ML platform.

Samuel Than

our company have been using kubeflow in production for about a year. I would agree that as a solo MLops engineer maintaining even 2–3 projects was a huge learning curve and effort to get it off the ground for a team.Certainly is not easy when one is new to kubernetes AND trying to implement kubeflow at the same time.after 1 year of using kubeflow, i agree that the progress seems slow in terms of new versions and update. We still love Kubeflow especially the kubeflow pipeline concept, we hope that the vision of what Kubeflow can become as it matures will be something to look forward to….

Luke Marsden

Late to the party here but I think that graph shows the wrong thing, lots of activity has moved to different repos. That said, I’m having a tough time with Kubeflow 1.1 and IMO it’s really lacking a focus on end user experience, which is way harder than it needs to be

Demetrios

@Luke Marsden which other repos have you seen a bit more activity?

Luke Marsden

I’m happy to be involved in Kubeflow through the KFData effort though and want to see it be successful!

Luke Marsden

I’d have to check D, I’m on my phone at the moment :)

It’s certainly not dead though, there is lots of activity

Demetrios

@Clive Cox do you have anything to weigh in on this convo?

Clive Cox

Kubeflow is an ecosystem and some projects are more used than others. I think they are finding it challenging to bring everything into a cohesive whole. For it as a brand for MLOPs in Kubernetes… it probably needs more of a boost from the big companies involved. I’m more closely involved with the KFServing project and that has been a very active sub community and like Seldon is self-standing and loosely coupled to Kubeflow.

es

I guess it’s not a party unless you are WAY too late

@Toni Perämäki @Joey Zwicker, as far as we see it, although KF has a TON of downsides, mainly ease of onboarding, I don’t see anything else that comes close in popularity (at least from ML platform integration and blog posts standpoint :D) to KF.
Are you guys seeing anyone moving in another direction to KF?

Joey Zwicker

In my experience, KF has by far the most mindshare of any other tools, but not necessarily a lot of true adoption yet. Google’s marketing machine, their huge presence in the k8s and kubecon ecosystem, and their large engineering efforts behind KF have definitely gotten it as a front-runner right now, but time will tell if the product every actually get there or if it get bogged down in the TFX+KF bureaucracy and complexity and never really develops into a cohesive product.People were saying the same about the Hadoop ecosystem of projects and it looked like it had “won” for a while (we were regularly getting told Hadoop has already won even 3–4 years ago), but now look at it.

Unlike k8s, that had some of the same challenges early, but also was just simply the best at what it did, I don’t honestly think of KF as being the best at what it does. In fact, other that Argo, I don’t personally find any one component of KF to actually be the best at its one specific task. Kubeflow is “winning” based on the sheer force of mindshare it owns as the starting place for “ML on K8s”. Again, my experience is very biased, but Pachyderm has done more than a dozen PoCs of various types with KF + Pachyderm with various companies and by the end of the PoC they’re cutting away more and more of the Kubeflow components, generally just keeping TFJob, maybe KFP, and either using Pach + Seldon or including tools like Allegro in there with some home-grown components. It’s still just really really messy (edited)

Mariya Davydova

I’m still not sure about my feelings towards KubeFlow. We sometimes migrate our customers from it to other tools and see how much effort they’ve spent to make it work as they needed. For example, one of the current clients have 4–5 files for each pipeline component: Dockerfile, component itself, script to unpack parameters, script to enable Hydra, script to emulate caching, etc. This feels weird. Other products make it much easier.

I believe that KubeFlow suffers from the curse of the first product of a kind: it is simple to make a ton of mistakes and suboptimal solutions when you are first in a new field; those which go next learn on your mistakes.

David Aponte

@dsun20 @Yuzhui any thoughts?

es

@Joey Zwicker You’re raising a really interesting point. Do you think there’s a chance KF might be broken into different components that might have a better chance to flourish independently and not as a large entity?
Also, I think Google offers managed KF services IIRC, and since google cloud isn’t going up as much as Azure or AWS, it might hinder their efforts, but I’m just guessing here

Joey Zwicker

Do you think there’s a chance KF might be broken into different components that might have a better chance to flourish independently and not as a large entity?

KF is already a bunch of discreet components with some amount of shared governance. I think this is a weakness and reminds me too much of the Apache/Hadoop-related projects mentioned above. In our experience working with them, half the issues is that TFX does things one way, KFP another, Metadata logger differently again, etc etc. And every component is being forced to make design decisions to accomodate all the other which makes the product worse and more complicated or they simple only work with a subset of other KF-related projects which is equally bad.For better or worse, I find that OSS ecosystems that are most successful are when there is either one major problem that it tries to solve only (k8s = container orchestration, ELK stack = log aggregation & management, etc). It feel to me (although as a consumer of the ecosystem I want to be proven wrong), that a full MLOps and Data Science platform is too big and diverse for one project.In my decade or so of OSS work, when one problem is too big, what it ends up taking is a number of organizations (usually companies who have more unilateral control, not projects) to divide up the territory and build good API boundaries between them. This is what usually forms into more of a Canonical Stack — A few descreet tools that play nicely together and become a standard for a class of problems. You can think of the MEAN/MERN stack, LAMP Stack before that, ELK stack more recently. I think we’re going to see the same for MLOps in the coming 1–2 years.

Also, I think Google offers managed KF services IIRC, and since google cloud isn’t going up as much as Azure or AWS, it might hinder their efforts, but I’m just guessing here

Yes they do. In my understanding, KF as a managed service is how Google cloud is trying to get back in the race — being ahead of the game compared to the other clouds on k8s and the MLOps platform play. That said, this has caused some internal chaos because TFX and KF overlap but do this fairly differently and now they’re trying to merge. It’s going to be a painful transition and I’m not sure it’ll come out the other side for the better or not yet.

es

This is really some good stuff! and shines some light into the inner working details of KF that I wasn’t aware of!
And I really wonder what the near future holds as it seems the MLOPS field is comprised of about 20–30 startups with very little differentiation between, and a few major player (cloud providers mainly) trying to give their own spin on the problem.
I think you really touched the heart of the problem, the problem space is just too wide and trying to solve it with a single tool just isn’t going to work.
I really wonder what will be the factor that starts organizing the solutions into competing stacks and how would that stack battle pan out.
Interesting 2 years ahead of us all

Joey Zwicker

I think you really touched the heart of the problem, the problem space is just too wide and trying to solve it with a single tool just isn’t going to work.
I really wonder what will be the factor that starts organizing the solutions into competing stacks and how would that stack battle pan out.
Interesting 2 years ahead of us all

I’m trying to be careful to abide by @Demetrios rules to not pitch anything as a vendor, but Pachyderm and a number of major players in the MLOps space (Seldon, Determined, Allegro, and more) have an announcement in the works for Oct that we hope will accelerate this process greatly and “be the factor that starts organizing the solutions.” Sneak peak here: https://ai-infrastructure.org/

David Aponte

This looks awesome btw ^^ @Joey Zwicker

Also sent to the channel

Toni Perämäki

Hey everyone and thanks for joining in on the conversation my question started on Kubeflow!First of all, I really already love this community even thought I’ve been in it for a short time. So many thoughtful comments and insights to this thread (and to others too).

Starting from the beginning, some great points on the fact that the development has moved from the main repo to different smaller ones. And as someone said that KF is an ecosystem, not a single product — so that makes all the sense.Also seems there are different external companies pushing forward on different components that they feel are important to them. That’s a great thing for the project.It was also super to hear different angles to the discussion and learn how some of the people are leveraging KF or different components of it. Good point were made also on how broad the whole MLOps space is and what kind of effect that has to the selection of components for one’s own set of tooling. So many great points that I don’t even try to open them all here.So I believe we can easily say that Kubeflow is going forward.

Jim Dowling

IMO, one of the major problems affecting both TFX and KF is Apache Beam. Beam was supposed to be the scale-out feature engineering engine for TFX. KF never had one, but I guess there was a plan for Beam down the road. Now, TFX have pretty much abandoned Beam to promote Kubeflow pipelines. However, the elephant in the room is Spark/PySpark. Devs like PySpark for feature engineering, but it’s not been part of the KF eosystem (up to now). @Joey Zwicker mentioned earlier that KF abandoned data (i agree — and it is crazy), but they also abandoned scale-out feature engineering. The latest is an attempt by TensorFlow to provide support for data parallel map jobs — https://www.tensorflow.org/api_docs/python/tf/data/experimental/service/distribute . You could also place part of the blame on Python programmers who just don’t want to learn data parallel programming.

TensorFlow

A transformation that moves dataset processing to the tf.data service.

Ivan

I tend to agree with @Joey Zwicker on “AI end-to-end platform” is too large of a problem to be solved by a single project. I though curious have someone seen/tried https://polyaxon.com/ — it’s kinda a competitor to KF the way I understand it, but it also comes with a lot of components built-in. Does it suffer from the same pitfalls- solves a lot of stuff in a suboptimal way, complex to setup? Or does it have a good niche? (edited)

David Aponte

Polyaxon is “good enough” for training and hyperparameter tuning

using it right now actually lol

Gonçalo Martins Ribeiro

@David Aponte, would you be able to make a comparison between Polyaxon and KF? I’ve been following Polyaxon’s progress but I’m not convinced in switching Kubeflow for it.

David Aponte

Just some quick thoughts, but at a high level polyaxon is good for training and hptuning, while kubeflow includes pipelines, serving and more.Kubeflow:
Pros:

  • k8s native
  • Highly customizable
  • Open source
  • Loosely coupled (mostly)

Cons:

  • requires k8s and DevOps expertise
  • No managed support

Polyaxon
Pros:

  • k8s native
  • quick experimentation for HP tuning and training

Cons:

  • single maintainer
  • Custom YAML
  • serving is separate
  • no pipelines

es

@David Aponte Very interesting! I heard very good things about polyaxon in the past. Coming from a competitor of theirs (so take everything I say with a grain of salt) What made you choose them and not someone else (not specifically allegro’s solution)?
Maybe we can have a broader discussion about what makes people choose one solution over another

Demetrios 8 hours ago

I think thats a great topic for a thread @es! I’ll start a new one right now.

David Aponte

@es I cant speak too much about this (not sure how much I can disclose ATM sorry) but in a nutshell the ML infra team did a thorough set research to understand user stories, scope out requirements, and document our current workflows. Afterwards we decided that building our own infrastructure using Kubeflow and other tools worked best for our needs at the time — still think this is the case IMO. I know we looked at a bunch of vendors, not sure about allegro, but after trying some out we felt building our own on top of kubeflow would be best. But yea for BAI, I think a big reason is that were a k8s company and we have the resources to build and maintain the infrastructure on top of kubernetes. Its hard but its been working for us. Since moving away from polyaxon we’ve already found improvements in the reliability of our workflows.update about this: ill be scheduling a coffee session with my coworker soon who is leading the ML infrastructure team and well probably talk a little more about Kubeflow there for those interested.

Joey Zwicker

update about this: ill be scheduling a coffee session with my coworker soon who is leading the ML infrastructure team and well probably talk a little more about Kubeflow there for those interested.

Would love to listen in on this or participate if possible.

es

I echo Joey, I’m eagerly expecting to hear what you have to say about that, and pitch in my 2 cents if it’s desired

I must say that this is one of the fewest buy vs build, where you originally bought a solution but decided to move away from and build your own. Very interesting and the details are equally interesting!

eterna2

Sorry that I missed this thread.@Jim Dowling I just want to add in on the part on spark. There is a spark operator on k8s that is not officially part of kf but some folks are using it.https://github.com/GoogleCloudPlatform/spark-on-k8s-operatorBut frankly, I don’t see using spark on k8s to be terribly difficult as long as u get ur base image correct.U can run ur spark jobs in ur notebook or as part of kf pipelines. I don’t see any need for any further abstraction??

GoogleCloudPlatform/spark-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Jim Dowling

@eterna2 — the big missing piece for Spark on k8s is shuffle. If you write Spark jobs that don’t shuffle, you’ll be fine. But as soon as you need to shuffle at any scale there are both performance and correctness problems. The main problem is that because there is no shared host-level external shuffle service, things go either very slow or don’t. Uber introduced their solution — an external shuffle service, but didn’t open-source it yet. See more here, https://databricks.com/session_na20/zeus-ubers-highly-scalable-and-distributed-shuffle-as-a-service

Big thanks to all the community members that participated in the conversation to check out the more recent conversation around this topic you can watch our talk with co creator of kubeflow David Aronchick.

The MLOps community is an open and equal space where all are welcome to teach and learn from each other. We share best practices, tips, pains, and questions in slack and have live meetups to talk with some of the leading innovators in this field. If you would like to get involved please join slack and reach out. To hear all of our recorded past meetups check out our youtube channel or listen to our podcast.

--

--

Demetrios Brinkmann
MLOps.community

Father, Artist, Happy. Creator of MLOps community and Lover of AI Ethics