This article felt like it was addressed directly to me. At our company, we built our bioinformatics platform with Luigi and we’ve encountered all of the problems you describe. Schemas are implicit, and each task has to hope that its dependencies produce the inputs it is expecting. Minor changes in the file output can cause major downstream issues.
So I was eagerly hoping you had a viable solution at the end of the post, but unfortunately our field isn’t quite ready to put an entire pipeline in a single, type-safe application. Most of our critical genomics algorithms are externally developed academic command-line tools that have to be run on a single, powerful node and that produce custom file formats whose schema is only defined in the tool docs. Combine that with finicky environments and dependencies and huge RAM/storage requirements, and we’re stuck with major constraints on scaling our workflows, and schema problems actually become secondary.
If the only way out of this mess is to rewrite all of our algorithms in Scala, define our schemas in Avro, and run everything on Spark, I’m all for it, and some very smart people are already working on it. But this is a massive undertaking, and it takes time.