The Problem with Dagster
Dagster is a popular data orchestration framework offering the ability for users to push their data through arbitrary directed, acyclic graphs (DAGs) which represent data pipelines. If you work in the data engineering space, you have certainly heard of it.
As part of my work, I have been using Dagster extensively over the last six months. I really appreciate the tool’s simplicity and the frontend tools the Dagster team have put together really make managing and addressing issues with my workflows easier. Plus, the scalability of Dagster means that I won’t have to worry about running out of resources.
However, this blog post is not intended to sing the praises of Dagster as the next revolution in data engineering. Rather, the intent here is to discuss a particular problem I’ve been having, and which I suspect many of you have had or will have in the near future, that hasn’t been addressed by the internet at large, outside of an obscure Reddit post.
Dagster Pricing
To get into the specifics of this issue, I first need to discuss Dagster’s pricing model. Essentially, there are two metrics by which any Dagster usage are charged: serverless compute and Dagster credits.
Serverless compute is charged for every minute your code is running but this only accrues if you’re running serverless. If you’re running a hybrid setup like me or have Dagster deployed on-prem then you will pay nothing for compute because you “own” the compute.
Dagster credits are paid with each asset materialization or op execution. To anyone who understands Dagster, this is an obvious statement. However, to anyone unfamiliar with these concepts, as I was six months ago, this pricing model is opaque. I wouldn’t go so far as to say that’s intentional, but their sales team certainly hasn’t gone out of their way to make that clear.
The Problem
This brings me to the main issue. And, to illustrate my point, I will walk through the workflow that actually resulted in this post.
What you see here is the linearization of a particular job, called market_information
. This job is composed of eight ops which do more or less what it says on the tin. A JSON file is read from S3, converted to a dictionary, and then to three Pandas data frames. These are then each processed concurrently, converted to Parquet files and written back to S3. As an aside here, we’d normally want to split out the JSON and Parquet conversions into their own ops, but I’m glad we didn’t for the reasons I’m getting into now.
So, what exactly is the problem here? This looks like a well-architected data job. And, it is. We can see where each separate step runs, whether any fail, as well as the dependency between given steps. I can even click on each of these ops to get graphs showing how long they took. All of this is highly actionable data, and hugely valuable… but it’s a flaming money pit.
Let’s dig into the math to show you what I mean. This job has eight ops and runs every five minutes, which means it runs 12 times per hour. That equates to 288 runs per day, for a total of 2,304 op runs per day, which equates to 2,304 Dagster credits every day. Take that over a month and you get 69,120 credits.
- At the Solo tier, that equates to $2,464.80 per month because the first 7,500 credits are free and each additional credit is charged at $0.04.
- At the Starter tier (what used to be called Teams tier), that equates to $1,173.60 because you get a larger pool of 30,000 free credits and each additional credit is charged at $0.03.
Now, note here that this does not include compute, only orchestration. And this isn’t a particularly complex job. Nor would I call a five-minute cadence particularly frequent. However, combining these things together results in an outsized bill. Also understand that this is the cost for a single deployment. If you are on the enterprise tier and have multiple deployments (e.g. dev, stage, prod), then your costs get multiplied again!
This may not seem like a big problem. After all, what’s $1,000/month to an enterprise-level budget? Except it really is. For example, in my business of energy trading, our trading operations department uses this information to perform analysis so they can construct their day-ahead and intraday bids, which means this job has to run every time, and can’t be turned off. And, since these jobs directly tie to the department’s P&L, there are going to be a large swath of business intelligence that we won’t be able to take advantage of because the potential cost savings can’t justify the ongoing Dagster expenditures.
Now, I understand that Dagster is providing a tool and they have a right to charge for it. However, as we’re running on a hybrid environment, the only things our money is getting us at this point are the Dagster UI (which to be fair is top-notch) and the orchestration service itself.
Finally, this doesn’t consider the per-seat costs for additional accounts, which is simply ludicrous when you get down to it.
Possible Solutions
There are some solutions here, of course, and I’ll go through the ones we’ve discussed at length as well as their benefits and drawbacks.
Enterprise Tier
If you have deep enough pockets, you might be able to swing an upfront contract with Dagster to provision a large number of credits every year for a set price. The benefit here is that the cost of each additional job decreases logarithmically as you move into more heavily-discounted price bands. However, as enterprise pricing is negotiated, you could be missing out on additional savings. Plus, the per-seat cost goes up a bit over the Starter Tier, which is definitely annoying.
AWS Managed Services
We could decide that frequent jobs need to be offloaded to AWS and orchestrated through Airflow or even EventBridge, since this essentially runs on a CRON. Furthermore, we could roll our own orchestrator for this work using SQS/SNS or NATS if we preferred. The disadvantage here is that we would then be widening our stack, which puts an increased maintenance burden on the team and requires additional training or specialization for us to maintain competence.
Op Aggregation
This has been mentioned elsewhere, but if we’re paying per op but not on the size of the op (that total 69k ops cost $13 in ECS, for example), then we could convert the job to use some chunkier ops. This would also provide some performance boosts because the orchestrator would have fewer calls to make. The downside here is that we loose the linearization view, which is a bit of a self-defeating prospect where Dagster is concerned, considering that linearization is one of their better value adds.
On-Premises Deployment
This solution is a bit uncertain, but it’s possible that one could deploy the Dagster daemon to a VM controlled locally, thereby cutting Dagster out of the loop entirely. I wouldn’t personally recommend this approach, because you’d be acting as a free-rider, which is a habitual problem in the open-source world. The typical response to such leeches is to close-source the project and I’d rather not see that happen either.
Divest from Dagster
This last, and most drastic measure, is similar to the first one. However, instead of moving some of your workflows, you port all of them. The downside is that you no longer get to benefit from Dagster’s other features. But, that might be the best option for you.
Conclusion
Is Dagster worth it? Almost certainly! But, given the caveats and their pricing model, I can’t say it is for certain at this point. Maybe they’ll take another look at their model and rejig it to better-fit high frequency workflows. A per-job rate certainly wouldn’t be unwelcome, assuming it’s not outrageously priced of course. But, as things are, I can’t say whether or not we’ll be continuing with this service.