Handling Large Technology Changes Successfully: A Transition to .Net Core

Published in

DraftKings Engineering

11 min readSep 29, 2020

Two paths, technology and developer experience where we can make improvements, e.g. .Net Framework to Core on Linux and K8s — Modernization Paths

A little over a year ago DraftKings engineering identified an opportunity to target some of our key business objectives: lower costs, increase scalability, application flexibility and improve developer efficiency. This scope expanded when we merged with SB Tech and started to think about integrating our software, platforms and people. We knew it required more than simple code changes to turn this into an enterprise solution. If we tried to tackle all issues at once, we’d fail and stunt the engineering organization’s growth. After a deeper dive, we learned we would have to focus significantly on improving not only the technology, but our developer experience to achieve our objectives. It was important that these paths not have an end goal, but rather a direction and philosophy. This approach allowed us to pivot as needed along either path in response to lessons learned on the way.

Changes In The Real World

We quickly identified our starting point as the core technologies underlying all of our micro-services: .Net Framework and NServiceKit. With our target identified, we started doing some research and got excited — almost every article on migrating to .Net Core indicated it was relatively simple and straightforward. There were a few deprecated packages but otherwise, transitioning a service would take an hour or so.

In a large engineering organization, however, you inevitably end up with hundreds of services with custom code, varying patterns and libraries. Adding to that complexity, our teams have packed roadmaps. Our products evolve very quickly and our engineers do not have much time to take on large technology shifts. We also operate in a highly regulated environment. We cannot simply update code and deploy it. We have to go through a rigorous testing process and in some cases get approvals for releases from regulators. These challenges require a more intentional plan and a stepped-approach to a technology upgrade.

Basic Principles of Large Scale Changes

When doing any large-scale change, we found the following principles lead to better success:

Get buy-in from stakeholders & leadership
Update with a purpose
Do it incrementally
Trust But Verify
Communicate often and deliberately
Expect unforeseen problems
Plan for support

Get Buy-in

For an initiative with a large impact to succeed, it will need the support of others in the organization. Even an initiative done entirely by one team, if it has any measurable impact on the business or technology, cannot be done in a vacuum. The project should seek the support of leadership by showing a strong business case. Other teams across the engineering organization will want to know the benefits, whether technological or business related, and how they can contribute. For DraftKings, .Net Core had technological benefits such as the ability for better efficiency with “async-all-the-way-down”. For the business, it was an important and necessary step towards other initiatives that would save us a significant amount of hardware and compute in the cloud. It enabled a move to Linux, containers and later Kubernetes. We then utilized metrics to build those cases showing gains in developer efficiency, cloud cost savings, better flexibility and scalability.

Update With A Purpose

As engineers we always want everything to be better. We’re critical of our own work and often have an easier time seeing its flaws than its strengths. This is part of why refactors can frequently become massive projects that grow in scope with every piece of code touched. As we started down the path towards .Net Core, we saw numerous flaws we wanted to fix in our internal SDK and framework code. For example, the way we handled dependency injection did not use the newer .Net standard IServiceProvider/IServiceCollection. The temptation to fix this and many other areas of technical debt was strong.

As engineers, our first thought is, “as long as I’m in here, I might as well fix this to do it the right way.” However, all that does is make a three month project take a year and add additional moving parts that make regression testing more difficult. The end result is that it’s impossible to verify the application has the same expected behavior as the existing implementation. We instead used the agile approach and used Epics and our backlog to help keep the migration scope under control.

Legacy Patterns vs Modern Approach

There were some cases where it was not just our outdated external dependencies that gave us pause, but the capabilities we’d built on top of them in our SDK and micro-services. We had to change the underlying implementation in many of these cases which actually led to some nicer interfaces and patterns. However, if we’d exposed these nicer patterns in our SDK, the impact on our micro-services would be immense. We made the compromise to wrap these in legacy interfaces in many cases to ease the transition to .Net Core for our engineers. For example, we switched from our legacy Http library built on top of HttpWebRequest to one built on top of the newer HttpClient but kept our old interfaces as well.

Do It Incrementally

Our end goal for the team was to create automation that could do 90% of the work for converting a micro-service to .Net Core. However, if we’d started there we’d have created a script that simply frustrated engineers and didn’t handle the majority of the code. This was especially true for us as our teams are very autonomous and have some level of variation in how they implement their micro-services. It was important to first prove out and fine tune the conversion mechanisms manually, multiple times, before building automation.

Info graphic: many developers, 90 stories, 3 epics, two weeks for the 1st service, 6 hours by runbook, under 1 hour by script

It took us two weeks to migrate our first service as we worked through everything from .Net Core API changes, to third party library upgrades and required code changes such as moving to Kestrel + Polly. From there we created an initial Runbook for all the necessary, known steps. Our next few services took us about six hours each to convert. We then created an automated script that could do 90% of all the conversion needed for a service. The remaining 10% accounted for the variation between our services. Our next service conversion took under an hour, not including load testing. The majority of the time was no longer in the service’s code, but the tests. The tests’ code was much harder to fully convert using an automated script.

Automation

We ended up with a useful and powerful script that enabled developers to convert the majority of their service’s code, configuration and build scripts. The script not only converted their code, but had the flexibility to also create a new git branch and separate all changes into different commits, if desired. This helped engineers understand exactly how the service was changing and review the Pull Request more safely.

Automated migration script README — README instructions for our migration script

Trust But Verify

Although .Net Framework and .Net Core are both from Microsoft and share a fair amount of interfaces, there are a number of changes required that result in a different approach, API or library being used. It’s critical with any major technology shift to have good testing in place to ensure the behavior is the same. Good functional and integration tests were an important part of our criteria for choosing a pilot micro-service.

Some of the changes were due to previously unknown default behavior changes from .Net Framework to .Net Core or other third-party libraries. For example, some of our converted services threw the exception Request Header Fields Too Large. We discovered that the default number of allowed headers was much smaller in Kestrel than in the embedded web server we used previously for our micro-services. This was a simple fix of updating two settings:

KestrelServerOptions.Limits.MaxRequestHeaderCount
KestrelServerOptions.Limits.MaxRequestHeadersTotalSize

We were able to uncover a number of issues like this as we leaned on strong functional testing.

Load Testing

Once we had confidence that a micro-service was functionally sound and exhibiting the same behavior as its .Net Framework version, we moved on to load testing. This is incredibly important, particularly for the first few critical services you convert. We uncovered an issue with our circuit breaking code during load testing. All other tests seemed to show no difference in behavior, but at high load we were seeing a growing amount of latency. While we’d tried to keep our changes minimal, some interfaces had forced us to include some changes to asynchronous methods from previous synchronous ones. As a result, we’d missed that we were calling an externally facing ExecuteAsync(...) method on our circuit breaking HTTP library rather than the proper ExecuteAsyncInternal(...).

Proper load testing uncovered a number of issues and enabled us to have much greater confidence in deploying our micro-services.

Load testing showed successful and failed requests, latency and requests per second were identical in Framework and Core

Load test results dashboards showing requests against the converted service vs the older framework version — Load test results comparison dashboards

Not every micro-service will need to be load tested. Use your judgement based on the access patterns of that service in production.

Canary

Software is incredibly complex. Software in a micro-services world is orders of magnitude more complex. It’s impossible to predict every outcome or account for every variable. When you’re in a business where customers can lose money from buggy software, you have to be extremely careful in rolling out changes. For our software, this typically takes the form of feature rollouts and experiments. However, when an entire micro-service has changed its platform, you’re not rolling out a piece of functionality but an entire instance. The safe way to handle this is to use a canary-based approach. Using a load balancer, we can send a subset of the traffic to the new version of the service.

Feedback & Monitoring

Before you release a canary, make sure you have all your monitoring dashboards and logging set up so you can compare the health of the canary versus the other instances. You’ll want to monitor a number of areas such as:

Instance & deployment logs
Application logs (e.g. in ELK)
Request metrics (latency, errors, database latency, etc) for the canary instance vs all others

Metrics dashboard showing the canary instance’s results vs the older instances

Communicate Often

When you’re dealing with a large organization and many autonomous teams, no one team can do the project alone. You need the hard work and collaboration of different teams, engineers, product owners, managers and countless others. It’s critical that you communicate leading up to and throughout the entire project. You’ll also need to get some time on their roadmaps, i.e. within their sprints for agile teams.

Additionally, communication is important for discovering good partners for pilots and to help uncover differences between different micro-services. This offers an opportunity to bring in a diversity of perspectives and experience, leading to greater success and learning for the team. We started by holding weekly status meetings and recruiting a lot of amazing volunteer help from engineers around the organization. This not only helped with our progress, but with choosing some great initial applications to convert and uncover issues that made our eventual automated script more powerful.

Motivation is important when collaborating with other teams and keeping the key stakeholders apprised is essential. We made sure to communicate regular updates, hold weekly meetings and keep people aware of everyone’s status on the “scoreboard.”

Education

An important part of communication is education as well. We spent a lot of time putting together documentation that helped engineers convert their services without a need for training. We had instructions, guides and numerous examples. We made sure it was a painless process and have received feedback from engineers across the organization on how painless the process was due to the automation and documentation.

Documentation

You want to make sure you develop a Runbook for teams that covers some key aspects:

Conversion of the micro-service
Troubleshooting
Building the micro-service
Functional and load test guides (how to load test, how to compare the results with those of the non-converted service and covering different tuning configurations)
Deployment

Expect Unforeseen Problems

No amount of planning, static analysis or research can find every issue that will arise during a large project and especially during a major technology shift. Expect that for all your diligence, something will go wrong, be missed or forgotten. Just have a good process for dealing with it. We had a triage process that started with our dedicated slack channel #dotnet-core-transition and resulted in solving the issue and adding it to one of two places:

If it was expected to be common across all services and could be turned into a pattern, it went in the automated script
All other issues made it into our “Common Migration Issues” knowledge base (which we further built out with amazing contributions from a wide range of generous engineers converting their services)

Many of these issues dealt with third party API changes, for example AutoMapper. We would create documentation that was easy to follow and read like a git pull request.

Example Migration Issue Entry

Example of some instructions and code change explanation for a common migration issue

Plan For Support

We knew it was going to take a lot of time and effort beyond just the initial conversion, documentation and automation development. Our team planned for a significant amount of time to be spent supporting other teams converting their services. We found it to be a lot of fun and helped us learn more about how different teams tackled a wide variety of functionality. Mitigating the impact on the team was important, so our product owner worked with other teams to coordinate and plan for their transitions and ensure there was an engineer available to help.

Managing A Mixed Environment

We also recognized that there would be a long period of time when we were living in a mixed environment of converted and non-converted services. Our services made good use of strongly typed contracts for compile-time validation. This meant we had to develop some software for the non-converted services to make use of any contracts developed in .Net Core. We also built some core libraries to help the communication in the other direction as well. Most of the .Net Framework services did not have .Net Standard contracts and we did not want to force a chain of required updates just to convert a service.

We handled this by building compatibility into both “sides” of our SDK: an updated Framework version and our new .Net Core version. We added support in the Framework SDK to handle the existing NServiceKit contract types while adding overloaded support for the newer ServiceStack types. On the .Net Core side, we were able to continue using the non-migrated micro-services’ contracts since the .Net Core CLR is generally bytecode compatible with the Framework CLR. POCOs defined in Framework assemblies work fine in the .Net Core CLR.

Next Up

In our next article in this series, we’ll be diving into some of the technical issues we tackled in our .Net Core transition and take a closer look into the code as well as look into some developer efficiency improvements.