There were a lot of great talks about development practices at QCon New York 2017. These are some of the highlights for me…
Modern Software Development Practices
In his talk “Managing data in micro services”, Randy Shoup touched on how modern software development consists of technology within an organisation which requires practices which affect the team culture (which drives the technology …).
Randy’s ideal realisation of these facets is:
- Small service teams aligned to business domains that are cross functional (they have all the skillsets needed to do their job)
- Tests help you go faster — they give you confidence to change things, they help you catch bugs earlier
- “We don’t have time to do it right!” — every project manager ever. “Do you have time to do it twice?” (if the software matters). The more constrained you are for time and resources, the more important it is to build it well once. Build one great thing instead of 2 half-finished things. Right != Perfect.
- The consequences of “doing it right” with test driven development is there is no need for a bug tracking system. At StitchFix, the backlog does not contain a list of bugs to fix or half-implemented features. It just contains new features and tech debt. Bugs are fixed as they come up.
“You build it, you run it.” — Werner Vogels
- Have “Autonomy, Mastery, Purpose” (Daniel Pink) — give people the freedom to do what they need to do. Google’s 20% time. Atlassian does something similar. Teams that have autonomy, mastery and purpose are more satisfied and happier.
- Are responsible for all aspects of the service that they write & maintain
- Can overcome/remove the organisational friction (see DX section below)
James Wen from Spotify talked about “feature teams” that could focus on writing features, and “ops-in-squads” which were dev-ops specialists embedded within the feature development team. Similar idea.
Meaning & Impact
If having ownership of the code is important for successful software development, what other factors are necessary to form an effective team?
For Google, effective teams looked like this:
This rings true to me. While you can write code in a chaotic environment where your boss is a sociopath and your colleagues don’t really care about the work they do, I know from experience that you will produce better code if you:
- have your own desk, equipment and work with like-minded people
- feel like your voice is heard, that you are included and that you can trust people
- work with people who are dependable
- have a clear purpose
There’s only been a few times in my career where I felt my work had meaning and impact. And it is very motivating.
DX — great developer experience
A great talk by Adrian Trenaman called “REMOVING FRICTION IN THE DEVELOPER EXPERIENCE” looked at what can be done to minimise the distance between a good idea and getting it into production safely. Or phrased another way”
Deploy change frequently, swiftly and safely to production, and own the impact of that change.
Adrian also had a pyramid which was remarkably similar to Google’s one (but with less fancy language):
If your company recognises that it is an engineering organisation, then it follows that code is the primary artefact. So companies should organise themselves into teams that are optimised for writing good code.
However, even if you create an optimal team size with optimal people that have complete ownership of their service & are empowered, there is still a barrier to success: organisational friction.
Friction, and how to deal with it
(1) Staging & Testing environments
- Consider Spaghetti Diagrams (Motion Study) from Six Sigma, showing the movement of code across different environments before it hits production
Solution 1: Prefer to test in production (wherever you can). Use “dark canaries” (deploy here first) & check, then add one canary to one instance of your service & check, then roll out further. AWS makes this pattern possible. There’s a Python library called nova to setup this pattern.
Solution 2: Treat your teams as startups providing services to other dev teams. There must be a really clear API (a contract) that shows how it works. Create a sandbox instance that teams can point to to test their code, rather than you giving them code that they have to run themselves (such as a mock server).
The rest of Adrian’s talk went into how to deal with other kinds of friction. To summarise:
- Seek out and remove friction in your engineering process
- Give freedom-of-choice & freedom-of-movement to your engineers
- Code is the primary artefact
- Minimise the distance between good ideas and production
Designing to make things testable in production
How do you know if something is really working in production?
You may say, “Alerting”. But how do you know if your alerting is working?
Michael Bryzek— formerly from Gilt but now CTO of his own company (Flow) — talked about designing things to make them testable in production. Ensuring software quality is hard. “Verification in Production” is a powerful technique to help us build quality software.
Flow uses the following techniques to ensuring software quality:
True continuous delivery
This is about psychology. Make it super simple and safe. “Is it ok to deploy this software now? If not, why not?”
“One way to do something”. Avoid having multiple half-backed ways of doing things.
“Assume continuous delivery in design process”. He talked about designing the system so you can tell it, “Transition the state of the software to X”, (where Xmay be a version number), and the instrumentation generates a diff to take the software into state X.
No staging environments!
- Difficult to understand failure
- Expensive (30–40% of budget is common)
- Create the wrong incentives — the incentive to deploy things to staging for manual inspection. Can you guarantee that this will work in production?
- If you still get bugs in production, what was the point of all those environments?
Don’t run code locally
- If you are unsure if your code will work, write the test! Write the integration test! Run the tests locally, but resist the temptation to run the code locally and test it manually.
- Learn to trust your tests. Over time that becomes a cultural learning and increases incentives to write good tests.
Quality through architecture
The idea of extreme isolation —somebody else cannot break your software. When your software does break, you can be fairly sure its due to a bug in your software, not theirs.
- Event streaming is a key feature. All APIs can only talk to other APIs through emitting events and subscribing to other event streams and keeping a copy of the data for your API (AWS Kinesis). No shared database. No private-access to other APIs. Failure mode changes from “outage” to a “delay”. This stops cascading failures.
We can actually produce higher quality software by testing in production. The failure rate (when done correctly) goes down.
“Know that the checkout process works” — Gilt checkout example:
- A bot places an order every few minutes. This is a pretty hi-fi signal that it is working or not. Detect the bot user and cancel the order (this cancel-order code is also in production)
- Identify test orders and immediately cancel them
- Gilt does production load testing in the morning, before the midday peak. You feel pretty confident that it will work under real load.
“Support Sandbox accounts”- Facebook, PayPal, all payment companies do this:
- SaaS even for internal accounts
- Mark individual accounts as sandbox
- One API Key for all sandbox accounts (allow clients to create as many sandbox users as they need. Create sandbox org, run tests, delete org. Repeat.)
Treat every service (even internal ones) as a third party.
“Verify Proxy Server Works”- testing a high risk piece of software (a custom proxy server):
- Used a bunch of
curlcommands to ensure that the requests were routed to the correct place
- Once the tests pass, you can run this same test against production to ensure that the proxy keeps on working.
When things go wrong
- Make production access explicit (not the default access level)
- Use defined paths (don’t bypass APIs to check database. Use software like our clients use it. You may need another API if this is hard).
- Design for side effects
- Perfect documentation. Documentation is generated from integration tests that run in production. Request/response from API calls is stored after each test, sent to S3. When docs are regenerated, request/response is pulled from S3 and used in documentation.
Tooling to make this work safely
- API Builder — version control for APIs, backwards compatibility, high quality mocks. Mocks are generated from API contract.
- Vivid Cortex — real time DB Monitoring
- SumoLogic /Splunk— super simple alerts from a log
- Have to implement this approach at the beginning of a project
- Test your tests, do the work so you can trust your tests, run a subset of tests in production
- Invest in continuous delivery — upload a PR, run tests, tests pass, merge code, deploy to prod. But also have the ability to merge without tests in emergencies.
- Sandbox accounts are powerful
- High quality, trust-worthy mocks
- Real-time feedback form production
- Separate deployment (to production) from release (being used in production). Rollout incrementally. Use feature flags. Can test whether new software can handle the traffic by using Splitter.
- Ownership of code from design, dev, deployment and retirement is important for developer experience & for producing software faster, better & cheaper.
- Deploy change frequently, swiftly and safely to production, and own the impact of that change.
- Minimise the distance between good ideas and production.
- Remove the sources of friction that occur when creating software.
- Testing in production is a valid approach which leads to faster, cheaper and better software.
- Trust your tests. If you don’t, ask yourself why, then address that.
- Code as the primary artefact of an engineering organisation.