There is a book which I’ve heard of from so many people. The book is called ‘Release it’.
I recently finished reading it. Honestly, I enjoyed it a lot.
In this post I would like to share my main takeaways.
- SLA is a quantitive characteristic which can help to evaluate how many resources you need.
- You should reorient your SLAs around specific functions and not the service as a whole.
- When the average response time breaks your SLAs you can track it and return an error in response. That behaviour should be documented and expected for your clients.
- You should take SLAs of your application personally.
- Limit database query results to avoid the situations when millions of rows get returned.
- Caller has to indicate how much data it is willing to accept.
- For the test environment try to have the ratio between the front-end and back-end as in production.
- To test your performance choose the most expensive transaction and hit twice as many requests as you can get in the worst case. Your system might slow down but it should be able to recover afterwards.
- If your marketing department is planning to launch some special offer campaign, you might need to allocate additional capacity to be able to handle the load.
- Additionally you can segregate the part of your infrastructure which will handle offer campaigns.
- Always fail fast.
- Your load balancer should fail fast.
- Make use of the Circuit Breaker and Timeout patterns to protect your system.
- Involve business stakeholders while deciding on the circuit breaker rules.
- Failed servers increase the load on the remaining ones, which in turn make them fail faster.
- When you have potential bottlenecks in your architecture either scale them out or implement a fallback.
- synchronized is dangerous. If some dependent system hangs, your backend can crush due to too many waiting threads.
- Divide available threads into two parts: for admin tasks and for regular requests.
- Firewalls can drop connections but sender can still think that those connections are still valid.
- Backups should go over one interface and production data over the other.
- Be ready to handle the database failover.
- Using connection pools saves time.
- Choose the connection pool size wisely.
- Monitor resource pools for contention.
- Wait until your app is ready to serve connections.
- You need to have a way to throttle incoming requests.
- Java apps cache dns entries.
- Keep logs in a separate drive to avoid IO contention.
- Your monitoring can help to recognise different patterns.
- App should expose as many metrics as possible.
- Distinguish between instantaneous behavior and the system status.
- Monitoring system should be aware not only of the system features but also of the business features.
- You should see the global state of the whole system, for every single server.
- Prefer scripts to GUI for administrative stuff because scripts can be automated.
- Make your administrative functions scriptable.
- Log level ERROR or SEVERE should only be assigned to the problems which require some action.
- Run longevity tests. It’s the only way to catch longevity bugs. Also simulate the slowdown period during the night.
- Hunt for memory leaks.
- In case of the OutOfMemoryError there might not be enough memory to create a log object, so no logging could happen. Use external monitoring tools to avoid blindness.
- Use sessions carefully since they take memory. Use SoftReference in Java for that.
- Garbage collection should take around 2 % of the execution time.
- Avoid handcrafted sql. Use hinting, views and indexing.
- Use datasets with realistic sizes.
- Create a dashboard with response times for the database. You can use that date to create indexes.
- Run your apps as a non-root user.
- Limit max amount of memory for cache.
- Implement a data purging mechanism from the beginning.
- Copy logs out of production to the staging for analysis or use log rotation.
- Have tools/scripts which immediately capture the state of the broken system for further investigation.
- Get configuration files for the moment when the incident happened.
- Get log files and thread dumps for investigation.
- Getting thread dumps should be configured.
- CPU, RAM and bandwidth might be cheap, but always consider multiplier effects.
- Early decisions affect the most.
- Requirements always grow significantly.
- It’s better to spend money once during the development than to have recurring operational expenses later.
- Tests don’t cover everything. You get real feedback only after you’re in live.
- Use test harnesses to emulate low-level network errors.
- Use different ports for test harnesses to emulate different behaviour.
- If you have a firewall in production, you should have an alternative in staging since you’ll have configuration-related issues anyway.
Hope this list will be helpful for you and you’ll hopefully decide to read ‘Release it’ as well!
If you think that there is something which might be added to this list, feel free to post your ideas to the comments.
Originally published at antonfenske.com on August 31, 2016.