Troubleshooting On Running System

vincent
6 min readMay 5, 2018

--

Erlang In Anger is a little guild about about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from.
observer_cli inspired by the book and recon. It is mainly to find abnormality in the production system in a very simple and clear way.
Most of the note comes from Erlang In Anger Book, under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Common Overload Observation

1.CPU(Scheduler Utilization)
CPU is very hard to profile. Because Erlang CPU usage as reported from top isn’t the most reliable value (due to schedulers doing idle spinning to avoid going to sleep and impacting latency), a metric exists that is based on scheduler wall time.
For any time interval, Scheduler wall time can be used as a measure of how ‘busy’ a scheduler is. A scheduler is busy when:
* executing process code
* executing driver code
* executing NIF code
* executing BIFs
* garbage collecting
* doing memory management
Like htop would report something closer to this for each core.
The value here represents scheduler utilization rather than CPU utilization. The higher the ratio, the higher the workload. While the basic usage is explained in the Erlang/OTP reference manual

scheduler utilization
Global View

2.Process Port Atom Count/Limit
Trying to get a global view of processes is helpful when trying to assess how much work is being done in the VM in terms of tasks. A general good practice in Erlang is to use processes for truly concurrent activities on web servers, you will usually get one process per request or connection, and on stateful systems, you may add one process per-user and therefore the number of processes on a node can be used as a metric for load.
Don’t use dynamic atoms! Atoms go in a global table and are cached forever, watch out atom memory, atom count and atom limit.

3. Memory
The total field contains the sum of the memory used for processes and system (which is incomplete, unless the VM is instrumented!). processes is the memory used by Erlang processes, their stacks and heaps. system is the rest: memory used by ETS tables, atoms in the VM, refc binaries, and some of the hidden data I mentioned was missing.
If you want the total amount of memory owned by the virtual machine, as in the amount that will trip system limits (ulimit), this value is more difficult to get from within the VM.
If you want the data without calling top or htop, you have to dig down into the VM’s memory allocators to find things out. Digg in common sources of memory Leak.

3.1 Code
The code on an Erlang node is loaded in memory in its own area, and sits there until it is garbage collected. Only two copies of a module can coexist at one time, so looking for very large modules should be easy-ish.
3.2 ETS
ETS tables are never garbage collected, and will maintain their memory usage as long as records will be left undeleted in a table. Only removing records manually (or deleting the table) will reclaim memory.

ETS Tables

3.3 Processes
There are a lot of different ways in which process memory can grow. Most interesting cases will be related to a few common cases: process leaks (as in, you’re leaking processes), specific processes leaking their memory, and so on. It’s possible there’s more than one cause, so multiple metrics are worth investigating. Note that the process count itself is skipped and has been covered before.
You can find all information about TOP N processes by on click.

the biggest num consumers about reductions.
the biggest num consumers about memory.
the biggest num consumers about message queue.
the biggest num consumers about total heap size.
  • Memory Used/Msg Queue Len/Garbage Collection
  • Link/Monitor/Reductions/Registered Name
  • Message In Mailbox
  • Process Dictionary
  • Current Stack
  • Process State(gen_xyz behaviour process)
Specific Process Attribute

3.4 Binary
Erlang’s binaries are of two main types: ProcBins and Refc binaries. Binaries up to 64 bytes are allocated directly on the process’s heap, and their entire life cycle is spent in there. Binaries bigger than that get allocated in a global heap for binaries only, and each process to use one holds a local reference to it in its local heap. These binaries are referencecounted, and the deallocation will occur only once all references are garbage-collected from all processes that pointed to a specific binary.
In 99% of the cases, this mechanism works entirely fine. In some cases, however, the process will either:
1. do too little work to warrant allocations and garbage collection;
2. eventually grow a large stack or heap with various data structures, collect them, then get to work with a lot of refc binaries. Filling the heap again with binaries (even though a virtual heap is used to account for the refc binaries’ real size) may take a lot of time, giving long delays between garbage collections.

the biggest num consumers about binary.

4. Memory Fragmentation
more detail explain see Erlang In Anger Section 7.3

memory fragmentation

5. Network
Similarly to processes, Erlang ports allow a lot of introspection. The info can be accessed by calling erlang:port_info(Port, Key), and more info is available through the inet module.
Fetches a given attribute from all inet ports (TCP, UDP, SCTP) and returns the biggest Num consumers.
The values to be used can be the number of octets (bytes) sent, received, or both (send_oct, recv_oct, oct, respectively), or the number of packets sent, received, or both (send_cnt, recv_cnt, cnt, respectively).
About all type explain see: inet:getstat/2.
You can find all information about TOP N ports by on click.

the biggest num consumers about cnt
Specific Port Attribute

6. Conclusion From The End of Erlang In Anger
Maintaining and debugging software never ends. New bugs and confusing behaviours will keep popping up around the place all the time. There would probably be enough stuff out there to fill out dozens of manuals like this one, even when dealing with the cleanest of all systems. I hope that after reading this text, the next time stuff goes bad, it won’t go too bad. Still, there are probably going to be plenty of opportunities to debug production systems. Even the most solid bridges need to be repainted all the time in order avoid corrosion to the point of their collapse.
Best of luck to you.

--

--