Lightning Network routing FUD (and zombies)

Robert Olsson
7 min readJul 28, 2018

--

Lightning network has been growing rapidly lately. It is however still very much in its infancy and there are lots of misunderstandings and worries. The only thing that I consider is a real issue right now is the routing. The problems the network currently suffers from are huge, but luckily the fixes are small and easy to implement. I’d say the lightning network is still in beta, but routing part is barely alpha quality.

For instance there have been circulating numbers about LN having 3000 public nodes. About 1000 nodes don’t have any channels at all, so they are not an issue for routing, just taking up bandwidth and giving false numbers. About 500 only have one channel, thus they will not be used as routing hops. Unconnected nodes are easy to prune from the graph, but since they wouldn’t be used in routing algorithms anyhows it doesn’t matter much. They just look annoying.

However when analyzing it further, not many of those 3000 nodes have actually said anything to the network lately. Last 30 days only 1400 of them have sent a node or channel message. And usually there are a lot of channel updates in the network. Last 7 days only 1100, and last 24 hours only 900 nodes have updated anything. The rest are zombies. Source: https://www.robtex.com/lightning/node/

I have only identified zombie nodes so far, but since each channel has two endpoints, what are the odds that both sides are up and running, rendering the channel actually useful? And all those channels are considered when making payments.

Routing

Routing in Lightning Network is source-based, this means the originator of a payment is responsible of finding a working route to the recipient. Exactly how this is done is up to the implementation of the sender. The network just follows orders hop by hop and do not actually know where a packet is going, where it came from, or what it contains. Apart from making the routing flexible, it also protects privacy and gives you a great deal of anonymity.

Also the current balances of the channels on the network is hidden from the general network. Only the two nodes connected via a channel know the balance. This also enhances privacy, but it makes it a little bit harder to find a working route from point A to point B over the network. It does know the total capacity of the channel, as in how much can maximum flow in one direction before it has to flow in the other direction.

Since we don’t know everything about network status, the routing is solved by simply trying out different paths, sorted in a specific order, normally cheapest first. An example would be:

  1. A->C->B
  2. A->D->E->F->B
  3. A->D->E->G->B
  4. A->D->H->B

The sender will keep trying down the list until the payment gets through. The sender will get messages from the intermediate nodes if there is a problem forwarding the packet.

For instance C can tell A that their channel with B is disabled at the moment, and D can tell A that D->E isn’t viable in path #2 , so the sender can then skip path #3 which also includes D->E and jump directly to path #4 . It can cache that knowledge wisely for future similar payments.

Problem #1, Communication

This trial and error reporting sounds good in theory, however I’ve discovered that the different implementations interpret the BOLT specification differently and can thus not understand each others error messages about which channel actually has issues. This problem is now known and the dev teams are working together to be able to parse both formats and agree on the proper format of the error messages in the future. I discovered this while developing a balancing script which was supposed to learn which channels are down, but sometimes it couldn’t because the error returned was garbled. Full interoperability is still far away.

Problem #2, Zombies

There are many nodes on mainnet that haven’t sent a single message for months. Still those nodes and their channels are in the network graphs and considered for payment routing. Most of those channels should probably be force-closed unless you have a channel with a node you *know* is just down for a 3-month maintenance and will be back shortly.

Problem #3, Down and flapping channels

Even after force-closing 200 channels to zombie nodes that haven’t been active for the last 7 days, I now have 329 channels that are active, but still have 148 channels that are inactive. Most of those peers do not even pick up on TCP/IP, so they are either temporary down, or permanently and I will have to force-close them too.

Down or flapping channels per se is not an issue. The BOLT specification has a bit you set to announce that a channel is currently down. That bit will be forwarded so the channel will not be used for path finding. It is a tremendous improvement to success rate if you exclude channels that are known to be offline. The only problem with that is that no implementation actually sets that bit, thus probably nobody reads it either. The only time i’ve seen the disable flag being used for a channel is when it is permanently closed. That only happens once per channel of course, and you can already easily detect closures on-chain, so it is not helping.

Currently nodes that discover their channel with another node is down, will still announce the channel as being up, without the disable-flag.

Thus the sender will think the channel is up and will include them in their attempts, and they are many channels like this!

I added a script to rompert.com node that detects which channels are temporarily down, however i can’t via API set the disable flag, so instead i set an extremely high fee on the dead channels, hoping it will deter other nodes from considering using them in their paths. It seems to help, but I can only signal in outgoing direction, so implementations will still try to route in inbound direction, which of course will fail.

What we can do in the meantime

While we wait for implementations to better handle the routing and signalling and avoiding channels that are known to be down, check your own channels and close the ones you discover are down and will never come back. I’ve developed a tool for that to help killing those zombie-channels. As always, be careful and verify and decide yourself which ones to close. Closing useless channels and claiming back the funds is of course a good idea regardless. Use those funds to open new channels instead and improve the network with live channels.

Failed proof of what i’m talking about

I was going to post an example of when routings use closed channels as a proof in the wild, so i checked the awesome article by Andreas Brekken because i remember he had issues paying. However, as usual these early days when you dig for something in Lightning Network, you find something else. It’s like archeology. But newer.

So, he failed paying 67.49 USD to blockstream. Question is why. His wallet made 10 different attempts via different paths.

The channel identification is just a number which is commonly understood amongst the implementations, signalling which transaction is funding the channel. Unfortunately they have not yet decided on how to present that number, so they all decided to use different formats. I’ve converted these numbers to C-lightning and lnd:s equivalents:

80e250000a90000
527909:169:0
580442083918675968

80bd60008d00001
527318:2256:1
579792272683433985

7ef700007b10000
520048:1969:0
571798823130693632

Lets investigate one of them that might be interesting. Lets pick the first one. 80e250000a90000

This is a 100000 sat channel, about 8 usd. This is something that is public information and you can see the channel info here:

https://www.robtex.com/lightning/channel/580442083918675968

Finding out it was one of my nodes channels makes it even easier to investigate. Actually all three of the channels shown included my node, so it doesn’t look like the sender uses very different paths. But we will stick to this channel. Since it is on my node I can actually se the balance. Now Andreas did his tests 18 days ago, so it is hard to know how balance has changed since then. It is active today though.

{
“active”: true,
“remote_pubkey”: “03a2b9adc3086b0ba7844bcda0159f11967e5558e430f168ee7ee797cb9830d742”,
“channel_point”: “acd382d13bbde40476f793f3285a43a32b5aee13bc0b3ba504ff085c9c168533:0”,
“chan_id”: “580442083918675968”,
“capacity”: “100000”,
“local_balance”: “97494”,
“remote_balance”: “1600”,
“commit_fee”: “906”,
“commit_weight”: “724”,
“fee_per_kw”: “1250”,
“unsettled_balance”: “0”,
“total_satoshis_sent”: “58614”,
“total_satoshis_received”: “156108”,
“num_updates”: “1219”,
“pending_htlcs”: [
],
“csv_delay”: 144,
“private”: false
},

So, Andreas payment would be about 820000 satoshis.

The current remote balance is 1600, so clearly today i wouldn’t be able to route the payment for him. But could I at that time?

No. The channel can only fit maximum 100000 satoshis (minus some fees and reserve). It’s far too tiny to ever fit a payment of 820000 satoshis. And that fact is well known to the network since it is on the blockchain. So why did his wallet even try? Only ACINQ knows in this case I guess. Time to post an issue on their github.

An example of what i was actually going to demonstrate

I’ll pick another fresher example, from another guy:

In this case i investigated #2 where i am the other side of the channel . The error code is a bit misleading, the other node very well does know me. It also knows for sure that the channel fails to get established because of some interoperability-issue, but it still doesn’t signal to the network that the channel is disabled, so the payer tries to include the channel, which is is zombie state.

Conclusion

My conclusion is that despite these annoying things, most of the time Lightning Network actually works pretty well and is successfully used in production. Discovering how well it performs while knowing all bugs i’ve discovered is actually amazing. Now just get rid of these bugs and Lightning will work 100 times better.

And in the meantime, why not make the network better by killing some zombies from your own node at https://www.moneni.com/nodematch where you can also find new possible peers.

Disclosure: I’m doing most technology stuff on moneni.com, robtex.com and rompert.com , but I’m not related with any other companies or persons mentioned, and I am [unfortunately] not paid by anyone to write this [yet].

--

--