This article is a call for researchers to more systematically tinker about path finding strategies on the Lightning Network. It is written in a rather educational style as I believe at this point input from users could be very valuable. Thus I decided against heavily formalizing the problem with graph and probability theory. Also the text is not structured in the same way as I would structure a research paper as I use the first two of my 7 proposed evaluation criteria to introduce problem and explain the difficulty of it.
Whenever a payment is conducted on the lightning network the transport layer that forwards the payment requires the payer to provide a path of channels to the payee along which the payment is supposed to be forwarded. This turns out to be a little bit tricky for several (and many other) reasons:
- Nodes can only forward a payment along a channel if they have enough balance of the channel’s capacity on their side. This information is not available to the payer that chooses a path as the gossip protocol of the Lightning Network only shares the capacity of the channels with the network but not how the funds are actually split between the peers. (See below why gossiping balances should never be implemented!)
- Sending out a payment already locks up funds of the payer (and the nodes along the path).
- After being initialized payments currently can’t be canceled by the the payer.
For the first reason there is actually a pretty high likelihood that the payment along a single path will fail. Because of the second and third reason the payer's Lightning node can only try one path after another. Otherwise the sender risks over payment.
While testing a single path usually does not take longer than a second the time for finding a path with enough liquidity to settle the payment can add up. Christian Decker has reported in 2019 that 5% of all payments initialized from his node would have taken more than 3 minutes to be fulfilled. This is a terrifying user experience! Imagine for every 20th customer at a grocery store everyone would have to wait 3 minutes extra at the cashier because the payment of that particular customer is not arriving at the store.
The main goal of my research here at NTNU in Norway is to improve the payment process of Lightning Network nodes by increasing the node's ability to find paths to the recipient. There are many interesting ideas out in the wild but it was in particular the criticism by Peter Rizun on the What Bitcoin Did Podcast, the Hush Relay Paper but also the past discussions (thanks to fulmo for digging out the tweets) about Ant Routing that made me wonder very actively about the following question:
How can we evaluate scientifically whether a path finding algorithm is useful for being deployed on the Lightning Network or not?
To me it seems that this rather simply put question yields a ton of research to be answered satisfyingly. First we need to figure out what useful means and which goals should be fulfilled by a path finding algorithm / solution. Then we would need to discuss how each of these goals can be quantified and measured. In this article I want to share my list of 7 criteria that I came up with over the past 6 months (actually I think about this longer but I only started my research position at NTNU 6 months ago) Before I go through the list of goals and metrics that we could use I just want to demonstrate why the answer to this question is not obvious.
The first and obvious criteria that comes to one’s mind of course would be if the algorithm is able to find a path with enough liquidity to be used to route the payment successfully. I call this metric the success rate. As far as I understand both the Hush Relay approach and the Ant routing approach are able to guarantee with 100% certainty that they find a path if one exists. (In case of Hush Relay it might be a multi path payment but that will be supported on Lightning.)
So is this the end of the article? I mean we have at least two algorithms that work?
Unfortunately the answer seems to be "No". Having algorithms that guarantee such a success rate is certainly very desirable but it seems to be by no means sufficient. Before explaining the just mentioned strategies let me go into a third strategy that would be even simpler to implement and explain that it also has such a guaranteed success rate.
The third algorithm: Sharing balance information over the gossip protocol and using it to apply a standard graph search technique would — in theory — give us as Wikipedia explains 100% guarantee that we find a path with enough liquidity (or a multi path) if it existed as we had full information about the balances of the channels and the graph search techniques are well known.
But we don’t do that for a very simple reason:
If we would start to gossip about balance information we would have to do so with every payment(attempt) and this would — besides the privacy implications — mean that every participant of the network has to be informed about every payment. Does that sound familiar to you? To me it does: It is exactly how the Bitcoin Protocol works with the help of the blockchain as the central ledger to store all the information forever in an immutable way. The fact that broadcast networks have poor scaling properties and huge risks of creating network congestion was the reason we started with 2nd layer solutions anyway. I suggest to look at the Wikipedia Article as a starting point if you want to learn more about Broadcasting in Networking.
That being said it should become clear that the amount of network communication that has to take place when an algorithm searches for a path should — besides the success rate — be another crucial factor to evaluate the quality of a path finding algorithms.
Are the success rate and the amount of network communication sufficient?
While I believe that they are probably the most important metrics my search for evaluation criteria of path finding algorithms lead me to discussions with countless people (to whom I am very grateful) who suggested to add more criteria so here is my current list:
- Success rate (Obvious)
- Network Communication (Common Sense in Computer Science)
- Ability to preserve privacy (Wish by the users)
- Reliability (suggested by by my co-author Olaoluwa Osuntokun)
- Time To Complete (suggested in the Boomerang paper)
- Locked Liquidity x Time (suggested in the Boomerang paper)
- Decentralization (suggested by Ricardo Pérez-Marco)
After I have already discussed the first two points to motivate my search for evaluation criteria I want to briefly go into each of the other ones.
The ability to preserve privacy seems to be a strong desire by the users and the developers as many techniques to preserve privacy are built into the source base onion routing. However after research confirmed that the balance of payment channels can easily be probed via the onion routing mechanism I realized that our current mechanism of just trial and error to search for paths already leaks quite a lot of information about balances on the Lightning Network. Maybe even too much. In contrast I have ongoing and still unpublished results which show that despite the fact that JIT routing might require nodes to explicitly share balance information with their peers it reveals overall on average less information than the currently implemented probing based approach. As suggested in another probing paper JIT routing is actually even a mitigation to channel balance probing attacks. It made sense to me to include this metric. Even though the paper that points out the abilities of JIT routing to prevent probing attacks also suggests to use the information_coefficient (1 - (max_balance - min_balance)/capacity) to measure how much information leaks I am not sure whether there could be a better metric to quantify this property.
One of my favourite path finding papers suggests to measure Time to Complete and Locked Liquidity x Time. The TTC is quite obvious and also easily quantifiable. One just has to measure the time window from starting the path finding process for a payment until the preimage has arrived. (Technically only until all HTLCs are set up but that is hard to measure from one node). In the best case one would look at a histogram of simulated or real payments and hope for a small time to complete. Before I want to talk about the Locked Liquidity x Time I wish to look at another metric.
Olaoluwa Osuntokun one of my co-authors of Mastering the Lightning Network suggested that we look at the reliability of a path finding algorithm. The idea behind this metric is that set up HTLCs can get stuck on the Lightning Network and delay the algorithm heavily. Also in the case of JIT routing the just in time rebalancing can introduce a delay and random behaviour. On the other hand if there was a process to make sure that a path would be found, liquid and online the reliability of the algorithm would be rather high. As one can see I still have a hard time describing what exactly reliability means without dropping to time to complete. Maybe reliability could be expressed as the variance of the time to complete in the sense that we not only would want to have a short time to complete on average but also a very small window around it.
Going back to the Boomerang Paper I want to explain the Locked Liquidity x Time measure. While forwarding HTLCs — but also if we switched to adaptor signatures and PTLCs — funds would be locked for setting up the path until it fails or settles. These locked funds yield an opportunity cost which is being compensated by a routing fee. But meanwhile they limit the overall ability of the network to forward payments as they cannot be used for other payments. Thus we do not only want to keep the amount of funds that is being locked low but also the time the funds are being locked. So for example my primitive suggestion in the beginning to share balance information with peers would allow a sender of a payment to select a path which with a very high likelihood should settle and this should probably happen in a rather quick time. Whereas in contrast with atomic multipath payments the preimage for all paths will only be released if all paths for the full amount are set up. If we had a strangeling partial payment that needs minutes to finally find the last path of a multipath payment there is a lot of dead liquidity being set up for a long time.
Last but not least Decentralization was suggested by Ricardo Pérez-Marco. I understood this to be more like a requirement of property of the algorithm. However my common sense tells me if we want this we should be able to express this as a metric. I would argue that many of those Beacon node or landmark node approaches are not as centralized as the ant routing approach. To the best of my understanding the trampoline routing approach tries to be a compromise here.
While writing this text it occurs to me that this metric is kind of going hand in hand with the privacy metric. Only that the roles have changed. When I wrote about privacy I was more concerned what a paying node would learn about the network. While this measure it probably means that certain routing nodes do not learn too much or do not have too much control over what is going on.
While this article is already quite long I think I have to add a little bit more technical bits which I left out for stylistic reasons. Going back to success rate as the most important and obvious metric it seems to be not clear how the success rate should be measured. In my first publication I tried to express it as the fraction of payment pairs which can pay a single Satoshi payment. We also had a second suggestion to look at the maximum possible payment amount for each pair of nodes and take the median of this value. Both measures are obviously going in the right direction but also chosen rather arbitrarily. As it is clear that the max flow is the ultimate goal I come more and more to measure success rate as the fraction of the max flow that could be found by the algorithm. For example the HushRelay and fully probed MPP would always find the max flow eventually — assuming the network does not change meanwhile, which unfortunately it does. However it should be easy to construct situations where JIT routing will not be able to find the max flow between two pairs. Similarly I have not discussed how network communication should be measured? Is it the amount of messages that are being exchanged or the amount of bytes that are being transferred?
That being said I conclude with stating that I currently have a list of potentially 7 things that I believe should be considered when evaluating a path finding algorithm on the lightning network. For some of these things (TTC, Locked Liquidity x Time, Privacy) it seems straightforward to come up with a good measurement. Othere like the success rate or network communication seem to have several plausible choices. At last it still seems unclear how to express metrics like Reliability and Decentralization. Also in this article I have ignored the fact that the network is highly dynamic as it changes its topology with every payment that is attempted or settled. I have also neglected the fact that for rigorous research we need more data sets about
- How the balance of the network is actually being allocated
- How payments flow through the network over time.
Without having this information we will always have to make strong assumptions which we feed into our algorithms. While we might have good reason for our assumptions and the results are scientific if we always use the same assumptions the results might be of little use in the real world. Of course we could show that results are independent of the assumptions but as mentioned in the article I could construct graphs in which JIT routing would always find the max flow and I can construct graphs in which JIT routing would perform much more poorly.
I really hoped that I started a useful discussion and I am happy to see more studies and research being conducted. Even if you just have ideas how you or I could improve research in this direction I ask you kindly to reach out to me or the research community.