Uniswap Data Processing of zelos research

Published in

zelos-research

17 min readJan 2, 2024

Goal

we calculated users' net worth and return rate on Uniswap from the perspective of user addresses. This time, our goal remains the same, but we include the cash holdings of these addresses in the calculation to obtain a total net worth and return rate.

The pools included in this analysis are:

polygon-usdc-weth(fee:0.05), pool address: 0x45dda9cb7c25131df268515131f647d726f50608
ethereum-usdc-eth(fee:0.05), pool address: 0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640, this pool’s eth is a native token, bring extra works to clean it.

workflow

Get event data of Uniswap.
Get user cash data. ETH, and usdc hold by address
Calculate the price series, specifically the price of ETH.
Get the amount of fees collected per minute on tick.
Get the list of all positions during the statistical period.
Establish the mapping between addresses and positions.
Calculate the return rate for each position.
Calculate the return rate for each address as an LP based on the correspondence between positions and addresses.
Merge the user cash and LP accounts, and calculate the overall return rate.

We will discuss the above steps one by one.

Event Data of Uniswap

Previously, we developed the demeter-fetch to provide data sources for Demeter. This tool can retrieve logs from Uniswap pools through different approaches and parse them into uniform formats. The supported data sources include:

The supported data sources include:

Ethereum RPC: Standard RPC interface of the Ethereum client. It has lower data retrieval efficiency and requires multiple threads.
Google BigQuery: Downloading data from BigQuery datasets. Although it is updated once a day, it is convenient to use and cost-effective.
Trueblocks Chifra: The Chifra service can scrape on-chain transactions and reorganize them, making it easy to export transactions and balance information. However, it requires setting up your node and service.

The output formats include:

Minute: Resampling the transaction information of Uniswap swaps into data for each minute. Used for backtesting.
Tick: Recording each transaction in the pool, including swaps and liquidity operations.

In this workflow, we mainly obtain tick data to gather information about positions, including capital, hourly returns, lifespan, and holders.

These data are obtained from the pool’s event logs, such as mint, burn, collect, and swap. However, the pool’s logs do not include token IDs, which makes it impossible to locate the operations on specific positions in the pool.

In reality, the ownership of Uniswap LPs is managed through NFTs, and the managers of these NFT tokens are proxy contracts. The token IDs only exist in the event logs of the proxy contracts. Therefore, to obtain complete LP position information, we need to retrieve the event logs of the proxy contracts and combine them with the pool’s event logs.

Taking this transaction as an example, we need to pay attention to two logs with log indexes 227 and 229. They are the mint event thrown by the pool contract and the IncreaseLiquidity event thrown by the proxy contract. The amount (liquidity), amount0, and amount1 between them are the same. This can be used as the basis for association. By linking these two logs, we can obtain the tick range, liquidity, token ID, and the amounts of the two tokens corresponding to this LP action.

And for professional users, especially some funds, they may choose to bypass the proxy and interact directly with the pool contract. In this case, the position will not have a token ID. Instead, we will create an ID for this LP position in the format of address-LowerTick-UpperTick.

Its amount0 and amount1 may have a small difference, which is rare but common. So when matching burn and collect, we leave some tolerance in our parsec code.

The next issue to address is who initiated this transaction. For withdrawals, we will consider the receipt in the collect event as the holder of the position. As for mint, we can only obtain the sender from the pool mint event (see the diagram with the mint event).

If the user interacts with the pool contract, the sender will be the LP provider. However, if it is a regular user who interacts with the contract through a proxy, the sender will be the address of the proxy. This is because the funds are indeed transferred from the proxy to the pool. Fortunately, the proxy will generate an NFT token, which will be transferred to the LP provider. Therefore, by detecting the transfer of the proxy contract (which is the contract of the NFT token), we can find the LP provider corresponding to this mint.

Additionally, if the NFT token is transferred, it will change the holder of the position. We have accounted for this in our analysis, although it is less common. To simplify, we did not consider the transfer of the NFT token after minting.

Cash

The goal of this stage is to obtain the number of tokens held by an address at each moment during the statistical period. To achieve this goal, two types of data need to be obtained:

The balance of the address at the starting moment.
The transfer records of the address during the statistical period.

For the balance at the starting moment, it can be queried through an RPC interface. When using an archive node, the balance at any specific time can be obtained by setting the height parameter in the query. This method can be used for both native tokens and ERC20 tokens.

Obtaining the transfer records of ERC20 tokens is relatively straightforward and can be obtained through various channels such as BigQuery, RPC, or Chifra.

However, ETH transfer records need to be obtained through transactions and traces. Transactions are relatively easy to handle, but querying and processing traces require a significant amount of computation. Fortunately, Chifra provides the functionality to export ETH balances. It can output a record whenever the balance changes, although it can only record the quantity change and not the recipient of the transfer. Nevertheless, it can still meet the requirements and is the most cost-effective method.

Price

Swap events are generated by uniswap when token swaps occur. We can obtain the price of the token from the sqrtPriceX96 field and the total liquidity at that time from the liquidity field.

By resampling the token price, we can obtain a list of prices per minute. Additionally, since the liquidity field of the event also includes the total liquidity of the pool at that time, we can include the total liquidity in the table as well. The final table will look like this:

Fee

Fees are the main source of income for positions. Each time a user performs a swap operation on the pool, the corresponding position receives a certain amount of fees, and the amount of income depends on the proportion of liquidity, the pool’s fee rate, and the tick range.

To calculate the user’s fee income, we can record the amount of swap that occurs on which tick in the pool every minute. Then, we can calculate the fee income on that tick for the current minute:

Then we got the following table:

There are also some tech optimization tools used here, you can find engineer practice and details in this post.

Position List

To obtain a list of positions, we first need to specify the identifier for each position.
- For LPs who invest through a Proxy, each position will have an NFT (Non-Fungible Token), which serves as the position’s ID.
- For LPs who directly interact with the pool, we will create an ID for them in the format of `address_LowerTick_UpperTick`. This way, every position has its unique identifier.

Using this identifier, we can consolidate all the operations of an LP and create a list that describes the full lifecycle of the position. For example:

However, it is important to note that the scope of this analysis is limited to the year 2023, rather than the entire lifespan of the pool. Therefore, for some positions, we may not have data on their operations before January 1, 2023. To address this, we make an estimation of the liquidity for these positions at the start of the analysis period.

Here is the approach we take:

We add up the liquidity from mint and burn operations to obtain a value, L.
If L > 0, indicating that there was liquidity before the start of the analysis period (2023–01–01 00:00:00), we compensate for this by adding a mint operation at the start of the analysis period.
If L < 0, indicating that there was still liquidity at the end of the analysis period, we assume that the position continued to hold liquidity.

This approach allows us to avoid downloading data before 2023, thus saving costs. However, it does introduce the issue of the sunken position: if an LP did not perform any operations during the year, we would not be able to identify that LP. However, this issue is not significant. Given the one-year analysis period, we assume that LPs would typically adjust their positions during this time due to significant price fluctuations and other factors. LPs who do not make any adjustments are considered inactive and are not included in the analysis.

Another more complex scenario is when a position minted liquidity before 2023, performed mint/burn operations during the analysis period, and still held some liquidity at the end of the period. In this case, we can only capture a portion of the liquidity in our analysis. This sunken liquidity can affect the estimation of fees and result in abnormal yield calculations. We will discuss this issue in more detail later.

In the final analysis, there are a total of 73,278 positions on Polygon and 21,210 positions on Ethereum. The number of positions with abnormal yields does not exceed 10 on each chain, which confirms the reliability of our assumptions.

address position mapping

Since our goal is to calculate the earnings of addresses, we also need to establish the relationship between addresses and positions. Through this association, we can determine the specific investment behavior of users.

In Step 1, we did some work to find the correlation between funding operations (mint/collect) and users. Therefore, by finding the sender of the mint operation and the receipt of the collect operation, we can establish the correspondence between positions and addresses.

Position return

In this step, we need to calculate the net value of each position and then calculate the yield based on the net value.

Net Value

The net value of a position consists of two parts. One is the LP’s liquidity, which represents the principal of market-making. After the LP invests funds into the position, the quantity of liquidity does not change, but the net value fluctuates with the price. The other part is the fee income, which is independent of liquidity and is stored separately in the fee0 and fee1 fields. The net value of fees increases over time.

Therefore, at any given minute, combining liquidity with the price of that minute will give us the net value of the principal. To calculate the fees, we need to use the fee table calculated in the fourth step.

First, we divide the liquidity of the position by the total liquidity of the pool at that time to determine the share ratio. Then, we sum up all the fees for the ticks included in the tick range of this position to obtain the fee income for that minute.

Finally, by adding the fees of fee0 and fee1, we obtain the net value of the fees. Then, we add it to the net value of the liquidity to get the total net value.

When calculating the net value, we divide the lifecycle of a position based on the mint/burn/collect transactions.

1. When a mint transaction occurs, the liquidity increases.
2. When a burn transaction occurs, the liquidity decreases. The value of the liquidity is converted to the fee fields (this is also how it is done in the pool contract code).
3. When a collect transaction occurs, it triggers the calculation. The calculation range is from the last collect transaction to the current time. We calculate the net value and fee income for each minute and obtain a list.

Finally, we consolidate the net value lists obtained from each “collect” transaction. We then apply resampling to obtain the final results.

To improve accuracy, we have made two optimizations.

Firstly, for the hours with transactions (mint/burn/collect), we perform minute-level statistics, while for the hours without transactions, we perform hour-level statistics. Finally, we resample the results to the hour level.

Secondly, in the collect event, we can obtain the sum of liquidity and fees. Therefore, we can compare the actual collected value with our theoretical calculated value to obtain the difference between theoretical and actual fees (this difference also includes the difference in LP principal, but the error in LP principal difference is very small and can be considered as 0). We will compensate for the fee difference on each line to improve the accuracy of fee estimation (as shown in the table above with the fields fee_modify0 and fee_modify1).

Return rate of v3 lp

Usually, the return rate is calculated by dividing the initial net value by the final net value.

However, this approach is not suitable here for the following reasons:

1. The return rate needs to be calculated at a minute-level granularity.
2. Due to the transfer of funds in and out of positions, simply dividing the initial and final net values does not reflect the actual returns.

For problem 1, we can calculate the return rate for each minute by dividing the net value at each minute, and then multiply the return rates for each minute to obtain the overall return rate.

But this algorithm has a serious problem. If there is a calculation error in the minute return rate, it will cause a large deviation in the total return rate. This makes the statistical process like walking on a tightrope, with no room for error. However, the good thing is that any statistical error will be exposed.

For problem 2, if there are fund transfers in and out within a minute, dividing the return rates directly will still result in absurd return rates. Therefore, it is necessary to refine the minute return rate algorithm.

Our first attempt was to break down the changes in net value in detail and exclude the changes in funds. We divided the changes in net value into several parts: 1) the principal change due to price, 2) the accumulated fees for this minute, and 3) the inflow and outflow of funds. The third part needs to be excluded from the statistics. For this, we have developed the following calculation method:

Specify the current minute as n and the previous minute as n-1.
Assume that all transfer operations in the current minute occur at n:0.000 seconds. Therefore, for the remaining time, the net value of LP remains unchanged, meaning that the net value at n:0.001 seconds is equal to the net value at n:59.999 seconds.
The accumulation of fees occurs at the end of this minute, which is at n:59.999 seconds.
The price and fees at the end of the previous minute (n-1:59.999) are considered as the starting price and fees for this minute (n:0.000).

Based on the assumptions above, the return rate for each minute is calculated by dividing the end liquidity/price/fees by the starting liquidity/price/fees, using the formula shown below, where f represents the algorithm for converting liquidity to net value.

This approach looks great. It perfectly avoids the impact of liquidity changes and reflects the influence of price and fees on net value, which is exactly what we expect. However, in practice, there can be very high returns in certain rows. After investigation, we found that the issue occurs when withdrawing liquidity.

Let’s recall our rules: each row represents the end of a minute/hour. This provides a consistent scale for data analysis. However, it’s important to note that the meaning of each column is different:

For the net value column, it represents the instantaneous value, i.e., the value at the end of the current minute/hour.
The fees column represents the cumulative value, i.e., the fees accumulated during the current minute/hour.

Therefore, for the hour when liquidity is burned:

When the LP is burned and tokens are transferred out, the net value at the end of this hour will be 0.
However, for the fees, since they are cumulative, they will be greater than 0 at the end of this hour.

This makes the above formula simplify to:

This situation not only occurs at the end of the position’s lifecycle but also when burning a portion of liquidity, which affects the ratio of fee increase to LP net value.

To simplify, when there is a change in LP net value, we set the return rate to 1. This introduces some errors in the calculation of the return rate. However, for a position that is continuously invested, the hours with transactions are relatively few compared to the entire lifecycle. Therefore, the impact is not significant.

Address LP return

With the return rates for each position and the mapping between positions and addresses, we can calculate the return rates for the addresses in each position.

The algorithm here is relatively simple. We concatenate the positions for the address at different periods. During periods when there is no investment, the net value is set to 0 and the return rate is set to 1 (since there is no change in net value, the return rate is 1).

If there are multiple positions during the same period, we add up the net values in the overlapping parts to obtain the total net value. When merging the return rates, we weigh them based on the net values of each position.

The last step, the merge

Finally, by combining the cash and LP investments held by the user’s address, we can obtain the final result.

The merging of net values is simpler compared to the previous step of merging positions. We only need to find the time range on the LP net value side, then find the corresponding cash held during that time range, and finally find the price of ETH to obtain the total net value.

For the return rate, we also use the algorithm of calculating the return rate per minute and then multiplying them together. Initially, we used the incorrect return rate algorithm mentioned in section 7. This algorithm requires separating the fixed part (including the cash amount in cash and the liquidity in LP) and the variable part (price changes, accumulated fees, fund inflows, and outflows) for each minute. Compared to the statistics of positions, it is much more complex because, for the fund inflows and outflows of Uniswap, we only need to focus on the mint and collect events. However, tracing the cash transactions is very difficult. We need to distinguish whether the funds are transferred to LP or an external address. If they are transferred to LP, the principal part can remain unchanged, but if they are transferred externally, we need to adjust the quantity of the principal. This requires tracking the target addresses of ERC20 and ETH transfers. This work is very cumbersome. First, during mint/collect events, the transfer address can be the pool or the proxy. And even more complex is the transfer of ETH. Since ETH is a native token, some transfer records can only be found through trace records. However, the amount of trace data is too large and exceeds our processing capacity.

Finally, the last straw that broke the camel’s back was that we found that the net value in each row is the instantaneous value for that hour, while the fees are the accumulated value for that hour. Physically, they cannot be directly added together. We discovered this problem very late.

Therefore, we abandoned this algorithm and instead adopted the method of dividing the net value of the next minute by the net value of the previous minute. This method is much simpler. However, it also has a problem. When there are fund inflows and outflows, the return rate can still be unreasonable. From the above discussion, we learned that it is very difficult to separate the flow of funds. Therefore, here we sacrifice some accuracy and set the return rate to 1 when there is a fund transfer.

The remaining problem is how to identify the fund inflows and outflows in the current hour. Initially, the algorithm we thought of was simple. We used the token balance from the previous hour and the current price to calculate what the net value of this hour would be if we held those tokens. Then we subtracted the calculated value from the actual value. When the difference is not equal to zero, it indicates that there are fund inflows and outflows. This can be expressed in the formula:

However, this algorithm overlooks the complexity of Uniswap LP. In LP, the quantity of tokens changes with the price, and the net value also changes. Additionally, this approach does not consider the variation in fees, resulting in an error of around 0.1% between the estimated and actual values.

To improve accuracy, we can further refine the composition of funds by separately calculating the value change of LP and considering the fees.

Using this approach, the error in the estimated value can be controlled within 0.001%.

We also limit the decimal places in the data to avoid dividing extremely small numbers (usually below 10^-10). These small numbers are accumulated errors from various calculations and resampling. If not handled properly, dividing them directly would amplify the errors and severely distort the returns.

Other interesting issues

native token

In this analysis, we included the usdc-eth pool on the Ethereum network, where eth is the native token and requires special handling.

Eth cannot be directly used in DeFi and must be converted to weth. Therefore, this pool is the usdc-weth pool. For users directly interacting with the pool, they can deposit and withdraw weth as usual.

However, for users adding LP through a proxy, they need to include eth in the transaction’s value and transfer it to the proxy contract. The contract then converts the eth to weth and adds it to the pool.
During the collect operation, usdc can be directly transferred to the user, but eth cannot be directly transferred. It needs to be withdrawn from the pool to the proxy, converted back to eth by the proxy contract, and then sent to the user through an internal transfer.
An example of this process can be seen in [this transaction](https://etherscan.io/tx/0x6178aeef0406f184385f5600b1d49a387a7a987ce461274addbe75ae95a4f40d#eventlog).

Therefore, the usdc-eth pool is only different from regular pools in terms of fund deposits and withdrawals. This only affects the matching of positions and addresses. To address this issue, we retrieved all the NFT transfer data from the creation of the pool and identified the holders of the corresponding positions using token IDs.

missing position

Some positions did not make it into the final list. These positions have certain special characteristics.

A large portion of them are MEV (Miner Extractable Value) trades, which are purely arbitrage trades and not made by regular investors. Therefore, they are not included in our statistics. Additionally, it is difficult to track and analyze MEV trades in practice, as it requires trace-level data. To filter out MEV trades, we use a simple strategy: positions that exist for less than one minute are excluded. Since our data has a precision of one minute if a position exists for less than one minute, it cannot be included in the statistics.

Another possibility is that the position does not have a “collect” transaction. As seen in step 7, we calculate the returns based on “collect” transactions. Without a “collect” operation, the previous net value and return rate are not calculated. Normally, users would choose to harvest the returns or principal from their LP in a timely manner. However, there may be some special users who choose to keep their assets in the Uniswap pool (fee0 and fee1). For these users as well, we consider them as special cases and exclude them from the statistics.

Conclusion

The above is the data cleaning process of Zelos. I hope it can be helpful to you. The relevant code is currently available at https://github.com/zelos-alpha/uniswap-annual-review-2023. In the future, we will gradually merge the scripted code into the engineered Demeter fetch repo.