Analyzing Ethereum blockchain using Tinybird

alejandro del amo
8 min readMar 10, 2021

--

Tinybird is a platform that allows users to be able to ingest and analyze huge amounts of data in real-time with incredible performance. Their product enables users the possibility to run queries over millions of rows in seconds.

To achieve this, Tinybird is power by Clickhouse. An incredibly powerful OLAP database created by Yandex for Yandex Metrica.

Tinybird shines when dealing with large datasets, so I thought it would be interesting to use it to analyze the 36 million transactions that happened during January 2021 in the Ethereum blockchain.

How to extract the data

To extract these data, the simplest way is by querying over the Ethereum BigQuery public dataset, which you can find here. This dataset is populated using the ethereum-etl repository. Remember that even if the dataset is public, you will be charged according to how many GB your query needs to read. BigQuery can be very expensive, so be careful.

In my case, I used Kaggle to run the queries against BigQuery and to store the results in CSV. Kaggle has a partnership with BigQuery, users can scan up to 5 TB/month for free. Here is the Kaggle’s notebook I used.
So, let’s upload the CSV to Tinybird, and let’s see what we find.

Interesting insights

Something that everyone complains about the Ethereum network is the congestion the network has and the huge transaction fees. So let’s check it out.

Transactions per day

SELECT 
toDate(block_timestamp) AS day,
count() AS number_transactions,
joinGet('test__ETH_USD_join_by_day', 'price', day) AS eth_usd,
round(avg(gas_used * (gas_price / exp10(18))) * eth_usd, 2) AS `avg_tx_fee_$`,
round(max(gas_used * (gas_price / exp10(18))) * eth_usd, 2) AS `max_tx_fee_$`,
round(min(gas_used * (gas_price / exp10(18))) * eth_usd, 2) AS `min_tx_fee_$`
FROM test__transactions
GROUP BY
day,
eth_usd
ORDER BY day ASC
700.37MB processed, 35.02m x 6 (87.53ms)
Transaction fees during January
Transaction fees during January group by day

It only takes 88ms to return the results. Incredible.
But more incredible, how is it possible to have transactions paying $10k in fees and, at the same, to have transactions with zero fees?

First, we need to understand that for each action we do on the Ethereum network we need to pay a fee in Wei to the miners. 1 Ethereum = 10¹⁸ Wei
So, the fee is calculated by 2 parameters: Gas used and the price of the gas.

gas_used * (gas_price / 10^18) * eth_usd = transaction fee

The gas used is related to how complex a transaction is and the gas price is related to how much are you willing to pay. The more you pay, the faster it will be processed.

So, let’s get the top 10 most expensive transactions in January.

SELECT 
hash,
toDate(block_timestamp) as day,
joinGet('test__ETH_USD_join_by_day', 'price', day) AS eth_usd,
round(gas_used * (gas_price / exp10(18)) * eth_usd, 2) AS `fee_in_$`
FROM test__transactions
ORDER BY `fee_in_$` DESC
LIMIT 10
3.33GB processed, 35.02m x 4 (256.53ms)
Top 10 most expensive transactions
Top 10 most expensive transactions

In this case, it took 256ms. Still impressive, but this is a good example of the importance of the column type we read and especially its size. Here you see the difference between querying the hash column or not.
Adding the hash column has an impact of -184.44% in speed and +375% in hard disk needed to read.

Performance impact of reading the hash column

If we search the transaction in Etherscan, we can see that is quite a complex transaction, and the gas price used by the sender is really high.
322,488 gas * 0.00014842 gas/ETH * $1,111.49/ETH = $53,199 🤯

Information about the transaction 0x0bc0ee8537b0abe1553fb57a0f58ff2e056de734b1be930c3504548905e30ca3

Querying the 30 most used addresses we can see the struggles of working with the hash column.

Using standard SQL, we could do something like:

WITH(
SELECT COUNT(*) FROM test__transactions
) as total
SELECT
address,
COUNT(*) as number_transactions,
round(number_transactions/total*100,2) as percentatge
FROM (
SELECT from_address as address FROM test__transactions
UNION ALL
SELECT to_address as address FROM test__transactions
)
GROUP BY address
ORDER BY number_transactions DESC
LIMIT 30
3.70GB processed, 72.53m x 3 (3.99s)
Top 30 most used addresses using SQL standard

But as we see, it is not very efficient, so how can we improve this?
The hash column is a string column that contains hex addresses, so we could use a hash function like CityHash64 to start working with integers instead.
This will increase our performance when aggregating and ordering.

SELECT cityHash64('0xdac17f958d2ee523a2206206994597c13d831ec7') 
// 2930432376656490000

To improve our performance, even more, we can use TopK.
Another powerful function from Clickhouse that returns an array of the approximately most frequent values in the specified column.

Combining both features, we achieve to go below 1 sec. 🥳

WITH(
SELECT
arrayMap(x ->
cityHash64(x),
arrayConcat(
topK(20)(from_address),
topK(20)(to_address))),
count()
FROM test__transactions
) AS tr
SELECT
address,
count() as number_transactions,
round(
number_transactions/tupleElement(tr,2)*100, 2) as percentatge
FROM (
SELECT from_address as address FROM test__transactions
UNION ALL
SELECT to_address as address FROM test__transactions
)
WHERE has(tupleElement(tr,1), cityHash64(address))
GROUP BY address
ORDER BY number_transactions DESC
LIMIT 30
3.70GB processed, 72.53m x 3 (882.49ms)
Top 30 most frequent addresses using Clickhouse functions

Okay, but what about the transactions with zero fees we saw before?
These transactions are added by the miners or miner pools (groups of miners). Miners can decide what transactions go inside the block and in which order.

Number of transactions added by the miners with zero cost

SELECT 
joinGet(test__blocks_join_by_number, 'miner', block_number) AS miner,
countIf(miner = from_address or miner = to_address) AS txs_to_miners_inside_pool,
countIf(miner != from_address and to_address != miner ) AS external_accounts,
txs_to_miners_inside_pool + external_accounts as total
FROM test__transactions
WHERE gas_price = 0
GROUP BY miner
ORDER BY external_accounts DESC
870.77MB processed, 36.26m x 4 (103.37ms)
Transactions done with 0 fee
Transactions made with zero fees per miner pool

So let’s go one step further and check which are the accounts taking more advantage of this.

Accounts doing transactions with zero cost thanks to the miners

SELECT 
from_address,
joinGet(test__blocks_join_by_number, 'miner', block_number) AS miner,
count() as tx
FROM test__transactions
WHERE
from_address != miner AND
to_address != miner AND
gas_price <= 1000000000 -- 1 Gwei
GROUP BY from_address, miner
ORDER BY tx DESC
4.21GB processed, 36.26m x 3 (244.56ms)
Accounts taking advantage of the free transactions

Checking the second address on Etherscan, you can see that most of the transactions have zero cost, and they follow a weird pattern.
Multiple transactions in the same block and the last transaction is always a transfer to UUPool (a miner pool).

Spotting a MEV bot

Checking closer, we can see that this account belongs to a trading bot using a MEV (Miner Extractable Value) strategy. More info here.
Basically, a MEV bot looks for inefficiencies in the blockchain or tries to front-run/back-run other transactions to make a profit.
Here is more information about what front-running/back-running means.

Here you can see the bot exploiting the spread in price between two exchanges.

  1. The bot uses UniSwap, a decentralized exchange.
  2. Then, it uses SushiSwap, another decentralized exchange.
  3. Finally, pays 0,2 ETH to UUPool for the services.

But what happened? As we can see in the picture below, the bot swap 17.41 WETH for 11.98 SFI in UniSwap.

Then, we went to SushiSwap and exchange back the 11.87 SFI for 17.83 WETH.
Making a profit of 0.40 WETH = 0.40 ETH = $730. 🤯
Once the profit is made, the bot will share half with the miners.

Bot front-running transactions

And here you can find an example of the bot is front-running a transaction.
Below we can see how the bot is doing a “sandwich” to the transaction of the user (0x788…)
Let’s find out what is happening.

Transactions from the block 12004008 https://etherscan.io/txs?block=12004008&p=4

First, the bot listens to the mempool, when all the pending transactions are placed. Then, when it detects a big pending transaction like the swap the user is trying to do, the bot will run the same swap operation, but before him. This is possible because of the “partnership” the bot has with the miners and the ability of the miners to order the transactions at their will.

In this case, the bot swapped 6.48 ETH for 25,018 KEEP (0.000259 ETH/KEEP).

https://etherscan.io/tx/0xb816a5bcdcf13f3b82bcb4ceb3ee5cf880652b7eac0fe14655cfddb830d27576

The user, unaware is being tricked, swapped 23.95 ETH for 91,369 KEEP.
But instead of paying 0.000259 ETH/KEEP, as he would have paid if the bot did not push the price up, he paid 0.000262 ETH/KEEP ~ 1.22% slippage.

https://etherscan.io/tx/0x8b00f8be2c37aafce760cbfa0d9d2e1fa2a197052da736ea7d75f904a2bce6dc

Now the bot just needs to sell back to make a profit. In this case, a profit of 0.008 ETH (1.22%)~$150.

https://etherscan.io/tx/0x7adba85352e73679dd654a7655dddf47ec28408e22dfb09a4aac804b148393dd

Finally, the profit is shared with UUPool, the miner pool.

https://etherscan.io/tx/0x312fdd30e03f790988bb59786ffd4c9819b93e89dee7210da1dc628d943feac2

Conclusions

MEV bots are a serious threat to everyone, anyone can be a target of it.
Since January 2021, $300M has been extracted according to this explorer. The ability of the miners to order the transactions in the block is such a powerful tool. Solutions are being created to fight this, like the Taichi Network from SparkPool.

I would like to thank Tinybird for the free tier they offer. Their platform is truly amazing, it such an easy and powerful tool. If you deal with huge amounts of data, you should definitely check it out.

If you want to look into the queries and the results, here you have a public link in Tinybird. Also, here you have my repository in case you want to recreate the same environment in Tinybird.

Also, I would like to praise D5 for all the work they did on the ethereum-etl repository. Thanks to them, we can analyze the blockchain more easily.

For anyone interested in learning more about bots and hidden malicious actors inside the Ethereum blockchain, please consider reading this article and also this one.

--

--