-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider integrating contract addresses found on BigQuery #9
Comments
SELECT address FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) = "2022-11-23" can run this once a day using a DATE interval instead of a specific day in WHERE clause. This will cut down on IO to only the most recent (yesterday) partition. |
Want to help with this. Where should I start? |
we need to understand from @volsa how new contract addresses / URLs can be added to the scraper queue |
it's also possible to reduce the amount of scraping needed by looking at the contract creation bytes, because identical bytes will always produce identical contracts regardless of the address. SELECT address FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) = "2022-11-29"
AND bytecode NOT IN
(
SELECT DISTINCT bytecode FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) < "2022-11-29"
) this query isn't optimal though - what's needed is a hash of the bytecode in blockchain-etl, see issue blockchain-etl/ethereum-etl-airflow#480 |
Interesting. I'm doing something similar where I use:
instead of farm_fingerprint |
If you look at the Anyways, in theory this issue can be fixed by modifying the
Feel free to work on this if you want, but just as a heads-up I was planning on re-writing Etherface once I find some time (probably in a few months) where I'll probably also address this issue. |
Also consider changing the let fetchers: Vec<Box<dyn Fetcher + Sync + Send>> = vec![Box::new(EtherscanFetcher)];
let scrapers: Vec<Box<dyn Scraper + Sync + Send>> = vec![Box::new(EtherscanScraper)]; to only start the Etherscan fetcher and scraper; makes development much easier because you'll get less log clutter. |
Regarding the NULL fields referred to in (1), would it be acceptable in the short term to define a NULL token, like |
Regarding (3), here's where we need to detect the 404 as an alternate block: |
Regarding (2), and also in relation to:
It looks like etherscan API only supports a single version of the source [1]. But github does, and I'm not sure if we would need to track the file version. |
Yes, shouldn't be a problem; in fact if you execute the following query you'll get a few records where the metadata is missing because these addresses where taken from https://github.com/aphd/smart-corpus-api (https://aphd.github.io/smart-corpus/) SELECT * FROM etherscan_contract WHERE compiler LIKE 'n/a'; I totally forgot about that but I'd prefer if these fields are nullable in the rewrite.
Would this table have any mappings to other tables?
You'll also need to change the following function https://github.com/volsa/etherface/blob/master/etherface-lib/src/api/mod.rs#L196 to detect 404 errors because for whatever reason the Etherscan devs decided to always return a 200 HTTP response regardless if the request was actually successful. Instead they wrap the actual errors in the JSON response body. @kalanyuz feel free to ping me with any questions if you decide to work on this :) |
In the spirit of etherface as an operational DB, I'd make the source code linked to Otherwise, could introduce a new mapping table of Definitely the hashed bytecode needs to be done at some point because of the long tail nature of contracts, and a design decision needs to be made - how much of this work to do in etherface and how much to push to BigQuery? |
I'd be down to also store contract code; for Etherscan this should be quite simple because of its immutable nature, for GitHub this will be much trickier though.
Since we would store the source code in the database anyway, there is nothing stopping us from also storing its hashed bytecode; out of curiosity how would we off-load this to BigQuery though? My understanding was that we restore database dumps into BigQuery, would we then just calculate these hashes in the restoration process? Anyways I'll create a tracking issue for the Etherface rewrite in the coming days where I'll summarize all features mentioned in the last few days among other ones. |
You're right, there's nothing preventing etherface from calculating and storing hashed bytecode as well as sources, whether linked directly to addresses or via the hashed bytecode as a M2M key. I guess that where our thinking may differ is that I'm mainly aimed at making more analytical data available in BQ to supplement the Having the data in BQ is for the purposes of building dashboards, calculating metrics (e.g. for plot frequency of signatures in traces vs frequency of signatures in contracts, identifying which signatures are most common but have no known text / source), etc. Some of what would be useful to import to BQ from Etherface you've already written - this is why I started to engage with the project, thank you! I imagine that there other things you envision for Etherface that aren't necessarily useful for the Now, regarding source code, my guess is that we'll want to integrate, either using BQ as the DW or Pg, an airflow or dataflow job that does post-processing of the sources to extract more useful info. In order to do this, we'd need to:
One specific case I have in mind is auto-generating It's probably easier to use BQ as the DW for doing this, as there is already strong support for integrating with other GCP services. One point in favor of Pg would be the lower latency of it as an operational DW, but the rate of contract creation is relatively low, and the latency to get into BQ can be high - a daily job is good enough. So it's not really hitting major requirements, at least from this |
The current implementation to find source code on Etherscan consists of two steps:
While this works, the biggest drawback is that we don't have a list of all contract addresses with their associated source code currently hosted on Etherscan. Instead we only have the latest contract addresses starting from the initial deployment of Etherface.
Google BigQuery hosts the following dataset, which offers (all?) contract addresses on the Ethereum network. In theory we can populate these addresses into the Etherface database (more specifically into the
etherscan_contract
table), which will then simply execute step 2. if and only if the address has source code hosted on Etherscan. This feature could (and probably will) increase the number of signatures found on Etherscan as well as the number of unique signatures dramatically.The text was updated successfully, but these errors were encountered: