BERT4ETH

I have reorganized the code and tested it recently. The code should be able to reproduce the results presented in the paper. (2023/08/27)

This is the repo for the code and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.

Here you can find our slides.

Getting Start

Requirements

Python >= 3.6
TensorFlow >= 1.4.0

I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.

Preprocess dataset

Step 1: Download dataset from Google Drive.

The master branch hosts the basic BERT4ETH model. If you wish to run the basic BERT4ETH model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

Step 3: Transaction Sequence Generation

cd Model;
python gen_seq.py --bizdate=bert4eth_exp

Pre-training

Step 1: Pre-training Data Generation from Sequence

python gen_pretrain_data.py --bizdate=bert4eth_exp  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8

Step 2: Pre-train BERT4ETH

python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp

Parameter	Description
`bizdate`	The signature for this experiment run.
`max_seq_length`	The maximum length of BERT4ETH.
`masked_lm_prob`	The probability of masking an address.
`epochs`	Number of training epochs, default = `5`.
`batch_size`	Batch size, default = `256`.
`learning_rate`	Learning rate for the optimizer (Adam), default = `1e-4`.
`num_train_steps`	The maximum number of training steps, default = `1000000`,
`save_checkpoints_steps`	The parameter controlling the step of saving checkpoints, default = `8000`.
`neg_strategy`	Strategy for negative sampling, default `zip`, options (`uniform`, `zip`, `freq`).
`neg_share`	Whether enable in-batch sharing strategy, default = `True`.
`neg_sample_num`	The negative sampling number for one batch, default = `5000`.
`checkpointDir`	Specify the directory to save the checkpoints.

Step 3: Output Representation

python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=True

I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.

Testing on output account representation

Phishing Account Detection

python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)

python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF

De-anonymization (ENS dataset)

python run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000

Fine-tuning for phishing account detection

python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100

python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmp

Citation

If you find this repository useful, please give us a star and cite our paper : ) Thank you!

@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Data		Data
Material		Material
Model		Model
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT4ETH

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 1: Pre-training Data Generation from Sequence

Step 2: Pre-train BERT4ETH

Step 3: Output Representation

Testing on output account representation

Phishing Account Detection

De-anonymization (ENS dataset)

Fine-tuning for phishing account detection

Citation

Q&A

About

Releases

Packages

Languages

OpenDataforWeb3/BERT4ETH_ODC

Folders and files

Latest commit

History

Repository files navigation

BERT4ETH

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 1: Pre-training Data Generation from Sequence

Step 2: Pre-train BERT4ETH

Step 3: Output Representation

Testing on output account representation

Phishing Account Detection

De-anonymization (ENS dataset)

Fine-tuning for phishing account detection

Citation

Q&A

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages