Fake Text Detection Toy Project in ADS5035 (Data-driven Security and Privacy)
This is a experiment code on WebText and gpt2-output-dataset
functions in dataset.py
and util.py
is forked from gpt2-output-detector
Model | Train | Top-k 40 | Nucleus | Random |
---|---|---|---|---|
BERT | Top-k Nucleus Random |
89.79% 82.68% 47.3% |
72.22% 78.84% 53.9% |
43.79% 64.23% 80.45% |
RoBERTa | Top-k Nucleus Random |
98.35% 90.84% 51.17% |
69.47% 88.36% 58.75% |
49.22% 75.43% 91.34% |
Before run this code, construct dataset data/webtext.{train,dev,test}.jsonl
, data/xl-1542M-{k40,nucleus}.{train,dev,test}.jsonl
with this format.
You can run this code:
python baseline.py
--max-epochs=2 \
--batch-size=32 \
--max-sequence-length=128 \
--data-dir='data' \
--real-dataset='webtext' \
--fake-dataset='xl-1542M-nucleus' \
--save-dir='logs' \
--learning-rate=2e-5 \
--weight-decay=0 \
--model-name='bert-base-cased' \
--wandb=True
Extract Probability & Rank of each token with 16 Threads num-train-pairs 50000 indicates "Real:Fake = 50,000:50,000"
python prob_extract.py
--batch-size=32 \
--max-sequence-length=128 \
--seed 10 \
--num-workers 16\
--num-train-pairs 50000