Official implementaion of YouroQNet, a toyish quantum text classifier implemented with pyVQNet and pyQPanda
This repo contains code for the final problem of the OriginQ's 2nd CCF "Pilot Cup" contest (Professional Group - Quantum Machine Learning Track).
Oh yes yes child, we've run a hard time struggling.
The final total score is 79.2, and ranking unknown; but why the fuck you owe me a point of 0.8?? 🐱
And, code repo for the qualifying stage is here: 第二届“司南杯”初赛
⚪ install
conda create -n q python==3.8
(pyvqnet requires Python 3.8)conda activate q
pip install -r requirements.txt
⚪ for contest problem (👈 Follow this to reproduce our contest results!!)
python answer.py
for preprocess & train (⚠ VERY VERY SLOW!!)python check.py
for evaluate
⚪ for quick peek of YouroQNet components
python vis_tokenizer.py
for adaptive k-gram tokeinzer interactive demopython vis_youroqnet.py
for YouroQNet interactive demorun_quantum_toy.cmd
(👈 run the toy version out of box before all)
⚪ for full development
- download the full dataset simplifyweibo_4_moods, unzip
simplifyweibo_4_moods.csv
todata
folder pip install -r requirements_dev.txt
for extra dependenciespushd repo & init_repos.cmd & popd
for extra git repos- fasttext==0.9.2 requires numpy<1.24 (things might changed)
start_shell.cmd
to enter deveolp run command envstart_shell.cmd py
to get a ipy console that quick refering topyvqnet
's fucking undocumented-documentation withhelp()
mk_preprocess.cmd
for making clean datasets, stats, plots & vocabs etc... (~7 minutes)python vis_project.py
to see 3d data projection (you will understand what the fuck this dataset is 👿)run_baseline.cmd
to run classic modelsrun_quantum.cmd
to run quantum models
⚠ The training sometimes might fail due to ill random parameter initialization, when trainset loss not tends to decay or quickly go overfit, just kill it & retry 😅
⚪ core idea & contributions
- adaptive k-gram tokenizer (see mk_vocab.py, interactivate demo vis_tokenizer.py)
- YouroQNet for text clf (see run_quantum.py, interactivate demo vis_youroqnet.py)
- theoretical analysis of why & how QNN works (see vis_qc_apriori.py)
ℹ See our PPT YouroQNet.pdf for more conceptual understanding 🎉
A subset from simplifyweibo_4_moods: 1600
samples for train, 400
samples for test. Class label names: 0 - joy
, 1 - angry
, 2 - hate
, 3 - sad
, however is not very semantically corresponding in the datasets :(
⚠ File naming rule: train.csv
is train set, test.csv
is valid set, and the generated valid.csv
might be the real test set for this contest. We use csv filename to refer to each split in the code
- data exploration
- guess the target test set (
valid.txt
) - vocab & freq stats
- pca & cluster
- data relabel (?)
- guess the target test set (
- data filtering
- punctuation sanitize
- stop words removal
- too short / long sententce
- feature extraction
- tf-idf (syntaxical)
- fasttext embedding (sematical)
- adaptive tokenizer
- baseline models
- sklearn
- vqnet-classical
- quantum models
- quantum embedding
- model route on different length
- multi to binary clf
- contrastive learning
- learn the difference
# meterials
ref/ # thesis for dev
Question-ML.png # problem sheet
YouroQNet.pdf # solution PPT (YouroQNet)
init_thesis.cmd # thesis donwloader
repo/ # git repos for research
init_repos.cmd # git repo cloner
update_repos.cmd
data/ # dataset
simplifyweibo_4_moods.csv # raw dataset (manually download)
train|test.csv # context dataset
*_cleaned.csv
*_tokenized.txt
cc.zh.300.bin # FastText pretrained word embedding (auto downloaded)
log/ # outputs
<analyzer>/ # aka. vocab
<feature>/ # sklearn models
<model>/ # vqnet/torch models
tmp/ # generated intermediate results for debug
# contest related
answer.py # run script for preprocessing & training
check.py # run script for evalution
# preprocessors
mk_*.py
mk_preprocess.cmd # run script for mk_*.py
# models
run_baseline_*.py # classical experiments
run_baseline.cmd # run script for run_baseline_*.py
run_quantum.py # quantum experiments
run_quantum.cmd # run script for run_quantum.py
run_quantum_toy.cmd # toy QNN for debug and verify
# misc
vis_*.py # intercative demos or debug scaffolds
utils.py # common utils
start_shell.cmd # develop env entry
# doc & lic
README.md
TECH.md # techincal & theoretical stuff
requirements_*.txt
LICESE
ℹ For the contest, only these files are submitted: answer.py
, mk_vocab.py
, run_quantum.py
, utils.py
, README.md
; it should be enough to run all quantum parts 😀
- FastText:
- Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606
- Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759
- repo: https://github.com/facebookresearch/fastText
- QNN for text-clf:
- QNLP-DisCoCat: https://arxiv.org/abs/2102.12846
- QSANN: https://arxiv.org/abs/2205.05625
- OriginQ: https://originqc.com.cn/index.html
- QCNN related:
- tensorflow-quantum impl: https://www.tensorflow.org/quantum/tutorials/qcnn
- pytorch + qiskit impl: https://github.com/YPadawan/qiskit-hackathon
- pytorch + pennylane impl: https://github.com/christorange/QC-CNN
- Tiny-Q: https://github.com/Kahsolt/Tiny-Q
=> find thesis of related work in ref/init_thesis.cmd
=> find implementations of related work in repo/init_repos.cmd
If you find this work useful, please give a star ⭐ and cite~ 😃
@misc{kahsolt2023,
author = {Kahsolt},
title = {YouroQNet: Quantum Text Classification with Context Memory},
howpublished = {\url{https://github.com/Kahsolt/YouroQNet}}
month = {May},
year = {2023}
}
by Armit 2023/05/03