This paper is currently under peer review. The code might change frequently. We are currently experiencing a severe staff shortage. If you encounter any issues during the replication process, please feel free to contact us through an issue or via email:oceanytech@gmail.com.
This repository contains the official implementation for the paper "From Words to Worth: Newborn Article Impact Prediction with LLM". The tool is designed to PEFT the LLMs for the prediction of the future impact.
First, pull the repo and type following commands in the console:
cd ScImpactPredict
pip install -r requirements.txt
To begin with default setting, you should request access and download the LLaMA-3 pretrain weights at huggingface official sites. Then, download the provided LLaMA-3 LoRA weights (runs_dir) here.
Finally, modify the path to the model's weights in the single_pred.py
file, and type python single_pred.py
in the console.
For fine-tuning, you may manually modify the 'xxxForSequenceClassification' in the transformers
package. Or follow the instruction to trust remote code.
class LlamaForSequenceClassification(LlamaPreTrainedModel):
def __init__(self, config):
super().__init__(config)
...
self.post_init()
# Add codes here!
self.loss_func = 'mse'
self.sigmoid = nn.Sigmoid()
...
def forward(...):
...
logits = self.score(hidden_states)
# Add codes here!
if not self.loss_func == 'bce':
logits = self.sigmoid(logits)
if input_ids is not None:
batch_size = input_ids.shape[0]
...
# Add codes here!
if self.config.problem_type == "regression":
if self.loss_func == 'bce':
loss_fct = BCEWithLogitsLoss()
elif self.loss_func == 'mse':
loss_fct = MSELoss()
elif self.loss_func == 'l1':
loss_fct = L1Loss()
elif self.loss_func == 'smoothl1':
loss_fct = nn.SmoothL1Loss()
Then, prepare train.sh
bash file like below:
DATA_PATH="ScImpactPredict/NAID/NAID_train_extrainfo.csv"
TEST_DATA_PATH="ScImpactPredict/NAID/NAID_test_extrainfo.csv"
OMP_NUM_THREADS=1 accelerate launch offcial_train.py \
--total_epochs 5 \
--learning_rate 1e-4 \
--data_path $DATA_PATH \
--test_data_path $TEST_DATA_PATH \
--runs_dir ScImpactPredict/official_runs/LLAMA3 \
--checkpoint path_to_huggingface_LLaMA3
Finally, type sh train.sh
in the console. Wating for the training ends~
Similar to fine-tuning, prepare test.sh
as below:
python inference.py \
--data_path ScImpactPredict/NAID/NAID_test_extrainfo.csv \
--weight_dir path_to_runs_dir
Then, type sh test.sh
.
We also offer the weights of other models for download.
LLMs | Size | MAE | NDCG | Mem | Download Link |
---|---|---|---|---|---|
Phi-3 | 3.8B | 0.226 | 0.742 | 6.2GB | Download |
Falcon | 7B | 0.231 | 0.740 | 8.9GB | Download |
Qwen-2 | 7B | 0.223 | 0.774 | 12.6GB | Download |
Mistral | 7B | 0.220 | 0.850 | 15.4GB | Download |
Llama-3 | 8B | 0.216 | 0.901 | 9.4GB | Download |
With a few adjustments based on your specific needs, it should work fine. Since these models train very quickly (less than a few minutes on a single RTX 3080), we won’t be providing the trained weights.
Folders like furnace, database, and tools are used for building the NAID and TKPD datasets. They have no direct connection to training or inference.