Skip to content

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Notifications You must be signed in to change notification settings

xinjin95/BinSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

BinSum - Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

📚 Dataset | 📝 Citation

Important

We are in the process of releasing the dataset, adding more data/implementation details, and improving the documents. Please stay tuned.

What's New?

  • [Dec. 17, 2023] Our paper has been publicly available on arXiv. We are in the process of releasing the dataset.

Introduction

BinSum is a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.

Dataset

Coming soon.

Citation

If you find BinSum useful, please consider citing our paper:

BibTeX:

@article{jin2023binary,
  title={Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models},
  author={Xin Jin and Jonathan Larson and Weiwei Yang and Zhiqiang Lin},
  journal={arXiv preprint arXiv:2312.09601},
  year={2023},
}

About

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published