Evalutation of Large Language Model

Data

Measuring Massive Multitask Language Understanding

57가지 영역에서 전반적인 LLM 성능을 측정할 수 있는 데이터셋

wikitext

ppl 측정을 위한 데이터셋

Prompt

Few-shot (5)

The following are multiple choice questions (with answers) about  abstract `subject`.

Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A. 0
B. 1
C. 2
D. 3
Answer: B

Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: B

Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: C

Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: A

Find the characteristic of the ring 2Z.
A. 0
B. 3
C. 12
D. 30
Answer: A

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:

Zero-shot

The following are multiple choice questions (with answers) about  abstract `subject`.

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:

Config

필요에 따라 config.json파일을 수정해서 사용.

dev_dir : few-shot 디렉토리
test_dir : 성능 측정 디렉토리
model_name : LLM 모델명
sub_categories : 57가지 subject를 17개의 서브 카테고리로 분류
categories : 크게 STEM, humanities, social sciences, other로 분류
few_shot : few-shot 프롬프트를 사용할지 여부(True/False)
ppl : perplexity 측정할지 여부(True/False)
stride : ppl 측정시 사용되는 stride

Usage & Output

>> python main.py

output

================= Start evaluation =================
Subject                                      Acc  
--------------------------------------------------
abstract_algebra              Text match     0.260
                              Label match    0.290
                              Last logits    0.000
...
==================================================
Sub-group                               Acc  
--------------------------------------------------
math                                    0.291
health                                  0.422
physics                                 0.350
business                                0.513
biology                                 0.452
...
==================================================
Group                                   Acc  
--------------------------------------------------
STEM                                    0.363
humanities                              0.391
social sciences                         0.453
other (business, health, misc.)         0.438
...
==================================================
--------------------------------------------------
Perplexity                              Score
                                       52.011
==================================================

TODO

Last logits 관련 문제 해결
ppl의 경우 llm 마다 max_length가져오는 법이 다름.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
config.json		config.json
config.py		config.py
evaluate.py		evaluate.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evalutation of Large Language Model

Data

Prompt

Few-shot (5)

Zero-shot

Config

Usage & Output

TODO

About

Releases

Packages

Languages

khs0415p/lm-evaluate

Folders and files

Latest commit

History

Repository files navigation

Evalutation of Large Language Model

Data

Prompt

Few-shot (5)

Zero-shot

Config

Usage & Output

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages