Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Fix endoftext token in concatenation of token_chunks #62

Open
wants to merge 64 commits into
base: finetuning
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
5b64684
update README
WuTheFWasThat Feb 18, 2019
6dab221
reorganize and add temp 0.7
WuTheFWasThat Feb 19, 2019
aae26ab
add license
WuTheFWasThat Feb 20, 2019
fc0ee6d
add conditional samples
WuTheFWasThat Feb 20, 2019
825aa3d
separate out tensorflow install
WuTheFWasThat Feb 20, 2019
92ce9f2
shuffle headings
WuTheFWasThat Feb 20, 2019
bf43e73
more warning
WuTheFWasThat Feb 20, 2019
23ed990
instructinos mention git clone
WuTheFWasThat Feb 20, 2019
99af6d7
Add a Dockerfile and document usage in README
madisonmay Feb 14, 2019
2cf46d9
fixed unconditional sampling reproducibility issue
Feb 20, 2019
946facf
fixed seed arg to ensure reproducibility in conditional-samples model
Feb 20, 2019
b6f943d
update readme
WuTheFWasThat Feb 20, 2019
a3aa7de
add conditional samples with default settings
WuTheFWasThat Feb 21, 2019
68bf7a0
add .gitattributes file to ensure files copied to docker container ha…
Feb 21, 2019
c5b9c89
Minor: update readme
natemurthy Feb 21, 2019
c314dda
Minor: update readme
natemurthy Feb 27, 2019
ed49f03
Add documentation for help flags (#81)
ArmaanBhullar Feb 27, 2019
9d1e704
slight fix to batch size description
WuTheFWasThat Feb 27, 2019
0465394
updates
WuTheFWasThat Feb 28, 2019
d1fc873
Add finetuning code.
Mar 3, 2019
1fba31f
chmod +x
Mar 3, 2019
dfca3cf
Add finetuning instructions
Mar 3, 2019
9423776
Fix sample generation with batch_size greater than 1.
Mar 3, 2019
8eb6793
Python download script (#89)
webproduktion01 Mar 4, 2019
ed0dedc
update download stuff
WuTheFWasThat Mar 4, 2019
953530f
update readme with usage caveats and calls for research
WuTheFWasThat Mar 6, 2019
79a246a
add contributors md and move dev docs out
WuTheFWasThat Mar 6, 2019
8637828
fix for windows (thanks to chrothenbach)
WuTheFWasThat Mar 7, 2019
3e18729
Add training script with Horovod support
tlkh Mar 18, 2019
ec16bad
Fix typo in train command in README
tlkh Mar 18, 2019
0bad9e4
Added instructions for training using Horovod
tlkh Mar 18, 2019
d14501a
Update CONTRIBUTORS.md
WuTheFWasThat Mar 18, 2019
ef62678
Merge pull request #2 from tlkh/finetuning
nshepperd Mar 19, 2019
c465071
autoformat
Mar 4, 2019
1e32b10
Combine input text files with <|endoftext|> delimiter to ensure there…
Mar 19, 2019
3a3ce65
Write losses to summary file for tensorboard.
Mar 20, 2019
d5b387b
Add learning rate as command line flag.
Mar 20, 2019
b106d0a
Use argparse instead of fire in train.py.
Mar 20, 2019
2044d13
Fix encode.py
Mar 21, 2019
a359a34
Add gradient accumulation with default of 5 minibatches
Mar 21, 2019
8738950
Merge remote-tracking branch 'origin/master' into finetuning
Mar 25, 2019
eda8777
Turn off gradient accumulation by default, it shouldn't be needed.
May 2, 2019
0503b1b
updates for 345M model
WuTheFWasThat May 3, 2019
b5ef71a
reference dataset
WuTheFWasThat May 3, 2019
dd75299
remove samples
WuTheFWasThat May 3, 2019
47df6da
Add gradient checkpointing and another optimization necessary to allo…
May 4, 2019
c46ed99
Add "validation" loss calculation.
May 4, 2019
941a762
Add toposort to requirements
Tenoke May 5, 2019
13c5412
Merge pull request #3 from Tenoke/finetuning
May 6, 2019
3985cc7
Add option to use SGD for optimizer
May 14, 2019
7fc2a44
Record learning rate in tensorboard logs
May 14, 2019
a464925
Add text in README for --optimizer flag
May 14, 2019
ae535b6
Reduce default learning rate of train.py.
May 14, 2019
2d4fd0c
Merge remote-tracking branch 'origin/master' into finetuning
May 14, 2019
6a77a7b
New feature: add noise to network inputs to regularize against overre…
May 15, 2019
87fe3d7
Add top-p sampling
May 15, 2019
e99ee37
Add top_p to interactive_conditional_samples.py and generate_uncondit…
May 15, 2019
2b24145
fix typo in top_p
May 15, 2019
6c1f21d
Fix top_p sampling for batch_size>1
May 15, 2019
cca7144
Updated README.md
biranchi2018 Aug 15, 2019
a070f38
Merge pull request #22 from biranchi2018/biranchi2018-patch-1
Aug 27, 2019
50fa3b6
Add note to install cudnn, re https://github.com/nshepperd/gpt-2/issu…
Jun 16, 2019
b7cda3f
Add flag to set encoding for text reading and writing, defaulting to …
Jul 20, 2019
26e6d2b
fix endoftext token in concatenation of token_chunks
Feb 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# convert to OS line endings on checkout, back to LF on commit
* text=auto

# ensure anything copied to the container has unix style line endings
*.sh text eol=lf
requirements.txt text eol=lf
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
__pycache__
.mypy_cache/
models/
checkpoint
samples
17 changes: 17 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Contributors (alphabetically)

* **[madisonmay](https://github.com/madisonmay)**

Added Dockerfiles

* **[Margaret Mitchell et al](https://arxiv.org/abs/1810.03993)**

Our [usage](./README.md#usage) writeup was loosely inspired by the paper
[Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
and related conversations with some of the authors.

* **[webproduktion01](https://github.com/webproduktion01)**

Ported download script to python.

**[Full code contributors list](https://github.com/openai/gpt-2/contributors).**
86 changes: 86 additions & 0 deletions DEVELOPERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Installation

Git clone this repository, and `cd` into directory for remaining commands
```
git clone https://github.com/openai/gpt-2.git && cd gpt-2
```

Then, follow instructions for either native or Docker installation.

## Native Installation

All steps can optionally be done in a virtual environment using tools such as `virtualenv` or `conda`.

Install tensorflow 1.12 (with GPU support, if you have a GPU and want everything to run faster)
```
pip3 install tensorflow==1.12.0
```
or
```
pip3 install tensorflow-gpu==1.12.0
```

Install other python packages:
```
pip3 install -r requirements.txt
```

Download the model data
```
python3 download_model.py 117M
python3 download_model.py 345M
```

## Docker Installation

Build the Dockerfile and tag the created image as `gpt-2`:
```
docker build --tag gpt-2 -f Dockerfile.gpu . # or Dockerfile.cpu
```

Start an interactive bash session from the `gpt-2` docker image.

You can opt to use the `--runtime=nvidia` flag if you have access to a NVIDIA GPU
and a valid install of [nvidia-docker 2.0](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)).
```
docker run --runtime=nvidia -it gpt-2 bash
```

# Running

| WARNING: Samples are unfiltered and may contain offensive content. |
| --- |

Some of the examples below may include Unicode text characters. Set the environment variable:
```
export PYTHONIOENCODING=UTF-8
```
to override the standard stream settings in UTF-8 mode.

## Unconditional sample generation

To generate unconditional samples from the small model:
```
python3 src/generate_unconditional_samples.py | tee /tmp/samples
```
There are various flags for controlling the samples:
```
python3 src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee /tmp/samples
```

To check flag descriptions, use:
```
python3 src/generate_unconditional_samples.py -- --help
```

## Conditional sample generation

To give the model custom prompts, you can use:
```
python3 src/interactive_conditional_samples.py --top_k 40
```

To check flag descriptions, use:
```
python3 src/interactive_conditional_samples.py -- --help
```
9 changes: 9 additions & 0 deletions Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM tensorflow/tensorflow:1.12.0-py3

ENV LANG=C.UTF-8
RUN mkdir /gpt-2
WORKDIR /gpt-2
ADD . /gpt-2
RUN pip3 install -r requirements.txt
RUN python3 download_model.py 117M
RUN python3 download_model.py 345M
18 changes: 18 additions & 0 deletions Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM tensorflow/tensorflow:1.12.0-gpu-py3

# nvidia-docker 1.0
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="${CUDA_VERSION}"

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility \
NVIDIA_REQUIRE_CUDA="cuda>=8.0" \
LANG=C.UTF-8

RUN mkdir /gpt-2
WORKDIR /gpt-2
ADD . /gpt-2
RUN pip3 install -r requirements.txt
RUN python3 download_model.py 117M
RUN python3 download_model.py 345M
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2019 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
99 changes: 79 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,107 @@

Reference: ["Beginner’s Guide to Retrain GPT-2 (117M) to Generate Custom Text Content"](https://medium.com/@ngwaifoong92/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f)

# gpt-2

Code and samples from the paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
Code from the paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).

For now, we have only released a smaller (117M parameter) version of GPT-2.
We have currently released small (117M parameter) and medium (345M parameter) versions of GPT-2. While we have not released the larger models, we have [released a dataset](https://github.com/openai/gpt-2-output-dataset) for researchers to study their behaviors.

See more details in our [blog post](https://blog.openai.com/better-language-models/).

## Installation
## Usage

This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.

### Some caveats

- GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
- The dataset our GPT-2 models were trained on contains many texts with [biases](https://twitter.com/TomerUllman/status/1101485289720242177) and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
- To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.

### Work with us

Please [let us know](mailto:languagequestions@openai.com) if you’re doing interesting research with or working on applications of GPT-2! We’re especially interested in hearing from and potentially working with those who are studying
- Potential malicious use cases and defenses against them (e.g. the detectability of synthetic text)
- The extent of problematic content (e.g. bias) being baked into the models and effective mitigations

## Development

See [DEVELOPERS.md](./DEVELOPERS.md)

## Contributors

See [CONTRIBUTORS.md](./CONTRIBUTORS.md)

## Fine tuning on custom datasets

To retrain GPT-2 117M model on a custom text dataset:

Download the model data
```
sh download_model.sh 117M
PYTHONPATH=src ./train.py --dataset <file|directory|glob>
```

Install python packages:
If you want to precompute the dataset's encoding for multiple runs, you can instead use:

```
pip3 install -r requirements.txt
PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/encoded.npz
PYTHONPATH=src ./train.py --dataset /path/to/encoded.npz
```

## Unconditional sample generation
Make sure `cudnn` is installed. [Some have reported](https://github.com/nshepperd/gpt-2/issues/8) that `train.py` runs without it but has worse memory usage and might OOM.

| WARNING: Samples are unfiltered and may contain offensive content. |
| --- |
### Gradient Checkpointing

https://github.com/openai/gradient-checkpointing is included to reduce the memory requirements of the model, and can be enabled by `--memory_saving_gradients`. The checkpoints are currently chosen manually (poorly) by just adding layer 10 to the 'checkpoints' collection in model.py. `--memory_saving_gradients` is enabled by default for training the 345M model.

### Validation loss

Set `--val_every` to a number of steps `N > 0`, and "validation" loss against a fixed sample of the dataset will be calculated every N steps to get a better sense of training progress. N around 200 suggested. You can set `--val_dataset` to choose a separate validation dataset, otherwise it defaults to a sample from the train dataset (so not a real cross-validation loss!).

### Optimizer

You can use SGD instead of Adam with `--optimizer sgd`. This also helps conserve memory when training the 345M model. Note: the learning rate needs to be adjusted for SGD, due to not having Adam's gradient normalization (0.0006 seems to be a good number from some experiments).

### Multi gpu (out of date)

To do distributed on multiple GPUs or machines using Horovod:

To generate unconditional samples from the small model:
```
python3 src/generate_unconditional_samples.py | tee samples
```
There are various flags for controlling the samples:
```
python3 src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee samples
mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x PYTHONPATH=src \
-mca pml ob1 -mca btl ^openib \
/home/jovyan/gpt-2/train-horovod.py --dataset encoded.npz
```

While we have not yet released GPT-2 itself, you can see some unconditional samples from it (with default settings of temperature 1 and no truncation) in `gpt2-samples.txt`.
## GPT-2 samples

| WARNING: Samples are unfiltered and may contain offensive content. |
| --- |

## Conditional sample generation
While we have not yet released GPT-2 itself, you can see some samples from it in the `gpt-2-samples` folder.
We show unconditional samples with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40.
We show conditional samples, with contexts drawn from `WebText`'s test set, with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40.

To give the model custom prompts, you can use:
## Citation

Please use the following bibtex entry:
```
python3 src/interactive_conditional_samples.py --top_k 40
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
```

## Future work

We may release code for evaluating the models on various benchmarks.

We are still considering release of the larger models.

## License

[MIT](./LICENSE)
28 changes: 28 additions & 0 deletions download_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import os
import sys
import requests
from tqdm import tqdm

if len(sys.argv) != 2:
print('You must enter the model name as a parameter, e.g.: download_model.py 117M')
sys.exit(1)

model = sys.argv[1]

subdir = os.path.join('models', model)
if not os.path.exists(subdir):
os.makedirs(subdir)
subdir = subdir.replace('\\','/') # needed for Windows

for filename in ['checkpoint','encoder.json','hparams.json','model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta', 'vocab.bpe']:

r = requests.get("https://storage.googleapis.com/gpt-2/" + subdir + "/" + filename, stream=True)

with open(os.path.join(subdir, filename), 'wb') as f:
file_size = int(r.headers["content-length"])
chunk_size = 1000
with tqdm(ncols=100, desc="Fetching " + filename, total=file_size, unit_scale=True) as pbar:
# 1k for chunk_size, since Ethernet packet size is around 1500 bytes
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)
pbar.update(chunk_size)
17 changes: 0 additions & 17 deletions download_model.sh

This file was deleted.

31 changes: 31 additions & 0 deletions encode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/usr/bin/env python3
# Usage:
# PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/output.npz
# PYTHONPATH=src ./train --dataset /path/to/output.npz

import argparse
import numpy as np

import encoder
from load_dataset import load_dataset

parser = argparse.ArgumentParser(
description='Pre-encode text files into tokenized training set.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--model_name', metavar='MODEL', type=str, default='117M', help='Pretrained model name')
parser.add_argument('--combine', metavar='CHARS', type=int, default=50000, help='Concatenate files with <|endoftext|> separator into chunks of this minimum size')
parser.add_argument('--encoding', type=str, default='utf-8', help='Set the encoding for reading and writing files.')
parser.add_argument('in_text', metavar='PATH', type=str, help='Input file, directory, or glob pattern (utf-8 text).')
parser.add_argument('out_npz', metavar='OUT.npz', type=str, help='Output file path')

def main():
args = parser.parse_args()
enc = encoder.get_encoder(args.model_name)
print('Reading files')
chunks = load_dataset(enc, args.in_text, args.combine, encoding=args.encoding)
print('Writing', args.out_npz)
np.savez_compressed(args.out_npz, *chunks)


if __name__ == '__main__':
main()
Loading