Skip to content

waikato-llm/llm-dataset-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-dataset-converter

For converting large language model (LLM) datasets from one format into another. Filters can be supplied as well, e.g., for cleaning up the data.

Installation

Via PyPI:

pip install llm_dataset_converter

The latest code straight from the repository:

pip install git+https://github.com/waikato-llm/llm-dataset-converter.git

Docker

Docker images are available from:

Datasets

The following repository contains a curated list of datasets for LLMs:

https://github.com/Zjh-819/LLMDataHub

The Hugging Face Hub has an abundance of datasets as well:

https://huggingface.co/datasets

Dataset formats

The following dataset formats are supported:

Domain Format Read Write Compression
classification CSV from-csv-cl to-csv-cl Y
classification Jsonlines from-jsonlines-cl to-jsonlines-cl Y
classification Parquet from-parquet-cl to-parquet-cl N
classification TSV from-tsv-cl to-tsv-cl Y
pairs Alpaca from-alpaca to-alpaca Y
pairs CSV from-csv-pr to-csv-pr Y
pairs Jsonlines from-jsonlines-pr to-jsonlines-pr Y
pairs Parquet from-parquet-pr to-parquet-pr N
pairs TSV from-tsv-pr to-tsv-pr Y
pairs XTuner from-xtuner to-xtuner Y
pretrain CSV from-csv-pt to-csv-pt Y
pretrain Jsonlines from-jsonlines-pt to-jsonlines-pt Y
pretrain Parquet from-parquet-pt to-parquet-pt N
pretrain TSV from-tsv-pt to-tsv-pt Y
pretrain TXT from-txt-pt to-txt-pt Y 1
translation CSV from-csv-t9n to-csv-t9n Y
translation Jsonlines 2 from-jsonlines-t9n to-jsonlines-t9n Y
translation Parquet 3 from-parquet-t9n to-parquet-t9n N
translation TSV from-tsv-t9n to-tsv-t9n Y
translation TXT from-txt-t9n to-txt-t9n Y 1

1 Compression not available when concatenating content in single file.

2 Format defined here.

3 Translation data itself is stored as JSON dictionary.

Compression formats

In case a format supports compression, then the following compression formats are automatically supported for loading/saving files:

File encodings

Most readers offer the --encoding option to override the automatically determined file encoding, as that can be wrong due to only inspecting a fixed number of bytes. The number of bytes of a file inspected can be influenced via the following environment variable:

LDC_ENCODING_MAX_CHECK_LENGTH

A value of -1 means the complete file. However, that can be very slow and a smaller value of <1MB is recommended.

Tools

Dataset conversion

usage: llm-convert [-h|--help|--help-all|-help-plugin NAME] [-u INTERVAL]
                   [-c {None,bz2,gz,xz,zstd}]
                   [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                   reader
                   [filter [filter [...]]]
                   [writer]

Tool for converting between large language model (LLM) dataset formats.

readers (20):
   from-alpaca, from-csv-cl, from-csv-pr, from-csv-pt, from-csv-t9n, 
   from-jsonlines-cl, from-jsonlines-pr, from-jsonlines-pt, 
   from-jsonlines-t9n, from-parquet-cl, from-parquet-pr, 
   from-parquet-pt, from-parquet-t9n, from-tsv-cl, from-tsv-pr, 
   from-tsv-pt, from-tsv-t9n, from-txt-pt, from-txt-t9n, from-xtuner
filters (38):
   assemble-sentences, change-case, classification-label-map, 
   file-filter, find-substr, inspect, keyword, language, 
   llama2-to-pairs, max-length-pt, max-records, metadata, 
   metadata-from-name, pairs-to-llama2, pairs-to-pretrain, 
   pretrain-sentences-to-classification, pretrain-sentences-to-pairs, 
   randomize-records, record-files, record-window, remove-blocks, 
   remove-empty, remove-patterns, replace-patterns, require-languages, 
   reset-ids, sentences-pt, skip-duplicate-ids, skip-duplicate-text, 
   split-pt, split-records, tee, text-length, text-stats, 
   to-llama2-format, translation-to-pairs, translation-to-pretrain, 
   update-pair-data
writers (20):
   to-alpaca, to-csv-cl, to-csv-pr, to-csv-pt, to-csv-t9n, 
   to-jsonlines-cl, to-jsonlines-pr, to-jsonlines-pt, to-jsonlines-t9n, 
   to-parquet-cl, to-parquet-pr, to-parquet-pt, to-parquet-t9n, 
   to-tsv-cl, to-tsv-pr, to-tsv-pt, to-tsv-t9n, to-txt-pt, to-txt-t9n, 
   to-xtuner

optional arguments:
  -h, --help              show basic help message and exit
  --help-all              show basic help message plus help on all plugins and exit
  --help-plugin NAME      show help message for plugin NAME and exit
  -u INTERVAL, --update_interval INTERVAL
                          outputs the progress every INTERVAL records (default: 1000)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                          the logging level to use (default: WARN)
  -c {None,bz2,gz,xz,zstd}, --compression {None,bz2,gz,xz,zstd}
                          the type of compression to use when only providing an output
                          directory to the writer (default: None)
  -b, --force_batch       processes the data in batches
  -U, --unescape_unicode  unescape unicode characters in the command-line

Download

usage: llm-download [-h|--help|--help-all|-help-plugin NAME]
                    downloader

Tool for downloading data for large language models (LLMs).

downloaders:
   huggingface

optional arguments:
  -h, --help            show basic help message and exit
  --help-all            show basic help message plus help on all plugins and exit
  --help-plugin NAME    show help message for plugin NAME and exit

Combining multiple files (one-after-the-other)

usage: llm-append [-h] [-i [INPUT [INPUT ...]]]
                  [-I [INPUT_LIST [INPUT_LIST ...]]]
                  [-t {csv,json,jsonlines,plain-text,tsv}] [-o FILE] [-p]
                  [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for combining multiple text files by appending them.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to append; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        append (default: None)
  -t {csv,json,jsonlines,plain-text,tsv}, --file_type {csv,json,jsonlines,plain-text,tsv}
                        The type of files that are being processed. (default:
                        plain-text)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -p, --pretty_print    Whether to output the JSON in more human-readable
                        format. (default: False)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Combining multiple files (side-by-side)

usage: llm-paste [-h] [-i [INPUT [INPUT ...]]]
                 [-I [INPUT_LIST [INPUT_LIST ...]]] [-o FILE]
                 [-s [SEP [SEP ...]]] [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for combining multiple text files by placing them side-by-side.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to combine; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        combine (default: None)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -s [SEP [SEP ...]], --separator [SEP [SEP ...]]
                        The separators to use between the files; uses TAB if
                        not supplied; use '{T}' as placeholder for tab
                        (default: None)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

File encodings

The following tool allows you to determine the encoding of text files.

usage: llm-file-encoding [-h] [-i [INPUT [INPUT ...]]]
                         [-I [INPUT_LIST [INPUT_LIST ...]]]
                         [-m MAX_CHECK_LENGTH] [-o FILE]
                         [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for determining the file encoding of text files.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to check; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the actual files to
                        check (default: None)
  -m MAX_CHECK_LENGTH, --max_check_length MAX_CHECK_LENGTH
                        The maxmimum number of bytes to use for checking
                        (default: None)
  -o FILE, --output FILE
                        The path of the file to store the determined encodings
                        in; outputs it to stdout if omitted or a directory
                        (default: None)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Locating files

Readers tend to support input via file lists. The llm-find tool can generate these.

usage: llm-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP [REGEXP ...]]]
                [-n [REGEXP [REGEXP ...]]]
                [--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]]
                [--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]]
                [--split_name_separator SPLIT_NAME_SEPARATOR]
                [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for locating files in directories that match certain patterns and store
them in files.

optional arguments:
  -h, --help            show this help message and exit
  -i DIR [DIR ...], --input DIR [DIR ...]
                        The dir(s) to scan for files. (default: None)
  -r, --recursive       Whether to search the directories recursively
                        (default: False)
  -o FILE, --output FILE
                        The file to store the located file names in (default:
                        None)
  -m [REGEXP [REGEXP ...]], --match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be included (default: None)
  -n [REGEXP [REGEXP ...]], --not-match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be excluded (default: None)
  --split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]
                        The split ratios to use for generating the splits
                        (int; must sum up to 100) (default: None)
  --split_names [SPLIT_NAMES [SPLIT_NAMES ...]]
                        The split names to use as filename suffixes for the
                        generated splits (before .ext) (default: None)
  --split_name_separator SPLIT_NAME_SEPARATOR
                        The separator to use between file name and split name
                        (default: -)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

Generating help screens for plugins

usage: llm-help [-h] [-c [PACKAGE [PACKAGE ...]]] [-e EXCLUDED_CLASS_LISTERS]
                [-p NAME] [-f FORMAT] [-L INT] [-o PATH] [-i FILE] [-t TITLE]
                [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for outputting help for plugins in various formats.

optional arguments:
  -h, --help            show this help message and exit
  -c [PACKAGE [PACKAGE ...]], --custom_class_listers [PACKAGE [PACKAGE ...]]
                        The names of the custom class listers, uses the
                        default ones if not provided. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -p NAME, --plugin_name NAME
                        The name of the plugin to generate the help for,
                        generates it for all if not specified (default: None)
  -f FORMAT, --help_format FORMAT
                        The output format to generate (default: text)
  -L INT, --heading_level INT
                        The level to use for the heading (default: 1)
  -o PATH, --output PATH
                        The directory or file to store the help in; outputs it
                        to stdout if not supplied; if pointing to a directory,
                        automatically generates file name from plugin name and
                        help format (default: None)
  -i FILE, --index_file FILE
                        The file in the output directory to generate with an
                        overview of all plugins, grouped by type (in markdown
                        format, links them to the other generated files)
                        (default: None)
  -t TITLE, --index_title TITLE
                        The title to use in the index file (default: llm-
                        dataset-converter plugins)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Plugin registry

usage: llm-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
                    [-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}]

For inspecting/querying the registry.

optional arguments:
  -h, --help            show this help message and exit
  -c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
                        The comma-separated list of custom class listers to
                        use. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}, --list {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}
                        For outputting various lists on stdout. (default:
                        None)

Plugins

See here for an overview of all plugins.

Examples

You can find examples for using the library (command-line and code) here:

https://waikato-llm.github.io/llm-dataset-converter-examples/

Additional libraries

Class listers

The llm-dataset-converter uses the class lister registry provided by the seppl library.

Each module defines a function, typically called list_classes that returns a dictionary of names of superclasses associated with a list of modules that should be scanned for derived classes. Here is an example:

from typing import List, Dict


def list_classes() -> Dict[str, List[str]]:
    return {
        "ldc.api.Downloader": [
            "mod.ule1",
        ],
        "ldc.api.Reader": [
            "mod.ule2",
            "mod.ule3",
        ],
        "ldc.api.Filter": [
            "mod.ule4",
        ],
        "seppl.io.Writer": [
            "mod.ule5",
        ],
    }

Such a class lister gets referenced in the entry_points section of the setup.py file:

    entry_points={
        "class_lister": [
            "unique_string=module_name:function_name",
        ],
    },

:function_name can be omitted if :list_classes.

The following environment variables can be used to influence the class listers:

  • LDC_CLASS_LISTERS
  • LDC_CLASS_LISTERS_EXCL

Each variable is a comma-separated list of module_name:function_name, defining the class listers.