RecDP LLM is a set of python components that enables quick and easy establish of your own LLM data preparation pipeline.
- 10 general LLM data process components for Foundation model and finetune model training.
- 4 LLM data quality enhancement module for finetune model training
- 2 use cases for foundation model data prepration and finetune model data preparation.
Type | notebook | Description | supports | Verified dataset & size |
---|---|---|---|---|
DocumentExtract | extract text from unstructured format | jpg, png, pdf, docx, | RefinedWeb - 1.7 TB | |
Reader | Read data from directory | jsonl, parquet, | RefinedWeb - 1.7 TB | |
Converter | Read and convert unstructed data to unified format | html, document, image, pdf, ... | RefinedWeb - 1.7 TB | |
Filter | Filter out document based on condition | profanity-based, black-list, url_based, length_based | RedPajama - 2 TB | |
Text Bytesize | Get text bytes size | RedPajama - 2 TB | ||
Text Fixer | Clean repeated format in html, latex, codes | html, latex, codes | RefinedWeb - 1.7 TB | |
Language Identify | Inentify major language type of document | en, zh, fr, de, .. total 25 langs | RedPajama - 2 TB | |
Fuzzy Deduplicator | Detect and reduce duplication based on document context | minHashLSH | PILE - 200 GB | |
Global Decuplicator | Detect and reduce duplication based on exact same content | sha256-hash | RefinedWeb - 1.7 TB, RedPajama - 2 TB | |
Rouge Score Decuplicator | Remove similar data by calculating the rough score | alpaca | ||
Repetition Removal | Detect and reduce repetition context in same document | RefinedWeb - 1.7 TB, RedPajama - 2 TB | ||
Document splitter | Split Document into multiple sub documents | chapter_based, length_based | RefinedWeb - 1.7 TB | |
PII Removal | Detect and replace personal infomation in document | email, phone, ip, username, password | RefinedWeb - 1.7 TB | |
User Defined Transform | Easy way to plugin user defined map function | parallel with ray or spark | RefinedWeb - 1.7 TB | |
User Defined Filter | Easy way to plugin user defined filter function | parallel with ray or spark | RefinedWeb - 1.7 TB | |
Writer | write data to directory | jsonl, parquet | RefinedWeb - 1.7 TB | |
ClassifyWriter | Classify and write data into sub buckets | meta fields, language | RefinedWeb - 1.7 TB | |
Prompt Enhancement | creates high-complexity instructions from existing instruct-tuned LLM models | PromptSource, self-instruct, evol-instruct(wizardLM) | alpaca | |
Tokenization | using LLAMA2 tokenizer and save as Megatron | LLAMA2 tokenizer | RefinedWeb - 1.7 TB |
Diversity | GPT-3 Scoring | Toxicity | Perplexity |
---|---|---|---|
Visualize the diversity distribution of data | Leverage GPT-3 to scoring | Visualize Toxicity probability | Visualize Perplexity Distribution |
learn more | learn more | learn more | learn more |
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[LLM] --pre
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import TextPipeline
pipeline = TextPipeline()
ops = [
DirectoryLoader("document", glob="**/*.html"),
DocumentSplit(),
DocumentIngestion(
vector_store='FAISS',
vector_store_args={
"output_dir": "ResumableTextPipeline_output",
"index": "test_index"
},
embeddings='HuggingFaceEmbeddings',
embeddings_args={
'model_name': f"{model_root_path}/sentence-transformers/all-mpnet-base-v2"
}
),
]
pipeline.add_operations(ops)
pipeline.execute()
Low-Code configuration with automated operators parameter tuning, allowing user to transform their own raw data toward a high quality dataset with low-effort. We coupled data processing with Quality Analisys as evaluation metrics, which will estimate data's quality before actual model finetuning/inference.
from pyrecdp.primitives.llmutils.pipeline_hpo import text_pipeline_optimize
# input data path is configured in input_pipeline_file
input_pipeline_file = "config/pipeline_hpo.yaml.template"
input_hpo_file = 'config/hpo.yaml'
output_pipeline_file = "config/pipeline.yaml"
text_pipeline_optimize(input_pipeline_file, output_pipeline_file, input_hpo_file)
- cmdline mode
python pyrecdp/primitives/llmutils/language_identify.py \
--data_dir tests/llm_data \
--language_identify_output_dir output \
--fasttext_model_dir ./cache/RecDP/models/lib.bin
- operation-based API - ray mode
from pyrecdp.primitives.operations import LengthFilter
dataset = … # Ray Dataset
op = LengthFilter()
op.process_rayds(dataset)
- operation-based API - spark mode
from pyrecdp.primitives.operations import LengthFilter
sparkdf = … # Spark Dataframe
op = LengthFilter()
op.process_spark(sparkdf)