Skip to content

Latest commit

 

History

History
124 lines (100 loc) · 13.9 KB

README.md

File metadata and controls

124 lines (100 loc) · 13.9 KB

RecDP LLM - LLM data preparation utility

RecDP LLM is a set of python components that enables quick and easy establish of your own LLM data preparation pipeline.

  • 10 general LLM data process components for Foundation model and finetune model training.
  • 4 LLM data quality enhancement module for finetune model training
  • 2 use cases for foundation model data prepration and finetune model data preparation.

General - Foundation & FineTune

Type notebook Description supports Verified dataset & size
DocumentExtract Open In Colab extract text from unstructured format jpg, png, pdf, docx, RefinedWeb - 1.7 TB
Reader Open In Colab Read data from directory jsonl, parquet, RefinedWeb - 1.7 TB
Converter Open In Colab Read and convert unstructed data to unified format html, document, image, pdf, ... RefinedWeb - 1.7 TB
Filter Open In Colab Filter out document based on condition profanity-based, black-list, url_based, length_based RedPajama - 2 TB
Text Bytesize Open In Colab Get text bytes size RedPajama - 2 TB
Text Fixer Open In Colab Clean repeated format in html, latex, codes html, latex, codes RefinedWeb - 1.7 TB
Language Identify Open In Colab Inentify major language type of document en, zh, fr, de, .. total 25 langs RedPajama - 2 TB
Fuzzy Deduplicator Open In Colab Detect and reduce duplication based on document context minHashLSH PILE - 200 GB
Global Decuplicator Open In Colab Detect and reduce duplication based on exact same content sha256-hash RefinedWeb - 1.7 TB, RedPajama - 2 TB
Rouge Score Decuplicator Open In Colab Remove similar data by calculating the rough score alpaca
Repetition Removal Open In Colab Detect and reduce repetition context in same document RefinedWeb - 1.7 TB, RedPajama - 2 TB
Document splitter Open In Colab Split Document into multiple sub documents chapter_based, length_based RefinedWeb - 1.7 TB
PII Removal Open In Colab Detect and replace personal infomation in document email, phone, ip, username, password RefinedWeb - 1.7 TB
User Defined Transform Open In Colab Easy way to plugin user defined map function parallel with ray or spark RefinedWeb - 1.7 TB
User Defined Filter Open In Colab Easy way to plugin user defined filter function parallel with ray or spark RefinedWeb - 1.7 TB
Writer Open In Colab write data to directory jsonl, parquet RefinedWeb - 1.7 TB
ClassifyWriter Open In Colab Classify and write data into sub buckets meta fields, language RefinedWeb - 1.7 TB
Prompt Enhancement Open In Colab creates high-complexity instructions from existing instruct-tuned LLM models PromptSource, self-instruct, evol-instruct(wizardLM) alpaca
Tokenization Open In Colab using LLAMA2 tokenizer and save as Megatron LLAMA2 tokenizer RefinedWeb - 1.7 TB

LLM Data Quality Analysis

Diversity GPT-3 Scoring Toxicity Perplexity
Visualize the diversity distribution of data Leverage GPT-3 to scoring Visualize Toxicity probability Visualize Perplexity Distribution
diversity quality toxicity perxicity
learn more learn more learn more learn more

Getting Start

Deploy

DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[LLM] --pre

Data pipeline

1. RAG Data Pipeline - Build from public HTML

from pyrecdp.primitives.operations import *
from pyrecdp.LLM import TextPipeline

pipeline = TextPipeline()
ops = [
    DirectoryLoader("document", glob="**/*.html"),
    DocumentSplit(),
    DocumentIngestion(
        vector_store='FAISS',
        vector_store_args={
            "output_dir": "ResumableTextPipeline_output",
            "index": "test_index"
        },
        embeddings='HuggingFaceEmbeddings',
        embeddings_args={
            'model_name': f"{model_root_path}/sentence-transformers/all-mpnet-base-v2"
        }
    ),
]
pipeline.add_operations(ops)
pipeline.execute()

2. Finetune Data Pipeline - Downsize public finetune dataset

3. Finetune Data Pipeline - Build finetune dataset from Plain Text

4. Finetune Data Pipeline - Build finetune dataset from Existing QA

5. AutoHPO

Low-Code configuration with automated operators parameter tuning, allowing user to transform their own raw data toward a high quality dataset with low-effort. We coupled data processing with Quality Analisys as evaluation metrics, which will estimate data's quality before actual model finetuning/inference. Open In Colab image

from pyrecdp.primitives.llmutils.pipeline_hpo import text_pipeline_optimize

# input data path is configured in input_pipeline_file
input_pipeline_file = "config/pipeline_hpo.yaml.template"
input_hpo_file = 'config/hpo.yaml'
output_pipeline_file = "config/pipeline.yaml"

text_pipeline_optimize(input_pipeline_file, output_pipeline_file, input_hpo_file)

* run with individual component

  • cmdline mode
python pyrecdp/primitives/llmutils/language_identify.py \
    --data_dir tests/llm_data \
    --language_identify_output_dir output \
    --fasttext_model_dir ./cache/RecDP/models/lib.bin

  • operation-based API - ray mode
from pyrecdp.primitives.operations import LengthFilter
 
dataset = … # Ray Dataset
op = LengthFilter()
op.process_rayds(dataset)
  • operation-based API - spark mode
from pyrecdp.primitives.operations import LengthFilter

sparkdf = … # Spark Dataframe
op = LengthFilter()
op.process_spark(sparkdf)