diff --git a/CHANGELOG.md b/CHANGELOG.md index b2ebef0..5c0226e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,10 @@ All notable changes to this project will be documented in this file. +## [2.4.0] - 2024-12-13 +### ✨ Added +- Added `sentenceit` function. + ## [2.3.7] - 2024-11-25 ### πŸ“¦ Updated - Update `string-segmenter` patch version diff --git a/README.md b/README.md index 31d9830..ba5e940 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ NPM Package for Semantically creating chunks from large texts. Useful for workfl ### Maintained by - eQuill Labs + eQuill Labs ## Features @@ -260,6 +260,33 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete There is an additional function you can import to just "cram" sentences together till they meet your target token size for when you just need quick, high desity chunks. + +## Parameters + +`cramit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter: + +- `documents`: array of documents. each document is an object containing `document_name` and `document_text`. + ``` + documents = [ + { document_name: "document1", document_text: "..." }, + { document_name: "document2", document_text: "..." }, + ... + ] + ``` + +- **Cramit Options Object:** + + - `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps. + - `maxTokenSize`: Integer (optional, default `500`) - Maximum token size for each chunk. + - `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings. + - `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`). + - `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`). + - `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`). + - `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`. + - `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`. + - `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths. + - `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations. + Basic usage: ```javascript @@ -285,37 +312,62 @@ main(); Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters. -### Tuning +--- -The behavior of the `chunkit` function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements. +## `sentenceit` - βœ‚οΈ When you just need a Clean Split -#### `logging` +There is an additional function you can import to just split sentences. -- **Type**: Boolean -- **Default**: `false` -- **Description**: Enables detailed debug output during the chunking process. Turning this on can help in diagnosing how chunks are formed or why certain chunks are combined. -#### `maxTokenSize` +## Parameters -- **Type**: Integer -- **Default**: `500` -- **Description**: Sets the maximum number of tokens allowed in a single chunk. Smaller values result in smaller, more numerous chunks, while larger values can create fewer, larger chunks. It’s crucial for maintaining manageable chunk sizes when processing large texts. +`sentenceit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter: + +- `documents`: array of documents. each document is an object containing `document_name` and `document_text`. + ``` + documents = [ + { document_name: "document1", document_text: "..." }, + { document_name: "document2", document_text: "..." }, + ... + ] + ``` -#### `onnxEmbeddingModel` +- **Sentenceit Options Object:** + + - `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps. + - `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings. + - `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`). + - `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`). + - `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`). + - `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`. + - `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`. + - `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths. + - `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations. -- **Type**: String -- **Default**: `Xenova/paraphrase-multilingual-MiniLM-L12-v2` -- **Description**: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts. -- **Resource Link**: [ONNX Embedding Models](https://huggingface.co/models?pipeline_tag=feature-extraction&library=onnx&sort=trending) - Link to a filtered list of embedding models converted to ONNX library format by Xenova. - Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English). +Basic usage: -#### `dtype` +```javascript +import { sentenceit } from 'semantic-chunking'; -- **Type**: String -- **Default**: `fp32` -- **Description**: Indicates the precision of the embedding model. Options are `fp32`, `fp16`, `q8`, `q4`. -`fp32` is the highest precision but also the largest size and slowest to load. `q8` is a good compromise between size and speed if the model supports it. All models support `fp32`, but only some support `fp16`, `q8`, and `q4`. +let duckText = "A duck waddles into a bakery and quacks to the baker, \"I'll have a loaf of bread, please.\" The baker, amused, quickly wraps the loaf and hands it over. The duck takes a nibble, looks around, and then asks, \"Do you have any seeds to go with this?\" The baker, chuckling, replies, \"Sorry, we're all out of seeds today.\" The duck nods and continues nibbling on its bread, clearly unfazed by the lack of seed toppings. Just another day in the life of a bread-loving waterfowl! πŸ¦†πŸž"; + +// initialize documents array and add the duck text to it +let documents = []; +documents.push({ + document_name: "duck document", + document_text: duckText +}); + +// call the sentenceit function passing in the documents array and the options object +async function main() { + let myDuckChunks = await sentenceit(documents, { returnEmbedding: true }); + console.log("myDuckChunks", myDuckChunks); +} +main(); + +``` + +Look at the `example\example-sentenceit.js` file in the root of this project for a more complex example of using all the optional parameters. --- diff --git a/chunkit.js b/chunkit.js index bdf82a6..081a95b 100644 --- a/chunkit.js +++ b/chunkit.js @@ -39,7 +39,6 @@ export async function chunkit( combineChunks = DEFAULT_CONFIG.COMBINE_CHUNKS, combineChunksSimilarityThreshold = DEFAULT_CONFIG.COMBINE_CHUNKS_SIMILARITY_THRESHOLD, onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL, - onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version) dtype = DEFAULT_CONFIG.DTYPE, localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH, modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR, @@ -56,9 +55,6 @@ export async function chunkit( throw new Error('Input must be an array of document objects'); } - // if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8' - if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; } - // Initialize embedding utilities and set optional paths const { modelName, dtype: usedDtype } = await initializeEmbeddingUtils( onnxEmbeddingModel, @@ -193,7 +189,6 @@ export async function cramit( logging = DEFAULT_CONFIG.LOGGING, maxTokenSize = DEFAULT_CONFIG.MAX_TOKEN_SIZE, onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL, - onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version) dtype = DEFAULT_CONFIG.DTYPE, localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH, modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR, @@ -210,11 +205,8 @@ export async function cramit( throw new Error('Input must be an array of document objects'); } - // if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8' - if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; } - // Initialize embedding utilities with paths - const { modelName, isQuantized } = await initializeEmbeddingUtils( + await initializeEmbeddingUtils( onnxEmbeddingModel, dtype, localModelPath, @@ -259,8 +251,8 @@ export async function cramit( document_name: documentName, number_of_chunks: numberOfChunks, chunk_number: index + 1, - model_name: modelName, - is_model_quantized: isQuantized, + model_name: onnxEmbeddingModel, + dtype: dtype, text: prefixedChunk }; @@ -296,3 +288,111 @@ export async function cramit( // Flatten the results array since we're processing multiple documents return allResults.flat(); } + + +// ------------------------------ +// -- Main sentenceit function -- +// ------------------------------ +export async function sentenceit( + documents, + { + logging = DEFAULT_CONFIG.LOGGING, + onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL, + dtype = DEFAULT_CONFIG.DTYPE, + localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH, + modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR, + returnEmbedding = DEFAULT_CONFIG.RETURN_EMBEDDING, + returnTokenLength = DEFAULT_CONFIG.RETURN_TOKEN_LENGTH, + chunkPrefix = DEFAULT_CONFIG.CHUNK_PREFIX, + excludeChunkPrefixInResults = false, + } = {}) { + + if(logging) { printVersion(); } + + // Input validation + if (!Array.isArray(documents)) { + throw new Error('Input must be an array of document objects'); + } + + if (returnEmbedding) { + // Initialize embedding utilities with paths + await initializeEmbeddingUtils( + onnxEmbeddingModel, + dtype, + localModelPath, + modelCacheDir + ); + } + + // Process each document + const allResults = await Promise.all(documents.map(async (doc) => { + if (!doc.document_text) { + throw new Error('Each document must have a document_text property'); + } + + // Split the text into sentences + const chunks = []; + for (const { segment } of splitBySentence(doc.document_text)) { + chunks.push(segment.trim()); + } + + if (logging) { + console.log('\nSENTENCEIT'); + console.log('=============\nSentences\n============='); + chunks.forEach((chunk, index) => { + console.log("\n"); + console.log(`--------------`); + console.log(`-- Sentence ${(index + 1)} --`); + console.log(`--------------`); + console.log(chunk.substring(0, 50) + '...'); + }); + } + + const documentName = doc.document_name || ""; // Normalize document_name + const documentId = Date.now(); + const numberOfChunks = chunks.length; + + return Promise.all(chunks.map(async (chunk, index) => { + const prefixedChunk = chunkPrefix ? applyPrefixToChunk(chunkPrefix, chunk) : chunk; + const result = { + document_id: documentId, + document_name: documentName, + number_of_sentences: numberOfChunks, + sentence_number: index + 1, + text: prefixedChunk + }; + + if (returnEmbedding) { + result.model_name = onnxEmbeddingModel; + result.dtype = dtype; + result.embedding = await createEmbedding(prefixedChunk); + + if (returnTokenLength) { + try { + const encoded = await tokenizer(prefixedChunk, { padding: true }); + if (encoded && encoded.input_ids) { + result.token_length = encoded.input_ids.size; + } else { + console.error('Tokenizer returned unexpected format:', encoded); + result.token_length = 0; + } + } catch (error) { + console.error('Error during tokenization:', error); + result.token_length = 0; + } + } + + // Remove prefix if requested (after embedding calculation) + if (excludeChunkPrefixInResults && chunkPrefix && chunkPrefix.trim()) { + const prefixPattern = new RegExp(`^${chunkPrefix}:\\s*`); + result.text = result.text.replace(prefixPattern, ''); + } + } + + return result; + })); + })); + + // Flatten the results array since we're processing multiple documents + return allResults.flat(); +} diff --git a/example/example-chunkit.js b/example/example-chunkit.js index 6d595d0..b4df8ec 100644 --- a/example/example-chunkit.js +++ b/example/example-chunkit.js @@ -54,7 +54,7 @@ let trackedTimeSeconds = (endTime - startTime) / 1000; trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2)); console.log("\n\n"); -// console.log("myTestChunks:"); -// console.log(myTestChunks); +console.log("myTestChunks:"); +console.log(myTestChunks); console.log("length: " + myTestChunks.length); console.log("trackedTimeSeconds: " + trackedTimeSeconds); \ No newline at end of file diff --git a/example/example-sentenceit.js b/example/example-sentenceit.js new file mode 100644 index 0000000..b74b19a --- /dev/null +++ b/example/example-sentenceit.js @@ -0,0 +1,55 @@ +// ----------------------- +// -- example-sentenceit.js -- +// -------------------------------------------------------------------------------- +// this is an example of how to use the sentenceit function +// first we import the sentenceit function +// then we setup the documents array with a text +// then we call the sentenceit function with the text and an options object +// the options object is optional +// +// the cramit function is faster than the chunkit function, but it is less accurate +// useful for quickly chunking text, but not for exact semantic chunking +// -------------------------------------------------------------------------------- + +import { sentenceit } from '../chunkit.js'; // this is typically just "import { sentenceit } from 'semantic-chunking';", but this is a local test +import fs from 'fs'; + +// initialize documents array +let documents = []; +let textFiles = ['./example3.txt']; + +// read each text file and add it to the documents array +for (const textFile of textFiles) { + documents.push({ + document_name: textFile, + document_text: await fs.promises.readFile(textFile, 'utf8') + }); +} + +// start timing +const startTime = performance.now(); + +let myTestSentences = await sentenceit( + documents, + { + logging: false, + onnxEmbeddingModel: "Xenova/all-MiniLM-L6-v2", + dtype: 'fp32', + localModelPath: "../models", + modelCacheDir: "../models", + returnEmbedding: true, + } +); + +// end timeing +const endTime = performance.now(); + +// calculate tracked time in seconds +let trackedTimeSeconds = (endTime - startTime) / 1000; +trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2)); + +console.log("\n\n\n"); +console.log("myTestSentences:"); +console.log(myTestSentences); +console.log("length: " + myTestSentences.length); +console.log("trackedTimeSeconds: " + trackedTimeSeconds); \ No newline at end of file diff --git a/package-lock.json b/package-lock.json index c985f5c..22a477a 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,12 +1,12 @@ { "name": "semantic-chunking", - "version": "2.3.9", + "version": "2.4.0", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "semantic-chunking", - "version": "2.3.9", + "version": "2.4.0", "license": "ISC", "dependencies": { "@huggingface/transformers": "^3.1.2", diff --git a/package.json b/package.json index 93af9e5..81890c3 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "semantic-chunking", - "version": "2.3.9", + "version": "2.4.0", "description": "Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).", "homepage": "https://www.equilllabs.com/projects/semantic-chunking", "repository": {