Skip to content

Commit

Permalink
v2.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
jparkerweb committed Dec 13, 2024
1 parent c03b3e2 commit bc1a4a1
Show file tree
Hide file tree
Showing 7 changed files with 250 additions and 39 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

All notable changes to this project will be documented in this file.

## [2.4.0] - 2024-12-13
### ✨ Added
- Added `sentenceit` function.

## [2.3.7] - 2024-11-25
### 📦 Updated
- Update `string-segmenter` patch version
Expand Down
98 changes: 75 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ NPM Package for Semantically creating chunks from large texts. Useful for workfl

### Maintained by
<a href="https://www.equilllabs.com">
<img src="https://raw.githubusercontent.com/jparkerweb/eQuill-Labs/refs/heads/main/src/static/images/logo-text-outline.png" alt="eQuill Labs" height="40">
<img src="https://raw.githubusercontent.com/jparkerweb/eQuill-Labs/refs/heads/main/src/static/images/logo-text-outline.png" alt="eQuill Labs" height="32">
</a>

## Features
Expand Down Expand Up @@ -260,6 +260,33 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete

There is an additional function you can import to just "cram" sentences together till they meet your target token size for when you just need quick, high desity chunks.


## Parameters

`cramit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:

- `documents`: array of documents. each document is an object containing `document_name` and `document_text`.
```
documents = [
{ document_name: "document1", document_text: "..." },
{ document_name: "document2", document_text: "..." },
...
]
```

- **Cramit Options Object:**

- `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps.
- `maxTokenSize`: Integer (optional, default `500`) - Maximum token size for each chunk.
- `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings.
- `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`).
- `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
- `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
- `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
- `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
- `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
- `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.

Basic usage:

```javascript
Expand All @@ -285,37 +312,62 @@ main();

Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters.

### Tuning
---

The behavior of the `chunkit` function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements.
## `sentenceit` - ✂️ When you just need a Clean Split

#### `logging`
There is an additional function you can import to just split sentences.

- **Type**: Boolean
- **Default**: `false`
- **Description**: Enables detailed debug output during the chunking process. Turning this on can help in diagnosing how chunks are formed or why certain chunks are combined.

#### `maxTokenSize`
## Parameters

- **Type**: Integer
- **Default**: `500`
- **Description**: Sets the maximum number of tokens allowed in a single chunk. Smaller values result in smaller, more numerous chunks, while larger values can create fewer, larger chunks. It’s crucial for maintaining manageable chunk sizes when processing large texts.
`sentenceit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:

- `documents`: array of documents. each document is an object containing `document_name` and `document_text`.
```
documents = [
{ document_name: "document1", document_text: "..." },
{ document_name: "document2", document_text: "..." },
...
]
```

#### `onnxEmbeddingModel`
- **Sentenceit Options Object:**

- `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps.
- `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings.
- `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`).
- `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
- `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
- `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
- `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
- `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
- `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.

- **Type**: String
- **Default**: `Xenova/paraphrase-multilingual-MiniLM-L12-v2`
- **Description**: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
- **Resource Link**: [ONNX Embedding Models](https://huggingface.co/models?pipeline_tag=feature-extraction&library=onnx&sort=trending)
Link to a filtered list of embedding models converted to ONNX library format by Xenova.
Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).
Basic usage:

#### `dtype`
```javascript
import { sentenceit } from 'semantic-chunking';

- **Type**: String
- **Default**: `fp32`
- **Description**: Indicates the precision of the embedding model. Options are `fp32`, `fp16`, `q8`, `q4`.
`fp32` is the highest precision but also the largest size and slowest to load. `q8` is a good compromise between size and speed if the model supports it. All models support `fp32`, but only some support `fp16`, `q8`, and `q4`.
let duckText = "A duck waddles into a bakery and quacks to the baker, \"I'll have a loaf of bread, please.\" The baker, amused, quickly wraps the loaf and hands it over. The duck takes a nibble, looks around, and then asks, \"Do you have any seeds to go with this?\" The baker, chuckling, replies, \"Sorry, we're all out of seeds today.\" The duck nods and continues nibbling on its bread, clearly unfazed by the lack of seed toppings. Just another day in the life of a bread-loving waterfowl! 🦆🍞";

// initialize documents array and add the duck text to it
let documents = [];
documents.push({
document_name: "duck document",
document_text: duckText
});

// call the sentenceit function passing in the documents array and the options object
async function main() {
let myDuckChunks = await sentenceit(documents, { returnEmbedding: true });
console.log("myDuckChunks", myDuckChunks);
}
main();

```

Look at the `example\example-sentenceit.js` file in the root of this project for a more complex example of using all the optional parameters.

---

Expand Down
122 changes: 111 additions & 11 deletions chunkit.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ export async function chunkit(
combineChunks = DEFAULT_CONFIG.COMBINE_CHUNKS,
combineChunksSimilarityThreshold = DEFAULT_CONFIG.COMBINE_CHUNKS_SIMILARITY_THRESHOLD,
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version)
dtype = DEFAULT_CONFIG.DTYPE,
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
Expand All @@ -56,9 +55,6 @@ export async function chunkit(
throw new Error('Input must be an array of document objects');
}

// if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8'
if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; }

// Initialize embedding utilities and set optional paths
const { modelName, dtype: usedDtype } = await initializeEmbeddingUtils(
onnxEmbeddingModel,
Expand Down Expand Up @@ -193,7 +189,6 @@ export async function cramit(
logging = DEFAULT_CONFIG.LOGGING,
maxTokenSize = DEFAULT_CONFIG.MAX_TOKEN_SIZE,
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version)
dtype = DEFAULT_CONFIG.DTYPE,
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
Expand All @@ -210,11 +205,8 @@ export async function cramit(
throw new Error('Input must be an array of document objects');
}

// if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8'
if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; }

// Initialize embedding utilities with paths
const { modelName, isQuantized } = await initializeEmbeddingUtils(
await initializeEmbeddingUtils(
onnxEmbeddingModel,
dtype,
localModelPath,
Expand Down Expand Up @@ -259,8 +251,8 @@ export async function cramit(
document_name: documentName,
number_of_chunks: numberOfChunks,
chunk_number: index + 1,
model_name: modelName,
is_model_quantized: isQuantized,
model_name: onnxEmbeddingModel,
dtype: dtype,
text: prefixedChunk
};

Expand Down Expand Up @@ -296,3 +288,111 @@ export async function cramit(
// Flatten the results array since we're processing multiple documents
return allResults.flat();
}


// ------------------------------
// -- Main sentenceit function --
// ------------------------------
export async function sentenceit(
documents,
{
logging = DEFAULT_CONFIG.LOGGING,
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
dtype = DEFAULT_CONFIG.DTYPE,
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
returnEmbedding = DEFAULT_CONFIG.RETURN_EMBEDDING,
returnTokenLength = DEFAULT_CONFIG.RETURN_TOKEN_LENGTH,
chunkPrefix = DEFAULT_CONFIG.CHUNK_PREFIX,
excludeChunkPrefixInResults = false,
} = {}) {

if(logging) { printVersion(); }

// Input validation
if (!Array.isArray(documents)) {
throw new Error('Input must be an array of document objects');
}

if (returnEmbedding) {
// Initialize embedding utilities with paths
await initializeEmbeddingUtils(
onnxEmbeddingModel,
dtype,
localModelPath,
modelCacheDir
);
}

// Process each document
const allResults = await Promise.all(documents.map(async (doc) => {
if (!doc.document_text) {
throw new Error('Each document must have a document_text property');
}

// Split the text into sentences
const chunks = [];
for (const { segment } of splitBySentence(doc.document_text)) {
chunks.push(segment.trim());
}

if (logging) {
console.log('\nSENTENCEIT');
console.log('=============\nSentences\n=============');
chunks.forEach((chunk, index) => {
console.log("\n");
console.log(`--------------`);
console.log(`-- Sentence ${(index + 1)} --`);
console.log(`--------------`);
console.log(chunk.substring(0, 50) + '...');
});
}

const documentName = doc.document_name || ""; // Normalize document_name
const documentId = Date.now();
const numberOfChunks = chunks.length;

return Promise.all(chunks.map(async (chunk, index) => {
const prefixedChunk = chunkPrefix ? applyPrefixToChunk(chunkPrefix, chunk) : chunk;
const result = {
document_id: documentId,
document_name: documentName,
number_of_sentences: numberOfChunks,
sentence_number: index + 1,
text: prefixedChunk
};

if (returnEmbedding) {
result.model_name = onnxEmbeddingModel;
result.dtype = dtype;
result.embedding = await createEmbedding(prefixedChunk);

if (returnTokenLength) {
try {
const encoded = await tokenizer(prefixedChunk, { padding: true });
if (encoded && encoded.input_ids) {
result.token_length = encoded.input_ids.size;
} else {
console.error('Tokenizer returned unexpected format:', encoded);
result.token_length = 0;
}
} catch (error) {
console.error('Error during tokenization:', error);
result.token_length = 0;
}
}

// Remove prefix if requested (after embedding calculation)
if (excludeChunkPrefixInResults && chunkPrefix && chunkPrefix.trim()) {
const prefixPattern = new RegExp(`^${chunkPrefix}:\\s*`);
result.text = result.text.replace(prefixPattern, '');
}
}

return result;
}));
}));

// Flatten the results array since we're processing multiple documents
return allResults.flat();
}
4 changes: 2 additions & 2 deletions example/example-chunkit.js
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ let trackedTimeSeconds = (endTime - startTime) / 1000;
trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2));

console.log("\n\n");
// console.log("myTestChunks:");
// console.log(myTestChunks);
console.log("myTestChunks:");
console.log(myTestChunks);
console.log("length: " + myTestChunks.length);
console.log("trackedTimeSeconds: " + trackedTimeSeconds);
55 changes: 55 additions & 0 deletions example/example-sentenceit.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
// -----------------------
// -- example-sentenceit.js --
// --------------------------------------------------------------------------------
// this is an example of how to use the sentenceit function
// first we import the sentenceit function
// then we setup the documents array with a text
// then we call the sentenceit function with the text and an options object
// the options object is optional
//
// the cramit function is faster than the chunkit function, but it is less accurate
// useful for quickly chunking text, but not for exact semantic chunking
// --------------------------------------------------------------------------------

import { sentenceit } from '../chunkit.js'; // this is typically just "import { sentenceit } from 'semantic-chunking';", but this is a local test
import fs from 'fs';

// initialize documents array
let documents = [];
let textFiles = ['./example3.txt'];

// read each text file and add it to the documents array
for (const textFile of textFiles) {
documents.push({
document_name: textFile,
document_text: await fs.promises.readFile(textFile, 'utf8')
});
}

// start timing
const startTime = performance.now();

let myTestSentences = await sentenceit(
documents,
{
logging: false,
onnxEmbeddingModel: "Xenova/all-MiniLM-L6-v2",
dtype: 'fp32',
localModelPath: "../models",
modelCacheDir: "../models",
returnEmbedding: true,
}
);

// end timeing
const endTime = performance.now();

// calculate tracked time in seconds
let trackedTimeSeconds = (endTime - startTime) / 1000;
trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2));

console.log("\n\n\n");
console.log("myTestSentences:");
console.log(myTestSentences);
console.log("length: " + myTestSentences.length);
console.log("trackedTimeSeconds: " + trackedTimeSeconds);
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "semantic-chunking",
"version": "2.3.9",
"version": "2.4.0",
"description": "Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).",
"homepage": "https://www.equilllabs.com/projects/semantic-chunking",
"repository": {
Expand Down

0 comments on commit bc1a4a1

Please sign in to comment.