JSON Chunking and Tokenization Tool

This Python script processes JSON files by splitting the text content into chunks, ensuring each chunk meets specified token length requirements. The tool is designed to handle nested JSON structures and can process multiple files within a directory.

Features

Splits text content in JSON files into manageable chunks.
Ensures chunks meet specified token length requirements.
Handles nested JSON structures.
Processes multiple JSON files in a directory.

Installation

Clone the repository:

git clone https://github.com/MelikaMirdamadi/data-chunking.git
cd json-chunking-tool

Set up a Python virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate   # On Windows, use `venv\Scripts\activate`

Install the required packages:

pip install sentence-transformers langchain

Usage

Prepare your input directory:

Ensure you have an input directory named input-file-path containing JSON files with the key "content" holding the text you want to process.
Run the script:

Execute the script using Python:
```
python main.py
```
Output:

The processed JSON files with added chunks will be saved in the specified output directory (output-file-path).

Code Overview

Main Functions

fix_token_wise_length: Adjusts the size of text chunks based on token length.
get_tokenizer: Retrieves the tokenizer for the specified model.
chunk_text: Splits text into chunks using the specified tokenizer and splitter.
process_json_file: Processes a single JSON file, adding text chunks to the JSON data.
process_json_files_in_directory: Processes all JSON files in a directory.

Example

Given an input JSON file (input-file-path/example.json):

{
    "content": "This is some sample text content that will be split into chunks."
}

After running the script, the output JSON file (output-file-path/example.json) will include the processed chunks:

{
    "content": "This is some sample text content that will be split into chunks.",
    "CHUNK__": [
        "This is some sample text content that",
        "will be split into chunks."
    ]
}

Customization

Adjusting Chunk Sizes: Modify the max_chunk_size, chunk_overlap, and min_chunk_size parameters in the chunk_text function to customize chunking behavior.
Changing Input/Output Paths: Update the input_directory and output_directory variables in the __main__ block to specify different directories.

Contributing

Contributions are welcome! If you have suggestions for improvements or new features, feel free to create an issue or submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
README.md		README.md
data-chunking.py		data-chunking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSON Chunking and Tokenization Tool

Features

Installation

Usage

Code Overview

Main Functions

Example

Customization

Contributing

License

About

Releases

Packages

Languages

MelikaMirdamadi/data-chunking

Folders and files

Latest commit

History

Repository files navigation

JSON Chunking and Tokenization Tool

Features

Installation

Usage

Code Overview

Main Functions

Example

Customization

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages