This Python script processes JSON files by splitting the text content into chunks, ensuring each chunk meets specified token length requirements. The tool is designed to handle nested JSON structures and can process multiple files within a directory.
- Splits text content in JSON files into manageable chunks.
- Ensures chunks meet specified token length requirements.
- Handles nested JSON structures.
- Processes multiple JSON files in a directory.
-
Clone the repository:
git clone https://github.com/MelikaMirdamadi/data-chunking.git cd json-chunking-tool
-
Set up a Python virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install sentence-transformers langchain
-
Prepare your input directory:
Ensure you have an input directory named
input-file-path
containing JSON files with the key"content"
holding the text you want to process. -
Run the script:
Execute the script using Python:
python main.py
-
Output:
The processed JSON files with added chunks will be saved in the specified output directory (
output-file-path
).
fix_token_wise_length
: Adjusts the size of text chunks based on token length.get_tokenizer
: Retrieves the tokenizer for the specified model.chunk_text
: Splits text into chunks using the specified tokenizer and splitter.process_json_file
: Processes a single JSON file, adding text chunks to the JSON data.process_json_files_in_directory
: Processes all JSON files in a directory.
Given an input JSON file (input-file-path/example.json
):
{
"content": "This is some sample text content that will be split into chunks."
}
After running the script, the output JSON file (output-file-path/example.json
) will include the processed chunks:
{
"content": "This is some sample text content that will be split into chunks.",
"CHUNK__": [
"This is some sample text content that",
"will be split into chunks."
]
}
- Adjusting Chunk Sizes: Modify the
max_chunk_size
,chunk_overlap
, andmin_chunk_size
parameters in thechunk_text
function to customize chunking behavior. - Changing Input/Output Paths: Update the
input_directory
andoutput_directory
variables in the__main__
block to specify different directories.
Contributions are welcome! If you have suggestions for improvements or new features, feel free to create an issue or submit a pull request.
This project is licensed under the MIT License.