Skip to content

Commit

Permalink
Merge branch 'main' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
PabloSanchi authored Aug 18, 2024
2 parents 7efc2bb + e845835 commit 68fbce0
Show file tree
Hide file tree
Showing 14 changed files with 674 additions and 85 deletions.
93 changes: 65 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# JChunk
## A Spring Boot Library for Text Chunking

JChunk project is simple library that enables different types of text splitting strategies.
This project begun thanks to Greg Kamradt's post [text splitting ideas](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
Expand All @@ -10,14 +11,76 @@ For now there is only [Pablo Sanchidrian](https://github.com/PabloSanchi) develo
Feel free to contribute!!

## ROAD MAP
- [ ] Character Chunker (NOT STARTED)
- [x] Fixed Character Chunker (PRE-RELEASE)
- [ ] Recursive Character Text Chunker (NOT STARTED)
- [ ] Document Specific Chunker (NOT STARTED)
- [x] Semantic Chunker (PRE-RELEASE)
- [ ] Agentic Chunker (NOT STARTED)

## Building

To build with running unit tests

```sh
./mvnw clean package
```

To reformat using the java-format plugin

```sh
./mvnw spring-javaformat:apply
```

To update the year on license headers using the license-maven-plugin

```sh
./mvnw license:update-file-header -Plicense
```

To check javadocs using the javadoc:javadoc

```sh
./mvnw javadoc:javadoc -Pjavadoc
```

## Fixed Character Chunker
Character splitting is a basic text processing technique where text is divided into fixed-size chunks of characters. While it's not suitable for most advanced text processing tasks due to its simplicity and rigidity, it serves as an excellent starting point to understand the fundamentals of text splitting. See the following aspects of this chunker including its advantages, disadvantages, and key concepts like chunk size, chunk overlap, and separators.

### 1. Chunk Size
The chunk size is the number of characters each chunk will contain. For example, if you set a chunk size of 50, each chunk will consist of 50 characters.

**Example:**
- Input Text: "This is an example of character splitting."
- Chunk Size: 10
- Output Chunks: `["This is an", " example o", "f characte", "r splitti", "ng."]`

### 2. Chunk Overlap
Chunk overlap refers to the number of characters that will overlap between consecutive chunks. This helps in maintaining context across chunks by ensuring that a portion of the text at the end of one chunk is repeated at the beginning of the next chunk.

**Example:**
- Input Text: "This is an example of character splitting."
- Chunk Size: 10
- Chunk Overlap: 4
- Output Chunks: `["This is an", " an examp", "mple of ch", "aracter sp", " splitting."]`

### 3. Separators
Separators are specific character sequences used to split the text. For instance, you might want to split your text at every comma or period.

## Character Chunker
**Example:**
- Input Text: "This is an example. Let's split on periods. Okay?"
- Separator: ". "
- Output Chunks: ["This is an example", "Let's split on periods", "Okay?"]


### Pros and Cons

**Pros**
- Easy & Simple: Character splitting is straightforward to implement and understand.
- Basic Segmentation: It provides a basic way to segment text into smaller pieces.

**Cons**
- Rigid: Does not consider the structure or context of the text.
- Duplicate Data: Chunk overlap creates duplicate data, which might not be efficient.

## Recursive Character Text Chunker

Expand Down Expand Up @@ -117,32 +180,6 @@ Split the text into chunks at the identified breakpoints.

## Agentic Chunker

## Building

To build with running unit tests

```sh
./mvnw clean package
```

To reformat using the java-format plugin

```sh
./mvnw spring-javaformat:apply
```

To update the year on license headers using the license-maven-plugin

```sh
./mvnw license:update-file-header -Plicense
```

To check javadocs using the javadoc:javadoc

```sh
./mvnw javadoc:javadoc -Pjavadoc
```

## Contributing

Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
Expand Down
42 changes: 42 additions & 0 deletions jchunk-fixed/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.github.PabloSanchi</groupId>
<artifactId>jchunk</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>

<artifactId>jchunk-fixed</artifactId>
<packaging>jar</packaging>
<name>JChunk - Fixed Chunker</name>
<description>Fixed Chunker for Java</description>
<url>https://github.com/PabloSanchi/jchunk</url>

<scm>
<url>https://github.com/PabloSanchi/jchunk</url>
<connection>git://github.com/PabloSanchi/jchunk.git</connection>
<developerConnection>git@github.com:PabloSanchi/jchunk.git</developerConnection>
</scm>

<dependencies>
<dependency>
<groupId>com.github.PabloSanchi</groupId>
<artifactId>jchunk-core</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot</artifactId>
</dependency>

<!-- test -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
</dependency>
</dependencies>

</project>
118 changes: 118 additions & 0 deletions jchunk-fixed/src/main/java/jchunk/chunker/fixed/Config.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
package jchunk.chunker.fixed;

import org.springframework.util.Assert;

/**
* Configuration for the fixed chunker
*
* @author Pablo Sanchidrian Herrera
*/
public class Config {

private final Integer chunkSize;

private final Integer chunkOverlap;

private final String delimiter;

private final Boolean trimWhitespace;

private final Delimiter keepDelimiter;

public Integer getChunkSize() {
return chunkSize;
}

public Integer getChunkOverlap() {
return chunkOverlap;
}

public String getDelimiter() {
return delimiter;
}

public Boolean getTrimWhitespace() {
return trimWhitespace;
}

public Delimiter getKeepDelimiter() {
return keepDelimiter;
}

public Config(Integer chunkSize, Integer chunkOverlap, String delimiter, Boolean trimWhitespace,
Delimiter keepDelimiter) {
this.chunkSize = chunkSize;
this.chunkOverlap = chunkOverlap;
this.delimiter = delimiter;
this.trimWhitespace = trimWhitespace;
this.keepDelimiter = keepDelimiter;
}

/**
* {@return the default config}
*/
public static Config defaultConfig() {
return builder().build();
}

public static Builder builder() {
return new Builder();
}

public static class Builder {

private Integer chunkSize = 1000;

private Integer chunkOverlap = 100;

private String delimiter = " ";

private Boolean trimWhitespace = true;

private Delimiter keepDelimiter = Delimiter.NONE;

public Builder chunkSize(Integer chunkSize) {
Assert.isTrue(chunkSize > 0, "Chunk size must be greater than 0");
this.chunkSize = chunkSize;
return this;
}

public Builder chunkOverlap(Integer chunkOverlap) {
Assert.isTrue(chunkOverlap >= 0, "Chunk overlap must be greater than or equal to 0");
this.chunkOverlap = chunkOverlap;
return this;
}

public Builder delimiter(String delimiter) {
this.delimiter = delimiter;
return this;
}

public Builder trimWhitespace(Boolean trimWhitespace) {
this.trimWhitespace = trimWhitespace;
return this;
}

public Builder keepDelimiter(Delimiter keepDelimiter) {
this.keepDelimiter = keepDelimiter;
return this;
}

public Config build() {
Assert.isTrue(chunkSize > chunkOverlap, "Chunk size must be greater than chunk overlap");
return new Config(chunkSize, chunkOverlap, delimiter, trimWhitespace, keepDelimiter);
}

}

/**
* Enum to represent the delimiter configuration NONE: No delimiter START: Delimiter
* at the start of the chunk END: Delimiter at the end of the chunk
*/
public enum Delimiter {

NONE, START, END

}

}
31 changes: 31 additions & 0 deletions jchunk-fixed/src/main/java/jchunk/chunker/fixed/FixedChunker.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package jchunk.chunker.fixed;

import jchunk.chunker.core.chunk.Chunk;
import jchunk.chunker.core.chunk.IChunker;

import java.util.List;

/**
* {@link FixedChunker} is a chunker that splits the content into fixed size chunks.
*
* @author Pablo Sanchidrian Herrera
*/
public class FixedChunker implements IChunker {

private final Config config;

public FixedChunker() {
this(Config.defaultConfig());
}

public FixedChunker(Config config) {
this.config = config;
}

@Override
public List<Chunk> split(String content) {
List<String> sentences = Utils.splitIntoSentences(content, config);
return Utils.mergeSentences(sentences, config);
}

}
Loading

0 comments on commit 68fbce0

Please sign in to comment.