trivor-nlp leverages the use of NPL (Natural Language Processing) to detect sentences, tokens as well as the meaning of each token in the given sentence. After processing all sentences, several generators will produce valuable information that can be easily consumed.
- Java 8+
<dependency>
<groupId>org.kalnee</groupId>
<artifactId>trivor-nlp</artifactId>
<version>0.0.1-alpha.2</version>
</dependency>
trivor-nlp
provides two processors:
TranscriptProcessor
: general-purpose processor, the content must be accessed either via URI or String.SubtitleProcessor
: subtitle-only processor, the content must be accessed via URI.
Accepted URI schemas: file://, jar:// and s3:// (Make sure to have the AWS Credentials in place.)
// from URI
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri).build();
// from String
TranscriptProcessor tp = new TranscriptProcessor.Builder("This is a sentence.").build();
For each line in the provided content, custom filters and mappers can be used to clean up the text before running the NLP models. Both fields are optional.
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
.withFilters(singletonList(line -> !line.contains("Name")))
.withMappers(singletonList(line -> line.replaceAll(TRANSCRIPT_REGEX, EMPTY)))
.build();
The following values can be overwritten by adding the Config
class when building a Processor
:
- Vocabulary probability:
Double
(default: 0.9) e.g. it'll only be accepted verbs with a probability >= 90% - Chunk probability:
Double
(default: 0.5) - Run Sentiment Analysis:
Boolean
(default: true)
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
.withConfig(new Config.Builder().vocabularyProb(.98).chunkProb(.98).sentimentAnalysis(false).build())
.build();
final SubtitleProcessor sp = new SubtitleProcessor.Builder(uri).withDuration(43).build();
All the necessary filters and mappers have already been provided for a Subtitle
.
After successfully building a processor, the NLP results can be accessed as follows:
This method return the list of sentences. Each sentence is composed by the identified tokens, tags and chunks:
{
"sentence" : "My name's Forrest.",
"tokens" : [
{
"token" : "My",
"tag" : "PRP$",
"lemma" : "my",
"prob" : 0.976362822572366
},
{
"token" : "name",
"tag" : "NN",
"lemma" : "name",
"prob" : 0.98267246788283
},
{
"token" : "'s",
"tag" : "POS",
"lemma" : "'s",
"prob" : 0.933313435543914
},
{
"token" : "Forrest",
"tag" : "NNP",
"lemma" : "forrest",
"prob" : 0.908174572293974
},
{
"token" : ".",
"tag" : ".",
"lemma" : ".",
"prob" : 0.982098322024085
}
],
"chunks" : [
{
"tokens" : [
"My",
"name"
],
"tags" : [
"PRP$",
"NN"
]
},
{
"tokens" : [
"'s",
"Forrest"
],
"tags" : [
"POS",
"NNP"
]
}
]
}
This method return a Result
object with many different insights such as:
- Rate of Speech (only for
Subtitles
) - Frequency Rate
- Frequent Sentences
- Frequent Chunks
- Vocabulary
- Phrasal Verbs
- Sentiment Analysis
The full documentation can be accessed here.
MIT (c) Kalnee. See LICENSE for details.