Skip to content

Latest commit

 

History

History
162 lines (128 loc) · 4.72 KB

README.md

File metadata and controls

162 lines (128 loc) · 4.72 KB

trivor-nlp CircleCI

trivor-nlp leverages the use of NPL (Natural Language Processing) to detect sentences, tokens as well as the meaning of each token in the given sentence. After processing all sentences, several generators will produce valuable information that can be easily consumed.

Prerequisites

  • Java 8+

Usage

1. Add dependency

<dependency>
  <groupId>org.kalnee</groupId>
  <artifactId>trivor-nlp</artifactId>
  <version>0.0.1-alpha.2</version>
</dependency>

2. Create a Processor

trivor-nlp provides two processors:

  • TranscriptProcessor: general-purpose processor, the content must be accessed either via URI or String.
  • SubtitleProcessor: subtitle-only processor, the content must be accessed via URI.

Accepted URI schemas: file://, jar:// and s3:// (Make sure to have the AWS Credentials in place.)

Create a TranscriptProcessor from URI or String

// from URI
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri).build();
// from String
TranscriptProcessor tp = new TranscriptProcessor.Builder("This is a sentence.").build();

Customize

Filters and mappers

For each line in the provided content, custom filters and mappers can be used to clean up the text before running the NLP models. Both fields are optional.

TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
        .withFilters(singletonList(line -> !line.contains("Name")))
        .withMappers(singletonList(line -> line.replaceAll(TRANSCRIPT_REGEX, EMPTY)))
        .build();
Settings

The following values can be overwritten by adding the Config class when building a Processor:

  • Vocabulary probability: Double (default: 0.9) e.g. it'll only be accepted verbs with a probability >= 90%
  • Chunk probability: Double (default: 0.5)
  • Run Sentiment Analysis: Boolean (default: true)
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
        .withConfig(new Config.Builder().vocabularyProb(.98).chunkProb(.98).sentimentAnalysis(false).build())
        .build();

Create a SubtitleProcessor from URI

final SubtitleProcessor sp = new SubtitleProcessor.Builder(uri).withDuration(43).build();

All the necessary filters and mappers have already been provided for a Subtitle.

3. Result

After successfully building a processor, the NLP results can be accessed as follows:

processor.getSentences()

This method return the list of sentences. Each sentence is composed by the identified tokens, tags and chunks:

{
            "sentence" : "My name's Forrest.",
            "tokens" : [ 
                {
                    "token" : "My",
                    "tag" : "PRP$",
                    "lemma" : "my",
                    "prob" : 0.976362822572366
                }, 
                {
                    "token" : "name",
                    "tag" : "NN",
                    "lemma" : "name",
                    "prob" : 0.98267246788283
                }, 
                {
                    "token" : "'s",
                    "tag" : "POS",
                    "lemma" : "'s",
                    "prob" : 0.933313435543914
                }, 
                {
                    "token" : "Forrest",
                    "tag" : "NNP",
                    "lemma" : "forrest",
                    "prob" : 0.908174572293974
                }, 
                {
                    "token" : ".",
                    "tag" : ".",
                    "lemma" : ".",
                    "prob" : 0.982098322024085
                }
            ],
            "chunks" : [ 
                {
                    "tokens" : [ 
                        "My", 
                        "name"
                    ],
                    "tags" : [ 
                        "PRP$", 
                        "NN"
                    ]
                }, 
                {
                    "tokens" : [ 
                        "'s", 
                        "Forrest"
                    ],
                    "tags" : [ 
                        "POS", 
                        "NNP"
                    ]
                }
            ]
        }

processor.getResult()

This method return a Result object with many different insights such as:

  • Rate of Speech (only for Subtitles)
  • Frequency Rate
  • Frequent Sentences
  • Frequent Chunks
  • Vocabulary
  • Phrasal Verbs
  • Sentiment Analysis

The full documentation can be accessed here.

License

MIT (c) Kalnee. See LICENSE for details.