Java Word2Vec

This is a Java implementation of the popular Word2Vec algorithm for converting words into multidimensional vectors in embedding space. It supports both CBOW (Continuous Bag of Words) and Skip-gram protocols for predicting a word from context words or vice versa, as well as direct vector operations like cosine similarity and embedding vector arithmetic to manipulate words. The entire program has a CLI (console line interface) to interact with and dynamically experiment with Word2Vec without touching a single line of code. The included Word2Vec class also has several useful helper functions like finding similar words, cleaning/reading from a corpus text, and training the neural network with customizable parameters. It uses a dependency on my java NeuralNetwork project to act as the base for the actual prediction and vectorization process and my java ConsoleTool project to create a CLI wrapper for the application.

Overview

This project implements Word2Vec, a technique for learning word embeddings using a simple neural network architecture. Word2Vec learns high-dimensional vector representations of words from large text corpora, which capture semantic and syntactic similarities between words. This implementation includes two primary training protocols:

CBOW (Continuous Bag of Words): Predicts the target word given the surrounding context words.
Skip-gram: Predicts the context words given a single target word.

These embeddings can be used for various natural language processing tasks, such as finding similar words, word analogies, and more.

Key Features

CBOW and Skip-gram Support: Implements both Continuous Bag of Words and Skip-gram models to allow flexible training and testing.
User-friendly Console Interface: Everything is wrapped in an easy-to-use console interface using commands to perform tasks, similar to a shell terminal. This allows for no-code experimentation with the Word2Vec embedding model.
Customizable Parameters: Allows users to adjust parameters such as learning rate, embedding size, context window size, minimum word frequency, and number of training epochs.
Vector Operations: Supports vector arithmetic and cosine similarity calculations to find relationships between words.
Corpus Processing: Includes functionality to read, clean, and preprocess text corpora, handling tokenization and normalization.
Save and Load Models: Ability to save trained models to a file and load pre-trained models for reuse or evaluation.

Internal Usage

For use in your own Java projects, simply import the Word2Vec.java class file and it will immediately be usable. The following section covers the proper syntax for

Initializing Word2Vec Model:

Word2Vec.ModelType modelType = Word2Vec.ModelType.CBOW; // or Word2Vec.ModelType.SKIPGRAM
int minFrequency = 5; // minimum times a word needs to occur for it to be added to the model's vocabulary
int windowSize = 5; // how far around the word to look for context
int dimensions = 100; // how many dimensions should the embedding vector have
String corpusString = "The quick brown fox jumps over the lazy dog."; // corpus text is automatically cleaned up for tokenization and parsing

Word2Vec model = new Word2Vec(modelType, corpusString, minFrequency, windowSize, dimensions);

Training the Model: Use the trainModel method to train the Word2Vec model on a given text corpus.

model.train(numberOfEpochs, learningRate);

Finding Similar Words: Use the findSimilarWords method to find words similar to a given input word or embedding vector.

String[] similarWords = model.findSimilarWords("word", topN);

double[] vector = {...};
String[] similarWords2 = model.findSimilarWords(vector, topN);

String closestWord = model.getClosestWord("word");
String closestWord2 = model.getClosestWord(vector);
String closestWord3 = model.getClosestWord(vector, "excludedWord1", "excludedWord2", ...);

Vector Arithmetic: Perform operations like word analogies using vector arithmetic.

double[] kingVector = model.vector("king");
double[] manVector = model.vector("man");
double[] womanVector = model.vector("woman");
double[] queenVector = model.add(model.subtract(kingVector, manVector), womanVector); // king - man + woman = queen
String queenWord = getClosestWord(queenVector); // gets the closest word that matches this new embedding vector

Comparing Words: Compare the similarity of 2 words using cosine similarity in the similarity() function

double sim1 = model.similarity("king", "queen"); // high similarity (strong correlation)
double sim2 = model.similarity("king", "phone"); // near 0 similarity (no correlation)
double sim3 = model.similarity("king", "peasant"); // low similarity (opposite correlation)

Program Usage

Compile the Code: First, make sure you are working in the project directory. If you are running the full project with the console interface, run the following commands to compile and run the program:

Unix (Mac/Linux) users:

Compile:
```
javac -cp ".:./libraries/jfreechart-1.5.3.jar" Main.java
```
Run:
```
java -cp ".:./libraries/jfreechart-1.5.3.jar" Main
```
Windows users:

Compile:
```
javac -cp ".;./libraries/jfreechart-1.5.3.jar" Main.java
```
Run:
```
java -cp ".;./libraries/jfreechart-1.5.3.jar" Main
```
Or, if you are just using the Word2Vec class, the jfreechart library can be excluded, simplifying the commands to:

Compile:
```
javac Main.java
```
Run:
```
java Main
```

Exiting the Program:

To exit the program, simply type exit, and the program will terminate.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
libraries		libraries
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
ChartUpdater.java		ChartUpdater.java
ConsoleTool.java		ConsoleTool.java
MagicTreeHouse.model		MagicTreeHouse.model
Main.java		Main.java
NeuralNetwork.java		NeuralNetwork.java
README.md		README.md
Word2Vec.java		Word2Vec.java
story.txt		story.txt
story2.txt		story2.txt
textbook.txt		textbook.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Word2Vec

Table of Contents

Overview

Key Features

Internal Usage

Program Usage

Exiting the Program:

About

Releases

Packages

Languages

spaceshark123/Word2Vec

Folders and files

Latest commit

History

Repository files navigation

Java Word2Vec

Table of Contents

Overview

Key Features

Internal Usage

Program Usage

Exiting the Program:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages