Class Definition
Class BeatriceVec: This class defines a custom word embedding model.
Initialization
init method: This is the constructor method for initializing an instance of BeatriceVec. self.corpus: Stores the input corpus, which is a list of sentences. self.word_vocab: A list to store the vocabulary of unique words. self.word2id: A dictionary mapping words to unique IDs. self.id2word: A dictionary mapping unique IDs to words. self.word_vectors: Initially set to None, will later store word vectors. self.dimension: Dimensionality of word vectors, set to 600. self.context_size: Context window size, set to 2. self.learning_rate: Learning rate for gradient descent, set to 0.01. self.num_epochs: Number of training epochs, set to 10.
Vocabulary Building
build_vocab method: This method builds the vocabulary from the corpus. Iterates over each sentence in the corpus. Splits each sentence into words and adds them to self.word_vocab if they are not already present. Sorts self.word_vocab. Populates self.word2id and self.id2word dictionaries with indices corresponding to each word.
Word Vector Initialization
initialize_word_vectors method: This method initializes the word vectors with random values. vocab_size: The size of the vocabulary. Creates a list of lists where each inner list represents a word vector initialized with random values between -1 and 1.
Training
train method: This method trains the embedding model using a simplified version of the Word2Vec algorithm. Iterates over the number of epochs. For each sentence, it splits the sentence into tokens (words). For each target word in the sentence, it identifies the context words within the specified context_size. For each context word, it retrieves the corresponding word vectors and updates them using the update_vector method.
Vector Update
update_vector method: This method updates a word vector using gradient descent. Iterates over each dimension of the word vector. Adjusts each element of the vector by subtracting the product of the learning rate and the corresponding element of the context vector.
Retrieving Embeddings
get_embeddings method: This method retrieves embeddings for all words in the vocabulary. Iterates over the vocabulary. Uses the get_embedding method to get the embedding for each word and appends it to the embeddings list. Returns the list of embeddings.
get_embedding method: This method retrieves the embedding vector for a given word. Returns the vector from self.word_vectors corresponding to the given word if the word exists in the vocabulary.
Summary of Methods and Attributes
init: Initializes the class attributes. build_vocab: Builds the vocabulary and word-to-ID mappings. initialize_word_vectors: Initializes word vectors with random values. train: Trains the word vectors using context words. update_vector: Updates word vectors using gradient descent. get_embeddings: Retrieves all word embeddings. get_embedding: Retrieves the embedding for a specific word.
This code essentially defines a basic framework for creating word embeddings from a corpus of text using a simplified version of the Word2Vec algorithm. The training method lacks some of the more complex aspects of Word2Vec, such as negative sampling or hierarchical softmax, but it provides a conceptual foundation for understanding how word embeddings can be learned from context.