Skip to content

Releases: Mascerade/supervised-product-matching

Cleaner Repository and a Package!

22 Oct 00:01
Compare
Choose a tag to compare

Release Notes

  • With this release there were no changes to the model itself at all.
    • All the changes that were made were to make it easier for people to actually use the model.
  • There is now a new directory called supervised_product_matching which is described in the README.
    • This directory is a package that can be installed using the command provided.
    • It gives people much easier access to the model architectures used for training as well as the preprocessing that was used before titles get sent to the model.
  • The repository also now makes use of my CharacterBERT repository which essentially just updated the code of the original repository to work with the latest version of HuggingFace Transformers and exposes the architecture in the form of a package for better portability.
  • There are command-line arguments now for torch_train_model.py
  • You can find the NLP Dashboard repository here and the NLP Dashboard Server here if you want to make use of them for training.

CharacterBERT and A LOT of Change!

16 Apr 18:25
Compare
Choose a tag to compare

Release Notes

  • Completely revamped the data. The architecture of the project can be found in the README
    • The gist of it is that there have more realistic laptop data and we use the WDC Product Corpus's electronic data.
  • Using CharacterBERT as opposed to regular BERT
    • CharacterBERT is much more robust towards number data, which helps with discerning between numerical attributes of data.
    • ScaleTransformerEncoder can be added on top of CharacterBERT (check the README for more info)
  • New method of batching the training data (to consume less memory).
  • There is a test script to validate/manually test data.

Implementation Notes

  • Need to download pre-trained CharacterBERT and BERT models
    • Instructions in README
  • Extract train.zip into data/train
  • Extract test.zip into data/test
  • Extract CharacterBERT-Models.zip into models

Results

  • The models trained are much better at laptop data especially as well as in general.
  • The models also are better regularized so they don't overfit the data.
  • The models, in my evaluation, should be good enough to be used in production.

Using BERT!

31 Dec 23:15
Compare
Choose a tag to compare
Using BERT! Pre-release
Pre-release

Release Notes

  • This model was a complete revamp of previous models
  • We now use a pre-trained BERT model with an added classification head and fine-tune it
  • The laptop data used now is much simpler (doesn't have the added fluff-words to replicate actual title data)
    • The idea behind this choice is that the model should first learn how to properly recognize the different attributes of a title without having to worry about these added tokens
  • The rest of the data remains the same

Results

  • The results in this model are both better on paper and also when manually testing
    • It is able to better understand laptop data's attributes and when tested on real data that we procured, it did quite well (about 70% accuracy)
  • The problem with BERT, though, is that it overfits very easily
    • At just 4 epochs, the model had overfitted on certain data
      *BERT is promising, though, because it is more flexible with the structure of data and the semantic meanings

Issues/Future Improvements

  • There are major problems with the laptop data being used
    • Specifically, as noted in one of the commits, the model learns very easily the frequency at which some of the product names appear in positive and negative pairs
    • For example "apple macbook", when in both titles, is always negative
      • The model discovered this pattern and now anytime there is a laptop with "apple macbook" in both titles, it is always a negative pairing no matter how similar the titles actually are
  • If this issue can be solved, we believe it will open up the model to much more learning because it will not be able to take these shortcuts

Notes

  • The csv files go into data/train and 0.2.0_BERT_epoch_3 goes into models

Expanded Amount of Models

29 Sep 01:31
Compare
Choose a tag to compare
Pre-release

Release Notes

We now have created more model architectures in order to explore different approaches to this problem. We have:

  • The distance between the final layers in the siamese network sent to a sigmoid classifier (DistanceSigmiod)
  • The exponential difference (so e^-abs(difference) between the final layers in the siamese network fed to a sigmoid classifier (ExpDistanceSigmoid)
  • The exponential difference between the final layers in the siamese network fed to a softmax classifier (ExpDistanceSoftmax)
  • The Manhattan distance between the final layers in the siamese network

Results

The results are not very good across the board, but what gives hope is the fact that numbers do not seem to work well with the fastText embeddings I am using. Most numbers are treated at the same, therefore tokens like 128 and 256 are largely viewed as the same. This is most likely because both numbers are found in almost the same exact context. This is the same with tokens like SSD and HDD. Because of this, I would like to explore different ways of creating embeddings.

Future Improvements/Research

There are many research papers about NLP to explore. I need to do research into using different embeddings (like with ConceptNet) and using other types of layers, like perhaps Transformers. I would also like to look into different regularization methods in order to get better validation and test results. In addition, analyzing the data we have some more in order to look into why adjectives play such a heavy role in determining how the models do can help.

How to Use

Trying to Improve Accuracy on Laptops

25 Aug 19:53
Compare
Choose a tag to compare
Pre-release

Release Notes

We now generate laptop data by having a spec list of laptop parts and using them to create one laptop. We shuffle all the tokens in the laptop data so that the LSTM network does not overfit to the positions of certain tokens (like the CPU always being first, then the ram, then the storage, etc.). In addition, there is added data for hard drives, CPUs and RAM. The network also now uses a dropout chance of 0.6 in the last two layers to help with overfitting.

Results

  • Achieved 87% accuracy on the test set with 128 batch size and 80 epochs
  • When manually testing, the laptops do not do well at all. The only things that govern whether two laptops are the same are the CPU and brand

Future Improvements

  • Need more laptop data
  • We cannot overfit to brand name and only one or two specs

Topics to Test/Explore

  • Explore different distance layers
  • Explore pre-trained LSTMs
  • Perhaps train the fastText embedding on our data
  • Maybe a separate model for laptops would be better

How To Use

  • Unzip the train.zip into the train folder
  • If you want to use the model itself, put it into the model directory

Proof of Concept

06 Jul 19:16
Compare
Choose a tag to compare
Proof of Concept Pre-release
Pre-release

Release Notes

This is the first release of this algorithm. I wanted to just see if it was possible to train a model that can classify two titles as the same or not. The LSTM network for 50 epochs with a batch size of 64 using the training data included. I attached the model itself as well, which achieved 87% accuracy on the test set and 91% accuracy on the training set. To train a model yourself, all you have to do is:

  • Read the readme and download the fastText embedding model
  • Put the computers_train_bal_shuffle.csv and computers_train_xlarge_norm_simple.csv into the computers_train in the data folder
  • Go in the train_model.py code and change the output model name to what you want it to be
  • Run train_model.py
  • * If you want to just test the model, put the .h5 file into the models folder and run test_model.py
    • Change the titles if you want in the code

Future Improvements

  • Use the cameras dataset from WDC Product Data Corpus
  • Make the model differentiate between titles that have different attributes, like a laptop with 500gb of HDD vs a laptop with 750gb
    • This includes manually getting data for this and switching out attributes
  • Test with the contrastive loss function