An app that uses machine learning to help you with your Christmas shopping!
Main App Repo
·
Demo
·
eBay API Repo
Table of Contents
This API forms part of a group project completed on the Northcoders software development bootcamp. Santa's Little Helper is an app created in React Native which uses a Word2Vec machine learning model to help users find an ideal present for a loved one.
As a brief overview of the project flow, a user swipes to like or dislike gifts for an intended recipient accessed via the eBay API. We extract keywords describing each item and record whether each keyword describes an item the user liked or instead that they disliked. From this, we create a list of "positive", or liked, keywords and a list of "negative", or disliked, keywords. These lists are passed to the API in this repo, which holds our Word2Vec neural network model.
Based on these lists, this API returns a list of related keywords that the intended recipient may like. These keywords are then used in the next eBay API call to suggest items that the intended recipient is more likely to be interested in - essentially tailoring the items to the users likes and dislikes. Please see the main repo for further details and our project page which contains an app demo.
The main purpose of this repo is to hold our Flask API, which creates a small Python-based server which outputs semantically similar words when given one or more words as input. We achieved this using Word2Vec - a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. We used our model to suggest similar words to given keywords relating to gifts in the front end of our project.
While pre-trained models exist, we chose to train our own custom set of word vectors specific to the eCommerce context of our project. We've included our code from this process and therefore, there are two uses for this repo:
- Creating a Flask REST API which uses machine learning and NLP to recommend related keywords
- Training custom word vectors with Word2Vec using an eCommerce dataset (see the folder
/model_training
and this section of the repo.
At the time of writing, this API is hosted here.
We have used Python to both train our word vectors and create this API. Key technologies we've used in development are:
Details on development of the React Native app can be found in the main app repo.
As mentioned, there are two uses for this repo - running an API and training a custom set of word vectors. If you only wish to set up the API, follow the installation instructions below. For details on our word vector training process, see this later section. However, it is not necessary to train the word vectors to run the API as our pre-trained vectors are included in this repo (/model/ecommerce_vecs.txt
).
You can get started using a local version of our API by following these steps:
You can clone this repo locally using the command:
git clone https://github.com/teyahbd/ecommerce-keyword-api.git
Before installing the required packages, it is convention in Python to create a local virtual environment to install these packages within. To do this, navigate into the main directory of this project and run:
python3 -m venv venv
To enter the virtual environment, both now and at later points, you can use the command:
source venv/bin/activate
The name of the virtual environment (venv) should appear on your command line to indicate you are currently working within the virtual environment. In order to have access to the packages we will install in the next step, it's important to check you are within the environment when working with this repo.
After ensuring you are within the virtual environment, use the requirements.txt
file to install the requirements for this project via pip
with the command:
pip install -r requirements.txt
To run the Flask app on your local server, you can use the command:
flask run
Note: Flask will default to using port 5000.
This API was created for our React Native app to interact with our Word2Vec model and so it only contains 2 endpoints. Here is a brief overview of the intended flow of this API:
- A user submits a list of zero or more positive words and a list of zero of more negative words to the API in a POST request.
- These two lists are passed to our Word2Vec keyword recommendation model which uses our custom word vectors to find the "most similar" words to the positive list (and/or the "least similar" to the negative list).
- The API responds to the POST request with a list of the two recommended words generated by the model.
Essentially, the API recommends similar words based on the submitted words. The list of positive words contribute positively to the similarity, and the list of negative words contribute negatively. In use for our app, the API takes a list of positive keywords generated from items a user has liked and a list of negative keywords generated from items the user has disliked. More details the function used for these word vector calculations can be found here.
Responds with a list of all the current API endpoints.
{
"Endpoints": {
"/model": "accepts POST request containing keywords which returns related keywords"
}
}
Responds with an object containing a list of the two "most similar" words to the words submitted by the user.
The request object should have exactly two keys - positive
and negative
. Each key should have a value of a list of strings containing words accepted by the Word2Vec model. Empty arrays are accepted however, both keys must always be present on the request object.
{
"positive": ["heart", "chair"],
"negative": ["star"]
}
The response object will have a key of keywords
. The list returned will contain exactly two values - the two "most similar" words to the submitted words. Each word is returned with a corresponding value which represents it's similarity to the positive words from the request object (and/or their dissimilarity to the negative words). This value ranges from -1 (not very similar) to 1 (very similar).
{
"keywords": [
[
"desk",
0.5458441972732544
],
[
"camping",
0.46846500039100647
]
]
}
- A list of all accepted words for our Word2Vec API can be found in
model/word_list.txt
. ⚠️ Warning! This API focuses on the value of training custom, context-relevant word vectors as an exploration of the technology involved. Therefore, our dataset is relatively small and there are many words that do not exist on our current word list.- It is recommended to check which words are accepted by our API before use. If any word on either list is not accepted, currently our API will return a 400 Bad Request error.
- If you wish to see an example using a much larger (but less relevant) word list, see our previous repo which contains an identical API but instead uses pretrained Wikipedia word vectors from GLoVe.
Our Word2Vec function, API endpoints and their respective error handling is tested using pytest
and our test files can be found within the __tests__
folder. To run these tests for yourself, navigate into this folder (after following the API installation procedure) and run:
python3 -m pytest
This command can be followed by either file name (test_app.py
or test_model.py
) to run the tests from each file separately.
We've included all of our training files for training word vectors based on an eCommerce-specific dataset in the folder /model_training
. The two Jupyter Notebooks in particular provide a useful explanation of our process. The resulting word vectors can be found in our API in the /model
folder and are used in model.py
.
If you wish to try this out for yourself, there are two stages to our process:
- Preparing and cleaning the dataset
- Training the word vectors using Word2Vec
- Follow steps 1-3 in the Getting Started section
- Note: You may wish to set up a separate directory and virtual environment for training the dataset. In this case, copy the folder (
/model_training
) into the new directory alongside therequirements.txt
file and follow the Getting Started steps from there.
- Note: You may wish to set up a separate directory and virtual environment for training the dataset. In this case, copy the folder (
- Ensure you have downloaded the UCI Online Retail dataset found here and placed the
.xlsx
file within the/model_training
folder.
To get started, follow the walk-through found in the first jupyter notebook /model_training/clean_data.ipynb
. We have also included a basic Python script, so you can complete the steps found in the notebook in one go using the command
python3 clean_data.py
within the /model_training
directory.
- Expect to wait up to a few minutes when running scripts or commands in this section due to the size of the dataset.
- Check the downloaded dataset has the same file name as used in the script (
Online_Retail.txt
).
You should have generated a file called cleaned_dataset.txt
inside the /model_training
folder which contains the data in an appropriate format for training. Follow the walk-through in the second jupyter notebook /model_training/train_model.ipynb
to train the word vectors using this data. Again, we've included a basic Python script for this so you can complete these steps in one go using the command:
python3 train_model.py
within the /model_training
directory.
You should have now generated your own version of our word vector file inside /model_training
called ecommerce_vecs.txt
.
- While the first step of cleaning the data is specific to our dataset, the second file for training word vectors should work for any corpus formatted similarly to
cleaned_dataset.txt
(i.e. a text file containing a list of sentences to train). - If you wish to use this new file for your API, replace our default file in the
/model
folder.
This API forms part of a short project completed during the Northcoders software development bootcamp in 2022 by My Favourite Team. Check out our project page with an app demo here.
- Teyah Brennen-Davies (LinkedIn|Github)
- Hannah Barber (LinkedIn|Github)
- Byron Esson (LinkedIn|Github)
- David Cobb (LinkedIn|Github)
- Niall Sexton (LinkedIn|Github)
- Rob Carter (LinkedIn|Github)
To train a set of custom eCommerce word vectors, we have used an online retail dataset from the UCI machine learning repository which can be downloaded here.
The following resources were particularly helpful in creating this project: