EOSC Marketplace Recommender System uses Deep Reinforcement Learning to suggest relevant scientific services to appropriate researchers on the EOSC Marketplace portal.
The recommender system works as a microservice and exposes API to the Marketplace.
The inner structure can be described as two elements:
- web service part based on
Celery
andFlask
with API created, documented and validated withFlask_restx
andSwagger
- deep reinforcement learning part based on
Pytorch
and other ML libraries
All required project packages are listed in the pipfile
. For their installation look at the setup.
If you want to use GPU with PyTorch you need CUDA capable device.
- Install
git
,python
andpipenv
- Clone this repository and go to its root directory
git clone https://github.com/cyfronet-fid/recommender-system.git
- Install all required project packages by executing
pipenv install --dev
- To open project virtual environment shell, type:
pipenv shell
Launch EOSC Marketplace Recommender server by executing in the project root directory:
export FLASK_ENV=development
export FLASK_APP=app.py
pipenv run flask run
NOTE: You can customize flask host and flask port by using FLASK_RUN_HOST
and FLASK_RUN_PORT
env variables accordingly.
To run background tasks you also need a celery worker running alongside your server. To run the worker:
export FLASK_ENV=development
pipenv run celery -A worker:app worker
NOTE: Celery needs a running redis broker server in the background.
NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).
The recommender system is running celery to execute background tasks in a queue.
As a backend, we are using Redis. By default, Redis is running on redis://localhost:6379
.
NOTE: You can customize your Redis host URL using REDIS_HOST
env variable.
NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).
Install and start the MongoDB server following the Mongo installation instructions. It should be running on the default
URL mongodb://localhost:27017
.
NOTE: You can customize your MongoDB host path in the MONGODB_HOST
env variable.
You can interact with recommender system microservice using API available (by default) here: http://localhost:5000/
To run all background servers needed for development (Redis
, MongoDB
) it is recommended that you use Docker:
docker-compose up
Mongo will be exposed and available on your host on 127.0.0.1:27017
, and Redis on 127.0.0.1:6379
, although
you can change them using MONGODB_HOST
and REDIS_HOST
env variables accordingly.
NOTE: You still need to set up Flask server and Celery worker as shown above. This is advantageous over the next option because you can run Pytest directly from your IDE, debug the application simply, restart Flask server easily, and you also avoid having to rebuild your docker image if your dependencies change.
For full-stack local development deployment use:
docker-compose -f docker-compose.yml -f development.yml up
This will build application images and run the base Flask development server on 127.0.0.1:5000
(you can customize flask port and host using env variables).
This command will also run Celery worker, Mongo and Redis.
You can immediately change the server code without restarting the containers.
To run the Jupyter notebook server along with the application stack run:
docker-compose -f docker-compose.yml -f jupyter.yml up
NOTE: The URL of the Jupyter server will be displayed in the docker-compose output
(default: http://127.0.0.1:8888/?token=SOME_JUPYTER_TOKEN
) (you can customize Jupyter port and host using env variables)
To run all the tests in our app run:
export FLASK_ENV=testing
pipenv run pytest ./tests
...or you can run them using docker:
docker-compose -f docker-compose.testing.yml up && docker-compose -f docker-compose.testing.yml down
You can curl /health
server endpoint to check the application's health. It checks for
- Database connection
- User and Service tables row count (has to be at least 10 each)
- Celery workers connection
- JMS connection
Recommendation engines ensure that both logged in and non-logged in users receive relevant recommendations.
These engines use artificial intelligence algorithms to make recommendations based on a user's interests, the context of the search, and all of that user's and other users' behaviours on the portal.
- Random
These engines employ statistical methods to provide an anonymous user with the best possible recommendations. The anonymous engine provides information, for example, which services are most commonly visited or purchased by other users in a specific search context. In that case, even if a user is anonymous, our recommender system can provide the user with valuable recommendations.
Recommender system can use one of two recommendation engines implemented for logged-in user:
NCF
- based on Neural Collaborative Filtering paper.RL
- based on Deep Deterministic Policy Gradient paper.
There are two ways in which the recommender system can be trained.
-
First method is to send a database dump to the RS
/update
endpoint. It may be done, for example, by triggering./bin/rails recommender:update
task on the Marketplace side. It sends a database dump to the/update
endpoint of the Recommender System. It sends the most current training data, preprocesses it, and utilizes it to train the models that are required. -
The second method is to use Flask commands:
flask train all
- the equivalent of training via endpoint/update
- triggers the training of each pipeline,flask train ae
- triggers the training of autoencoders pipeline,flask train embedding
- triggers the training of embeddings,flask train ncf
- triggers the training of NCF pipeline (provides 3 recommendations),flask train rl
- triggers the training of RL pipeline (provides 3 recommendations),
GPU support can be enabled using an environmental variable TRAINING_DEVICE
(look into ENV variables section).
After training is finished, the system is immediately ready for serving recommendations (no manual reloading is needed).
To specify from which engine the recommendations are requested, provide an engine_version
parameter inside the body of \recommendations
endpoint. NCF
denotes the NCF engine, while RL
indicates the RL engine. There is also NCFRanking
(that uses NCF
engine under the hood) but while requested it returns whole ranking of services rather than top K recommended services.
It is possible to define which algorithm should be used by default in the absence of the engine_version
parameter by modifying the DEFAULT_RECOMMENDATION_ALG
parameter from .env file
(look into ENV variables section).
List of available engines (after training):
NCF
- returnsK
recommendations from the given context.RL
- returnsK
recommendations from the given context.Random
- returnsK
random recommendations from the given context.NCFRanking
- used for sort by relevance.RandomRanking
- sort resources randomly.
Note: Only available in development and testing environment.
Our recommender, like other systems, requires data to perform properly. Several prepared commands can be used to generate such data:
flask seed seed
- it allows to seed a database with any number of synthetic users and services. The exact number can be adjusted here seed,flask seed seed_faker
- analysis the users and services from a current database and produces some documents which later on will enable to generate more realistic synthetic users and services,
We provide also commands to manipulate database.
flask db drop_mp
- drops the documents from the RS database which were sent by the MP database dump,flask db drop_models
- drops machine learning models from m_l_component collection,flask db regenerate_sarses
- based on new user actions - add new SARSes and regenerate existing ones that are deprecated.
We are using MongoDB as our database, which is a NoSQL, schema-less, document-based DB. However, we are also using mongoengine
- an
ODM (Object Document Mapping), which defines a "schema" for each document (like specifying field names or required values).
This means that we need a minimalistic migration system to apply the defined "schema" changes,
like changing a field name or dropping a collection, if we want to maintain the Application <=> DB integrity.
Migration flask CLI commands (first set the FLASK_ENV
variable to either development
or production
):
flask migrate apply
- applies migrations that have not been applied yetflask migrate rollback
- reverts the previously applied migrationflask migrate list
- lists all migrations along with their application statusflask migrate check
- checks the integrity of the migrations - if the migration files match the DB migration cacheflask migrate repopulate
- deletes migration cache and repopulates it with all the migrations defined in/recommender/migrate
dir.
To create a new migration:
- In the
/recommender/migrations
dir: - Create a python module with a name
YYYYMMDDMMHHSS_migration_name
(e.g.20211126112815_remove_unused_collections
) - In this module create a migration class (with an arbitrary name) which inherits from
BaseMigration
- Implement
up
(application) anddown
(teardown) methods, by usingself.pymongo_db
(pymongo, a low-level adapter for mongoDB, connected to proper (dependent on theFLASK_ENV
variable) recommender DB instance)
(See existing files in the /recommender/migrate
dir for a more detailed example.)
DO NOT DELETE EXISTING MIGRATION FILES. DO NOT CHANGE EXISTING MIGRATION FILE NAMES. DO NOT MODIFY THE CODE OF EXISTING MIGRATION FILES
(If you performed any of those actions, run flask migrate check
to determine what went wrong.)
The essential components of the recommendation system are also documented in our repository:
We are using .env to store instance-specific constants or secrets. This file is not tracked by git and it needs to be present in the project root directory. Details:
MONGODB_HOST
- URL and port of your running MongoDB server (example:127.0.0.1:27018
) or desired URL and port of your MongoDB server when it is run using docker-compose (recommended)REDIS_HOST
- URL and port of your running Redis server (example:127.0.0.1:6380
) or desired URL and port of your Redis server when it is run using docker-compose (recommended)FLASK_RUN_HOST
- desired URL of your application server (example:127.0.0.1
)FLASK_RUN_PORT
- desired port of your application server (example:5001
)JUPYTER_RUN_PORT
- desired port of your Jupyter server when ran using Docker (example:8889
)JUPYTER_RUN_HOST
- desired host of your Jupyter server when ran using Docker (example:127.0.0.1
)CELERY_LOG_LEVEL
- log level of your Celery worker when ran using Docker (one of:CRITICAL
,ERROR
,WARN
,INFO
orDEBUG
)SENTRY_DSN
- The DSN tells the Sentry where to send the events (example:https://16f35998712a415f9354a9d6c7d096e6@o556478.ingest.sentry.io/7284791
). If that variable does not exist, Sentry will just not send any events.SENTRY_ENVIRONMENT
- environment name - it's optional and it can be a free-form string. If not specified and using Docker, it is set todevelopment
/testing
/production
respectively to the docker environment.SENTRY_RELEASE
- human-readable release name - it's optional and it can be a free-form string. If not specified, Sentry automatically set it based on the commit revision number.TRAINING_DEVICE
- the device used for training of neural networks:cuda
for GPU support orcpu
(note:cuda
support is experimental and works only in Jupyter notebookneural_cf
- not in the recommender dev/prod/test environment)DEFAULT_RECOMMENDATION_ALG
- the version of the recommender engine (one ofNCF
,RL
,random
) - Whenever request handling or celery task need this variable, it is dynamically loaded from the .env file, so you can change it during flask server runtime.RS_DATABUS_HOST
- the address of your JMS provider (default:127.0.0.1
)RS_DATABUS_USERNAME
- your login to the JMS provider (default:admin
, placeholder for the real password)RS_DATABUS_PASSWORD
- your password to the JMS provider (default:admin
, placeholder for the real password)RS_DATABUS_PORT
- the port of your JMS provider (default:61613
)RS_DATABUS_SUBSCRIPTION_TOPIC
- topic on which subscriber listens to jms (default:/topic/user_actions
)RS_DATABUS_PUBLISH_TOPIC
- name of the topic to publish the recommendations on the JMS host (default:/topic/recommendations
)RS_DATABUS_SUBSCRIPTION_ID
- subscription id of the jms subscriber (optional)RS_DATABUS_SSL
- whether to use ssl when connecting to jms (default: 1) (accepted values0
or1
,yes
orno
)TEST_RS_DATABUS_HOST
- same asRS_DATABUS_HOST
but used when testing viapytest
(default:127.0.0.1
)TEST_RS_DATABUS_PORT
- same asRS_DATABUS_PORT
but used when testing viapytest
(default:61613
)TEST_RS_DATABUS_USERNAME
- same asRS_DATABUS_USERNAME
but used when testing viapytest
(default:admin
, default password for the stomp docker image)TEST_RS_DATABUS_PASSWORD
- same asRS_DATABUS_PASSWORD
but used when testing viapytest
(default:admin
, default password for the stomp docker image)TEST_RS_DATABUS_SUBSCRIPTION_TOPIC
- same asRS_DATABUS_SUBSCRIPTION_TOPIC
but used when testing viapytest
(default:topic/user_actions_test
)TEST_RS_DATABUS_PUBLISH_TOPIC
- same asRS_DATABUS_PUBLISH_TOPIC
but used when testing viapytest
(default:topic/recommendations_test
)
NOTE: All the above variables have reasonable defaults, so if you want you can just have your .env file empty.
There is flask cli command to run JMS subscription which connects to databus and consumes user actions. It can be run with following command
flask subscribe --host 127.0.0.1 --port 61613 --username guest --password guest
For all available options run
flask subscribe --help
All arguments to subscribe can be read from environmental variables (see section about env variables above)
To activate pre-commit run:
pipenv run pre-commit install
Install EnvFile plugin. Go to the run configuration of your choice, switch to EnvFile
tab, check Enable EnvFile
, click +
button below, select .env
file and click Apply
(Details on the plugin's page)
In Pycharm, go to Settings
-> Tools
-> Python Integrated Tools
-> Testing
and choose pytest
Remember to put FLASK_ENV=testing env variable in the configuration.
While committing using PyCharm Git GUI, pre-commit doesn't use project environment and can't find modules used in hooks.
To fix this, go to .git/hooks/pre-commit
generated by the above command in the project directory and replace:
# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'
with:
# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'
os.environ['PATH'] = f'{os.path.dirname(INSTALL_PYTHON)}{os.pathsep}{os.environ["PATH"]}'
Sentry
is integrated with the Flask
server and the Celery
task queue manager so all unhandled exceptions from these entities will be tracked and sent to the sentry.
Customization of the sentry integration can be done vie environmental variables (look into ENV variables section) - you can read more about them here