Skip to content

Latest commit

 

History

History
307 lines (242 loc) · 17.5 KB

README.md

File metadata and controls

307 lines (242 loc) · 17.5 KB

EOSC Marketplace Recommender System

Code style: black

Python-app workflow

EOSC Marketplace Recommender System uses Deep Reinforcement Learning to suggest relevant scientific services to appropriate researchers on the EOSC Marketplace portal.

Architecture

The recommender system works as a microservice and exposes API to the Marketplace.

The inner structure can be described as two elements:

  • web service part based on Celery andFlask with API created, documented and validated with Flask_restx and Swagger
  • deep reinforcement learning part based on Pytorch and other ML libraries

Development environment

Requirements

All required project packages are listed in the pipfile. For their installation look at the setup. If you want to use GPU with PyTorch you need CUDA capable device.

Setup

  1. Install git, python and pipenv
  2. Clone this repository and go to its root directory
git clone https://github.com/cyfronet-fid/recommender-system.git
  1. Install all required project packages by executing
pipenv install --dev
  1. To open project virtual environment shell, type:
pipenv shell

Server

Launch EOSC Marketplace Recommender server by executing in the project root directory:

export FLASK_ENV=development
export FLASK_APP=app.py
pipenv run flask run

NOTE: You can customize flask host and flask port by using FLASK_RUN_HOST and FLASK_RUN_PORT env variables accordingly.

Celery

To run background tasks you also need a celery worker running alongside your server. To run the worker:

export FLASK_ENV=development
pipenv run celery -A worker:app worker

NOTE: Celery needs a running redis broker server in the background.

Redis

NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).

The recommender system is running celery to execute background tasks in a queue. As a backend, we are using Redis. By default, Redis is running on redis://localhost:6379.

NOTE: You can customize your Redis host URL using REDIS_HOST env variable.

Mongo

NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).

Install and start the MongoDB server following the Mongo installation instructions. It should be running on the default URL mongodb://localhost:27017.

NOTE: You can customize your MongoDB host path in the MONGODB_HOST env variable.

API

You can interact with recommender system microservice using API available (by default) here: http://localhost:5000/

Docker

To run all background servers needed for development (Redis, MongoDB) it is recommended that you use Docker:

docker-compose up

Mongo will be exposed and available on your host on 127.0.0.1:27017, and Redis on 127.0.0.1:6379, although you can change them using MONGODB_HOST and REDIS_HOST env variables accordingly.

NOTE: You still need to set up Flask server and Celery worker as shown above. This is advantageous over the next option because you can run Pytest directly from your IDE, debug the application simply, restart Flask server easily, and you also avoid having to rebuild your docker image if your dependencies change.

For full-stack local development deployment use:

docker-compose -f docker-compose.yml -f development.yml up

This will build application images and run the base Flask development server on 127.0.0.1:5000 (you can customize flask port and host using env variables). This command will also run Celery worker, Mongo and Redis. You can immediately change the server code without restarting the containers.

To run the Jupyter notebook server along with the application stack run:

docker-compose -f docker-compose.yml -f jupyter.yml up

NOTE: The URL of the Jupyter server will be displayed in the docker-compose output (default: http://127.0.0.1:8888/?token=SOME_JUPYTER_TOKEN) (you can customize Jupyter port and host using env variables)

Tests

To run all the tests in our app run:

export FLASK_ENV=testing
pipenv run pytest ./tests

...or you can run them using docker:

docker-compose -f docker-compose.testing.yml up && docker-compose -f docker-compose.testing.yml down

Deep health check

You can curl /health server endpoint to check the application's health. It checks for

  • Database connection
  • User and Service tables row count (has to be at least 10 each)
  • Celery workers connection
  • JMS connection

Recommendation Engines

Recommendation engines ensure that both logged in and non-logged in users receive relevant recommendations.

Engines for logged-in users (training required):

These engines use artificial intelligence algorithms to make recommendations based on a user's interests, the context of the search, and all of that user's and other users' behaviours on the portal.

Engines for anonymous users (no training required):

  • Random

These engines employ statistical methods to provide an anonymous user with the best possible recommendations. The anonymous engine provides information, for example, which services are most commonly visited or purchased by other users in a specific search context. In that case, even if a user is anonymous, our recommender system can provide the user with valuable recommendations.

Training

Recommender system can use one of two recommendation engines implemented for logged-in user:

There are two ways in which the recommender system can be trained.

  1. First method is to send a database dump to the RS /update endpoint. It may be done, for example, by triggering ./bin/rails recommender:update task on the Marketplace side. It sends a database dump to the /update endpoint of the Recommender System. It sends the most current training data, preprocesses it, and utilizes it to train the models that are required.

  2. The second method is to use Flask commands:

  • flask train all - the equivalent of training via endpoint /update - triggers the training of each pipeline,
  • flask train ae - triggers the training of autoencoders pipeline,
  • flask train embedding - triggers the training of embeddings,
  • flask train ncf - triggers the training of NCF pipeline (provides 3 recommendations),
  • flask train rl - triggers the training of RL pipeline (provides 3 recommendations),

GPU support can be enabled using an environmental variable TRAINING_DEVICE (look into ENV variables section).

After training is finished, the system is immediately ready for serving recommendations (no manual reloading is needed).

To specify from which engine the recommendations are requested, provide an engine_version parameter inside the body of \recommendations endpoint. NCF denotes the NCF engine, while RL indicates the RL engine. There is also NCFRanking (that uses NCF engine under the hood) but while requested it returns whole ranking of services rather than top K recommended services. It is possible to define which algorithm should be used by default in the absence of the engine_version parameter by modifying the DEFAULT_RECOMMENDATION_ALG parameter from .env file (look into ENV variables section).

Engines

List of available engines (after training):

  • NCF - returns K recommendations from the given context.
  • RL - returns K recommendations from the given context.
  • Random - returns K random recommendations from the given context.
  • NCFRanking - used for sort by relevance.
  • RandomRanking - sort resources randomly.

Seeding database

Note: Only available in development and testing environment.

Our recommender, like other systems, requires data to perform properly. Several prepared commands can be used to generate such data:

  • flask seed seed - it allows to seed a database with any number of synthetic users and services. The exact number can be adjusted here seed,
  • flask seed seed_faker - analysis the users and services from a current database and produces some documents which later on will enable to generate more realistic synthetic users and services,

Managing database

We provide also commands to manipulate database.

  • flask db drop_mp - drops the documents from the RS database which were sent by the MP database dump,
  • flask db drop_models - drops machine learning models from m_l_component collection,
  • flask db regenerate_sarses - based on new user actions - add new SARSes and regenerate existing ones that are deprecated.

Migrations

We are using MongoDB as our database, which is a NoSQL, schema-less, document-based DB. However, we are also using mongoengine - an ODM (Object Document Mapping), which defines a "schema" for each document (like specifying field names or required values). This means that we need a minimalistic migration system to apply the defined "schema" changes, like changing a field name or dropping a collection, if we want to maintain the Application <=> DB integrity.

Migration flask CLI commands (first set the FLASK_ENV variable to either development or production):

  • flask migrate apply - applies migrations that have not been applied yet
  • flask migrate rollback - reverts the previously applied migration
  • flask migrate list - lists all migrations along with their application status
  • flask migrate check - checks the integrity of the migrations - if the migration files match the DB migration cache
  • flask migrate repopulate - deletes migration cache and repopulates it with all the migrations defined in /recommender/migrate dir.

To create a new migration:

  1. In the /recommender/migrations dir:
  2. Create a python module with a name YYYYMMDDMMHHSS_migration_name (e.g. 20211126112815_remove_unused_collections)
  3. In this module create a migration class (with an arbitrary name) which inherits from BaseMigration
  4. Implement up (application) and down (teardown) methods, by using self.pymongo_db (pymongo, a low-level adapter for mongoDB, connected to proper (dependent on the FLASK_ENV variable) recommender DB instance)

(See existing files in the /recommender/migrate dir for a more detailed example.)

DO NOT DELETE EXISTING MIGRATION FILES. DO NOT CHANGE EXISTING MIGRATION FILE NAMES. DO NOT MODIFY THE CODE OF EXISTING MIGRATION FILES

(If you performed any of those actions, run flask migrate check to determine what went wrong.)

Documentation

The essential components of the recommendation system are also documented in our repository:

ENV variables

We are using .env to store instance-specific constants or secrets. This file is not tracked by git and it needs to be present in the project root directory. Details:

  • MONGODB_HOST - URL and port of your running MongoDB server (example: 127.0.0.1:27018) or desired URL and port of your MongoDB server when it is run using docker-compose (recommended)
  • REDIS_HOST - URL and port of your running Redis server (example: 127.0.0.1:6380) or desired URL and port of your Redis server when it is run using docker-compose (recommended)
  • FLASK_RUN_HOST - desired URL of your application server (example: 127.0.0.1)
  • FLASK_RUN_PORT - desired port of your application server (example: 5001)
  • JUPYTER_RUN_PORT - desired port of your Jupyter server when ran using Docker (example: 8889)
  • JUPYTER_RUN_HOST - desired host of your Jupyter server when ran using Docker (example: 127.0.0.1)
  • CELERY_LOG_LEVEL - log level of your Celery worker when ran using Docker (one of: CRITICAL, ERROR, WARN, INFO or DEBUG)
  • SENTRY_DSN - The DSN tells the Sentry where to send the events (example: https://16f35998712a415f9354a9d6c7d096e6@o556478.ingest.sentry.io/7284791). If that variable does not exist, Sentry will just not send any events.
  • SENTRY_ENVIRONMENT - environment name - it's optional and it can be a free-form string. If not specified and using Docker, it is set to development/testing/production respectively to the docker environment.
  • SENTRY_RELEASE - human-readable release name - it's optional and it can be a free-form string. If not specified, Sentry automatically set it based on the commit revision number.
  • TRAINING_DEVICE - the device used for training of neural networks: cuda for GPU support or cpu (note: cuda support is experimental and works only in Jupyter notebook neural_cf - not in the recommender dev/prod/test environment)
  • DEFAULT_RECOMMENDATION_ALG - the version of the recommender engine (one of NCF, RL, random) - Whenever request handling or celery task need this variable, it is dynamically loaded from the .env file, so you can change it during flask server runtime.
  • RS_DATABUS_HOST - the address of your JMS provider (default: 127.0.0.1)
  • RS_DATABUS_USERNAME - your login to the JMS provider (default: admin, placeholder for the real password)
  • RS_DATABUS_PASSWORD - your password to the JMS provider (default: admin, placeholder for the real password)
  • RS_DATABUS_PORT - the port of your JMS provider (default: 61613)
  • RS_DATABUS_SUBSCRIPTION_TOPIC - topic on which subscriber listens to jms (default: /topic/user_actions)
  • RS_DATABUS_PUBLISH_TOPIC - name of the topic to publish the recommendations on the JMS host (default: /topic/recommendations)
  • RS_DATABUS_SUBSCRIPTION_ID - subscription id of the jms subscriber (optional)
  • RS_DATABUS_SSL - whether to use ssl when connecting to jms (default: 1) (accepted values 0 or 1, yes or no)
  • TEST_RS_DATABUS_HOST - same as RS_DATABUS_HOST but used when testing via pytest (default: 127.0.0.1)
  • TEST_RS_DATABUS_PORT - same as RS_DATABUS_PORT but used when testing via pytest (default: 61613)
  • TEST_RS_DATABUS_USERNAME - same as RS_DATABUS_USERNAME but used when testing via pytest (default: admin, default password for the stomp docker image)
  • TEST_RS_DATABUS_PASSWORD - same as RS_DATABUS_PASSWORD but used when testing via pytest (default: admin, default password for the stomp docker image)
  • TEST_RS_DATABUS_SUBSCRIPTION_TOPIC - same as RS_DATABUS_SUBSCRIPTION_TOPIC but used when testing via pytest (default: topic/user_actions_test)
  • TEST_RS_DATABUS_PUBLISH_TOPIC - same as RS_DATABUS_PUBLISH_TOPIC but used when testing via pytest (default: topic/recommendations_test)

NOTE: All the above variables have reasonable defaults, so if you want you can just have your .env file empty.

JMS Subscriber

There is flask cli command to run JMS subscription which connects to databus and consumes user actions. It can be run with following command

flask subscribe --host 127.0.0.1 --port 61613 --username guest --password guest 

For all available options run

flask subscribe --help

All arguments to subscribe can be read from environmental variables (see section about env variables above)

Pre-commit

To activate pre-commit run:

pipenv run pre-commit install

PyCharm Integrations

.env

Install EnvFile plugin. Go to the run configuration of your choice, switch to EnvFile tab, check Enable EnvFile, click + button below, select .env file and click Apply (Details on the plugin's page)

PyTest

In Pycharm, go to Settings -> Tools -> Python Integrated Tools -> Testing and choose pytest Remember to put FLASK_ENV=testing env variable in the configuration.

Pre-commit

While committing using PyCharm Git GUI, pre-commit doesn't use project environment and can't find modules used in hooks. To fix this, go to .git/hooks/pre-commit generated by the above command in the project directory and replace:

# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'

with:

# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'
os.environ['PATH'] = f'{os.path.dirname(INSTALL_PYTHON)}{os.pathsep}{os.environ["PATH"]}'

External tools integration

Sentry

Sentry is integrated with the Flask server and the Celery task queue manager so all unhandled exceptions from these entities will be tracked and sent to the sentry. Customization of the sentry integration can be done vie environmental variables (look into ENV variables section) - you can read more about them here