Machine Learning Algorithm Data Exchanges

Due to the huge amount of (non-intrusive) data we expect to collect through this project, we rely on Machine Learning (ML) algorithms to provide useful recommendations to the final users regarding the best training they can follow to obtain the skills required to get into their dream job.

This document describes the information we expect to provide the ML algorithm and what we expect in return.

There are 2 topics, one major, one minor, for which we expect insightful results from the ML algorithm: training recommendations and job opportunities matching. At this time, job opportunities matching is not developed.

Training recommendations can be requested based on a skill ID (we ask for the training sessions that will best match the skill we are looking for), or based on an occupation ID (an occupation being a "career", sort of, that defines a set of skills necessary to get into that occupation).

Initally, we envisionned calling the Machine Learning algorithm to retrain it very frequently (see deprecated "Feeding training data to ML" section below), but this has been discarded. The model training will connect to the database, issue queries to the database, and train on the new data it finds there. As such, the complexity of the data collection has been delegated to the ML algo (search https://github.com/silkc/silkc-machine-learning for "data_aggregator.py" files).

Training recommendations

Obtaining insightful data back from ML

The ML algo is good for certain types of operations, but others should be managed by this application. For example, preferences, set as filters in the search form, should only apply to the results returned by the ML algo. The ML should not try to understand those filters, as they cannot be applied to generic data (which are the objectives of training a model).

The application should also check whether the skill or occupation searched for has already been acquired by the user. If so, it might be useless to search for a training to acquire that skill or occupation.

Finally, the application should send the skill or occupation we are looking for (a numerical ID) together with some user details from the user searching for this. These user details will allow the ML algo to search for proximities with other users.

The following user data should be sent:

skill or occupation id
date_of_birth (of the user launching the search)
address (this is only a city name, country name)
up_to_distance (in km; although the ML will not locate cities, getting an idea of how far a person is willing to travel might give a hint of other users having set the same number)
professional_experience (indicates how much of a veteran the user is in its professional career and find affinities with other users)

Deprecated: Feeding training data to ML

Info	Description (optional)
Training ID	Unique ID from the SILKC system
Training name	Name of the training (not sure this is necessary)
Skills required	Array of skills (English names or URIs) required in order to enter the training (the user MUST have those to qualify)
Skills acquired	Array of skills (English names or URIs) to be acquired by following the training.
Location	Latitude, Longitude of the training (affected by online training circumstances)
Online	Boolean value indicating if the training is exclusively online (true). False by default.
Cost	Float value. At this point, higher prices lowers preferences, but this might change in the future to be a more precise indicator based on the price of other trainings with similar skills acquired or required.
Duration in hours	Integer value. At this point, we consider a higher duration as lowering the preference for this training. Might change in the future.

Deprecated: Feeding user data to ML

Data sent as a request to the inference API.

Based on the information we collect from the final users, we can provide the ML algorithm with the following information.

Info	Description (optional)
User ID	An integer
Year of birth	An integer year
Coordinates of residency	As latitude, longitude coordinates
Acceptable commute distance	(`up_to_distance` field) The number of kilometers from his/her residency where the user is willing to travel for work of training
Skills acquired through jobs	Array of skills references (could be provided either as English name or URI) the user has acquired through work. This list is considered as very reliable, as previous work experience can usually be verified relatively easily.
Skills acquired through training	Array of skills references (could be provided either as English name or URI) the user has acquired through training. This list is considered as reliable, as previous training experience can usually be proven, although not as easily verified as work experience.
Skills personally reported as acquired	Array of skills references (could be provided either as English name or URI) the user has acquired through other media and the user is, himself/herself, reporting as acquired. Due to the non-correlation with previous reported training or work experience, we consider this list as being of lesser reliability than the rest.
Previous ocupations	Array of occupations (could be provided either as English name or URI) the user has had.
Current occupation	Array of occupations (could be provided either as English name or URI) the user has at the moment. For now, there is only one item in the array (we consider only one current occupation)
Training(s) followed	Array of internal training IDs from the SILKC application.
Score given to followed training	Array of followed training (by ID) with a score (1 to 5) expressing a preference of the user towards one training or the other. We don't consider this preference to be a very strong differentiator, but want to include it as a stronger future differentiator (when there is a huge amount of data)
Dream occupation	The English name or URI of the dream occupation (the final goal) of the user
Dream occupation skills	Array of skills (either English names or URIs) of the dream occupation. This element could be skipped if we otherwise store and maintain, in the ML, a match between occupation and skills.
Job openings	Details on job openings in this dream occupation. Array of job openings that match the dream occupation or the dream occupation's skills list. This array should also contain the location for the job opening, so that a match can be calculated in terms of distance from the residency.
Professional experience	Number of years since this person started his/her professional life. We assume this will act as a differentiator, over time, for recommended training, but not really at the beginning.

Job vacancies

Feeding data to ML

Info	Description (optional)
User ID
Year of birth
City and country of residency
Acceptable commute distance	(`up_to_distance` field) The number of kilometers from his/her residency where the user is willing to travel for work of training
Skills acquired through jobs
Skills acquired through training
Skills personally reported as acquired
Previous ocupations
Current occupation
Training(s) followed	Training details provided separately?
Score given to followed training (expressing preference)
Dream occupation	{including skills required for that occupation}
Professional experience	Number of years since this person started his/her professional life

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Algorithm Data Exchanges

Training recommendations

Obtaining insightful data back from ML

Deprecated: Feeding training data to ML

Deprecated: Feeding user data to ML

Job vacancies

Feeding data to ML

Obtaining insightful data back from ML

Clone this wiki locally