Skip to content

Machine Learning Algorithm Data Exchanges

Yannick Warnier edited this page Mar 11, 2022 · 4 revisions

Due to the huge amount of (non-intrusive) data we expect to collect through this project, we rely on Machine Learning (ML) algorithms to provide useful recommendations to the final users regarding the best training they can follow to obtain the skills required to get into their dream job.

This document describes the information we expect to provide the ML algorithm and what we expect in return.

There are 2 topics, one major, one minor, for which we expect insightful results from the ML algorithm: training recommendations and job opportunities matching. At this time, job opportunities matching is not developed.

Training recommendations can be requested based on a skill ID (we ask for the training sessions that will best match the skill we are looking for), or based on an occupation ID (an occupation being a "career", sort of, that defines a set of skills necessary to get into that occupation).

Initally, we envisionned calling the Machine Learning algorithm to retrain it very frequently (see deprecated "Feeding training data to ML" section below), but this has been discarded. The model training will connect to the database, issue queries to the database, and train on the new data it finds there. As such, the complexity of the data collection has been delegated to the ML algo (search https://github.com/silkc/silkc-machine-learning for "data_aggregator.py" files).

Training recommendations

Obtaining insightful data back from ML

The ML algo is good for certain types of operations, but others should be managed by this application. For example, preferences, set as filters in the search form, should only apply to the results returned by the ML algo. The ML should not try to understand those filters, as they cannot be applied to generic data (which are the objectives of training a model).

The application should also check whether the skill or occupation searched for has already been acquired by the user. If so, it might be useless to search for a training to acquire that skill or occupation.

Finally, the application should send the skill or occupation we are looking for (a numerical ID) together with some user details from the user searching for this. These user details will allow the ML algo to search for proximities with other users.

The following user data should be sent:

  • skill or occupation id
  • date_of_birth (of the user launching the search)
  • address (this is only a city name, country name)
  • up_to_distance (in km; although the ML will not locate cities, getting an idea of how far a person is willing to travel might give a hint of other users having set the same number)
  • professional_experience (indicates how much of a veteran the user is in its professional career and find affinities with other users)

Deprecated: Feeding training data to ML

Info Description (optional)
Training ID Unique ID from the SILKC system
Training name Name of the training (not sure this is necessary)
Skills required Array of skills (English names or URIs) required in order to enter the training (the user MUST have those to qualify)
Skills acquired Array of skills (English names or URIs) to be acquired by following the training.
Location Latitude, Longitude of the training (affected by online training circumstances)
Online Boolean value indicating if the training is exclusively online (true). False by default.
Cost Float value. At this point, higher prices lowers preferences, but this might change in the future to be a more precise indicator based on the price of other trainings with similar skills acquired or required.
Duration in hours Integer value. At this point, we consider a higher duration as lowering the preference for this training. Might change in the future.

Deprecated: Feeding user data to ML

Data sent as a request to the inference API.

Based on the information we collect from the final users, we can provide the ML algorithm with the following information.

Info Description (optional)
User ID An integer
Year of birth An integer year
Coordinates of residency As latitude, longitude coordinates
Acceptable commute distance (up_to_distance field) The number of kilometers from his/her residency where the user is willing to travel for work of training
Skills acquired through jobs Array of skills references (could be provided either as English name or URI) the user has acquired through work. This list is considered as very reliable, as previous work experience can usually be verified relatively easily.
Skills acquired through training Array of skills references (could be provided either as English name or URI) the user has acquired through training. This list is considered as reliable, as previous training experience can usually be proven, although not as easily verified as work experience.
Skills personally reported as acquired Array of skills references (could be provided either as English name or URI) the user has acquired through other media and the user is, himself/herself, reporting as acquired. Due to the non-correlation with previous reported training or work experience, we consider this list as being of lesser reliability than the rest.
Previous ocupations Array of occupations (could be provided either as English name or URI) the user has had.
Current occupation Array of occupations (could be provided either as English name or URI) the user has at the moment. For now, there is only one item in the array (we consider only one current occupation)
Training(s) followed Array of internal training IDs from the SILKC application.
Score given to followed training Array of followed training (by ID) with a score (1 to 5) expressing a preference of the user towards one training or the other. We don't consider this preference to be a very strong differentiator, but want to include it as a stronger future differentiator (when there is a huge amount of data)
Dream occupation The English name or URI of the dream occupation (the final goal) of the user
Dream occupation skills Array of skills (either English names or URIs) of the dream occupation. This element could be skipped if we otherwise store and maintain, in the ML, a match between occupation and skills.
Job openings Details on job openings in this dream occupation. Array of job openings that match the dream occupation or the dream occupation's skills list. This array should also contain the location for the job opening, so that a match can be calculated in terms of distance from the residency.
Professional experience Number of years since this person started his/her professional life. We assume this will act as a differentiator, over time, for recommended training, but not really at the beginning.

Job vacancies

Feeding data to ML

Info Description (optional)
User ID
Year of birth
City and country of residency
Acceptable commute distance (up_to_distance field) The number of kilometers from his/her residency where the user is willing to travel for work of training
Skills acquired through jobs
Skills acquired through training
Skills personally reported as acquired
Previous ocupations
Current occupation
Training(s) followed Training details provided separately?
Score given to followed training (expressing preference)
Dream occupation {including skills required for that occupation}
Professional experience Number of years since this person started his/her professional life

Obtaining insightful data back from ML