G4.82 Revisit the use of scope by the Validator API #82

carolyncaron · 2024-06-07T20:32:44Z

Branch

g4.82-[shortDescriptor]

Groups

Group 2 - Data Importing, Group 4 - API | Services | Plugins

Describe

@laceysanderson and I have had discussions regarding the use of scope by the validators. Prior to PR #77 and #79, each validator was assigned its own scope, but breaking up the previous TraitImportValues validator led to 3 validators under the same scope, which was renamed from 'TRAIT IMPORT VALUES' to 'FILE ROW'. While this change works for the purpose of those PRs, we'd like to evaluate our use of scope in general, and if we need a new way to categorize validators to streamline their use by the Traits and Phenotype importers.

Here are some things to consider:

Changes to scope would impact the ValidatorBase class, and multiple tests
The validators not touched by PR# 77 & 79 use the class method validate() which handles the file for each validator. The new validators instead use validateRow() which does not interact with the file directly. It would make the most sense to not do any file iteration within any of the validators, and instead do it one time only in the formValidate() method of the importer. But we don't want to implement validateRow() for every validator that doesn't apply to every row of the input file.
We still want to categorize the validators based on the input they receive. The goal of this is the insulate each validator to access only the data it needs. This should negate the need for the loadAssets() method in the Validator Base class to provide some things that are unnecessary to every validator, such as Genus and Project.
Finally, if loadAssets() is removed, we need a way to replace the purpose of the $skip parameter. This is what allows the Trait Importer to skip the validators that follow that are in a different scope. (Example: if there is an issue with the type of file uploaded, we don't want validation to continue on the file contents.)

Design

Current scopes:

GENUS
FILE
HEADERS
FILE ROW

The text was updated successfully, but these errors were encountered:

laceysanderson · 2024-06-24T22:28:36Z

Regarding my original vision for scope:

When I first brought this up I was picturing scoping as a grouping mechanisms where importers could give the same context to all validators in the same scope and new ones could be easily added without changes being made in the importer per se. For example, I pictured a row level scope where all validators of this type would just be given a row and know what to do with it. In this way, if we later developed a new row level validator we could just add its identifier to an array and the importer would happily add it to the list of things to check.

My thoughts on the design I would like to see:

In hindsight, this concept might better be described as two variables. First an inputType which indicates what the importer should pass to the validator and perhaps a general category to group them further 🤔

inputType: the data to be passed into the validate method for the validator.

metadata: This inputType is meant to validate information entered by the user in the form. The importer could call a validateMetadata($form_values) function which takes the values of the form as an arguement. It should not open any files saved in the values. An example would be validating the genus or project chosen is configured to work with phenotypes.
file: This inputType is meant to ensure the file object (not contents) is valid. The Importer would call a validateFileObject($filename, $fid = NULL) which takes the name of the file and optionally the fid if it is a managed file. Again it would not open the file but instead would validate that it exists, is readable, is of the right type.
header-row: This input type would validate the first row of the file and would take that as input. The importer would call validateRow($row) only once for the file and the row it would pass in would be the first one. This would include all validation that the correct column headers are present. If we decide we could also include the file format validation here for cases where we check for tsv.
data-row: This input type would validate all non-header rows. The importer would call this for each row of the file not including the first one. It would also call validateRow($row).

Validators could indicate more then one inputType they support. For example, the emptyValues validator could support both 'header-row' and 'data-row' although you would want to configure the plugin differently for each case.

General Notes

Me + Carolyn thought perhaps we needed a category to distinguish between header and data rows and both would have the 'row' inputType but now while writing this out I wonder if we just need to let validators support more then one inputType.

We may want to discuss giving the first line of the file as an arguement to the file type of validator to make it easier to check format. 🤔 Alternatively we could have different validators for (1) checking the file suffix and mime type and (2) the file format of the contents (i.e. is it tsv).

Configuring

Right now there are two approaches born out of this concept of 'configure the validator but pass it the same thing within the same scope'.

First was @reynoldtan's take on this where he used loadAssets() to configure the module by giving it all the context it could ever need. This meets the goal of configuring once and allowed us to loop through the validators and call them all the same way. However, it also creates a lot of dependencies that are not needed and encouraged a single validator to move across multiple input types. This makes these validators a lot harder to share across importers.

Second was @carolyncaron's approach to pass in the context to the validateRow($row, $context) function. This meets the need of restricting access of the validator to unneeded information for less dependencies and easier re-use. However, it resulted in needing to call each validator within the same scope independently in order to set a different context for each. Hense we could no longer loop through them indiscriminately.

Hindsight makes me think maybe we would be best served by configuring the validator once at the top with a number of setters. Each instance/object of a validator could be configured differently in this way allowing the same validator to be used multiple times in a single importer. It would also mean we may not need to pass in additional context to the validate*() methods.

laceysanderson · 2024-06-24T22:57:36Z

Example of using this design

Lets consider the trait importer as an example usage since we are all familiar with it now.

I picture adding a new method to the importer class which handles configuring the validators. Here is some psudo-code on how that might look.

/**
 * Configure all the validators this importer uses.
 *
 * @param array $form_values
 *   An array of the importer form values provided to formValidate.
 * @return array
 *   A listing of configured validator objects first keyed by 
 *   their inputType. More specifically,
 *   - [inputType]: and array of validator instances. Not an 
 *     associative array although the keys do indicate what 
 *     order they should be run in.
 */
public function configureValidators($form_values) {
  $validators = [];

  // Metadata
  // Genus Exists
  $instance = $manager->createInstance('genusExists');
  $validators['metadata']['genus_exists'] = $instance;
  // Genus matches the configured project
  $instance = use plugin manager to create instance of genusProjectMatch validator
  $validators['metadata']['genus_project'] = $instance;

  // ... 

  // File level
  $instance = $manager->createInstance('fileExists');
  $instance->setFileSuffix(['tsv', 'txt']);
  $instance->setMimeType('text/tab-separated-values');
  $validators['file']['file_exists'] = $instance;

  // do the same sort of thing for any other file level validators.
  // each validator would have its own set of setters although some 
  // could share setters across different validators that are 
  // defined in a parent class.

  // Header Level
  // All header row cells are not empty.
  $instance = $manager->createInstance('emptyCell');
  $instance->setIndicies([0,1,2,3,4,5]);
  $validators['header-row']['header_not_empty'] = $instance;

  // ... 

  // Data Row Level
  // All data row cells in columns 0,2,4 are not empty
  $instance = $manager->createInstance('emptyCell');
  $instance->setIndicies([0,2,4]);
  $validators['data-row']['data_not_empty'] = $instance;

  // ...

  $this->validatorObjects = $validators;

  return $validators; 
}

Then when validating the file in the formValidate I see us being able to happily loop through these objects and pass them all the same things.

public function formValidate($form, &$form_state) {
  $form_values = $form_state->getValues();

  // Configure the validators
  $validators = $this->configureValidators($form_values);

  // Use a variable to keep track of if one input type had recieved errors
  // and only continue to the next if no errors.
  $validators_failed = FALSE;

  // Keep track of failed items.
  // We expect the first key to be a unique id of the validator instance
  // which is not just the validator id as there can be multiple instances
  // of one validator. Then within that we expect line number
  $failures = [];

  foreach ($this->validators['metadata'] as $key => $validator) {
    $result = $validator->validateMetadata($form_value);
    // Check for old style...
    if (!array_key_exists('valid', $result) && ($return['success'] == 'fail')) {
      $validators_failed = TRUE;
      $failures['metadata'][$key] = $result['details'];
    }
    // Then new style.
    elseif (array_key_exists('valid', $result) && $result['valid'] === FALSE) {
      $failures['metadata'][$key] = $result['failedItems'];
    }
  }

  if ($validators_failed !== FALSE) {
    foreach ($this->validators['file'] as $key => $validator) {
      $result = $validator->validateFile($form_value['filename'], $form_values['fid']);
      // Do same as above but use key file instead of metadata.
    }
  }

  // open the file for reading.
  // foreach line of the file... 
  $line_no = 0;
  while ($row = fgets($handle)) {
    $line_no++;

    // Validate the header.
    if ($line_no == 1) {
      foreach ($this->validators['header-row'] as $key => $validator) {
        $result = $validator->validateRow($row);
        // Check for old style...
        if (!array_key_exists('valid', $result) && ($return['success'] == 'fail')) {
          $validators_failed = TRUE;
          // Note: Added line number here.
          $failures['header-row'][$key][$line_no] = $result['details'];
        }
        // Then new style.
        elseif (array_key_exists('valid', $result) && $result['valid'] === FALSE) {
          $failures['header-row'][$key][$line_no] = $result['failedItems'];
        }
      }
    }
    // Otherwise validate the data.
    else {
      foreach ($this->validators['data-row'] as $key => $validator) {
        $result = $validator->validateRow($row);
        // Do same as header row but use key data-row.
      }
    }
  }

  // Now importer uses $failures array to generate user friendly messages.
}

This way if we wanted to add another data row level validator we would only need to add and configure it in the configureValidators() method with no changes needed to formValidate()!

NOTES

We are switching the return value array to use case, valid, failedIems instead of title, status, details.
The above code sample shows an inbetween state where it handles both the old and new style of return value.
The importer is going to determine the user feadback message. The validator return value should provide info for that but not become that.
The skip functionality previously done by the validators is now done by the importer via the $validators_failed variable.

Still to be dealt with

We still need to formulate exactly how the importer is going to generate it's user friendly messages 🤪
do we still have one report line to the user for each validator? If not, how are we grouping the validators together?

laceysanderson · 2024-06-26T16:58:10Z

The above code sample was updated based on our conversation yesterday.
Specifically,

Added our discussion about keeping track of errors
Added new return value checks
Added a key for the validators in the configure that will be unique to each instance
Updated the notes and still to be dealt with

laceysanderson · 2024-06-26T17:47:46Z

Additional documentation on return value

Currently the validators return a very human readable array meant to be passed directly to the user. More specifically,

return [
  'title' => [Title describing the check meant to have a status symbol in front of it. always shown]
  'status' => [either 'pass' or 'fail' depending if the validation check passes or fails]
  'details' => [sentence describing the failure and providing helpful details on how to fix the issue. Has the failed items embedded in the string]
];

We've decided with the validators becoming more generic this is really hard to make a good message in the generic validator... Instead we are changing the return value to the following and the importer will make the friendly messages.

return [
  'case' => [a developer focused string describing the case checked],
  'valid' => [either TRUE or FALSE depending on if the data is valid or not]
  'failedItems' => [an array of "items" to be used in the message to the user. This is empty if the data was valid]
];

laceysanderson · 2024-06-26T18:29:56Z

Tasks to get us to this design in the trait importer

See the milestone for progress: Upgrade validator API to new design

Update annotation and interface for the validator plugin to define inputType and the new return value style. See Issue G4.92 Update annotation, interface and validator base with new design #92.
Create traits with various common setters for the validators to use to be configured. See issue G4.93 Develop a number of setter traits for use with configuring validators. #93.
Break down the TraitValuesValidator into multiple smaller validators. Started in PR G4.76 STEP 1 --Divides trait import value plugin instance into 3 separate instances #77, will be completed in PR associated with issue G4.85 Update Trait Data Validators #85.
Update the importer formValidate to support looping through the file and calling the row level validators for each row (currently PR G4.76 Step 2 --Remove validator TraitImportValues and utilize the 3 newly added validators by the Traits Importer #79). Also update the importer to have a configureValidators method with will create the validator instances and ensure the validator loops use these new inputTypes (see comment G4.76 Step 2 --Remove validator TraitImportValues and utilize the 3 newly added validators by the Traits Importer #79 (comment)).
Update the formValidate to use the new return values style with issue G4.86 Update Trait Importer formValidate to use new return value style. #86. This will not remove the old style keys.
Create new validators adhering to the new style completely for the GENUS, FILE and HEADER validators. This will be done in Issue G4.87 Split GENUS validator into new style validators. #87, G4.88 Split FILE validator into new style validators. #88, G4.89 Updates header validator into new style validator #89.
Update trait importer to use the new genus, file and header validators by adding them to the configure and removing the old versions. This is done in issue G4.90 Update trait importer to use only new style validators #90.
Go back and remove all mention of scope and the old return value keys. This is done in issue G4.91 Final cleanup of Validator upgrade #91.

laceysanderson · 2024-06-26T20:47:16Z

You can see the flow of code (arrows) and dependencies (each block is dependant upon all other blocks before it in the flow of code) in the following diagram. Each banner represents an issue and is colour coded by who is working on it.

carolyncaron added Group 2 - Data Importing Any issue relating to importing of biological data into either Chado or any other database. Group 4 - API | Services | Plugins Any issue related to developing an API (i.e. services + plugins) labels Jun 7, 2024

carolyncaron changed the title ~~G4.[issueNo] Revisit the use of scope by the Validator API~~ G4.82 Revisit the use of scope by the Validator API Jun 7, 2024

This was referenced Jun 26, 2024

G4.85 Update Trait Data Validators #85

Closed

G4.76 Step 2 --Remove validator TraitImportValues and utilize the 3 newly added validators by the Traits Importer #79

Merged

This was referenced Jun 26, 2024

G4.87 Split GENUS validator into new style validators. #87

Closed

G4.88 Split FILE validator into new style validators. #88

Closed

G4.89 Updates header validator into new style validator #89

Closed

laceysanderson modified the milestone: Upgrade validator API to new design Jun 26, 2024

laceysanderson mentioned this issue Jun 26, 2024

G4.92 Update annotation, interface and validator base with new design #92

Closed

carolyncaron mentioned this issue Jun 26, 2024

G4.76 STEP 1 --Divides trait import value plugin instance into 3 separate instances #77

Merged

18 tasks

This was referenced Jun 27, 2024

G4.92 Update validator base, annotation and interface #94

Merged

G4.93 Validator Setters #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G4.82 Revisit the use of scope by the Validator API #82

G4.82 Revisit the use of scope by the Validator API #82

carolyncaron commented Jun 7, 2024 •

edited

Loading

laceysanderson commented Jun 24, 2024 •

edited

Loading

laceysanderson commented Jun 24, 2024 •

edited

Loading

laceysanderson commented Jun 26, 2024

laceysanderson commented Jun 26, 2024 •

edited

Loading

laceysanderson commented Jun 26, 2024 •

edited

Loading

laceysanderson commented Jun 26, 2024

G4.82 Revisit the use of scope by the Validator API #82

G4.82 Revisit the use of scope by the Validator API #82

Comments

carolyncaron commented Jun 7, 2024 • edited Loading

Branch

Groups

Describe

Design

laceysanderson commented Jun 24, 2024 • edited Loading

Regarding my original vision for scope:

My thoughts on the design I would like to see:

General Notes

Configuring

laceysanderson commented Jun 24, 2024 • edited Loading

Example of using this design

NOTES

Still to be dealt with

laceysanderson commented Jun 26, 2024

laceysanderson commented Jun 26, 2024 • edited Loading

Additional documentation on return value

laceysanderson commented Jun 26, 2024 • edited Loading

Tasks to get us to this design in the trait importer

laceysanderson commented Jun 26, 2024

carolyncaron commented Jun 7, 2024 •

edited

Loading

laceysanderson commented Jun 24, 2024 •

edited

Loading

laceysanderson commented Jun 24, 2024 •

edited

Loading

laceysanderson commented Jun 26, 2024 •

edited

Loading

laceysanderson commented Jun 26, 2024 •

edited

Loading