Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

G4.82 Revisit the use of scope by the Validator API #82

Open
carolyncaron opened this issue Jun 7, 2024 · 6 comments
Open

G4.82 Revisit the use of scope by the Validator API #82

carolyncaron opened this issue Jun 7, 2024 · 6 comments
Labels
Group 2 - Data Importing Any issue relating to importing of biological data into either Chado or any other database. Group 4 - API | Services | Plugins Any issue related to developing an API (i.e. services + plugins)

Comments

@carolyncaron
Copy link
Contributor

carolyncaron commented Jun 7, 2024

Branch

g4.82-[shortDescriptor]

Groups

Group 2 - Data Importing, Group 4 - API | Services | Plugins

Describe

@laceysanderson and I have had discussions regarding the use of scope by the validators. Prior to PR #77 and #79, each validator was assigned its own scope, but breaking up the previous TraitImportValues validator led to 3 validators under the same scope, which was renamed from 'TRAIT IMPORT VALUES' to 'FILE ROW'. While this change works for the purpose of those PRs, we'd like to evaluate our use of scope in general, and if we need a new way to categorize validators to streamline their use by the Traits and Phenotype importers.

Here are some things to consider:

  • Changes to scope would impact the ValidatorBase class, and multiple tests
  • The validators not touched by PR# 77 & 79 use the class method validate() which handles the file for each validator. The new validators instead use validateRow() which does not interact with the file directly. It would make the most sense to not do any file iteration within any of the validators, and instead do it one time only in the formValidate() method of the importer. But we don't want to implement validateRow() for every validator that doesn't apply to every row of the input file.
  • We still want to categorize the validators based on the input they receive. The goal of this is the insulate each validator to access only the data it needs. This should negate the need for the loadAssets() method in the Validator Base class to provide some things that are unnecessary to every validator, such as Genus and Project.
  • Finally, if loadAssets() is removed, we need a way to replace the purpose of the $skip parameter. This is what allows the Trait Importer to skip the validators that follow that are in a different scope. (Example: if there is an issue with the type of file uploaded, we don't want validation to continue on the file contents.)

Design

Current scopes:

  1. GENUS
  2. FILE
  3. HEADERS
  4. FILE ROW
@carolyncaron carolyncaron added Group 2 - Data Importing Any issue relating to importing of biological data into either Chado or any other database. Group 4 - API | Services | Plugins Any issue related to developing an API (i.e. services + plugins) labels Jun 7, 2024
@carolyncaron carolyncaron changed the title G4.[issueNo] Revisit the use of scope by the Validator API G4.82 Revisit the use of scope by the Validator API Jun 7, 2024
@laceysanderson
Copy link
Member

laceysanderson commented Jun 24, 2024

Regarding my original vision for scope:

When I first brought this up I was picturing scoping as a grouping mechanisms where importers could give the same context to all validators in the same scope and new ones could be easily added without changes being made in the importer per se. For example, I pictured a row level scope where all validators of this type would just be given a row and know what to do with it. In this way, if we later developed a new row level validator we could just add its identifier to an array and the importer would happily add it to the list of things to check.

My thoughts on the design I would like to see:

In hindsight, this concept might better be described as two variables. First an inputType which indicates what the importer should pass to the validator and perhaps a general category to group them further 🤔

inputType: the data to be passed into the validate method for the validator.

  • metadata: This inputType is meant to validate information entered by the user in the form. The importer could call a validateMetadata($form_values) function which takes the values of the form as an arguement. It should not open any files saved in the values. An example would be validating the genus or project chosen is configured to work with phenotypes.
  • file: This inputType is meant to ensure the file object (not contents) is valid. The Importer would call a validateFileObject($filename, $fid = NULL) which takes the name of the file and optionally the fid if it is a managed file. Again it would not open the file but instead would validate that it exists, is readable, is of the right type.
  • header-row: This input type would validate the first row of the file and would take that as input. The importer would call validateRow($row) only once for the file and the row it would pass in would be the first one. This would include all validation that the correct column headers are present. If we decide we could also include the file format validation here for cases where we check for tsv.
  • data-row: This input type would validate all non-header rows. The importer would call this for each row of the file not including the first one. It would also call validateRow($row).

Validators could indicate more then one inputType they support. For example, the emptyValues validator could support both 'header-row' and 'data-row' although you would want to configure the plugin differently for each case.

General Notes

Me + Carolyn thought perhaps we needed a category to distinguish between header and data rows and both would have the 'row' inputType but now while writing this out I wonder if we just need to let validators support more then one inputType.

We may want to discuss giving the first line of the file as an arguement to the file type of validator to make it easier to check format. 🤔 Alternatively we could have different validators for (1) checking the file suffix and mime type and (2) the file format of the contents (i.e. is it tsv).

Configuring

Right now there are two approaches born out of this concept of 'configure the validator but pass it the same thing within the same scope'.

First was @reynoldtan's take on this where he used loadAssets() to configure the module by giving it all the context it could ever need. This meets the goal of configuring once and allowed us to loop through the validators and call them all the same way. However, it also creates a lot of dependencies that are not needed and encouraged a single validator to move across multiple input types. This makes these validators a lot harder to share across importers.

Second was @carolyncaron's approach to pass in the context to the validateRow($row, $context) function. This meets the need of restricting access of the validator to unneeded information for less dependencies and easier re-use. However, it resulted in needing to call each validator within the same scope independently in order to set a different context for each. Hense we could no longer loop through them indiscriminately.

Hindsight makes me think maybe we would be best served by configuring the validator once at the top with a number of setters. Each instance/object of a validator could be configured differently in this way allowing the same validator to be used multiple times in a single importer. It would also mean we may not need to pass in additional context to the validate*() methods.

@laceysanderson
Copy link
Member

laceysanderson commented Jun 24, 2024

Example of using this design

Lets consider the trait importer as an example usage since we are all familiar with it now.

I picture adding a new method to the importer class which handles configuring the validators. Here is some psudo-code on how that might look.

/**
 * Configure all the validators this importer uses.
 *
 * @param array $form_values
 *   An array of the importer form values provided to formValidate.
 * @return array
 *   A listing of configured validator objects first keyed by 
 *   their inputType. More specifically,
 *   - [inputType]: and array of validator instances. Not an 
 *     associative array although the keys do indicate what 
 *     order they should be run in.
 */
public function configureValidators($form_values) {
  $validators = [];

  // Metadata
  // Genus Exists
  $instance = $manager->createInstance('genusExists');
  $validators['metadata']['genus_exists'] = $instance;
  // Genus matches the configured project
  $instance = use plugin manager to create instance of genusProjectMatch validator
  $validators['metadata']['genus_project'] = $instance;

  // ... 

  // File level
  $instance = $manager->createInstance('fileExists');
  $instance->setFileSuffix(['tsv', 'txt']);
  $instance->setMimeType('text/tab-separated-values');
  $validators['file']['file_exists'] = $instance;

  // do the same sort of thing for any other file level validators.
  // each validator would have its own set of setters although some 
  // could share setters across different validators that are 
  // defined in a parent class.

  // Header Level
  // All header row cells are not empty.
  $instance = $manager->createInstance('emptyCell');
  $instance->setIndicies([0,1,2,3,4,5]);
  $validators['header-row']['header_not_empty'] = $instance;

  // ... 

  // Data Row Level
  // All data row cells in columns 0,2,4 are not empty
  $instance = $manager->createInstance('emptyCell');
  $instance->setIndicies([0,2,4]);
  $validators['data-row']['data_not_empty'] = $instance;

  // ...

  $this->validatorObjects = $validators;

  return $validators; 
}

Then when validating the file in the formValidate I see us being able to happily loop through these objects and pass them all the same things.

public function formValidate($form, &$form_state) {
  $form_values = $form_state->getValues();

  // Configure the validators
  $validators = $this->configureValidators($form_values);

  // Use a variable to keep track of if one input type had recieved errors
  // and only continue to the next if no errors.
  $validators_failed = FALSE;

  // Keep track of failed items.
  // We expect the first key to be a unique id of the validator instance
  // which is not just the validator id as there can be multiple instances
  // of one validator. Then within that we expect line number
  $failures = [];

  foreach ($this->validators['metadata'] as $key => $validator) {
    $result = $validator->validateMetadata($form_value);
    // Check for old style...
    if (!array_key_exists('valid', $result) && ($return['success'] == 'fail')) {
      $validators_failed = TRUE;
      $failures['metadata'][$key] = $result['details'];
    }
    // Then new style.
    elseif (array_key_exists('valid', $result) && $result['valid'] === FALSE) {
      $failures['metadata'][$key] = $result['failedItems'];
    }
  }

  if ($validators_failed !== FALSE) {
    foreach ($this->validators['file'] as $key => $validator) {
      $result = $validator->validateFile($form_value['filename'], $form_values['fid']);
      // Do same as above but use key file instead of metadata.
    }
  }

  // open the file for reading.
  // foreach line of the file... 
  $line_no = 0;
  while ($row = fgets($handle)) {
    $line_no++;

    // Validate the header.
    if ($line_no == 1) {
      foreach ($this->validators['header-row'] as $key => $validator) {
        $result = $validator->validateRow($row);
        // Check for old style...
        if (!array_key_exists('valid', $result) && ($return['success'] == 'fail')) {
          $validators_failed = TRUE;
          // Note: Added line number here.
          $failures['header-row'][$key][$line_no] = $result['details'];
        }
        // Then new style.
        elseif (array_key_exists('valid', $result) && $result['valid'] === FALSE) {
          $failures['header-row'][$key][$line_no] = $result['failedItems'];
        }
      }
    }
    // Otherwise validate the data.
    else {
      foreach ($this->validators['data-row'] as $key => $validator) {
        $result = $validator->validateRow($row);
        // Do same as header row but use key data-row.
      }
    }
  }

  // Now importer uses $failures array to generate user friendly messages.
}

This way if we wanted to add another data row level validator we would only need to add and configure it in the configureValidators() method with no changes needed to formValidate()!

NOTES

  • We are switching the return value array to use case, valid, failedIems instead of title, status, details.
  • The above code sample shows an inbetween state where it handles both the old and new style of return value.
  • The importer is going to determine the user feadback message. The validator return value should provide info for that but not become that.
  • The skip functionality previously done by the validators is now done by the importer via the $validators_failed variable.

Still to be dealt with

  • We still need to formulate exactly how the importer is going to generate it's user friendly messages 🤪
  • do we still have one report line to the user for each validator? If not, how are we grouping the validators together?

@laceysanderson
Copy link
Member

The above code sample was updated based on our conversation yesterday.
Specifically,

  • Added our discussion about keeping track of errors
  • Added new return value checks
  • Added a key for the validators in the configure that will be unique to each instance
  • Updated the notes and still to be dealt with

@laceysanderson
Copy link
Member

laceysanderson commented Jun 26, 2024

Additional documentation on return value

Currently the validators return a very human readable array meant to be passed directly to the user. More specifically,

return [
  'title' => [Title describing the check meant to have a status symbol in front of it. always shown]
  'status' => [either 'pass' or 'fail' depending if the validation check passes or fails]
  'details' => [sentence describing the failure and providing helpful details on how to fix the issue. Has the failed items embedded in the string]
];

We've decided with the validators becoming more generic this is really hard to make a good message in the generic validator... Instead we are changing the return value to the following and the importer will make the friendly messages.

return [
  'case' => [a developer focused string describing the case checked],
  'valid' => [either TRUE or FALSE depending on if the data is valid or not]
  'failedItems' => [an array of "items" to be used in the message to the user. This is empty if the data was valid]
];

@laceysanderson
Copy link
Member

laceysanderson commented Jun 26, 2024

Tasks to get us to this design in the trait importer

See the milestone for progress: Upgrade validator API to new design

@laceysanderson
Copy link
Member

You can see the flow of code (arrows) and dependencies (each block is dependant upon all other blocks before it in the flow of code) in the following diagram. Each banner represents an issue and is colour coded by who is working on it.

MergeFlowDiagram-Dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Group 2 - Data Importing Any issue relating to importing of biological data into either Chado or any other database. Group 4 - API | Services | Plugins Any issue related to developing an API (i.e. services + plugins)
Projects
None yet
Development

No branches or pull requests

2 participants