Add schema hinting to new sheets functionality (C4-1088) #280

netsettler · 2023-08-24T04:22:45Z

This is getting close to where it might be usable. Features available, including those which are available in some of the prevoius PRs. I've tried to highlight the areas of current development here:

New module sheet_utils for loading workbooks.
- Important things of interest:
  - Class ItemManager for loading Item-style data from any .xlsx, .csv or .tsv files. (Planned support, only partly implemented, for JSON via .json files and JSON LInes via .jsonl files.)
  - Function load_items that does the same as ItemManager.load. Features:
    - type hinting (see class TypeHint and its subclasses) to normalize/repair various kinds of inputs. Work ongoing in this area. Doug has requested this be done differently in various ways.
    - autoloading schemas based on tab names
    - object cross-references. Work on this "instaguid" feature is in flux and an area of ongoing work. Experimentation and considering of variations of implementations still planned before converging. WIll and Doug have asked questions. Kent plans a document soon itemizing alternatives and issues.
- Various lower-level implementation classes such as:
  - Classes XlsxManager, CsvManager and TsvManager for loading raw data from .xlsx, .csv, and .tsv files, respectively.
  - Classes XlsxItemManager, CsvItemManager, and TsvItemManager for loading Item-style data from .xlsx, .csv, and .tsv files, respectively.
New utility functionality in misc_utils:
- New function is_uuid (migrated from Fourfront)
- New function pad_to
- New class JsonLinesReader

Still to do:

Better doc strings
Additional unit testing for coverage
Redesign of instaguids after planned design discussion offline.
Better framework for defining type hints. It would help to have input about kinds of transformations desired.

…er_agent, for example

… not the workbook level artifact. Better handling of init args.

… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.

Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>

…y yet, though.

…schema_hinting

drio18

Dropping a few comments, including one important one regarding the need to use the schema information more thoroughly. 4DN does much of this work already in Submit4DN (mostly here).

dcicutils/misc_utils.py

drio18 · 2023-08-25T17:14:19Z

dcicutils/sheet_utils.py

+        def finder(subheader, subschema):
+            if not parsed_header:
+                return None
+            else:
+                [key1, *other_headers] = subheader
+                if isinstance(key1, str) and isinstance(subschema, dict):
+                    if subschema.get('type') == 'object':
+                        def1 = subschema.get('properties', {}).get(key1)
+                        if not other_headers:
+                            if def1 is not None:
+                                t = def1.get('type')
+                                if t == 'string':
+                                    enum = def1.get('enum')
+                                    if enum:
+                                        mapping = {e.lower(): e for e in enum}
+                                        return EnumHint(mapping)
+                                elif t == 'boolean':
+                                    return BoolHint()
+                                else:
+                                    pass  # fall through to asking super()
+                            else:
+                                pass  # fall through to asking super()
+                        else:
+                            return finder(subheader=other_headers, subschema=def1)


Consider breaking some of logic into more discrete parts here to clarify the process and facilitate simpler testing.

Yep, I'll do that.

drio18 · 2023-08-25T17:38:22Z

dcicutils/sheet_utils.py

+            cls.set_path_value(datum[key], more_path, value)
+
+    @classmethod
+    def find_type_hint(cls, parsed_header: Optional[ParsedHeader], schema: Any):


I believe there needs to be additional logic for using the schema here; I could be mistaken, but I don't believe there are tests for such cases at least.

For example, prefer_number above is automatically converting things that look like numbers to ints or floats, but the schema may require the field to be a string. Hence, there's the possibility of a bit of casting back and forth happening here that needs to be handled. As currently written, some information could be lost by this casting back and forth as well, if for example "+1" is cast to an int of 1 and then a string of "1".

Similarly, if the schema defines the field to be an array of strings and the spreadsheet value is one string without the "|" character (e.g. "dog"), the string needs to be cast to an array of strings (["dog"]). On the flip side, if the schema dictates the field is a string, and someone submits a string with the "|" character, it will be converted to an array of strings via parse_item_value, and the validation will fail even though the string was valid.

My personal preference for handling this would be to cast all values to appropriate types and do the "type hinting" all in one location; currently, these operations are occurring in multiple places/stages of the processing, and the logic becomes more difficult to follow as to where an error arose. As the cases above demonstrate, it may not make sense to do anything with a cell value until the type from the schema is known.

Yeah, this is trying to emulate what Excel does (in the raw layer0. I really highly doubt too many people ever manage their cell type to match the schemas, so we have to assume that the data coming in will be either string or number, and it therefore doesn't hurt to do this as a hack to make csv files feel like they do similar things. The semantic (schema-hinting) layer will need to assume both types of data come in to it.

Honestly, I don't expect us to ever use the raw layer, but as a matter of abstraction it seemed important to build this in steps.

Oh, I see. I actually wrote this capability for the lower level and only stubbed the other part in here before I had schemas, but yeah, I guess I can make that be optional or just remove it at that level. Even the conversion to boolean might not be helpful at this level, since it can be schema-driven now. Somewhat the thing to see is that this is evolutionary and I've been changing the nature of some of these things organically. But I generally agree with you about expanding the capabilities and making it more schema-driven.

And, yes, I left the testing at this level light figuring it would change a lot. The testing, such as it is, is in the spreadsheets, which allows some of this stuff to be refactored to do similar things a different way without major changes to the tests while we're in transition. This isn't to say there are no tests, but they are more end-to-end and not down to the individual function layer at this point in development. I'll fill the testing in better when it's more stable. But if you ask me to refactor everything, you should know you're asking to have all the affected tests rewritten, so you should be glad there's little such friction. Means you're not asking for a heavy lift. :)

This also gets involved in the error-handling behavior you want, though my present theory is that while it's hard to know what to do about bad column headers, if we can at least locate the data correctly, we can just pass through a string and let the validator complain about it.

Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>

…d .tabs.json

…rectoryItemManager. Rename _JsonInsertsDataItemManager to InsertsItemManager.

…_load_json_data() into ._check_json_data().

…ctively, to _parse_inserts_data, _load_inserts_data, and _check_inserts_data.

…that argument.

… of item managers.

…ualApp.

Add portal_vapp= functionality to sheet_utils

dmichaels-harvard · 2023-09-12T12:57:16Z

dcicutils/sheet_utils.py

+    with TemporaryDirectory() as temp_dir:
+        temp_base = os.path.join(temp_dir, target_base_part)
+        temp_filename = temp_base + target_ext
+        _do_shell_command(['cp', filename, temp_filename])


Wouldn't we want to do as much as possible using Python libraries (e.g. file copy) as possible, rather than execing out to shell?

netsettler and others added 30 commits August 14, 2023 07:21

First cut at tools for parsing workbooks.

4c84f0b

Refactor to separate some functionality into a separate sevice class.

7b73a67

Add a csv file for testing.

3d4573f

Add some negative testing.

f4e5cfa

Update lock file.

e9d2465

Document new sheets_utils module.

6e9060f

Issue a beta for this functionality.

df12c91

Fix documentation for sheet_utils.

6a39c8a

Add some declarations. Small refactors to improve modularity.

eedb5c6

Rearrange some methods for presentational reasons.

a6b68fe

First cut at useful functionality.

3ff63a9

Some name changes to make things more abstract. workbook becomes read…

39bd2e0

…er_agent, for example

Rename sheetname to tabname throughout, to be more clear that this is…

77b72f6

… not the workbook level artifact. Better handling of init args.

Add some doc strings. Rename load_table_set to just load. Arrange for…

ba8c55c

… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.

Add load_items function. Fix some test names. Update changelog.

50488cb

Experimental bug fix from Will to hopefully make get_schema_names work.

807e525

update changelog

2a8e81a

Update dcicutils/sheet_utils.py

718054a

Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>

Merge branch 'master' into kmp_sheet_utils

682c95a

Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv

582f002

Add some comments in response to Doug's code review.

56d1459

Support TSV files.

2facf9e

Add changelog info about tsv files.

bcc4e63

Add a missing data file.

9de282e

First stable cut at schema hinting. Doesn't find schemas automaticall…

8d6495f

…y yet, though.

Merge branch 'master' into kmp_sheet_utils

3a103ee

Mark chardet as an acceptable license for use.

56f702a

Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv

08d428e

Merge branch 'kmp_sheet_utils_refactor_for_csv' into kmp_sheet_utils_…

42ad579

…schema_hinting

Backport some small fixes and cosmetics from the schemas branch.

60ada3f

drio18 reviewed Aug 25, 2023

View reviewed changes

netsettler and others added 2 commits August 25, 2023 16:28

Fix typo in comment (dcicutils/misc_utils.py)

1c34ad0

Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>

Add some doc strings and comments.

41fad79

netsettler changed the title ~~Add schema hinting to new sheets functionality~~ Add schema hinting to new sheets functionality (C4-1088) Aug 28, 2023

netsettler and others added 23 commits August 30, 2023 13:38

Rename tabname to tab_name throughout the sheet_utils interfaces.

6e8ce2c

Add support for reading inserts dirs, .json, .jsonl (two formats), an…

04eb58c

…d .tabs.json

Bump beta version.

ce9f9bc

Add yaml formats.

0ea5b62

Add class AbstractItemManager. Rename InsertsItemManager to InsertsDi…

bcc1128

…rectoryItemManager. Rename _JsonInsertsDataItemManager to InsertsItemManager.

Rename ._parser() to ._parse_json_data(). Factor type checks out of .…

7de093a

…_load_json_data() into ._check_json_data().

Rename _parse_json_data, _load_json_data, and _check_json_data, respe…

b01e34b

…ctively, to _parse_inserts_data, _load_inserts_data, and _check_inserts_data.

WIP. Testing good.

0ae48ee

WIP. Tests passing.

1e2c5a9

Rearrange the way escaping= works so both csv an tsv files can using …

b8a4c39

…that argument.

Separate registration of regular table set managers from registration…

a2fe079

… of item managers.

Stub in checking of required headers.

91ddce0

Bump beta version.

142a20b

PEP8

e09af07

Merge branch 'kmp_schemas_from_vapp' into kmp_sheet_utils_with_vapp

70762c6

Fix a bug in newly proposed ff_utils.get_schemas with vapp.

7d2ecaa

Extend VirtualApp to amke it easier to test by adding an AbstractVirt…

5e46273

…ualApp.

Implement portal_vapp= in sheet_utils.

53de60a

Simplifications per Will's code review.

630720f

Merge utils 7.10.0 from master.

5a07b69

Merge pull request #282 from 4dn-dcic/kmp_sheet_utils_with_vapp

295adfe

Add portal_vapp= functionality to sheet_utils

Merge branch 'master' into kmp_sheet_utils_schema_hinting

486adce

Add support for zipped files.

54c51aa

dmichaels-harvard reviewed Sep 12, 2023

View reviewed changes

dmichaels-harvard approved these changes Sep 22, 2023

View reviewed changes

netsettler merged commit f820233 into master Oct 31, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema hinting to new sheets functionality (C4-1088) #280

Add schema hinting to new sheets functionality (C4-1088) #280

netsettler commented Aug 24, 2023 •

edited

Loading

drio18 left a comment

drio18 Aug 25, 2023

netsettler Aug 28, 2023

drio18 Aug 25, 2023 •

edited

Loading

netsettler Aug 25, 2023

netsettler Aug 25, 2023

netsettler Aug 25, 2023

netsettler Aug 25, 2023

dmichaels-harvard Sep 12, 2023

Add schema hinting to new sheets functionality (C4-1088) #280

Add schema hinting to new sheets functionality (C4-1088) #280

Conversation

netsettler commented Aug 24, 2023 • edited Loading

drio18 left a comment

Choose a reason for hiding this comment

drio18 Aug 25, 2023

Choose a reason for hiding this comment

netsettler Aug 28, 2023

Choose a reason for hiding this comment

drio18 Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

netsettler Aug 25, 2023

Choose a reason for hiding this comment

netsettler Aug 25, 2023

Choose a reason for hiding this comment

netsettler Aug 25, 2023

Choose a reason for hiding this comment

netsettler Aug 25, 2023

Choose a reason for hiding this comment

dmichaels-harvard Sep 12, 2023

Choose a reason for hiding this comment

netsettler commented Aug 24, 2023 •

edited

Loading

drio18 Aug 25, 2023 •

edited

Loading