-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema hinting to new sheets functionality (C4-1088) #280
Conversation
…er_agent, for example
… not the workbook level artifact. Better handling of init args.
… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.
Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping a few comments, including one important one regarding the need to use the schema information more thoroughly. 4DN does much of this work already in Submit4DN (mostly here).
def finder(subheader, subschema): | ||
if not parsed_header: | ||
return None | ||
else: | ||
[key1, *other_headers] = subheader | ||
if isinstance(key1, str) and isinstance(subschema, dict): | ||
if subschema.get('type') == 'object': | ||
def1 = subschema.get('properties', {}).get(key1) | ||
if not other_headers: | ||
if def1 is not None: | ||
t = def1.get('type') | ||
if t == 'string': | ||
enum = def1.get('enum') | ||
if enum: | ||
mapping = {e.lower(): e for e in enum} | ||
return EnumHint(mapping) | ||
elif t == 'boolean': | ||
return BoolHint() | ||
else: | ||
pass # fall through to asking super() | ||
else: | ||
pass # fall through to asking super() | ||
else: | ||
return finder(subheader=other_headers, subschema=def1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider breaking some of logic into more discrete parts here to clarify the process and facilitate simpler testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll do that.
cls.set_path_value(datum[key], more_path, value) | ||
|
||
@classmethod | ||
def find_type_hint(cls, parsed_header: Optional[ParsedHeader], schema: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there needs to be additional logic for using the schema here; I could be mistaken, but I don't believe there are tests for such cases at least.
For example, prefer_number
above is automatically converting things that look like numbers to ints or floats, but the schema may require the field to be a string. Hence, there's the possibility of a bit of casting back and forth happening here that needs to be handled. As currently written, some information could be lost by this casting back and forth as well, if for example "+1" is cast to an int of 1 and then a string of "1".
Similarly, if the schema defines the field to be an array of strings and the spreadsheet value is one string without the "|" character (e.g. "dog"), the string needs to be cast to an array of strings (["dog"]). On the flip side, if the schema dictates the field is a string, and someone submits a string with the "|" character, it will be converted to an array of strings via parse_item_value
, and the validation will fail even though the string was valid.
My personal preference for handling this would be to cast all values to appropriate types and do the "type hinting" all in one location; currently, these operations are occurring in multiple places/stages of the processing, and the logic becomes more difficult to follow as to where an error arose. As the cases above demonstrate, it may not make sense to do anything with a cell value until the type from the schema is known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is trying to emulate what Excel does (in the raw layer0. I really highly doubt too many people ever manage their cell type to match the schemas, so we have to assume that the data coming in will be either string or number, and it therefore doesn't hurt to do this as a hack to make csv files feel like they do similar things. The semantic (schema-hinting) layer will need to assume both types of data come in to it.
Honestly, I don't expect us to ever use the raw layer, but as a matter of abstraction it seemed important to build this in steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I actually wrote this capability for the lower level and only stubbed the other part in here before I had schemas, but yeah, I guess I can make that be optional or just remove it at that level. Even the conversion to boolean might not be helpful at this level, since it can be schema-driven now. Somewhat the thing to see is that this is evolutionary and I've been changing the nature of some of these things organically. But I generally agree with you about expanding the capabilities and making it more schema-driven.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, yes, I left the testing at this level light figuring it would change a lot. The testing, such as it is, is in the spreadsheets, which allows some of this stuff to be refactored to do similar things a different way without major changes to the tests while we're in transition. This isn't to say there are no tests, but they are more end-to-end and not down to the individual function layer at this point in development. I'll fill the testing in better when it's more stable. But if you ask me to refactor everything, you should know you're asking to have all the affected tests rewritten, so you should be glad there's little such friction. Means you're not asking for a heavy lift. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also gets involved in the error-handling behavior you want, though my present theory is that while it's hard to know what to do about bad column headers, if we can at least locate the data correctly, we can just pass through a string and let the validator complain about it.
Co-authored-by: drio18 <58236592+drio18@users.noreply.github.com>
…rectoryItemManager. Rename _JsonInsertsDataItemManager to InsertsItemManager.
…_load_json_data() into ._check_json_data().
…ctively, to _parse_inserts_data, _load_inserts_data, and _check_inserts_data.
… of item managers.
Add portal_vapp= functionality to sheet_utils
with TemporaryDirectory() as temp_dir: | ||
temp_base = os.path.join(temp_dir, target_base_part) | ||
temp_filename = temp_base + target_ext | ||
_do_shell_command(['cp', filename, temp_filename]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want to do as much as possible using Python libraries (e.g. file copy) as possible, rather than execing out to shell?
This is getting close to where it might be usable. Features available, including those which are available in some of the prevoius PRs. I've tried to highlight the areas of current development here:
New module
sheet_utils
for loading workbooks.Important things of interest:
Class
ItemManager
for loading Item-style data from any.xlsx
,.csv
or.tsv
files. (Planned support, only partly implemented, for JSON via .json files and JSON LInes via .jsonl files.)Function
load_items
that does the same asItemManager.load
. Features:type hinting (see class TypeHint and its subclasses) to normalize/repair various kinds of inputs. Work ongoing in this area. Doug has requested this be done differently in various ways.
autoloading schemas based on tab names
object cross-references. Work on this "instaguid" feature is in flux and an area of ongoing work. Experimentation and considering of variations of implementations still planned before converging. WIll and Doug have asked questions. Kent plans a document soon itemizing alternatives and issues.
Various lower-level implementation classes such as:
Classes
XlsxManager
,CsvManager
andTsvManager
for loading raw data from.xlsx
,.csv
, and.tsv
files, respectively.Classes
XlsxItemManager
,CsvItemManager
, andTsvItemManager
for loading Item-style data from.xlsx
,.csv
, and.tsv
files, respectively.New utility functionality in
misc_utils
:is_uuid
(migrated from Fourfront)pad_to
JsonLinesReader
Still to do: