Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute with repeated data type. #991

Merged
merged 21 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
32a2656
Attribute isRepeated flag. ValidateDataTypes indexing job.
marikomedlock Sep 4, 2024
eb79192
CreateEntityMainn indexing job.
marikomedlock Sep 4, 2024
8fe09c8
Merge branch 'main' into mm-repeated-data-type
marikomedlock Sep 5, 2024
5d54178
WriteEntityLevelDisplayHints indexing job.
marikomedlock Sep 5, 2024
d4b6921
WriteTextSearchField indexing job.
marikomedlock Sep 5, 2024
519f3cd
Select repeated attr field. Result parsing test list query.
marikomedlock Sep 6, 2024
c209b0f
Fetch string-only hints. Result parsing test hint query.
marikomedlock Sep 9, 2024
0620795
Merge branch 'main' into mm-repeated-data-type
marikomedlock Sep 9, 2024
0714da8
Group by repeated attribute field. Sql building and results parsing t…
marikomedlock Sep 9, 2024
dd3f527
Filter on repeated attribute field. Sql building test list query.
marikomedlock Sep 9, 2024
4e8c32c
Only use FLATTENED alias for repeated attribute field.
marikomedlock Sep 9, 2024
3e4bce0
OpenAPI repeated type in ValueDisplay. Bug in returning hint query re…
marikomedlock Sep 9, 2024
1ae4e10
Remove variant config files.
marikomedlock Sep 9, 2024
8eb84dc
Regenerate docs.
marikomedlock Sep 9, 2024
66b6633
Check actual data rows in result parsing hint & count query tests.
marikomedlock Sep 10, 2024
ea8dff6
Merge branch 'main' into mm-repeated-data-type
marikomedlock Sep 10, 2024
3a5b7e9
Fix overlapping conditional in eldh indexing job.
marikomedlock Sep 10, 2024
7f8f84d
Merge branch 'main' into mm-repeated-data-type
dexamundsen Sep 10, 2024
2f25ffe
Merge branch 'main' into mm-repeated-data-type
marikomedlock Sep 11, 2024
89bfadb
Always populate repeatedValue property of ValueDisplay in OpenAPI.
marikomedlock Sep 12, 2024
906b06d
Merge branch 'main' into mm-repeated-data-type
marikomedlock Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/generated/UNDERLAY_CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,13 @@ When set to true, an indexing job will try to compute a display hint for this at

*Default value:* `false`

### SZAttribute.isDataTypeRepeated
**optional** boolean

True if the data type is repeated (e.g. an array of ints).

*Default value:* `false`

### SZAttribute.isSuppressedForExport
**optional** boolean

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@ public void run(boolean isDryRun) {
Field.newBuilder(
columnSchema.getColumnName(),
BigQueryBeamUtils.fromDataType(columnSchema.getDataType()))
.setMode(
columnSchema.isDataTypeRepeated()
? Field.Mode.REPEATED
: Field.Mode.NULLABLE)
.build())
.collect(Collectors.toList());

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,18 @@ public void run(boolean isDryRun) {
.collect(Collectors.joining(",", "[", "]")),
sourceQueryField.getType());
}

// Check that the schema repeated flags match those of the index table columns.
boolean sourceQueryFieldIsRepeated = sourceQueryField.getMode().equals(Field.Mode.REPEATED);
boolean isRepeatedFlagsMatch = attribute.isDataTypeRepeated() == sourceQueryFieldIsRepeated;
if (!isRepeatedFlagsMatch) {
foundError = true;
LOGGER.info(
"Data type repeated mismatch found for attribute {}: entity declared {}, SQL schema returns {}",
attribute.getName(),
attribute.isDataTypeRepeated(),
sourceQueryField.getMode());
}
}
if (foundError) {
throw new InvalidConfigException("Data type mismatch found for entity: " + entity.getName());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,23 +113,53 @@ public void run(boolean isDryRun) {
attribute.getName(),
minMax.getKey(),
minMax.getValue());
} else if (isEnumHint(attribute)) {
List<Pair<ValueDisplay, Long>> enumCounts = computeEnumHint(attribute, isDryRun);
} else if (isEnumHintForValueDisplay(attribute)) {
List<Pair<ValueDisplay, Long>> enumCounts =
computeEnumHintForValueDisplay(attribute, isDryRun);
enumCounts.forEach(
enumCount -> {
Literal int64Field =
attribute.isValueDisplay()
? enumCount.getKey().getValue()
: Literal.forInt64(null);
Literal stringField =
attribute.isValueDisplay()
? Literal.forString(enumCount.getKey().getDisplay())
: enumCount.getKey().getValue();

List<Literal> rowOfLiterals = new ArrayList<>();
rowOfLiterals.add(Literal.forString(attribute.getName()));
rowOfLiterals.add(Literal.forDouble(null));
rowOfLiterals.add(Literal.forDouble(null));
rowOfLiterals.add(int64Field);
rowOfLiterals.add(stringField);
rowOfLiterals.add(Literal.forInt64(enumCount.getValue()));
insertRows.add(rowOfLiterals);
LOGGER.info(
"Enum value-display or simple-string hint: {}, {}, {}, {}",
attribute.getName(),
int64Field,
stringField,
enumCount.getValue());
});
} else if (isEnumHintForRepeatedStringValue(attribute)) {
List<Pair<Literal, Long>> enumCounts =
computeEnumHintForRepeatedStringValue(attribute, isDryRun);
enumCounts.forEach(
enumCount -> {
List<Literal> rowOfLiterals = new ArrayList<>();
rowOfLiterals.add(Literal.forString(attribute.getName()));
rowOfLiterals.add(Literal.forDouble(null));
rowOfLiterals.add(Literal.forDouble(null));
rowOfLiterals.add(enumCount.getKey().getValue());
rowOfLiterals.add(Literal.forString(enumCount.getKey().getDisplay()));
rowOfLiterals.add(Literal.forInt64(null));
rowOfLiterals.add(enumCount.getKey());
rowOfLiterals.add(Literal.forInt64(enumCount.getValue()));
insertRows.add(rowOfLiterals);
LOGGER.info(
"Enum hint: {}, {}, {}, {}",
"Enum repeated-string hint: {}, {}, {}, {}",
attribute.getName(),
enumCount.getKey().getValue(),
enumCount.getKey().getDisplay(),
null,
enumCount.getKey(),
enumCount.getValue());
});
} else {
Expand Down Expand Up @@ -205,16 +235,26 @@ public void run(boolean isDryRun) {
}
}

private static boolean isEnumHint(Attribute attribute) {
return attribute.isValueDisplay() && DataType.INT64.equals(attribute.getRuntimeDataType());
private static boolean isEnumHintForValueDisplay(Attribute attribute) {
return (attribute.isValueDisplay() && DataType.INT64.equals(attribute.getRuntimeDataType()))
|| (attribute.isSimple()
&& !attribute.isDataTypeRepeated()
&& DataType.STRING.equals(attribute.getRuntimeDataType()));
}

private static boolean isEnumHintForRepeatedStringValue(Attribute attribute) {
// TODO: Support not-repeated string simple attributes & repeated integer attributes.
return attribute.isSimple()
&& attribute.isDataTypeRepeated()
&& (DataType.STRING.equals(attribute.getDataType()));
}

private static boolean isRangeHint(Attribute attribute) {
if (attribute.isValueDisplay()) {
return false;
}
return switch (attribute.getRuntimeDataType()) {
case DOUBLE, INT64 -> true;
case DOUBLE, INT64 -> !attribute.isDataTypeRepeated();
default -> false;
};
}
Expand Down Expand Up @@ -275,29 +315,36 @@ private Pair<Literal, Literal> computeRangeHint(Attribute attribute, boolean isD
}
}

private List<Pair<ValueDisplay, Long>> computeEnumHint(Attribute attribute, boolean isDryRun) {
private List<Pair<ValueDisplay, Long>> computeEnumHintForValueDisplay(
Attribute attribute, boolean isDryRun) {
// Build the query.
// SELECT attrVal AS enumVal, attrDisp AS enumDisp, COUNT(*) AS enumCount FROM indextable GROUP
// BY enumVal, enumDisp
// SELECT attrVal AS enumVal[, attrDisp AS enumDisp], COUNT(*) AS enumCount FROM indextable
// GROUP BY enumVal[, enumDisp]
SqlField attrValField = indexAttributesTable.getAttributeValueField(attribute.getName());
SqlField attrDispField = indexAttributesTable.getAttributeDisplayField(attribute.getName());
SqlField attrDispField =
attribute.isValueDisplay()
? indexAttributesTable.getAttributeDisplayField(attribute.getName())
: null;
final String enumValAlias = "enumVal";
final String enumDispAlias = "enumDisp";
final String enumCountAlias = "enumCount";

String selectEnumCountSql =
"SELECT "
+ SqlQueryField.of(attrValField, enumValAlias).renderForSelect()
+ ", "
+ SqlQueryField.of(attrDispField, enumDispAlias).renderForSelect()
+ ", COUNT(*) AS "
"SELECT " + SqlQueryField.of(attrValField, enumValAlias).renderForSelect();
if (attribute.isValueDisplay()) {
selectEnumCountSql += ", " + SqlQueryField.of(attrDispField, enumDispAlias).renderForSelect();
}
selectEnumCountSql +=
", COUNT(*) AS "
+ enumCountAlias
+ " FROM "
+ indexAttributesTable.getTablePointer().render()
+ " GROUP BY "
+ SqlQueryField.of(attrValField, enumValAlias).renderForGroupBy(null, true)
+ ", "
+ SqlQueryField.of(attrDispField, enumDispAlias).renderForGroupBy(null, true);
+ SqlQueryField.of(attrValField, enumValAlias).renderForGroupBy(null, true);
if (attribute.isValueDisplay()) {
selectEnumCountSql +=
", " + SqlQueryField.of(attrDispField, enumDispAlias).renderForGroupBy(null, true);
}
LOGGER.info("SQL enum count: {}", selectEnumCountSql);

// Execute the query.
Expand All @@ -315,16 +362,26 @@ private List<Pair<ValueDisplay, Long>> computeEnumHint(Attribute attribute, bool
// Parse the result rows.
for (FieldValueList rowResult : tableResult.getValues()) {
FieldValue enumValFieldValue = rowResult.get(enumValAlias);
Literal enumVal =
Literal.forInt64(enumValFieldValue.isNull() ? null : enumValFieldValue.getLongValue());
FieldValue enumDispFieldValue = rowResult.get(enumDispAlias);
String enumDisp = enumDispFieldValue.isNull() ? null : enumDispFieldValue.getStringValue();
Literal enumVal;
String enumDisp;
if (attribute.isValueDisplay()) {
enumVal =
Literal.forInt64(
enumValFieldValue.isNull() ? null : enumValFieldValue.getLongValue());
FieldValue enumDispFieldValue = rowResult.get(enumDispAlias);
enumDisp = enumDispFieldValue.isNull() ? null : enumDispFieldValue.getStringValue();
} else {
enumVal =
Literal.forString(
enumValFieldValue.isNull() ? null : enumValFieldValue.getStringValue());
enumDisp = null;
}
FieldValue enumCountFieldValue = rowResult.get(enumCountAlias);
long enumCount = enumCountFieldValue.getLongValue();
enumCounts.add(Pair.of(new ValueDisplay(enumVal, enumDisp), enumCount));

if (enumCounts.size() > MAX_ENUM_VALS_FOR_DISPLAY_HINT) {
// if there are more than the max number of values, then skip the display hint
// If there are more than the max number of values, then skip the display hint.
LOGGER.info(
"Skipping enum values display hint because there are >{} possible values: {}",
MAX_ENUM_VALS_FOR_DISPLAY_HINT,
Expand All @@ -334,17 +391,82 @@ private List<Pair<ValueDisplay, Long>> computeEnumHint(Attribute attribute, bool
}
}

// Check that there is exactly one display per value.
Map<Literal, String> valDisplay = new HashMap<>();
enumCounts.forEach(
enumCount -> {
if (valDisplay.containsKey(enumCount.getKey().getValue())) {
throw new InvalidConfigException(
"Found >1 possible display for the enum value " + enumCount.getKey().getValue());
} else {
valDisplay.put(enumCount.getKey().getValue(), enumCount.getKey().getDisplay());
}
});
if (attribute.isValueDisplay()) {
// Check that there is exactly one display per value.
Map<Literal, String> valDisplay = new HashMap<>();
enumCounts.forEach(
enumCount -> {
if (valDisplay.containsKey(enumCount.getKey().getValue())) {
throw new InvalidConfigException(
"Found >1 possible display for the enum value " + enumCount.getKey().getValue());
} else {
valDisplay.put(enumCount.getKey().getValue(), enumCount.getKey().getDisplay());
}
});
}
return enumCounts;
}

private List<Pair<Literal, Long>> computeEnumHintForRepeatedStringValue(
Attribute attribute, boolean isDryRun) {
// TODO: Consolidate the logic here with the ValueDisplay enum hint method.
// Build the query.
// SELECT flattenedAttrVal AS enumVal, COUNT(*) AS enumCount FROM indextable
// CROSS JOIN UNNEST(indextable.attrVal) AS flattenedAttrVal
// GROUP BY enumVal
SqlField attrValField = indexAttributesTable.getAttributeValueField(attribute.getName());
final String enumValAlias = "enumVal";
final String enumCountAlias = "enumCount";
final String flattenedAttrValAlias = "flattenedAttrVal";
SqlField flattenedAttrValField = SqlField.of(flattenedAttrValAlias);

String selectEnumCountSql =
"SELECT "
+ SqlQueryField.of(flattenedAttrValField, enumValAlias).renderForSelect()
+ ", COUNT(*) AS "
+ enumCountAlias
+ " FROM "
+ indexAttributesTable.getTablePointer().render()
+ " CROSS JOIN UNNEST("
+ SqlQueryField.of(attrValField).renderForSelect()
+ ") AS "
+ flattenedAttrValAlias
+ " GROUP BY "
+ SqlQueryField.of(attrValField, enumValAlias).renderForGroupBy(null, true);
LOGGER.info("SQL enum count: {}", selectEnumCountSql);

// Execute the query.
List<Pair<Literal, Long>> enumCounts = new ArrayList<>();
if (isDryRun) {
if (getOutputTable().isEmpty()) {
LOGGER.info("Skipping query dry run because output table does not exist yet.");
} else {
googleBigQuery.dryRunQuery(selectEnumCountSql);
}
enumCounts.add(Pair.of(Literal.forString(""), 0L));
} else {
TableResult tableResult = googleBigQuery.runQueryLongTimeout(selectEnumCountSql);

// Parse the result rows.
for (FieldValueList rowResult : tableResult.getValues()) {
FieldValue enumValFieldValue = rowResult.get(enumValAlias);
Literal enumVal =
Literal.forString(
enumValFieldValue.isNull() ? null : enumValFieldValue.getStringValue());
FieldValue enumCountFieldValue = rowResult.get(enumCountAlias);
long enumCount = enumCountFieldValue.getLongValue();
enumCounts.add(Pair.of(enumVal, enumCount));

if (enumCounts.size() > MAX_ENUM_VALS_FOR_DISPLAY_HINT) {
// if there are more than the max number of values, then skip the display hint
LOGGER.info(
"Skipping enum values display hint because there are >{} possible values: {}",
MAX_ENUM_VALS_FOR_DISPLAY_HINT,
attribute.getName());
return List.of();
}
}
}
return enumCounts;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,22 @@ public void run(boolean isDryRun) {
SqlField attributeTextField;
if (attribute.isValueDisplay()) {
attributeTextField = indexTable.getAttributeDisplayField(attribute.getName());
} else if (!attribute.getDataType().equals(DataType.STRING)) {
attributeTextField =
indexTable
.getAttributeValueField(attribute.getName())
.cloneWithFunctionWrapper("CAST(${fieldSql} AS STRING)");
} else {
String functionWrapper = null;
if (!attribute.getDataType().equals(DataType.STRING)) {
functionWrapper = "CAST(${fieldSql} AS STRING)";
}
if (attribute.isDataTypeRepeated()) {
functionWrapper =
"ARRAY_TO_STRING("
+ (functionWrapper == null ? "${fieldSql}" : functionWrapper)
+ ", \" \")";
}

attributeTextField = indexTable.getAttributeValueField(attribute.getName());
if (functionWrapper != null) {
attributeTextField = attributeTextField.cloneWithFunctionWrapper(functionWrapper);
}
}

String idTextSql =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,21 @@ public static ApiAttribute toApiObject(Attribute attribute) {

public static ApiValueDisplay toApiObject(ValueDisplay valueDisplay) {
ApiValueDisplay apiObject = new ApiValueDisplay();
if (valueDisplay != null) {
apiObject.value(toApiObject(valueDisplay.getValue())).display(valueDisplay.getDisplay());
if (valueDisplay == null) {
return apiObject;
} else if (valueDisplay.isRepeatedValue()) {
return apiObject
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the existence of a display field not independent of whether the value is an array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's impossible to have both a display string and a repeated value. In the future, we could support that though, or maybe it would be better to support an array of value-display pairs, not sure. For the variant entity, we only have arrays of strings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's what I'm wondering and trying to see if we're baking in the way variant search uses arrays the right amount. The array of values as used here seems like it would map to a single display string. Presumably this works differently when a DisplayValue object is returned as part of an enum hint and there wouldn't be repeated values there, but maybe that's not always true? Should some places have a list of ValueDisplays instead of a single one? Should the array be inside Literal instead? Maybe there are two versions, this isn't a general array but specifically support for a list of repeated, individual items, and there could exist another where the array is treated as a single unit?

I don't have a particular proposal and I'm just trying to understand the model and how it will work going forward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the backend (indexing & query engine) does not support repeated values that have both a value & display component. I think that would be a good enhancement, but it's not needed for variant search and we don't have any other immediate use cases, so in the interest of time I decided to tackle this smaller change first.

I don't think array should be inside literal, because then we'd need a new data type for each array (e.g. string-array, int64-array).

For hints, we don't return repeated values. The indexing calculates counts considering each array element separately. e.g. If a condition could be included in >1 vocabulary (condition.vocabulary is a repeated attribute), then we count the condition once per vocabulary. The returned hints are still vocabulary-count pairs, not array of vocabularies-count pairs.

Currently, the only time the API could return a list of values is when fetching an attribute in a list query. I think we may want to expand this in the future, though probably not all at once here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, we decided that in the future it might be useful to support:

  • A single display string for an array type attribute. The backend doesn't support that now, but the existing definition of the ValueDisplay OpenAPI object would support that.
  • A list of related entity instances. This would support the case where there is more than just a list of "tags" (e.g. a list of locations, each of which has an id, a display name, and maybe also a description). Supporting this larger use case obviates the need to support a list of value-display pairs here, since that would be possible with the related entity instances, with each instance returning two attributes. As a general rule, there are 2 cases we'd like to support: "tagging" where it's just a list of values (could be strings or ints or any other data type), and "related instances" where it's more than a list values (could be one or more additional pieces of information per element).

.isRepeatedValue(true)
.repeatedValue(
valueDisplay.getRepeatedValue().stream().map(ToApiUtils::toApiObject).toList());
} else {
ApiLiteral apiValue = toApiObject(valueDisplay.getValue());
return apiObject
.value(apiValue)
.display(valueDisplay.getDisplay())
.isRepeatedValue(false)
.repeatedValue(List.of(apiValue));
}
return apiObject;
}

public static ApiLiteral toApiObject(Literal literal) {
Expand Down
6 changes: 6 additions & 0 deletions service/src/main/resources/api/service_openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1432,6 +1432,12 @@ components:
type: string
description: Optional display string
nullable: true
repeatedValue:
type: array
items:
$ref: "#/components/schemas/Literal"
isRepeatedValue:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Doesn't the existence of the repeatedValue (or it being non-empty) already indicate this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to distinguish an empty array from a non-repeated/scalar value. Some of the rows in the variant table have empty arrays.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there potentially be other types as well and this should be an enum? Should value be deprecated and replaced with values instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have plans for any other types, though if they come up then an enum certainly sounds like a good choice to consider.

I'm happy to remove value here and just always populate a single element of values instead -- would that be easier for you? I didn't do it initially because it wouldn't be backwards compatible, so I figured you'd need to make a bunch of changes on the frontend to accommodate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, we decided to:

  • Always populate the values array, so the UI can move over to always using that, without breaking backwards compatibility immediately.
  • Continue populating the is repeated flag, just in case the UI hits a case where it needs to distinguish between a single element array and a non-repeated value. If it turns out the UI doesn't need to distinguish between those two cases, we can remove this flag from OpenAPI in the future.

type: boolean

Literal:
type: object
Expand Down
1 change: 1 addition & 0 deletions ui/src/tanagra-underlay/underlayConfig.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ export type SZAttribute = {
displayHintRangeMax?: number;
displayHintRangeMin?: number;
isComputeDisplayHint?: boolean;
isDataTypeRepeated?: boolean;
isSuppressedForExport?: boolean;
name: string;
runtimeDataType?: SZDataType;
Expand Down
Loading