Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support excluding time values in SimpleSegmentNameGenerator #11650

Merged

Conversation

dang-stripe
Copy link
Contributor

@dang-stripe dang-stripe commented Sep 21, 2023

Related to #11649

We use the normalizedDate segment name generator for an append table w/ a time column. The generators add a min/max time value to the segment name (ex. example_table_2023-09-22_2023-10-02_46). We had a user backfill their batch jobs overwriting existing data. Some of the upstream data had changed as part of this backfill which caused the min/max time value in the segment name to also change. This caused inconsistent data since some new segments didn't replace thier old ones.

This PR adds param omit.timestamps.in.segment.name to the SimpleSegmentNameGenerator to omit time values from the segment name. I've only added this to the simple generator since it doesn't seem intuitive to have the normalizedDate generator omit timestamps. We plan to create segments w/ this generator by including the execution date for the batch run (ex. example_table_20230928_46) since it will give us a consistent set of segment file names as long as the # of input files is the same.

I've updated the unit tests. It seems like the best place to write an end to end test might be in https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/test/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentGenerationJobRunnerTest.java Let me know if that's desired.

Long-term, it seems ideal that we'd use https://docs.pinot.apache.org/operators/operating-pinot/consistent-push-and-rollback for re-running segment creation on a given day so users aren't contrained on how the input data is structured.

cc @Jackie-Jiang


public SimpleSegmentNameGenerator(String segmentNamePrefix, @Nullable String segmentNamePostfix) {
this(segmentNamePrefix, segmentNamePostfix, false);
this(segmentNamePrefix, segmentNamePostfix, false, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i noticed the use of this constructor for minion here: https://github.com/apache/pinot/blob/master/pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/name/SegmentNameGeneratorFactory.java#L54-L57 but it doesn't seem like appendUUIDToSegmentName is threaded through there so figured it wasn't necessary to support this one either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes defaults to false, so should be good.

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2023

Codecov Report

Merging #11650 (901b2d3) into master (4226ea0) will increase coverage by 0.11%.
Report is 54 commits behind head on master.
The diff coverage is 81.81%.

@@             Coverage Diff              @@
##             master   #11650      +/-   ##
============================================
+ Coverage     62.97%   63.09%   +0.11%     
  Complexity     1118     1118              
============================================
  Files          2335     2342       +7     
  Lines        125216   125892     +676     
  Branches      19209    19360     +151     
============================================
+ Hits          78854    79428     +574     
- Misses        40724    40811      +87     
- Partials       5638     5653      +15     
Flag Coverage Δ
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 63.06% <81.81%> (+48.59%) ⬆️
java-17 62.95% <81.81%> (+48.50%) ⬆️
java-20 62.94% <81.81%> (-0.02%) ⬇️
temurin 63.09% <81.81%> (+0.11%) ⬆️
unittests 63.08% <81.81%> (+0.11%) ⬆️
unittests1 67.26% <80.00%> (+0.13%) ⬆️
unittests2 14.42% <9.09%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...tion/batch/common/SegmentGenerationTaskRunner.java 70.76% <100.00%> (+0.92%) ⬆️
...not/spi/ingestion/batch/BatchConfigProperties.java 0.00% <ø> (ø)
...ache/pinot/segment/local/utils/IngestionUtils.java 28.32% <0.00%> (ø)
.../apache/pinot/spi/ingestion/batch/BatchConfig.java 70.00% <80.00%> (-1.02%) ⬇️
...t/spi/creator/name/SimpleSegmentNameGenerator.java 89.28% <85.71%> (-1.63%) ⬇️

... and 157 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang
Copy link
Contributor

cc @snleee @swaminathanmanish

@Jackie-Jiang Jackie-Jiang added the Configuration Config changes (addition/deletion/change in behavior) label Sep 21, 2023
Copy link
Contributor

@swaminathanmanish swaminathanmanish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know the motivation for this change as well? Could you add to the description. Thanks !

@@ -65,6 +65,7 @@ public class SegmentGenerationTaskRunner implements Serializable {
public static final String DEPRECATED_USE_LOCAL_DIRECTORY_SEQUENCE_ID = "local.directory.sequence.id";
public static final String USE_GLOBAL_DIRECTORY_SEQUENCE_ID = "use.global.directory.sequence.id";
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name";
public static final String OMIT_TIMESTAMPS_IN_SEGMENT_NAME = "omit.timestamps.in.segment.name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good otherwise.
Can we rename to "append.timestamps.to.segment.name", to be consistent with how its done for uuid (append.uuid.to.segment.name). Also we creating the name here, so might be intuitive to include/not include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with append.timestamps.to.segment.name, we'd have to make the value true by default to match existing behavior. it seems a little unintuitive to default a parameter to true rather than false here.

so might be intuitive to include/not include?

sorry i'm not following this comment. what do you mean to include?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im fine with either approach. I just thought having the same naming convention would make it easier to use this class, but I guess this would matter only to users who want to override defaults.

Can we just rename this to exclude.timestamps.in.segment.name ? Looks good other wise. Thanks !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Updated the name!

@dang-stripe
Copy link
Contributor Author

@swaminathanmanish thanks for reviewing! updated the description to include motivation and responded to your comment

@swaminathanmanish
Copy link
Contributor

Related to #11649

We use the normalizedDate segment name generator for an append table w/ a time column. The generators add a min/max time value to the segment name (ex. example_table_2023-09-22_2023-10-02_46). We had a user backfill their batch jobs overwriting existing data. Some of the upstream data had changed as part of this backfill which caused the min/max time value in the segment name to also change. This caused inconsistent data since some new segments didn't replace thier old ones.

This PR adds param omit.timestamps.in.segment.name to the SimpleSegmentNameGenerator to omit time values from the segment name. I've only added this to the simple generator since it doesn't seem intuitive to have the normalizedDate generator omit timestamps. We plan to create segments w/ this generator by including the execution date for the batch run (ex. example_table_20230928_46) since it will give us a consistent set of segment file names as long as the # of input files is the same.

I've updated the unit tests. It seems like the best place to write an end to end test might be in https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/test/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentGenerationJobRunnerTest.java Let me know if that's desired.

Long-term, it seems ideal that we'd use https://docs.pinot.apache.org/operators/operating-pinot/consistent-push-and-rollback for re-running segment creation on a given day so users aren't contrained on how the input data is structured.

cc @Jackie-Jiang

Thanks for adding detailed notes. Did you consider using FixedSegmentNameGenerator, where you can exactly specify the segment name that you want ?

BatchConfigProperties.SegmentNameGeneratorType.FIXED:
        return new FixedSegmentNameGenerator(_segmentName);

@dang-stripe
Copy link
Contributor Author

Yes, we still want to use the global sequence ID which isn't supported by the FixedSegmentNameGenerator. Another option here is to update FixedSegmentNameGenerator to support appending the sequence ID. Let me know if that approach is preferred.

@swaminathanmanish
Copy link
Contributor

Yes, we still want to use the global sequence ID which isn't supported by the FixedSegmentNameGenerator. Another option here is to update FixedSegmentNameGenerator to support appending the sequence ID. Let me know if that approach is preferred.

Thanks ! Lets go with SimpleSegmentNameGenerator itself. FixedSegmentNameGenerator can just be for a specific names set by the user, without the other attributes.

Copy link
Contributor

@swaminathanmanish swaminathanmanish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other than the one naming comment. Thanks !

@dang-stripe dang-stripe changed the title Support omitting time values in SimpleSegmentNameGenerator Support excluding time values in SimpleSegmentNameGenerator Oct 2, 2023
@dang-stripe
Copy link
Contributor Author

@swaminathanmanish updated the name. let me know if there's other feedback that needs addressing.

@@ -65,6 +65,7 @@ public class SegmentGenerationTaskRunner implements Serializable {
public static final String DEPRECATED_USE_LOCAL_DIRECTORY_SEQUENCE_ID = "local.directory.sequence.id";
public static final String USE_GLOBAL_DIRECTORY_SEQUENCE_ID = "use.global.directory.sequence.id";
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name";
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest calling it time instead of timestamp because it might not follow timestamp format

Suggested change
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name";
public static final String EXCLUDE_TIME_IN_SEGMENT_NAME = "exclude.time.in.segment.name";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should directly use the constant within BatchConfigProperties

@@ -190,6 +193,10 @@ public boolean isAppendUUIDToSegmentName() {
return _appendUUIDToSegmentName;
}

public boolean isExcludeTimestampsFromSegmentName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this consistent with config name

Suggested change
public boolean isExcludeTimestampsFromSegmentName() {
public boolean isExcludeTimeInSegmentName() {

@@ -62,6 +62,7 @@ private BatchConfigProperties() {
public static final String FAIL_ON_EMPTY_SEGMENT = "fail.on.empty.segment";
public static final String AUTH_TOKEN = "authToken";
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name";
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name";
public static final String EXCLUDE_TIME_IN_SEGMENT_NAME = "exclude.time.in.segment.name";

@@ -66,8 +68,9 @@ public String generateSegmentName(int sequenceId, @Nullable Object minTimeValue,
SegmentNameUtils.validatePartialOrFullSegmentName(maxTimeValue.toString());
}

return JOINER.join(_segmentNamePrefix, minTimeValue, maxTimeValue, _segmentNamePostfix,
sequenceId >= 0 ? sequenceId : null, _appendUUIDToSegmentName ? UUID.randomUUID() : null);
return JOINER.join(_segmentNamePrefix, _excludeTimestampsInSegmentName ? null : minTimeValue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put the check before performing the time value check

Suggested change
return JOINER.join(_segmentNamePrefix, _excludeTimestampsInSegmentName ? null : minTimeValue,
if (_excludeTimeInSegmentName) {
JOINER.join(_segmentNamePrefix, ...);
} else {
if (minTimeValue != null) {
...
}

@dang-stripe
Copy link
Contributor Author

@Jackie-Jiang thanks for the review. addressed your comments.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

@@ -83,4 +86,19 @@ public void testWithMalFormedTableNameSegmentNamePostfixTimeValue() {
assertEquals(e.getMessage(), "Invalid partial or full segment name: 12|34");
}
}

@Test
public void testWithexcludeTimeInSegmentName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void testWithexcludeTimeInSegmentName() {
public void testWithExcludeTimeInSegmentName() {

@@ -51,6 +51,7 @@ public class BatchConfig {
private final String _segmentNamePostfix;
private final boolean _excludeSequenceId;
private final boolean _appendUUIDToSegmentName;
private final boolean _excludeTimeFromSegmentName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final boolean _excludeTimeFromSegmentName;
private final boolean _excludeTimeInSegmentName;

Comment on lines 196 to 198
public boolean isExcludeTimeFromSegmentName() {
return _excludeTimeFromSegmentName;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public boolean isExcludeTimeFromSegmentName() {
return _excludeTimeFromSegmentName;
}
public boolean isExcludeTimeInSegmentName() {
return _excludeTimeInSegmentName;
}

@dang-stripe
Copy link
Contributor Author

ah thanks for catching those @Jackie-Jiang. i fixed them up.

@dang-stripe
Copy link
Contributor Author

dang-stripe commented Oct 3, 2023

pushing an empty commit to trigger CI again. the test failure seemed unrelated. i couldn't repro locally.

Screenshot 2023-10-02 at 7 43 03 PM

@dang-stripe
Copy link
Contributor Author

@Jackie-Jiang can you take another pass if this is good to merge now? CI is passing

@Jackie-Jiang Jackie-Jiang merged commit f26faa4 into apache:master Oct 3, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) enhancement ingestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants