-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support excluding time values in SimpleSegmentNameGenerator #11650
Support excluding time values in SimpleSegmentNameGenerator #11650
Conversation
|
||
public SimpleSegmentNameGenerator(String segmentNamePrefix, @Nullable String segmentNamePostfix) { | ||
this(segmentNamePrefix, segmentNamePostfix, false); | ||
this(segmentNamePrefix, segmentNamePostfix, false, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i noticed the use of this constructor for minion here: https://github.com/apache/pinot/blob/master/pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/name/SegmentNameGeneratorFactory.java#L54-L57 but it doesn't seem like appendUUIDToSegmentName
is threaded through there so figured it wasn't necessary to support this one either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes defaults to false, so should be good.
Codecov Report
@@ Coverage Diff @@
## master #11650 +/- ##
============================================
+ Coverage 62.97% 63.09% +0.11%
Complexity 1118 1118
============================================
Files 2335 2342 +7
Lines 125216 125892 +676
Branches 19209 19360 +151
============================================
+ Hits 78854 79428 +574
- Misses 40724 40811 +87
- Partials 5638 5653 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 157 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to know the motivation for this change as well? Could you add to the description. Thanks !
@@ -65,6 +65,7 @@ public class SegmentGenerationTaskRunner implements Serializable { | |||
public static final String DEPRECATED_USE_LOCAL_DIRECTORY_SEQUENCE_ID = "local.directory.sequence.id"; | |||
public static final String USE_GLOBAL_DIRECTORY_SEQUENCE_ID = "use.global.directory.sequence.id"; | |||
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name"; | |||
public static final String OMIT_TIMESTAMPS_IN_SEGMENT_NAME = "omit.timestamps.in.segment.name"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good otherwise.
Can we rename to "append.timestamps.to.segment.name", to be consistent with how its done for uuid (append.uuid.to.segment.name). Also we creating the name here, so might be intuitive to include/not include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with append.timestamps.to.segment.name
, we'd have to make the value true
by default to match existing behavior. it seems a little unintuitive to default a parameter to true rather than false here.
so might be intuitive to include/not include?
sorry i'm not following this comment. what do you mean to include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im fine with either approach. I just thought having the same naming convention would make it easier to use this class, but I guess this would matter only to users who want to override defaults.
Can we just rename this to exclude.timestamps.in.segment.name ? Looks good other wise. Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Updated the name!
@swaminathanmanish thanks for reviewing! updated the description to include motivation and responded to your comment |
Thanks for adding detailed notes. Did you consider using FixedSegmentNameGenerator, where you can exactly specify the segment name that you want ?
|
Yes, we still want to use the global sequence ID which isn't supported by the |
Thanks ! Lets go with SimpleSegmentNameGenerator itself. FixedSegmentNameGenerator can just be for a specific names set by the user, without the other attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, other than the one naming comment. Thanks !
@swaminathanmanish updated the name. let me know if there's other feedback that needs addressing. |
@@ -65,6 +65,7 @@ public class SegmentGenerationTaskRunner implements Serializable { | |||
public static final String DEPRECATED_USE_LOCAL_DIRECTORY_SEQUENCE_ID = "local.directory.sequence.id"; | |||
public static final String USE_GLOBAL_DIRECTORY_SEQUENCE_ID = "use.global.directory.sequence.id"; | |||
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name"; | |||
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest calling it time instead of timestamp because it might not follow timestamp format
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name"; | |
public static final String EXCLUDE_TIME_IN_SEGMENT_NAME = "exclude.time.in.segment.name"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should directly use the constant within BatchConfigProperties
@@ -190,6 +193,10 @@ public boolean isAppendUUIDToSegmentName() { | |||
return _appendUUIDToSegmentName; | |||
} | |||
|
|||
public boolean isExcludeTimestampsFromSegmentName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this consistent with config name
public boolean isExcludeTimestampsFromSegmentName() { | |
public boolean isExcludeTimeInSegmentName() { |
@@ -62,6 +62,7 @@ private BatchConfigProperties() { | |||
public static final String FAIL_ON_EMPTY_SEGMENT = "fail.on.empty.segment"; | |||
public static final String AUTH_TOKEN = "authToken"; | |||
public static final String APPEND_UUID_TO_SEGMENT_NAME = "append.uuid.to.segment.name"; | |||
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final String EXCLUDE_TIMESTAMPS_IN_SEGMENT_NAME = "exclude.timestamps.in.segment.name"; | |
public static final String EXCLUDE_TIME_IN_SEGMENT_NAME = "exclude.time.in.segment.name"; |
...rc/main/java/org/apache/pinot/plugin/ingestion/batch/common/SegmentGenerationTaskRunner.java
Outdated
Show resolved
Hide resolved
@@ -66,8 +68,9 @@ public String generateSegmentName(int sequenceId, @Nullable Object minTimeValue, | |||
SegmentNameUtils.validatePartialOrFullSegmentName(maxTimeValue.toString()); | |||
} | |||
|
|||
return JOINER.join(_segmentNamePrefix, minTimeValue, maxTimeValue, _segmentNamePostfix, | |||
sequenceId >= 0 ? sequenceId : null, _appendUUIDToSegmentName ? UUID.randomUUID() : null); | |||
return JOINER.join(_segmentNamePrefix, _excludeTimestampsInSegmentName ? null : minTimeValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put the check before performing the time value check
return JOINER.join(_segmentNamePrefix, _excludeTimestampsInSegmentName ? null : minTimeValue, | |
if (_excludeTimeInSegmentName) { | |
JOINER.join(_segmentNamePrefix, ...); | |
} else { | |
if (minTimeValue != null) { | |
... | |
} |
@Jackie-Jiang thanks for the review. addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments
@@ -83,4 +86,19 @@ public void testWithMalFormedTableNameSegmentNamePostfixTimeValue() { | |||
assertEquals(e.getMessage(), "Invalid partial or full segment name: 12|34"); | |||
} | |||
} | |||
|
|||
@Test | |||
public void testWithexcludeTimeInSegmentName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void testWithexcludeTimeInSegmentName() { | |
public void testWithExcludeTimeInSegmentName() { |
@@ -51,6 +51,7 @@ public class BatchConfig { | |||
private final String _segmentNamePostfix; | |||
private final boolean _excludeSequenceId; | |||
private final boolean _appendUUIDToSegmentName; | |||
private final boolean _excludeTimeFromSegmentName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private final boolean _excludeTimeFromSegmentName; | |
private final boolean _excludeTimeInSegmentName; |
public boolean isExcludeTimeFromSegmentName() { | ||
return _excludeTimeFromSegmentName; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public boolean isExcludeTimeFromSegmentName() { | |
return _excludeTimeFromSegmentName; | |
} | |
public boolean isExcludeTimeInSegmentName() { | |
return _excludeTimeInSegmentName; | |
} |
ah thanks for catching those @Jackie-Jiang. i fixed them up. |
@Jackie-Jiang can you take another pass if this is good to merge now? CI is passing |
Related to #11649
We use the
normalizedDate
segment name generator for an append table w/ a time column. The generators add a min/max time value to the segment name (ex.example_table_2023-09-22_2023-10-02_46
). We had a user backfill their batch jobs overwriting existing data. Some of the upstream data had changed as part of this backfill which caused the min/max time value in the segment name to also change. This caused inconsistent data since some new segments didn't replace thier old ones.This PR adds param
omit.timestamps.in.segment.name
to the SimpleSegmentNameGenerator to omit time values from the segment name. I've only added this to the simple generator since it doesn't seem intuitive to have thenormalizedDate
generator omit timestamps. We plan to create segments w/ this generator by including the execution date for the batch run (ex.example_table_20230928_46
) since it will give us a consistent set of segment file names as long as the # of input files is the same.I've updated the unit tests. It seems like the best place to write an end to end test might be in https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/test/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentGenerationJobRunnerTest.java Let me know if that's desired.
Long-term, it seems ideal that we'd use https://docs.pinot.apache.org/operators/operating-pinot/consistent-push-and-rollback for re-running segment creation on a given day so users aren't contrained on how the input data is structured.
cc @Jackie-Jiang