Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse transform constraints later in the DataProcessor #1511

Merged
merged 5 commits into from
Jul 24, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions sdv/data_processing/data_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -699,9 +699,6 @@ def reverse_transform(self, data, reset_keys=False):
except rdt.errors.NotFittedError:
LOGGER.info(f'HyperTransformer has not been fitted for table {self.table_name}')

for constraint in reversed(self._constraints_to_reverse):
reversed_data = constraint.reverse_transform(reversed_data)

num_rows = len(reversed_data)
sampled_columns = list(reversed_data.columns)
missing_columns = [
Expand All @@ -720,6 +717,13 @@ def reverse_transform(self, data, reset_keys=False):
generated_keys = self.generate_keys(num_rows, reset_keys)
sampled_columns.extend(self._keys)

for constraint in reversed(self._constraints_to_reverse):
reversed_data = constraint.reverse_transform(reversed_data)
Comment on lines +720 to +721
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it work if you just did sampled_columns = constraint.reverse_transform(sampled_columns)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get what you mean, you can't apply reverse_transform to sampled_columns, since that's a list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, I got confused. Should we just set sampled_columns to be reversed_data.columns on line 725 instead of extending. Otherwise that list might contain columns that were created by constraints that we don't want

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually the first thing I tried. It doesn't work because sampled_columns has the keys, which reversed_data.columns doesn't have.

Your concern is actually not a problem because of line 731, it removes any columns not seen in the metadata.I originally had an extra two lines to remove these extra columns directly, for clarity, but I decided to remove them since they were unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Can you edit the comments before line 731 to add that extra columns will also be dropped?


# Add new columns generated by the constraint
new_columns = list(set(reversed_data.columns) - set(sampled_columns))
sampled_columns.extend(new_columns)

# Sort the sampled columns in the order of the metadata
# In multitable there may be missing columns in the sample such as foreign keys
# And alternate keys. Thats the reason of ensuring that the metadata column is within
Expand Down
Loading