Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

Merged
merged 6 commits into from
Aug 14, 2023

Conversation

R-Palazzo
Copy link
Contributor

Resolve #1520

@R-Palazzo R-Palazzo requested a review from a team as a code owner August 7, 2023 14:22
@R-Palazzo R-Palazzo removed the request for review from a team August 7, 2023 14:22
@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (0433374) 96.40% compared to head (ab8352e) 96.42%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1533      +/-   ##
==========================================
+ Coverage   96.40%   96.42%   +0.01%     
==========================================
  Files          49       49              
  Lines        3982     3999      +17     
==========================================
+ Hits         3839     3856      +17     
  Misses        143      143              
Files Changed Coverage Δ
sdv/metadata/multi_table.py 99.43% <100.00%> (+0.02%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

the values are the dataframes.
"""
if not data or not all(isinstance(df, pd.DataFrame) for df in data.values()):
raise ValueError('The provided dictionary must contain only pandas DataFrame objects')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End with a period.


Args:
data (dict):
Dictionary of ``pandas.DataFrame`` objects where the keys are the table names and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify to `Dictionary of table names to dataframes.

folder_name (str):
Name of the folder to detect the metadata from.

Raises:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never write raises in docstrings. I guess for consistency it would make sense not to add it?

Raises:
ValueError: If no CSV files are detected in the folder.
"""
csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could validate the folder name exists if you want, up to you.

raise ValueError(f"No CSV files detected in the folder '{folder_name}'")

for filename in csv_files:
table_name = filename[:-4] # Removing the .csv extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do this directly in line 388.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine here because I need the filename and the table_name after

csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')]

if not csv_files:
raise ValueError(f"No CSV files detected in the folder '{folder_name}'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End with period, pretty much all our error messages do, so it makes sense to be consistent.


# Run and Assert
expected_message = re.escape("No CSV files detected in the folder '{}'".format(tmp_path))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line not necessary.

}
}
},
'relationships': [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will add the logic to detect the relationships in another issue, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@R-Palazzo
Copy link
Contributor Author

Thanks for your review @fealho. I addressed the comments in b85bf4f.

Comment on lines 385 to 388
if os.path.exists(folder_name) and os.path.isdir(folder_name):
csv_files = [
filename for filename in os.listdir(folder_name) if filename.endswith('.csv')
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using os.path we can use Pathlib which is more powerful and will improve the code down the line.
For example:

Path().rglob(f'{folder_name}/*.csv')

will return a list with all the paths of csv files-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good advice, thanks! Done in 4c37e19.

raise ValueError(f"No CSV files detected in the folder '{folder_name}'.")

for filename in csv_files:
table_name = filename[:-4] # Removing the .csv extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if changed to pathlib you can directly access the name of the file.


for filename in csv_files:
table_name = filename[:-4] # Removing the .csv extension
csv_file = os.path.join(folder_name, filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't need this since pathlib will have the full path already.

@fealho fealho self-requested a review August 10, 2023 15:10
Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 LGTM!


metadata = MultiTableMetadata()

with tempfile.TemporaryDirectory() as temp_dir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the tmp_path fixture instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in 8cf7f02

@@ -354,11 +370,32 @@ def detect_table_from_csv(self, table_name, filepath):
"""
self._validate_table_not_detected(table_name)
table = SingleTableMetadata()
data = table._load_data_from_csv(filepath)
table._detect_columns(data)
table.detect_from_csv(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a reason this was the way it was. Something to do with avoiding undesired printing maybe? Idk if we want to change it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this change because _load_data_from_csv no longer exists in SingleTableMetadata. I added an integration test in 922aad0 for the detect_table_from_csv which fails in the master branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got moved here

def load_data_from_csv(filepath, pandas_kwargs=None):

I think we don't want to call detect_from_csv because if the error gets raised it says
raise InvalidMetadataError(
'Metadata already exists. Create a new ``SingleTableMetadata`` '
'object to detect from other data sources.'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation, done in ef79cf8

@@ -1443,7 +1443,7 @@ def test_detect_table_from_csv(self, single_table_mock, log_mock):
# Setup
metadata = MultiTableMetadata()
fake_data = Mock()
single_table_mock.return_value._load_data_from_csv.return_value = fake_data
single_table_mock.return_value.detect_from_csv.return_value = fake_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change this

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! 🚢 📦

@R-Palazzo R-Palazzo force-pushed the issue-1520-detect-from-multi-tables branch from ef79cf8 to ab8352e Compare August 14, 2023 08:34
@R-Palazzo R-Palazzo merged commit 72f7e1f into master Aug 14, 2023
37 checks passed
@R-Palazzo R-Palazzo deleted the issue-1520-detect-from-multi-tables branch August 14, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata
5 participants