Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

R-Palazzo · 2023-08-07T14:22:36Z

Resolve #1520

codecov-commenter · 2023-08-07T14:29:26Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (0433374) 96.40% compared to head (ab8352e) 96.42%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1533      +/-   ##
==========================================
+ Coverage   96.40%   96.42%   +0.01%     
==========================================
  Files          49       49              
  Lines        3982     3999      +17     
==========================================
+ Hits         3839     3856      +17     
  Misses        143      143

Files Changed	Coverage Δ
sdv/metadata/multi_table.py	`99.43% <100.00%> (+0.02%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fealho

Looks great!

fealho · 2023-08-08T17:09:01Z

sdv/metadata/multi_table.py

+                the values are the dataframes.
+        """
+        if not data or not all(isinstance(df, pd.DataFrame) for df in data.values()):
+            raise ValueError('The provided dictionary must contain only pandas DataFrame objects')


End with a period.

fealho · 2023-08-08T17:12:52Z

sdv/metadata/multi_table.py

+
+        Args:
+            data (dict):
+                Dictionary of ``pandas.DataFrame`` objects where the keys are the table names and


You can simplify to `Dictionary of table names to dataframes.

fealho · 2023-08-08T17:21:25Z

sdv/metadata/multi_table.py

+            folder_name (str):
+                Name of the folder to detect the metadata from.
+
+        Raises:


We never write raises in docstrings. I guess for consistency it would make sense not to add it?

fealho · 2023-08-08T17:23:05Z

sdv/metadata/multi_table.py

+        Raises:
+            ValueError: If no CSV files are detected in the folder.
+        """
+        csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')]


You could validate the folder name exists if you want, up to you.

fealho · 2023-08-08T17:24:21Z

sdv/metadata/multi_table.py

+            raise ValueError(f"No CSV files detected in the folder '{folder_name}'")
+
+        for filename in csv_files:
+            table_name = filename[:-4]  # Removing the .csv extension


Could do this directly in line 388.

I think this is fine here because I need the filename and the table_name after

fealho · 2023-08-08T17:24:31Z

sdv/metadata/multi_table.py

+        csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')]
+
+        if not csv_files:
+            raise ValueError(f"No CSV files detected in the folder '{folder_name}'")


End with period, pretty much all our error messages do, so it makes sense to be consistent.

fealho · 2023-08-08T17:30:13Z

tests/unit/metadata/test_multi_table.py

+
+        # Run and Assert
+        expected_message = re.escape("No CSV files detected in the folder '{}'".format(tmp_path))
+


New line not necessary.

fealho · 2023-08-08T17:33:08Z

tests/integration/metadata/test_multi_table.py

+                }
+            }
+        },
+        'relationships': [],


We will add the logic to detect the relationships in another issue, correct?

R-Palazzo · 2023-08-10T10:35:09Z

Thanks for your review @fealho. I addressed the comments in b85bf4f.

pvk-developer · 2023-08-10T12:58:10Z

sdv/metadata/multi_table.py

+        if os.path.exists(folder_name) and os.path.isdir(folder_name):
+            csv_files = [
+                filename for filename in os.listdir(folder_name) if filename.endswith('.csv')
+            ]


Instead of using os.path we can use Pathlib which is more powerful and will improve the code down the line.
For example:

Path().rglob(f'{folder_name}/*.csv')

will return a list with all the paths of csv files-

This is really good advice, thanks! Done in 4c37e19.

pvk-developer · 2023-08-10T12:58:44Z

sdv/metadata/multi_table.py

+            raise ValueError(f"No CSV files detected in the folder '{folder_name}'.")
+
+        for filename in csv_files:
+            table_name = filename[:-4]  # Removing the .csv extension


if changed to pathlib you can directly access the name of the file.

pvk-developer · 2023-08-10T12:59:14Z

sdv/metadata/multi_table.py

+
+        for filename in csv_files:
+            table_name = filename[:-4]  # Removing the .csv extension
+            csv_file = os.path.join(folder_name, filename)


Wouldn't need this since pathlib will have the full path already.

fealho

👍

pvk-developer

👍🏻 LGTM!

amontanez24 · 2023-08-10T23:56:47Z

tests/integration/metadata/test_multi_table.py

+
+    metadata = MultiTableMetadata()
+
+    with tempfile.TemporaryDirectory() as temp_dir:


can we use the tmp_path fixture instead?

Yes, done in 8cf7f02

amontanez24 · 2023-08-10T23:59:02Z

sdv/metadata/multi_table.py

@@ -354,11 +370,32 @@ def detect_table_from_csv(self, table_name, filepath):
        """
        self._validate_table_not_detected(table_name)
        table = SingleTableMetadata()
-        data = table._load_data_from_csv(filepath)
-        table._detect_columns(data)
+        table.detect_from_csv(filepath)


I think there was a reason this was the way it was. Something to do with avoiding undesired printing maybe? Idk if we want to change it

I think we need this change because _load_data_from_csv no longer exists in SingleTableMetadata. I added an integration test in 922aad0 for the detect_table_from_csv which fails in the master branch

It got moved here

SDV/sdv/utils.py

Line 168 in 387e9dd

def load_data_from_csv(filepath, pandas_kwargs=None):

I think we don't want to call detect_from_csv because if the error gets raised it says

SDV/sdv/metadata/single_table.py

Lines 287 to 289 in 387e9dd

raise InvalidMetadataError(

'Metadata already exists. Create a new ``SingleTableMetadata`` '

'object to detect from other data sources.'

I see, thanks for the explanation, done in ef79cf8

amontanez24 · 2023-08-10T23:59:23Z

tests/unit/metadata/test_multi_table.py

@@ -1443,7 +1443,7 @@ def test_detect_table_from_csv(self, single_table_mock, log_mock):
        # Setup
        metadata = MultiTableMetadata()
        fake_data = Mock()
-        single_table_mock.return_value._load_data_from_csv.return_value = fake_data
+        single_table_mock.return_value.detect_from_csv.return_value = fake_data


I don't think we should change this

amontanez24

Thanks for addressing the comments! 🚢 📦

R-Palazzo requested review from fealho and pvk-developer August 7, 2023 14:22

R-Palazzo requested a review from a team as a code owner August 7, 2023 14:22

R-Palazzo removed the request for review from a team August 7, 2023 14:22

fealho reviewed Aug 8, 2023

View reviewed changes

R-Palazzo requested a review from amontanez24 August 9, 2023 17:17

pvk-developer requested changes Aug 10, 2023

View reviewed changes

fealho self-requested a review August 10, 2023 15:10

fealho approved these changes Aug 10, 2023

View reviewed changes

pvk-developer approved these changes Aug 10, 2023

View reviewed changes

amontanez24 reviewed Aug 11, 2023

View reviewed changes

amontanez24 approved these changes Aug 11, 2023

View reviewed changes

R-Palazzo added 6 commits August 14, 2023 09:28

define detection methods + tests

dcc4260

address comments

c18bc4a

use Pathlib

f5bec5e

modify test to use tmp_path

794926b

test detect_table_from_csv

b1cacea

use load_data_from_csv

ab8352e

R-Palazzo force-pushed the issue-1520-detect-from-multi-tables branch from ef79cf8 to ab8352e Compare August 14, 2023 08:34

R-Palazzo merged commit 72f7e1f into master Aug 14, 2023
37 checks passed

R-Palazzo deleted the issue-1520-detect-from-multi-tables branch August 14, 2023 09:34

amontanez24 mentioned this pull request Aug 14, 2023

AttributeError when detecting multi table metadata from a CSV #1507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

R-Palazzo commented Aug 7, 2023

codecov-commenter commented Aug 7, 2023 •

edited

Loading

fealho left a comment

fealho Aug 8, 2023

fealho Aug 8, 2023

fealho Aug 8, 2023

fealho Aug 8, 2023

fealho Aug 8, 2023

R-Palazzo Aug 10, 2023

fealho Aug 8, 2023

fealho Aug 8, 2023

fealho Aug 8, 2023

R-Palazzo Aug 10, 2023

R-Palazzo commented Aug 10, 2023

pvk-developer Aug 10, 2023

R-Palazzo Aug 10, 2023

pvk-developer Aug 10, 2023

pvk-developer Aug 10, 2023

fealho left a comment

pvk-developer left a comment

amontanez24 Aug 10, 2023

R-Palazzo Aug 11, 2023

amontanez24 Aug 10, 2023

R-Palazzo Aug 11, 2023

amontanez24 Aug 11, 2023

R-Palazzo Aug 11, 2023

amontanez24 Aug 10, 2023

amontanez24 left a comment


		# Run and Assert
		expected_message = re.escape("No CSV files detected in the folder '{}'".format(tmp_path))


		metadata = MultiTableMetadata()

		with tempfile.TemporaryDirectory() as temp_dir:

	raise InvalidMetadataError(
	'Metadata already exists. Create a new ``SingleTableMetadata`` '
	'object to detect from other data sources.'

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533

Conversation

R-Palazzo commented Aug 7, 2023

codecov-commenter commented Aug 7, 2023 • edited Loading

Codecov Report

fealho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

R-Palazzo commented Aug 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fealho left a comment

Choose a reason for hiding this comment

pvk-developer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 7, 2023 •

edited

Loading