-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data/split task #15
Data/split task #15
Conversation
WalkthroughThe pull request introduces a new module Changes
Sequence DiagramsequenceDiagram
participant Script as run_all_steps.py
participant Splitter as step1_3_split_data.py
participant Files as CSV Files
Script->>Splitter: Call split_multiple_files('.')
Splitter->>Files: Scan input directory
Splitter->>Splitter: Optimize memory usage
Splitter->>Splitter: Split files into train/test
Splitter-->>Script: Save split datasets
Assessment against linked issues
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
step1_prepare/step1_3_split_data.py (2)
25-25
: Use thelogging
module instead ofUsing the
logging
module provides better flexibility and control over log levels and formatting, which is more suitable for production code than usingHere's how you can modify the code:
+import logging + +logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') def split_multiple_files(input_directory, train_ratio=0.8, chunk_size=50000): # Existing code... if os.stat(file_path).st_size == 0: - print(f"Skipping empty file: {file}") + logging.info(f"Skipping empty file: {file}") pbar.update(1) continue try: # Existing code... - print(f"Processed {file} successfully.") + logging.info(f"Processed {file} successfully.") except Exception as e: - print(f"Error processing {file}: {e}") + logging.error(f"Error processing {file}: {e}") if __name__ == "__main__": try: # Existing code... except Exception as e: - print(f"Error: {e}") + logging.error(f"Error: {e}")Also applies to: 41-41, 44-44, 60-60
43-45
: Catch specific exceptions to improve error handlingCatching all exceptions with
except Exception
can mask unexpected errors and make debugging difficult. Consider catching specific exceptions that you anticipate during file processing, such aspd.errors.EmptyDataError
,FileNotFoundError
, orIOError
.try: # Existing code... except Exception as e: - print(f"Error processing {file}: {e}") + logging.error(f"Error processing {file}: {e}") + # Optionally, re-raise the exception if you want the program to halt on critical errors + # raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
step1_prepare/split/test_step1_1_storybook_learning_events.csv
is excluded by!**/*.csv
step1_prepare/split/test_step1_1_storybooks.csv
is excluded by!**/*.csv
step1_prepare/split/train_step1_1_storybook_learning_events.csv
is excluded by!**/*.csv
step1_prepare/split/train_step1_1_storybooks.csv
is excluded by!**/*.csv
📒 Files selected for processing (3)
requirements.txt
(1 hunks)run_all_steps.py
(1 hunks)step1_prepare/step1_3_split_data.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- requirements.txt
🔇 Additional comments (1)
step1_prepare/step1_3_split_data.py (1)
48-54
: Ensure data integrity when downcasting data types in optimize_memory
Downcasting numerical data types can lead to precision loss or overflow if the data contains large numbers or high-precision floats. Consider using pd.to_numeric
with the downcast
parameter to safely downcast numeric columns.
Here's how you can modify the function:
def optimize_memory(data):
"""Convert columns to more memory-efficient types."""
- for col in data.select_dtypes(include=['float64']).columns:
- data[col] = data[col].astype('float32')
- for col in data.select_dtypes(include=['int64']).columns:
- data[col] = data[col].astype('int32')
+ for col in data.select_dtypes(include=['float', 'integer']).columns:
+ data[col] = pd.to_numeric(data[col], downcast='float' if data[col].dtype.kind == 'f' else 'integer')
return data
This method is safer and allows pandas to choose the appropriate smaller dtype without risking data loss.
split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare') | ||
except Exception as e: | ||
print(f"Error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid hardcoding absolute paths; use relative paths or command-line arguments
The input_directory
is hardcoded with an absolute path specific to your local environment, which can cause issues when others run the script. Consider accepting the input_directory
as a command-line argument.
Here's how you can modify the code:
+import argparse
if __name__ == "__main__":
try:
+ parser = argparse.ArgumentParser(description="Split multiple CSV files into training and testing datasets.")
+ parser.add_argument(
+ "--input_directory",
+ type=str,
+ default=".",
+ help="Path to the input directory containing CSV files.",
+ )
+ args = parser.parse_args()
+
- split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare')
+ split_multiple_files(input_directory=args.input_directory)
except Exception as e:
logging.error(f"Error: {e}")
This change allows users to specify the input directory when running the script:
python step1_3_split_data.py --input_directory /path/to/input_directory
train_data, test_data = [], [] | ||
for chunk in pd.read_csv(file_path, chunksize=chunk_size): | ||
chunk = optimize_memory(chunk) | ||
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42) | ||
train_data.append(train) | ||
test_data.append(test) | ||
|
||
# Save splits | ||
pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False) | ||
pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize memory usage by writing chunks directly to CSV files
Accumulating all chunks in train_data
and test_data
lists may lead to high memory usage for large files. Writing each chunk directly to the CSV files in append mode can reduce memory consumption.
Here's how you can modify the code:
try:
# Initialize data files
- train_data, test_data = [], []
+ train_file_path = os.path.join(output_dir, f"train_{file}")
+ test_file_path = os.path.join(output_dir, f"test_{file}")
+ first_chunk = True
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunk = optimize_memory(chunk)
train, test = train_test_split(
chunk, test_size=1 - train_ratio, random_state=42
)
- train_data.append(train)
- test_data.append(test)
+ train.to_csv(train_file_path, mode='a', index=False, header=first_chunk)
+ test.to_csv(test_file_path, mode='a', index=False, header=first_chunk)
+ first_chunk = False
- # Save splits
- pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False)
- pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False)
logging.info(f"Processed {file} successfully.")
This approach writes each chunk directly to the CSV files, minimizing the amount of data held in memory.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
train_data, test_data = [], [] | |
for chunk in pd.read_csv(file_path, chunksize=chunk_size): | |
chunk = optimize_memory(chunk) | |
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42) | |
train_data.append(train) | |
test_data.append(test) | |
# Save splits | |
pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False) | |
pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False) | |
train_file_path = os.path.join(output_dir, f"train_{file}") | |
test_file_path = os.path.join(output_dir, f"test_{file}") | |
first_chunk = True | |
for chunk in pd.read_csv(file_path, chunksize=chunk_size): | |
chunk = optimize_memory(chunk) | |
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42) | |
train.to_csv(train_file_path, mode='a', index=False, header=first_chunk) | |
test.to_csv(test_file_path, mode='a', index=False, header=first_chunk) | |
first_chunk = False | |
logging.info(f"Processed {file} successfully.") |
run_all_steps.py
Outdated
# Step 1.3 Split Data | ||
import step1_prepare.step1_3_split_data | ||
step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid changing the working directory; adjust import statements instead
Changing the working directory with os.chdir('step1_prepare')
can lead to confusion and issues with relative paths and module imports. It's better to keep the working directory consistent and adjust your imports or file paths accordingly.
Here's how you can modify the code:
import os
print('\n*** Step 1. Prepare Data 🌏 ***')
-os.chdir('step1_prepare')
-print(os.path.basename(__file__), f'os.getcwd(): {os.getcwd()}')
-import step1_prepare.step1_1_download_data
+# Adjust the import statements without changing directories
+import step1_prepare.step1_1_download_data
#import step1_prepare.step1_2_preprocess_data
# Step 1.3 Split Data
import step1_prepare.step1_3_split_data
-step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.')
+step1_prepare.step1_3_split_data.split_multiple_files(input_directory='step1_prepare')
By specifying the correct input_directory
, you avoid changing directories and maintain clearer code structure.
Committable suggestion skipped: line range outside the PR's diff.
Issue Number
Purpose
Technical Details
Testing Instructions
Screenshots
Summary by CodeRabbit
New Features
scikit-learn
andtqdm
.Bug Fixes
Documentation
requirements.txt
to reflect new dependencies.