Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data/split task #15

Closed
wants to merge 4 commits into from
Closed

Conversation

NitikaBahl
Copy link

@NitikaBahl NitikaBahl commented Dec 14, 2024

Issue Number

Purpose

•	This PR modifies the data splitting functionality to handle multiple CSV files within a specified directory, and ensures that empty files are skipped. It also improves logging and error handling.

Technical Details

•	The function split_multiple_files was introduced to handle multiple CSV files in the input directory.
•	The code now checks if a file is empty before attempting to split it.
•	The split datasets (train/test) are saved to a new split directory.
•	Added more detailed print statements for clarity and error handling.

Testing Instructions

1.	Ensure the directory contains multiple .csv files (e.g., step1_1_storybook_learning_events.csv, step1_1_storybooks.csv).
2.	Run the script and verify that the data is split into train and test files for each CSV file.
3.	Check that empty files are skipped and a message is printed.
4.	Ensure that the split files are stored in the split directory within the input folder.

Screenshots

•	No UI changes have been made.

Summary by CodeRabbit

  • New Features

    • Added new dependencies: scikit-learn and tqdm.
    • Introduced a functionality to split multiple CSV files into training and testing datasets, enhancing data preparation steps.
  • Bug Fixes

    • Resolved an issue by activating the previously commented-out data splitting functionality.
  • Documentation

    • Updated requirements.txt to reflect new dependencies.

@NitikaBahl NitikaBahl requested a review from a team as a code owner December 14, 2024 13:35
@NitikaBahl NitikaBahl closed this Dec 14, 2024
Copy link

coderabbitai bot commented Dec 14, 2024

Walkthrough

The pull request introduces a new module step1_3_split_data.py to handle data splitting functionality. The script provides a comprehensive solution for splitting multiple CSV files in a directory into training and testing datasets. It includes memory optimization techniques, error handling, and uses scikit-learn's train_test_split for dividing data. The run_all_steps.py script has been updated to activate this new data splitting step, and the requirements.txt file has been modified to include necessary dependencies like scikit-learn and tqdm.

Changes

File Change Summary
requirements.txt Added dependencies: scikit-learn and tqdm
run_all_steps.py Uncommented and activated data splitting step by calling split_multiple_files()
step1_prepare/step1_3_split_data.py New file with functions: split_multiple_files() and optimize_memory() for splitting CSV files into train/test sets

Sequence Diagram

sequenceDiagram
    participant Script as run_all_steps.py
    participant Splitter as step1_3_split_data.py
    participant Files as CSV Files
    
    Script->>Splitter: Call split_multiple_files('.')
    Splitter->>Files: Scan input directory
    Splitter->>Splitter: Optimize memory usage
    Splitter->>Splitter: Split files into train/test
    Splitter-->>Script: Save split datasets
Loading

Assessment against linked issues

Objective Addressed Explanation
Split preprocessed data into training and test sets [#12]

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. This feature will be included in our Pro Plan when released.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
step1_prepare/step1_3_split_data.py (2)

25-25: Use the logging module instead of print statements for better logging control

Using the logging module provides better flexibility and control over log levels and formatting, which is more suitable for production code than using print statements.

Here's how you can modify the code:

+import logging
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

 def split_multiple_files(input_directory, train_ratio=0.8, chunk_size=50000):
     # Existing code...

             if os.stat(file_path).st_size == 0:
-                print(f"Skipping empty file: {file}")
+                logging.info(f"Skipping empty file: {file}")
                 pbar.update(1)
                 continue

             try:
                 # Existing code...
-                print(f"Processed {file} successfully.")
+                logging.info(f"Processed {file} successfully.")

             except Exception as e:
-                print(f"Error processing {file}: {e}")
+                logging.error(f"Error processing {file}: {e}")

 if __name__ == "__main__":
     try:
         # Existing code...
     except Exception as e:
-        print(f"Error: {e}")
+        logging.error(f"Error: {e}")

Also applies to: 41-41, 44-44, 60-60


43-45: Catch specific exceptions to improve error handling

Catching all exceptions with except Exception can mask unexpected errors and make debugging difficult. Consider catching specific exceptions that you anticipate during file processing, such as pd.errors.EmptyDataError, FileNotFoundError, or IOError.

             try:
                 # Existing code...
             except Exception as e:
-                print(f"Error processing {file}: {e}")
+                logging.error(f"Error processing {file}: {e}")
+                # Optionally, re-raise the exception if you want the program to halt on critical errors
+                # raise
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9511e3 and 5401689.

⛔ Files ignored due to path filters (4)
  • step1_prepare/split/test_step1_1_storybook_learning_events.csv is excluded by !**/*.csv
  • step1_prepare/split/test_step1_1_storybooks.csv is excluded by !**/*.csv
  • step1_prepare/split/train_step1_1_storybook_learning_events.csv is excluded by !**/*.csv
  • step1_prepare/split/train_step1_1_storybooks.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • requirements.txt (1 hunks)
  • run_all_steps.py (1 hunks)
  • step1_prepare/step1_3_split_data.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • requirements.txt
🔇 Additional comments (1)
step1_prepare/step1_3_split_data.py (1)

48-54: Ensure data integrity when downcasting data types in optimize_memory

Downcasting numerical data types can lead to precision loss or overflow if the data contains large numbers or high-precision floats. Consider using pd.to_numeric with the downcast parameter to safely downcast numeric columns.

Here's how you can modify the function:

 def optimize_memory(data):
     """Convert columns to more memory-efficient types."""
-    for col in data.select_dtypes(include=['float64']).columns:
-        data[col] = data[col].astype('float32')
-    for col in data.select_dtypes(include=['int64']).columns:
-        data[col] = data[col].astype('int32')
+    for col in data.select_dtypes(include=['float', 'integer']).columns:
+        data[col] = pd.to_numeric(data[col], downcast='float' if data[col].dtype.kind == 'f' else 'integer')
     return data

This method is safer and allows pandas to choose the appropriate smaller dtype without risking data loss.

Comment on lines +58 to +60
split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare')
except Exception as e:
print(f"Error: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid hardcoding absolute paths; use relative paths or command-line arguments

The input_directory is hardcoded with an absolute path specific to your local environment, which can cause issues when others run the script. Consider accepting the input_directory as a command-line argument.

Here's how you can modify the code:

+import argparse

 if __name__ == "__main__":
     try:
+        parser = argparse.ArgumentParser(description="Split multiple CSV files into training and testing datasets.")
+        parser.add_argument(
+            "--input_directory",
+            type=str,
+            default=".",
+            help="Path to the input directory containing CSV files.",
+        )
+        args = parser.parse_args()
+
-        split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare')
+        split_multiple_files(input_directory=args.input_directory)
     except Exception as e:
         logging.error(f"Error: {e}")

This change allows users to specify the input directory when running the script:

python step1_3_split_data.py --input_directory /path/to/input_directory

Comment on lines +31 to +40
train_data, test_data = [], []
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunk = optimize_memory(chunk)
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42)
train_data.append(train)
test_data.append(test)

# Save splits
pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False)
pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize memory usage by writing chunks directly to CSV files

Accumulating all chunks in train_data and test_data lists may lead to high memory usage for large files. Writing each chunk directly to the CSV files in append mode can reduce memory consumption.

Here's how you can modify the code:

             try:
                 # Initialize data files
-                train_data, test_data = [], []
+                train_file_path = os.path.join(output_dir, f"train_{file}")
+                test_file_path = os.path.join(output_dir, f"test_{file}")
+                first_chunk = True

                 for chunk in pd.read_csv(file_path, chunksize=chunk_size):
                     chunk = optimize_memory(chunk)
                     train, test = train_test_split(
                         chunk, test_size=1 - train_ratio, random_state=42
                     )
-                    train_data.append(train)
-                    test_data.append(test)
+                    train.to_csv(train_file_path, mode='a', index=False, header=first_chunk)
+                    test.to_csv(test_file_path, mode='a', index=False, header=first_chunk)
+                    first_chunk = False

-                # Save splits
-                pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False)
-                pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False)
                 logging.info(f"Processed {file} successfully.")

This approach writes each chunk directly to the CSV files, minimizing the amount of data held in memory.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
train_data, test_data = [], []
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunk = optimize_memory(chunk)
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42)
train_data.append(train)
test_data.append(test)
# Save splits
pd.concat(train_data).to_csv(os.path.join(output_dir, f"train_{file}"), index=False)
pd.concat(test_data).to_csv(os.path.join(output_dir, f"test_{file}"), index=False)
train_file_path = os.path.join(output_dir, f"train_{file}")
test_file_path = os.path.join(output_dir, f"test_{file}")
first_chunk = True
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunk = optimize_memory(chunk)
train, test = train_test_split(chunk, test_size=1-train_ratio, random_state=42)
train.to_csv(train_file_path, mode='a', index=False, header=first_chunk)
test.to_csv(test_file_path, mode='a', index=False, header=first_chunk)
first_chunk = False
logging.info(f"Processed {file} successfully.")

run_all_steps.py Outdated
Comment on lines +9 to +11
# Step 1.3 Split Data
import step1_prepare.step1_3_split_data
step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid changing the working directory; adjust import statements instead

Changing the working directory with os.chdir('step1_prepare') can lead to confusion and issues with relative paths and module imports. It's better to keep the working directory consistent and adjust your imports or file paths accordingly.

Here's how you can modify the code:

 import os

 print('\n*** Step 1. Prepare Data 🌏 ***')
-os.chdir('step1_prepare')
-print(os.path.basename(__file__), f'os.getcwd(): {os.getcwd()}')
-import step1_prepare.step1_1_download_data
+# Adjust the import statements without changing directories
+import step1_prepare.step1_1_download_data
 #import step1_prepare.step1_2_preprocess_data

 # Step 1.3 Split Data
 import step1_prepare.step1_3_split_data
-step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.')
+step1_prepare.step1_3_split_data.split_multiple_files(input_directory='step1_prepare')

By specifying the correct input_directory, you avoid changing directories and maintain clearer code structure.

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split data
1 participant