Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closes #12 #14

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pandas==2.2.3
scikit-learn
6 changes: 5 additions & 1 deletion run_all_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
print(os.path.basename(__file__), f'os.getcwd(): {os.getcwd()}')
import step1_prepare.step1_1_download_data
#import step1_prepare.step1_2_preprocess_data
#import step1_prepare.step1_3_split_data

# Step 1.3 Split Data
import step1_prepare.step1_3_split_data
step1_prepare.step1_3_split_data.split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace hardcoded absolute path with relative path

The current implementation uses a hardcoded absolute path that:

  1. Contains spaces which could cause issues
  2. Is specific to a user's local machine
  3. Won't work in different environments

Consider using a relative path instead:

-step1_prepare.step1_3_split_data.split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
+step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
step1_prepare.step1_3_split_data.split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
step1_prepare.step1_3_split_data.split_multiple_files(input_directory='.')



print('\n*** Step 2. Train Model 🌏🚀 ***')
#os.chdir('../step2_train')
Expand Down
15 changes: 15 additions & 0 deletions step1_prepare/split/test_step1_1_storybook_learning_events.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
id,timestamp,android_id,package_name,storybook_id,storybook_title,learning_event_type
233,1624130540000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
203,1603749678000,f94ac8506e31b8d2,ai.elimu.vitabu,61.0,कुत्ते के अंडे,STORYBOOK_OPENED
191,1593337895000,f94ac8506e31b8d2,ai.elimu.vitabu,15.0,उड़ने वाला ऑटो,STORYBOOK_OPENED
196,1593919393000,f94ac8506e31b8d2,ai.elimu.vitabu,15.0,उड़ने वाला ऑटो,STORYBOOK_OPENED
215,1607940607000,f94ac8506e31b8d2,ai.elimu.vitabu,51.0,घूम-घूम घड़ियाल का अनोखा सफ़र,STORYBOOK_OPENED
228,1607962595000,f94ac8506e31b8d2,ai.elimu.vitabu,11.0,गप्पू नाच नहीं सकती,STORYBOOK_OPENED
245,1629721527000,f94ac8506e31b8d2,ai.elimu.vitabu,30.0,मलार का बड़ा सा घर,STORYBOOK_OPENED
192,1593337850000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
248,1630075716000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
199,1599582905000,f94ac8506e31b8d2,ai.elimu.vitabu,5.0,बनबिलाव! बनबिलाव!,STORYBOOK_OPENED
212,1606683792000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
252,1632391559000,467ab5528a9f4f82,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
234,1624130386000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
187,1593338196000,f94ac8506e31b8d2,ai.elimu.vitabu,38.0,हमारे मित्र कौन है?,STORYBOOK_OPENED
13 changes: 13 additions & 0 deletions step1_prepare/split/test_step1_1_storybooks.csv

Large diffs are not rendered by default.

55 changes: 55 additions & 0 deletions step1_prepare/split/train_step1_1_storybook_learning_events.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
id,timestamp,android_id,package_name,storybook_id,storybook_title,learning_event_type
241,1624776061000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
242,1624775819000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
236,1624130258000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
194,1593526257000,f94ac8506e31b8d2,ai.elimu.vitabu,40.0,सूरज का दोस्त कौन ?,STORYBOOK_OPENED
229,1608766889000,f94ac8506e31b8d2,ai.elimu.vitabu,63.0,आनंद,STORYBOOK_OPENED
218,1607940497000,f94ac8506e31b8d2,ai.elimu.vitabu,30.0,मलार का बड़ा सा घर,STORYBOOK_OPENED
223,1607940370000,f94ac8506e31b8d2,ai.elimu.vitabu,53.0,एक सौ सैंतीसवाँ पैर,STORYBOOK_OPENED
206,1603899537000,f94ac8506e31b8d2,ai.elimu.vitabu,,रमाइलो दिन,STORYBOOK_OPENED
232,1624130566000,f94ac8506e31b8d2,ai.elimu.vitabu,23.0,स्वतंत्रता की ओर,STORYBOOK_OPENED
220,1607940430000,f94ac8506e31b8d2,ai.elimu.vitabu,23.0,स्वतंत्रता की ओर,STORYBOOK_OPENED
235,1624130356000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
217,1607940532000,f94ac8506e31b8d2,ai.elimu.vitabu,40.0,सूरज का दोस्त कौन ?,STORYBOOK_OPENED
200,1599582466000,f94ac8506e31b8d2,ai.elimu.vitabu,2.0,आलू-मालू-कालू,STORYBOOK_OPENED
250,1630074219000,f94ac8506e31b8d2,ai.elimu.vitabu,66.0,"एक सफ़र, एक खेल",STORYBOOK_OPENED
227,1607962611000,f94ac8506e31b8d2,ai.elimu.vitabu,41.0,राजू की पहली हवाई-यात्रा,STORYBOOK_OPENED
190,1593337900000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
204,1603900046000,f94ac8506e31b8d2,ai.elimu.vitabu,11.0,गप्पू नाच नहीं सकती,STORYBOOK_OPENED
221,1607940404000,f94ac8506e31b8d2,ai.elimu.vitabu,41.0,राजू की पहली हवाई-यात्रा,STORYBOOK_OPENED
195,1593526240000,f94ac8506e31b8d2,ai.elimu.vitabu,41.0,राजू की पहली हवाई-यात्रा,STORYBOOK_OPENED
231,1624130631000,f94ac8506e31b8d2,ai.elimu.vitabu,23.0,स्वतंत्रता की ओर,STORYBOOK_OPENED
193,1593526278000,f94ac8506e31b8d2,ai.elimu.vitabu,39.0,बंटी और उसके गाते हुए पक्षी,STORYBOOK_OPENED
243,1629290144000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
253,1642773664000,f94ac8506e31b8d2,ai.elimu.vitabu,5.0,बनबिलाव! बनबिलाव!,STORYBOOK_OPENED
202,1602350226000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
214,1607940651000,f94ac8506e31b8d2,ai.elimu.vitabu,27.0,तारा की गगनचुंबी यात्रा,STORYBOOK_OPENED
213,1607940723000,f94ac8506e31b8d2,ai.elimu.vitabu,27.0,तारा की गगनचुंबी यात्रा,STORYBOOK_OPENED
211,1606683798000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
254,1642773408000,f94ac8506e31b8d2,ai.elimu.vitabu,30.0,मलार का बड़ा सा घर,STORYBOOK_OPENED
198,1593940575000,e142205d609d6032,ai.elimu.vitabu,49.0,लाल बरसाती,STORYBOOK_OPENED
219,1607940459000,f94ac8506e31b8d2,ai.elimu.vitabu,5.0,बनबिलाव! बनबिलाव!,STORYBOOK_OPENED
251,1629722157000,467ab5528a9f4f82,ai.elimu.vitabu,30.0,मलार का बड़ा सा घर,STORYBOOK_OPENED
237,1624130213000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
224,1607940339000,f94ac8506e31b8d2,ai.elimu.vitabu,11.0,गप्पू नाच नहीं सकती,STORYBOOK_OPENED
216,1607940577000,f94ac8506e31b8d2,ai.elimu.vitabu,55.0,जादुर्इ गुटका,STORYBOOK_OPENED
230,1617200389000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
240,1624776442000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
188,1593338164000,f94ac8506e31b8d2,ai.elimu.vitabu,41.0,राजू की पहली हवाई-यात्रा,STORYBOOK_OPENED
208,1603896889000,f94ac8506e31b8d2,ai.elimu.vitabu,,रमाइलो दिन,STORYBOOK_OPENED
189,1593337905000,f94ac8506e31b8d2,ai.elimu.vitabu,10.0,रिमझिम बरसे बादल,STORYBOOK_OPENED
249,1630074282000,f94ac8506e31b8d2,ai.elimu.vitabu,66.0,"एक सफ़र, एक खेल",STORYBOOK_OPENED
226,1607962630000,f94ac8506e31b8d2,ai.elimu.vitabu,22.0,मुत्तज्जी की उम्र क्या है?,STORYBOOK_OPENED
222,1607940383000,f94ac8506e31b8d2,ai.elimu.vitabu,15.0,उड़ने वाला ऑटो,STORYBOOK_OPENED
239,1624776446000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
210,1606683808000,f94ac8506e31b8d2,ai.elimu.vitabu,2.0,आलू-मालू-कालू,STORYBOOK_OPENED
246,1629721478000,f94ac8506e31b8d2,ai.elimu.vitabu,30.0,मलार का बड़ा सा घर,STORYBOOK_OPENED
197,1593940582000,e142205d609d6032,ai.elimu.vitabu,48.0,ग़ोलू एक ग़ोल कि कहानी,STORYBOOK_OPENED
209,1606061122000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
205,1603899555000,f94ac8506e31b8d2,ai.elimu.vitabu,,रमाइलो दिन,STORYBOOK_OPENED
244,1629290127000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
225,1607940270000,f94ac8506e31b8d2,ai.elimu.vitabu,29.0,"आज, मैं हूँ...",STORYBOOK_OPENED
207,1603897578000,f94ac8506e31b8d2,ai.elimu.vitabu,,रमाइलो दिन,STORYBOOK_OPENED
247,1630075738000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
201,1599581599000,f94ac8506e31b8d2,ai.elimu.vitabu,1.0,"अभी नहीं, अभी नहीं!",STORYBOOK_OPENED
238,1624776499000,f94ac8506e31b8d2,ai.elimu.vitabu,37.0,अद्भुत कीड़े,STORYBOOK_OPENED
47 changes: 47 additions & 0 deletions step1_prepare/split/train_step1_1_storybooks.csv

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions step1_prepare/step1_3_split_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import os
import pandas as pd
from sklearn.model_selection import train_test_split

def split_multiple_files(input_directory, train_ratio=0.8):
# Get list of all CSV files in the directory
csv_files = [f for f in os.listdir(input_directory) if f.endswith('.csv')]

# Check if any CSV files were found
if not csv_files:
print("No CSV files found in the directory.")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enhance input validation and error handling

The current implementation needs more robust input validation and error handling:

  1. Directory existence check
  2. Permission validation
  3. Case-insensitive CSV extension matching
 def split_multiple_files(input_directory: str, train_ratio: float = 0.8) -> None:
+    # Validate directory
+    if not os.path.isdir(input_directory):
+        raise ValueError(f"Directory not found: {input_directory}")
+
     # Get list of all CSV files in the directory
-    csv_files = [f for f in os.listdir(input_directory) if f.endswith('.csv')]
+    try:
+        csv_files = [f for f in os.listdir(input_directory) 
+                    if f.lower().endswith('.csv')]
+    except PermissionError:
+        raise PermissionError(f"Permission denied accessing: {input_directory}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get list of all CSV files in the directory
csv_files = [f for f in os.listdir(input_directory) if f.endswith('.csv')]
# Check if any CSV files were found
if not csv_files:
print("No CSV files found in the directory.")
return
# Validate directory
if not os.path.isdir(input_directory):
raise ValueError(f"Directory not found: {input_directory}")
# Get list of all CSV files in the directory
try:
csv_files = [f for f in os.listdir(input_directory)
if f.lower().endswith('.csv')]
except PermissionError:
raise PermissionError(f"Permission denied accessing: {input_directory}")
# Check if any CSV files were found
if not csv_files:
print("No CSV files found in the directory.")
return


# Iterate through all files and split them
for file in csv_files:
input_file = os.path.join(input_directory, file)

# Load dataset
print(f"Loading data from {input_file}...")
data = pd.read_csv(input_file)

# Check if the dataset is empty
if data.empty:
print(f"Warning: {file} is empty. Skipping...")
continue

# Split data
print(f"Splitting data into {int(train_ratio*100)}% train and {int((1-train_ratio)*100)}% test sets.")
train_data, test_data = train_test_split(data, test_size=(1 - train_ratio), random_state=42)

# Save splits
output_dir = os.path.join(input_directory, 'split')
os.makedirs(output_dir, exist_ok=True)

train_output = os.path.join(output_dir, f"train_{file}")
test_output = os.path.join(output_dir, f"test_{file}")

# Save the split datasets to CSV
train_data.to_csv(train_output, index=False)
test_data.to_csv(test_output, index=False)

print(f"Data from {file} split and saved successfully.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling and memory efficiency

The main processing loop needs several improvements:

  1. Error handling for file operations
  2. Memory optimization for large files
  3. Progress tracking for multiple files
+    total_files = len(csv_files)
+    for idx, file in enumerate(csv_files, 1):
         input_file = os.path.join(input_directory, file)
         
         # Load dataset
-        print(f"Loading data from {input_file}...")
+        print(f"Processing file {idx}/{total_files}: {file}")
+        try:
             data = pd.read_csv(input_file)
+        except Exception as e:
+            print(f"Error reading {file}: {str(e)}")
+            continue
         
         # Check if the dataset is empty
         if data.empty:
             print(f"Warning: {file} is empty. Skipping...")
             continue
         
+        # Validate data structure
+        if len(data.columns) == 0:
+            print(f"Warning: {file} has no columns. Skipping...")
+            continue
+
         # Split data
         print(f"Splitting data into {int(train_ratio*100)}% train and {int((1-train_ratio)*100)}% test sets.")
-        train_data, test_data = train_test_split(data, test_size=(1 - train_ratio), random_state=42)
+        try:
+            # Process in chunks for large files
+            chunk_size = 100000  # Adjust based on available memory
+            if len(data) > chunk_size:
+                train_chunks = []
+                test_chunks = []
+                for chunk in pd.read_csv(input_file, chunksize=chunk_size):
+                    train_chunk, test_chunk = train_test_split(
+                        chunk, test_size=(1 - train_ratio), random_state=42
+                    )
+                    train_chunks.append(train_chunk)
+                    test_chunks.append(test_chunk)
+                train_data = pd.concat(train_chunks)
+                test_data = pd.concat(test_chunks)
+            else:
+                train_data, test_data = train_test_split(
+                    data, test_size=(1 - train_ratio), random_state=42
+                )
+        except Exception as e:
+            print(f"Error splitting {file}: {str(e)}")
+            continue

Committable suggestion skipped: line range outside the PR's diff.


if __name__ == "__main__":
# Example usage for splitting multiple files in a directory
split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove hardcoded path from example usage

The example usage contains the same hardcoded path issue as in run_all_steps.py.

 if __name__ == "__main__":
     # Example usage for splitting multiple files in a directory
-    split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
+    split_multiple_files(input_directory='.')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
# Example usage for splitting multiple files in a directory
split_multiple_files(input_directory='/Users/nitikabahl/story recemonder/ml-storybook-recommender/step1_prepare/')
if __name__ == "__main__":
# Example usage for splitting multiple files in a directory
split_multiple_files(input_directory='.')

Loading