This repository provides an insight into the methods used by Team PowerPuff Girls in the Amazon Machine Learning Challenge 2023.
Note: This is not a comprehensive solution, but an assortment of the key model code that our team implemented.
Team Name : PowerPuff Girls
Team members :
- Akarshan Kapoor
- Samvaidan Salgotra
- Taraksh Sambhar
- Ayush Tiwari
Leaderboard Position : Rank 50.
Find the leaderboard here.
This approach constructs a Keras Sequential model, reads in the training data, and processes two features named "TITLE_DES" and "TITLE_BUL" using the TF-IDF vectoriser. It then pads the resulting vectors to a uniform length and trains the model. It iterates through the groups in the test DataFrame, processes the same two features, generates the corresponding input for the model, and uses the trained model to make predictions. Finally, the model collects all the predictions along with their corresponding "PRODUCT_ID" in a DataFrame for the final output.
This approach preprocesses the data by cleaning text columns, removing HTML tags, converting to lowercase, removing punctuation, and eliminating stopwords. The modified training data is saved to a new CSV file. The AutoKeras library is used to create a text regression model, which is trained on the preprocessed training data and loaded using TensorFlow. The preprocessed test data is combined into a single column, and the model is used to predict the "PRODUCT_LENGTH" for both the combined text and the title text in the test data.
This approach creates a BERT-based classifier. It begins by setting a fixed seed for reproducibility. After this, it performs the preprocessing tasks, such as handling duplicate entries and missing values. The titles are then encoded into numerical values using a pre-trained multilingual version of BERT, and the encoded data is split into training and validation sets. It then defines a PyTorch Dataset class and uses it to construct DataLoader instances for efficient iteration over the dataset during training and validation. The script builds a classification model by adding a dropout and a linear layer on top of the pre-trained BERT model. After setting up the learning rate scheduler and loss function, the training process is executed in a loop, where in each epoch, the model is trained on the entire dataset and then evaluated on the validation set. The best performing model on the validation set is saved. Lastly we evaluate the model using root mean square error (RMSE) as the metric.