From ecf3e29ae36bef6724f53cde8d52d238ac05df19 Mon Sep 17 00:00:00 2001 From: simon-mo Date: Fri, 8 Mar 2024 13:41:55 -0800 Subject: [PATCH] add concrete tasks --- gradproject.md | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/gradproject.md b/gradproject.md index a9f325a..211ff90 100644 --- a/gradproject.md +++ b/gradproject.md @@ -18,7 +18,7 @@ markdown: kramdown The graduate project is **offered only to students enrolled in Data C200, CS C200A, or Data 200S**. Other students are welcome to explore the questions and datasets in the project for personal learning, but their work will not be graded or counted towards their final grades. -The purpose of the project is to give students experience in both open-ended data science analysis and research in general. +The purpose of the project is to give students experience in both open-ended data science analysis and research in general. @@ -27,7 +27,7 @@ The purpose of the project is to give students experience in both open-ended dat ## Deliverables -The graduate project element will require the following deliverables: +The graduate project element will require the following deliverables: ### Upcoming Deliverables: @@ -97,7 +97,7 @@ You can access all of the data within the `./satellite-image-data` directory. Th 2. `socal-fire` 3. `hurricane-matthew` -Within each folder is a zip file `train_images.npz` containing the satellite images as numpy arrays and a `train_labels.npy` file with corresponding damage level labels. +Within each folder is a zip file `train_images.npz` containing the satellite images as numpy arrays and a `train_labels.npy` file with corresponding damage level labels. **Testing**: In the main directory there are also the `test_images_hurricane-matthew.npz` and `test_images_flooding-fire.npz` zip files. The first contains test images from the `hurricane-matthew` disaster and the latter consists of a combination of test images from `midwest-flooding` and `socal-fire`. @@ -124,7 +124,7 @@ Now that leadership is pleased with your initial EDA report and confident in you ##### Task A: Disaster Type Classification The agency consists of different subdivisions for assisting with different disaster types, e.g., fires, floods, etc. In the event of a disaster, the agency mounts its response effort by first assessing the type of disaster and then requesting the appropriate subdivision to assist with the disaster. -Your task is to assist the agency with making this initial call quickly by automatically classifying images based on the disaster scenario. Specifically, your role will be to build a classifier that can distinguish images from the `midwest-flooding` disaster and the `socal-fire` disaster. +Your task is to assist the agency with making this initial call quickly by automatically classifying images based on the disaster scenario. Specifically, your role will be to build a classifier that can distinguish images from the `midwest-flooding` disaster and the `socal-fire` disaster. To assess your performance, please submit predictions for the `test_images_flooding-fire.npz` images. This should be in a csv file `test_images_flooding-fire_predictions.csv` consisting of a single column with no header, with a 0 to indicate a `midwest-flooding` prediction and a 1 to indicate a `socal-fire` prediction. The prediction in row i should correspond to the i th image. @@ -202,17 +202,36 @@ We used [this prompt](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdf Your tasks will be open ended and feel free to explore the data as you see fit. Overall, you should aim to perform all of the following tasks. We included example questions to consider, but you are expected to come up with your own questions to answer. -#### EDA Tasks +At a high level, we expect the following tasks can be discussed in your project proposal. + +##### EDA Tasks For EDA task, we expect plots and story telling. Tell us more about the data. What do you see in the data? Come up with questions and answer about them. For example, what are the win-rate of GPT4? What are the most common topics? Do different judges have different preferences? What are the most common topics? What are the most common reasons for a question being hard? -#### Modeling Tasks +##### Modeling Tasks Perform some modeling tasks given our ground truth labels. Can you train a logistic regression model to predict the winner given embeddings? How about a K-means clustering model to cluster the questions? Can you use linear regression to predict the hardness score? For modeling tasks, we expect you to demostrate how the well model works and how to evaluate them. You should justify the choice of model and the evaluation metrics. You should also discuss the limitations of the model and how to improve them. -#### Analysis Tasks +##### Analysis Tasks By leveraging the question embeddings, can we find similar questions? How "repeated" is the questions in the dataset? Can you reproduce the Elo score rating for the chatbots and come up with a better ranking? How can we make sense of the data overall. +#### Concrete End to End Tasks + +Concretely, you should perform the following baseline set of tasks from EDA, to modeling, to analysis. + +##### Task A: Modeling Winning Model +Given a prompt, can we predict which model's response will win user vote? You can start by analyzing the length, textual feature, and embeddings of the prompt. You should also explore the difference in output of the different models. For modeling, you can use logistic regression to perform single class classification (does OpenAI model win or lose) or multi-class classification (which exact model wins). You should also evaluate the model using appropriate metrics. + +One hint would be utilize topic modeling data by first clustering prompt given their embeddings, then for each cluster, train a model to predict the winner. Also feel free to use the hardness score to help with the prediction. + +##### Task B: Hardness Prediction +While we provide the hardness score generated by GPT3.5, can you explore whether such scoring is useful and valid? For hardness score, we want it to be an integer value from 1 to 10. For example, if a prompt's score is 1, we expect the weak model to be able to answer the question. If the score is 10, we expect the question to be hard, maybe only GPT4 can answer it. + +You can start by analyzing the embeddings and the topic modeling data. You can then use linear regression to predict the hardness score, using existing or new features. + +You should also evaluate the model using appropriate metrics. One challenging aspect here is that the output score should be integer value, while linear regression output in continous space. + + #### Getting Started ***!!! TODO: put the notebook in public sp-24-public repo before shipping !!!*** @@ -299,7 +318,7 @@ Specifically, you should ensure you address the following in the narrative: * Clearly state the research questions and why they are interesting and important. * Introduction: ensure you include a brief survey of related work on the topic(s) of your analysis. Be sure to reference current approaches/research in the context of your project, as well as how your project differs from or complements existing research. You must cite all the references you discuss in this section. * Description of data: ensure you outline the summary of the data and how the data was prepared for the modeling phase (summarizing your EDA work). If applicable, descriptions of additional datasets that you gathered to support your analysis may also be included. -* Methodology: carefully describe the methods/models you use and why they are appropriate for answering your research questions. You must include a detailed description of how modeling is done in your project, including inference or prediction methods used, feature engineering and regularization if applicable, and cross-validation or test data as appropriate for model selection and evaluation. You may also include interesting findings involving your datasets. +* Methodology: carefully describe the methods/models you use and why they are appropriate for answering your research questions. You must include a detailed description of how modeling is done in your project, including inference or prediction methods used, feature engineering and regularization if applicable, and cross-validation or test data as appropriate for model selection and evaluation. You may also include interesting findings involving your datasets. * Summary of results: analyze your findings in relation to your research question(s). Include/reference visualizations and specific results. Discuss any interesting findings from your analysis. You are encouraged to compare the results using different inference or prediction methods (e.g. linear regression, logistic regression, or classification and regression trees). Can you explain why some methods performed better than others? * Discussion: evaluate your approach and discuss any limitations of the methods you used. Also, briefly describe any surprising discoveries and whether there are any interesting extensions to your analysis.