Skip to content

Commit

Permalink
Update gradproject.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mihranmiroyan committed Mar 11, 2024
1 parent b44a15a commit e77f5dc
Showing 1 changed file with 18 additions and 4 deletions.
22 changes: 18 additions & 4 deletions gradproject.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,17 @@ All of the provided datasets can be found in the Datahub directory `shared/sp24_
In disaster situations, it is important for emergency response efforts to have access to quick and accurate information about an area in order to respond effectively. This project will explore how data science techniques can be useful for such efforts.

#### Project Goals
{:.no_toc}
- Learn to work with image data by learning to use common feature extraction techniques like Sobel edge filtering.
- Learn to work on real-world data with common complexities such as class imbalance, low signal-to-noise ratio, and high dimensional data.
- Learn how to design effective preprocessing and featurization pipelines for solving difficult machine learning tasks.

#### Mission
{:.no_toc}
You have been hired by a crisis response agency to help assist them with your impressive data science skills! The agency has found that using satellite imagery is highly useful for supplying information for their response efforts. Unfortunately, however, annotating these high-resolution images can be a slow process for analysts. Your mission is to help address this challenge by developing an automatic computer vision approach!

#### Dataset Description
{:.no_toc}
The agency would like you to develop your approach on their internal dataset, derived from the [xView2 Challenge Dataset](https://xview2.org/){:target="_blank"}. This dataset contains satellite images of buildings after various natural disasters. The buildings are labeled based on the level of damage sustained on a scale ranging from 0 (no damage) to 3 (destroyed).

You can access all of the data within the `./satellite-image-data` directory. The dataset consists of the following folders for different natural disasters
Expand All @@ -97,12 +100,14 @@ Within each folder is a zip file `train_images.npz` containing the satellite ima
> Testing: In the main directory, there are also the `test_images_hurricane-matthew.npz` and `test_images_flooding-fire.npz` zip files. The first contains test images from the `hurricane-matthew` disaster and the latter consists of a combination of test images from `midwest-flooding` and `socal-fire`.
#### Getting Started
{:.no_toc}
To help you with onboarding, the agency has provided a starter notebook [`starter.ipynb`](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Fsp24-student&urlpath=lab%2Ftree%2Fsp24-student%2Fgrad-proj%2Fsatellite-images%2Fstarter.ipynb&branch=main){:target="_blank"} which will introduce you to the dataset and some useful internal tools. After completing the onboarding assignment you will be comfortable with the following:
1. Loading and visualizing data using tools from `data_utils.py`
2. Processing different color channels in the dataset images.
3. Extracting feature information from images using tools from `feature_utils.py`.

#### Exploratory Data Analysis
{:.no_toc}
Now that you have successfully been onboarded, the agency would like you to start performing some exploratory data analysis to build an initial understanding of the data. As part of the exploratory data analysis, the agency is interested in understanding certain aspects of the dataset better. Specifically, they are looking for:

- Basic statistics about the dataset, such as the number of images per disaster type and the distribution of image sizes and damage labels.
Expand All @@ -113,23 +118,27 @@ Now that you have successfully been onboarded, the agency would like you to star
Please prepare an EDA report to present to the agency leadership with the above in mind.

#### Project Tasks
{:.no_toc}
Now that leadership is pleased with your initial EDA report and confident in your data science ability, they would like you to assist the agency with various tasks. *Please complete Task A first and then Task B.*

#### *Task A: Disaster Type Classification*
{:.no_toc}
The agency consists of different subdivisions for assisting with different disaster types, e.g., fires, floods, etc. In the event of a disaster, the agency mounts its response effort by first assessing the type of disaster and then requesting the appropriate subdivision to assist with the disaster.

Your task is to assist the agency with making this initial call quickly by automatically classifying images based on the disaster scenario. Specifically, your role will be to build a classifier that can distinguish images from the `midwest-flooding` disaster and the `socal-fire` disaster.

To assess your performance, please submit predictions for the `test_images_flooding-fire.npz` images. This should be in a csv file `test_images_flooding-fire_predictions.csv` consisting of a single column with no header, with a 0 to indicate a `midwest-flooding` prediction and a 1 to indicate a `socal-fire` prediction. The prediction in row *i* should correspond to the *ith* image.

#### *Task B: Damage Level Classification*
{:.no_toc}
The agency needs to know how severe a disaster is in order to allocate resources for a response effectively. The agency is especially concerned with human lives and uses building damage as an important metric for disaster severity.

Your task is to assist the agency by automatically detecting the building damage level after a disaster. Specifically, create a damage level classifier for the `hurricane-matthew` disaster.

To assess your performance, please submit predictions for the `test_images_hurricane-matthew.npz` images. This should be in a CSV file `test_images_hurricane-matthew_predictions.csv` consisting of a single column with no header, with a 0-3 prediction of the damage level. The prediction in row *i* should correspond to the *i*th image.

#### Resources
{:.no_toc}
To assist you in your efforts the agency has compiled the following list of resources:
- For more background about the dataset you can look at the [paper](https://arxiv.org/pdf/1911.09296.pdf){:target="_blank"} associated with the dataset.
- For image processing, [scikit-image](https://scikit-image.org/){:target="_blank"} is a very useful library. This [tutorial](https://www.kaggle.com/code/bextuychiev/full-tutorial-on-image-processing-in-skimage){:target="_blank"} may be helpful for learning how to use the library.
Expand All @@ -142,12 +151,13 @@ A common task in real-life data analysis involves working with text data.
In this project, we will work with a dataset consisting of natural language questions asked by humans and answers provided by chatbots.

#### Project Goals
{:.no_toc}
- Prepare you to work with text data by learning common techniques like embedding generation, tokenization, and topic modeling.
- Work with real-world data in its targetted domain. The data is non-trivial in both size and complexity.
- Ask open-ended questions and answer them using data at hand.

#### Dataset Description

{:.no_toc}
The source dataset link is [here](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations){:target="_blank"}. The author describes the dataset as follows:

> This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.
Expand Down Expand Up @@ -184,25 +194,29 @@ There are two auxiliary datasets that you can use to help with your analysis:
We used [this prompt](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-prompt-md){:target="_blank"} to generate the responses. You are welcome to generate your own ground truth data. You can generate your own embeddings following [this](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-using-your-own-embeddings-md){:target="_blank"} guide.

#### Exploratory Data Analysis
{:.no_toc}
For the EDA tasks, tell us more about the data. What do you see in the data? Come up with questions and answers about them. For example, what is the win rate of GPT4? What are the most common topics? Do different judges have different preferences? What are the most common topics? What are the most common reasons for a question being hard?

#### Project Tasks
{:.no_toc}
Now, we aim to better understand the different chatbot models! Please complete both Task A and B. We have included example questions to consider, but you are expected to come up with your own questions to answer.

#### *Task A: Modeling the Winning Model*
{:.no_toc}
Given a prompt, can we predict which model's response will win the user vote? You can start by analyzing the length, textual features, and embeddings of the prompt. You should also explore the difference in output of the different models. For modeling, you can use logistic regression to perform binary classification (does OpenAI model win or lose) or multi-class classification (which exact model wins). You should also evaluate the model using appropriate metrics.

One hint would be to utilize topic modeling data by first clustering prompts given their embeddings, then for each cluster, train a model to predict the winner. Also, feel free to use the hardness score to help with the prediction.

#### *Task B: Hardness Prediction*
{:.no_toc}
While we provide the hardness score generated by GPT3.5, can you explore whether such scoring is useful and valid? For hardness score, we want it to be an integer value from 1 to 10. For example, if a prompt's score is 1, we expect the weak model to be able to answer the question. If the score is 10, we expect the question to be hard, maybe only GPT4 can answer it.

You can start by analyzing the embeddings and the topic modeling data. You can then use linear regression to predict the hardness score, using existing or new features.

You should also evaluate the model using appropriate metrics. One challenging aspect here is that the output score should be integer value, while linear regression is used for continuous data.

#### Getting Started

{:.no_toc}
To get started, we provide a notebook [`nlp-chatbot-starter.ipynb`](https://github.com/DS-100/sp24-dev/blob/main/proj_final/nlp-chatbot-analysis/nlp-chatbot-starter.ipynb){:target="_blank"} that demonstrates how to load and inspect the data.

Additionally, here are some example questions about the project that you are welcome to explore.
Expand All @@ -212,7 +226,7 @@ Additionally, here are some example questions about the project that you are wel
> Analysis: By leveraging the question embeddings, can we find similar questions? How repeated are the questions in the dataset? Can you reproduce the Elo score rating for the chatbots and come up with a better ranking? How can we make sense of the data overall?
#### Resources

{:.no_toc}
- [Joey's EDA and Elo rating modeling](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH){:target="_blank"} is a great resource to get started with the EDA. Note that (1) the plot is made with Plotly, we recommend you to reproduce the plot with Matplotlib or Seaborn, and (2) the Elo rating is a good modeling task to reproduce but we expect you to do more than just that (for example, demonstrate how Elo rating works and how to calculate it in your report).

- [An intuitive introduction to text embeddings](https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/){:target="_blank"} is a good resource to understand what is text embeddings and how to use them.
Expand All @@ -225,7 +239,7 @@ Additionally, here are some example questions about the project that you are wel

## Group Formation + Research Proposal

The first deliverable of your group project is just to form your group, choose a dataset, and submit your research proposal to [this google form](https://forms.gle/DcBp3ZbM8TpTfSRD6) by 11:59 pm on 3/15. Along with your research proposal, you are required to briefly explore your chosen dataset and describe it in one paragraph. You may form groups of 2 or 3 people with any Data 200/200A/200S student.
The first deliverable of your group project is just to form your group, choose a dataset, and submit your research proposal to [this google form](https://forms.gle/DcBp3ZbM8TpTfSRD6){:target="_blank"} by 11:59 pm on 3/15. Along with your research proposal, you are required to briefly explore your chosen dataset and describe it in one paragraph. You may form groups of 2 or 3 people with any Data 200/200A/200S student.

<!-- ## Checkpoint 1: EDA + Internal Peer Review
Expand Down

0 comments on commit e77f5dc

Please sign in to comment.