Skip to content

Latest commit

 

History

History
37 lines (25 loc) · 2.5 KB

README.md

File metadata and controls

37 lines (25 loc) · 2.5 KB

AMAZON ML Challenge Problem Statement

Feature Extraction from Images

In this hackathon, the goal is to create a machine learning model that extracts entity values from images. This capability is crucial in fields like healthcare, e-commerce, and content moderation, where precise product information is vital. As digital marketplaces expand, many products lack detailed textual descriptions, making it essential to obtain key details directly from images. These images provide important information such as weight, volume, voltage, wattage, dimensions, and many more, which are critical for digital stores.

Data Description:

The dataset consists of the following columns:

  1. index: An unique identifier (ID) for the data sample
  2. image_link: Public URL where the product image is available for download. Example link - https://m.media-amazon.com/images/I/71XfHPR36-L.jpg To download images use download_images function from src/utils.py. See sample code in src/test.ipynb.
  3. group_id: Category code of the product
  4. entity_name: Product entity name. For eg: “item_weight”
  5. entity_value: Product entity value. For eg: “34 gram” Note: For test.csv, you will not see the column entity_value as it is the target variable.

Output:

A test_out.csv file in the Portal with the exact same formatting as sample_test_out.csv

File Descriptions:

source files

  1. src/sanity.py: Sanity checker to ensure that the final output file passes all formatting checks. Note: the script will not check if less/more number of predictions are present compared to the test file. See sample code in src/test.ipynb
  2. src/utils.py: Contains helper functions for downloading images from the image_link.
  3. src/constants.py: Contains the allowed units for each entity type.
  4. sample_code.py: We also provided a sample dummy code that can generate an output file in the given format. Usage of this file is optional.

Dataset files

  1. dataset/train.csv: Training file with labels (entity_value).
  2. dataset/test.csv: Test file without output labels (entity_value). Generate predictions using your model/solution on this file's data and format the output file to match sample_test_out.csv (Refer the above section "Output Format")
  3. dataset/sample_test.csv: Sample test input file.
  4. dataset/sample_test_out.csv: Sample outputs for sample_test.csv. The output for test.csv must be formatted in the exact same way. Note: The predictions in the file might not be correct