Skip to content

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020

Notifications You must be signed in to change notification settings

englishbook/product-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

product-matching

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020.

Task one: Product Matching

The product matching task aims to identify that if a pair of product deriving from different websites refer to the same product or not.

Datasets

In the SWC2020 challenge product matching task, the dataset of Task one is sampled from the WDC product data corpus. Products in the corpus are described by these properties: id, cluster id, category, title, description, brand, price, and specification table. Our models are mainly trained on two different matching dataset:

  • Computers dataset is provided by the organizers of the challenge which only contains product from Computers & Accessories.

  • All dataset contains products from all the four categories (Computers & Accessories, Camera & Photo, Watches, and Shoes).

Input

Although products are described by many attributes, most of the fields contain NULL values. Considering the filling rate and the input length, we focus on the title and description attributes and ignore the other ones.

Model

We use BERT_base as the main module of our matching model. Focal loss is adopted to alleviate class imbalance problem.

Please download the dataset and BERT weights first.

Just run the train.py to train all the models we used in the challenge:

python train.py

After obtaining the model parameters, run the predict.py to combine the predictions of different model and get the final results:

python predict.py

Post-processing

For test pairs with prediction results of 1 but different categories, we directly correct their results to 0 in the post-processing phase.

Results

Validation

Single model:

Model Input Dataset F1 Post F1
Bert_focal title All 0.9481 0.9496
Bert_focal title+description All 0.9384 0.9411
Bert_focal title+description Computers 0.9700 0.9700

Test

In the final evaluation, we ensemble these three models:

Model Precision Recall F1
Our model 0.8063 0.9200 0.8594

About

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages