Skip to content

Code and data from the NAACL 2018 paper "Robust Cross-lingual Hypernymy Detection using Dependency Context"

Notifications You must be signed in to change notification settings

CogComp/Cross-lingual-Hypernymy-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-lingual-Hypernymy-Detection

This paper contains data and code from the NAACL 2018 paper "Robust Cross-lingual Hypernymy Detection using Dependency Context".

Description

Cross-lingual Hypernymy Detection involves determining if a word in one language (“fruit”) is a hypernym of a word in another language (“pomme” i.e. apple in French). The ability to detect hypernymy cross-lingually can aid in solving cross-lingual versions of tasks such as textual entailment and event coreference. We propose BISPARSE-DEP, a family of unsupervised approaches for cross-lingual hypernymy detection, which learns sparse, bilingual word embeddings based on dependency contexts. We show that BISPARSE-DEP can significantly improve performance on this task, compared to approaches based only on lexical context. Our approach is also robust, showing promise for low-resource settings: our dependency-based embeddings can be learned using a parser trained on related languages, with negligible loss in performance. We also crowd-source a challenging dataset for this task on four languages – Russian, French, Arabic, and Chinese. Our embeddings and datasets are publicly available.

Data

The data directory contains two sub-directories :

  • hyper-hypo : This dataset contains crowd-sourced hypernyms as positive examples, and crowd-sourced hyponymys as negative examples. This dataset has been used to generate results in Table 3a in the paper.
  • hyper-cohypo : This dataset contains crowd-sourced hypernyms as positive examples, and automatically extracted co-hpyonyms as negative examples. This dataset has been used to generate results in Table 3b in the paper.

Both directories contain the exact tune/test split used in the paper, for each of four languages - Arabic (ar), French (fr), Russian (ru), and Chinese (zh). Additionally, hyper-hypo contains all examples that were crowdsourced - this is a superset of the tune/test data, and contains additional negative examples.

Pre-trained vectors

The pre-trained-vecs directory contains the sparse, bilingual word vectors that have been used to generate the results in the paper. There are 2 vectors per langauge pair (ar-en, fr-en, ru-en, zh-en), per model (window, dependency, joint, delex, unlab), per dataset (hyper-hypo, hyper-cohypo), making for a total of 80 files. They have been organized by dataset, with each dataset folder containing 40 files

Additionally, each dataset folder also contains hyperparams.txt which gives the hyperparameters used to generate the vecctors and obtain the results.

Scripts

  • balAPinc_multi_test.py - Given a list of cross-lingual word pairs, and two cross-lingual word vector files (one per language), generate balAPinc scores for the word pairs
    • Syntax : python scripts/balAPinc_multi_test.py <en-word-vectors> <non-en-word-vectors> <word-pairs-file> 0 <balAPinc-parameter> --prefix <optional prefix for output file> , where
      • <en-word-vectors> ::= File containing word vectors for English
      • <non-en-word-vectors> ::= File containing word vectors for the other language
      • <word-pairs-file> ::= List of (non-English, English) word pairs, with gold label (1 = English word is a hypernym of the non-English word, 0 = otherwise)
      • <balAPinc-parameter> ::= How many features to include while calculating balAPinc? (Integer between 0 and 100, inclusive)
    • Output : Input file, with a balAPinc score appended at the end of each line
    • Example usage : python scripts/balAPinc_multi_test.py pre-trained-vecs/hyper-hypo/ar-en.en.dep_1000.txt.gz pre-trained-vecs/hyper-hypo/ar-en.ar.dep_1000.txt.gz data/hyper-hypo/ar_tune.txt 0 100
  • balAPinc_classification.py - Given tune and test files, generate classification scores
    • Syntax : python scripts/balAPinc_classification.py --training <tune-word-file-with-scores> --test <test-word-file-with-scores>, where
      • <tune-word-file-with-scores> ::= Output of balAPinc_multi_test.py when run on tuning data
      • <test-word-file-with-scores> ::= Output of balAPinc_multi_test.py when run on test data
  • generate_results.sh - Run this to generate the results reported in the paper (currently generates all BiSparse-Dep (Full, Joint, Delex, Unlabeled) results in Tables 3a, 3b, and 4 )

Scripts to train vectors will be available soon. For now, you can use the scripts from previous work if needed.

Citation

Please cite the followings if using code, data or other resources from this paper

@InProceedings{UpadhyayVyasCarpuatRoth2018,
	author = 	"Upadhyay, Shyam
	and	Vyas, Yogarshi
	and Carpuat, Marine
	and Roth, Dan",
	title = 	"Robust Cross-lingual Hypernymy Detection using Dependency Context",
	booktitle = 	"Proceedings of the 2018 Conference of the North American Chapter of the      Association for Computational Linguistics: Human Language Technologies ",
	year = 	"2018",
	publisher = 	"Association for Computational Linguistics",
	location = 	"New Orleans, Louisiana",
	url = 	"https://arxiv.org/pdf/1803.11291.pdf"
}

@InProceedings{VyasCarpuat2016,
	author = 	"Vyas, Yogarshi
	and Carpuat, Marine",
	title = 	"Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment",
	booktitle = 	"Proceedings of the 2016 Conference of the North American Chapter of the      Association for Computational Linguistics: Human Language Technologies    ",
	year = 	"2016",
	publisher = 	"Association for Computational Linguistics",
	pages = 	"1187--1197",
	location = 	"San Diego, California",
	doi = 	"10.18653/v1/N16-1142",
	url = 	"http://www.aclweb.org/anthology/N16-1142"
}

Contacts

For inquiries : yogarshi@cs.umd.edu, shyamupa@seas.upenn.edu

About

Code and data from the NAACL 2018 paper "Robust Cross-lingual Hypernymy Detection using Dependency Context"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published