Preprocess for KBA - filter target attributes/properties rapidly.
Linguistic knowledge --> world knowledge
- World knowledge varies with time
- How to acquire knowledge from heterogeneous resources to reflect the changes of real wrold is very important (KBA)
Given a target entity to be tracked, find its (new) related information from heterogeneous resources effectively because large volume of data are created.
- Target entity
- Different patterns related to the entity type
- e.g. 歐巴馬 - person type
- extract all patterns related to person
- e.g. MS - org type
- extract pattern related to org.
- Patterns 分類
- Entites (entities types)
- Dynamic v.s Static
- Related information
- related to some pattern
- Exact
- disambiguation
- Evaluation matrics
- Speed
- wiki
- dbpedia
- acc, prec, recall
- TREC KBA -> to read KBA for more information.
- Testing Dataset
- Speed
Target entities have different types, different types have different patterns, different patterns related to differnet types.
Note that
- Type
- Pattern
- Feature
- Information
- Information related to human
- How many features are related to human ?
- What kinds of features are related to human ?
- What kinds of patterns can be used to find the features ?
- Tell out Dynamic and Static information
- What kinds of features are dynamic, and what kinds of features are static?
- In other words, what knowledge is unchanged?
- Tell out the position of the information
- Tell out if the information is related to the interesting targets
-
Data sturcture and algorithms
-
Filtering: (1) and (2)
-
Extract: (3) and (4)
-
Efficient Filtering, e.g.
- birth (n patterns related to birth)
- occupation change (m patterns related to this move)
-
If topic model is suitable:
- birth is a topic, occupation change is a topic,...
-
23.21 mins 24 core 33w
-
Spend 1393.446886 seconds
-
1 core spend 24 mins to deal 13750 docs
-
1 core 592docs/mins
-
16 core
-
1432.828743 seconds
-
about 3 mins
-
177.767400 s 8 core 6.6w
-
1 core spend 3 min to deal 8250 docs
-
1 core 2750/mins
- Patterns
- entity types
- dynamic (new) v.s. static
- related information, related to some patterns
- exact information, target entites, mention disambiguation
- Pattern coverage
- Pattern use
-
Statistics of samples
-
Distribution of features
-
Number of features/wiki
-
How to sample data from wiki for testing?
-
Testing effectiveness and efficient
- Efficiency: enough documents, number of documents created by human per second, ...
- Identifying constant and unique relations by using time-series text
- PATTY: a taxonomy of relational patterns with semantic types
- TREC KBA
- MongoDB
Stanford parser- Python
- RDFLib
- TextBlob
- Aptagger
- NLTK
- SimpleJson
- DBpedia
- Download
- DBpedia Ontology
- Raw Infobox Property
- Raw Infobox Property Definitions
- Persondata
- Download
- Wikipedia dump
- 20130304
- enwiki-20130304-pages-articles
- 20130304
- Wiki API,1961414
- 01-05,58115
- 06-10,126706
- 11-15,195114
- 15-18,262784
- 19-20,192237
- 21-22,206117
- 23,129647
- 24,142679
- 25,132130
- 26,125104
- 27,390781
- PATTY
- YAGO
- YAGO Facts