The Knowledge Graph Toolkit (KGTK) is a comprehensive framework for the creation and exploitation of large hyper-relational knowledge graphs (KGs), designed for ease of use, scalability, and speed. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. KGTK provides:
- a suite of import commands to import Wikidata, RDF and popular graph representations into KGTK format;
- a rich collection of transformation commands make it easy to clean, union, filter, and sort KGs;
- graph combination commands support efficient intersection, subtraction, and joining of large KGs;
- a query language using a variant of Cypher, optimized for querying KGs stored on disk supports efficient ad hoc queries;
- graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components and shortest paths;
- advanced commands support lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph;
- a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing and graph-tool;
- a development environment using Jupyter notebooks provides seamless integration with Pandas.
KGTK can process Wikidata-sized KGs with billions of edges on a laptop. We have used KGTK in multiple use cases, focusing primarily on construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a commonsense KG combining multiple existing sources, creation of Wikidata extensions for food security and the pharmaceutical industry.
KGTK is open source software, well documented, actively used and developed, and released using the MIT license. We invite the community to try KGTK. It is easy to get started with our tutorial notebooks available and executable online.
The following instructions install KGTK and the KGTK Jupyter Notebooks on Linux and MacOS systems.
If you want to install KGTK on a Microsoft Windows system, please
contact the KGTK team.
Our KGTK installations use a Conda virtual environment. If you don't have the Conda tools installed, follow this guide to install it. We recommend installing Miniconda installation rather than the full Anaconda installation.
Next, execute the following steps to install the latest stable release of KGTK:
conda create -n kgtk-env python=3.9
conda activate kgtk-env
conda install -c conda-forge graph-tool
conda install -c conda-forge jupyterlab
pip --no-cache install -U kgtk
Please see our installation document for more details. If you encounter problems with your installation, or are interested in a detailed explanation of these commands, read more about the installation procedure here.
Running pip install -e .
(development mode) throws an error about 3 libraries,
- thinc
- blis
- tokenizers
Fixed the thinc
issue by ,
a. commenting out [this line in requirements.txt](https://github.com/usc-isi-i2/kgtk/blob/dev/requirements.txt#L11)
b. running `pip install thinc-apple-ops`
Fixed the tokenizers issue by running the following commands in the conda environment
# download and install Rust. Follow the on screen instructions
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python/
pip install setuptools_rust
python setup.py install
continue installing kgtk
, pip install -e .
Please refer to this document for installing KGTK with Docker
You can read our latest documentation online with:
https://kgtk.readthedocs.io/en/latest/
For examples of using KGTK, please see our Tutorial Notebooks.
- See all source code releases
The documentation for the KGTK Text Search API is here
The documentation for the KGTK Semantic Similarity API is here
@inproceedings{ilievski2020kgtk,
title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},
author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},
booktitle={International Semantic Web Conference},
pages={278--293},
year={2020},
organization={Springer}
url={https://arxiv.org/pdf/2006.00088.pdf}
}