The initial outline of this repository was primarily based the comprehensive introduction to knowledge graphs by Hogan, A. et al, 02 July 2021, ACM Computing Surveys, 54(4): 1–37 as well as that introduction's companion examples on GitHub.
Given the term's history, the exact definition of a “knowledge graph" was probably always bound to be a bit contentious, since the phrase “knowledge graph” was clearly used to evoke a hazily juxtaposed sketch of an idea linking "knowledge" and "graphs" in the technical literature since at least 1972 or mostly likely long before, because of how the term is built upon visionary notions of the "information society" and ideas that have been floating around practically forever like hyperlinks from Project Xanadu in 1965, which harked back to Vannevar Bush's popular essay, As We May Think from 1945. Regardless of its original origins, the modern popularized incarnation of the phrase in common industry parlance seems to stem from the 2012 announcement of the Google Knowledge Graph echoed by the follow-on announcements of the development of "yeah, we did that google stuff here, too" knowledge graphs by Airbnb, Amazon, eBay, Facebook, IBM, LinkedIn, Microsoft, Uber, and more besides. Even though its use might have been spurred by hypesters and evangelizers, the growing industrial uptake of the concept eventually proved difficult for even academia to ignore. More and more scientific literature is being published on knowledge graphs, which includes books as well as papers outlining definitions, novel techniques, surveys of specific aspects, at REST API of academic graphs based on the Semantic Scholar Open Data Platform, the largest open scientific graph to-date ... and from that data platform, a knowledge graph visual overview of ConnectedPapers on knowledge graphs.
The history of the Semantic Scholar Open Data Platform helps us to understand the defacto standard that make up a sophisticated data processing pipeline which continually ingests documents and metadata from numerous very large input sources, eg something Arxiv is one of more than fifty of those sources, extracting full text and metadata from PDFs, normalizing and disambiguating authors, institutions, and venues, classifying each paper’s field of study, generating a textual summary of its key results, and more. Sources may provide metadata in the Journal Article Tag Suite (JATS) format, or a variety of proprietary formats ... one additional important source is human created data from manual corrections that are part of the curation process.
The first task of the pipeline is to fetch the latest data from each source and parse it into a normalized format. Sources typically provide limited information about a paper in structured form: typically the title, author names, venue, and date, often linked to a PDF file. A critical output of the Semnantic Scholar PDF content extraction is a structured bibliography from which the citation graph is constructed. To accomplish this, Semantic Scholar uses several open source toolkits, including:
-
PDFalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content into Analyzed Layout and Text (ALTO) format, which serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) of the Library of Congress.
-
PDF-Plumber is used to Plumb a PDF for detailed information about each text character, rectangle, and line, et cetera — and easily extract text and tables as well as debug visually. Works best on machine-generated, rather than scanned, PDFs and was built on pdfminer.six, which is the actively maintained version of the original Python-based PDFMiner pdf-extractor, pdf2txt.py.
-
PDFMiner.Six is a community-maintained fork of the original PDFMiner. Since 2020, the original pdfminer went dormant and pdfminer.six became the recommended, actively maintained version of the original Python tool. PDFMiner.six, focuses on getting and analyzing text data, extracts the text from a page directly from the sourcecode of the PDF to parse all objects from a PDF document into Python objects in a human readable way. PDFMiner.six can also be used to get the exact location, font or color of the text. It has support for Chinese, Japanese and Korean (CJK) characters and derivatives as well as vertical writing and supports various font types (Type1, TrueType, Type3, and CID) as well as support for RC4 and AES encryption and support for AcroForm interactive form extraction similar to that in PyMuPDF. PDFMiner.six is built in Python in a modular way, such that each component of can be replaced easily. Developers can implement their own interpreter or rendering device, such as PDF-Plumber, that use/extend the power of PDFMiner for specific detailed purposes or to utilize Python objects for purposes that go beyond simpler text analysis.
-
Older PDFrw or fPDF in PHP and hosts of other alternatives that preceded PDF-Plumber or PDFMiner.Six ... made necessary by the idiotic flocking to the proprietary PDF format by the publishing industry, which has been a source of frustration for developers for decades.
-
PyPDF a pure python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files which is closely related to PDFly a CLI tool to extract (meta)data from PDF and manipulate PDF files
-
Pike PDF A Python library for reading and writing PDF, powered by QPDF, content-preserving PDF document transformer, which has been around and continuously developed since at least 2005.
-
PyMuPDF is perhaps the most mildly interesting new entrant into this realm of PDF parsing with a comparatively active NEW dev community, eg 367 forks and three active discussion boards on Github since PyMuPDF's initial public release on GitHub [v 1.22.5] in June 2023.
https://github.com/kermitt2/grobid
https://github.com/allenai/science-parse
https://github.com/allenai/spv2
https://poppler.freedesktop.org/
https://github.com/Layout-Parser/layout-parser
https://github.com/allenai/VILA
https://github.com/allenai/papermage
A number of related surveys, books, and so on, have been published relating to knowledge graphs as tertiary literature—surveys, books, tutorials, andso on relating to knowledge graphs. Some of the related literature provides more detail on particular topics that are not particularly focused on knowledge graphs, familiarity with topics is useful as further background reading ... the primary focus of this curated list of publications is to provide a broad and accessible introduction to the specific topic of knowledge graphs.