etymology-db

Downloads: (Last generated 2023-12-05)
Gzipped CSV
Parquet

A structured, comprehensive, and multilingual etymology dataset created by parsing Wiktionary's etymology sections. Key features:

4.2+ million etymological relationships between 2.0+ million terms in 3300+ languages/dialects
31 different types of etymological relations, distinguishing between inheritance, borrowing, etc.
Hierarchical data that preserves relationship structures, such as the evolution of a term across languages

Caveat for people interested in using this for research: all information is pulled directly from Wiktionary via semi-structured text parsing, and I've made no effort yet to validate any particular result. That said, I would love for this to be useful, so please raise an issue if you have questions.

Here is a description of the table schema:

Column Name	Description
term_id	A hash of the term and its language.
lang	The language/dialect of the term.
term	The term itself. Usually a word, but can also be a prefix or a multi-word expression, hence "term" instead of word.
reltype	The kind of etymological relation being specified (see below for details on each possible value).
related_term_id	A hash of the related term and its language (useful for assembling relationships across multiple terms).
related_lang	The language/dialect of the related term. NULL for parent root nodes.
related_term	The term that is etymologically related to the original entry. NULL for parent root nodes.
position	Zero-indexed position of the term when the relation is made up of multiple terms (e.g. a compound).
group_tag	Randomly generated ID. populated only for the root nodes of nested relationships.
parent_tag	If this relation is inside of a nested structure, this will be populated with the `group_tag` of its immediate parent. NULL otherwise.
parent_position	Zero-indexed position of the relation inside of its nested structure. NULL if not nested.

And here is a description of each relation type. Note that these are all derived directly from Wiktionary's etymology templates -- all rows are classified according to the name of the template from which the info was extracted, and no further inferences are made. The only exception to this is the group relations, which are based on formulaic and reoccuring patterns in natural language sections.

Relation Type	Description
inherited_from	Indicates that `term` has an unbroken chain of inheritance from `related_term`.
borrowed_from	Indicates that `term` is a loanword borrowed during the time the borrowing language was spoken.
derived_from	A catch-all for a derivation relationship that is not specifically inherited/borrowed.
learned_borrowing_from	Borrowed from the original language via atypical ("inorganic") means of language contact.
semi_learned_borrowing_from	Borrowed words that have been partly reshaped by later sound change or analogy with inherited terms. /
orthographic_borrowing_from	Borrows the spelling of `related_term` but not the pronunciation.
unadapted_borrowing_from	Borrowed words that have not been conformed to the morpho-syntactic, phonological and/or phonotactical rules of the target language.
root	Constructed root(s) of `term` in a theoretical ancestor language, e.g. Proto-Indo-European.
has_prefix	Indicates that `term` is partially based on the prefix `related_term`.
has_prefix_with_root	`related_term` is the term attached to a prefix (not necessarily a suffix, e.g. "normal" in "abnormal")
has_suffix	Same as above, but for suffixes.
has_suffix_with_root	Same as above, but for suffixes.
has_confix	A confix is a term whose first element is a prefix and whose last is a suffix. Position 0 of a confix is the prefix, and the last position is the suffix.
has_affix	Affix is the general form of prefix/suffix/confix and indicates some kind of compound structure without further detail.
compound_of	Indicates that `related_term` is the `position`-indexed term of a compound that makes up `term`. Used interchangeably with affix above.
back-formation_from	Indicates that `term` was formed from `related_term` by removing a prefix/suffix.
doublet_with	Indicates that `term` and `related_term` have the same etymological origin, especially when the relationship is unintuitive.
is_onomatopoeic	Indicates that `term` is an onomatopoeia (a word form from a sound associated with its meaning).
calque_of	Indicates that `term` is borrowed from `related_term` via a direct word-for-word or root-for-root translation.
semantic_loan_of	Special case of calques in which the word already existed but a new meaning was added.
named_after	Indicates that `term` is based on the name of the person `related_term` (an eponym).
phono-semantic_matching_of	Indicates that `term` and `related_term` have very similar sounds and meanings in both languages.
etymologically_related_to	A catch-all indicating `term` and `related_term` are etymologically related without any further context provided.
blend_of	Indicates that `term` is made up of a blend of `related_term` and the related terms in other `position`s. This differs from a compound in that the beginning of one word is combined with the ending of another.
clipping_of	Indicates that `term` is a spoken shortened version of `related_term` without any semantic difference.
abbreviation_of	Indicates that `term` is a written shortened version of `related_term` without any semantic difference.
initialism_of	Indicates that `term` is based on the initials of `related_term`.
cognate_of	Indicates that `term` and `related_term` sound/mean similar things, but no direct ancestral relationship exists.
group_affix_root	A node that groups together rows that, when combined, form an affix.
group_related_root	A node that groups together rows in which `related_terms` are not just related to the `term`, but to each other as well.
group_derived_root	A node that groups together rows that, when combined, form an unbroken chain of inheritance (in reverse chronological order).

The wiktionary_codes.csv file is manually combined from these two pages:
https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
https://en.wiktionary.org/wiki/Module:etymology_languages/data

All data is licensed under the Creative Commons ShareAlike 3.0 License. All code is licensed under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
elements.py		elements.py
main.py		main.py
requirements.txt		requirements.txt
templates.py		templates.py
wiktionary_codes.csv		wiktionary_codes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etymology-db

About

Releases 1

Packages

Contributors 2

Languages

License

droher/etymology-db

Folders and files

Latest commit

History

Repository files navigation

etymology-db

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages