Skip to content

Commit

Permalink
support for CLDF v1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
xrotwang committed Jan 22, 2024
1 parent 12470fe commit fdd8527
Show file tree
Hide file tree
Showing 24 changed files with 1,057 additions and 145 deletions.
114 changes: 72 additions & 42 deletions docs/dataset.rst
Original file line number Diff line number Diff line change
@@ -1,43 +1,50 @@
`pycldf.dataset`
================

.. py:currentmodule:: pycldf.dataset
The core object of the API, bundling most access to CLDF data, is
the :class:`pycldf.Dataset` . In the following we'll describe its
the :class:`.Dataset` . In the following we'll describe its
attributes and methods, bundled into thematic groups.


Dataset initialization
~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pycldf.dataset.Dataset
.. autoclass:: Dataset
:members: __init__, in_dir, from_metadata, from_data


Accessing dataset metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pycldf.Dataset
:noindex:
:members: directory, module, version, metadata_dict, properties, bibpath, bibname
.. autoproperty:: Dataset.directory
.. autoproperty:: Dataset.module
.. autoproperty:: Dataset.version
.. autoproperty:: Dataset.metadata_dict
.. autoproperty:: Dataset.properties
.. autoproperty:: Dataset.bibpath
.. autoproperty:: Dataset.bibname


Accessing schema objects: components, tables, columns, etc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similar to *capability checks* in programming languages that use
`duck typing <https://en.wikipedia.org/wiki/Duck_typing>`_, it is often necessary
to access a datasets schema, i.e. its tables and columns to figure out whether
the dataset fits a certain purpose. This is supported via a `dict`-like interface provided
by :class:`pycldf.Dataset`, where the keys are table specifiers or pairs (table specifier, column specifier).
to access a datasets schema, i.e. its tables and columns, to figure out whether
the dataset fits a certain purpose. This is supported via a
`mapping <https://docs.python.org/3/glossary.html#term-mapping>`_-like interface provided
by :class:`.Dataset`, where the keys are table specifiers or pairs (table specifier, column specifier).
A *table specifier* can be a table's component name or its `url`, a *column specifier* can be a column
name or its `propertyUrl`.

* check existence with `in`:
* check existence with ``in``:

.. code-block:: python
if 'ValueTable' in dataset: pass
if ('ValueTable', 'Language_ID') in dataset: pass
if 'ValueTable' in dataset: ...
if ('ValueTable', 'Language_ID') in dataset: ...
* retrieve a schema object with item access:

Expand All @@ -46,89 +53,101 @@ name or its `propertyUrl`.
table = dataset['ValueTable']
column = dataset['ValueTable', 'Language_ID']
* retrieve a schema object or a default with `.get`:
* retrieve a schema object or a default with :meth:`.Dataset.get`:

.. code-block:: python
table_or_none = dataset.get('ValueTableX')
column_or_none = dataset.get(('ValueTable', 'Language_ID'))
* remove a schema object with `del`:
* remove a schema object with ``del``:

.. code-block:: python
del dataset['ValueTable', 'Language_ID']
del dataset['ValueTable']
Note: Adding schema objects is **not** supported via key assignment, but with a set of specialized
methods described in :ref:`Editing metadata and schema`.
.. note::

Adding schema objects is **not** supported via key assignment, but with a set of specialized
methods described in :ref:`Editing metadata and schema`.

.. autoclass:: pycldf.Dataset
:noindex:
:members: tables, components, __getitem__, __contains__, get, get_foreign_key_reference, column_names, readonly_column_names
.. autoproperty:: Dataset.tables
.. autoproperty:: Dataset.components
.. automethod:: Dataset.__getitem__
.. automethod:: Dataset.__delitem__
.. automethod:: Dataset.__contains__
.. automethod:: Dataset.get
.. automethod:: Dataset.get_foreign_key_reference
.. autoproperty:: Dataset.column_names
.. autoproperty:: Dataset.readonly_column_names


Editing metadata and schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In many cases, editing the metadata of a dataset is as simple as editing
:meth:`~pycldf.dataset.Dataset.properties`, but for the somewhat complex
:meth:`.Dataset.properties`, but for the somewhat complex
formatting of provenance data, we provide the shortcut
:meth:`~pycldf.dataset.Dataset.add_provenance`.
:meth:`.Dataset.add_provenance`.

Likewise, `csvw.Table` and `csvw.Column` objects in the dataset's schema can
Likewise, ``csvw.Table`` and ``csvw.Column`` objects in the dataset's schema can
be edited "in place", by setting their attributes or adding to/editing their
`common_props` dictionary.
``common_props`` dictionary.
Thus, the methods listed below are concerned with adding and removing tables
and columns.

.. autoclass:: pycldf.Dataset
:noindex:
:members: add_table, remove_table, add_component, add_columns, remove_columns, rename_column, add_foreign_key, add_provenance,
.. automethod:: Dataset.add_table
.. automethod:: Dataset.remove_table
.. automethod:: Dataset.add_component
.. automethod:: Dataset.add_columns
.. automethod:: Dataset.remove_columns
.. automethod:: Dataset.rename_column
.. automethod:: Dataset.add_foreign_key
.. automethod:: Dataset.add_provenance


Adding data
~~~~~~~~~~~

The main method to persist data as CLDF dataset is :meth:`~pycldf.Dataset.write`,
The main method to persist data as CLDF dataset is :meth:`.Dataset.write`,
which accepts data for all CLDF data files as input. This does not include
sources, though. These must be added using :meth:`~pycldf.Dataset.add_sources`.
sources, though. These must be added using :meth:`.Dataset.add_sources`.

.. automethod:: Dataset.add_sources

.. autoclass:: pycldf.Dataset
:noindex:
:members: add_sources


Reading data
~~~~~~~~~~~~

Reading rows from CLDF data files, honoring the datatypes specified in the schema,
is already implemented by `csvw`. Thus, the simplest way to read data is iterating
over the `csvw.Table` objects. However, this will ignore the semantic layer provided
over the ``csvw.Table`` objects. However, this will ignore the semantic layer provided
by CLDF. E.g. a CLDF languageReference linking a value to a language will be appear
in the `dict` returned for a row under the local column name. Thus, we provide several
in the ``dict`` returned for a row under the local column name. Thus, we provide several
more convenient methods to read data.

.. autoclass:: pycldf.Dataset
:noindex:
:members: iter_rows, get_row, get_row_url, objects, get_object
.. automethod:: Dataset.iter_rows
.. automethod:: Dataset.get_row
.. automethod:: Dataset.get_row_url
.. automethod:: Dataset.objects
.. automethod:: Dataset.get_object


Writing (meta)data
~~~~~~~~~~~~~~~~~~

.. autoclass:: pycldf.Dataset
:noindex:
:members: write, write_metadata, write_sources
.. automethod:: Dataset.write
.. automethod:: Dataset.write_metadata
.. automethod:: Dataset.write_sources


Reporting
~~~~~~~~~

.. autoclass:: pycldf.Dataset
:noindex:
:members: validate, stats
.. automethod:: Dataset.validate
.. automethod:: Dataset.stats


Dataset discovery
Expand All @@ -147,7 +166,7 @@ Sources
~~~~~~~

When constructing sources for a CLDF dataset in Python code, you may pass
:class:`pycldf.Source` instances into :meth:`pycldf.Dataset.add_sources`,
:class:`pycldf.Source` instances into :meth:`Dataset.add_sources`,
or use :meth:`pycldf.Reference.__str__` to format a row's `source` value
properly.

Expand All @@ -169,8 +188,19 @@ in its `sources` attribute.
Subclasses supporting specific CLDF modules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

Most functionality provided through properties and methods described below is implemented via
the :mod:`pycldf.orm` module, and thus subject to the limitations listed at `<./orm.html>`_

.. autoclass:: pycldf.Generic
:members:

.. autoclass:: pycldf.Wordlist
:members:

.. autoclass:: pycldf.StructureDataset
:members:

.. autoclass:: pycldf.TextCorpus
:members:
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ install_requires =
clldutils>=3.9
uritemplate>=3.0
python-dateutil
# pybtex requires setuptools, but doesn't seem to declare this.
setuptools
pybtex
requests
newick
Expand Down
8 changes: 8 additions & 0 deletions src/pycldf/components/ExampleTable-metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,14 @@
"dc:description": "References the language of the translated text",
"datatype": "string"
},
{
"name": "LGR_Conformance",
"required": false,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#lgrConformance",
"dc:extent": "singlevalued",
"dc:description": "The level of conformance of the example with the Leipzig Glossing Rules",
"datatype": {"base": "string", "format": "WORD_ALIGNED|MORPHEME_ALIGNED"}
},
{
"name": "Comment",
"required": false,
Expand Down
45 changes: 45 additions & 0 deletions src/pycldf/components/ParameterNetwork-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"url": "parameter_network.csv",
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ParameterNetwork",
"dc:description": "Rows in this table describe edges in a network of parameters.",
"tableSchema": {
"columns": [
{
"name": "ID",
"required": true,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
"datatype": {
"base": "string",
"format": "[a-zA-Z0-9_\\-]+"
}
},
{
"name": "Description",
"required": false,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#description",
"datatype": "string"
},
{
"name": "Target_Parameter_ID",
"required": true,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#targetParameterReference",
"dc:description": "References the target node of the edge.",
"datatype": "string"
},
{
"name": "Source_Parameter_ID",
"required": true,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#sourceParameterReference",
"dc:description": "References the source node of the edge.",
"datatype": "string"
},
{
"name": "Edge_Is_Directed",
"required": false,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#edgeIsDirected",
"dc:description": "Flag signaling whether the edge is directed or undirected.",
"datatype": {"base": "boolean", "format": "Yes|No"}
}
]
}
}
Loading

0 comments on commit fdd8527

Please sign in to comment.