Skip to content

Latest commit

 

History

History
86 lines (61 loc) · 5.2 KB

life_of_a_dataset.md

File metadata and controls

86 lines (61 loc) · 5.2 KB

Life of a dataset

Note: this document assumes familiarity with how statistics is represented in Data Commons and the MCF format.

This tutorial walks through the process of structuring and inserting data into the Data Commons graph.

As a prerequisite, you should understand the dataset, and have an idea of how to map location entities in your dataset to Data Commons entities and measures in your dataset to Data Commons statistical variables.

Define Statistical Variables

If you are adding new types of data to the knowledge graph, you might need to define new statistical variables. You can browse all existing variables in the Statistical Variable Explorer.

The statistical variable DCIDs should be human-readable, encapsulating the meaning of its triples. The naming rules are summarized in this doc.

When the variables are finalized, they get checked into the schema repo.

Template MCF with tabular data (CSV)

Template MCF is essentially a mapping file that instructs how to convert the data in a CSV into graph nodes for ingestion into Data Commons. For additional information, read Template MCF.

The raw CSV will often needs pre-processing before it can be imported. An example simple cleaning script is here.

There are no restrictions on your approach for this step, but the only requirement is that a property value in the TMCF map to a single CSV column (as illustrated in the examples in MCF format).

The general guidelines are:

  1. A property in the Template MCF node should have a constant value (like typeOf), reference to another node (like E:Dataset->E1), or refer to a CSV column for its value (like C:Dataset->col_name).
  2. Dates must be in ISO 8601 format: "YYYY-MM-DD", "YYYY-MM", etc.
  3. References to existing nodes in the graph must be dcids.
  4. The cleaning script is reproducible and easy to run. Python or Golang is recommended.

There are a couple of ways to map the statistical variables with TMCF:

  1. Each StatisticalVariable has its own column for its observed value. So, there are as many TMCF StatVarObservation nodes as variables. For an example, see this TMCF and the corresponding CSV.
  2. The StatisticalVariable DCIDs are included in CSV values, such that there is a single TMCF StatVarObservation node that points to the variable column. For an example, see this TMCF and the corresponding CSV.

TIP: To represent DC strings and repeated values in a CSV field, refer to these CSV Formatting Tips.

Validate the artifacts

Use the dc-import tool to validate the artifacts. When you run it, it will generate report.json and summary_report.html with counters representing warnings/errors and summary statistics.

Send Pull Requests

Create a Pull Request (PR) with the Template MCF file together with the cleaned CSV, its preprocessing script, and the README (template) to https://github.com/datacommonsorg/data under the appropriate scripts/<provenance>/<dataset> subdirectory. If you wrote a script to automate the generation of the TMCF, please also include that.

In the PR, please also include the validation results (report.json and summary_report.html).

If you introduced new statistical variables, please create a Pull Request for them in the schema repo.

Alternate approach: Generate Instance MCF

In some cases, a dataset is so highly unstructured that it makes sense to skip the Template MCF / CSV approach and directly generate the instance MCF. For example, data from biological sources frequently needs to be directly formatted as MCF.

In this case, the cleaning script should do more heavy-lifting to generate instance MCFs. Such an example script is here.