Skip to content

Latest commit

 

History

History
248 lines (165 loc) · 12.3 KB

usage.md

File metadata and controls

248 lines (165 loc) · 12.3 KB

About

The Data Commons import tool is used to analyze and debug files that are developed in the process of importing new datasets to the Data Commons Knowledge Graph.

The tool:

  • Operates on two types of files: instance MCF (.mcf); Template MCF (.tmcf) and corresponding CSV files
  • Performs resolution, syntax and statistics validations
  • Generates instance MCF from template MCF and corresponding CSV files
  • Generates reports on error/warning counters, stats validation and sample time-series charts

The tool is actively used for all data imports that are included in the Data Commons Knowledge Graph. It is under active development, including feature additions and bug fixes.

The tool is a command line application built with Java. See below for usage instructions.

Installation

Make sure you've downloaded the .jar file under Assets here. Note the path to .jar.

Usage

Use the import tool from the command line, like so:

java -jar <path-to-jar> <mode> <list of mcf/tmcf/csv files>

Hint: it can be useful to create an alias for the jar file, such as:

alias dc-import='java -jar <path-to-jar>'

This is the form that will be used in the rest of the documentation.

Hint: to access a concise explanation of usage modes and flags, run dc-import --help

Usage Modes

In lint mode, the import tool validates the artifacts produced for addition to Data Commons. These artifacts include instance MCF files and pairs of template MCF (TMCF) and corresponding CSV files.

In genmcf mode, the import tool produces instance MCF files from a pair of TMCF file, and its associated CSV files. This mode performs all validations that the lint mode would have performed.

Output

Both modes generate two output files:

  • report.json is a detailed log of error/warning counters and associated messages to help locate the source of the counters.
  • summary_report.html includes a summary of the counters from report.json, followed by statistical summaries for sample places. It is meant to be viewed in a web browser.

If input includes statistics (CSV and TMCF files, or MCF files with StatVarObservation nodes are provided), the reports will also include information on statistics from sample places and time-series charts. In genmcf node, generated instance MCF files are written to table_mcf_nodes_{CSV_FILE_NAME}.mcf (if there were no fatal errors).

The output files are placed under a new folder in the current working directory named dc_generated by default.The --output-dir flag (documented below) can be specified to modify the name of this output folder.

Lint Mode (lint)

To run the tool in lint mode, use:

dc-import lint <list of mcf files>

Note that if you are importing a dataset where non-numerical StatVar Observations are expected (for example, statType is measurementResult and, therefore, the SVObs values are references), set --allow-non-numeric-obs-values=true in the command line invocation.

For example, we can use lint to perform syntax validation on a test MCF included in this repository at path tool/src/test/resources/org/datacommons/tool/lint/mcfonly/input/McfOnly.mcf relative to the base of this repo like so:

dc-import lint tool/src/test/resources/org/datacommons/tool/lint/mcfonly/input/McfOnly.mcf

This will output issues found in the input file to dc_generated/report.json and dc_generated/summary_report.html in the current working directory.

Instance MCF Generation Mode (genmcf)

To run the tool in genmcf mode, use:

dc-import genmcf <list of csv/tmcf files>

Optionally, schema file(s) may also be passed. This is required to resolve references to newly introduced schema nodes.

Similar to when using lint mode above, if you are importing a dataset where non-numerical StatVar Observations are expected (for example, statType is measurementResult and, therefore, the SVObs values are references), set --allow-non-numeric-obs-values=true in the command line invocation.

For example, we can use genmcf to perform validations, and generate instance MCF from test files about COVID-19 cases in India.

These test files are:

  • covid.csv at path tool/src/test/resources/org/datacommons/tool/genmcf/statchecks/input/covid.csv relative to the base of this repo.
  • covid.tmcf at path tool/src/test/resources/org/datacommons/tool/genmcf/statchecks/input/covid.tmcf relative to the base of this repo.

From the base of the repo, we issue the following command:

dc-import lint tool/src/test/resources/org/datacommons/tool/genmcf/statchecks/input/covid.csv tool/src/test/resources/org/datacommons/tool/genmcf/statchecks/input/covid.tmcf

This will output issues found in the input to dc_generated/report.json and dc_generated/summary_report.html under the current working directory.

This will also output the instance MCFs generated from the template to dc_generated/table_mcf_nodes_covid.mcf. Note that instance MCF will not be generated if there are any fatal errors in the input files. These fatal errors will instead be logged to report.json and summary_report.html.

Command Line Flags

Flags available to modify the behavior of the tool are listed below. All flags apply to both usage modes (lint and genmcf).

You can also run dc-import --help to see a list of flags in your terminal.

-e, --existence-checks

Checks DCID references to schema nodes against the KG and locally. If this flag is set, then calls will be made to the Staging API server, and instance MCFs get fully loaded into memory.

Suppose the CSV file has a cell value like dcid:Count_Person indicating a reference to a DC entity. This check will ensure that such an entity is defined either in Data Commons KG (in this case it does), or in another instance MCF given as an input.

Defaults to true.

-h, --help

Shows a help message and exit.

-n, --num-threads=<numThreads>

Specifies the number of concurrent threads used for processing CSVs.

You need multiple CSVs to take advantage of concurrent processing.

TIP: In case your generated CSV is very large, you can use the split_csv tool to shard it into multiple files.

Defaults to 1.

-o, --output-dir=<outputDir>

Specifies the directory to write output files.

Default is dc_generated/ within current working directory.

-ep, --existence-checks-place

Specifies whether to perform existence checks for places found in the observationAbout property in StatVarObservation nodes.

Defaults to false.

-s, --stat-checks

Checks integrity of time series by checking for holes, variance in values, etc.

A set of counters detailing the results of the checks will be logged in report.json. For every such counter, the tool will provide a few exemplar cases to help the user understand and resolve the issue(s).

For example, in this test input covid.mcf file, the value of the CumulativeCount_MedicalTest_ConditionCOVID_19_Positive StatVar for place geoId/07 is 3.0 one day, (2020-03-02;line 49), and 7.0 on the next day (2020-03-03; line 65). Because the fluctuation in the value is greater than 100%, the tool flags this as a potential statistical issue (counter: StatsCheck_MaxPercentFluctuationGreaterThan100). This is logged in the resulting report.json as follows:

"statsCheckSummary": [{
    "placeDcid": "geoId/07",
    "statVarDcid": "CumulativeCount_MedicalTest_ConditionCOVID_19_Positive",
    "measurementMethod": "",
    "observationPeriod": "",
    "scalingFactor": "",
    "unit": "",
    "validationCounters": [{
      "counterKey": "StatsCheck_MaxPercentFluctuationGreaterThan100",
      "problemPoints": [{
        "date": "2020-03-02",
        "values": [{
          "value": 3.0,
          "locations": [{
            "file": "covid.mcf",
            "lineNumber": "49"
          }]
        }]
      }, {
        "date": "2020-03-03",
        "values": [{
          "value": 7.0,
          "locations": [{
            "file": "covid.mcf",
            "lineNumber": "65"
          }]
        }]
      }],
      "percentDifference": 133.33
    }]
  }]

Note that information relevant to this check (sample place, file and location of the issue, the values involved, and the exact percent fluctuation) are conveniently provided to assist the user in debugging issues.

Defaults to true.

--allow-non-numeric-obs-values

Allows non-numeric (text or reference) values for StatVarObservation value field.

  • When false, non-numeric values will log an error counter (Sanity_SVObs_Value_NotANumber)
  • When true, these values will be allowed and relevant StatChecks might be performed (depending on the value of --stat-checks).

Defaults to false.

--check-measurement-result

Checks DCID references from StatVarObservation nodes if the StatisticalVariable they are measuring has statType: measurementResult.

If the StatVar definition exists in the local MCF files provided, that will be used. Otherwise, API requests to the Data Commons KG will be made synchronously per unknown StatVar.

Only nodes in sample places are subject to this check.

Defaults to false.

-p, --sample-places=<samplePlaces>

Specifies a list of place dcids to run stats check on.

This flag should only be set if --stat-checks is true. If --stat-checks is true and this flag is not set, 5 sample places are picked for roughly each distinct place type.

-r, --resolution=<resolutionMode>

Specifies the mode of resolution to use: NONE, LOCAL, or FULL.

Resolution refers to the process of assigning DCIDs to every graph node in the input. For StatVarObservation nodes, new DCIDs are generated. For nodes of other types, either the DCIDs must be provided, or the tool will use the Data Commons KG to find the DCID based on an external ID.

As an example of the latter, see the MCF node below where California is referenced using the isoCode property. This will resolve to the dcid of California in Data Commmons (geoId/06) when this flag is set to FULL.

Node: CANode
typeOf: dcs:Place
isoCode: "US-CA"
  • LOCAL: Only resolves local references and generates DCIDs. Notably, this mode does not resolve the external IDs against the DC KG.
  • FULL: Resolves external IDs (such as ISO) in DC, local references, and generated DCIDs. Note that FULL mode may be slower since it makes (batched) DC Recon API calls and performs two passes over the provided CSV files. You should only use this if you have to resolve location entities via external IDs.
  • NONE: Does not resolve references. Use this only if all inputs have DCIDs defined. You rarely want to use this mode.

Defaults to LOCAL.

-sr, --summary-report

Generates an HTML summary report named summary_report.html in the output folder. See the output section above for more details on what is included in the summary report.

Defaults to true.

-V, --version

Prints version information and exit.

--verbose

Prints verbose log.

Defaults to false.