Skip to content

Latest commit

 

History

History
156 lines (130 loc) · 5.74 KB

json-to-csv.md

File metadata and controls

156 lines (130 loc) · 5.74 KB

Converting JSON to CSVs

This app uses an algorithm based on analysis of a JSON iterable's schema (generated on the fly) to attempt to convert it into a table.

How JSON nodes are represented in CSV

  • As of v6.0, the CSV files generated by JsonTools use an algorithm similar to RFC 4180, in that all values containing the quote character, \r, or \n must be wrapped in quotes, and quote characters are escaped by doubling them up.
    • Thus the JSON string "foo,\n\"bar\"" would become "foo,\n""bar"""" (with an actual newline) in a CSV file with , as delimiter and " as quote character, because the newline, quote character, and delimiter in the string all necessitate enclosing quotes, and the internal quotes must be doubled up.
  • Floating point numbers are always exported with . as the decimal separator, as in JSON. This is not currently configurable, but may become optional in the future.
  • null is represented as an empty string.
  • Dates are represented in YYYY-MM-DD format (e.g., January 7, 2015 is represented as 2015-01-07)
  • Datetimes are represented in yyyy-MM-dd hh:mm:ss format (e.g., 4:15:35PM April 29 2021 is represented as 2021-04-29 16:15:35)

Pre-v6.0 docs are here.

At present, four strategies for making a table are supported.

Options

  1. The Key Separator (keysep) is a character that joins together the path to a value. In Full Recursive and Default modes, the algorithm recursively searches for child keys and indices, and uses the path as a column name instead of a key.

For example, we might have

{"b": [{"a": [1, 2, 3]}]}

which can be tabularized, but requires recursive search from key b to the first array to key a in the sub-object, yielding the path ["b", "a"].

So with keysep = '.', we get the output

b.a
1
2
3

and with keysep = '_', we get

b_a
1
2
3
  1. The Delimiter in output file (one of '\t', ',', or '|') is the character used to separate the fields in the output file. By default this is ',', so you output a CSV. '\t' would result in a TSV (tab-separated variables).

  2. The Line terminator in output file (one of CRLF, CR, or LF) setting (added in v5.8) is the line terminator. By default this is Unix LF, which was the only line terminator available prior to this release.

  3. Convert booleans to integers takes all instances of true and false and replaces them with 1 and 0, respectively. Can be useful sometimes.

Strategies

Default

Drills down into nested JSON until it finds one of the following:

  • an array of arrays of scalars
  • an array of same-format objects with scalar values
  • an object with at least one key that maps to arrays of scalars

The emitted table will then have all the scalar values along the path to the table, as well as the table itself.

  • For example, something that's already tabular, like
{"a": 1, "b": [1, 2, 3], "c": ["a", "b", "c"]}

will be tabularized to

a,b,c
1,1,a
1,2,b
1,3,c
  • But it can drill down to find the tabular part of this JSON:
[{"a": 1, "b": [1, 2, 3], "c": {"d": "y"}}, {"a": 2, "b": [4, 5, 6], "c": {"d": "z"}}]

will be tabularized to

a,b,c.d
1,1,y
1,2,y
1,3,y
2,4,z
2,5,z
2,6,z

Full recursive

If the JSON is an array of objects or an array of arrays, recursively search each sub-object or sub-array, adding new columns to the emitted table for each path to a terminal node.

If the JSON is not an array, throw an error.

This can make tables with a lot of columns, as shown in this example:

[{"a": 1, "b": [1, 2, 3], "c": {"d": "y"}}, {"a": 2, "b": [4, 5, 6], "c": {"d": "z"}}]

will be tabularized to

a,b.col1,b.col2,b.col3,c.d
1,1,2,3,y
2,4,5,6,z

No recursion

If the JSON is one of the following:

  • an array of arrays of scalars
  • an array of same-format objects with scalar values
  • an object with at least one key that maps to arrays of scalars

make a table containing all the rows. No recursive search will be done - the entire JSON must match this specification.

  • For example,
{"a": 1, "b": [1, 2, 3], "c": ["a", "b", null]}

will be tabularized to

a,b,c
1,1,a
1,2,b
1,3,
  • But
[{"a": 1, "b": [1, 2, 3]}, {"a": 2, "b": [4, 5, 6]}]

will NOT be tabularized because it is an array of objects that don't contain all scalars.

Stringify iterables

This approach is similar to NO_RECURSION, but instead of refusing to run if it encounters non-scalars in the table, it stringifies them.

WARNING: this option does not work well with ',' as the separator! You should use '\t' as the separator instead.

This strategy might be useful if you want to partially deconstruct your JSON so that some fields are their own columns and more complex deeply nested fields are left as JSON, which an RDBMS like PostgreSQL can convert into a JSON column type.

For example, consider

[{"a": 1, "b": [1, 2, 3], "c": {"d": "y"}}, {"a": 2, "b": [4, 5, 6], "c": {"d": "z"}}]

This will be tabularized to

a,b,c
1,"[1, 2, 3]","{""d"": ""y""}"
2,"[4, 5, 6]","{""d"": ""z""}"

Meanwhile,

{"a": 1, "b": [1, 2, 3], "c": ["a", "b", "c"]}

will be tabularized to

a,b,c
1,1,a
1,2,b
1,3,c

which is the same behavior as the No recursion and Default strategies.