-
Notifications
You must be signed in to change notification settings - Fork 159
Hive JSON SerDe
Randall Whitman edited this page Jul 14, 2017
·
6 revisions
Hive uses SerDes (Serializers/Deserializers) to map both structured and unstructured data into tables that can then be queried in a similar manner to a traditional relational database. JSON is not tabular, but it is structured and the JSON Formats exported from ArcGIS can easily be mapped to rows and columns using the Esri-JSON SerDe provided with these tools.
Here is an example of a table mapped with the SerDe using the Unenclosed Esri JSON format from the JSON Formats overview.
JSON source data
{
"attributes" : {
"OBJECTID" : 1,
"NAME" : "Redlands"
...
},
"geometry" : {
"rings" : [
...
]
}
}
{
"attributes" : {
"OBJECTID" : 2,
"NAME" : "Riverside"
...
},
"geometry" : {
"rings" : [
...
]
}
}
Create a table that maps the NAME
attribute and the geometry from JSON to columns Name
and BoundaryShape
.
CREATE EXTERNAL TABLE IF NOT EXISTS counties (Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
There are a few things to note in the create statement:
- Column names are not case sensitive. So while the attribute in the JSON is
NAME
, you can reference it asName
. - The geometry is mapped to whichever column is defined with the type
binary
. As a consequence, we only allow one column to be defined asbinary
in order to remove any potential ambiguity. - Attributes that are not defined in the column list will simply be discarded.
- The Hadoop input format
com.esri.json.hadoop.UnenclosedEsriJsonInputFormat
from the Spatial Framework for Hadoop is necessary for the SerDe to work. See InputFormats and RecordReaders.
Once the table is created, you can interact with the JSON as if it were a table.
hive> DESCRIBE counties;
OK
name string from deserializer
boundaryshape binary from deserializer
hive> SELECT Name, ST_GeometryType(BoundaryShape) FROM counties;
OK
Redlands ST_MULTIPOLYGON
Riverside ST_MULTIPOLYGON