Skip to content

Scripts for generating OGM Aardvark metadata for UMass' GeoBlacklight instance

License

Notifications You must be signed in to change notification settings

umass-gis/metadata-scripts

Repository files navigation

OGM Aardvark Metadata Scripts

Scripts for creating geospatial metadata in the OpenGeoMetadata Aardvark metadata schema for UMass Amherst's GeoBlacklight repository. The workflow is designed to generate metadata for a collection of georeferenced aerial photos in the UMass Amherst MacConnell Aerial Photo Collection.

Required inputs

  • Individual XML files for each resource generated by the Export Metadata Multiple tool in ArcMap, containing bounding coordinates and a unique ID (or information that can be transformed into a unique ID)
  • One or more CSV files with additional metadata, each row containing a unique ID

Workflow

Step Script Summary
1 extractXMLToCSVGetGeoNames.py Iterates through multiple XML files containing bounding coordinates, extracts relevant data based on tags, uses API queries to retrieve coverage information from the GeoNames database, and aggregates the information into a single CSV.
2 mergeCSVs.py Merges the output from Step 1 with another CSV containing additional metadata, based on a shared unique ID.
3 formatCSVtoAardvark.py Formats the output from Step 2 into the OGM Aardvark metadata schema.
4 parseCSVToMultipleJSONs.py Parses the output from Step 3 into multiple JSON files (one per item) that can be ingested into GeoBlacklight applications.
5 validateJSONs.py Validates the output JSONs from Step 4 against an OGM Aardvark JSON file.

Step 1: Extract XML to CSV and Get GeoNames

This Python script iterates through multiple XML files generated by the Export Metadata Multiple tool in ArcMap. The files should be copied into the same folder as the Python script (note that the script will search through subdirectories within this folder). The script retrieves each item's title and bounding coordinates, uses API queries to collect placename information from the GeoNames database, then generates a single CSV file with the aggregated information.

The script uses two APIs:

  • Find Nearby Populated Place returns the GeoName ID of the populated place nearest to the photo's center point. This API requires a coordinate pair, so extractXMLToCSVGetGeoNames.py calculates the pair based on the bounding coordinates in the input XML file.
  • Hierarchy collects all the GeoName IDs in the geographical levels "higher" than the populated place.

A note on credit limits: GeoNames gives free-tier users 20,000 credits per day and 1,000 credits per hour. The Find Nearby Populated Place API uses 4 credits per query, and the Hierarchy API uses 1 credit per query. This means that only 200 XML files can be processed each hour.

Output Fields

These are the fields that are extracted and/or calculated:

Field Description Matching field in OGM Aardvark
mods_ID Unique identifier dct_identifier_sm
geometry Extent of the resource, formatted as "ENVELOPE(W, E, N, S)" locn_geometry
bbox Bounding box, also formatted as "ENVELOPE(W, E, N, S)" dcat_bbox
geoname_ID GeoName ID (if found, otherwise 'none') umass_geonames_s
place Populated Place from the GeoNames database
town_long ADM3 from the GeoNames database
town_short ADM3, edited to remove "Town of" or "City of ... Town"
county ADM2 from the GeoNames database
state ADM1 from the GeoNames database

Customizations

cols - update this list with the column headings that should appear in the output CSV. These fields will need to be edited for the variable this_df as well.

citeinfo and bounding - these parts of the script use the xml.etree.ElementTree library to search for information within the XML file based on specific tags.

  • For XML files with multiple children, see the Python docs help on Parsing XML.
  • For locating buried elements in a tree, see the Python docs help on XPath support.

geometry - formats the output as a bounding box in the format ENVELOPE(W,E,N,S). However, starting in GeoBlacklight 4.0 this field can be formatted as a complex geometry, which becomes the default polygon shown in the map interface. See the OGM Aardvark metadata schema for more information about the locn_geometry field.

this_df - there are several calculated fields that format the extracted title and coordinate information according to the OGM Aaardvark metadata schema.

username - to use the GeoNames API, you must create a GeoNames username. Running the script with the default username ("demo") will likely not return any results.

Step 2: Merge CSVs

This Python script merges three CSVs based on the unique ID, mods_ID. The script is is designed to merge the output from 1_extractXMLToCSVGetGeoNames.py with additional metadata from the UMass Amherst MacConnell Aerial Photo collection.

Output Fields

These are the fields that are extracted and/or calculated (in addition to those above):

Field Description Matching field in OGM Aardvark
titleInfo_partNumber Photo ID from original metadata record
place_placeTerm Spatial coverage from original metadata record
dateCreated Creation date from original metadata record dct_temporal_sm and dct_issued_s
year Creation year from original metadata record gbl_indexYear_im
annotation Note about whether or not the photo has markings umass_annotated_s

Step 3: Format CSV to Aardvark

This Python script reads a CSV containing basic geospatial metadata and reformats it according to the OGM Aardvark metadata schema. The script is designed to format the output from 2_mergeCSVs.py. For an example of how to format an input CSV for this script, check out the file testdata_2_merged.csv in the test data pack.

Customizations

cols - this list contains all the OGM Aardvark fields, as well as custom UMass fields. Fields you don't want can be commented out.

Reading the CSV - this list contains the column headings from the CSV. Make sure to add any columns that contain Aardvark-ready metadata.

spatial - this code creates a list rearranging the GeoNames parts into a custom format: ["Amherst, MA", "Town of Amherst, MA", "Hampshire County, MA"]. These elements can be arranged to create whatever format you prefer. See the OGM Aaardvark metadata schema for more information about the dct_spatial_sm field.

Appending to cols - this list contains all the same fields as in cols above. Fields you don't want can be commented out. The rest will need to be updated based on your own desired outputs. Note that for the UMass Amherst workflow, we are working with a set of historical aerial photographs that are for the most part similar; therefore we are using this part of the script to populate many fields with identical information. Alternatively, you might create a CSV with all the OGM Aardvark information, then customize this part of the script to simply read the CSV and write its contents to the fields.

Step 4: Parse CSV to Multiple JSONs

This Python script reads a CSV containing OGM Aardvark metadata and parses it into individual GeoJSON files for each record. The script is designed to parse the output from 3_formatCSVToAardvark.py. For an example of how to format an input CSV for this script, check out the file testdata_3_aardvark.csv in the test data pack.

Customizations

mods_ID - the unique ID that we use in naming the output file. You can substitute any other field, or change this part of the code to name output files as you wish.

Step 5: Validate JSONs

This Python script compares the output JSONs from Step 4 against an OGM Aardvark validation file. See the GeoBlacklight schema directory for the most up-to-date version of this file and substitute accordingly.

Try it yourself!

To try out the scripts, the test_data.zip package contains 6 sample XML files of georeferenced aerial photos, the files scua.csv and annotation.csv with additional metadata about the full aerial photo collection, and the validation file geoblacklight-schema-aardvark.json.

You'll need to download the individual Python scripts and save them in the same folder, then move the folder "helper_docs" to the main folder. Use your favorite Python application to run the scripts (like PyCharm or Google Colab).

About

Scripts for generating OGM Aardvark metadata for UMass' GeoBlacklight instance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages