A Stata command for cleaning up common mistakes in street address data that prevent the same physical address from being associated identifiable across years. The original purpose of the program was to help in identifying hospital locations across several dataset over time.
The 6 addresses below are of the same physical address, but they were recorded differently each time making matching difficult
- "N 2nd St" Consider this the base case
- "North 2nd St" N is now North
- "N second Street" 2nd is now second and St is now Street
- "N 2nd STRT" St is now STRT
- "N 2nd St" A double space between 2nd an St exists
- " N 2nd ST " A space was place in the front of the string and St is now ST
- Converts all strings to lower case.
- Checks add spaces around common seperators " , / - ()" and the start and end of addresses for proper identification.
- Uses the United States Postal Service list of address name abbreviations to the common written form. For example, ave -> avenue.
- Converts all abbreviated cardinal directions to the full form version. For example, n. -> north or n-> north.
- Converts shorten versions of 1st-9th to first-ninth.
- Corrects the different version of "po box" to only be written as "po box"
- Removes excess spaces in street address.
- Cleanup address information before attemping to submit addresses for geolocation information requests at Census Tiger, Google Maps, Mapbox, or Bing Maps
- Checking addresses across time for indications of opening/closures of organization or businesses.
- Prevent gaps or losing observation in datasets across time if a constant permanent address is needed for analysis.