Skip to content

Rigonz/GeographicCentroids

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeographicCentroids

Geographic centroids with population and areal weighing.

Presentation

As far as one can judge there are some common errors in the calculation of the centroids of geographical areas. They are not huge indeed, but they are easily preventable: this project presents the issues and one of the existing solutions.

Also, I am not aware of a repository of geographical centroids, even less if they are weighed by population. This project provides one such repository.

Finally, it includes two Jupyter Notebooks that can be easily adapted. They use a very large public dataset of populated places to calculate different centroids with two pproaches:

  1. On one side the purely geographic centroids but also weighed by population: the dataset includes detailed population estimates for the years 2000, 2005, 2010, 2015 and 2020, so it is possible to compute the corresponding centroids.
  2. On the other side, this dataset allows different degrees of administrative aggregation: nation, estate, province, etc. Two sets of results are provided: at the administrativel level 1 (248 nations) and at level 2 for certain large countries (CHN, RUS, USA, AUS, CAN and BRA).

Basics: Physics and Cartography

The centroid of a set of points (or a surface or a volume) is defined in physics textbooks as the point which minimizes the total distance from the set of points to the centroid. In a rectangular coordinates system and for the usual euclidean distance the centroid is found by averaging the (x, y, z) coordinates of the points. If a weighing has to be applied (for instance the population asigned to the points) the formulas are factored by the corresponding weigh. (See f.eg.Centroid in Wikipedia).

These averaging and factoring expressions are quite convenient for calculation. However, they are valid for a rectangular system of coordinates, and they are not for immediate application with geographical reference systems using longitude-latitude coordinates, spherical coordinates.

A simple example can show this for a pair of points whose geographical coordinates are:

  • P1 = (0, 0)
  • P2 = (20, 20)

The centroid calculated by the average formula is located at P0 = (10, 10). However, the geodetic distances from P1 and P2 to this "centroid" are (assuming the coordinate pairs are in lon-lat, not lat-lon):

  • P0-P1 = 1565.1 km
  • P0-P2 = 1541.9 km

As calculated with the following python script:

from geopy.distance import geodesic
P1  = [0,   0]
P2  = [20, 20]
P0  = [10, 10]
D01 = geodesic(P0, P1).km
D02 = geodesic(P0, P2).km
print (D01, D02)

A great-circle calculator would provide a different value. For example, NOAA yields 1567 km and 1544 km. Other pages give different results: 1569 and 1545 km from LATLONG.. Anyhow, what is relevant here is that P0 is not the centroid.

The mistake of averaging spherical coordinates to obtain the centroid seems not to be uncommon, as presented in the following two examples:

  1. Baylor

     select countrynm,
         (sum(lat_cen * p00a) / sum(p00a)) as latitude,
         (sum(long_cen * p00a) / sum(p00a)) as longitude, 
         sum(p00a) as population
     from centroids 
     group by countrynm;
    
  2. Sumit

     points = MultiPoint(geoList)
     result_df = result_df.append({'State':df['State'].iloc[index-1],'District':df['District'].iloc[index-1],'Latitude':points.centroid.x,'Longitude':points.centroid.y}, ignore_index=True)
    

The library used in this second case, shapely, calculates the centroid as the mean of the given coordinates, without consideration to the coordinate units.

Solutions

As far as I can see there are two alternatives for the correct calculation of these centroids:

  1. Move into a rectangular system of coordinates.
  2. Adjust the units of the longitude.

Alternative 2 is used by the US Census Bureau as described here: Census.

If we want to avoid mixing angle and linear units the conversion is:

x = a*cos(LAT)*cos(LON)
y = a*cos(LAT)*sin(LON)
z = b*sin(LAT)

Where a and b are the semiaxis of the selected geoid. The centroid will not lay on the surface but within the volume of the geoid. Its reprojection is straightforward:

LAT = arcsin(Z/b)
LON = atan2(Y, X)

Where X, Y and Z are the averages (weighed if necessary) of the x's, y's and z's from the dataset. (As a side note: this makes the selection of the geoid irrelevant, as both a and b are removed in the calculation of the geographical coordinates).

Dataset

The dataset used in this project is the "Administrative Unit Center Points with Population Estimates" v4.11 from GPW (Gridded Population of the World) at SEDAC_GPW.

The database set (which is very large, close to 2 GB) has been prepared with the following tasks:

  • clearing parsing errors (I cannot say if they are on the original dataset or appeared during the files downloads, but there were errors in all the files),
  • removing unused fields, as indicated in the Notebooks,
  • merging the four datasets from USA.

Jupyter Notebooks

See the two files included in this repository, adecuately commented (I hope).

Results

The results are included in the files:

  • "country_centroids R0.csv"
  • "region_centroids R0.csv"

An example of the differences between geographic and demographic centroids, and the displacement of the demographic centroid in the last 20 years, can be seen in the repository: pics.

countries

regions