This script performs an analysis of metadata from the Global Biodiversity Information Facility API, visualizes the data, and generates useful outputs such as networks, choropleth maps, word clouds, and markdown summaries. It is intended to be use with the output of linkLitToGbifId (https://github.com/AgentschapPlantentuinMeise/linkLitToGbifId).
- Fetch Metadata from GBIF API: Retrieves metadata for a list of DOIs from the GBIF API.
- Data Processing: Cleans and organizes data from CSV files, calculates statistics, and enriches metadata with additional information such as taxon names and country names.
- Network Creation and Visualization: Builds networks of keywords and topics and visualizes them using NetworkX and Matplotlib.
- Markdown and PDF Generation: Creates detailed Markdown summaries for each publication and converts them into PDF format.
- Geospatial Visualization: Generates a global choropleth map highlighting the number of publications by researchers' countries.
- Word Cloud Generation: Creates word clouds based on publication abstracts and keywords.
The script requires the following Python packages:
pandas
requests
networkx
matplotlib
tqdm
pypandoc
pycountry
re
basemap
wordcloud
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Ensure that you have the required CSV files:
KYMS.csv
: Contains the GBIF data with DOIs from scientific literature and associated GBIF IDs of cited specimens creates from the script linkLitToGbifId.download_summary.csv
: Contains download keys and total records.
-
Create a directory named
publication_markdowns
to store the Markdown and PDF files. -
Update the paths in the script as needed:
file_path
for the main data CSV file.download_summary_path
for the download summary CSV file.output_excel_path
for saving the processed data.output_dir
for storing Markdown and PDF files.
The script reads the CSV files, calculates the number of GBIF IDs associated with each DOI, and fetches metadata from the GBIF API. The relevant data is stored in a DataFrame and saved to an Excel file.
The script creates networks based on keywords and topics extracted from the metadata. These networks are saved as GraphML files for further analysis.
- Keyword Network and Topic Network: Visualizes the relationships between keywords and topics using NetworkX.
- Choropleth Map: Generates a map showing the distribution of publications by researchers' countries.
- Word Clouds: Creates word clouds from publication abstracts and keywords.
Creates a detailed Markdown file for each publication with all relevant metadata, and converts it to a PDF file using a LaTeX template.
Run the script in a Python environment with all dependencies installed. For large datasets, make sure to handle API rate limits and possible errors appropriately.
python gbif_metadata_analysis.py
- Excel File:
gbif_literature_data.xlsx
containing processed metadata. - GraphML Files:
keyword_network.graphml
andtopic_network.graphml
. - Markdown Files: Individual Markdown files for each publication stored in the
publication_markdowns
directory. - PDF Files: Converted PDF files for each publication in the
publication_markdowns
directory. - Images:
publications_by_country.png
: Choropleth map of publications by country.wordcloud_abstracts.png
: Word cloud based on publication abstracts.wordcloud_keywords.png
: Word cloud based on keywords.
- Font Issues: If you encounter warnings about missing glyphs, ensure that you have a font installed that supports the characters being used. You can modify the font settings in the script. Having said that, there is not much to be gained from fixing all of these warnings.
- API Rate Limits: If the GBIF API rate limits your requests, you may need to increase the delay between requests or handle the
429
status code more robustly.