We have set the linked official demo page to 'private' by default to control the costs. If you wish to try it, please send us an Email in order to book a time slot. During this time, the demo page will be set to 'public'.
Historical maps provide valuable information and knowledge about the past. However, as they often feature non-standard projections, hand-drawn styles, and artistic elements, it is challenging for non-experts to identify and interpret them. While existing image captioning methods have achieved remarkable success on natural images, their performance on maps is suboptimal as maps are underrepresented in their pre-training process. Despite the recent advance of GPT-4 in text recognition and map captioning, it still has a limited understanding of maps, as its performance wanes when texts (e.g., titles and legends) in maps are missing or inaccurate. Besides, it is inefficient or even impractical to fine-tune the model with users’ own datasets.
To address these problems, we propose a novel and lightweight map-captioning counterpart. Specifically, we fine-tune the state-of-the-art vision-language model CLIP (Contrastive Language-Image Pre- Training) to generate captions relevant to historical maps and enrich the captions with GPT-3.5 to tell a brief story regarding where, what, when and why of a given map. We propose a novel decision tree architecture to only generate captions relevant to the specified map type. Our system shows invariance to text alterations in maps. The system can be easily adapted and extended to other map types and scaled to a larger map captioning system.
We first process maps and their metadata automatically from the online map repository David Rumsey Map Collection to generate a training dataset with keyword captions regarding where, what and when and use this dataset to fine-tune different CLIP models. In the inference phase, we propose a decision tree architecture to structure the keyword captions with respect to the map type and use GPT to extend the context (why) and summarize the story. Furthermore, a web interface is developed for interactive storytelling with the decision tree architecture and fine-tuned models loaded at the backend.
Step by step instructions to reproduce our results with our proposed approach.
git clone https://github.com/claudaff/automatic-map-storytelling && cd automatic-map-storytelling
conda env create -f environment.yml
conda activate map_storytelling
Download and unzip the following fifteen .zip files containing our collected maps with associated metadata (1.6 GB overall).
M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, M14, M15
Run the two scripts CaptionGenerationClassical.py
(for topographic maps) and CaptionGenerationPictorial.py
(for pictorial maps). The output will be two NumPy arrays (one containing the map image paths and one containing the corresponding ground-truth captions) for each of the six caption categories.
Run the six fine-tuning scripts fineTuneCLIP{Caption Category}
. The output will be six fine-tuned CLIP models. One for each caption category.
Alternatively, download the six fine-tuned models here (3.4 GB overall):
-
Download our test maps here (less than 50 MB) and unzip: Pictorial Test Maps, Topographic Test Maps
-
Run the script
Inference.py
after reading the instructions in the comments. This script allows testing the six fine-tuned models separately on our test maps.
To run our map storytelling web app, open the script CaptionInferenceGUI.py
, add your own OpenAI API Key and run it. Make sure that the six fine-tuned models (FT1 to FT6) were downloaded.
Alternatively, if no API Key is available a 'light' version of our approach can be tested without GPT.
For this open CaptionInferenceLight.py
and assign input_map
the path to the desired historical map. Running this script will generate corresponding keyword captions with no why part.
@misc{liu2024efficientautomaticmapstorytelling,
title={An Efficient System for Automatic Map Storytelling -- A Case Study on Historical Maps},
author={Ziyi Liu and Claudio Affolter and Sidi Wu and Yizi Chen and Lorenz Hurni},
year={2024},
eprint={2410.15780},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.15780},
}