The parkpic.us project was created as part of the Insight Data Science Fellowship program.
The majestic landscapes and diverse ecosystems of U.S. National Parks arouse a sense of astonishment and invoke the primitive spirit of our nation. From the stunning geological features of the Grand Canyon to the lush wetlands of the Everglades, these pristine places provide a sanctuary for visitors to escape the connections and complications of modern civilization, wandering briefly in a timeless panorama.
I bet you are going to want to take a picture.
The parkpic.us project transports you to your favorite U.S. National Park and serves as your guide to the best photo scenes within each park. The application navigates through photos taken at the parks using machine learning and sketches park scenes using the content of those photos. By using parkpic.us to plan your next expedition, you will spend less time digging for your map and more time enjoying the scenes around you.
The data behind the parkpic.us project consists of photographs of U.S. National Parks shared on Flickr. First, the location coordinates and area of each of the 58 U.S. National Parks were scraped from Wikipedia using the Python module BeautifulSoup in order to search Flickr for photographs taken within these park areas. The photographs were obtained using the Flickr API, searching at the latitude and longitude of each national park within a radius that was scaled to the area of each park. To ensure complete coverage of each park, the radius search was repeated by randomly selecting a point within the park using GIS Park Boundary data from the National Park Service. The geographic coordinates and user tags for each photograph, along with the Flickr user information and photo links, were stored in a MySQL database.
Flickr also generates auto-tags for each photograph using image recognition technology to assist in the discovery of new photographs. The auto-tags generated by Flickr are more consistent identifiers of the features within a photograph than tags that are manually entered by its users. However, the auto-tags for each photograph are not available using the Flickr API, and the auto-tags are well hidden behind layers of JavaScript in the source code. For this project, the Flickr auto-tags were scraped by opening each photo page using the headless browser PhantomJS and forcing the JavaScript to execute using the browser automation module Selenium.
The best photo scenes within each national park were determined using the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) within the Python scikit-learn package. The advantage of using the DBSCAN algorithm is that it assumes that the points in a cluster are densely grouped and considers points that lie too far outside of the cluster as noise. The DBSCAN algorithm predicts the best photo scenes by identifying high volume photo locations that are popular with Flickr users. The geo-locations of the photographs in the clusters were averaged to find the center locations of each photo scene.
To help visitors decide which of the predicted photo scenes best match their photographic style, the Flickr auto-tags are aggregated and displayed for each photo scene. Each scene is also compared to other scenes in the chosen park using the Jaccard Similarity Index, which provides a metric for determining which scenes contain photos that are similar to each other. The index is calculated as the ratio between the auto-tag similarity (i.e., tags that are shared between photos) to the tag diversity (i.e., those that are unique to a particular scene).
The backend of the application was built using the Flask microframework and the data was handled using the Python modules SQLAlchemy and pandas. The application is hosted on an Amazon Web Services Ubuntu microinstance and served using the Apache HTTP Server. The user interfaces for the site were constructed using jQuery and the Google Maps API.