-
Notifications
You must be signed in to change notification settings - Fork 29
Developer Guide
What Git workflow are we going to adopt?
How does all this code fit together? What goes where?
- Open Source Application: uReport (https://github.com/City-of-Bloomington/uReport)
- Author: Quan Zhang (quanzhang@acm.org)
- Mentor: Cliff Inghamn (inghamn@bloomington.in.gov)
- Sponsor: Google Inc. (Google Summer of Code)
- Focus Area: GIS, Clustering, large data
- When: June - September, 2013
- Where: Bloomington City Hall, IT services Dept.
- Address: 401 N Morton St, Suite 150, Bloomington IN 47402
- Tools: Solr, Apache, Tomcat, MySQL, Google Map API V3, YUI
- Languages: PHP, Javascript, SQL
- Main Achievement: Displayed clusters of up to 1M tickets on the map within one second.
The city of Bloomington has a web application to receive the issues reported by citizens. Once received, the issue will be stored into the database as a ticket. A ticket has many properties like category, status, date, township, latitude, longitude and etc. These properties are set when the citizen report the issue. Now the city's database stored more than 70,000 tickets. In the main page of the web application, the users need to see the tickets on the map. Also they can filter tickets by choosing properties as search parameters. However, a problem rose. If we show all the tickets as markers on the map at a time, the map will be full of markers. Also, displaying the markers will take a long time. What's worse, the client will receive tickets from Solr server. The time to transmit tickets data from server to client will also be too long to accept. We need to cluster markers on server side and transmit the clusters instead of tickets to the map of the client.
Solr[1] has a feature to facet query and return any number of results back. The query results will be transmitted as JSON or XML data. The more tickets results returned, the larger the data will be. By setting "start" and "rows" parameters in Solr query string, we can get small number of tickets of the query results, and display the tickets on the map immediately. The Solr query strings are sent by Javascript whenever the map stops to pan or zoom. Since this design cannot show all the query results at a time. I also added an HTML option to dynamically display the chosen number of tickets and used OverlappingMarkerSpiderfier [5] to display overlapping markers separately. Later I tried to cluster tickets on the client side. Google provided some samples to cluster on client side [3]. Gary Little [4] implemented the client side clustering and released it as an API. Also, the officially posted article "Too Many Markers!" [2] summarized solutions to deal with marker clustering on client side. I borrowed some ideas from this article in later server side clustering design. However, no matter how efficient the clustering algorithm is, since transmitting large data from server to client is not acceptable, we still need to implement clustering on server side.
As far as my mentor and I can find, there is no released API or build-in component of Solr can do clustering on server side. However I got idea from a thesis posted by Josef Dabernig [9], even though that was implemented on Drupal. The basic idea is pre-cluster the tickets into grids using Base 32 Geohash [8]. Austin White provided an open source API to convert a lat-lng pair to a base 32 geohash code [7]. The earth firstly will be divided into 32 grids, the first digit of geohash code represents the grid the lat-lng belongs to. Further, the represented grid will be divided into another 32 sub-grids. The second digit of geohash code represents the sub-grid the lat-lng belongs to, and so forth. The more digits we use in geohash code, the smaller the grid will be [10]. In this way, we can pre-cluster the tickets into grids based on current zoom level. I also designed a geohash length to zoom level corresponding relationship to display appropriate number of clusters on the map when zoom in and out. Since what we need is the number of tickets of clusters and the centroids of clusters rather than lat-lngs of individual tickets. We need to calculate them on Solr server. Fortunately, we don’t need to add any plug-in into Solr. The Solr’s Stats Component can count the number of tickets, get the average latitude and longitude of tickets in clusters very fast. For example, the following Solr query can do statistics within a bounding box and group by geohash code.
Localhost/solr/quan/select?q=*:*&stats=true&stats.field=latitude&stats.field=longitude&stats.facet=geohash_lv8&fq=coordinates:[39.169000,-86.550000 TO 39.170000,-86.549000]&rows=0
In this way, we can display the clusters with count on the centroid very fastly. However, the view is bad, the clusters are always like standing in a queue when the real tickets distribution is even. Also, in some part of map, two clusters are too close to each other, but since they belong to different grids, they will be shown in different clusters no matter how close they really are.
The distance based clustering [12] on server side can solve the problems in grid based one. The distribution of clusters will never be orderly and the distance between the clusters are guaranteed. The distance based algorithm is performed in SQL. In the tickets table, we added new fields “cluster_id_lv0”, “cluster_id_lv1”, …, “cluster_id_lv6”. The clustering level is similar to that of geohash, clustering levels are corresponding to zoom level in Google Map. We only need to care the zoom level from 21 to 10, since we are only handling data in Bloomington Area. For a particular clustering level, we assign the cluster id to each ticket. When calculating distance, we used Haversine formula [11].
for t ∈ tickets do
select the tickets tics within r distance to t;
get all the clusters containing ticket in tics;
calculate the centroid of the clusters;
if the centroid of the closest cluster c is within r distance to t
assign c’s cluster id to t;
else
assign a new cluster id to t.
- Coding Style
- Accessibility (Section 508)
- Progressive Enhancement
- Unobtrusive Javascript