Skip to content

Aggregations and Visualisations

Gabriel Martin edited this page Jul 5, 2018 · 7 revisions

Aggregations

General terminology

Rows

A row is just a dataset row.

Bucket

A bucket is a collection of rows, grouped on a particular column (let's call this the bucket-column). For each bucket, all rows within that bucket will have the same value for the bucket-column.

We currently only support "exact match" buckets, where all the rows in a bucket have exactly the same value in the bucket-column. We could in the future support different ways of dividing up the buckets. For example, if the bucket-column was a number, we could define each bucket based on an interval (e.g. 0-5, 5-10, 10-16). In that case, each row in a bucket would have a bucket-column value within that bucket's interval. If the bucket-column were a date type, we could support buckets based on months or years.

Metric column

A bucket can have one or more "metric columns". A metric-column is a column with a value we are interested in visualising (or otherwise analysing). This can be almost anything, but a common example might be "milligrams of fluoride per liter of water" or "temperature".

The value of the metric-column can be different for each row in a bucket. In order to display a value for the metric-column for the entire bucket, we need some way of combining the different metric-column values of the rows into a single value.

Metric aggregation

A metric aggregatoin is a mathematical operation to convert the metric-column of a bucket into a single number.

We currently support two main types of metric aggregation.

In the first type of metric aggregation, we take the collection of metric-column values from the bucket rows and do a mathematical operation on that collection.

We currently support mean, max, min, median, sum, upper quartile and lower-quartile. We implement these operations in a generic mathematical way with no special logic.

In the second type of metric aggregation, we use the metric-column to count the number of rows in the bucket.

With "count", we count every row in the bucket that has any value in the metric-aggregation column. With "count unique", we count every unique value that occurs within the metric-column for a given bucket.

Missing values

In all kinds of metric-aggregations, we ignore missing values, because in the problem domain we work in, a missing value is inherently different from a zero value (e.g. "We don't have any data about the amount of fluoride in the water" is a very different statement than "There is zero fluoride in the wayer").

Chart types

Pie

Purpose

Pie and donut charts are identical aside from donut charts having a hole in the middle, se we can refer to both as "pie".

We use pie charts to show proportional relationships between buckets -- to answer the question "What proportion/percentage of the overall total does each bucket represent?".

Because pie charts are about the relationship between the overall total and each individual bucket, they only make sense with aggregations which can be added up to a total. This means it wouldn't make sense to show a pie chart where the metric aggregation was mean, max, min, median, upper-quartile or lower-quartile, because for each of these aggregation types, taking the value for each bucket and then adding up those values to get a total results in a nonsensical figure. For example, if you take a mean for each bucket, and then add up each bucket's mean to get a total, you don't get the mean for the entire dataset. You get a "sum of means" that, in the context of our visualisations, is meaningless.

That leaves count, count unique and sum as three metric-aggregation types where you can add up the value for each bucket to get a total which is valid. The total of "count" would be the count for the metric-column for the entire dataset, the total of "sum" would be the sum for the entire metric-column for the entire dataset, and the total of "count unique" would be the numer of unique values of the metric column for the entire dataset.

We currently only support "count" as the metric aggregation for pie-charts (and in fact we don't even show a metric-aggregation menu to the user).

We could support "count unique". We could also support "sum", but only in cases where the metric-column is a number.

Because we currently only support "count", the aggregation we perform on the backend is simple: group the rows by the bucket-column, perform a COUNT for each group, and then return the collection of those counts to the client.

Color

Color is used to help differentiate each wedge, and to link each wedge to its entry in the legend, when the legend is showing

Corner cases

Too many wedges

When there are a large number of buckets (i.e. a large number of wedges in the pie), the chart becomes impractical to read.

We currently ameliorate this problem by i) not showing labels for very small wedges, and ii) not showing the chart at all when there are more wedges than some maximum (currently set at 50).

We could go further in the following way: Define some maximum number of wedges (n). When there are more than n wedges in the pie, show the (n-1) largest wedges, and then show a final wedge labelled "other" that shows the sum of the values of all non-displayed wedges in the pie. In this way, we could support very large numbers of buckets.

Dates and numbers

The current system of exact-match buckets works reasonably well when the bucket-column is a text type. When the bucket-column is a number or (especailly) a date type, there will usually be too many unique values in the bucket-column, resulting in too many buckets, resulting in too many wedges in the pie. (In the worst case, every row in the datset would be its own bucket).

I think the best way to tackle this would be to support interval-based buckets in a generic way (i.e. common to all visualsiation types).

Clone this wiki locally