-
Notifications
You must be signed in to change notification settings - Fork 18
Aggregations and Visualisations
A row is just a dataset row.
A bucket is a collection of rows, grouped on a particular column (let's call this the bucket-column). For each bucket, all rows within that bucket will have the same value for the bucket-column.
We currently only support "exact match" buckets, where all the rows in a bucket have exactly the same value in the bucket-column. We could in the future support different ways of dividing up the buckets. For example, if the bucket-column was a number, we could define each bucket based on an interval (e.g. 0-5, 5-10, 10-16). In that case, each row in a bucket would have a bucket-column value within that bucket's interval. If the bucket-column were a date type, we could support buckets based on months or years.
A bucket can have one or more "metric columns". A metric-column is a column with a value we are interested in visualising (or otherwise analysing). This can be almost anything, but a common example might be "milligrams of fluoride per liter of water" or "temperature".
The value of the metric-column can be different for each row in a bucket. In order to display a value for the metric-column for the entire bucket, we need some way of combining the different metric-column values of the rows into a single value.
A metric aggregatoin is a mathematical operation to convert the metric-column of a bucket into a single number.
We currently support two main types of metric aggregation.
In the first type of metric aggregation, we take the collection of metric-column values from the bucket rows and do a mathematical operation on that collection.
We currently support mean, max, min, median, sum, upper quartile and lower-quartile. We implement these operations in a generic mathematical way with no special logic.
In the second type of metric aggregation, we use the metric-column to count the number of rows in the bucket.
With "count", we count every row in the bucket that has any value in the metric-aggregation column. With "count unique", we count every unique value that occurs within the metric-column for a given bucket.
In all kinds of metric-aggregations, we ignore missing values, because in the problem domain we work in, a missing value is inherently different from a zero value (e.g. "We don't have any data about the amount of fluoride in the water" is a very different statement than "There is zero fluoride in the wayer").
Pie and donut charts are identical aside from donut charts having a hole in the middle, se we can refer to both as "pie".
We use pie charts to show proportional relationships between buckets -- to answer the question "What proportion/percentage of the overall total does each bucket represent?".
Because pie charts are about the relationship between the overall total and each individual bucket, they only make sense with aggregations which can be added up to a total. This means it wouldn't make sense to show a pie chart where the metric aggregation was mean, max, min, median, upper-quartile or lower-quartile, because for each of these aggregation types, taking the value for each bucket and then adding up those values to get a total results in a nonsensical figure. For example, if you take a mean
for each bucket, and then add up each bucket's mean to get a total, you don't get the mean for the entire dataset. You get a "sum of means" that, in the context of our visualisations, is meaningless.
That leaves count, count unique and sum as three metric-aggregation types where you can add up the value for each bucket to get a total which is valid. The total of "count" would be the count for the metric-column for the entire dataset, the total of "sum" would be the sum for the entire metric-column for the entire dataset, and the total of "count unique" would be the numer of unique values of the metric column for the entire dataset.
We currently only support "count" as the metric aggregation for pie-charts (and in fact we don't even show a metric-aggregation menu to the user).
We could support "count unique". We could also support "sum", but only in cases where the metric-column is a number.
Because we currently only support "count", the aggregation we perform on the backend is simple: group the rows by the bucket-column, perform a COUNT
for each group, and then return the collection of those counts to the client.
Color is used to help differentiate each wedge, and to link each wedge to its entry in the legend, when the legend is showing
When there are a large number of buckets (i.e. a large number of wedges in the pie), the chart becomes impractical to read.
We currently ameliorate this problem by i) not showing labels for very small wedges, and ii) not showing the chart at all when there are more wedges than some maximum (currently set at 50
).
We could go further in the following way: Define some maximum number of wedges (n). When there are more than n wedges in the pie, show the (n-1) largest wedges, and then show a final wedge labelled "other" that shows the sum of the values of all non-displayed wedges in the pie. In this way, we could support very large numbers of buckets.
The current system of exact-match buckets works reasonably well when the bucket-column is a text type. When the bucket-column is a number or (especailly) a date type, there will usually be too many unique values in the bucket-column, resulting in too many buckets, resulting in too many wedges in the pie. (In the worst case, every row in the datset would be its own bucket).
I think the best way to tackle this would be to support interval-based buckets in a generic way (i.e. common to all visualsiation types).
Line and area charts are identical apart from minor display options, so we can refer to both as "line".
We use line charts to show the change in some variable (let's call it the primary variable) in relation to the change in some other continuous variable (let's call it the secondary variable). The secondary variable is most often time, but it can also be a number.
The secondary variable is continuous in the sense that it spans some inherently ordered, subdividable range. This will usually be time (subdividable in the sense that you can divide a day, hour, minute etc into a smaller units) but could also be something like temperature or fluoride level. It would not make sense to use a discrete value, like "region number", as the secondary variable. ("Discrete" in the sense that if you have "Region 1" and "Region 2" it doens't make sense to talk about "Region 1.02052").
We currently support showing the primary variable on the y-axis and the secondary variable on the x-axis. We could in theory support flipping the chart by 90 degrees, such that the line runs from top-to-bottom rather than from left-to-right, but in practise I am not aware of any requests for this feature and I don't recall seeing a chart like this in-the-wild.
Unlike most other charts, line charts do not need to be bucketed to be displayed (so long as the user selects a number-type as the primary variable). On the backend we select all rows in the dataset, order them by the secondary variable, and return ordered arrays of both the primary and secondary variables to the client.
On the client, we combine those arrays into an ordered array of datapoints, and then for each datapoint, draw a point on the chart at the corresponding x-axis and y-axis locations. We then draw a straight line between each ordered point.
A metric-aggregation is only needed in the case where two or more datapoints have the same secondary-variable value - that is, the same position on the x-axis. By default, these points will be connected by a vertical line, creating an odd appearance.
We resolve this by allowing the user to select a metric aggregation. We do not let them choose a bucket-column, because the system already knows to use the secondary-variable (the x-axis) as the bucket column. Using any other column as the bucket-column would not make sense. If the user needs to use a different column as the bucket-column, it is likely that a different visualisation type would be more appropriate for them.
We do let the use choose a metric aggregation. All the metric aggregation types make a degree of sense. "Count" and "Count Unique" are less likely to be useful, but could conceivably be useful in the case where each row represents a reading, and the user wishes to visualise "readings per day" or similar.
In the case where there are a large numbe of rows, we support random sampling for line charts. We currently limit the number of points returned to the client to 2500
. This number can be adjusted trivially.
When the number of rows is higher than the limit, the backend runs the aggregation query over the entire dataset, then orders the results randomly, then selects the first 2500 results from the randomly-ordered aggregated results, and then orders then by the secondary-variable (x-axis) once more.
The need for sampling comes from the fact that, without it, i) the aggregated response from the backend can be very large, and ii) visualising a large number of points on the client can be very slow.
The rationale behind random sampling is that, given the purpose of a line chart is to observe the general pattern of change between two variables, when there is a very large number of points, a random sample should display an extremely similar pattern to the overall pattern.
An alternative would be to support interval-based bucketing on the x-axis. For example, when the x-axis is a date type (most common), allowing the user to bucket by day, week, month or year would reduce the number of points without sampling. As mentioned previously, I think this would best be achieved by supporting interval-based bucketing in a generic way.
We currently only support one primary (y-axis) variable. In the future, we could consider supporting multiple y-axis variables for the same x-axis.
One way would be to support more than one y-axis column. For example, the user might select the y-axis column "Number of functional hand-dug wells" and then another y-axis column "Number of functional machine-dug wells", and then the x-axis column "Survey date", resulting in a chart with one line for hand-dug-wells and another for machine-dug-wells over the same time period.
Although we allow the user to pick a custom color, we don't currently use color to convey information. If we supported multiple series on a single chart, we would use color to identify the different series.
Originally, we would display a chart as soon as the user selected a y-axis column (i.e. before there was an x-axis column). We did this by using the implicit dataset row number as the x-axis value.
We continue to support displaying these charts for the sake of backwards compatibility, but it is extremely hard to imagine a scenario where the resulting chart is actually useful, because the implicit dataset row number is extremely hard to interpret meaningfully when looking at a chart. We no longer display such charts in the visualisation editor to discourage users creating them, but we do display them in shared dashboards and visualisations.
When the user selects a text column as the y-axis metric column, we need to convert that text column to a numeric value to be able to show it on a chart. We do this by always using the aggregation "count" on the backend when the y-axis metric column is a text type, regardless of what the user choses as the aggregation metric. This is dangerous as, if a user selects a different aggregation type, it will be ignored and the resulting chart will be misleading. We should eitehr change the UI of the visualisation editor to disallow the selection of a different metric aggregation when the y-axis metric-column is a text type, or leave the UI as is, but show an error message if any metric aggregation other than "count" is chosen.
The purpose of a scatter chart is to compare datapoints across two measurement axes at the same time.
When there are a small number of points, the user may be interested in both i) the overall pattern of the points, and ii) the individual data for each point.
When there are a large number of points, it is hard to see anything other than the overall pattern, although the user may still be interested in inspecting data for individual points (e.g. if a point is an outlier).
Both scatter charts and line charts involve plotting the position of a series of datapoints on two numeric axes. One difference is that points in a line chart are inherently ordered (for lumen, they are ordered by their x-axis values) whereas points in a scatter chart have no order.
At a higher level, line charts often answer the question "How does Y change as X changes?", whereas scatter charts often answer the question "How do Y and X relate to each other?".
Like line charts, a scatter chart will usually make the most sense when both the X and Y axis represent continuous axes. If the x-axis is discrete rather than continuous, a bar graph may be a more appropriate choice.
For legacy reasons, we allow both the x-axis column and the y-axis column to be date types.
It is hard to think of any valid use-case for the y-axis column to be a date type. We should consider removing this functionality.
It is also hard to think of use cases where it makes sense for the x-axis column to be a date type. If the user is interested in the change of the y-axis variable over time, it is likely that a line chart would be more appropriate for that purpose than a scatter chart.
For scatter charts, the user can select a "label" column. The value in this column will be displayed for each datapoint when the chart is in interactive mode.
Like line charts, we do not require scatter charts to be bucketed. When the user has selected a metric column for both the y-axis and the x-axis, on the backend we select those column values for all rows in the dataset and return the results to the client.
Optionally, the user can select a bucket-column. Unlike a line chart, this can be any column.
The user can also select any metric-aggregation.
When a bucket-column is chosen, on the backend we select all rows grouped by the bucket column, then calculate an x-value and a y-value for each bucket using the metric aggregation specified.
In theory, it would be possible to use two different metric aggregations for the x-axis and the y-axis. For example, the user might want to group by "region", and then draw a scatter graph of "mean fluoride per ml" vs "median copper per ml". We do not support this currently. In my opinion, this is the kind of feature we should not consider adding unless we have specific user requests for it.
When buckets are used, the label-column is ignored and the name of each bucket is used as the label instead.
Like line charts, we use sampling for scatter charts with more than 2500 points for performance reasons.
Currently, the color of the points is user-selectable, but it is not used to convey any information.
In the future, color could be used to show categories of points. This would be similar to the way we use colors for points in maps - the user selects a column to be used for color, and the we assign a color to each unique value in that column. We show a legend and let the user change the color mapping if they want to.
We couldn't use that system when the user is using buckets in their scatter chart, but we could assign a unique color to each bucket, which would look similar from a user perspective.