-
Notifications
You must be signed in to change notification settings - Fork 14
line aggregation not working correctly #165
Comments
After looking through things here I'm wondering if this is related to #144 ? (cc @alexbfree @chrislintott) These marks (in this project) are overlapping by request - but also I don't understand why that should be a problem as they are very different from each other. If I plot raw marks for any of these by, e.g., [origin, r, theta], or [slope, intercept, length], they separate pretty well into 2 different areas of parameter space. So for reasonable choices for distance cutoff, a hierarchical clustering algorithm ought to do the trick. I've tested this by applying my own hierarchical clustering, selecting the cutoff distance for defining a cluster as the maximum distance permitted while still requiring that the number of markings registered does not exceed the total number of classifications. That's slightly different than requiring that a cluster not contain 2 marks from the same user, but I did that because we know we have a (small but) non-zero duplication rate, and I haven't removed duplicates yet. Taking the last example shown above, here's how that comes out: In my spot-checking, there's still the occasional case where either the bar length or the bar width has more than 1 clustered result with p_true_positive > 0.3 (where I'm defining that as the number of marks in the cluster divided by the number of classifications where the user made any marks), but it's ~0.5% of the total number of subjects with at least 2 detected clusters above that threshold, so I think that's quite good. This has made me think that:
No problems with any of the others, either, and I checked across the range of p_bar values within the set of ~500 subjects where the aggregations reported a single line with "NA" for the median/mean probability. So I'm left thinking that is really a bug rather than a choice of method. If the clustering depends on having no duplicate marks from the same user in the same cluster, and given that some of the early stages of live Panoptes had much higher duplicate rates, could that be causing a problem here? |
Just started going through the aggregations for GZ Bar Lengths (project_id == 3). In early spot-checking I came across this subject (493223):
which has 49 line drawings in the relevant workflow (workflow_id = 3, workflow_version = 56.13). Here they are in raw form:
I can't show the aggregated lines because the aggregations report no lines for this subject (despite overall reporting 21,554 aggregated line markings for 7,716 subjects).
Something's gone wrong and I'm not sure what because I haven't checked many other subjects, but I wanted to report this right away.
Really hope this is minor and just because I'm doing something wrong.
The text was updated successfully, but these errors were encountered: