-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential mistake in Active Learning (Acquisition.py) #20
Comments
@danny911kr @ddelange @yuchenlin @mshen2 @frankxu2004 it's been seven months and this seems like an application breaking bug, do you still care about this project? |
Hi @rodrigomeireles 👋 I think this project is in the hands of the community. My contributions in this repo for instance were just enough to get our project rolling. I think PRs are welcome and will be reviewed by the repo maintainers 👍 |
ah I have to amend. my link above is to do active learning with Doccano itself, as this fork had diverged too much if I remember correctly |
Hey @ddelange, I think I fixed the bug in this issue (and some others related to your dockerfile) but the application always breaks when I try to turn on suggestions. The terminal spams:
Now, I might be dreaming a little too high here but would it be possible for you to help me debug this? You seem very knowledgeable about the stack and... I'm not. There is a fork in my repo which, although fixes this issue, I didn't consider stable for the reason above and haven't submitted yet. Sorry if this sounds a little too cheeky but I'm new to contributing to public repos and don't quite know the best way to contact or ask for help in these cases. |
fwiw you'll probably end up saving man hours by going with a more actively maintained or even managed/proprietary alternative. we spent some money with the spacy team some time ago for instance. their suite seemed nice :) |
I believe there is a bug in this line in acquisition.py (which is used to rank and fetch samples based on the confidence score of your model).
Let me explain:
sort_info = data['sort_info']
. This returns a tuple which describes the reshuffling that happened in step above. For example, a tuple of the form(3,0,2,1)
tells us that the first element in this batch was in fact the 4th one (in the original dataset), the second one was the first and so on.probscores.extend(list(norm_scores[np.array(sort_info)]))
. The goal of this is to reshuffle the probabilty/confidence scores back so that they respect the original ordering and not this new, length based ordering that is used within each batch.The issue is that (if I am not missing something obvious),
norm_scores[np.array(sort_info)]
is not what we want. Let me explain it with the below example:sentences=[["Hello", "World"], ["This","is","a","big","sentence"], ["Hello", "World", "."]]
.ordered_sentences=[["This","is","a","big","sentence"], ["Hello", "World", "."], ["Hello", "World"]]
, giving ussort_info = (1,2,0)
.norm_scores = [0.1, 0.2, 0.3]
, meaning it gives a score of 0.1 to ["This","is","a","big","sentence"], 0.2 to ["Hello", "World", "."] and 0.3 to ["Hello", "World"].list(norm_scores[np.array(sort_info)])
) will reshuffle it to[0.2, 0.3,0.1]
. In the original dataset, this means that we give a score of 0.2 to ["Hello", "World"], 0.3 to ["This","is","a","big","sentence"] and 0.1 to ["Hello", "World"], which is not the same as aboveThe root of this problem is that sort_info returns indices (via
argsort
) that lead to the sorted array. It does not return the indices required to unshuffle it. In essence, what we need is the inverse. One proposed solution for this, is to instead usedinverse_sorting = [sort_list.index(i) for i in range(len(sort_list))]
, and thenlist(norm_scores[np.array(invere_sorting)])
. In the above example, inverse_sorting = [2, 0, 1], which in turn gives a score of[0.3, 0.1, 0.2]
, which is what we want in the original dataset (0.3 for ["Hello", "World"], 0.1 for ["This","is","a","big","sentence"] and 0.2 for ["Hello", "World", "."].I stumbled on this error by noticing that sentences that were exactly the same, would be given different confidence scores by the model (because of the mistake in undoing the reshuffle). Nevertheless, the example I gave above should suffice.
The text was updated successfully, but these errors were encountered: