Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the legends in Stave to show Disease, Medical, etc. #53

Open
Leolty opened this issue Jul 19, 2022 · 13 comments · May be fixed by #54
Open

Update the legends in Stave to show Disease, Medical, etc. #53

Leolty opened this issue Jul 19, 2022 · 13 comments · May be fixed by #54

Comments

@Leolty
Copy link
Collaborator

Leolty commented Jul 19, 2022

As mentioned in the meeting.

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 19, 2022

Except Disease and Medical, what other annotations can we add?

BioBERT: https://github.com/dmis-lab/biobert

@hunterhector
Copy link
Member

should be based on the ner_type we can predict, need to study the model outputs

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 19, 2022

should be based on the ner_type we can predict, need to study the model outputs

Yeah. Here is the problem. In the config of this example, the ner_type is specified to Disease, so all the model outputs would be Disease, if I remove this configuration and run the pipeline, all the outputs will be labelled as "BioEntity", see the default configuration here (Line 235).

I could not find any instructions on how can I change the entity type to show different kinds of types, instead of all the Entities are labelled as "BioEntity".

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 19, 2022

@hunterhector I think I detect the problem. In the following file:

https://github.com/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/bio_ner_predictor.py

I check the source code for BioBERTProcessor, and I noticed that the relationship between Line 235 and Line 228 seems that do not make sense. It just labels all the type of entities as "BioEntity", and if I change the configuration to "DISEASE", all the type of entities will then be labelled as "DISEASE", and I can change whatever I want actually.

Here I just change the configuration to "APPLE", like this: ner_type: "APPLE". All the entities are labelled as "APPLE".

@hunterhector
Copy link
Member

  1. We have more ner detectors, I think @Piyush13y knows where they are
  2. The adjustable label type is just a simple approach to change the label based on the model, it is not the best solution but kinda work now.

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 19, 2022

got it. I remember I used to solve an issue to support bio ner using stanza, I will try that.

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 19, 2022

I tried stanza, and the ner_type of the outputs are as follows:

  • TEST: oxygen saturation/ MRI of the head
  • PROBLEM: an underlying restrictive ventilatory defect/ hydrocephalus/ shift of the normal midline strictures
  • TREATMENT: Lexapro /sublingual nitroglycerin

we may change the Dieases, Medical to Test, Problem and Treatment.

@hunterhector
Copy link
Member

I tried stanza, and the ner_type of the outputs are as follows:

  • TEST: oxygen saturation/ MRI of the head
  • PROBLEM: an underlying restrictive ventilatory defect/ hydrocephalus/ shift of the normal midline strictures
  • TREATMENT: Lexapro /sublingual nitroglycerin

we may change the Dieases, Medical to Test, Problem and Treatment.

Yeah, double check with @Piyush13y since I am sure we also have more spacy models

@Piyush13y
Copy link
Collaborator

Yes, we have more scispacy models that we can use and they give out different kinds of NER labels.

image

Ref: https://allenai.github.io/scispacy/

@Leolty I feel we can't just be changing the label type for the reason that I mentioned to you guys on the call. We want the users to see what they understand in the legend and not some NLP jargon. They wouldn't know what EntityMentions/MedicalEntityMentions mean. Also, adding more attributes (ner_type) to the same annotation will still require changes to the ontology file. Might as well create new annotations for each of the NER types for a smoother demo. At least, that's what I think, specially since it might not really take a lot more time than the adjustable label type approach.

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 21, 2022

@hunterhector @Piyush13y
I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.

We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json

And in the code, we usually use this to create new project: session.create_project(project_json)

It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

@Leolty Leolty linked a pull request Jul 21, 2022 that will close this issue
@hunterhector
Copy link
Member

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.

We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json

And in the code, we usually use this to create new project: session.create_project(project_json)

It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug?

Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.

@Leolty
Copy link
Collaborator Author

Leolty commented Jul 25, 2022

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.
We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json
And in the code, we usually use this to create new project: session.create_project(project_json)
It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug?

Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.

Hi, @hunterhector. After check the function you sent me, I think I have known where the bug is. As you mentioned, create_project is correct and the json file is correct. The bug occurs when loading the json file.

In python, we usually use these functions to load a json file:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(project_json)

And I just made project_json as the input of the function create_project. project_json is a Dict, which results in the Single quotation and "True".

Actually, I just need to use the dump function to solve this bug, for example:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(json.dumps(project_json))

So I think there is no need to modify the source code. We just need to make sure the parameter of the funtion create_project‘is a string with json format (I mean Double quatation and "true" "false") instead of a Dict.

@hunterhector
Copy link
Member

@hunterhector @Piyush13y I detected a bug here, related to Stave, I will elaborate here, which is quite tiny but stuck me for hours.
We have the json file here, like this, https://github.com/asyml/ForteHealth/blob/50_streamlit_to_stave/examples/search_engine_to_stave/default_onto_project.json
And in the code, we usually use this to create new project: session.create_project(project_json)
It can successfully create the project, but I can not open the documents in the project, it keeps loading. So I go over the .stave/db.sqlite3, and compare the ontology and config in the table stave_backend_project:

  1. I first found that, in the json file, Double Quotation Marks are used, however, in the database, they become Single Quotation Marks. ( I change it with SQL statement -- useless)
  2. Then, I carefully compared, found in the json file, the config uses true and false, however, when it stored in the database, it became True and False, but in json, we should use true and false. ( I change it with SQL statement -- works perfectly fine!) I think that's the point, the source code of create_project() should be modified.

Hi, @Leolty. Thanks for exploring this and it seems like you find an interesting bug, and I believe it is related to this function. Would you mind creating the issues on Stave to discuss the bug?
Now the fix of the bug could be simple (fixing the quotation marks and case before storing the value to the database). But I am still wondering of the reasons and the best solution:

  1. Double vs Single quotation, you mentioned changing this does not fix the problem, I think that's because this is only part of the problem but this should also be fixed, right?
  2. "True" vs "true", similar to above, JSON spec requires "true". But when does the conversion go wrong for both cases? The json file we provided seems to be correct, and create_project simply sends the data via POST. IMO, the best solution would be to find out which conversion step causes this and we can find a principled solution from there. It is our last resort to post-fix the data inside the create_project function.

Hi, @hunterhector. After check the function you sent me, I think I have known where the bug is. As you mentioned, create_project is correct and the json file is correct. The bug occurs when loading the json file.

In python, we usually use these functions to load a json file:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(project_json)

And I just made project_json as the input of the function create_project. project_json is a Dict, which results in the Single quotation and "True".

Actually, I just need to use the dump function to solve this bug, for example:

import json

file_obj = open(file_path)
project_json = json.load(file_obj)

create_project(json.dumps(project_json))

So I think there is no need to modify the source code. We just need to make sure the parameter of the funtion create_project‘is a string with json format (I mean Double quatation and "true" "false") instead of a Dict.

Sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants