Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDatabase client returns KeyError or fields with dictionary values. #7

Open
ipcamit opened this issue Jan 19, 2022 · 6 comments
Open
Labels
bug Something isn't working

Comments

@ipcamit
Copy link
Collaborator

ipcamit commented Jan 19, 2022

If mongo entry for a database is a dictionary the MongoClient.get_data tries to find source-val key, and raises KeyValue error on its absence. Therefore it fails for any dictionary without a KeyValue field.

See example below:

In [77]: client.get_data("properties",fields={"relationships"},query={"_id": "-7545167799866548821"})
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [77], in <module>
----> 1 client.get_data("properties",fields={"relationships"},query={"_id": "-7545167799866548821"})

File ~/Projects/COLABFIT/colabfit-tools/colabfit/tools/database.py:851, in MongoDatabase.get_data(self, collection_name, fields, query, ids, keep_ids, concatenate, ravel, verbose)
    849         if not missing:
    850             if isinstance(v, dict):
--> 851                 v = v['source-value']
    853             data[k].append(np.atleast_1d(v))
    855 if concatenate or ravel:

KeyError: 'source-value'
@ipcamit ipcamit added the bug Something isn't working label Jan 19, 2022
@jvita
Copy link
Member

jvita commented Jan 19, 2022

Just to make sure, if you do something like client.get_data("properties", fields=["relationships.configurations", "relationships.property_settings"], query={"_id": "-7545167799866548821"}) it returns {"relationships.configurations": ..., "relationships.property_settings": ...}, right?

I added the automatic un-packing of property fields using v = v['source-value'] because it enabled calls like get_data('properties', ['energy-forces-stress.energy']), which is better than get_data('properties', ['energy-forces-stress.energy.source-value']) and works well with dataset.aggregated_info['property_fields']. I agree though that get_data() should be able to return lists of dictionaries. Maybe I could change if isinstance(v, dict): to if isinstance(v, dict) and ('source-value' in v):? Or alternatively, adding something like an return_dicts argument and using if (not return_dicts) and isinstance(v, dict):? Personally, I like the argument option better.

@ipcamit
Copy link
Collaborator Author

ipcamit commented Jan 20, 2022

Just to make sure, if you do something like client.get_data("properties", fields=["relationships.configurations", "relationships.property_settings"], query={"_id": "-7545167799866548821"}) it returns {"relationships.configurations": ..., "relationships.property_settings": ...}, right?

Yes, that behaviour is working as expected. I was just concerned that if there is certain property which is just a dictionary, it might create unexpected errors. I think easiest would be to check if there is a 'source-value' field inside, and if not, return the whole dictionary? What do you think?

@jvita
Copy link
Member

jvita commented Jan 20, 2022

I like the idea of having 'source-value' being un-packed by default, but I want to make sure that the behavior is documented properly so that users will be aware that it happens. I think the best way to do this would be to add an argument like unpack_properties which will be True by default and will un-pack the 'source-value' field if it exists. Then users who want to avoid this behavior will know they can set unpack_properties=False.

jvita added a commit that referenced this issue Jan 20, 2022
@ipcamit
Copy link
Collaborator Author

ipcamit commented Jan 20, 2022

Yes, I think that would be most sensible.

We can also try something like process_results=True or raw_query=False by default. To indicate that incoming data is being processed by colabfit-tools somehow. And changing its value would mean now user can expect the raw results for them to sort out.

@jvita
Copy link
Member

jvita commented Jan 20, 2022

We can also try something like process_results=True or raw_query=False by default. To indicate that incoming data is being processed by colabfit-tools somehow. And changing its value would mean now user can expect the raw results for them to sort out.

That's another question that I've been struggling with: how much should we try to hide the Mongo API behind the MongoDatabase functions? As I've pointed out, the current implementation exposes a lot of the Mongo API to the user (which I think is a good thing). Because of this, a user can do pretty much anything they want by using the Mongo API directly instead of using MongoDatabase's functions which are basically just user-friendly wrappers. For something as simple as what raw_query would enable, I think it would make more sense to call database.<collection>.find(<query>) directly. Though I see how raw_query could be convenient for some people, I worry about duplicating too much functionality that already exists through the Mongo API.

@ipcamit
Copy link
Collaborator Author

ipcamit commented Jan 20, 2022

I think currently it is pretty well balanced overall. With get_data being the lowest level api, and get_dataset etc for actual use. We just need to emphasize use of get_configuration get_dataset etc in the tutorials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants