Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sssom-py is much too slow #202

Closed
1 task
matentzn opened this issue Feb 7, 2022 · 6 comments
Closed
1 task

sssom-py is much too slow #202

matentzn opened this issue Feb 7, 2022 · 6 comments
Assignees
Labels

Comments

@matentzn
Copy link
Collaborator

matentzn commented Feb 7, 2022

We need to figure out why that is first of all, i.e. which functions are so inefficient, and then working on improving efficiency. First goal:

  • A small document of 100 mappings should be processable in 1 second.
@hrshdhgd
Copy link
Contributor

hrshdhgd commented Feb 14, 2022

Here are my observations:

  1. I ran test_conversion.py since that took the most time amongst all other tests. The following is the CPU profiling graph (courtesy: Austin).
    Screen Shot 2022-02-14 at 10 45 45 AM

As noted, as_rdf_graph takes a noticeable amount of time (~1.2 seconds) which is called at least 4 times within the test.

  1. On running sssom convert A.json -o A.tsv gave me the following:
    Screen Shot 2022-02-14 at 10 43 50 AM
  • Immediate improvements needed:
    - from_sssom_dataframe could be recoded to be more efficient.

Screen Shot 2022-02-14 at 10 52 10 AM

- same with `to_mapping_set_document`

Screen Shot 2022-02-14 at 10 53 49 AM

- But `init` is common for both above:

init

command for documentation purposes: sudo austin -i 100 -o /path/to/austin/output.austin python -m pytest

@matentzn
Copy link
Collaborator Author

Thank you @hrshdhgd this is a great analysis! You can put it a bit on the backburner now and we get back to it later!

@joeflack4
Copy link
Collaborator

Agree it is too slow, and also a good analysis!

@matentzn
Copy link
Collaborator Author

Writing write_owl(msdf1,f) for large msdfs takes a huge amount of time (100MB tsv ~ > 2 hours)

@matentzn
Copy link
Collaborator Author

(sssom) ➜  ontology git:(master) ✗ wc -l ../mappings/oba-all-phenotype.sssom.tsv
 1712464 ../mappings/oba-all-phenotype.sssom.tsv
(sssom) ➜  ontology git:(master) ✗ sssom filter ../mappings/oba-all-phenotype.sssom.tsv -o ../mappings/oba-all-hp-phenotype.sssom.tsv --subject_id HP:% --object_id OBA:%

This should be basically instant but takes 30 minutes. Maybe bypass linkml for certain operations?

hrshdhgd added a commit that referenced this issue Nov 20, 2023
Addresses #202 
 - [x] Ran `poetry update`
- [x] Call `_get_sssom_schema_object()` once in the function
`get_dict_from_mapping()` rather than multiple times in a for loop that
is inefficient.
- [x] Instead of `pandas.iterrows()` use `pandas.apply()` in
`_get_mapping_set_from_df()`
 - [x] Use dict/list comprehensions instead of for loops
- [x] Use sets instead of lists where lookups are done and sequence of
elements don't matter.
 - [x] Improve `SchemaView` object instantiation and persistence
   - [x] Use `@cached_property` thank you @cthoyt

---------

Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com>
Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com>
@hrshdhgd
Copy link
Contributor

Closing this for now in favor of #462 . Feel free to re-open a new issue with exact location of latency improvement needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants