sssom-py is much too slow #202

matentzn · 2022-02-07T09:00:27Z

We need to figure out why that is first of all, i.e. which functions are so inefficient, and then working on improving efficiency. First goal:

A small document of 100 mappings should be processable in 1 second.

hrshdhgd · 2022-02-14T16:57:58Z

Here are my observations:

I ran test_conversion.py since that took the most time amongst all other tests. The following is the CPU profiling graph (courtesy: Austin).

As noted, as_rdf_graph takes a noticeable amount of time (~1.2 seconds) which is called at least 4 times within the test.

On running sssom convert A.json -o A.tsv gave me the following:

Immediate improvements needed:
- from_sssom_dataframe could be recoded to be more efficient.

- same with `to_mapping_set_document`

- But `init` is common for both above:

command for documentation purposes: sudo austin -i 100 -o /path/to/austin/output.austin python -m pytest

matentzn · 2022-02-14T17:47:23Z

Thank you @hrshdhgd this is a great analysis! You can put it a bit on the backburner now and we get back to it later!

joeflack4 · 2022-03-03T21:34:38Z

Agree it is too slow, and also a good analysis!

matentzn · 2022-05-16T12:45:18Z

Writing write_owl(msdf1,f) for large msdfs takes a huge amount of time (100MB tsv ~ > 2 hours)

matentzn · 2023-01-22T17:08:42Z

(sssom) ➜  ontology git:(master) ✗ wc -l ../mappings/oba-all-phenotype.sssom.tsv
 1712464 ../mappings/oba-all-phenotype.sssom.tsv
(sssom) ➜  ontology git:(master) ✗ sssom filter ../mappings/oba-all-phenotype.sssom.tsv -o ../mappings/oba-all-hp-phenotype.sssom.tsv --subject_id HP:% --object_id OBA:%

This should be basically instant but takes 30 minutes. Maybe bypass linkml for certain operations?

@cthoyt

Addresses #202 - [x] Ran `poetry update` - [x] Call `_get_sssom_schema_object()` once in the function `get_dict_from_mapping()` rather than multiple times in a for loop that is inefficient. - [x] Instead of `pandas.iterrows()` use `pandas.apply()` in `_get_mapping_set_from_df()` - [x] Use dict/list comprehensions instead of for loops - [x] Use sets instead of lists where lookups are done and sequence of elements don't matter. - [x] Improve `SchemaView` object instantiation and persistence - [x] Use `@cached_property` thank you @cthoyt --------- Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com> Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com>

hrshdhgd · 2023-12-15T22:04:34Z

Closing this for now in favor of #462 . Feel free to re-open a new issue with exact location of latency improvement needed.

matentzn added the priority label Feb 7, 2022

matentzn assigned hrshdhgd Feb 7, 2022

hrshdhgd mentioned this issue Nov 14, 2023

Optimization of some functions #462

Merged

7 tasks

hrshdhgd closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sssom-py is much too slow #202

sssom-py is much too slow #202

matentzn commented Feb 7, 2022

hrshdhgd commented Feb 14, 2022 •

edited

Loading

matentzn commented Feb 14, 2022

joeflack4 commented Mar 3, 2022

matentzn commented May 16, 2022

matentzn commented Jan 22, 2023

hrshdhgd commented Dec 15, 2023

sssom-py is much too slow #202

sssom-py is much too slow #202

Comments

matentzn commented Feb 7, 2022

hrshdhgd commented Feb 14, 2022 • edited Loading

matentzn commented Feb 14, 2022

joeflack4 commented Mar 3, 2022

matentzn commented May 16, 2022

matentzn commented Jan 22, 2023

hrshdhgd commented Dec 15, 2023

hrshdhgd commented Feb 14, 2022 •

edited

Loading