Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/st dataframe pagenation #77

Merged
merged 6 commits into from
Aug 30, 2024

Conversation

singjc
Copy link
Contributor

@singjc singjc commented Aug 20, 2024

User description

This add a method for rendering a streamlit dataframe with pagenation to allow viewing small chunks of the dataframe at a time, which also solves #76

image


PR Type

enhancement


Description

  • Added a new method display_large_dataframe in src/common.py to render large DataFrames with pagination and row selection.
  • Implemented a helper function to calculate DataFrame memory usage in megabytes.
  • Replaced the existing st.dataframe call in src/view.py with the new display_large_dataframe method to enhance the view_spectrum functionality.

Changes walkthrough 📝

Relevant files
Enhancement
common.py
Add paginated DataFrame display and memory usage calculation

src/common.py

  • Added a function display_large_dataframe for paginated DataFrame
    display.
  • Implemented memory usage calculation for DataFrames.
  • Introduced pagination controls and row selection for DataFrames.
  • +78/-0   
    view.py
    Integrate paginated DataFrame display in view_spectrum     

    src/view.py

  • Replaced st.dataframe call with display_large_dataframe.
  • Integrated new paginated DataFrame display in view_spectrum.
  • +2/-16   

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Performance Concern
    The display_large_dataframe function recalculates the current chunk on every rerun, which may be inefficient for large DataFrames.

    Code Smell
    The get_dataframe_mem_useage function name has a typo. It should be get_dataframe_mem_usage.

    Potential Bug
    The rows variable is used without checking if event is None, which could lead to an error if no row is selected.

    @singjc singjc requested a review from t0mdavid-m August 20, 2024 17:17
    Copy link

    codiumai-pr-agent-pro bot commented Aug 20, 2024

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Avoid using mutable default arguments in function definitions

    The chunk_sizes parameter in the function signature uses a mutable default argument
    (an empty list), which can lead to unexpected behavior. It's better to use None as
    the default and then set the list inside the function.

    src/common.py [254-255]

    -def display_large_dataframe(df, 
    -                            chunk_sizes: list[int] = [100, 1_000, 10_000]):
    +def display_large_dataframe(df, chunk_sizes: list[int] | None = None):
    +    if chunk_sizes is None:
    +        chunk_sizes = [100, 1_000, 10_000]
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The suggestion correctly identifies a best practice issue by recommending the use of None instead of a mutable default argument, which can prevent unexpected behavior.

    9
    Performance
    Implement more efficient pagination for large datasets to reduce memory usage

    Consider using a more efficient method for pagination. Instead of loading the entire
    DataFrame into memory and then slicing it, you could use SQL-like operations to
    fetch only the required chunk from the data source. This would significantly reduce
    memory usage for very large datasets.

    src/common.py [277-280]

    -def get_current_chunk(df, chunk_size, chunk_index):
    +def get_current_chunk(data_source, chunk_size, chunk_index):
         start = chunk_index * chunk_size
    -    end = min(start + chunk_size, len(df))  # Ensure end does not exceed dataframe length
    -    return df.iloc[start:end], start, end
    +    end = start + chunk_size
    +    return data_source.fetch_rows(start, end), start, min(end, data_source.total_rows)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion addresses a performance issue by proposing a more efficient method for pagination, which is crucial for handling large datasets and reducing memory usage.

    8
    Possible issue
    Add a check for None before accessing properties of a returned object

    The event variable is used before it's defined. This could lead to a NameError if
    display_large_dataframe doesn't return an event object. Consider adding a check to
    ensure event is not None before accessing its properties.

    src/view.py [225-226]

     event = display_large_dataframe(df)
    -rows = event.selection.rows
    +rows = event.selection.rows if event is not None else []
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion addresses a potential bug by ensuring that the event variable is checked for None before accessing its properties, preventing possible runtime errors.

    8
    Maintainability
    Correct function name spelling and use a more descriptive name

    The function get_dataframe_mem_useage has a typo in its name. It should be
    get_dataframe_mem_usage. Also, consider using a more descriptive name like
    calculate_dataframe_memory_usage_mb.

    src/common.py [418]

    -def get_dataframe_mem_useage(df):
    +def calculate_dataframe_memory_usage_mb(df):
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion improves code maintainability by correcting a typo and providing a more descriptive function name, enhancing code readability.

    7

    @singjc singjc linked an issue Aug 20, 2024 that may be closed by this pull request
    @t0mdavid-m
    Copy link
    Member

    Thanks a lot for the addition. I will review this PR early next week as I am OOO for the rest of this week.

    Copy link
    Member

    @t0mdavid-m t0mdavid-m left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for the addition! I just have some minor comments. After those are addressed the PR would be good to merge from my side.

    src/common.py Outdated Show resolved Hide resolved
    src/common.py Outdated

    event = st.dataframe(
    current_chunk_df,
    column_order=[
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This should be more flexible as the data displayed may vary. I think adding column_order as a function parameter would work best. Is there a reason why selection_mode, on_select, use_container_width, and hide_index have to be set to these specific values? If not I would suggest adding those as parameters as well.

    Anything specific to a particular workflow/dataset should not be located in common.py as it is intended to contain functions of general use. As a sidenote, common modules should be moved in a separate directory to make this more clear to developers.

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    That's a good point. These are specifically set because I came across this issue when using the pyopenms workflow for selecting spectra to plot when selecting a row, but this should be more flexible for other use cases.

    We could pass kwargs to the display_large_dataframe method for st.dataframe because that's probably the only main method that would need to be changed based on particular needs. I opted for this option

    @t0mdavid-m
    Copy link
    Member

    Merging with main should fix the circular import issue with the linter.

    @t0mdavid-m
    Copy link
    Member

    Thanks for making the changes. I like the solution with kwargs. From my side this PR would be ready to merge.

    @t0mdavid-m t0mdavid-m merged commit 261502a into OpenMS:main Aug 30, 2024
    4 of 5 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    Large Data Exceeds View Size Limit
    2 participants