Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To Consider: Comprehensive Final Enhancements for Project Efficiency and Maintainability #32

Open
6 of 8 tasks
tmushayahama opened this issue Apr 22, 2024 · 0 comments
Open
6 of 8 tasks
Assignees

Comments

@tmushayahama
Copy link
Collaborator

tmushayahama commented Apr 22, 2024

Some tasks to consider for the remaining time

  • Implement Elasticsearch Scrolling for Pagination
  • Add pagination for large datasets using Elasticsearch scrolling for @quemeb.
  • Consider creating endpoints like scrollAnnotations, ScrollSNPsByChromosome, and ScrollSnpsById.
  • Research and implement API security, possibly using an API Guard annotation.
  • Make the scrollId an optional parameter and extend the Snp class to return a scrollId.
  • I will explain below more detail
  • Automate Purge of Downloads Folder
  • Develop a cron job or equivalent to regularly clear the downloads folder.
  • Enhance Test Coverage
  • Ensure test coverage includes fields like VEP_refseq_PANTHER_GO_SLIM_cellular_component_list_id.
  • Add these values to your sample data to ensure comprehensive testing.
  • Dynamic Column Handling
  • Implement functionality to test variable column loading, allowing for the addition or removal of columns dynamically.
  • This will start from your schema generation code
  • API Documentation
  • Research and implement a tool equivalent to Swagger for documenting APIs, including descriptions, required parameters, and optional parameters.
  • Code Documentation
  • If time allows, enhance code documentation using docstrings.
  • Reference: https://testdriven.io/blog/documenting-python/
  • Something to consider, Standardize Coding Conventions
  • Ensure consistent naming conventions across the codebase.
  • Choose and enforce a standard naming convention (preferably snake_case for Python). sometimes it is
    GetSNPsByChromosome and sometimes it is search_by_chromosomes
  • Good Error Messages

Implementation flow idea Scrolling in Elasticsearch:

Scrolling in Elasticsearch allows you to retrieve large numbers of results from a query in multiple batches without the cost of deep pagination. It's suitable for processing large datasets that exceed typical pagination limits.

When a scroll query is initiated, Elasticsearch provides a scroll_id that you use to fetch the next batch of results. This scroll_id acts like a cursor pointing to a specific place in the dataset.

Making scrollId an Optional Parameter:

  • Modify the endpoint that triggers the scrolling query to accept a scrollId as an optional query parameter.
  • If a scrollId is provided, the API should continue fetching results from where the last batch ended.
  • If no scrollId is provided, the API should start a new scroll session and return the initial batch of results along with a new scrollId.

Extending the Snp Class:
Subclass the Snp class to include a property that can return a scrollId associated with a query session.

API and Code Adjustments:
Adjust the API's logic to manage the lifecycle of a scroll session, including the expiration of scrollIds after a certain time (typically 1 minute by default in Elasticsearch, but configurable).
Implement error handling for cases when an expired or invalid scrollId is received.

tagging @akshala @huaiyumi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants