Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats Endpoint #396

Open
MichaelLukowski opened this issue Sep 25, 2023 · 8 comments
Open

Stats Endpoint #396

MichaelLukowski opened this issue Sep 25, 2023 · 8 comments

Comments

@MichaelLukowski
Copy link
Collaborator

MichaelLukowski commented Sep 25, 2023

Per the connect discussions I believe it would be a good idea to set up a stats endpoint for DRS servers.

This will allow us to build something like a dashboard for existing DRS servers and show the research community just how much data is being hosted on existing DRS servers.

I think simply the endpoint can just be a GET to /stats and the minimal information required from this endpoint would be total_files and total_file_size.

I would love to get some discussion related to this.

@mattions
Copy link

Thank @MichaelLukowski to get this rolling.
Yes I totally agree.

Maybe is it worth it also to add another bit about the "name" of the DRS Server?
Should we just add a self_url field that can be used as proxy for the name?

@briandoconnor
Copy link
Contributor

@MichaelLukowski OK to assign you as the champion? I changed you to the owner of the ticket

@MichaelLukowski
Copy link
Collaborator Author

@briandoconnor yes that's fine. I'll take it

@briandoconnor
Copy link
Contributor

briandoconnor commented Mar 25, 2024

Revving this since we'll discuss at GA4GH connect.

Can we do something really simple like the information that powers https://stats.gen3.org?

What about expanding service_info? That seems like a natural place rather than creating a new stats endpoint?

In this mockup I also included geo/cloud location information (as an array since a given DRS server may manage data across multiple locations). It would be useful to be able to have a map of DRS servers and to know general high-level stats on them.

There might be interest to break down stats by geo location but maybe that's overkill. Still, I like including a location since it will help us draw a map of the DRS servers out there....

{
  "id": "org.ga4gh.myservice",
  "name": "My project",
  "type": {
    "group": "org.ga4gh",
    "artifact": "drs",
    "version": "1.0.0"
  },
  "description": "This service provides...",
  "organization": {
    "name": "My organization",
    "url": "https://example.com"
  },
  "statistics": { <-- this could give some basic stats on the DRS server
      "subjects_count" : "integer",
      "files_count": "integer",
      "total_file_size": "integer, in bytes"
  }
  "locations" : { [ <-- a place for declaring the location of data in this DRS server... could be cloud-based using a predetermined list of providers + regions or geo location or country code. Modeled as arrays knowing that a single DRS server may manage data in multiple locations.
    "geo_location_coordinates": "lat long coordinates",
    "geo_location_country_code": "country code",
    "cloud_provider" : "cloud provider name",
    "cloud_region": "region code that makes sense"
   ] },
  "contactUrl": "mailto:support@example.com",
  "documentationUrl": "https://docs.myservice.example.com",
  "createdAt": "2019-06-04T12:58:19Z",
  "updatedAt": "2019-06-04T12:58:19Z",
  "environment": "test",
  "version": "1.0.0",
  "maxBulkRequestLength": 0
}

@mattions
Copy link

Do we know if others standards have used service_info to add specific information?
I think that could tell us if we should add "DRS" specific there or not.

I'm trying to understand the log/lat idea. A location of a DRS Server does not really give an idea of what data is there, so it's more to have them populated on a map?

If that so, would not make more sense to build the map using the location of the institutions/partners that are running the Server?

@MichaelLukowski
Copy link
Collaborator Author

An example of the endpoint that powers stats.gen3.org is this https://gen3.biodatacatalyst.nhlbi.nih.gov/index/_stats

Simply it is just a single json record that contains "fileCount" and "totalFileSize" fields.

I personally don't know of a standard that has modified the service_info endpoint and I am unsure if it is a good idea to open that issue as service_info was created as a standard way to give information about the service.

I will have a PR for this later this week. For right now I will add a new stats endpoint, however if we decide that service_info is a better place then I can move it.

@briandoconnor
Copy link
Contributor

In the Cloud WS meeting on Aug 12th, 2024 we decided to simplify the feature described in Issue #396 for DRS release 1.5. I updated this feature branch to just include the following two variables in the service-info endpoint:

  • objectCount
  • totalObjectSize

The goal is to keep the metadata as simple as possible while still providing some useful understanding of the contents of a given DRS server.

@susheel
Copy link
Member

susheel commented Sep 18, 2024

total_file_size should not be an integer in bytes as this would only allow us to signify 2 billion bytes ~ 2GB

Suggest to use string or number as the type for this field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants