NCZarr S3 performance when file contains many groups #2588

fsvenson · 2023-01-16T09:54:21Z

fsvenson
Jan 16, 2023

I am currently investigating the possibility of utilizing NCZarr for efficiently reading of cloud stored data. Our NetCDF files are structured as quad trees (think map tiles) where each tile's data and attributes are contained in a NetCDF group.

As it turns out, when requesting data from via ncdump (or python via xarray and necdf4-python) it seems like the time it takes to fetch the data for a single group increases with how many groups there are in the whole file. I used strace to check how many network requests are being made depending on how many groups there are in the file with these results:

Strace command:
$ strace -ff -e trace=network -s 2000 ncdump -g"/0_0_0" "/s3/path/to/file.nc#mode=nczarr,s3" 2>&1  | grep -c sin_addr

Output of the above command:
groups | count
-------|-------
     1 |   124
     5 |   397
   341 | 23372

The group layout of the (largest) file is currently like this:

{
  /0_0_0
}
{
  /1_0_0
}
...
{
  /4_255_255
}

I have also tried the following group layout and that did not seem to make a difference in performance:

My question is then if these results are expected? Is there something I can do with my file structure to avoid all those network requests when trying to access a single group of data?

I should mention that I am using NetCDF v4.9.1-rc2 and AWS SDK v1.10.51.

DennisHeimbigner · 2023-01-16T19:57:33Z

DennisHeimbigner
Jan 16, 2023
Collaborator

Thanks for this. Improving performance is always important to us.
Is that S3 file publicly accessible? If so, can you send the full URL?
If not, can you send me the output of ncdump (or at least 'ncdump -h' )
for that file?

0 replies

DennisHeimbigner · 2023-01-16T20:15:42Z

DennisHeimbigner
Jan 16, 2023
Collaborator

Thinking about this, I am not sure it will give useful information.
A better test would be to write a C program that directly opens
only the group of interest. Then run strace on that program.

1 reply

fsvenson Jan 16, 2023
Author

Here is a minimal C program, reading the data variable of a single group:

/* Copyright 2019 University Corporation for Atmospheric
   Research/Unidata.  See COPYRIGHT file for conditions of use. */
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <netcdf.h>

/* This is the name of the data file we will read. */
#define FILE_NAME "https://s3-url-to/file.nc#mode=nczarr,s3"

#define NX 1028
#define NY 1028

/* Handle errors by printing an error message and exiting with a
 * non-zero status. */
#define ERRCODE 2
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); exit(ERRCODE);}

int
main()
{
   /* There will be netCDF IDs for the file, each group, and each
    * variable. */
   int ncid, varid1, grp1id;

   uint8_t data_in[3 * NX * NY];

   /* Loop indexes, and error handling. */
   int retval;

   /* Open the file. NC_NOWRITE tells netCDF we want read-only access
    * to the file.*/
   if ((retval = nc_open(FILE_NAME, NC_NOWRITE, &ncid)))
      ERR(retval);

   /* Get the group ids of our two groups. */
   if ((retval = nc_inq_ncid(ncid, "0_0_0", &grp1id)))
      ERR(retval);

   /* Get the varid of the uint8 data variable, based on its name, in
    * grp1. */
   if ((retval = nc_inq_varid(grp1id, "data", &varid1)))
      ERR(retval);

   /* Read the data. */
   if ((retval = nc_get_var_ubyte(grp1id, varid1, &data_in[0])))
      ERR(retval);

   /* Close the file, freeing all resources. */
   if ((retval = nc_close(ncid)))
      ERR(retval);

   printf("*** SUCCESS reading example file %s!\n", FILE_NAME);
   return 0;
}

Results are similar as with ncdump, but with fewer requests:

groups | count
-------|-------
     1 |   74
     5 |   219
   341 | 12381

What kind of output are you interested in seeing to understand the issue further?

DennisHeimbigner · 2023-01-16T21:47:14Z

DennisHeimbigner
Jan 16, 2023
Collaborator

I will have to run gprof next to see if I can figure out why this behavior.

1 reply

fsvenson Jan 26, 2023
Author

I noticed there was an option to run with tracing enabled so below are the output logs with tracing for the case with 1 group and 5 groups respectively:

tracing_one_grp.txt
tracing_five_grps.txt

DennisHeimbigner · 2023-01-26T20:42:56Z

DennisHeimbigner
Jan 26, 2023
Collaborator

I need to make sure I understand the file.

It has 4255255 = 260100 non-nested groups
Each group contains a single byte variable with 2 dimensions: 1028 x 1028

Is this correct?

1 reply

fsvenson Jan 26, 2023
Author

Not sure if I follow what you mean in your first point, but your second point is correct.

In these example files no groups are nested.

DennisHeimbigner · 2023-01-26T21:09:23Z

DennisHeimbigner
Jan 26, 2023
Collaborator

I guess I am asking if you rebuild the test file with a differing number of groups and then access one group.
Just want to make sure my tests are similar.

2 replies

fsvenson Jan 26, 2023
Author

Yes, that is exactly what I have done. In this case, one file has a single group while the other one has five groups. And the only difference in the test program is which file I read.

fsvenson Jan 30, 2023
Author

Here are two test files you can try out: test_files.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCZarr S3 performance when file contains many groups #2588

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

NCZarr S3 performance when file contains many groups #2588

fsvenson Jan 16, 2023

Replies: 5 comments · 5 replies

DennisHeimbigner Jan 16, 2023 Collaborator

DennisHeimbigner Jan 16, 2023 Collaborator

fsvenson Jan 16, 2023 Author

DennisHeimbigner Jan 16, 2023 Collaborator

fsvenson Jan 26, 2023 Author

DennisHeimbigner Jan 26, 2023 Collaborator

fsvenson Jan 26, 2023 Author

DennisHeimbigner Jan 26, 2023 Collaborator

fsvenson Jan 26, 2023 Author

fsvenson Jan 30, 2023 Author

fsvenson
Jan 16, 2023

Replies: 5 comments 5 replies

DennisHeimbigner
Jan 16, 2023
Collaborator

DennisHeimbigner
Jan 16, 2023
Collaborator

fsvenson Jan 16, 2023
Author

DennisHeimbigner
Jan 16, 2023
Collaborator

fsvenson Jan 26, 2023
Author

DennisHeimbigner
Jan 26, 2023
Collaborator

fsvenson Jan 26, 2023
Author

DennisHeimbigner
Jan 26, 2023
Collaborator

fsvenson Jan 26, 2023
Author

fsvenson Jan 30, 2023
Author