`list_objects_v2()` is much slower than it should be #619

mgirlich · 2023-05-30T08:16:01Z

Most of the time is spend parsing the xml output. It would be great to speed this up.

DyfanJones · 2023-05-30T08:29:40Z

Will have a look to see if there is any quick wins with thjs

DyfanJones · 2023-05-30T08:30:32Z

If you have some idea in how to improve the speed, please feel free to raise a PR ☺️

mgirlich · 2023-05-30T08:58:29Z

A bit tedious but much faster parsing is possible if you know the structure of the xml, e.g. for listing objects you can do something like

contents <- xml2::xml_find_all(resp_content, "d1:Contents")

xml2::xml_find_all(contents, "d1:Key") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:LastModified") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:ETag") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:ChecksumAlgorithm") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:Size") |> xml2::xml_integer()
owner <- xml2::xml_find_all(contents, "d1:Owner")
owner |> 
  xml2::xml_find_all("d1:ID") |> 
  xml2::xml_text()
owner |> 
  xml2::xml_find_all("d1:DisplayName") |> 
  xml2::xml_text()
xml2::xml_find_all(contents, "d1:StorageClass") |> xml2::xml_text()

DyfanJones · 2023-05-30T10:05:36Z

I wonder if there is away to make this more generic 🤔 As this function parses all responses.

mgirlich · 2023-05-31T06:58:19Z

Yes, this can be done more generic. I hacked something together:

decode_xml <- function(xml, interface) {
  if (!is.list(interface)) {
    out <- parse_xml_elt(xml, interface)
    return(out)
  }
  
  nms <- names(interface)
  out <- list()
  
  for (i in seq_along(interface)) {
    key <- paste0("d1:", names(interface)[[i]])
    interface_i <- interface[[i]]
    type <- attr(interface_i, "tags")$type
    flattened <- attr(interface_i, "tags")$flattened

    elts <- xml2::xml_find_all(xml, key, flatten = FALSE)
    if (inherits(xml, "xml_nodeset")) {
      idx <- lengths(elts) == 0
      parsed_elts <- parse_xml_elt(elts, interface_i, idx)
    } else {
      parsed_elts <- parse_xml_elt(elts, interface_i)
    }
    
    if (isTRUE(flattened)) {
      n_i <- length(parsed_elts)
      parsed_elts <- setNames(parsed_elts, rep(nms[[i]], n_i))
    } else {
      parsed_elts <- setNames(list(parsed_elts), nms[[i]])
    }
    
    out <- c(out, parsed_elts)
  }
  
  out
}

parse_xml_elt <- function(xml, interface, idx = NULL) {
  n <- length(xml)
  
  if (!is.null(idx) && n > 0) {
    xml <- structure(unlist(recursive = FALSE, xml), class = "xml_nodeset")
  }
  
  tags <- attr(interface, "tags")
  type <- tags$type
  flattened <- tags$flattened
  
  if (type == "structure") {
    val <- decode_xml(xml, interface[[1]])
    default <- xml2::as_xml_document(list())
    default <- decode_xml(default, interface[[1]])
  } else if (type == "list") {
    val <- decode_xml(xml, interface[[1]])
    default <- xml2::as_xml_document(list())
    default <- decode_xml(default, interface[[1]])
  } else if (type == "boolean") {
    val <- xml2::xml_text(xml) == "true"
    default <- NA
  } else if (type == "string") {
    val <- xml2::xml_text(xml)
    default <- NA_character_
  } else if (type == "timestamp") {
    val <- xml2::xml_text(xml)
    val <- strptime(val, format = "%Y-%m-%dT%H:%M:%S")
    val <- as.POSIXct(val)
    default <- strptime(NA, format = "%Y-%m-%dT%H:%M:%S")
    default <- as.POSIXct(default)
  } else if (type == "integer") {
    val <- xml2::xml_integer(xml)
    default <- NA_integer_
  } else {
    browser()
  }
  
  if (isTRUE(flattened)) {
    out <- purrr::transpose(val)
  } else {
    # `unlist()` might drop elements which we need to fill with the missing value
    # this would be a bit easier with `vctrs::vec_init()`
    out <- rep(default, n)
    if (is.null(idx)) {
      idx <- FALSE
    }
    out[!idx] <- val
  }
  out
}

The structure is similar but not quite the same as before (e.g. it skips some nesting that's not in the interface). It would also be possible (and probably be nicer) to return a data frame instead of flattening a list, e.g. the Contents fields.

DyfanJones · 2023-05-31T08:13:15Z

@mgirlich thanks for this, will need to benchmark it. Is it possible to do this without adding any extra dependencies? I would like to keep the dependencies low if possible

mgirlich · 2023-05-31T08:48:02Z

It can be done without purrr. But before doing that we should:

compare performance
consider how the output should look like (data frame vs list). If we change to a data frame structure we don't need the purrr::transpose() step anyway.

DyfanJones · 2023-05-31T09:44:02Z

We should keep it to a list as we want to keep the api consistent with prior versions and mimic the other aws sdks 😄

DyfanJones · 2023-08-08T17:10:35Z

I did a check of paws performance and interesting I am getting fairly decent results:

library(paws)
svc <- s3()

profvis::profvis({
  resp <- svc$list_objects_v2(Bucket = "my-dummy-bucket")
})

Note: this returned the maximum number of items for the AWS api call (1000).

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] paws_0.4.0

loaded via a namespace (and not attached):
 [1] crayon_1.5.2       vctrs_0.6.3        knitr_1.43         httr_1.4.6         cli_3.6.1          xfun_0.39         
 [7] rlang_1.1.1        stringi_1.7.12     purrr_1.0.1        jsonlite_1.8.7     glue_1.6.2         htmltools_0.5.5   
[13] rmarkdown_2.23     evaluate_0.21      ellipsis_0.3.2     fastmap_1.1.1      yaml_2.3.7         lifecycle_1.0.3   
[19] stringr_1.5.0      compiler_4.3.1     htmlwidgets_1.6.2  Rcpp_1.0.11        rstudioapi_0.15.0  digest_0.6.33     
[25] R6_2.5.1           curl_5.0.1         paws.common_0.5.9  paws.storage_0.4.0 magrittr_2.0.3     tools_4.3.1       
[31] profvis_0.3.8      xml2_1.3.5

@mgirlich can you share the R session environment? Would be interesting to see what is causing this bottle neck.

mgirlich · 2023-08-09T05:16:31Z

The difference in timing is because I'm listing many more than 1.000 objects. But still the vast majority of time is spent in unmarshal(). It would be nice if this could be improved.

DyfanJones · 2023-08-22T14:43:29Z

Keeping open until paws.common released to cran

DyfanJones · 2023-09-07T12:20:36Z

Closing a paws.common 0.6.0 has been release onto the cran

mgirlich mentioned this issue Jun 1, 2023

Use known interface when parsing xml #621

Merged

DyfanJones added the enhancement 💡 New feature or request label Jun 5, 2023

DyfanJones linked a pull request Aug 19, 2023 that will close this issue

Use known interface when parsing xml #621

Merged

DyfanJones added the performance 🚀 Performance 🚀 label Aug 20, 2023

DyfanJones closed this as completed in #621 Aug 22, 2023

DyfanJones reopened this Aug 22, 2023

DyfanJones closed this as completed Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`list_objects_v2()` is much slower than it should be #619

`list_objects_v2()` is much slower than it should be #619

mgirlich commented May 30, 2023

DyfanJones commented May 30, 2023

DyfanJones commented May 30, 2023

mgirlich commented May 30, 2023

DyfanJones commented May 30, 2023

mgirlich commented May 31, 2023

DyfanJones commented May 31, 2023

mgirlich commented May 31, 2023

DyfanJones commented May 31, 2023 •

edited

Loading

DyfanJones commented Aug 8, 2023 •

edited

Loading

mgirlich commented Aug 9, 2023

DyfanJones commented Aug 22, 2023

DyfanJones commented Sep 7, 2023

list_objects_v2() is much slower than it should be #619

list_objects_v2() is much slower than it should be #619

Comments

mgirlich commented May 30, 2023

DyfanJones commented May 30, 2023

DyfanJones commented May 30, 2023

mgirlich commented May 30, 2023

DyfanJones commented May 30, 2023

mgirlich commented May 31, 2023

DyfanJones commented May 31, 2023

mgirlich commented May 31, 2023

DyfanJones commented May 31, 2023 • edited Loading

DyfanJones commented Aug 8, 2023 • edited Loading

mgirlich commented Aug 9, 2023

DyfanJones commented Aug 22, 2023

DyfanJones commented Sep 7, 2023

`list_objects_v2()` is much slower than it should be #619

`list_objects_v2()` is much slower than it should be #619

DyfanJones commented May 31, 2023 •

edited

Loading

DyfanJones commented Aug 8, 2023 •

edited

Loading