Use known interface when parsing xml #621

mgirlich · 2023-06-01T06:47:36Z

This improves the performance noted in #619. Still the most time is spend in parsing but I got a 3 times performance improvement.
I'm not sure there is an easy way to improve the performance further. It might be doable by using smart xpaths but I don't have much experience with that. Maybe there is a chance to improve the performance of xml_find_all() itself.

I didn't manage to load the paws package properly and run the tests. So, I only tested this for list_objects_v2()

EDIT: we might do a bit better by using xml_find_first() instead of xml_find_all() in some situations.

DyfanJones · 2023-06-01T10:32:17Z

Here are the benchmarks for reference:

DyfanJones · 2023-06-01T14:23:15Z

This is currently failing unit test: test_handlers_restxml.R

==> Testing R file using 'testthat'

ℹ Loading paws.common
[ FAIL 0 | WARN 1 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 1 | WARN 2 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 8 | WARN 6 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 9 | WARN 7 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 10 | WARN 8 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 22 | WARN 15 | SKIP 1 | PASS 22 ]

── Warning (test_handlers_restxml.R:564:3): unmarshal scalar members ───────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:564:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:564:3): unmarshal scalar members ─────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:564:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:586:3): unmarshal blob ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:586:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:586:3): unmarshal blob ───────────────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:586:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:601:3): unmarshal list ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:601:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:603:3): unmarshal list ─────────────────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:604:3): unmarshal list ─────────────────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:617:3): unmarshal list ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:617:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:619:3): unmarshal list ─────────────────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:620:3): unmarshal list ─────────────────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:633:3): unmarshal flattened list ───────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:633:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:635:3): unmarshal flattened list ───────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:636:3): unmarshal flattened list ───────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:649:3): unmarshal map ──────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:649:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:649:3): unmarshal map ────────────────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:649:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:665:3): unmarshal flattened map ────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:665:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:665:3): unmarshal flattened map ──────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:665:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:681:3): unmarshal flattened named map ──────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:681:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:681:3): unmarshal flattened named map ────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:681:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Skip (test_handlers_restxml.R:692:3): unmarshal empty string ────────────────
Reason: skip

── Warning (test_handlers_restxml.R:714:3): unmarshal enum ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:714:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Warning (test_handlers_restxml.R:714:3): unmarshal enum ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:714:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:716:3): unmarshal enum ─────────────────────
out$FooEnum not equal to "foo".
Lengths differ: 0 is not 1

── Failure (test_handlers_restxml.R:717:3): unmarshal enum ─────────────────────
out$ListEnums[1] not equal to "foo".
1/1 mismatches
x[1]: NA
y[1]: "foo"

── Failure (test_handlers_restxml.R:718:3): unmarshal enum ─────────────────────
out$ListEnums[2] not equal to "bar".
1/1 mismatches
x[1]: NA
y[1]: "bar"

── Warning (test_handlers_restxml.R:730:3): unmarshal timestamp ────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:730:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:732:3): unmarshal timestamp ────────────────
out$Timestamp not equal to unix_time(0).
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:746:3): unmarshal timestamp in header ──────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:746:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:749:3): unmarshal timestamp in header ──────
as.integer(out$Timestamp) not equal to as.integer(expected).
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:765:3): unmarshal elements in header and body ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:765:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Warning (test_handlers_restxml.R:765:3): unmarshal elements in header and body ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:765:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:767:3): unmarshal elements in header and body ──
out$Body not equivalent to "foo".
Lengths differ: 0 is not 1

── Failure (test_handlers_restxml.R:768:3): unmarshal elements in header and body ──
out$Header not equivalent to "bar".
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:785:3): unmarshal result elements at root of xml ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:785:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:785:3): unmarshal result elements at root of xml ──
Error in `if (type == "structure") {
    val <- decode_xml(xml, interface)
    default <- xml2::as_xml_document(list())
    default <- decode_xml(default, interface)
} else if (type == "list") {
    val <- decode_xml(xml, interface[[1]])
    default <- list()
} else if (type == "boolean") {
    val <- xml2::xml_text(xml) == "true"
    default <- logical()
} else if (type == "string") {
    val <- xml2::xml_text(xml)
    default <- character()
} else if (type == "timestamp") {
    val <- xml2::xml_text(xml)
    val <- strptime(val, format = "%Y-%m-%dT%H:%M:%S", tz = "GMT")
    val <- as.POSIXct(val)
    default <- as.POSIXct(NULL)
} else if (type == "integer") {
    val <- as.numeric(xml2::xml_integer(xml))
    default <- numeric()
} else {
    browser()
}`: argument is of length zero
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:785:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Failure (test_handlers_restxml.R:803:3): unmarshal error ────────────────────
err$message not equal to "Bar".
1/1 mismatches
x[1]: "failed to read from query HTTP response body"
y[1]: "Bar"

── Failure (test_handlers_restxml.R:804:3): unmarshal error ────────────────────
err$code not equal to "Foo".
1/1 mismatches
x[1]: "SerializationError"
y[1]: "Foo"

── Failure (test_handlers_restxml.R:806:3): unmarshal error ────────────────────
err$error_response$RequestID not equal to "Baz".
target is NULL, current is character

[ FAIL 22 | WARN 15 | SKIP 1 | PASS 22 ]

Test complete

DyfanJones · 2023-06-01T14:25:52Z

We should also hook into xml_unmarshal as this parse xml class types. I will have a look to see if there is an easy way to do this

mgirlich · 2023-06-01T16:46:59Z

I fixed a couple of issues:

Undefined namespace prefix [1219]: The xml response uses a namespace but the tests don't. For now I simply removed the namespace to pass the tests.
The tests cover many types I didn't support before: character, long, double, float, blob.
Map is still missing
The list type is not yet correct

DyfanJones · 2023-08-09T16:22:31Z

Currently this PR seems to be breaking paws with S3 redirects and failing to return response from AWS.

svc = paws.storage::s3()
resp = svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
#> Error in rawToChar(value): argument 'x' must be a raw vector

^{Created on 2023-08-09 with reprex v2.0.2}

svc = paws.storage::s3(config = list(credentials = list(anonymous = T), region = "us-east-1"))
resp = svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
resp
#> $IsTruncated
#> logical(0)
#> 
#> $Contents
#> list()
#> 
#> $Name
#> character(0)
#> 
#> $Prefix
#> character(0)
#> 
#> $Delimiter
#> character(0)
#> 
#> $MaxKeys
#> numeric(0)
#> 
#> $CommonPrefixes
#> list()
#> 
#> $EncodingType
#> character(0)
#> 
#> $KeyCount
#> numeric(0)
#> 
#> $ContinuationToken
#> character(0)
#> 
#> $NextContinuationToken
#> character(0)
#> 
#> $StartAfter
#> character(0)

^{Created on 2023-08-09 with reprex v2.0.2}

DyfanJones · 2023-08-14T17:34:47Z

I believe I have a working solution developed from initially proposal. This method I use the xml2:xml_contents similar to xml2::as_list and then map the interface onto it.

restxml_unmarshal_2 <- function(request) {
  t <- paws.common:::rest_payload_type(request$data)
  if (t == "structure" || t == "") {
    data <- request$http_response$body
    interface <- request$data
    request$data <- xml_unmarshal_2(data, interface)
  } else {
    request <- paws.common:::rest_unmarshal(request)
  }
  return(request)
}

# Decode raw bytes XML into an R list object.
xml_unmarshal_2 <- function(raw_data, interface = NULL) {
  if (paws.common::is_empty(raw_data)) {
    return(interface)
  }

  data <- xml2::read_xml(raw_data, encoding = "utf8")
  out <- xml_parse_2(data, interface)

  return(out)
}

xml_parse_2 <- function(data, interface) {
  nms <- names(interface)
  contents <- xml2::xml_contents(data)
  contents_nms <- xml2::xml_name(contents)

  result <- list()
  for (nm in nms) {
    interface_i <- interface[[nm]]
    tags_i <- attr(interface_i, "tags")
    if (!is.null(tags_i$locationName)) {
      key <- tags_i$locationName
    } else {
      key <- nm
    }
    found <- (key == contents_nms)
    xml_elts <- contents[found]
    result[[nm]] <- (
      if (any(found)) {
        parse_xml_elt_2(xml_elts, interface_i, tags_i)
      } else {
        default_parse_xml(interface_i, tags_i)
      }
    )
  }
  return(result)
}

parse_xml_elt_2 <- function(xml_elts, interface_i, tags_i) {
  n <- length(xml_elts)

  type <- tags_i[["type"]]
  flattened <- tags_i[["flattened"]]

  t <- paws.common:::type(interface_i)
  parse_fn <- switch(t,
    structure = xml_parse_structure_2,
    map = xml_parse_map_2,
    list = xml_parse_list_2,
    xml_parse_scalar_2
  )

  result <- parse_fn(xml_elts, interface_i, type)

  # the `is.list()` check is necessary because e.g. `CheckSumAlgorithm` has
  # a list interface though it isn't a list?!
  if (isTRUE(flattened) && is.list(result)) {
    result <- transpose(result)
  }

  return(result)
}

xml_parse_structure_2 <- function(xml_elts, interface, tags_i, type = NULL) {
  result <- xml_parse_2(xml_elts, interface)
  attr(result, "tags") <- tags_i
  return(result)
}

xml_parse_map_2 <- function(xml_elts, interface,  tags_i, type = NULL) {
  result <- xml_parse_2(xml_elts, interface)
  attr(result, "tags") <- tags_i
  return(result)
}

xml_parse_list_2 <- function(xml_elts, interface, type = NULL) {
  xml_parse_2(xml_elts, interface[[1]])
}

xml_parse_scalar_2 <- function(xml_elts, interface, type = NULL) {
  results <- vapply(xml_elts, xml2::xml_text, FUN.VALUE = character(1))

  convert <- switch(type,
    blob = paws.common:::base64_to_raw,
    boolean = as.logical,
    double = as.numeric,
    float = as.numeric,
    integer = as.numeric,
    long = as.numeric,
    timestamp = function(x) paws.common:::as_timestamp(x, format = "iso8601"),
    as.character
  )
  result <- convert(results)
  names(result) <- names(interface)
  return(result)
}

default_parse_xml <- function(interface_i, tags_i) {
  type <- tags_i[["type"]]

  t <- paws.common:::type(interface_i)
  parse_fn <- switch(t,
    structure = default_parse_structure,
    map = default_parse_map,
    list = default_parse_list,
    default_parse_scalar
  )
  return(parse_fn(interface_i, type))
}

default_parse_structure <- function(interface_i, type = NULL) {
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_map <- function(interface_i, type = NULL) {
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_list <- function(interface_i, type = NULL) {
  interface_i <- interface_i[[1]]
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_scalar <- function(interface_i, type = NULL) {
  result <- switch(type,
    integer = numeric(),
    double = numeric(),
    long = numeric(),
    float = numeric(),
    timestamp = as.POSIXct(NULL),
    boolean = logical(),
    character()
  )
  return(result)
}

transpose  <- function(x) {
  if (!is.list(x)) {
    return(x)
  }
  n_col <- length(x)
  if (n_col == 0) {
    return(list())
  }
  n_row <- length(x[[1]])
  if (n_row == 0) {
    return(list())
  }
  out <- vector("list", length = n_row)
  col_seq <- seq.int(n_col, 1)

  vals <- vector("list", length = n_col)
  names(vals) <- names(x)

  for (row in seq.int(1, n_row)) {
    for (col in col_seq) {
      vals[col] <- list(rep_len(x[[col]], n_row)[[row]])
    }
    out[[row]] <- vals
  }
  names(out) <- names(x[[1]])
  return(out)
}

request <- readRDS("request.RDS")

raw_data <- request$http_response$body
interface <- request$data
data <- xml2::read_xml(raw_data, encoding = "utf8")

bm <- bench::mark(
  resp1 = paws.common:::restxml_unmarshal(request)$data,
  resp2 = restxml_unmarshal_2(request)$data,
  check = F,
  iterations = 1000
)

Note: request.RDS is a list_objects_v2 request from my personal AWS s3 bucket. I will try and get example from a public aws s3 bucket for further benchmarking.

TODO:

Add attributes to default values
Check logic from map methods
add/check existing units test
tests against other calls
Ensure existing functionality doesn't break i.e. S3 redirects

… xml

DyfanJones · 2023-08-15T16:08:54Z

Performance comparisons:

profvis::profvis({
  library(paws.storage)
  svc <- s3()
  resp <- svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
})

Old unmarshall:

New unmarshal:

DyfanJones · 2023-08-15T16:13:24Z

@mgirlich do you mind doing a profvis on your s3 bucket from your #619 issue? I would like to see if we have managed to get any significant performance improvements over larger xml responses :)

remotes::install_github("mgirlich/paws/paws.common", ref = "speed-up-decode-xml")

mgirlich · 2023-08-16T06:16:44Z

It was way faster until I got an error 😉 The response from my bucket seems to be different to the response from the bucket voltrondata-labs-datasets...

Error in `result[[i]]`:
! subscript out of bounds
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `result[[i]]`:
! subscript out of bounds
---
Backtrace:
    ▆
 1. └─s3$list_objects_v2(...)
 2.   └─paws.common::send_request(request)
 3.     └─paws.common:::get_request_output(request)
 4.       └─paws.common::tag_del(request$data)
 5.         └─paws.common::tag_del(result[[i]], tags)
 6.           └─paws.common::tag_del(result[[i]], tags)
 7.             └─paws.common::tag_del(result[[i]], tags)

DyfanJones · 2023-08-17T18:29:29Z

Just a quick update. I have nearly got all unit tests working. I am struggling to get the code working for the final unit tests. I believe it should be working for most cases. @mgirlich do you mind doing an another tempt as list_objects_v2?

profvis::profvis({
  library(paws.storage)
  svc <- s3()
  resp <- svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
})

We are so close in getting this working as intended.

So close but so far at the same time 😆

mgirlich · 2023-08-18T06:15:02Z

So much better! 🎉

CRAN
total: 15420 ms
unmarshal: 12550 ms

PR
total: 2550 ms
unmarshal: 2040 ms

I noticed two (minor) things:

In the CRAN version the field ChecksumAlgorithm was sometimes an empty list but not in this PR. But this only happend in 1 page.
CommonPrefixes() is an empty list in the CRAN version and now a list(Prefix = character()).

This is basically irrelevant to me but it might indicate there are still some bugs in parsing (or maybe even fixes?)

DyfanJones · 2023-08-18T06:24:47Z

I noticed two (minor) things:

In the CRAN version the field ChecksumAlgorithm was sometimes an empty list but not in this PR. But this only happend in 1 page.
CommonPrefixes() is an empty list in the CRAN version and now a list(Prefix = character()).

This is basically irrelevant to me but it might indicate there are still some bugs in parsing (or maybe even fixes?)

You are quite right. I am still squashing the final bugs. I just need to make sure the all the data types are returned correctly. I believe it is working fine for all units test bar from 4.

Nearly there 😆

mgirlich · 2023-08-21T06:26:07Z

Unfortunately, the small fixes cost quite a lot of performance overall (though still much better than before this PR!):

remotes::install_github("mgirlich/paws/paws.common@f0bad2c")
# 3040 (1830 for unmarshal)

remotes::install_github("mgirlich/paws/paws.common@5302502")
# 5040 (3010 for unmarshal)
# `transpose()` added -> 1000 ms

remotes::install_github("mgirlich/paws/paws.common@c4e5d84")
# 5950 (4610 for unmarshal)
# `transpose()` -> 2300 ms

DyfanJones · 2023-08-21T08:11:26Z

Ah nuts, lets see if we can claw any of the performance back :)

DyfanJones · 2023-08-21T12:34:40Z

@mgirlich is transpose the bottle neck here? 🤔 If you replace the transpose with the purrr::transpose do we get an significant benefits? Do you mind doing that benchmark? Unfortunately I don't have a suitable request to do proper benchmarking.

DyfanJones · 2023-08-21T15:23:40Z

Ok I have made some improvements. It is looking promising however I don't know how it scales up.

@mgirlich do you mind doing another benchmark for me? (sorry for keep pestering you for the benchmarks)

mgirlich · 2023-08-21T16:52:57Z

This looks much better now

CRAN: 19710 ms / 17250 ms for unmarshal()
09b7c22: 4530 ms / 2380 ms for unmarshal()

DyfanJones · 2023-08-21T16:57:30Z

That is good news, can you put the profvis break down? I just want to see what is bottle necking us now. And see if we can get any more improvements out of it.

mgirlich · 2023-08-22T11:03:54Z

I attached a Rprof file, so you can also easily analyse this 😄
prof.out.zip

DyfanJones · 2023-08-22T12:35:51Z

@mgirlich I noticed xml_name was taking alot of time roughly 880ms.

I believe it was down to xml2::xml_name being called in a lapply.

    result <- lapply(contents, function(x) {
      xml_elt <- xml2::xml_contents(x)
      xml_parse(xml_elt, interface_i[[1]], xml2::xml_name(xml_elt))
    })

I have now opted to get xml names from the xml_child.

    xml_nms <- xml2::xml_name(xml2::xml_child(contents[[1]]))
    result <- lapply(contents, function(x) {
      xml_elt <- xml2::xml_contents(x)
      xml_parse(xml_elt, interface_i[[1]], xml_nms)
    })

@mgirlich do you mind doing a benchmark to see if we have got any performance improvements at all :P

mgirlich · 2023-08-22T13:32:39Z

Don't try to hard for now. I am now also looking into improving the performance of {xml2}. There are a couple of easy performance improvements that should give nice boost here as well.

DyfanJones · 2023-08-22T13:38:20Z

If that is the case then I will merge these improvements and start preparing paws.common to be released to the cran.

Use known interface when parsing xml

0844290

mgirlich mentioned this pull request Jun 1, 2023

Improve performance r-lib/xml2#394

Open

mgirlich added 3 commits June 1, 2023 16:41

Support types character, long, double, float, blob

d6e9a48

Support tag loacationName

052ddfc

Use key without namespace for tests

bdbb588

DyfanJones added 2 commits August 15, 2023 17:02

initial refactor of xml_unmarshal, using known interface when parsing…

0869bd4

… xml

pass raw data body and interface to xml_unmarshal

64543ed

step down a level if result_name is in known interface: paws-r#501

2c88517

DyfanJones added 2 commits August 17, 2023 19:19

pass xml object to xml_parse

e2bba16

return interface if not empty for default value

f0bad2c

DyfanJones added 2 commits August 18, 2023 05:51

add missing flatten option for xml_parse_list

5302502

drop a level

1e7fe21

drop xml level correctly, ensure key is grabbed correclty

f2996e5

DyfanJones and others added 5 commits August 18, 2023 11:21

add default list value

b1784db

update default values

67d5d90

fix method of dropping level if result_name is known in data

7e455ff

Merge branch 'main' into speed-up-decode-xml

ea0b4d9

update news

c4e5d84

DyfanJones linked an issue Aug 19, 2023 that may be closed by this pull request

list_objects_v2() is much slower than it should be #619

Closed

DyfanJones added the performance 🚀 Performance 🚀 label Aug 20, 2023

DyfanJones added 3 commits August 21, 2023 10:38

remove vapply

aeb9212

correctly name list

64ff769

pass xml names to xml_parse attempt to reduce repeat calculations

92b3331

only repeat data if needed

09b7c22

DyfanJones mentioned this pull request Aug 21, 2023

Inconsistent result of s3fs::s3_dir_create with fs::dir_create DyfanJones/s3fs#28

Closed

DyfanJones added 2 commits August 22, 2023 10:11

switch to vapply to improve performance

5a2512e

formatting

1e1cf4b

prevetn xml names being repeated

b1875b4

get xml children instead of xml child

4cbc329

revert xml_name implementation for now

93ce592

DyfanJones merged commit 7a841d9 into paws-r:main Aug 22, 2023
3 checks passed

mgirlich deleted the speed-up-decode-xml branch August 23, 2023 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use known interface when parsing xml #621

Use known interface when parsing xml #621

mgirlich commented Jun 1, 2023 •

edited

Loading

DyfanJones commented Jun 1, 2023

DyfanJones commented Jun 1, 2023

DyfanJones commented Jun 1, 2023

mgirlich commented Jun 1, 2023

DyfanJones commented Aug 9, 2023

DyfanJones commented Aug 14, 2023 •

edited

Loading

DyfanJones commented Aug 15, 2023

DyfanJones commented Aug 15, 2023

mgirlich commented Aug 16, 2023

DyfanJones commented Aug 17, 2023

mgirlich commented Aug 18, 2023

DyfanJones commented Aug 18, 2023

mgirlich commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

mgirlich commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

mgirlich commented Aug 22, 2023

DyfanJones commented Aug 22, 2023

mgirlich commented Aug 22, 2023

DyfanJones commented Aug 22, 2023

Use known interface when parsing xml #621

Use known interface when parsing xml #621

Conversation

mgirlich commented Jun 1, 2023 • edited Loading

DyfanJones commented Jun 1, 2023

DyfanJones commented Jun 1, 2023

DyfanJones commented Jun 1, 2023

mgirlich commented Jun 1, 2023

DyfanJones commented Aug 9, 2023

DyfanJones commented Aug 14, 2023 • edited Loading

DyfanJones commented Aug 15, 2023

DyfanJones commented Aug 15, 2023

mgirlich commented Aug 16, 2023

DyfanJones commented Aug 17, 2023

mgirlich commented Aug 18, 2023

DyfanJones commented Aug 18, 2023

mgirlich commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

mgirlich commented Aug 21, 2023

DyfanJones commented Aug 21, 2023

mgirlich commented Aug 22, 2023

DyfanJones commented Aug 22, 2023

mgirlich commented Aug 22, 2023

DyfanJones commented Aug 22, 2023

mgirlich commented Jun 1, 2023 •

edited

Loading

DyfanJones commented Aug 14, 2023 •

edited

Loading