Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use known interface when parsing xml #621

Merged
merged 26 commits into from
Aug 22, 2023

Conversation

mgirlich
Copy link
Contributor

@mgirlich mgirlich commented Jun 1, 2023

This improves the performance noted in #619. Still the most time is spend in parsing but I got a 3 times performance improvement.
I'm not sure there is an easy way to improve the performance further. It might be doable by using smart xpaths but I don't have much experience with that. Maybe there is a chance to improve the performance of xml_find_all() itself.

I didn't manage to load the paws package properly and run the tests. So, I only tested this for list_objects_v2()

EDIT: we might do a bit better by using xml_find_first() instead of xml_find_all() in some situations.

@DyfanJones
Copy link
Member

Here are the benchmarks for reference:

image

@DyfanJones
Copy link
Member

This is currently failing unit test: test_handlers_restxml.R

==> Testing R file using 'testthat'

ℹ Loading paws.common
[ FAIL 0 | WARN 1 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 1 | WARN 2 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 8 | WARN 6 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 9 | WARN 7 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 10 | WARN 8 | SKIP 0 | PASS 21 ]Called from: parse_xml_elt(elts, interface_i)
debug at /Users/dyfanjones/Documents/temp/paws/paws.common/R/xmlutil.R#111: if (n == 0 && length(val) > 0) {
    val <- val[0]
}
[ FAIL 22 | WARN 15 | SKIP 1 | PASS 22 ]

── Warning (test_handlers_restxml.R:564:3): unmarshal scalar members ───────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:564:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:564:3): unmarshal scalar members ─────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:564:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:586:3): unmarshal blob ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:586:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:586:3): unmarshal blob ───────────────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:586:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:601:3): unmarshal list ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:601:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:603:3): unmarshal list ─────────────────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:604:3): unmarshal list ─────────────────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:617:3): unmarshal list ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:617:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:619:3): unmarshal list ─────────────────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:620:3): unmarshal list ─────────────────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:633:3): unmarshal flattened list ───────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:633:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:635:3): unmarshal flattened list ───────────
out$ListMember[1] not equal to "abc".
1/1 mismatches
x[1]: NA
y[1]: "abc"

── Failure (test_handlers_restxml.R:636:3): unmarshal flattened list ───────────
out$ListMember[2] not equal to "123".
1/1 mismatches
x[1]: NA
y[1]: "123"

── Warning (test_handlers_restxml.R:649:3): unmarshal map ──────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:649:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:649:3): unmarshal map ────────────────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:649:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:665:3): unmarshal flattened map ────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:665:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:665:3): unmarshal flattened map ──────────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:665:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Warning (test_handlers_restxml.R:681:3): unmarshal flattened named map ──────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:681:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:681:3): unmarshal flattened named map ────────
Error in `parse_xml_elt(elts, interface_i)`: object 'val' not found
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:681:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Skip (test_handlers_restxml.R:692:3): unmarshal empty string ────────────────
Reason: skip

── Warning (test_handlers_restxml.R:714:3): unmarshal enum ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:714:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Warning (test_handlers_restxml.R:714:3): unmarshal enum ─────────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:714:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:716:3): unmarshal enum ─────────────────────
out$FooEnum not equal to "foo".
Lengths differ: 0 is not 1

── Failure (test_handlers_restxml.R:717:3): unmarshal enum ─────────────────────
out$ListEnums[1] not equal to "foo".
1/1 mismatches
x[1]: NA
y[1]: "foo"

── Failure (test_handlers_restxml.R:718:3): unmarshal enum ─────────────────────
out$ListEnums[2] not equal to "bar".
1/1 mismatches
x[1]: NA
y[1]: "bar"

── Warning (test_handlers_restxml.R:730:3): unmarshal timestamp ────────────────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:730:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:732:3): unmarshal timestamp ────────────────
out$Timestamp not equal to unix_time(0).
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:746:3): unmarshal timestamp in header ──────
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:746:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:749:3): unmarshal timestamp in header ──────
as.integer(out$Timestamp) not equal to as.integer(expected).
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:765:3): unmarshal elements in header and body ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:765:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Warning (test_handlers_restxml.R:765:3): unmarshal elements in header and body ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:765:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Failure (test_handlers_restxml.R:767:3): unmarshal elements in header and body ──
out$Body not equivalent to "foo".
Lengths differ: 0 is not 1

── Failure (test_handlers_restxml.R:768:3): unmarshal elements in header and body ──
out$Header not equivalent to "bar".
Lengths differ: 0 is not 1

── Warning (test_handlers_restxml.R:785:3): unmarshal result elements at root of xml ──
Undefined namespace prefix [1219]
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:785:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 6. xml2:::xml_find_all.xml_node(xml, key, flatten = FALSE)

── Error (test_handlers_restxml.R:785:3): unmarshal result elements at root of xml ──
Error in `if (type == "structure") {
    val <- decode_xml(xml, interface)
    default <- xml2::as_xml_document(list())
    default <- decode_xml(default, interface)
} else if (type == "list") {
    val <- decode_xml(xml, interface[[1]])
    default <- list()
} else if (type == "boolean") {
    val <- xml2::xml_text(xml) == "true"
    default <- logical()
} else if (type == "string") {
    val <- xml2::xml_text(xml)
    default <- character()
} else if (type == "timestamp") {
    val <- xml2::xml_text(xml)
    val <- strptime(val, format = "%Y-%m-%dT%H:%M:%S", tz = "GMT")
    val <- as.POSIXct(val)
    default <- as.POSIXct(NULL)
} else if (type == "integer") {
    val <- as.numeric(xml2::xml_integer(xml))
    default <- numeric()
} else {
    browser()
}`: argument is of length zero
Backtrace:
 1. paws.common:::unmarshal(req)
      at test_handlers_restxml.R:785:2
 2. paws.common:::run(request, unmarshal)
      at paws.common/R/handlers.R:158:2
 3. handler$fn(request)
      at paws.common/R/handlers.R:113:4
 4. paws.common:::decode_xml(xml, interface)
      at paws.common/R/handlers_restxml.R:23:4
 5. paws.common:::parse_xml_elt(elts, interface_i)
      at paws.common/R/xmlutil.R:26:6

── Failure (test_handlers_restxml.R:803:3): unmarshal error ────────────────────
err$message not equal to "Bar".
1/1 mismatches
x[1]: "failed to read from query HTTP response body"
y[1]: "Bar"

── Failure (test_handlers_restxml.R:804:3): unmarshal error ────────────────────
err$code not equal to "Foo".
1/1 mismatches
x[1]: "SerializationError"
y[1]: "Foo"

── Failure (test_handlers_restxml.R:806:3): unmarshal error ────────────────────
err$error_response$RequestID not equal to "Baz".
target is NULL, current is character

[ FAIL 22 | WARN 15 | SKIP 1 | PASS 22 ]

Test complete

@DyfanJones
Copy link
Member

We should also hook into xml_unmarshal as this parse xml class types. I will have a look to see if there is an easy way to do this

@mgirlich
Copy link
Contributor Author

mgirlich commented Jun 1, 2023

I fixed a couple of issues:

  • Undefined namespace prefix [1219]: The xml response uses a namespace but the tests don't. For now I simply removed the namespace to pass the tests.
  • The tests cover many types I didn't support before: character, long, double, float, blob.
  • Map is still missing
  • The list type is not yet correct

@DyfanJones
Copy link
Member

Currently this PR seems to be breaking paws with S3 redirects and failing to return response from AWS.

svc = paws.storage::s3()
resp = svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
#> Error in rawToChar(value): argument 'x' must be a raw vector

Created on 2023-08-09 with reprex v2.0.2

svc = paws.storage::s3(config = list(credentials = list(anonymous = T), region = "us-east-1"))
resp = svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
resp
#> $IsTruncated
#> logical(0)
#> 
#> $Contents
#> list()
#> 
#> $Name
#> character(0)
#> 
#> $Prefix
#> character(0)
#> 
#> $Delimiter
#> character(0)
#> 
#> $MaxKeys
#> numeric(0)
#> 
#> $CommonPrefixes
#> list()
#> 
#> $EncodingType
#> character(0)
#> 
#> $KeyCount
#> numeric(0)
#> 
#> $ContinuationToken
#> character(0)
#> 
#> $NextContinuationToken
#> character(0)
#> 
#> $StartAfter
#> character(0)

Created on 2023-08-09 with reprex v2.0.2

@DyfanJones
Copy link
Member

DyfanJones commented Aug 14, 2023

I believe I have a working solution developed from initially proposal. This method I use the xml2:xml_contents similar to xml2::as_list and then map the interface onto it.

restxml_unmarshal_2 <- function(request) {
  t <- paws.common:::rest_payload_type(request$data)
  if (t == "structure" || t == "") {
    data <- request$http_response$body
    interface <- request$data
    request$data <- xml_unmarshal_2(data, interface)
  } else {
    request <- paws.common:::rest_unmarshal(request)
  }
  return(request)
}

# Decode raw bytes XML into an R list object.
xml_unmarshal_2 <- function(raw_data, interface = NULL) {
  if (paws.common::is_empty(raw_data)) {
    return(interface)
  }

  data <- xml2::read_xml(raw_data, encoding = "utf8")
  out <- xml_parse_2(data, interface)

  return(out)
}

xml_parse_2 <- function(data, interface) {
  nms <- names(interface)
  contents <- xml2::xml_contents(data)
  contents_nms <- xml2::xml_name(contents)

  result <- list()
  for (nm in nms) {
    interface_i <- interface[[nm]]
    tags_i <- attr(interface_i, "tags")
    if (!is.null(tags_i$locationName)) {
      key <- tags_i$locationName
    } else {
      key <- nm
    }
    found <- (key == contents_nms)
    xml_elts <- contents[found]
    result[[nm]] <- (
      if (any(found)) {
        parse_xml_elt_2(xml_elts, interface_i, tags_i)
      } else {
        default_parse_xml(interface_i, tags_i)
      }
    )
  }
  return(result)
}

parse_xml_elt_2 <- function(xml_elts, interface_i, tags_i) {
  n <- length(xml_elts)

  type <- tags_i[["type"]]
  flattened <- tags_i[["flattened"]]

  t <- paws.common:::type(interface_i)
  parse_fn <- switch(t,
    structure = xml_parse_structure_2,
    map = xml_parse_map_2,
    list = xml_parse_list_2,
    xml_parse_scalar_2
  )

  result <- parse_fn(xml_elts, interface_i, type)

  # the `is.list()` check is necessary because e.g. `CheckSumAlgorithm` has
  # a list interface though it isn't a list?!
  if (isTRUE(flattened) && is.list(result)) {
    result <- transpose(result)
  }

  return(result)
}

xml_parse_structure_2 <- function(xml_elts, interface, tags_i, type = NULL) {
  result <- xml_parse_2(xml_elts, interface)
  attr(result, "tags") <- tags_i
  return(result)
}

xml_parse_map_2 <- function(xml_elts, interface,  tags_i, type = NULL) {
  result <- xml_parse_2(xml_elts, interface)
  attr(result, "tags") <- tags_i
  return(result)
}

xml_parse_list_2 <- function(xml_elts, interface, type = NULL) {
  xml_parse_2(xml_elts, interface[[1]])
}

xml_parse_scalar_2 <- function(xml_elts, interface, type = NULL) {
  results <- vapply(xml_elts, xml2::xml_text, FUN.VALUE = character(1))

  convert <- switch(type,
    blob = paws.common:::base64_to_raw,
    boolean = as.logical,
    double = as.numeric,
    float = as.numeric,
    integer = as.numeric,
    long = as.numeric,
    timestamp = function(x) paws.common:::as_timestamp(x, format = "iso8601"),
    as.character
  )
  result <- convert(results)
  names(result) <- names(interface)
  return(result)
}

default_parse_xml <- function(interface_i, tags_i) {
  type <- tags_i[["type"]]

  t <- paws.common:::type(interface_i)
  parse_fn <- switch(t,
    structure = default_parse_structure,
    map = default_parse_map,
    list = default_parse_list,
    default_parse_scalar
  )
  return(parse_fn(interface_i, type))
}

default_parse_structure <- function(interface_i, type = NULL) {
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_map <- function(interface_i, type = NULL) {
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_list <- function(interface_i, type = NULL) {
  interface_i <- interface_i[[1]]
  nms <- names(interface_i)
  result <- list()
  for (nm in nms) {
    tags_i <- attr(interface_i[[nm]], "tags")
    result[[nm]] <- default_parse_xml(interface_i[[nm]], tags_i)
  }
  return(list(result))
}

default_parse_scalar <- function(interface_i, type = NULL) {
  result <- switch(type,
    integer = numeric(),
    double = numeric(),
    long = numeric(),
    float = numeric(),
    timestamp = as.POSIXct(NULL),
    boolean = logical(),
    character()
  )
  return(result)
}

transpose  <- function(x) {
  if (!is.list(x)) {
    return(x)
  }
  n_col <- length(x)
  if (n_col == 0) {
    return(list())
  }
  n_row <- length(x[[1]])
  if (n_row == 0) {
    return(list())
  }
  out <- vector("list", length = n_row)
  col_seq <- seq.int(n_col, 1)

  vals <- vector("list", length = n_col)
  names(vals) <- names(x)

  for (row in seq.int(1, n_row)) {
    for (col in col_seq) {
      vals[col] <- list(rep_len(x[[col]], n_row)[[row]])
    }
    out[[row]] <- vals
  }
  names(out) <- names(x[[1]])
  return(out)
}
request <- readRDS("request.RDS")

raw_data <- request$http_response$body
interface <- request$data
data <- xml2::read_xml(raw_data, encoding = "utf8")

bm <- bench::mark(
  resp1 = paws.common:::restxml_unmarshal(request)$data,
  resp2 = restxml_unmarshal_2(request)$data,
  check = F,
  iterations = 1000
)

image

Note: request.RDS is a list_objects_v2 request from my personal AWS s3 bucket. I will try and get example from a public aws s3 bucket for further benchmarking.

TODO:

  1. Add attributes to default values
  2. Check logic from map methods
  3. add/check existing units test
  4. tests against other calls
  5. Ensure existing functionality doesn't break i.e. S3 redirects

@DyfanJones
Copy link
Member

Performance comparisons:

profvis::profvis({
  library(paws.storage)
  svc <- s3()
  resp <- svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
})

Old unmarshall:
image

New unmarshal:
image

@DyfanJones
Copy link
Member

@mgirlich do you mind doing a profvis on your s3 bucket from your #619 issue? I would like to see if we have managed to get any significant performance improvements over larger xml responses :)

remotes::install_github("mgirlich/paws/paws.common", ref = "speed-up-decode-xml")

@mgirlich
Copy link
Contributor Author

It was way faster until I got an error 😉 The response from my bucket seems to be different to the response from the bucket voltrondata-labs-datasets...

Error in `result[[i]]`:
! subscript out of bounds
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `result[[i]]`:
! subscript out of bounds
---
Backtrace:
    ▆
 1. └─s3$list_objects_v2(...)
 2.   └─paws.common::send_request(request)
 3.     └─paws.common:::get_request_output(request)
 4.       └─paws.common::tag_del(request$data)
 5.         └─paws.common::tag_del(result[[i]], tags)
 6.           └─paws.common::tag_del(result[[i]], tags)
 7.             └─paws.common::tag_del(result[[i]], tags)

@DyfanJones
Copy link
Member

Just a quick update. I have nearly got all unit tests working. I am struggling to get the code working for the final unit tests. I believe it should be working for most cases. @mgirlich do you mind doing an another tempt as list_objects_v2?

profvis::profvis({
  library(paws.storage)
  svc <- s3()
  resp <- svc$list_objects_v2(Bucket = "voltrondata-labs-datasets")
})

We are so close in getting this working as intended.

So close but so far at the same time 😆

@mgirlich
Copy link
Contributor Author

So much better! 🎉

CRAN
total: 15420 ms
unmarshal: 12550 ms

PR
total: 2550 ms
unmarshal: 2040 ms

I noticed two (minor) things:

  1. In the CRAN version the field ChecksumAlgorithm was sometimes an empty list but not in this PR. But this only happend in 1 page.
  2. CommonPrefixes() is an empty list in the CRAN version and now a list(Prefix = character()).

This is basically irrelevant to me but it might indicate there are still some bugs in parsing (or maybe even fixes?)

@DyfanJones
Copy link
Member

I noticed two (minor) things:

In the CRAN version the field ChecksumAlgorithm was sometimes an empty list but not in this PR. But this only happend in 1 page.
CommonPrefixes() is an empty list in the CRAN version and now a list(Prefix = character()).

This is basically irrelevant to me but it might indicate there are still some bugs in parsing (or maybe even fixes?)

You are quite right. I am still squashing the final bugs. I just need to make sure the all the data types are returned correctly. I believe it is working fine for all units test bar from 4.

Nearly there 😆

@DyfanJones DyfanJones linked an issue Aug 19, 2023 that may be closed by this pull request
@DyfanJones DyfanJones added the performance 🚀 Performance 🚀 label Aug 20, 2023
@mgirlich
Copy link
Contributor Author

Unfortunately, the small fixes cost quite a lot of performance overall (though still much better than before this PR!):

remotes::install_github("mgirlich/paws/paws.common@f0bad2c")
# 3040 (1830 for unmarshal)

remotes::install_github("mgirlich/paws/paws.common@5302502")
# 5040 (3010 for unmarshal)
# `transpose()` added -> 1000 ms

remotes::install_github("mgirlich/paws/paws.common@c4e5d84")
# 5950 (4610 for unmarshal)
# `transpose()` -> 2300 ms

@DyfanJones
Copy link
Member

Ah nuts, lets see if we can claw any of the performance back :)

@DyfanJones
Copy link
Member

@mgirlich is transpose the bottle neck here? 🤔 If you replace the transpose with the purrr::transpose do we get an significant benefits? Do you mind doing that benchmark? Unfortunately I don't have a suitable request to do proper benchmarking.

@DyfanJones
Copy link
Member

Ok I have made some improvements. It is looking promising however I don't know how it scales up.

image

@mgirlich do you mind doing another benchmark for me? (sorry for keep pestering you for the benchmarks)

@mgirlich
Copy link
Contributor Author

This looks much better now

CRAN: 19710 ms / 17250 ms for unmarshal()
09b7c22: 4530 ms / 2380 ms for unmarshal()

@DyfanJones
Copy link
Member

That is good news, can you put the profvis break down? I just want to see what is bottle necking us now. And see if we can get any more improvements out of it.

@mgirlich
Copy link
Contributor Author

I attached a Rprof file, so you can also easily analyse this 😄
prof.out.zip

@DyfanJones
Copy link
Member

@mgirlich I noticed xml_name was taking alot of time roughly 880ms.

image

I believe it was down to xml2::xml_name being called in a lapply.

    result <- lapply(contents, function(x) {
      xml_elt <- xml2::xml_contents(x)
      xml_parse(xml_elt, interface_i[[1]], xml2::xml_name(xml_elt))
    })

I have now opted to get xml names from the xml_child.

    xml_nms <- xml2::xml_name(xml2::xml_child(contents[[1]]))
    result <- lapply(contents, function(x) {
      xml_elt <- xml2::xml_contents(x)
      xml_parse(xml_elt, interface_i[[1]], xml_nms)
    })

@mgirlich do you mind doing a benchmark to see if we have got any performance improvements at all :P

@mgirlich
Copy link
Contributor Author

Don't try to hard for now. I am now also looking into improving the performance of {xml2}. There are a couple of easy performance improvements that should give nice boost here as well.

@DyfanJones
Copy link
Member

If that is the case then I will merge these improvements and start preparing paws.common to be released to the cran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🚀 Performance 🚀
Projects
None yet
Development

Successfully merging this pull request may close these issues.

list_objects_v2() is much slower than it should be
2 participants