Implement subset() and transform() methods for network objects #3

krivit · 2018-05-23T01:01:23Z

subset() would drop vertices or edges based on a specification.
transform() would rewrite or construct vertex or edge attributes based on a specification.

The text was updated successfully, but these errors were encountered:

skyebend · 2018-05-23T01:48:51Z

would subset() be different from network::get.inducedSubraph() ? or are you thinking of it as a clearer alias?

krivit · 2018-05-23T05:26:10Z

A clearer alias that lets the user specify the subset with vertex attributes as variable names.

martinamorris · 2018-05-23T16:54:43Z

* subset() would drop vertices or edges based on a specification. * transform() would rewrite or construct vertex or edge attributes based on a specification.

nice.

knapply · 2018-09-28T19:10:18Z

I may have implemented equivalents of these functions elsewhere.

Is the intention to use non-standard evaluation?
If so, will the next CRAN release of {network} keep the GitHub version's {tibble} import (thus making {rlang} available)?

krivit · 2018-09-28T23:44:05Z

That's the plan.

knapply · 2018-09-29T23:06:28Z

Cool. Here's an implementation for subset.network():

library(rlang) # tibble dependency
suppressPackageStartupMessages(library(network))

This extracts vertex attributes (adapted from my own package, but may be handy if it were native to {network}):

vert_tibble <- function(x) {
  .vert_attr_names <- unique(unlist(lapply(x[["val"]], names)))
  .vert_attrs <- set_names(lapply(.vert_attr_names, function(attr_name) {
      unlist(lapply(lapply(x[["val"]], `[[`, attr_name), `%||%`, NA))
    }), .vert_attr_names)
  tibble::as_tibble(.vert_attrs)
}

The actual subset function:

subset.network <- function(x, .predicate, what = c("vertices", "edges")) {
  what <- arg_match(what, c("vertices", "edges"))
  if (what == "edges") {
    .e_df <- network:::as_tibble.network(x, attrnames = TRUE)
    .eids <- eval_tidy(quo(.e_df[!!enquo(.predicate), , drop = FALSE]), 
                       data = .e_df)[[".eid"]]
    return(get.inducedSubgraph(x, eid = .eids))
  }
  .v_df <- vert_tibble(x)
  .v_df[[".vid"]] <- seq_len(nrow(.v_df))
  
  .vids <- eval_tidy(quo(.v_df[!!enquo(.predicate), , drop = FALSE]), 
                     data = .v_df)[[".vid"]]
  get.inducedSubgraph(x, v = .vids)
}

Some quick helpers for visual inspection:

edge_tibble <- function(x) {
  network:::as_tibble.network(x, attrnames = TRUE)
}
`%>%` <- magrittr::`%>%`

Example data:

data("sampson", package = "ergm")

subset.network() in action:

subset by multiple vertex predicates

subset(samplike, group == "Loyal" & !cloisterville)
#>  Network attributes:
#>   vertices = 5 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 8 
#>     missing edges= 0 
#>     non-missing edges= 8 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

samplike %>% 
  subset(group == "Loyal" & !cloisterville) %>% 
  vert_tibble()
#> # A tibble: 5 x 4
#>   na    group vertex.names cloisterville
#>   <lgl> <chr> <chr>        <lgl>        
#> 1 TRUE  Loyal Berthold     FALSE        
#> 2 TRUE  Loyal Victor       FALSE        
#> 3 TRUE  Loyal Ambrose      FALSE        
#> 4 TRUE  Loyal Romauld      FALSE        
#> 5 TRUE  Loyal Louis        FALSE

subset by edge predicates

samplike %>% 
  subset(nominations == 1, what = "edges")
#>  Network attributes:
#>   vertices = 18 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 38 
#>     missing edges= 0 
#>     non-missing edges= 38 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

samplike %>% 
  subset(nominations == 1, what = "edges") %>% 
  edge_tibble()
#> # A tibble: 38 x 5
#>    .tail .head  .eid na    nominations
#>    <int> <int> <int> <lgl>       <dbl>
#>  1     5     1     1 FALSE           1
#>  2     7     1     2 FALSE           1
#>  3     1     2     3 FALSE           1
#>  4     3     2     4 FALSE           1
#>  5     6     5    20 FALSE           1
#>  6    10     8    32 FALSE           1
#>  7    10     9    41 FALSE           1
#>  8     8    10    49 FALSE           1
#>  9    18    13    72 FALSE           1
#> 10     2    15    77 FALSE           1
#> # ... with 28 more rows

chain subsets together

samplike %>% 
  subset(group == "Outcasts") %>% 
  subset(nominations == 3, what = "edges")
#>  Network attributes:
#>   vertices = 4 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 4 
#>     missing edges= 0 
#>     non-missing edges= 4 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

Thoughts?

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] network_1.14-355 rlang_0.2.2.9001
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18      fansi_0.3.0       assertthat_0.2.0 
#>  [4] utf8_1.1.4        crayon_1.3.4      digest_0.6.17    
#>  [7] magrittr_1.5      evaluate_0.11     pillar_1.3.0.9000
#> [10] cli_1.0.0         stringi_1.2.4     rmarkdown_1.10.13
#> [13] htmldeps_0.1.1    tools_3.5.1       stringr_1.3.1    
#> [16] xfun_0.3          yaml_2.2.0        compiler_3.5.1   
#> [19] htmltools_0.3.6   knitr_1.20.15     tibble_1.4.2

skyebend · 2018-10-01T06:43:26Z

is there a reason to not use the network packages standard accessors in vert_tibble? list.vertex.attributes() get.vertex.attribute() I find there are often a lot of edge cases with network objects, so it is nice to have them handled the same way ... unless it is too slow?

knapply · 2018-10-07T12:44:29Z

I’ll change that.

After putting it through more tests, any performance gained is negligible, even for large, heavily-decorated graphs.

CarterButts · 2018-10-09T04:09:03Z

I can't say I'm thrilled with the additional dependencies that have been added - we should not be contributing to the ever-growing problem of R toolchain cruft. The idea was for network to be a minimal package, with minimal dependencies. Adding extra stuff just because it is convenient is a departure from the path of wisdom.

krivit · 2018-10-10T23:56:40Z

I can't say I'm thrilled with the additional dependencies that have been added - we should not be contributing to the ever-growing problem of R toolchain cruft. The idea was for network to be a minimal package, with minimal dependencies. Adding extra stuff just because it is convenient is a departure from the path of wisdom.

network is a high-level library package. I don't think there is anything wrong with it depending on low-level library packages, especially for facilities that others can implement better than us.

krivit · 2018-10-17T02:54:33Z

@knapply , thank you for your hard work.

vert_tibble

I wonder if it might be worth folding it into as_tibble.network and giving it an argument what.

Speaking of that, in the egor package, we ended up using the argument name unit= to specify whether a function refers to vertices or edges (as in, "units of analysis"). I wonder if it might be worth using instead of what=.

martinamorris · 2018-10-17T03:43:25Z

Speaking of that, in the egor package, we ended up using the argument name unit= to specify whether a function refers to vertices or edges (as in, "units of analysis"). I wonder if it might be worth using instead of what=.

that sounds good to me.

knapply · 2018-10-17T21:54:02Z

The statnet team has produced invaluable tools for my work and the work of my colleagues. I'm very happy to help.

I wasn't familiar with egor, but I agree that being consistent by using unit is better.

I believe the following addresses everyone's concerns.

Here's an as.data.frame.network() method that removes the need for dependencies, handles both edges and vertices, and takes care of base R's behavior toward nested data.frames. If immediate tibbles are still desired, the adjustment is easy.

The comments are intentionally verbose to clarify the thought process and steps as much as possible.

suppressPackageStartupMessages(library(network))

as.data.frame.network <- function(x, unit = c("edges", "vertices")) {
  unit <- match.arg(unit, c("edges", "vertices"))
  if (unit == "edges") {
    # get valid, non-missing edge indices
    edge_ids <- which(!vapply(x[["mel"]], is.null, logical(1L))) 
    # use those to get valid, non-missing tails and heads
    tails <- lapply(x[["mel"]], `[[`, "outl")[edge_ids]
    heads <- lapply(x[["mel"]], `[[`, "inl")[edge_ids]
    # create "metadata" data.frame to use in returned object
    if (is.hyper(x)) {
      # keep hypergraph edges as list columns
      edge_df <- data.frame(.tail = tails,
                            .head = heads,
                            .eid = edge_ids)
    } else {
      # cast non-hypergraph edges to atomic vectors, retaining the integer casting from 
      # network:::as_tibble.network()
      edge_df <- data.frame(.tail = as.integer(unlist(tails)),
                            .head = as.integer(unlist(heads)),
                            .eid = edge_ids)
    }
    attr_names <- list.edge.attributes(x)
    # extract edge attributes, subsetting on valid indices
    out <- lapply(attr_names, function(e_attr) {
      get.edge.attribute(x, e_attr, unlist = FALSE)[edge_ids]
    })
  } else if (unit == "vertices") {
    # get valid, non-missing vertex indices (as a 1 column data.frame)
    vert_df <- data.frame(.vid = which(!vapply(x[["val"]], is.null, logical(1L))))
    # get vertex attr names
    attr_names <- list.vertex.attributes(x)
    # extract vertex attributes, subsetting on valid indices
    out <- lapply(attr_names, function(v_attr) {
      get.vertex.attribute(x, v_attr, unlist = FALSE)[vert_df[[".vid"]]]
    })
  }
  # attach attr names to resulting list
  names(out) <- attr_names
  # probe attributes to determine which will be list columns (i.e. they contain non-scalar elements)
  # in returned data.frame and create a logical vector to use in subsetting
  list_col_index <- vapply(out, function(attr_name) {
    any(vapply(attr_name, function(attr) length(attr) != 1L, logical(1L)))
    }, logical(1))
  # `I()`nhibit `base::data.frame`s default behavior of converting list columns
  out[list_col_index] <- lapply(out[list_col_index], I)
  # flatten attributes that only contain scalar elements to atomic vectors
  atomic_col_names <- names(list_col_index[!list_col_index])
  out[atomic_col_names] <- lapply(out[atomic_col_names], unlist)
  # convert the prepped list to a data.frame
  out <- as.data.frame(out, stringsAsFactors = FALSE)
  # `I()` works by prepending "AsIs" to the provided argument's class, which is no longer 
  # necessary and will only cause confusion, so it's dropped.
  out[list_col_index] <- lapply(out[list_col_index], `class<-`, "list")
  # combine edge/vertex metadata with attribute data
  if (unit == "edges") {
    out <- cbind.data.frame(edge_df, out, stringsAsFactors = FALSE)
  } else if (unit == "vertices") {
    out <- cbind.data.frame(vert_df, out, stringsAsFactors = FALSE)
  }
  # attach the attributes used in the current network:::as_tibble.network()
  attr(out, "n") <- network.size(x)
  attr(out, "vnames") <- network.vertex.names(x)
  if (is.bipartite(x)) {
    attr(out, "bipartite") <- get.network.attribute(x, "bipartite")
  }
  # rownames are unnecessary for these purposes and most people will convert this to a 
  # `tibble`, where they're not permitted, anyways
  rownames(out) <- NULL
  out
}

A "simple" network, that doesn't have nested attributes.

`%>%` <- magrittr::`%>%`

data("sampson", package = "ergm")
as.data.frame.network(samplike, unit = "vertices") %>% head(3)
#>   .vid cloisterville    group   na vertex.names
#> 1    1          TRUE    Turks TRUE   John Bosco
#> 2    2          TRUE    Turks TRUE      Gregory
#> 3    3          TRUE Outcasts TRUE        Basil

An object that does have nested attributes ( active attributes are matrices)...

data("windsurfers", package = "networkDynamic")
as.data.frame.network(windsurfers, unit = "vertices") %>% head(3)
#>   .vid                                                   active group1
#> 1    1                            0, 17, 26, 28, 16, 24, 27, 31      1
#> 2    2                0, 5, 9, 18, 23, 29, 2, 8, 11, 21, 24, 31      1
#> 3    3 0, 6, 8, 10, 18, 23, 25, 29, 4, 7, 9, 14, 21, 24, 28, 31      0
#>   group2    na regular vertex.names
#> 1      0 FALSE       1            1
#> 2      0 FALSE       1            2
#> 3      0 FALSE       1            3

... and the matrix dimensions are safely preserved.

as.data.frame.network(windsurfers)$active[[1]]
#>       [,1] [,2]
#>  [1,]    0    1
#>  [2,]    2    3
#>  [3,]    7    8
#>  [4,]    9   14
#>  [5,]   17   18
#>  [6,]   19   24
#>  [7,]   26   27
#>  [8,]   28   29
#>  [9,]   30   31

Returned objects are still compatible with modern data frames like tibbles or data.tables and only differ from objects returned by the current network:::as_tibble.network() in class. The attrnames parameter can be added as well.

identical(
  network:::as_tibble.network(windsurfers, attrnames = TRUE),
  tibble::as_tibble(as.data.frame.network(windsurfers))
)
#> [1] TRUE

tibble::as_tibble(as.data.frame.network(windsurfers))
#> # A tibble: 556 x 5
#>    .tail .head  .eid active        na   
#>    <int> <int> <int> <list>        <lgl>
#>  1     4     1     1 <dbl [9 x 2]> FALSE
#>  2     5     1     2 <dbl [2 x 2]> FALSE
#>  3     6     1     3 <dbl [5 x 2]> FALSE
#>  4     8     1     4 <dbl [1 x 2]> FALSE
#>  5     9     1     5 <dbl [1 x 2]> FALSE
#>  6    11     1     6 <dbl [2 x 2]> FALSE
#>  7     4     2     7 <dbl [6 x 2]> FALSE
#>  8     5     2     8 <dbl [3 x 2]> FALSE
#>  9     7     2     9 <dbl [1 x 2]> FALSE
#> 10     5     4    10 <dbl [3 x 2]> FALSE
#> # ... with 546 more rows

data.table::data.table(as.data.frame.network(windsurfers))
#>      .tail .head .eid                                  active    na
#>   1:     4     1    1  0, 2, 7, 9,17,19, 1, 3, 8,14,18,24,... FALSE
#>   2:     5     1    2                              0,10, 2,11 FALSE
#>   3:     6     1    3        0, 6,10,23,30, 1, 8,11,24,31,... FALSE
#>   4:     8     1    4                                     0,1 FALSE
#>   5:     9     1    5                                     0,1 FALSE
#>  ---                                                               
#> 552:    25    14  552                                   30,31 FALSE
#> 553:    25    68  553                                   30,31 FALSE
#> 554:    38    25  554                                   30,31 FALSE
#> 555:    16    15  555                                   30,31 FALSE
#> 556:    95    16  556                                   30,31 FALSE

If there is ultimately a decision to drop the tibble dependency, a native as_tibble.network() method can remain (preventing interruption to anything that has come to rely on it) by moving it to the Suggests field of the DESCRIPTION and writing the function like this:

as_tibble.network <- function(x, unit = "edges") {
  if (!requireNamespace("tibble", quietly = TRUE)) {
    stop("The tibble package is required for this functionality.", call. = FALSE)
  }
  tibble::as_tibble(as.data.frame.network(x, unit = unit))
}

as_tibble.network(samplike, unit = "vertices")
#> # A tibble: 18 x 5
#>     .vid cloisterville group    na    vertex.names
#>    <int> <lgl>         <chr>    <lgl> <chr>       
#>  1     1 TRUE          Turks    TRUE  John Bosco  
#>  2     2 TRUE          Turks    TRUE  Gregory     
#>  3     3 TRUE          Outcasts TRUE  Basil       
#>  4     4 TRUE          Loyal    TRUE  Peter       
#>  5     5 TRUE          Loyal    TRUE  Bonaventure 
#>  6     6 FALSE         Loyal    TRUE  Berthold    
#>  7     7 TRUE          Turks    TRUE  Mark        
#>  8     8 FALSE         Loyal    TRUE  Victor      
#>  9     9 FALSE         Loyal    TRUE  Ambrose     
#> 10    10 FALSE         Loyal    TRUE  Romauld     
#> 11    11 FALSE         Loyal    TRUE  Louis       
#> 12    12 FALSE         Turks    TRUE  Winfrid     
#> 13    13 FALSE         Outcasts TRUE  Amand       
#> 14    14 FALSE         Turks    TRUE  Hugh        
#> 15    15 FALSE         Turks    TRUE  Boniface    
#> 16    16 FALSE         Turks    TRUE  Albert      
#> 17    17 FALSE         Outcasts TRUE  Elias       
#> 18    18 FALSE         Outcasts TRUE  Simplicius

I'd also recommend that a parameter be given to list.vertex.attributes() and list.edge.attributes() to specify whether the returned names are sorted. That way, you can extract everything exactly as contained in the original network object. If you want to be able to build network objects directly from data.frames, preventing attribute sorting will allow you to better test features and would be more consistent with user expectations. By setting the default to TRUE, there are no changes to current behavior.

list.vertex.attributes <- function(x, .sort = TRUE) {
  if (!is.network(x))  {
    stop("list.vertex.attributes requires an argument of class network.\n")
  }
  if (network.size(x) == 0) {
        return(NULL)
  }
  allnam <- unlist(sapply(x$val, names))
  out <- unique(as.vector(allnam))
  if (!.sort) {
    return(out)
  }
  sort(out)
}

Apologies for the amount of text here. I can split this up into different issues if that's more palatable.

Cheers.

…butes instead of edge attributes. References #3.

krivit · 2019-01-05T10:15:28Z

Apologies for the force-updates. Note that samplike dataset appears to have its na vertex attribute set incorrectly, so don't panic if it doesn't work.

* add UTF-8 encoding (roxygen throws warning without it), update zzz.R (documentation doesn't build with old .onLoad() and network-package.R now has @useDynLib) git push * finish up documentation, verify all is still good * rebuild * fixed botched parentheses local test doesn't catch

krivit added enhancement help wanted labels May 23, 2018

krivit mentioned this issue Jan 4, 2019

Network package's dependencies #9

Closed

krivit added a commit that referenced this issue Jan 5, 2019

as_tibble.network() method can now returns a tibble with vertex attri…

486f9ee

…butes instead of edge attributes. References #3.

knapply mentioned this issue May 15, 2020

network_from_data_frame #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement subset() and transform() methods for network objects #3

Implement subset() and transform() methods for network objects #3

krivit commented May 23, 2018

skyebend commented May 23, 2018

krivit commented May 23, 2018

martinamorris commented May 23, 2018 via email

knapply commented Sep 28, 2018

krivit commented Sep 28, 2018

knapply commented Sep 29, 2018 •

edited

Loading

skyebend commented Oct 1, 2018

knapply commented Oct 7, 2018

CarterButts commented Oct 9, 2018

krivit commented Oct 10, 2018

krivit commented Oct 17, 2018

martinamorris commented Oct 17, 2018 via email

knapply commented Oct 17, 2018 •

edited

Loading

krivit commented Jan 5, 2019

Implement subset() and transform() methods for network objects #3

Implement subset() and transform() methods for network objects #3

Comments

krivit commented May 23, 2018

skyebend commented May 23, 2018

krivit commented May 23, 2018

martinamorris commented May 23, 2018 via email

knapply commented Sep 28, 2018

krivit commented Sep 28, 2018

knapply commented Sep 29, 2018 • edited Loading

skyebend commented Oct 1, 2018

knapply commented Oct 7, 2018

CarterButts commented Oct 9, 2018

krivit commented Oct 10, 2018

krivit commented Oct 17, 2018

martinamorris commented Oct 17, 2018 via email

knapply commented Oct 17, 2018 • edited Loading

krivit commented Jan 5, 2019

knapply commented Sep 29, 2018 •

edited

Loading

knapply commented Oct 17, 2018 •

edited

Loading