Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement subset() and transform() methods for network objects #3

Open
krivit opened this issue May 23, 2018 · 14 comments
Open

Implement subset() and transform() methods for network objects #3

krivit opened this issue May 23, 2018 · 14 comments

Comments

@krivit
Copy link
Member

krivit commented May 23, 2018

  • subset() would drop vertices or edges based on a specification.
  • transform() would rewrite or construct vertex or edge attributes based on a specification.
@skyebend
Copy link
Contributor

would subset() be different from network::get.inducedSubraph() ? or are you thinking of it as a clearer alias?

@krivit
Copy link
Member Author

krivit commented May 23, 2018

A clearer alias that lets the user specify the subset with vertex attributes as variable names.

@martinamorris
Copy link
Member

martinamorris commented May 23, 2018 via email

@knapply
Copy link
Contributor

knapply commented Sep 28, 2018

I may have implemented equivalents of these functions elsewhere.

  • Is the intention to use non-standard evaluation?
  • If so, will the next CRAN release of {network} keep the GitHub version's {tibble} import (thus making {rlang} available)?

@krivit
Copy link
Member Author

krivit commented Sep 28, 2018

That's the plan.

@knapply
Copy link
Contributor

knapply commented Sep 29, 2018

Cool. Here's an implementation for subset.network():

library(rlang) # tibble dependency
suppressPackageStartupMessages(library(network))

This extracts vertex attributes (adapted from my own package, but may be handy if it were native to {network}):

vert_tibble <- function(x) {
  .vert_attr_names <- unique(unlist(lapply(x[["val"]], names)))
  .vert_attrs <- set_names(lapply(.vert_attr_names, function(attr_name) {
      unlist(lapply(lapply(x[["val"]], `[[`, attr_name), `%||%`, NA))
    }), .vert_attr_names)
  tibble::as_tibble(.vert_attrs)
}

The actual subset function:

subset.network <- function(x, .predicate, what = c("vertices", "edges")) {
  what <- arg_match(what, c("vertices", "edges"))
  if (what == "edges") {
    .e_df <- network:::as_tibble.network(x, attrnames = TRUE)
    .eids <- eval_tidy(quo(.e_df[!!enquo(.predicate), , drop = FALSE]), 
                       data = .e_df)[[".eid"]]
    return(get.inducedSubgraph(x, eid = .eids))
  }
  .v_df <- vert_tibble(x)
  .v_df[[".vid"]] <- seq_len(nrow(.v_df))
  
  .vids <- eval_tidy(quo(.v_df[!!enquo(.predicate), , drop = FALSE]), 
                     data = .v_df)[[".vid"]]
  get.inducedSubgraph(x, v = .vids)
}

Some quick helpers for visual inspection:

edge_tibble <- function(x) {
  network:::as_tibble.network(x, attrnames = TRUE)
}
`%>%` <- magrittr::`%>%`

Example data:

data("sampson", package = "ergm")

subset.network() in action:

  • subset by multiple vertex predicates
subset(samplike, group == "Loyal" & !cloisterville)
#>  Network attributes:
#>   vertices = 5 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 8 
#>     missing edges= 0 
#>     non-missing edges= 8 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

samplike %>% 
  subset(group == "Loyal" & !cloisterville) %>% 
  vert_tibble()
#> # A tibble: 5 x 4
#>   na    group vertex.names cloisterville
#>   <lgl> <chr> <chr>        <lgl>        
#> 1 TRUE  Loyal Berthold     FALSE        
#> 2 TRUE  Loyal Victor       FALSE        
#> 3 TRUE  Loyal Ambrose      FALSE        
#> 4 TRUE  Loyal Romauld      FALSE        
#> 5 TRUE  Loyal Louis        FALSE
  • subset by edge predicates
samplike %>% 
  subset(nominations == 1, what = "edges")
#>  Network attributes:
#>   vertices = 18 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 38 
#>     missing edges= 0 
#>     non-missing edges= 38 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

samplike %>% 
  subset(nominations == 1, what = "edges") %>% 
  edge_tibble()
#> # A tibble: 38 x 5
#>    .tail .head  .eid na    nominations
#>    <int> <int> <int> <lgl>       <dbl>
#>  1     5     1     1 FALSE           1
#>  2     7     1     2 FALSE           1
#>  3     1     2     3 FALSE           1
#>  4     3     2     4 FALSE           1
#>  5     6     5    20 FALSE           1
#>  6    10     8    32 FALSE           1
#>  7    10     9    41 FALSE           1
#>  8     8    10    49 FALSE           1
#>  9    18    13    72 FALSE           1
#> 10     2    15    77 FALSE           1
#> # ... with 28 more rows
  • chain subsets together
samplike %>% 
  subset(group == "Outcasts") %>% 
  subset(nominations == 3, what = "edges")
#>  Network attributes:
#>   vertices = 4 
#>   directed = TRUE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   total edges= 4 
#>     missing edges= 0 
#>     non-missing edges= 4 
#> 
#>  Vertex attribute names: 
#>     cloisterville group vertex.names 
#> 
#>  Edge attribute names: 
#>     nominations

Thoughts?

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] network_1.14-355 rlang_0.2.2.9001
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18      fansi_0.3.0       assertthat_0.2.0 
#>  [4] utf8_1.1.4        crayon_1.3.4      digest_0.6.17    
#>  [7] magrittr_1.5      evaluate_0.11     pillar_1.3.0.9000
#> [10] cli_1.0.0         stringi_1.2.4     rmarkdown_1.10.13
#> [13] htmldeps_0.1.1    tools_3.5.1       stringr_1.3.1    
#> [16] xfun_0.3          yaml_2.2.0        compiler_3.5.1   
#> [19] htmltools_0.3.6   knitr_1.20.15     tibble_1.4.2

@skyebend
Copy link
Contributor

skyebend commented Oct 1, 2018

is there a reason to not use the network packages standard accessors in vert_tibble? list.vertex.attributes() get.vertex.attribute() I find there are often a lot of edge cases with network objects, so it is nice to have them handled the same way ... unless it is too slow?

@knapply
Copy link
Contributor

knapply commented Oct 7, 2018

I’ll change that.

After putting it through more tests, any performance gained is negligible, even for large, heavily-decorated graphs.

@CarterButts
Copy link
Contributor

I can't say I'm thrilled with the additional dependencies that have been added - we should not be contributing to the ever-growing problem of R toolchain cruft. The idea was for network to be a minimal package, with minimal dependencies. Adding extra stuff just because it is convenient is a departure from the path of wisdom.

@krivit
Copy link
Member Author

krivit commented Oct 10, 2018

I can't say I'm thrilled with the additional dependencies that have been added - we should not be contributing to the ever-growing problem of R toolchain cruft. The idea was for network to be a minimal package, with minimal dependencies. Adding extra stuff just because it is convenient is a departure from the path of wisdom.

network is a high-level library package. I don't think there is anything wrong with it depending on low-level library packages, especially for facilities that others can implement better than us.

@krivit
Copy link
Member Author

krivit commented Oct 17, 2018

@knapply , thank you for your hard work.

vert_tibble

I wonder if it might be worth folding it into as_tibble.network and giving it an argument what.

Speaking of that, in the egor package, we ended up using the argument name unit= to specify whether a function refers to vertices or edges (as in, "units of analysis"). I wonder if it might be worth using instead of what=.

@martinamorris
Copy link
Member

martinamorris commented Oct 17, 2018 via email

@knapply
Copy link
Contributor

knapply commented Oct 17, 2018

The statnet team has produced invaluable tools for my work and the work of my colleagues. I'm very happy to help.

I wasn't familiar with egor, but I agree that being consistent by using unit is better.

I believe the following addresses everyone's concerns.

Here's an as.data.frame.network() method that removes the need for dependencies, handles both edges and vertices, and takes care of base R's behavior toward nested data.frames. If immediate tibbles are still desired, the adjustment is easy.

The comments are intentionally verbose to clarify the thought process and steps as much as possible.

suppressPackageStartupMessages(library(network))

as.data.frame.network <- function(x, unit = c("edges", "vertices")) {
  unit <- match.arg(unit, c("edges", "vertices"))
  if (unit == "edges") {
    # get valid, non-missing edge indices
    edge_ids <- which(!vapply(x[["mel"]], is.null, logical(1L))) 
    # use those to get valid, non-missing tails and heads
    tails <- lapply(x[["mel"]], `[[`, "outl")[edge_ids]
    heads <- lapply(x[["mel"]], `[[`, "inl")[edge_ids]
    # create "metadata" data.frame to use in returned object
    if (is.hyper(x)) {
      # keep hypergraph edges as list columns
      edge_df <- data.frame(.tail = tails,
                            .head = heads,
                            .eid = edge_ids)
    } else {
      # cast non-hypergraph edges to atomic vectors, retaining the integer casting from 
      # network:::as_tibble.network()
      edge_df <- data.frame(.tail = as.integer(unlist(tails)),
                            .head = as.integer(unlist(heads)),
                            .eid = edge_ids)
    }
    attr_names <- list.edge.attributes(x)
    # extract edge attributes, subsetting on valid indices
    out <- lapply(attr_names, function(e_attr) {
      get.edge.attribute(x, e_attr, unlist = FALSE)[edge_ids]
    })
  } else if (unit == "vertices") {
    # get valid, non-missing vertex indices (as a 1 column data.frame)
    vert_df <- data.frame(.vid = which(!vapply(x[["val"]], is.null, logical(1L))))
    # get vertex attr names
    attr_names <- list.vertex.attributes(x)
    # extract vertex attributes, subsetting on valid indices
    out <- lapply(attr_names, function(v_attr) {
      get.vertex.attribute(x, v_attr, unlist = FALSE)[vert_df[[".vid"]]]
    })
  }
  # attach attr names to resulting list
  names(out) <- attr_names
  # probe attributes to determine which will be list columns (i.e. they contain non-scalar elements)
  # in returned data.frame and create a logical vector to use in subsetting
  list_col_index <- vapply(out, function(attr_name) {
    any(vapply(attr_name, function(attr) length(attr) != 1L, logical(1L)))
    }, logical(1))
  # `I()`nhibit `base::data.frame`s default behavior of converting list columns
  out[list_col_index] <- lapply(out[list_col_index], I)
  # flatten attributes that only contain scalar elements to atomic vectors
  atomic_col_names <- names(list_col_index[!list_col_index])
  out[atomic_col_names] <- lapply(out[atomic_col_names], unlist)
  # convert the prepped list to a data.frame
  out <- as.data.frame(out, stringsAsFactors = FALSE)
  # `I()` works by prepending "AsIs" to the provided argument's class, which is no longer 
  # necessary and will only cause confusion, so it's dropped.
  out[list_col_index] <- lapply(out[list_col_index], `class<-`, "list")
  # combine edge/vertex metadata with attribute data
  if (unit == "edges") {
    out <- cbind.data.frame(edge_df, out, stringsAsFactors = FALSE)
  } else if (unit == "vertices") {
    out <- cbind.data.frame(vert_df, out, stringsAsFactors = FALSE)
  }
  # attach the attributes used in the current network:::as_tibble.network()
  attr(out, "n") <- network.size(x)
  attr(out, "vnames") <- network.vertex.names(x)
  if (is.bipartite(x)) {
    attr(out, "bipartite") <- get.network.attribute(x, "bipartite")
  }
  # rownames are unnecessary for these purposes and most people will convert this to a 
  # `tibble`, where they're not permitted, anyways
  rownames(out) <- NULL
  out
}

A "simple" network, that doesn't have nested attributes.

`%>%` <- magrittr::`%>%`

data("sampson", package = "ergm")
as.data.frame.network(samplike, unit = "vertices") %>% head(3)
#>   .vid cloisterville    group   na vertex.names
#> 1    1          TRUE    Turks TRUE   John Bosco
#> 2    2          TRUE    Turks TRUE      Gregory
#> 3    3          TRUE Outcasts TRUE        Basil

An object that does have nested attributes ( active attributes are matrices)...

data("windsurfers", package = "networkDynamic")
as.data.frame.network(windsurfers, unit = "vertices") %>% head(3)
#>   .vid                                                   active group1
#> 1    1                            0, 17, 26, 28, 16, 24, 27, 31      1
#> 2    2                0, 5, 9, 18, 23, 29, 2, 8, 11, 21, 24, 31      1
#> 3    3 0, 6, 8, 10, 18, 23, 25, 29, 4, 7, 9, 14, 21, 24, 28, 31      0
#>   group2    na regular vertex.names
#> 1      0 FALSE       1            1
#> 2      0 FALSE       1            2
#> 3      0 FALSE       1            3

... and the matrix dimensions are safely preserved.

as.data.frame.network(windsurfers)$active[[1]]
#>       [,1] [,2]
#>  [1,]    0    1
#>  [2,]    2    3
#>  [3,]    7    8
#>  [4,]    9   14
#>  [5,]   17   18
#>  [6,]   19   24
#>  [7,]   26   27
#>  [8,]   28   29
#>  [9,]   30   31

Returned objects are still compatible with modern data frames like tibbles or data.tables and only differ from objects returned by the current network:::as_tibble.network() in class. The attrnames parameter can be added as well.

identical(
  network:::as_tibble.network(windsurfers, attrnames = TRUE),
  tibble::as_tibble(as.data.frame.network(windsurfers))
)
#> [1] TRUE

tibble::as_tibble(as.data.frame.network(windsurfers))
#> # A tibble: 556 x 5
#>    .tail .head  .eid active        na   
#>    <int> <int> <int> <list>        <lgl>
#>  1     4     1     1 <dbl [9 x 2]> FALSE
#>  2     5     1     2 <dbl [2 x 2]> FALSE
#>  3     6     1     3 <dbl [5 x 2]> FALSE
#>  4     8     1     4 <dbl [1 x 2]> FALSE
#>  5     9     1     5 <dbl [1 x 2]> FALSE
#>  6    11     1     6 <dbl [2 x 2]> FALSE
#>  7     4     2     7 <dbl [6 x 2]> FALSE
#>  8     5     2     8 <dbl [3 x 2]> FALSE
#>  9     7     2     9 <dbl [1 x 2]> FALSE
#> 10     5     4    10 <dbl [3 x 2]> FALSE
#> # ... with 546 more rows

data.table::data.table(as.data.frame.network(windsurfers))
#>      .tail .head .eid                                  active    na
#>   1:     4     1    1  0, 2, 7, 9,17,19, 1, 3, 8,14,18,24,... FALSE
#>   2:     5     1    2                              0,10, 2,11 FALSE
#>   3:     6     1    3        0, 6,10,23,30, 1, 8,11,24,31,... FALSE
#>   4:     8     1    4                                     0,1 FALSE
#>   5:     9     1    5                                     0,1 FALSE
#>  ---                                                               
#> 552:    25    14  552                                   30,31 FALSE
#> 553:    25    68  553                                   30,31 FALSE
#> 554:    38    25  554                                   30,31 FALSE
#> 555:    16    15  555                                   30,31 FALSE
#> 556:    95    16  556                                   30,31 FALSE

If there is ultimately a decision to drop the tibble dependency, a native as_tibble.network() method can remain (preventing interruption to anything that has come to rely on it) by moving it to the Suggests field of the DESCRIPTION and writing the function like this:

as_tibble.network <- function(x, unit = "edges") {
  if (!requireNamespace("tibble", quietly = TRUE)) {
    stop("The tibble package is required for this functionality.", call. = FALSE)
  }
  tibble::as_tibble(as.data.frame.network(x, unit = unit))
}

as_tibble.network(samplike, unit = "vertices")
#> # A tibble: 18 x 5
#>     .vid cloisterville group    na    vertex.names
#>    <int> <lgl>         <chr>    <lgl> <chr>       
#>  1     1 TRUE          Turks    TRUE  John Bosco  
#>  2     2 TRUE          Turks    TRUE  Gregory     
#>  3     3 TRUE          Outcasts TRUE  Basil       
#>  4     4 TRUE          Loyal    TRUE  Peter       
#>  5     5 TRUE          Loyal    TRUE  Bonaventure 
#>  6     6 FALSE         Loyal    TRUE  Berthold    
#>  7     7 TRUE          Turks    TRUE  Mark        
#>  8     8 FALSE         Loyal    TRUE  Victor      
#>  9     9 FALSE         Loyal    TRUE  Ambrose     
#> 10    10 FALSE         Loyal    TRUE  Romauld     
#> 11    11 FALSE         Loyal    TRUE  Louis       
#> 12    12 FALSE         Turks    TRUE  Winfrid     
#> 13    13 FALSE         Outcasts TRUE  Amand       
#> 14    14 FALSE         Turks    TRUE  Hugh        
#> 15    15 FALSE         Turks    TRUE  Boniface    
#> 16    16 FALSE         Turks    TRUE  Albert      
#> 17    17 FALSE         Outcasts TRUE  Elias       
#> 18    18 FALSE         Outcasts TRUE  Simplicius

I'd also recommend that a parameter be given to list.vertex.attributes() and list.edge.attributes() to specify whether the returned names are sorted. That way, you can extract everything exactly as contained in the original network object. If you want to be able to build network objects directly from data.frames, preventing attribute sorting will allow you to better test features and would be more consistent with user expectations. By setting the default to TRUE, there are no changes to current behavior.

list.vertex.attributes <- function(x, .sort = TRUE) {
  if (!is.network(x))  {
    stop("list.vertex.attributes requires an argument of class network.\n")
  }
  if (network.size(x) == 0) {
        return(NULL)
  }
  allnam <- unlist(sapply(x$val, names))
  out <- unique(as.vector(allnam))
  if (!.sort) {
    return(out)
  }
  sort(out)
}

Apologies for the amount of text here. I can split this up into different issues if that's more palatable.

Cheers.

krivit added a commit that referenced this issue Jan 5, 2019
…butes instead of edge attributes. References #3.
@krivit
Copy link
Member Author

krivit commented Jan 5, 2019

Apologies for the force-updates. Note that samplike dataset appears to have its na vertex attribute set incorrectly, so don't panic if it doesn't work.

krivit pushed a commit that referenced this issue Jul 11, 2020
* add UTF-8 encoding (roxygen throws warning without it), update zzz.R (documentation doesn't build with old .onLoad() and network-package.R now has @useDynLib)
git push

* finish up documentation, verify all is still good

* rebuild

* fixed botched parentheses local test doesn't catch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants