Nodal regression is a regression-based analysis often used in social
network analysis to determine factors that may be associated with
node-level network properties, for example if age or sex predicts
network centrality. In this notebook we will use edge weight posteriors
from a previously-run edge weight model to conduct a nodal regression of
eigenvector centrality against node type. In our toy example, node type
will be either lifeform
or droid
.
BISoN adopts a fully Bayesian philosophy, so not only does it use edge weight models to estimate uncertainty over edge weights, but it also propagates this uncertainty through downstream analyses, such as nodal regression in this case. There are several ways uncertainty over edge weights can be included in regression analyses, but in this example we will use a Metropolis-Hastings sampler to sample over both posterior centrality estimates and the parameter space to model the joint distribution of posterior centralities.
We’ll load edge weight posteriors in from a previous run model using the
readRDS()
function. These data were generated by the ewm_binary.Rmd
example in the Github repository.
library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose() masks igraph::compose()
## x tidyr::crossing() masks igraph::crossing()
## x tidyr::extract() masks rstan::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::groups() masks igraph::groups()
## x dplyr::lag() masks stats::lag()
## x purrr::simplify() masks igraph::simplify()
source("../scripts/sampler.R")
data <- readRDS("../example_data/binary.RData")
df <- data$df
df_agg <- data$df_agg
logit_edge_samples <- data$logit_edge_samples
Before we can use centralities in the regression, we must calculate them over the posterior edge weights. We’ll use eigenvector centrality as the centrality measure in this example, but any centrality metric can be used. Centrality posteriors can be calculated using the following code:
# Build adjacency tensor (4000 x 8 x 8 array of edge weights)
edge_samples <- plogis(logit_edge_samples)
adj_tensor <- array(0, c(4000, 8, 8))
for (dyad_id in 1:nrow(df_agg)) {
dyad_row <- df_agg[df_agg$dyad_id == dyad_id, ]
adj_tensor[, dyad_row$node_1_id, dyad_row$node_2_id] <- edge_samples[, dyad_id]
}
# Calculate centrality and store posterior samples in a matrix
centrality_samples <- matrix(0, 4000, 8)
centrality_samples_std <- matrix(0, 4000, 8)
for (i in 1:4000) {
g <- graph_from_adjacency_matrix(adj_tensor[i, , ], mode="undirected", weighted=TRUE)
centrality_samples[i, ] <- eigen_centrality(g)$vector
centrality_samples_std[i, ] <- (centrality_samples[i, ] - mean(centrality_samples[i, ]))/sd(centrality_samples[i, ])
}
head(centrality_samples) # Unstandardised eigenvector centrality
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.8903686 0.9637529 1 0.9389145 0.5049571 0.8243466 0.5659694 0.2652065
## [2,] 0.9101940 0.9197402 1 0.9726799 0.6248272 0.8278425 0.5137988 0.2169508
## [3,] 0.8996659 0.8913441 1 0.9592490 0.5069681 0.8224934 0.4965188 0.2225705
## [4,] 0.9071611 0.9162472 1 0.9993787 0.5737260 0.8586543 0.5556993 0.2945746
## [5,] 0.8485677 0.8883311 1 0.9986871 0.5167451 0.8085251 0.5396104 0.3396272
## [6,] 0.9157121 0.9328924 1 0.9667868 0.5444870 0.8443323 0.5682344 0.2477288
head(centrality_samples_std) # Standardised eigenvector centrality
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.5481719 0.8233630 0.9592897 0.7302191 -0.8971214 0.3005893 -0.6683253
## [2,] 0.5895289 0.6242812 0.9164605 0.8170040 -0.4493261 0.2897349 -0.8535160
## [3,] 0.6234538 0.5937753 0.9812823 0.8359493 -0.7770519 0.3482279 -0.8143178
## [4,] 0.5594282 0.5947317 0.9201475 0.9177334 -0.7361113 0.3709582 -0.8061527
## [5,] 0.4313061 0.5930150 1.0471473 1.0418081 -0.9181424 0.2684615 -0.8251544
## [6,] 0.6064892 0.6703390 0.9197412 0.7963060 -0.7731512 0.3412095 -0.6848949
## [,8]
## [1,] -1.796186
## [2,] -1.934167
## [3,] -1.791319
## [4,] -1.820735
## [5,] -1.638441
## [6,] -1.876039
Now that the edge weights are loaded, we need to prepare the data for fitting the regression model. This involves 1) preparing a response matrix of centralities describing the distribution of centralities for edge node, and 2) building a predictor matrix indicating the node type of each node (“lifeform” or “droid”).
predictor_matrix <- matrix(0, 8, 1)
colnames(predictor_matrix) <- c("node_type")
predictor_matrix[, 1] <- c(1, 1, 1, 1, 2, 2, 2, 2)
predictor_matrix
## node_type
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
## [5,] 2
## [6,] 2
## [7,] 2
## [8,] 2
Now everything else is in place, it’s time to define the model. As in Stan, the model will be defined by its log-likelihood function, but since we’re using a custom Metropolis sampler directly in R, the log-likelihood function will need to be written directly in R. Writing likelihood functions is beyond the scope of this tutorial, but a good resource on the topic can be found here: https://www.ime.unicamp.br/~cnaber/optim_1.pdf.
The dyadic regression model we’ll be using will predict the standardised centrality (Z-score) using a Gaussian family model where dyad type is the main effect, and multi-membership terms are included as random effects to account for non-independence between edges due to nodes. Priors are set relatively narrow to improve model fit for the purposes of this example, but in any real analysis they should be determined by domain knowledge and predictive checks.
loglik_nodalreg <- function(params, Y, X, index) {
### Define parameters ###
beta_nodetype <- params[1:2]
sigma <- exp(params[3]) # Exponential keeps underlying value unconstrained, which is much easier for the sampler.
### Sample data according to index ###
y <- Y[index %% dim(Y)[1] + 1, ]
### Define model ###
target <- 0
## Linear predictor ##
# y ~ normal(b_intercept + b_dyadtype + mm[i] + mm[j], sigma)
target <- target + sum(dnorm(y, mean=beta_nodetype[X[, 1]], sd=sigma, log=TRUE)) # Main model
# b_dyadtype ~ normal(0, 0.5)
target <- target + sum(dnorm(beta_nodetype, mean=0, sd=1, log=TRUE))
# sigma ~ exponential(1)
target <- target + dexp(sigma, rate=1, log=TRUE)
return(target)
}
# Create the `target` function that evaluates the log-likelihood on the dataset.
target <- function(params, index) loglik_nodalreg(params, centrality_samples_std, predictor_matrix, index)
We now have a function target(params, index)
that gives the
log-likelihood of a set of parameters params
given the data, for a
particular sample of the posterior edge weights, index
. Let’s make
sure it works on an initial set of parameters all set to zero, for an
arbitrary index of the data. Note that the standard deviation parameters
are transformed by an exponential in the log-likelihood function, so the
sampler will treat them on the log-scale, meaning log(sigma) = 0 is
equivalent to sigma = 1.
target(rep(0, 3), 1)
## [1] -13.68939
The function has evaluated to a real number, so everything appears to be
working okay so far. Now we can use the function metropolis
from
sampler.R
to fit the model using the provided target function, an
initial set of parameters (again, all zeros), and some additional MCMC
options. Once the sampler has run, we will print out the top few rows of
the chains.
chain <- metropolis(target, rep(0, 3), iterations=200000, thin=100, refresh=10000)
## Chain: 1 | Iteration: 10000/202000 (Sampling)
## Chain: 1 | Iteration: 20000/202000 (Sampling)
## Chain: 1 | Iteration: 30000/202000 (Sampling)
## Chain: 1 | Iteration: 40000/202000 (Sampling)
## Chain: 1 | Iteration: 50000/202000 (Sampling)
## Chain: 1 | Iteration: 60000/202000 (Sampling)
## Chain: 1 | Iteration: 70000/202000 (Sampling)
## Chain: 1 | Iteration: 80000/202000 (Sampling)
## Chain: 1 | Iteration: 90000/202000 (Sampling)
## Chain: 1 | Iteration: 100000/202000 (Sampling)
## Chain: 1 | Iteration: 110000/202000 (Sampling)
## Chain: 1 | Iteration: 120000/202000 (Sampling)
## Chain: 1 | Iteration: 130000/202000 (Sampling)
## Chain: 1 | Iteration: 140000/202000 (Sampling)
## Chain: 1 | Iteration: 150000/202000 (Sampling)
## Chain: 1 | Iteration: 160000/202000 (Sampling)
## Chain: 1 | Iteration: 170000/202000 (Sampling)
## Chain: 1 | Iteration: 180000/202000 (Sampling)
## Chain: 1 | Iteration: 190000/202000 (Sampling)
## Chain: 1 | Iteration: 200000/202000 (Sampling)
## Acceptance Rate: 0.22469801980198
chain[, 3] <- exp(chain[, 3])
colnames(chain) <- c("b_lifeform", "b_droid", "sigma")
head(chain)
## b_lifeform b_droid sigma
## [1,] 0.6157228 -1.3152008 0.6553135
## [2,] 0.5997483 -0.5883987 0.5906998
## [3,] 1.0177988 -0.8086998 0.5121928
## [4,] 0.9692682 -0.9995611 0.4210394
## [5,] 0.7069929 -0.3893741 0.7478319
## [6,] 1.1924890 -0.8182122 0.6476125
The acceptance rate is around 0.23, which is the target acceptance rate for this sampler. Deviances from 0.23 could indicate sampling issues, but the converse is not true, so acceptance rate isn’t an ideal diagnostic tool. Instead it’s best to inspect the traceplots, which can be done using the following code:
par(mfrow=c(1, 3), mar=c(1,1,1,1))
for (i in 1:3) {
plot(chain[, i], type="l")
}
Note: In our runs of the code, these traces usually look good. But the stochastic nature of MCMC and the experimental sampler mean that sometimes the chains may not behave well. If the chains in this notebook have not converged, this is likely an artefact of this stochasticity.
At this point it’s a good idea to do some diagnostic checks, such as predictive checks or residual plots. These are covered separately in the Github repository, so in-depth diagnostic checks aren’t shown here, but should always be carried out.
We will run a brief diagnostic check by comparing the density of expected edge weights (draws of which are shown in black) against the density of predicted edge weights from the regression model (draws of which are shown in blue).
plot(density(centrality_samples_std[1, ]), col=rgb(0, 0, 0, 0.25), ylim=c(0, 0.8), main="Posterior predictive density", xlab="Logit edge weight")
for (i in 1:100) {
j <- sample(1:2000, 1)
lines(density(centrality_samples_std[j, ]), col=rgb(0, 0, 0, 0.25))
lines(density(
rnorm(8, mean=chain[j, 1:2][predictor_matrix[, 1]], sd=chain[j, 3])
),
col=rgb(0, 0, 1, 0.25))
}
Almost all of the expected (black) densities fall within the range expected from the predicted (blue) densities. The distribution of edge weights appears to have been captured well by the regression model. There are many other types of diagnostic check that could be carried out, but we won’t go into detail here. See the github repository page for more information.
Assuming we’ve now carried out any diagnostics we think are appropriate, it’s finally time to answer the scientific questions that led us to conduct the analysis. We can start off by calculating the 95% credible intervals of the model parameters. This can be done using the following code:
summary_matrix <- t(apply(chain, 2, function(x) quantile(x, probs=c(0.5, 0.025, 0.975))))
summary_matrix <- round(summary_matrix, 2)
summary_matrix
## 50% 2.5% 97.5%
## b_lifeform 0.71 0.08 1.35
## b_droid -0.72 -1.32 -0.10
## sigma 0.60 0.37 1.08
At this point it becomes clear that the regression we’ve conducted is
not exactly the same as what might be expected from standard frequentist
regressions, where categories are interpreted relative to a reference
category. Instead, a parameter is estimated for each category, and we
can use contrasts to calculate the magnitude of differences between
categories of interest. Contrasts are easily calculated as the statistic
of interest from the posteriors of the model. Namely, if we’re interest
in whether lifeform-lifeform edges are stronger than lifeform-droid
edges, we simply compute the difference in posteriors between b_ll
and
b_ld
. This can be done using the following code:
b_diff <- chain[, "b_lifeform"] - chain[, "b_droid"]
plot(density(b_diff), main="Posterior difference between dyad types")
abline(v=0, lty=2)
b_diff_summary <- round(quantile(b_diff, probs=c(0.5, 0.025, 0.975)), 2)
b_diff_summary
## 50% 2.5% 97.5%
## 1.44 0.51 2.30
This gives us an estimate that lifeforms have an eigenvector centrality of around ~1.4 standard deviations higher than droids, with a 95% credible interval of approximately (0.5, 2.3).
This notebook has given a brief overview of how to fit and interpret a dyadic regression model using the Metropolis-Hastings sampler in BISoN. We also have a guide on how to conduct the same type of analysis in Stan (examples/dyadic_regression.Rmd) using a Gaussian approximation of edge weights, which should yield similar results and may be less prone to model fitting issues, at the cost of being less flexible for some types of model.