-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-behavior-and-coevolution.qmd
370 lines (264 loc) · 15.8 KB
/
03-behavior-and-coevolution.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
number-sections: true
title: Behavior and coevolution
nocite: |
@lusherExponentialRandomGraph2013
---
<!-- \newcommand{\bmi}[1]{\bm{{#1}}} -->
\newcommand{\tpose}[1]{#1^{\mathbf{t}}}
\newcommand{\Prergm}[0]{P_{\mathcal{Y}, \bm{{\theta}}}}
\newcommand{\Y}[0]{\bm{{Y}}}
\newcommand{\y}[0]{\bm{{y}}}
\newcommand{\X}[0]{\bm{{X}}}
\newcommand{\x}[0]{\bm{{x}}}
\newcommand{\thetas}[0]{\bm{{\theta}}}
\newcommand{\isone}[1]{\mathbb{1}\left(#1\right)}
## Introduction
This section focuses on inference involving network and a secondary outcome. While there are many ways of studying the coevolution or dependence between network and behavior, this section focuses on two classes of analysis: When the network is fixed and when both network and behavior influence each other.
Whether we treat the network as given or endogenous sets the complexity of conducting statistical inference. Data analysis becomes much more straightforward if our research focuses on individual-level outcomes embedded in a network and not on the network itself. Here, we will deal with three particular cases: (1) when network effects are lagged, (2) egocentric networks, and (3) when network effects are contemporaneous.
## Lagged exposure
If we assume that network influence in the form of exposure is lagged, we have one of the most straightforward cases for network inference [@hayeSmokingDiffusionNetworks2019; @valenteDiffusionContagionProcesses2020; @valenteNetworkInfluencesPolicy2019]. Here, instead of dealing with convoluted statistical models, the problem reduces to estimating a simple linear regression model. Generally, lagged exposure effects look like this:
$$
y_{it} = \rho \left(\sum_{j\neq i}X_{ij}\right)^{-1}\left(\sum_{j\neq i}y_{jt-1} X_{ij}\right) + {\bm{{\theta}}}^{\mathbf{t}}\bm{{w_i}} + \varepsilon,\quad \varepsilon \sim \text{N}(0, \sigma^2)
$$
where $y_{it}$ is the outcome of individual $i$ at time $t$, $X_{ij}$ is the $ij$-th entry of the adjacency matrix, $\bm{{\theta}}$ is a vector of coefficients, $\bm{{w_i}}$ is a vector of features/covariates of individual $i$, and $\varepsilon_i$ is a normally distributed error. Here, the key component is $\rho$: the coefficient associated with the network exposure effect.
The exposure statistic, $\left(\sum_{j\neq i}X_{ij}\right)^{-1}\left(\sum_{j\neq i}y_{jt-1} X_{ij}\right)$, is the weighted average of $i$'s neighbors' outcomes at time $t-1$.
## Code example: Lagged exposure
The following code example shows how to estimate a lagged exposure effect using the `glm` function in R. The model we will simulate and estimate features a Bernoulli graph with 1,000 nodes and a density of 0.01.
$$
y_{it} = \theta_1 + \rho \text{Exposure}_{it} + \theta_2 w_i + \varepsilon
$$
where $\text{Exposure}_{it}$ is the exposure statistic defined above, and $w_i$ is a vector of covariates.
```{r}
#| label: lagged-exposure
## Simulating data
n <- 1000
time <- 2
theta <- c(-1, 3)
## Sampling a bernoilli network
set.seed(3132)
p <- 0.01
X <- matrix(rbinom(n^2, 1, p), nrow = n)
diag(X) <- 0
## Covariate
W <- matrix(rnorm(n), nrow = n)
## Simulating the outcome
rho <- 0.5
Y0 <- cbind(rnorm(n))
## The lagged exposure
expo <- (X %*% Y0)/rowSums(X)
Y1 <- theta[1] + rho * expo + W * theta[2] + rnorm(n)
```
Now we fit the model using GLM, in this case, linear Regression
```{r}
#| label: lagged-exposure-fit
fit <- glm(Y1 ~ expo + W, family = "gaussian")
summary(fit)
```
## Egocentric networks
Generally, when we use egocentric networks and egos' outcomes, we are thinking in a model where one observation is the pair $(y_i, X_i)$, this is, the outcome of individual $i$ and the egocentric network of individual $i$. When such is the case, since (a) networks are independent across egos and (b) the networks are fixed, like the previous case, a simple linear regression model is enough to conduct the analyses. A typical model looks like this:
$$
\bm{{y}} = \tpose{\bm{{\theta}}_{x}}s(\bm{{X}}) + \tpose{\thetas}\bm{{w}} + \varepsilon,\quad \varepsilon \sim \text{N}(0, \sigma^2)
$$
Where $\bm{{y}}$ is a vector of outcomes, $\bm{{X}}$ is a matrix of egocentric networks, $\bm{{w}}$ is a vector of covariates, $\bm{{\theta}}$ is a vector of coefficients, and $\varepsilon$ is a vector of errors. The key component here is $s(\bm{{X}})$, which is a vector of sufficient statistics of the egocentric networks. For example, if we are interested in the number of ties, $s(\bm{{X}})$ is a vector of the number of ties of each ego.
## Code example: Egocentric networks
For this example, we will simulate a stream of 1,000 Bernoulli graphs looking into the probability of school dropout. Each network will have between 4 and 10 nodes and have a density of 0.4. The data-generating process is as follows:
$$
{\Pr{}}_{\bm{{\theta}}}\left(Y_i=1\right) = \text{logit}^{-1}\left(\thetas_x s(\bm{{X}}_i) \right)
$$
Where $s(X) \equiv \left(\text{density}, \text{n mutual ties}\right)$, and $\bm{{\theta}}_x = (0.5, -1)$. This model only features sufficient statistics. We start by simulating the networks
```{r}
#| label: egocentric-networks
set.seed(331)
n <- 1000
sizes <- sample(4:10, n, replace = TRUE)
## Simulating the networks
X <- lapply(sizes, function(x) matrix(rbinom(x^2, 1, 0.4), nrow = x))
X <- lapply(X, \(x) {diag(x) <- 0; x})
## Inspecting the first 5
head(X, 5)
```
Using the `ergm` R package [@Handcock2023; @Hunter2008], we can extract the associated sufficient statistics of the egocentric networks:
```{r}
#| label: egocentric-networks-stats
#| message: false
#| warning: false
#| cache: true
library(ergm)
stats <- lapply(X, \(x) summary_formula(x ~ density + mutual))
## Turning the list into a matrix
stats <- do.call(rbind, stats)
## Inspecting the first 5
head(stats, 5)
```
We now simulate the outcomes
```{r}
#| label: egocentric-networks-outcomes
y <- rbinom(n, 1, plogis(stats %*% c(0.5, -1)))
glm(y ~ stats, family = binomial(link = "logit")) |>
summary()
```
## Network effects are endogenous
Here we have two different approaches: Spatial Autocorrelation [SAR], and
Autologistic actor attribute model [ALAAM] [@robinsNetworkModelsSocial2001b]. The first one is a generalization of the linear regression model that accounts for spatial dependence. The second is a close relative of ERGMs that treats covariates as endogenous and network as exogenous. Overall, ALAAMs are more flexible than SARs, but SARs are easier to estimate.
**SAR** Formally, SAR models [see @lesageIntroductionSpatialEconometrics2008] can be used to estimate network exposure effects. The general form is:
$$
\bm{{y}} = \rho \bm{{W}} \bm{{y}} + \bm{{\theta}}^{\mathbf{t}} \bm{{X}} + \epsilon,\quad \epsilon \sim \text{MVN}(0, \Sigma)
$$
where $\bm{{y}}\equiv \{y_i\}$ is a vector of outcomes the outcome, $\rho$ is an autocorrelation coefficient, $\bm{{W}} \in \{w_{ij}\}$ is a row-stochastic square matrix of size $n$, $\bm{{\theta}}$ is a vector of model parameters, $\bm{{X}}$ is the corresponding matrix with exogenous variables, and $\epsilon$ is a vector of errors that distributes multivariate normal with mean 0 and covariance make$\Sigma$.[^notation] The SAR model is a generalization of the linear regression model that accounts for spatial dependence. The SAR model can be estimated using the `spatialreg` package in R [@Bivand2022].
::: {.callout-tip}
What is the appropriate network to use in the SAR model? According to @lesageBiggestMythSpatial2014, it is not very important. Since $(I_n - \rho \mathbf{W})^{-1} = \rho \mathbf{W} + \rho^2 \mathbf{W}^2 + \dots$.
:::
Although the SAR model was developed for spatial data, it is easy to apply it to network data. Furthermore, each entry of the vector $\bm{{Wy}}$ has the same definition as network exposure, namely
$$
\bm{{Wy}} \equiv \left\{\sum_{j}y_j w_{ij}\right\}_i
$$
Since $\bm{{W}}$ is row-stochastic, $\bm{{Wy}}$ is a weighted average of the outcome of the neighbors of $i$, *i.e.*, a vector of network exposures.
**ALAAM** The simplest way we can think of this class of models is as if a given covariate switched places with the network in an ERGM, so the network is now fixed and the covariate is the random variable. While ALAAMs can also estimate network exposure effects, we can use them to build more complex models beyond exposure. The general form is:
$$
\Pr\left(\bm{{Y}} = \bm{{y}}|\bm{{W}},\bm{{X}}\right) = \exp{\left(\bm{{\theta}}^{\mathbf{t}}s(\bm{{y}},\bm{{W}}, \bm{{X}})\right)}\times\eta(\bm{{\theta}})^{-1}
$$
$$
\eta(\bm{{\theta}}) = \sum_{\bm{{y}}}\exp{\left(\bm{{\theta}}^{\mathbf{t}}s(\bm{{y}},\bm{{W}}, \bm{{X}})\right)}
$$
Where $\bm{{Y}}\equiv \{y_i \in (0, 1)\}$ is a vector of binary individual outcomes, $\bm{{W}}$ denotes the social network, $\bm{{X}}$ is a matrix of exogenous variables, $\bm{{\theta}}$ is a vector of model parameters, $s(\bm{{y}},\bm{{W}}, \bm{{X}})$ is a vector of sufficient statistics, and $\eta(\bm{{\theta}})$ is a normalizing constant.
## Code example: SAR
Simulation of SAR models can be done using the following observation: Although the outcome shows on both sides of the equation, we can isolate it in one side and solve for it; formally:
$$
\bm{{y}} = \rho \bm{{X}} \bm{{y}} + \tpose{\bm{{\theta}}}\bm{{W}} + \varepsilon \implies \bm{{y}} = \left(\bm{{I}} - \rho \bm{{X}}\right)^{-1}\tpose{\bm{{\theta}}}\bm{{W}} + \left(\bm{{I}} - \rho \bm{{X}}\right)^{-1}\varepsilon
$$
The following code chunk simulates a SAR model with a Bernoulli graph with 1,000 nodes and a density of 0.01. The data-generating process is as follows:
```{r}
#| label: sar
#| cache: true
set.seed(4114)
n <- 1000
## Simulating the network
p <- 0.01
X <- matrix(rbinom(n^2, 1, p), nrow = n)
## Covariate
W <- matrix(rnorm(n), nrow = n)
## Simulating the outcome
rho <- 0.5
library(MASS) # For the mvrnorm function
## Identity minus rho * X
X_rowstoch <- X / rowSums(X)
I <- diag(n) - rho * X_rowstoch
## The outcome
Y <- solve(I) %*% (2 * W) + solve(I) %*% mvrnorm(1, rep(0, n), diag(n))
```
Using the `spatialreg` R package, we can fit the model using the `lagsarlm` function:
```{r}
#| label: sar-fit
#| cache: true
#| warning: false
#| message: false
library(spdep) # for the function mat2listw
library(spatialreg)
fit <- lagsarlm(
Y ~ W,
data = as.data.frame(X),
listw = mat2listw(X_rowstoch)
)
```
```{r}
#| label: sar-fit-summary
#| cache: true
## Using texreg to get a pretty print
texreg::screenreg(fit, single.row = TRUE)
```
The interpretation of this model is almost the same as a linear regression, with the difference that we have the autocorrelation effect (`rho`). As expected, the model got an estimate close enough to the population parameter: $\rho = 0.5$.
## Code example: ALAAM
To date, there is no R package implementing the ALAAM framework. Nevertheless, you can fit ALAAMs using the PNet software developed by the Melnet group at the University of Melbourne (click [here](https://www.melnet.org.au/pnet/){target="_blank"}).
Because of the similarities, ALAAMs can be implemented using ERGMs. Because of the novelty of it, the coding example will be left as a potential class project. We will post a fully-featured example after the workshop.
## Coevolution
Finally, we discuss coevolution when both network and behavior are embedded in a feedback loop. Coevolution should be the default assumption when dealing with social networks. Nevertheless, models capable of capturing coevolution are hard to estimate. Here, we will discuss two of such models: Stochastic Actor-Oriented Models (or Siena Models) (first introduced in @snijdersStochasticActorOriented1996; see also @snijdersStochasticActorOrientedModels2017) and Exponential-family Random Exponential-family Random Network Models [ERNMs,] a generalization of ERGMs [@wangUnderstandingNetworksExponentialfamily2023; @ernmsFellows2023].
**Siena** Stochastic Actor-Oriented Models [SOAMs] or Siena Models are dynamic models of network and behavior that describe the transition of a network system within two or more time points.
<!-- The general form of a Siena model is: -->
**ERNMs** This model is closely related to ERGMs, with the difference that they incorporate a vertex-level output. Conceptually, it is moving from a having a random network, to a model where a given vertex feature and network are random:
$$
\Prergm(\Y=\y|\X=\x) \to \Prergm(\Y=\y, \X=\x)
$$
## Code example: Siena
This example was adapted from the `RSiena` R package (see `?sienaGOF-auxiliary` page).We start by loading the package and taking a look at the data we will use:
```{r}
#| label: siena-data-struct
library(RSiena)
## Visualizing the adjacency matrix & behavior
op <- par(mfrow=c(2, 2))
image(s501, main = "Net: s501")
image(s502, main = "Net: s502")
hist(s50a[,1], main = "Beh1")
hist(s50a[,2], main = "Beh2")
par(op)
```
The next step is the data preparation process. `RSiena` does not receive raw data as is. We need to explicitly declare the networks and outcome variable. Siena models can also model network changes
```{r}
#| label: siena-data-prep
## Initializing the dependent variable (network)
mynet1 <- sienaDependent(array(c(s501, s502), dim=c(50, 50, 2)))
mynet1
mybeh <- sienaDependent(s50a[,1:2], type="behavior")
mybeh
## Node-level covariates (artificial)
mycov <- c(rep(1:3,16),1,2)
## Edge-level covariates (also artificial)
mydycov <- matrix(rep(1:5, 500), 50, 50)
```
```{r}
#| label: siena-model-build
#| collapse: true
## Creating the data object
mydata <- sienaDataCreate(mynet1, mybeh)
## Adding the effects (first get them!)
myeff <- getEffects(mydata)
## Notice that Siena adds some default effects
myeff
## Adding a few extra effects (automatically prints them out)
myeff <- includeEffects(myeff, transTies, cycle3)
```
To add more effects, first, call the function `effectsDocumentation(myeff)`. It will show you explicitly how to add a particular effect. For instance, if we wanted to add network exposure (`avExposure`,) under the documentation of `effectsDocumentation(myeff)` we need to pass the following arguments:
```{r}
#| label: siena-exposure
## And now, exposure effect
myeff <- includeEffects(
myeff,
avExposure,
# These last three are specified by effectsDocum...
name = "mybeh",
interaction1 = "mynet1",
type = "rate"
)
```
The next step involves creating the model with (`sienaAlgorithmCreate`,) where we specify all the parameters for fitting the model (e.g., MCMC steps.) Here, we modified the values of `n3` and `nsub` to half of the default values to reduce the time it would take to fit the model; yet this degrades the quality of the fit.
```{r}
#| label: siena-model-fit
#| collapse: true
#| cache: true
## Shorter phases 2 and 3, just for example:
myalgorithm <- sienaAlgorithmCreate(
nsub = 2, n3 = 500, seed = 122, projname = NULL
)
## Fitting and printing the model
ans <- siena07(
myalgorithm,
data = mydata, effects = myeff,
returnDeps = TRUE, batch = TRUE
)
ans
```
As a rule of thumb, absolute t-values below 0.1 show good convergence, below 0.2 we say "reasonably well," and above is no convergence. Let's highlight two of the effects we have in our model
1. Transitive ties (number five) are positive 0.78 with a t-value of smaller than 0.01. Thus, we say that the network has a tendency towards transitivity (balance) that is significant.
2. Exposure effect (number seven) is also positive, but small, 0.03, but still significant (t-value of -0.01)
:Like with ERGMs, we also do goodness-of-fit:
```{r}
#| label: siena-gof
#| cache: true
sienaGOF(ans, OutdegreeDistribution, varName="mynet1") |>
plot()
sienaGOF(ans, BehaviorDistribution, varName = "mybeh") |>
plot()
```
## Code example: ERNM
To date, there is no CRAN release for the ERNM model. The only implementation I am aware of is one of the leading authors, which is available on GitHub: [https://github.com/fellstat/ernm](https://github.com/fellstat/ernm){target="_blank"}. Unfortunately, the current version of the package seems to be broken.
Just like the ALAAM case, as ERNMs are closely related to ERGMS, building an example using the ERGM package could be a great opportunity for a class project!