Generalized 2SLS implementation in Stata

Nicolás Suárez Chavarría

Last updated on Jul 21, 2023

Update:

I presented an slightly updated version of this Stata package at the 2023 Stata Conference. You can find my presentation here and the updated code here.

Original post:

For one of my research projects I was working on peer effects for college students, and I wanted to use the Generalized 2SLS procedure to estimate peer effects described in Bramoullé, Y., Djebbari, H., & Fortin, B. (2009). Identification of peer effects through social networks. Journal of econometrics, 150(1), 41-55.. The problem is that this project required a lot of data work and cleaning, so I worked on it on Stata mostly, but there is not a lot of support for peer effects regressions on Stata, so I had to adapt the code of Bramoullé et al. (2009) on my own. I based my code on their article, and on the code shared on Habiba Djebbari’s website.

Here I share my code, some examples about how to use it, and a comparison of my code and the original R code used by the authors.

To wrote this, I took advantage of Stata’s new integration with Python and Jupyter Notebooks (more information on how to use this feature can be found here). I also used the rpy2 package to run R commands inside a Jupyter notebook.

The original Jupyter Notebook I wrote for this can be found in my Github repository.

Setting up Stata in Python

#setting up Stata in Python
import stata_setup
stata_setup.config("C:/Program Files/Stata17", "mp")

  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       MP—Parallel Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Unlimited-user 4-core network, expiring 21 Jul 2022
Serial number: xxxxxxxxxxxx
  Licensed to: Nicolas Suarez
               Stanford University

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. More than 2 billion observations are allowed; see help obs_advice.
      3. Maximum number of variables is set to 5,000; see help set_maxvar.

Running C:\Program Files\Stata17/profile.do ...

Peer IV function

To implement linear in means regressions, I wrote the peer_iv function. This function takes as inputs, as any other regression function, a dependent variable and independent variables, plus an adjacency matrix G to perform the calculations. This matrix has to be a Mata matrix object (Stata matrices have size limitations), and it is the only parameter that the user is forced to set. All the other parameters are optiontal: row allows us to row-normalize the adjacency matrix (so the sum of each row is 1, and we can interpret the product of a variable and the matrix as weighted means), the fixed option adds group level fixed effects, the ols option runs a regular OLS regression but with peer effects, so by using this we avoid creating new variables with average outcomes and regressors for the peers of each observation, and the vardir option allow us to pass variables directly to the regression with the ols option, so they won’t have a peer effect (this can be used for instance to add dummies for fixed effects).

The matrix generates an standard Stata regression output, containing coefficients, standard errors, p-values and all the relevant information, and it stores eclass results. This means that the output could be stored with custom commands like outreg2, estout or esttab that allow the user to build customizable output tables.

%%stata

capture program drop peer_iv
program define peer_iv, eclass
    version 17
    syntax varlist, ADJacency(name) [ROW FIXED OLS VARdir(varlist)] 
    /* implements the generalized 2SLS model of Bramoulle et al (2009), without fixed effects.
    The model includes a constant, then the endogenous effect, effects of the independent variables, and then the exogenous effects.
    For more details see https://rpubs.com/Nicolas_Gallo/549370

    INPUTS:
    varlist = exogenous variables
    dep= dependent variable
    adjacency=name of the adjacency matrix in Mata
    row=optional, row normalizes the adjacency matrix
    fixed=optional, estimates model with cohort level fixed effects
    ols=optional, OLS results. Don't use together with FIXED
    vardir=optional. Variables to use in the regression that won't have peer effects, only to be used with OLS.

    OUTPUT:
    displays the coefficients and standard errors in a table. Stores eclass results.
    */
    *separating dependent from independent variables
    gettoken dep varlist: varlist
    preserve
    quietly{

    *returning error if OLS and FIXED are used together
    if "`ols'"!="" & "`fixed'"!="" { 
    noisily display as error "options OLS and FIXED may not be combined"
    exit 184
    }

    *returning error if VARDIR and OLS are not used together
    if "`ols'"==""  & "`vardir'"!="" { 
    noisily display as error "option VARDIR has to be used with OLS option"
    exit 184
    }

    *checking if there are missing values in our data
    reg `dep' `varlist' `vardir'
    *recovering the indexes of non-missing observations
    gen muestra=e(sample)
    ereturn clear

    *moving data as matrices
    mata X=st_data(.,"`varlist'")
    mata y=st_data(.,"`dep'")
    mata muestra=st_data(.,"muestra")
    if "`vardir'"!="" mata vardir=st_data(.,"`vardir'")
    
    *dropping missing values from data matrices
    mata X=select(X,muestra)
    mata y=select(y,muestra)
    if "`vardir'"!="" mata vardir=select(vardir,muestra)

    *dropping missing values from G matrix (eliminating the rows and columns with missing values, so the matrix are comformable)
    mata G1=select(`adjacency',muestra)
    mata G1=select(G1,muestra')

    *row normalizing G if needed
    if "`row'"!="" mata G1=G1:/editvalue(rowsum(G1),0,1)

    *generating identity matrix
    mata Id=I(rows(G1))

    *OLS results
    if "`ols'"!="" {
    if "`vardir'"!="" mata X_1 =  J(rows(X),1,1), G1*y, X, G1*X, vardir
    else mata X_1 =  J(rows(X),1,1), G1*y, X, G1*X
    mata theta= invsym(quadcross(X_1, X_1))*quadcross(X_1, y)
    mata e= y - X_1*theta
    mata V = (quadsum(e:^2)/(rows(X_1)-cols(X_1)))*invsym(quadcross(X_1, X_1))
    }
    else {
    *putting matrices together
    *with fixed effects
    if "`fixed'"!="" {
    mata S=( (Id-G1)*X, (Id-G1)*G1*X, (Id-G1)*G1*G1*X )
    mata X_1= ( (Id-G1)*G1*y, (Id-G1)*X, (Id-G1)*G1*X )             
    }
    else{
    mata S=( J(rows(X),1,1), X, G1*X, G1*G1*X )
    mata X_1= ( J(rows(X),1,1), G1*y, X, G1*X )
    }
    mata P= S*invsym(quadcross(S,S))*S'

    *first 2sls
    if "`fixed'"!="" mata theta_1= invsym(X_1'*P*X_1)*X_1'*P*(Id-G1)*y
    else mata theta_1= invsym(X_1'*P*X_1)*X_1'*P*y

    *building instrument
    if "`fixed'"!="" {
    mata Z = G1*luinv(Id-theta_1[1]*G1)*(Id-G1)*(X*theta_1[2::(1+cols(X))] +  G1*X*theta_1[(2+cols(X))::(1+2*cols(X))] ), (Id-G1)*X, (Id-G1)*G1*X   
    }
    else{
    mata Z = J(rows(X),1,1), G1*luinv(Id-theta_1[2]*G1)*( theta_1[1]*J(rows(X),1,1) + X*theta_1[3::(2+cols(X))] +  G1*X*theta_1[(3+cols(X))::(2+2*cols(X))] ), X, G1*X
    }
    *

    *final 2sls
    if "`fixed'"!="" mata theta = luinv(quadcross(Z,X_1))*quadcross(Z,(Id-G1)*y)
    else mata theta = luinv(quadcross(Z,X_1))*quadcross(Z,y)

    *resids
    if "`fixed'"!="" {
    mata e= (Id-G1)*y - luinv(Id-theta[1]*G1)*((Id-G1)*X*theta[2::(1+cols(X))] + (Id-G1)*G1*X*theta[(2+cols(X))::(1+2*cols(X))] )
    }
    else{
    mata e= y - luinv(Id-theta[2]*G1)*( theta[1]*J(rows(X),1,1) + X*theta[3::(2+cols(X))] +  G1*X*theta[(3+cols(X))::(2+2*cols(X))] )
    }


    *variance
    mata V = luinv(quadcross(Z,X_1))*(Z')*diag(e:^2)*Z*luinv(quadcross(X_1,Z))
    }

    *sending results to Stata
    mata st_matrix("b",theta')
    mata st_matrix("V",V)

    *row and col names for matrices
    local exog_peer //list for names of exogenous effects
    foreach var in `varlist'{
    local exog_peer `exog_peer' `var'_p
    }
    if "`fixed'"!="" {
    local varnames `dep'_p `varlist' `exog_peer' `vardir'
    }
    else{
    local varnames _cons `dep'_p `varlist' `exog_peer' `vardir'
    }


    *adding col and rownames
    matrix colnames b= `varnames'
    matrix colnames V = `varnames'
    matrix rownames V = `varnames'
    }
    *storing eclass results
    *tab muestra
    ereturn post b V, depname(`dep') esample(muestra)
    mata st_numscalar("e(N)", rows(G1))
    mata st_numscalar("e(df_r)", rows(X_1)-cols(X_1))
    eret local cmd peer_iv
    ereturn display

    restore     
end

Example

Here we will run a little example to see how the command works, and how you can generate an adjacency matrix in Stata. We are going to use the auto dataset with 1978 automobile data, and we are going to create a random adjacency matrix G with elements that are drawn from a uniform between 0 an 1, but we will force the elements in the main diagonal to be 0. We are also going to row normalize the adjacency matrix.

%%stata

sysuse auto, clear

*generating the matrix
mata G= runiform(`c(N)', `c(N)') 
forval i=1/`c(N)'{
    mata G[strtoreal(st_local("i")),strtoreal(st_local("i"))]=0
}

*running the regression
peer_iv price trunk turn, row adj(G)

eret list

. 
. sysuse auto, clear
(1978 automobile data)

. 
. *generating the matrix
. mata G= runiform(`c(N)', `c(N)') 

. forval i=1/`c(N)'{
  2.         mata G[strtoreal(st_local("i")),strtoreal(st_local("i"))]=0
  3. }

. 
. *running the regression
. peer_iv price trunk turn, row adj(G)
------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   41706.23   48765.18     0.86   0.395    -55603.18    139015.6
     price_p |  -1.527064   11.64602    -0.13   0.896    -24.76633     21.7122
       trunk |   141.5504   83.60055     1.69   0.095    -25.27191    308.3727
        turn |    98.8758   103.6205     0.95   0.343    -107.8956    305.6472
     trunk_p |     439.69   968.8704     0.45   0.651    -1493.661    2373.041
      turn_p |  -959.4334   2872.551    -0.33   0.739    -6691.519    4772.652
------------------------------------------------------------------------------

. 
. eret list

scalars:
                  e(N) =  74
               e(df_r) =  68

macros:
                e(cmd) : "peer_iv"
         e(properties) : "b V"
             e(depvar) : "price"

matrices:
                  e(b) :  1 x 6
                  e(V) :  6 x 6

.

We can see the output of the regression, including a constant, the coefficient price_p is the coefficient of the endogenous effect, while trunk_p and turn_p are the exogenous effects.

After the regression command we ran the eret list command, and we can see all the elements that are stored after running the command.

Storing Mata matrices

At least for my particular application, computing the adjacency matrix was very slow, so it is not something that I would do each time before I want to run peer effects regressions. To avoid this, we can use the Mata functions matsave and matuse to store a matrix as a .mmat object, and then load it into Stata.

Checking if the code works

Here, to see if my code works properly, I will run the R code provided by Bramoullé, Djebbari and Fortin (it can be found here) to generate data, then estimate peer effects with and without fixed effects, export their data to Stata, and then see if my function obtains the same coefficients.

The code provided by the authors is meant to be used to run Monte Carlo simulations, so I made some modifications to keep only the relevant parts. Also, for both cases, the authors defined different data generating processes, so we will have 2 vectors of dependent variables, but all of them are generated with the vector of white noise.

#Python package to use magic R commands
%load_ext rpy2.ipython

Generating network data in R

%%R

library(knitr)
library(igraph)
library(truncnorm)
set.seed(1)

alpha=0.7683
beta=0.4666
gamma=0.0834
delta=0.1507
e_var=0.1
#Generating a graph with 100 vertices and a probability of link of 0.04 with the "random.renyi.game()" function
g<-erdos.renyi.game(100,0.04)
#Generating the associated weighted adjacency matrix
G<-get.adjacency(g)
G<-as.matrix(G)

#Drawing a vector x of characteristics
x_sim<-matrix(rbinom(n = nrow(G),size = 1,prob = 0.9458 ),nrow(G),1)
for(i in 1:nrow(x_sim)){
  if(x_sim[i,] != 0){
    x_sim[i,]<-rtruncnorm(n = 1,a = 0,b = 1000,mean = 1,sd = 3) 
  }
}
#a vector filled with 1,  size m x 1 (used when there is an intercept in the model, i.e. when fixed_effects = FALSE)
l<-matrix(1,nrow(G),1)

GX<-G %*% x_sim
G2X<-(G %*% G) %*% x_sim
#an identity matrix of appropriate size
I<-(diag(nrow(G))) 
# Inv corresponds to (I - Beta*G))^(-1)   in the reduced form(check equation (5))
# Solve function gives the inverse of a matrix
Inv<-solve(I - beta * G)

Case without fixed effects

%%R

#the instrument vector of size m x 4
S<-matrix(c(l,x_sim,GX,G2X),nrow(x_sim),)

#P is the weighting matrix of size m x n
P<-S %*% solve(t(S) %*% S) %*% t(S)


eps<-matrix(rnorm(n = nrow(G),mean = 0,sd = e_var),nrow(G),1)
y<-alpha * Inv %*% l  + Inv %*% (gamma * I + delta * G) %*% x_sim  + Inv %*% eps
Gy<-G %*% y

#X tilde, size n x 4
X_t<-matrix(c(l,Gy,x_sim,GX),nrow(x_sim),)

#theta 2sls and extracting its parameters
th_2sls <-solve(t(X_t) %*% P %*% X_t) %*% t(X_t) %*% P %*% y
alpha_2sls<-th_2sls[1]
beta_2sls<-th_2sls[2]
gamma_2sls<-th_2sls[3]
delta_2sls<-th_2sls[4]

#Recalculate I with Beta_2sls
I_2sls<-solve((diag(nrow(G)) - beta_2sls * G ))  

#Gy estimated in theta 2sls
gy_2sls<- G %*% I_2sls %*% (alpha_2sls * l   + gamma_2sls * x_sim + GX * delta_2sls)

Z_th<-matrix(c(rep(1,nrow(G)),gy_2sls,x_sim,GX),nrow(x_sim),)

#THETAS
#theta lee
th_lee_1<-solve(t(Z_th) %*% X_t) %*% t(Z_th) %*% y

Case with fixed effects

%%R
IG <-(I - G)
#the instrument vector of size m x 3
S<-matrix(c(IG%*%x_sim,IG%*%GX,IG%*%G2X),nrow(G),)

#P is the weighting matrix of size m x m
P<-S %*% solve(t(S) %*% S) %*% t(S)

y2<- solve(IG) %*% Inv %*% (gamma*I + delta * G) %*% IG %*% x_sim + solve(IG) %*% Inv %*% IG %*% eps

#X tilde, size m x 3
X_t<-matrix(c(G%*%IG%*%y2 ,IG%*%x_sim,IG%*%GX),nrow(x_sim),)


#theta 2sls and extracting its parameters
th_2sls <-solve(t(X_t) %*% P %*% X_t) %*% t(X_t) %*% P %*% IG%*%y2
beta_2sls<-th_2sls[1]
gamma_2sls<-th_2sls[2]
delta_2sls<-th_2sls[3]

#Recalculate I with Beta_2sls
Inv_2sls<-solve(I - beta_2sls * G )
#IGy estimated in theta 2sls
IGy_2sls<-Inv_2sls %*% (gamma_2sls * I + delta_2sls * G) %*% IG %*% x_sim + Inv_2sls %*% IG %*% eps 

IG_Gy_2sls<- G %*% Inv_2sls %*% (IG %*% (x_sim * gamma_2sls + GX* delta_2sls))

Z_th<-matrix(c(IG_Gy_2sls,IG%*%x_sim,IG%*%GX),nrow(G),)


#THETAS
#theta lee
th_lee_2<-solve(t(Z_th) %*% X_t) %*% t(Z_th) %*% IG%*%y2

Storing data in R as CSV, and reading it in Stata

%%R
#storing data and adjacency matrix as CSV, to be readed in Stata later
write.csv(data.frame(x_sim,y,y2),'data.csv',row.names = FALSE)
write.csv(G,'adjacency.csv',row.names = FALSE)

%%stata
*reading adjacency matrix, and passing it to Mata
import delimited "adjacency.csv", clear
putmata  G_r=(v*), replace

*reading data
import delimited "data.csv", varnames(1) clear

. *reading adjacency matrix, and passing it to Mata
. import delimited "adjacency.csv", clear
(encoding automatically selected: ISO-8859-2)
(100 vars, 100 obs)

. putmata  G_r=(v*), replace
(1 matrix posted)

. 
. *reading data
. import delimited "data.csv", varnames(1) clear
(encoding automatically selected: ISO-8859-1)
(3 vars, 100 obs)

.

Comparing results without fixed effects

%%stata
peer_iv y x_sim, adj(G_r)

-----------------------------------------------------------------------------
> -
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   .7693815   .0861937     8.93   0.000     .5982885    .9404746
         y_p |   .4668116   .0019521   239.13   0.000     .4629367    .4706865
       x_sim |   .0832526   .0174479     4.77   0.000     .0486188    .1178864
     x_sim_p |   .1501907   .0057371    26.18   0.000     .1388026    .1615789
------------------------------------------------------------------------------

%%R
th_lee_1

          [,1]
[1,] 0.7693815
[2,] 0.4668116
[3,] 0.0832526
[4,] 0.1501907

Comparing results with fixed effects

%%stata
peer_iv y2 x_sim, fixed adj(G_r)

------------------------------------------------------------------------------
          y2 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        y2_p |   .4663327   .0025075   185.97   0.000      .461356    .4713095
       x_sim |   .0841561   .0081916    10.27   0.000      .067898    .1004143
     x_sim_p |   .1500943   .0018714    80.20   0.000       .14638    .1538086
------------------------------------------------------------------------------

%%R
th_lee_2

           [,1]
[1,] 0.46633273
[2,] 0.08415615
[3,] 0.15009431

As we can see, in both cases, the peer_iv command produces the same estimates as the original package. Furthermore, both packages produce estimates that are very similar to the ones used to generate our data (for instance, we have that $\beta=0.4666$, and in both cases we get coefficients very close to that).

Nicolás Suárez Chavarría

Ph.D. Candidate

My research interests include Development and Public Economics, Data Science and Machine Learning applications for Causal Inference.