Structural equation modelling with lavaan: a tutorial and an intuitive introduction

The goal of this paper is to present a tutorial on structural equation modelling (“SEM”). SEM is a combination of multivariate linear regression and path analysis models. We will discuss path analysis, measurement models, measurement invariance and when or how to use them, twin studies, and longitudinal data analysis. In this tutorial, we shall use the free and open source package “lavaan” in R. Structural equation modelling in health sciences and epidemiology Structural equation modelling (“SEM”) is a combination of multivariate linear regression and path analysis models. In this brief hands-on tutorial, we will discuss path analysis, measurement models, measurement invariance and when or how to use them, twin studies, and longitudinal data analysis using SEM. We shall use the free and open source package “lavaan” in R the free and open source statistical programming language Steps of SEM As structural equation modelling is graphical, we recommend that you follow the sequence: 1. Draw your model as a system of paths 2. Input data in the form of covariance or correlation matrix 3. Identify the model 4. Specify the model 5. Assess parameter estimates 6. Assess fit measures (chi-square, df, residual matrix, GFI, RMSEA) 7. Check the modification indices 8. Rerun the model till you get the best fit of the data to the model and theory Absolute first step before you proceed: Load the packages Assuming that you use R for your analyses, you need to load the following packages in R after installing R: library(lavaan) library(dagitty) library(ggdag) Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 1/33 library(tidyverse) library(DiagrammeR) library(DiagrammeRsvg) library(rsvg) Step 1: Notes on the system of paths and directed acyclic graphs (DAGs) Everything graphical in SEM starts with path analysis. Richard Sewall Wright (1921), a hundred years ago, described a system of finding correlation between two variables, X and Y using a system of paths (Denis 2021). In this approach, he described that if a system of paths exist between two variables X and Y, the multiplication produce of the path coefficents of the sequences of the paths that traverse between the two variables should be added to the path coefficients of the direct paths that exist between X and Y to derive their correlations (Figure 1). Figure 1. Basic path diagram Figure 1. A path diagram The above figure presents a system of paths that can be connected to derive covariances between pairs fo variables. These paths can be traced from one variable to another according to set rules. Sewall Wright described such a system of path tracing rules as follows: 1. A path can start from one variable to be connected to another variable and can start in either a forward or a reverse direction in the direction of the arrowhead 2. Once started in one direction, the path must continue in the same direction unless it meets with another path in a Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 2/33 reverse direction and at that point can proceed no further 3. A path can only contain ONE curved double headed arrow. A curved double headed arrow signifies either a covariance between two variables or a variance of a single variable 4. A path cannot go through the same variable twice, that is a path can only go through one variable at a time Then, once all the valid paths are identified, their path coefficients are multiplied and added to the direct path coefficient if one exists between the two variables to derive the correlation between these two paths. With these information, we can trace the following valid paths in Figure 1: x-a-b-c-y x-c-y x-d-e-y These are the only three valid paths. No direct path exists between x and y, and all other paths are either invalid or they are blocked one way or another. In order to derive the correlation between x and y, we will need to add the multiple products of the individual path coefficients as follows: cor(x,y) = (x-a)*(a-b)*(b-c)*(c-y) + (x-c)*(c-y) + (x-d)*(d-e)*(e-y) Figure 2. Another system of paths we use in measurement model or a confirmatory factor analysis model. Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 3/33

library(tidyverse) library(DiagrammeR) library(DiagrammeRsvg) library(rsvg) Step 1: Notes on the system of paths and directed acyclic graphs (DAGs) Everything graphical in SEM starts with path analysis. Richard Sewall Wright (1921), a hundred years ago, described a system of finding correlation between two variables, X and Y using a system of paths (Denis 2021). In this approach, he described that if a system of paths exist between two variables X and Y, the multiplication produce of the path coefficents of the sequences of the paths that traverse between the two variables should be added to the path coefficients of the direct paths that exist between X and Y to derive their correlations (Figure 1). The above figure presents a system of paths that can be connected to derive covariances between pairs fo variables.
These paths can be traced from one variable to another according to set rules. Sewall Wright described such a system of path tracing rules as follows: 1. A path can start from one variable to be connected to another variable and can start in either a forward or a reverse direction in the direction of the arrowhead 2. Once started in one direction, the path must continue in the same direction unless it meets with another path in a Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 2/33 reverse direction and at that point can proceed no further 3. A path can only contain ONE curved double headed arrow. A curved double headed arrow signifies either a covariance between two variables or a variance of a single variable 4. A path cannot go through the same variable twice, that is a path can only go through one variable at a time Then, once all the valid paths are identified, their path coefficients are multiplied and added to the direct path coefficient if one exists between the two variables to derive the correlation between these two paths. With these information, we can trace the following valid paths in Figure 1: x-a-b-c-y x-c-y x-d-e-y These are the only three valid paths. No direct path exists between x and y, and all other paths are either invalid or they are blocked one way or another. In order to derive the correlation between x and y, we will need to add the multiple products of the individual path coefficients as follows: cor (  The path model in Figure 2 is a simple measurement model or confirmatory factor analysis model with one latent variable ("lv1"), and three manifest variables (mv1 … mv3). You can see that a set of six paths connect the manifest variables. In confirmatory factor analysis, we estimate or constrain such path coefficients and the path coefficients are used to derive the variances and covariances of these variables. The path coefficients also tell us the effect of one variable over another. For example, the effect of lv on mv1 in Figure 2 will be determined by the path coefficient of the path connecting lv with mv1.
Note another feature: all the arrows in these diagrams point in one direction, and the variables are all connected by arrows that move in one direction. Such kind of graphs are referred to as directed acyclic graphs (DAG) as no variable has arrows that eventually return to itself closing any loop. DAGs are visual tools to directly observe the causal relationships between exposures and outcomes, including mediators, confounders, and effect modifiers; this is perhaps as accurate a definition of the role of DAGs as you can get Rohrer (2018).
Path diagrams for structural equation models ("SEM") have symbols that help the readers to understand what these models are doing. However, as our aim is to eventually discuss structural causal models in the light of structural equation models, we will briefly mention them here and continue with the models as we present here for uniformity. Table 2 provides a brief description of the symbols used for structural equation models and the differences we will adopt for our purposes in this tutorial. We will assume that all exogenous latent and manifest variables have variances so we do not show them separately and all endogenous latent and manifest variables have exogenous error terms. Hence we suggest the above scheme, besides, using dagitty and ggdag packages in R helps us to draw these graphs in a uniform way. Alternative, publication quality causal and structural equation graphs can be drawn using graphviz and DiagrammeR packages but they are also time consuming. We recommend that for rapid visualisation of the models, use dagitty.

Path analysis in a measurement model: partition of variances
Refer to Figure 2 where we see a measurement model with one latent variable and three manifest variables. Here, we Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 5/33 would like to use path tracing rules to derive the variance of manifest variable mv1. Here is the procedure: We will trace ALL paths that BEGIN with mv1 and end on mv1.
We will multiply and add the path coefficients of all such paths What paths exist?
A path starts from mv1, goes to lv and returns to mv1 (we call this mv1-lv-mv1) Another path is mv1-e1-mv1 No other path starts from mv1 and ends in mv1 Hence, -variance of mv1 = (mv1-lv) * var(lv) * (lv-mv1) + (mv1-e1) * var(e1) * (e1-mv1) Now, -mv1-lv and lv-mv1 are the same paths, so the path coefficient get squared -mv1-e1 and e1-mv1 are also the same paths, and we set the coefficients of these paths at 1.0 by convention -If we standardise lv, then var(lv) = 1.0 So, from here, we can say that the var(mv1) = square of the path coefficient from latent variable + variance of error term The term "square of the path coefficient" is referred to as "communality," because this part of the total variance (or variability, so to say) of mv1 is EXPLAINED by the latent variable that is common to all other manifest variables in the model that receive arrows from the latent variable. The path coefficient is also referred to as factor loading.
Using the path analysis approach, you will see that as the error terms are uncorrelated, therefore the correlation (or covariance) between mv1 and mv2 is given by mv1-lv * var(lv) * lv-mv2 These concepts are fundamental to understanding what goes in SEMs. We have seen one model, that of a measurement model or confirmatory factor analysis model, where we have one or more latent variables load on manifest variables. where lv2 is regressed on lv1, so the path coefficient of lv1-lv2 can be viewed as a beta coefficient, with something like: This is the path diagram of a full structural equation model. Note that this model incorporates BOTH a set of measurement models (two measurement models here) and a structural model (lv1-lv2 path). You can extend and make these models as complex or as simple as you want. You could probably have only one manifest variable for the structural part where you would regress the manifest variable on the latent variable (simple), or you could have many more latent variable models that you would link up to form complex patterns that you would like to analyse.

Path diagrams of models with meanstructures and group comparisons
So far, we have confined our discussions to models that have only one group of people and models that only have explained covariances and variances. For example, a measurement model would be well suited to test the validity of the construct of a questionnaire you have set up to investigate some health construct. Say you have developed a questionnaire that aims to tap an individual's concept of "health" and decide to distribute this questionnaire to 200 individuals, and obtain data from them. Each individual is asked five questions, and you could have a measurement model Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 7/33 out of these five questions and a latent or unobserved construct of "health" from your research participants. Such a procedure would provide you with an estimate of whether you were able to tap the construct of "health" based on the items you asked your participants. Now imagine that among your participants there are those with chronic diseases (such as diabetes, high blood pressure, chronic heart disease and so on) and those who are otherwise healthy adults with no evidence of any disease. You may claim that the questions were perceived in the same way among members of both the groups, and that the factor loadings of the latent variable (in this case "health") would be equal in both groups; So even if the unexplained and explained variances of the manifest variables may be similar in both groups, it might still mean that they would have different average values on those scores. In other words, you may want to find out group differences or invariances of your measurements across groups to make sure that your model is a robust model. This is where a mean structure is important (Meredith 1993).
In SEM, the means of the manifest variables are referred to as "intercepts" and the means of the latent variables as "factor mean." The factor mean is derived from a constant term, represented by a triangle (we will present here in the form of a circle with 1.0 as value). In a standardised solution, the path coefficient from the constant to the latent variable is set at 0 (mean = 0 for a standardised variable). Also note that as this is a constant, it's variance is 0, and therefore it contributes only to the mean of the latent and manifest variables. When the mean of the latent variable is set to 0, the intercept of the model is same as the mean of the individual manifest variables. Such a structure is used to compare and evaluate that the measurements and the measurement properties (such as factor loadings, and variance of the latent variable) REMAIN THE SAME for each group. The groups themselves can be constituted on the basis of categorical variables such as sex, race, or case-control status, for example.  Both latent variables are "lv," except for group 1 it is 1lv, and for group 2, it is 2lv (similarly for manifest variables mv1, …, mv3 and e1 … e3, they are prefaced with 1 for group 1 and 2 for group 2) Three arrows on each side go from 1 to the manifest variables; these are the intercepts Then we have the measurement models of the two groups, group 1 and group 2 Now, a few things to consider here: 1. For each manifest variable in each group, we have their intercept, their variance, and their covariance on the basis of which the measurement model is formed 2. For each factor in each group, there are three manifest variables (in this case) 3. Each factor in each group (i.e., latent variable), we have measured their latent mean (the arrows that go from 1 to 1lv and 2lv), we have the factor loadings, and the variances We believe that the groups will differ in some ways, but we must make sure that they were measured in EXACTLY the same way. So, while we study both (or more groups), we can restrict or constrain the group parameters in a number of ways: At the least, we say that we only constrain that the configuration of the latent and manifest variables remain invariant in between the groups, but everything else can vary. If this is the case, this will be one of configural measurement Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 9/33 invariance Then we can say that our factor loadings MUST be constrained in both the groups so as to ensure that both the groups had similar measurements. This is a weak measurement invariance because we still leave the possibiity that the variance of the latent variables themselves could differ between the groups Then we add the constraints of both factor loadings as well as latent variable variances to remain identical across the groups, but the latent means could still vary. This is strong invariance and is most practical assumption Finally, we can add the constraint that everything will be identical between the groups and see what factor structure will satisfy this condition and how would the model hold. This is referred to as strict measurement invariance.
Using strong invariance to examine the group differences or groups is important to ascertain as to what differences exist even as the measurement of the constructs were conducted identically. Such groups can be two periods of time for the same population, or same sample, or two conditions (cases and controls or exposed and non-exposed, or treatment and control conditions), or gender, or indeed any characteristic of the individuals. It might be useful to investigate situations over two periods of time (as in baseline and final time point).

Group comparison in the context of twin studies
Group comparison is particularly useful in the context of twin studies (Neale and Cardon 2013). The problem and the solution are as follows. Say you want to test the hypothesis that BMI is determined by genes. BMI is a quantitative trait, so rather than one gene, we hypothesise that several genes add up to exert their effects on BMI. Yet we may also argue that in addition to genes, there are environmental variables that also determine individual BMI scores. To study such effects, you have collected data from 400 twins (200 monozygotic twins and 200 dizygotic twins), these twins were all raised in the same household and shared the same environment; however, even then, they also had unique environmental factors other than their shared households and rearing (such as different friends, different universities where they studied, different occupations, etc). So, if you take PAIRS of monzygotic and dizygotic twins and examine these TWO groups using a path analysis model, you will have the following: For each of the 200 pairs of Monozygotic (and it is true for dizygotic twins), you will have data on their BMI Twin1 and Twin2 (T1 and T2) will have on BMI a variance-covariance matrix whose patterns can be explained by three For MONOZYGOTIC twins, as they have EXACTLY the same complement of genes, their As will perfectly vary, i.e., Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 11/33 cor(A1, A2) = 1.0 For DIZYGOTIC twins, as they share 50% of their genes, their As will have cor(A1, A2) = 0.5 For DIZYGOTIC twins, as they share 25% of their dominant genes (heterozygosity), this will be cor(D1, D2) = 0.25 For BOTH monzygotic and dizygotic twins, their shared environments (Cs) will be perfectly correlated, so cor(C1, C2) = 1.0 for BOTH mono-and dizygotic twins, their unique environments are uncorrelated, so we will see cor(E1, E2) = 0 a$ ^2$ + e = 1.00 (assuming standardised coefficients) a is also referred to as "narrow heritability" Figure 5 shows the twin studies path diagram While analysing these paths, we have to be careful that while Cs are common to both MZ and DZ twins, the path coefficient that explains the part of the variation in the phenotype (say BMI in the example), will be very low if not negligible. This needs to be taken into account as you assess these models, so setting or constraining the path coefficients C1A1 and C2A2 to close to 0 (something like 0.001) is a useful strategy.

Path analysis of longitudinal data or latent growth curve models
So far, we have discussed path models that are all assessed over a single period of time, and even when we discussed measurement invariance and discussed two groups in two periods of time, the scope was limited. We now turn to the problem of what happens when you have correlated measures over repeated measurements taken over three or more periods. These models are referred to as latent growth curve models.
There are two major differences between the models we have studied so far and latent growth curve models. First, so far we have assumed that the data for all our models came from studies that were conducted over a specified period of time, 3. Change must also have some kind of rate of change. This could be linear, this could be quadratic. Equally, as we factor in time, we factor in what intervals over which such change occur? Are the data collected linearly, as in every n number of years? Or are data collected every two years for the first two waves and then every five years for the next few waves and so on? Figure 6 shows a simple latent growth curve model Figure 6. A simple latent growth curve model Latent growth curve model As you can see, x1 … x4 shows the measured X variables that we want to model over 4 periods of time. These are our manifest variables. The arrows from i, the latent intercept is fixed at 1.0 to indicate that the latent intercept has the same influence over the time bound values. The arrows from latent slope, s, to x1 … x4 is fixed as follows: if we want to model them as linear, we start with 0, so in this case this will be 0,1,2,3. A 0th time point is the beginning or the first measurement.
Depending on the frequency over which data were collected, we adjust the value of the path coefficients of the paths that run from s to individual xs. Finally the error terms for each time point observation. Here we have shown them to be uncorrelated, but these can be correlated errors.

Code of all the graphs and diagrams we have presented so far:
Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios Step 2: Input data in the form of a covariance matrix In order to analyse data in SEM, you will need the following: A correlation or a covariance matrix

A set of standard deviation (if you use correlation matrix) a set of means for the individual manifest variables
Sample size (the more the better) In our examples of SEM here, we will use lavaan package in R. You can download and install lavaan from lavaan's website. There are several ways of getting covariance or correlation matrices:

Direct input of numbers
You can input a raw set of numbers as variance-covariance or correlation matrix (lower or upper) and using the function lav_matrix_lower2full() you can obtain a full matrix in lavaan. ## Convert between correlation and covariance matrices Besides, if you have a correlation matrix and a set of standard deviation, you can convert a correlation matrix to a covariance matrix using the function cor2cov(matrix, sd) ## Directly compute the matrix from the variables in your data set Using the function cov(c(v1, ..., vn), na.rm = T), remember to set na.rm = T in R
Now that we have seen that evidence from 400 countries in Europe, Australia, and other parts of the world suggest that several aspects of personal well-being and better living are related to the social well-being, we can ask another related question: what are the relative genetic and environmental contributions to sense of psychological well-being. We will use data from Archontaki et.al. (2012) on the twin study of psychological well-being; for the purpose of demonstration, we will only use a small set of twin correlations, but you can read the entire study here (Archontaki, Lewis, and Bates 2013) We will replicate a small subscale from the data presented in the article, We will analyse here the scale "personal growth" for this paper. We will set up the data for analysis as follows: # for mz twins, we do: mz = lav_matrix_lower2full(c(1.00, 0.38, 1.00)) rownames(mz) = colnames(mz) = c("T1", "T2") Now that we have set up the data for this study, we will run the models and evaluate them as follows. Here we will evaluate an ACE model, where we will test the path coefficients and heritability estimates for a model where we will test additive genetic effecs (As), common environmental effects (Cs), and unique environmental effects (Es). Note that as correlation of Cs on both MZ and DZ twins is 1.0 (that is they share the SAME environment), the impact of a shared or Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 25/33 common environment on their phenotype is likely to be very low, so we will set it at more than 0.001 Our final analysis will be based on data where Gallup World Poll measured the proportion of people in various countries where they said they were happy over time, see for more information here. We obtained the data and you can download a copy of the data from here. The data were collected between 1984 through 2014. We will study the longitudinal pattern using SEM (growth model) and study the baseline percentage from where it began and slope of the growth. The data were gathered every five years.

#modificationIndices(happiness_fit)
As this analysis suggests: About 72% people over the world population reported that they were happy or satisfied in 1993 Over time, more people tended to report they were happy over subsequent surveys Qeios,.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 31/33 Countries that started with lower percentage of their people reporting in surveys they were happy, had a faster rate of growth over subsequent waves While our analysis borne this, you can also view the results reported graphically on the ourworldindata website as shown in the following figure: Figure 8. Trend in reporting of happiness Shared they were happy across time span, Ourworld in data as the source As you can see in this figure, countries such as Russia and Zimbabwe, that started at the lower end of the graphs had steeper upward slopes reporting happiness over time, while countries that started with higher proportion of people as happy tended to have slower trajectory of growth reporting happy people in subsequent waves. Overall, people do seem to get happier over time, at least till 2014 starting from 1993.

Summary
In this tutorial, we provided a brief roundup of using structural equation modelling as an analytical strategy. Structural equation modelling is a mix of regression and factor analysis, and as you may have seen, this can be used for validation of surveys and questionnaires, reducing data, linear regression modelling, analysis of group differences, twin data analyses, and growth curve modelling. We used worked out examples to show where you can start with simple measurement models, then transform or modify them to test how well they fit the data, and in the same modelling strategy, you can test regression as in testing structural models. Longitudinal studies and twin studies are different from measurement models Qeios, CC-BY 4.0 · Article, April 6, 2021 Qeios ID: DNI7ET · https://doi.org/10.32388/DNI7ET 32/33 and structural models in the sense that they need theoretical understandings to fix and constrain parameter estimates, and also for longitudinal studies, our goal is to find the parameters for latent intercepts and slopes, as we constrain the parameters that explain variance of the time bound values of variables. While SEMs are powerful strategies, one needs to be careful to deal with missing values, extreme outliers, non-normal distribution of the variables, and categorical variables.
Lately, SEM strategies are being used to model non-parametric equations as in (Structural Causal Models) [https://www.causalflows.com/structural-causal-models/]. We have also not touched the issues of model specification, interpretation of modification indices, and sample size estimation. In subsequent extensions to this inflammation, we will discuss these issues.