Directed acyclic graphs to explore causality in Epidemiological study designs, part I: an introduction to DAGs

Directed acyclic graphs are graphs that contain one directional arrows which connect the nodes within the graph structure, and where flow of information can be shown to flow from "past" to "future" along the direction of the arrows. T hese graphs are acyclic in the sense that no paths turn back on to the parent node as they are directed from a causal variable to an effect variable. In this paper, we discuss these graphs with respect to causal inference in Epidemiology and discuss ways of drawing our assumptions prior to our conclusions. Specifically, graphs will help us to identify biases and also help us to characterise counterfactual theories of causation Introduction Epidemiology is defined as the study of distribution and determinants of diseases in populations and use of this information to improve population health [1]​. While the Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 1/12 concept of distribution of disease is intuitive, identification of causal determinants of diseases is complex. A conceptual limitation is to differentiate the between correlation and causation. Here we will use graph theory to explore cause and effect assessment in epidemiological studies. We can examine causal association between an exposure A and an outcome Y with a two step procedure: we first establish that any observed association between A and Y is internally valid and then based on internal validity, we test the nature of such association is one of cause and effect [2]​ . A study on the association between A and Y is internally valid is established when it satisfies three conditions: (1) that the association would not occur by chance alone; (2) that the observed association is free from bias, and (3) that the association cannot be accounted for by a third variable referred to as a confounding variable L associated both the exposure and the outcome [3]​. If the association between an exposure A and an outcome Y is internally valid, is this associaiton is one of cause and effect? T his can be answered using two approaches: one, using a framework or considerations proposed by Sir Austin Bradford Hill in 1965 ("Hill's Criteria") and the other is the counterfactual theory of causation [4]​ . Accoding to Hill's criteria, we can examine nine considerations to asess the association between an exposure and an outcome (T able 1). Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 2/12 Figure 1. Hill's criteria (image source: https://www.researchgate.net/profile/Caroline_Watson4/publication/5516686/figure/tbl1/AS:6 69341636911123@1536594890255/Summary-of-Hills-criteria-1965.png) Of these criteria , the strength of association, the temporality, and the biological gradients are useful to think about cause and effect as nature of their association. Counterfactual theory of causation relates to alternative views of causation. T his theor, argues for example, that if an exposure A is deemed to cause an outcome Y, then, in the counterfactual of A (what other value would A assume?), Y ould not occur or not assume the same value as it would with the given value of A. If that would not be the case, or if the value of Y were to remain unchanged, then the association between A and Y would not have causal implications [5]​. For example, consider a study where the researchers study cigarette smoking (ever smoker / never smoker) as a causal risk factor for lung cancer, the hypothesis being that those who smoke are more likely than non-smokers to develop lung cancer. T he status of "non-smoker" is counterfactual to "smoking" status. T he uniqueness of this approach is that, we are comparing the "what if" scenario of the same individual as a smoker as opposed to being a "non-smoker" for the risk of lung cancer [6]​. In this paper, we will examine the role of directed acyclic graphs to map out graphs for studying the association between exposure and outcomes inthe context of epidemiological studies. In this paper, we will examine directed acyclic graphs. and in the subsequent parts of the series wie will explore counterfactual theories of causation and how they can be pre represented in graph structures such as SWIGs (single world intervention graphs) proposed by Richardson and Robins (see PDF in the linked slide deck) In the following sections, we e begini with an exploration of causal graphs and continue to through the series. What are causal graphs? Causal graphs are a subset of graphs were cause and effects can be represented using directed acyclic graphs. In 1923, Sewall Wright (read the full text pdf) three rules of path tracing to indicate valid paths as follows: 1. In a graph that contains a directed path or a set of paths between two nodes A and Y, such that a path leaves A and reaches to another node, Y, paths can travel in any direction from A but must continue in the same direction before it reaches Y. It cannot begin in one direction and then reverse its direction. 2. T he path can only contain one correlation or covariance term ("curved arrow"), and 3. Covariance between A and Y (expressed as cov(A, Y)) is the sum of the product terms Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 3/12 of all valid path based path coefficients. In 1986, Judea Pearl expanded these concepts to causal graphs and added the concepts of d-separation, backdoor, and frontdoor criterion (read the full text here). According to Pearl, a path going forward from A to Y would consist a front-door path and can contain mediator variables (Figure 1) Figure 1. DAG model that shows the front-door criterion (A is the cause, M = mediator, and Y = outcome) Hernan and Robins (read the full text in pdf format) extended these concepts in Epidemiology. In Epidemiology, front-door paths are rarely found by themselves with no relevant backdoor paths. In a backdoor path, an arrow from A would start backwards and the path would then point back to A traversing through other nodes (Figure 2) Figure 2. The backdoor paths that connect A and Y: A-L-Y is a backdoor path that connects A and Y, similarly A-L-U-Y is another backdoor path that connects A and Y Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 4/12 In Figure 2, A, and Y are connected to each other by two backdoor paths: (1) A-L-Y is one backdoor path that connects A and Y; and (2) A-L-U-Y is another backdoor path connects A and Y. Pearl proposed that two nodes, A and Y are deemed to be d-separated if there exist no open backdoor path between them. Further a path can be blocked or open depending on the nodes in the path and the direction of the arrows that connect them. In the following diagrams we provide examples of open and closed paths with annotations (Figure 3) Figure 3. Four path models Model 1: AMY, an open path; Model 2: A2M2Y2, a closed path and M2 is a collider in the path; Model 3: A3M3Y3, a closed path as M3 is conditioned on; Model 4: A4M4Y4, an open path as M4, the collider is conditioned on In Figure 3, four paths are shown as follows: In Model 1, the path A-M-Y shows a path where paths connect cause node A through mediator M, to node Y (the effect). T his path is open and A and Y are said to be dconnected. In Model 2, two opposite headed arrows connect A2 to M2, and Y2 to M2. T hese two paths, one from A2 to M2 and the other from Y2 to M2 collide at M2. M2 is therefore termed as collider. A path connecting A2 to Y2 in this case containing a collider is said to be "closed". T he presence of a collider that is not conditioned on (in terms of epidemiology and biostatistics, conditioned on is same as "adjusted for" or "controlled for") closes a path connecting two nodes. In Model 3, the path A3 > [M3] -> Y3 is closed because here M3 (note the square bracket around M3 or the white coloured circle representing M3) is controlled for or conditioned on. In an open path between two nodes A and Y that traverses through an intermediate node N, if the intermediate node is conditioned on, then the path closes. In this case, the path A3M3Y3 is a closed path that connects A3 with Y3. Finally, in Model 4, A4M4Y4 is an open path as in this path, M3, Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 5/12 which is a collider is controlled for. A path between two nodes A and Y that otherwise remains closed with a collider in it, opens when the collider is controlled for. Accordingly, in the four path models illustrated above, we state that: 1. In Model 1, A and Y are d-connected as the path is open and passes through M 2. In Model 2, A and Y are d-separated as there are no backdoor paths between them and the only front-door path that exists between them is closed as it contains a collider in the path that blocks the path 3. In Model 3, A and Y are d-separated as no backdoor path exists between them and the only path that traverses from A to Y through M is closed because M is now controlled for or conditioned on. 4. In Model 4, A and Y are d-connected as the collider that would otherwise block the path and close it is now controlled for or conditioned on. In Epidemiology, we draw all possible and relevant paths that connect two nodes A and Y and designate nodes in those paths. If A and Y are not causally connected, we do not connect A and Y with arrows. With DAGs, we graph possible sources of bias in the relationship between a cause and effect in epidemiology. It is possible to extend the scope of DAGs and unite with counterfactual theories of causation using Single World Intervention Graphs (SWIG) and Single World Intervention T emplates. In the next section, we use the properties of graphs, the causal structure, and the backdoor criterion, and the role of colliders to illustrate the process of drawing our assumptions about different types of epidemiological study designs and identification of different types of biases. Causal graphs for epidemiology Causal graphs in Epidemiology are drawn with the following rules: T he exposure or the intervention variable is referred to as "A" T he outcome variable is referred to as Y Measured confounding is given the label L Unmeasured confounding is labelled as U T he mediators are indicated but not explicitly drawn in the graph connecting A and Y. T he mediators may be important but the idea here is not to represent the world in this graph Similarly the antecedents of the several potential confounding variables are not explicitly represented in the graph If a node is included in the graph, all nodes that are associated with this node and any other node already present in the graph and the paths must be specified and indicated in the graph We represent a several simple epidemiological models in causal DAGs (Figure 4) Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 6/12 Figure 4. Three representative models of epidemiological studies. The top model, Graph 1 in the figure represents a Randomised controlled trial; Graph 2 (the middle of the figure), represents an observational study with an exposure variable, an outcome variable, and a measured confounding variable. Graph 3 (the bottom model) represents an epidemiological study with a collider in it (the letter C), that is also a source of selection bias if controlled for. As can be seen in Figure 4, the topmost graph in the model is a representation of a randomised controlled trial. Here, A is an intervention and Y is the outcome as a result of the intervention. As this is a randomised controlled trial, by design all measured and unmeasured confounding are controlled for, hence a measured confounding is not necessary in this graph and an arrow connects the intervention (A) to the outcome (Y). In the middle graph (Graph 2), an observational study with an exposure (A) and an outcome (Y) is shown. T his observational study includes a measured confounding (L) that is associated with both A and Y. T his notion of the confounding variable is compatible with the conventional description of a confounding variable as a variable with the following characteristic (cite an epidemiological study design article): A confounding variable is associated with the exposure A confounding variable is associated with the outcome A confounding variable does not come in a putative causal pathway connecting the exposure and the outcome variable However, note that in this representative graph, a backdoor path exists between A and Y and connects A with Y, even though there is no direct path that connects A and Y. T his backdoor path is represented by A-L-Y. If L is not controlled for or if L is not conditioned on, this backdoor path with lead to an observed association with A and Y although one that does not exist. T herefore, to ensure that A and Y are d-separated, this backdoor Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 7/12 path should be closed by conditioning on L. L in this case is a confounding variable in the association between A and Y. According to the principles of causal graphs, any association that persists between two nodes after they are d-separated represents the true causal connection between them (cite Judea Pearl causal inference in statistics). Hence this backdoor path must be closed for assessing the true association between A and Y. I llustration of confounding variable Consider a case-control study on the risk of coffee consumption on pancreatic cancer and found that coffee consumption increased the risk of pancreatic cancer; however, this effect is absent when the study effect is adjusted for alcohol intake, as alcohol intake is known to be associated with both excessive coffee consumption and is also an independent risk factor for pancreatic cancer. In such a situation, A = coffee consumption, Y = pancreatic cancer, and L = alcohol intake. Even though there was no real association between coffee consumption and pancreatic cancer, the unadjusted odds ratio suggested that coffee consumption was a risk factor for pancreatic cancer because of the backdoor path through alcohol intake. Once the alcohol intake was adjusted for and the backdoor path was closed, the association disappeared. T he third graph in this series shows collider bias. Here, the backdoor path A3-L3-Y3 is open but the backdoor path A3-C-Y3 is closed as C is a collider. However, conditioning only on L3 will close this backdoor path, resulting A3 and Y3 to be d-separated. However, if there is also conditioning on C, then this additional backdoor path A3-C-Y3 becomes open and results in spurious association between A3 and Y3. T his is referred to as collider bias and in the context of epidemiological studies, this results in selection bias. I llustration of collider bias as selection bias Consider the following: a case-control study on the association between cigarette smoking and lung cancer; T he investigators recruited cases from oncology inpatients who were admitted with confirmed cases of lung cancer; they recruited controls from the same hospital and from in-patients who were admitted with heart failure and were treated indoors. T hey controlled for age, gender, socioeconomic status, and occupation. Such a study would still be open to selection bias as this would leave open the path that contains a collider. T o see why, consider the following graph (Figure 5): Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 8/12 Figure 5. Hospitalisation (from heart disease and lung cancer) is a collider of two paths converging from Lung Cancer and Cigarette Smoking and represents selection bias; L is the complement of all confounding variables. When we condition on or adjust for L, we close the backdoor path between cigarette smoking and lung cancer; when we adjust for hospitalisation or condition on hospitalisation, we open the backdoor path and lead to selection bias As can be seen in Figure 5, the participants for this study were obtained from those who were hospitalised either with lung cancer or with heart disease. Cigarette smoking is a known risk factor for hospitalisation due to heart disease as cigarette smoking is a known risk factor for heart disease. Equally, those who were suffering from lung cancer in this study were also hospitalised and because they were hospitalised they were likely to be included in this study. T herefore, by study design, the researchers decided to condition on hospitalisation (as they included all patients who were hospitalised due to a cause attributed to cigarette smoking) and in this way, open up an otherwise closed path between smoking and lung cancer. On the other hand, by conditioning on L, the common confounding variable, the researchers closed another backdoor path connecting smoking with lung cancer. T his is also illustrated in the third graph in Figure 4. In epidemiological Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 9/12 studies, collider biases lead to different forms of selection biases. T his form of selection bias where selection bias occurs due to a common cause of the exposure and the outcome is also referred to as Berkson's bias [7]​ . I llustration of Measurement bias Not all epidemiological studies are possible using unbiased or objective measurements. Consider the following study. An investigator studied the association between smoking and lung cancer in a case control study design. Cases were clinically confirmed cases of lung cancer and confirmed were drawn from community members, who were otherwise healthy volunteers; the information on smoking was obtained using a self-reported smoking habits on the number of packs of cigarettes smoked. Known confounding was controlled for; we can examine a possible source of bias in the following figure Figure 6. Independent differential measurement bias in a study between cigarette smoking and lung cancer where cigarette smoking is measured with self-reported smoking, and Lung Cancer is measured with recorded and confirmed cancer from hospital data. U1 and U2 are unmeasured error terms for the measurement of self reported smoking and cancer diagnoses; L refers to measured confounding


Introduction
Epidemiology is defined as the study of distribution and determinants of diseases in populations and use of this information to improve population health [1] . While the Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 1/12 concept of distribution of disease is intuitive, identification of causal determinants of diseases is complex. A conceptual limitation is to differentiate the between correlation and causation. Here we will use graph theory to explore cause and effect assessment in epidemiological studies.
We can examine causal association between an exposure A and an outcome Y with a two step procedure: we first establish that any observed association between A and Y is internally valid and then based on internal validity, we test the nature of such association is one of cause and effect [2] . A study on the association between A and Y is internally valid is established when it satisfies three conditions: (1) that the association would not occur by chance alone; (2) that the observed association is free from bias, and (3) that the association cannot be accounted for by a third variable referred to as a confounding variable L associated both the exposure and the outcome [3] .
If the association between an exposure A and an outcome Y is internally valid, is this associaiton is one of cause and effect? T his can be answered using two approaches: one, using a framework or considerations proposed by Sir Austin Bradford Hill in 1965 ("Hill's Criteria") and the other is the counterfactual theory of causation [4] . Accoding to Hill's criteria, we can examine nine considerations to asess the association between an exposure and an outcome (T able 1).  [5] . For example, consider a study where the researchers study cigarette smoking (ever smoker / never smoker) as a causal risk factor for lung cancer, the hypothesis being that those who smoke are more likely than non-smokers to develop lung cancer. T he status of "non-smoker" is counterfactual to "smoking" status. T he uniqueness of this approach is that, we are comparing the "what if" scenario of the same individual as a smoker as opposed to being a "non-smoker" for the risk of lung cancer [6] .
In this paper, we will examine the role of directed acyclic graphs to map out graphs for studying the association between exposure and outcomes inthe context of epidemiological studies. In this paper, we will examine directed acyclic graphs. and in the subsequent parts of the series wie will explore counterfactual theories of causation and how they can be pre represented in graph structures such as SWIGs (single world intervention graphs) proposed by Richardson and Robins (see PDF in the linked slide deck) In the following sections, we e begini with an exploration of causal graphs and continue to through the series.
What are causal graphs?
Causal graphs are a subset of graphs were cause and effects can be represented using directed acyclic graphs. In 1923, Sewall Wright (read the full text pdf) three rules of path tracing to indicate valid paths as follows: 1. In a graph that contains a directed path or a set of paths between two nodes A and Y, such that a path leaves A and reaches to another node, Y, paths can travel in any direction from A but must continue in the same direction before it reaches Y. It cannot begin in one direction and then reverse its direction.
2. T he path can only contain one correlation or covariance term ("curved arrow"), and 3. Covariance between A and Y (expressed as cov(A, Y)) is the sum of the product terms of all valid path based path coefficients.
In 1986, Judea Pearl expanded these concepts to causal graphs and added the concepts of d-separation, backdoor, and frontdoor criterion (read the full text here). According to Pearl, a path going forward from A to Y would consist a front-door path and can contain mediator variables ( Figure 1) T hese two paths, one from A2 to M2 and the other from Y2 to M2 collide at M2. M2 is therefore termed as collider. A path connecting A2 to Y2 in this case containing a collider is said to be "closed". T he presence of a collider that is not conditioned on (in terms of epidemiology and biostatistics, conditioned on is same as "adjusted for" or "controlled for") closes a path connecting two nodes. In Model  which is a collider is controlled for. A path between two nodes A and Y that otherwise remains closed with a collider in it, opens when the collider is controlled for. Accordingly, in the four path models illustrated above, we state that: Causal graphs for epidemiology Causal graphs in Epidemiology are drawn with the following rules: T he exposure or the intervention variable is referred to as "A" T he outcome variable is referred to as Y Measured confounding is given the label L Unmeasured confounding is labelled as U T he mediators are indicated but not explicitly drawn in the graph connecting A and Y.
T he mediators may be important but the idea here is not to represent the world in this  As can be seen in Figure 4, the topmost graph in the model is a representation of a randomised controlled trial. Here, A is an intervention and Y is the outcome as a result of the intervention. As this is a randomised controlled trial, by design all measured and unmeasured confounding are controlled for, hence a measured confounding is not necessary in this graph and an arrow connects the intervention (A) to the outcome (Y). In the middle graph (Graph 2), an observational study with an exposure (A) and an outcome path should be closed by conditioning on L. L in this case is a confounding variable in the association between A and Y. According to the principles of causal graphs, any association that persists between two nodes after they are d-separated represents the true causal connection between them (cite Judea Pearl causal inference in statistics).
Hence this backdoor path must be closed for assessing the true association between A and Y.

Illustration of conf ounding variable Illustration of conf ounding variable
Consider a case-control study on the risk of coffee consumption on pancreatic cancer and found that coffee consumption increased the risk of pancreatic cancer; however, this effect is absent when the study effect is adjusted for alcohol intake, as alcohol intake is known to be associated with both excessive coffee consumption and is also an independent risk factor for pancreatic cancer. In such a situation, A = coffee consumption, Y = pancreatic cancer, and L = alcohol intake. Even though there was no real association between coffee consumption and pancreatic cancer, the unadjusted odds ratio suggested that coffee consumption was a risk factor for pancreatic cancer because of the backdoor path through alcohol intake. Once the alcohol intake was adjusted for and the backdoor path was closed, the association disappeared.
T he third graph in this series shows collider bias. Here, the backdoor path A3-L3-Y3 is open but the backdoor path A3-C-Y3 is closed as C is a collider. However, conditioning only on L3 will close this backdoor path, resulting A3 and Y3 to be d-separated. However, if there is also conditioning on C, then this additional backdoor path A3-C-Y3 becomes open and results in spurious association between A3 and Y3. T his is referred to as collider bias and in the context of epidemiological studies, this results in selection bias.

Illustration of collider bias as selection bias Illustration of collider bias as selection bias
Consider the following: a case-control study on the association between cigarette smoking and lung cancer; T he investigators recruited cases from oncology inpatients who were admitted with confirmed cases of lung cancer; they recruited controls from the same hospital and from in-patients who were admitted with heart failure and were treated indoors. T hey controlled for age, gender, socioeconomic status, and occupation. Such a study would still be open to selection bias as this would leave open the path that contains a collider. T o see why, consider the following graph ( Figure   5): Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 8/12 As can be seen in Figure 5, the participants for this study were obtained from those who were hospitalised either with lung cancer or with heart disease. Cigarette smoking is a known risk factor for hospitalisation due to heart disease as cigarette smoking is a known risk factor for heart disease. Equally, those who were suffering from lung cancer in this study were also hospitalised and because they were hospitalised they were likely to be included in this study. T herefore, by study design, the researchers decided to condition on hospitalisation (as they included all patients who were hospitalised due to a cause attributed to cigarette smoking) and in this way, open up an otherwise closed path between smoking and lung cancer. On the other hand, by conditioning on L, the common confounding variable, the researchers closed another backdoor path connecting smoking with lung cancer. T his is also illustrated in the third graph in Figure 4. In epidemiological Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 9/12 studies, collider biases lead to different forms of selection biases. T his form of selection bias where selection bias occurs due to a common cause of the exposure and the outcome is also referred to as Berkson's bias [7] .

Illustration of M easurement bias Illustration of M easurement bias
Not all epidemiological studies are possible using unbiased or objective measurements.
Consider the following study. An investigator studied the association between smoking and lung cancer in a case control study design. Cases were clinically confirmed cases of lung cancer and confirmed were drawn from community members, who were otherwise healthy volunteers; the information on smoking was obtained using a self-reported smoking habits on the number of packs of cigarettes smoked. Known confounding was controlled for; we can examine a possible source of bias in the following figure As can be seen in Figure 6, such a study represents a misclassification bias. Here, cigarette smoking was measured using a self-reported levels of smoking (ever smoker/never smokers but also packs smoked). Lung cancer data were obtained using Qeios, CC-BY 4.0 · Article, May 13, 2020 Qeios ID: FFH3GU · https://doi.org/10.32388/FFH3GU 10/12 hospital records. T hus the two measurements were independent of each other and therefore errors, if any, between them would be uncorrelated. However, it is also true, that those who have cancers are more likely to recall their smoking habits better than those who did not have cancers (controls). As cases would have a higher rate of smoking recall than the controls, using a self-reported variable for measurement of smoking, the error in the measurement of smoking habit would be differentiated based on the casecontrol status between cases and controls. Hence this measurement error is BOT H independent of each types of measurements but differentiated based on who is the case and who is the control. Hence, this measurement error is referred to as independent differential measurement error. Note here, that even though one backdoor path between smoking and lung cancer is closed as L, the confounding variable is conditioned on, another path, that of "Lung Cancer -U1 -Self-reported causal graphs that can accommodate the counterfactual information in a causal structure. In this paper, we discussed only a few basic models to introduce the concept of directed acyclic causal graphs in epidemiology. We will extend these concepts for longitudinal and cohort studies. We will also discuss the cases for counterfactual theories of causation and how SWIGs can be used to explore causal relationships between variables.