Running

30 Ecotoxicological hazard assessment relies on species eﬀect data to estimate quantities such as the 31 predicted no-eﬀect concentration. Whilst there is a concerted eﬀort to quantify uncertainty in risk 32 assessments, the uncertainty due to inter-test variability in species eﬀect measurements is an overlooked 33 component. The EU REACH guidance document suggests that multiple toxicity records for a given 34 chemical-species combination should be aggregated by the geometric mean. Ignoring this issue or applying 35 unjustiﬁed so-called harmonisation methods weakens the defensibility of uncertainty quantiﬁcation and 36 interpretation about properties of ecological models, for example the predicted no-eﬀect concentration. 37 In the present study we propose a simple and broadly theoretically justiﬁable model to quantify 38 inter-test variability and analyse it using Bayesian methods. The value of data in ecotoxicity databases 39 is maximised by utilising (interval)censored data. An exploratory analysis is provided to support the 40 model. We conclude, based on a large ecotoxicity database of acute-eﬀects to aquatic species, that the 41 standard deviation of inter-test variability is about a factor (or fold-diﬀerence) of 3. The consequences 42 for decision makers of (not) adjusting for inter-test variability are demonstrated. 43


Introduction
tion concerning the 'Registration, Evaluation, Authorisation & restriction of CHemicals', better known as further details. The database is freely available as Supporting Information.
In addition to scientific review of original data sources, De Zwart [24], pp. 136-138 describes an ad hoc 142 collection of data filtering queries applied to a larger database which yielded the research database described 143 here. In particular, censored data points were removed unless they were either the smallest, greatest, or only 144 reported concentration for the corresponding chemical-species-A/C combination. Therefore, it is likely that 145 some records, whether 'outliers' or censored values, have been removed which would have been informative 146 for making inferences about inter-test variability. 147 We predominantly focus on a subset of the database selected according to the following rules: (i) all  158 We will use the following notation. Let y ijk be the k-th log (base 10) transformed toxicity value for 159 species j tested on chemical i. Also, let K ij be the number of toxicity records for species j tested on 160 chemical i. The three possible cases are: (1) K ij = 0 which means zero records are available for chemical-161 species combination (i, j); (2) K ij = 1 which means precisely one record is available; and (3) K ij > 1 which 162 means multiple records are available. It is case (3) which allows for inferences to be made about inter-test 163 variability; when it is part of a larger statistical model, case (2) data can also influence model parameter 164 estimates. 165 166 The most straightforward exploratory analysis is to calculate the sample standard deviation of log toxicity 167 values for each chemical-species combination (i, j) for cases when K ij > 1, namely

Exploratory analysis
whereȳ ij is the sample mean for chemical-species combination (i, j). Whilst this statistic is non-parametric, it is only calculable with pointwise toxicity data. Therefore, we only analyse the 6279 pointwise records 170 extracted previously. 171

Classical modelling 172
Inferences about a simple exploratory analysis can be made if a model is proposed. Pragmatic modelling is advocated here in light of the limited number of records for chemical-species combinations generally available in ecotoxicity databases and the lack of quality controlled metadata. We therefore propose a simple model, namely where µ ij is the 'true' log-transformed toxicity value for chemical-species combination (i, j) and ε ijk is the 173 inter-test variability. In addition, we make two initial modelling assumptions: (i) inter-test variability in the 174 database subset is random and not systematic, thus not requiring bias correction; and (ii) ε ijk | σ ∼ N (0, σ 2 ), 175 where the tilde is read as 'distributed'. These two assumptions combine to state that each residual about 176 the 'true' log-transformed toxicity value for chemical-species combination (i, j) is a random sample from a 177 normal distribution centered about zero with homogeneous variance σ 2 that is independent of experimental 178 conditions, chemical and species. 179 Since there is no a priori reason why the sum of variation, which is extraneous to the interspecies 180 variation, should be a unique property of the specific risk assessment (i.e. chemical or species tested) 181 rather than one which is globally defined, this modelling assumption appears reasonable. This model is, 182 however, incompatible with the joint modelling of multiple arbitrary endpoints, e.g. acute and chronic 183 together. Although the normal distribution assumption allows for confidence statements to be made, testing 184 the assumption is difficult since for each chemical-species combination (i, j), K ij -the number of available 185 L(E)C50 measurements for the chemical-species combination, is generally too small to (confidently) make 186 inferences from goodness-of-tests. This exact problem is faced by risk assessors trying to fit SSDs. Based on these assumptions, a statistically unbiased estimate of σ 2 is the pooled variance, namely  We earlier proposed the data model as y ijk | µ ij , σ ∼ N (µ ij , σ 2 ) for k = 1, . . . , K ij and species j tested 208 on with chemical i; this defines the likelihood function for a given ecotoxicity database. In order to analyse 209 the model from a Bayesian perspective we need to specify [prior] distributions for (i) σ 2 and (ii) µ ij .

210
For (i), a default prior distribution, which is referred to as non-informative, is the Jeffreys prior: π(σ 2 ) ∝ 211 σ −2 for σ 2 > 0. This is equivalent to the assumption that all values of log(σ) are equally likely a priori.

212
Although not universally accepted, it has been applied to the interspecies standard deviation parameter in where α i and ψ i are the per-chemical SSD mean and standard deviation parameters on log concentration 218 for chemical i. Note that, in a situation where ecotoxicity data are measured for a single chemical only 219 without inter-test variability, i.e. σ = 0, the model reduces to the one described in Aldenberg and Jaworska

220
[13] subject to notational differences; their µ and σ are then our α i and ψ i respectively and our µ ij then 221 correspond to their data. The inclusion of a common inter-test variability parameter means that the SSDs 222 are only conditionally independent (probabilistically) between chemicals. This hierarchical structure requires 223 us to either estimate each pair of hyper-parameters (α i , ψ i ) or model them; we do the latter using standard 224 independent non-informative priors: π(α i , ψ i ) ∝ 1. This is equivalent to the assumption that all values 225 of ψ i > 0 are equally likely a priori. however, is beyond the scope of the present study which is primarily interested in estimating σ and in the 231 consequences within the remit of the current simple SSD modelling framework.

232
The requirement in the database subset extraction routine that each chemical must have a minimum  The posterior distribution of the parameters of interest is calculated using Bayes' rule and is proportional where J i is the set of all species tested with chemical i. A mathematical derivation of the posterior distri-239 bution, and a description of the sampling methodology used to analyse it and a computer code script for 240 running it are all given in the Supporting Information.

241
The normal distribution assumption is not a prerequisite of this analysis; alternative distribution propos-  Inter-test variability is not taken into account in standard procedures for determining hazardous con-248 centrations, although it should be due to the simple fact that it is an additional component of uncertainty. 249 We briefly examine the consequences of adjusting for its presence. The usual method for fitting an SSD 250 assumes the 'true' log-transformed toxicity value µ ij for chemical-species combination (i, j) to be equal to 251 the aggregated measurementsȳ ij ; no account of uncertainty is made. Therefore, we define the 'inter-test In Figure 1 we display boxplots, for the pointwise-only subset of the acute data, of s ij for every chemical-276 species combination (i, j) for which K ij > 1 (s ij is not defined otherwise). Note that s ij is reported in units 277 of log 10 µg/L. Since sampling uncertainty will undoubtedly be greater for small K ij , we stratify the boxplots 278 according to K ij bins. The red dashed line shows the estimated pooled standard deviation, s pooled = 0.507.

279
Since s ij is measured on the log (base 10) scale, s pooled corresponds to a factor (or fold-difference) of about 3.2. There is no evidence of s ij being explained by major taxonomic grouping, however this conclusion is 281 not surprising since 79% of the 4615 distinct (i, j) combinations are for Crustacea and Osteichthyes only -282 a reflection of the imbalance between ecotoxicity databases and ecological representativeness.

283
Conditional on the model assumption that the log-toxicity values are realisations from a normal distribu-284 tion with homogeneous variance (with the mean equal to the 'true' log-toxicity value for that chemical-species 285 combination), an approximate 'region of high probability' for future s ij is highlighted blue in Figure 1. For 286 2 ≤ K ij ≤ 6 the median of the boxplots tend towards s pooled from below which, assuming the population 287 inter-test variance is homogenous, is expected since the sampling distribution of s ij is skewed to the right.

288
As shown by the grey line graph overlay, the number of chemical-species combinations for which K ij is large 289 (K ij ≥ 8) is generally less than 5. Although as K ij increases the standard error about s ij reduces, with only 290 a handful of (i, j) combinations there will be less information to gauge whether homogeneity is a reasonable 291 assumption. Qualitatively we conclude that homogeneity is a reasonable hypothesis. The null hypothesis geneity. This test, however, is limited in value since it is not generally appropriate to consider meaningful 295 populations (in the statistical sense) defined by the K ij bins.

296
As a side analysis to the focal acute dataset in the present study, an estimate of the NOEC-based inter-297 test pooled standard deviation is calculated as 0.580. There are much fewer data available (174 distinct 298 (i, j) combinations with repeated measurements such that 145 combinations fall into the K ij = 2 bin; 17 in 299 the K ij = 3 bin; 8 into the K ij = 4 bin; 3 into the K ij = 5 bin and 1 into the K ij = 6 bin) to test the 300 homogeneity hypothesis, although qualitative analysis suggested the hypothesis was reasonable. Although 301 test methods for assessing chronic (NOEC) toxicity are inherently more complex than those for assessing 302 acute (EC50 / LC50) toxicity, conditional on the homogeneity model, the average increase in inter-test 303 variability is only 18%. However, the uncertainty about this estimate is also larger. fitting some additional models to the data, the slight difference between s pooled and the posterior median estimate, which is of negligible practical significance, was found to be due largely to the hierarchical (SSD) 312 modelling of the chemical-species mean toxicities µ ij rather than to any difference between Bayesian and 313 frequentist procedures or to the incorporation of censored data in the Bayesian analysis. The posterior 314 distribution of σ was also found to be insensitive to the choice of prior distribution for σ.

315
In Figure 3, for 339 chemicals we plot the posterior ITVA median log 10 (HC5) estimate (based on all data

329
According to [11], the HCx is the concentration hazardous to x% of species; equivalently, the probability 330 that a randomly selected species from the assemblage has its endpoint exceed is x%. It can be inferred from 331 common practice and risk assessment guidance that the SSD, of which the HCx is a summary statistic, Bayesian paradigm also offers a rich framework to include subjective prior knowledge which will undoubtedly 390 allow experts with specialities in specific chemical groups and species to come together to reduce uncertainty 391 quantitatively whilst providing a transparent mechanism with which to examine expert judgements a poste- to whether this constitutes a value of concern, however it is a source of uncertainty nonetheless and should 402 be discussed in risk assessments since it will only serve to compound with other sources of uncertainty. In 403 many assessments, accounting for inter-test variability will lead to larger (or equivalently, relatively less            The simplest hierarchical model for SSDs which incorporates measurement error is where y ijk is the k-th (= 1, . . . , K ij ) log (base 10) toxicity value for chemical i (= 1, . . . , N ) tested on species j. For convenience, define J i to be the set of species tested with chemical i and Y to be the entire database of measured log toxicity measurements. The (hyper-)parameters are assigned prior distributions as follows: π(σ 2 ) ∼ σ −2 for σ 2 > 0; π(α i ) ∼ 1 independently for each α i ∈ R, i = 1, . . . , N ; and π(ψ i ) ∼ 1 independently for each ψ i > 0, i = 1, . . . , N.
Conditional on observing log toxicity values, the full-data likelihood function, which is the probability of observing the data but as a function of the model parameters, is given as where µ is the vector of 'true' toxicity values µ ij for all relevant chemical-species combinations (i, j). This can be simplified for analytical tractability as whereȳ ij and s 2 ij are the sample mean and variance of the log toxicity values for chemical-species combination (i, j).
The µ ij values are 'nuisance' parameters in this analysis and would ordinarily be integrated out of the density function. However, the later complication of censored data requires us to work with the full posterior. The posterior distribution of µ, σ 2 and the hyper-parameters, α 1 , . . . , α N , ψ 1 , . . . , ψ N , can then be determined using Bayes' rule: (2) where each µ ij conditional on α i and ψ i independently follows the distribution given by Eqn. 1.

A.2 Sampling
In order to sample from this posterior distribution, a block Gibbs Markov chain Monte Carlo (MCMC) sampler was written. The Gibbs sampler requires the posterior distribution of each parameter conditional on all the others in the model. With these distributions, starting from a "best" guess for the parameters, we cycle through them sampling each block of random variables one at a time based on the most recent version of the other parameters. All the conditional distributions described here belong to standard families (e.g. Gaussian) and therefore sampling from them is trivial once the location, scale and shape parameters are analytically determined. Further details of MCMC techniques can be found in Gelman et al. [1].
Derivation of the conditional distributions follows on straightforwardly from the decomposition in Eqn. 2; here we list them. where, where n i is the number of unique species tested with chemical i, i.e. the cardinality of the set J i , and

A.3 Censored Data
The posterior distribution calculations above are reliant on the database of measurements, Y, all being observed. Frequently laboratory measurements will yield censored measurements such that y ijk ∈ (L ijk , U ijk ) where L ijk and U ijk define the lower and upper bounds of the measurement value respectively. The type of censoring depends on the values of L ijk and U ijk as described in the table below.
where Y obs is the collection of observed measurements and Y cens is the collection of (unknown) censored measurements with corresponding (known) intervals (L, U). Then the posterior distribution of all the unknown parameters and Y cens conditional on Y obs , π(Y cens , µ, α 1 , . . . , α N , ψ 1 , . . . , ψ N , σ 2 | Y obs ), has the same form as the right-hand side of Eqn. 2. Hence, the Gibbs sampler can be augmented with the additional conditional distributions for all data y ijk ∈ Y cens : It is useful to exploit the probability integral transform to generate this sample: Step 1. Set where Φ(·) is the standard normal cumulative distribution function.
Step 2. Randomly generate U ijk ∼ U (P L ijk , P U ijk ) where U (a, b) is the uniform distribution with support on [a, b].
Step 3. Set y ijk = µ ij + Φ −1 (U ijk )σ As per the model (hyper-)parameters, it is necessary to initialise the Markov chain at some possible value which satisfies the constraints of the censoring.

A.4 Implementation
A technical issue arises in the resulting posterior distribution regarding whether its normalisation constant is finite, stemming from the variance component parameters ψ 1 , . . . , ψ N . However, by restricting the minimum sample size of the number of distinct species tested with each chemical to n i ≥ 3, the issue is resolved [2]. It is conceivable that the heavy right tail deriving from the typically small sample sizes found in ecotoxicological risk assessment will influence posterior inferences. This is of particular importance since estimates of the hazardous concentration to a fixed proportion of species (i.e. the HC p ) are a function of ψ i . Based on heuristic suggestions in Gelman [2], we therefore restrict sample sizes to n i ≥ 5.