Parametric analysis of variance is the most widely used statistical method in data analysis, developed by Fisher in the 1920s. However, it is necessary to comply with the theoretical assumptions for its use. Some of them state that errors are normally and independently distributed, that their variances are homogeneous, and to consider the attachment capacity of the model. When any of these theoretical assumptions fail, the use of other analysis methods is suggested, such as mixed linear (MIXED) and mixed generalized (GLIMMIX) models.
Mixed models, according to Dicovskiy and Pedroza (2017), are a proposal for advanced statistical modeling, which allow improving the quality of analysis of fixed and random factors, by modeling random variability and error correlation. They are very useful for unbalanced data analysis, which are data with some type of hierarchical structure. Therefore, they allow to estimate variability among groups and that of effects nested within groups.
Nelder and Wedderbum (1972) group different statistical models, which they released as generalized linear (MLGnz), which constitute an extension of classical general linear (MLG). These models can be applied to distributions of normal, binomial, Poisson, gamma type, among others (Mandujano et al. 2016, Díaz et al. 2017 and Monterubbianesi 2017).
Wang et al. (2015) state that data measured in agricultural research does not satisfy the premises of general linear models, so that mixed generalized linear models provide an analysis that does not necessarily require normal distribution of variables, by allowing these to be fitted to an exponential family distribution.
These models have been widely disseminated in social sciences, psychology, and medical sciences. However, in agriculture, they have had little application, without taking into account that, on many occasions, situations are involved in which it is difficult to use the MLG in the analysis of variance and regression. This is because analyzed variables do not meet the assumptions of normality, variance homogeneity and independence of errors, so these models can be proposed as an alternative analysis.
Therefore, the objective was to propose the mixed generalized linear model in the analysis of an experiment in rumen microbiology.
Materials and Methods
For the research, data from an experiment developed in the Department of Biophysiological Sciences of the Institute of Animal Science was used. This study aimed to evaluate the effect of different varieties of Moringa oleifera and Cynodon nlemfuensis (star grass) on the ruminal microbial population, for which the chemical variables total bacteria and isovaleric acid were measured. The experiment consisted of a completely randomized design, with a 6 x 3 factorial arrangement. The factors were the six grass varieties and the three hours, with six repetitions each. Measurements were not performed on the same experimental unit. The statistical models used were the following:
Mixed generalized linear model:
Where:
E (y) |
- expected value of response variable (total bacteria counting and isovaleric acid) |
Xβ |
- linear predictor (linear combination of a β unknown parameter) |
g |
- link function, which belongs to a member of exponential families of probability distributions. |
Mixed linear model:
Where:
yijk |
- response variable |
μ |
- general mean for all observations |
αi |
- fixed effect of the i-th grass (i = 1, ..., 6) |
βj |
- fixed effect of the j-th hour (j = 1, ..., 3) |
(α β)ij |
- fixed effect of the i-th grass in interaction with the fixed effect of the j-th hour (ij = 1, ..., 18) |
eik |
- random error associated with all observations |
The theoretical assumptions of the analysis of variance for the original variables were verified. For variance homogeneity of treatments, Levene (1960) test was used. Normality of errors was evaluated using Shapiro-Wilk (1965) test. In this analysis, the variable total bacteria did not comply with both assumptions, and after transformation, its fulfillment did not improve. The original isovaleric acid variable did meet these assumptions, so it was not necessary to perform data transformation.
For the variable that did not meet the theoretical assumptions of analysis of variance, mixed generalized linear model was applied as an analysis alternative, using GLIMMIX procedure. When theoretical assumptions of the analysis of variance were fulfilled, mixed linear model was used, with the help of the PROC MIXED, both from SAS. In the statistical analyzes, treatments, hours and interaction treatments per hours were considered as fixed effects. Nested repetition within hours was considered as a random effect. For total bacteria variable, normal, Poisson, lognormal, and gamma distributions were tested, the latter being the best fit, with log link function.
Toeplitz (Toep) variance-covariance structures, variance component (VC), composite symmetry (CS), autoregressive of order 1 (AR [1]) and unstructured (UN) were tested. To select the one with the best fit to the data, information criteria [Akaike (AIC), corrected Akaike (AICC) and Bayesian (BIC)] were used, which was considered the smallest value. For mean comparison, fixed range test was used (Kramer 1956). Data was analyzed with SAS (2013) statistical package, version 9.3.
Results and Discussion
Table 1 shows the analysis of theoretical assumptions normality of errors and variance homogeneity for the analyzed variables. It was observed that, for total bacteria, probability values in both tests were lower than 0.05, so these assumptions are not fulfilled. However, this value was higher than 0.05 for isovaleric acid. This shows the fulfillment of base hypotheses that support the analysis of variance.
Variables | ANAVA theoretical assumptions | Statistical tests | P Value |
---|---|---|---|
Total bacteria, 1011CFU/mL | Variance homogeneity | Levene | 0.0266 |
Normality of errors | Shapiro-Wilk | 0.0303 | |
Isovaleric acid, mmol/L | Variance homogeneity | Levene | 0.3513 |
Normality of errors | Shapiro-Wilk | 0.2033 |
CFU: colony forming units
Steel and Torrie (1996) and Peña (1994) point out that normal distribution of errors has little influence on ANAVA to compare means, since this technique is robust to error deviations. However, they argue that the lack of normality can affect other assumptions, such as the variance homogeneity, and this happens especially when the number of observations of groups are very different. Nevertheless, when variance components are analyzed, normality can affect the analysis result.
According to Gutiérrez and de la Vara (2012), variance homogeneity is an assumption that relates the residues of treatments, and offers an overview of the possible equality between them. For its analysis, Levene, Bartlett, Hartley, and other tests were used. However, Levene test is the most robust in the absence of normality.
When analyzing variables under study, it was observed that the total bacteria did not meet the variance homogeneity of residuals. Peña et al. (2015) state that, according to the nature of this type of variable, the use of classical statistical methods is not recommended because, in some cases, homogeneity assumption is not met.
It is necessary to verify the fulfillment of the theoretical assumptions of classical statistical methods before starting the statistical analysis for this type of research, since, according to results, selection of the appropriate statistical method is defined. The use of these statistical models also avoids all inconveniences that may affect the expected results. In addition, this type of model does not require fulfillment of these assumptions, and these are no longer a problem for data analysis.
Table 2 shows the analysis of variance and covariance structures in order to select the best fit model. For this, information criteria were considered. For total bacteria variable, the lowest value was obtained with that of variance components (VC), and for isovaleric acid, with the autoregressive of order one (AR (1)). However, composite symmetry (CS), unstructured (UN) and Toeplitz structures did not achieve convergence, and did not fit to the analyzed data. For this reason, the results for these structures are not reported. However, Gómez (2019) states that, for selecting the structure with the best fit to data, the one with the lowest values in the information criteria should be taken into account.
Variables | Covariance structures | Information criteria | ||
---|---|---|---|---|
AIC | AICC | BIC | ||
Total bacteria, 1011 CFU/mL | Toep | 775.93 | 815.11 | 807.98 |
VC | 742.77 | 752.77 | 760.58 | |
CS | - | - | - | |
AR(1) | 744.77 | 755.90 | 763.47 | |
UN | - | - | - | |
Isovaleric acid, mmol/L | Toep | - | - | - |
VC | 250.50 | 260.20 | 268.30 | |
CS | - | - | - | |
AR(1) | 249.10 | 259.80 | 267.80 | |
UN | - | - | - |
CFU: colony forming units
Valdivieso (2013) states that, to model covariance structures, data is available, in which the sample variance-covariances of the observed variables estimate the model parameters and their errors. Liscano and Ortiz (2017) report that if a structure that fits data is suspected, its use leads to a more efficient inference and estimation.
In the results of the table of analysis of variance, it is shown that mean square of the error was low, when mixed procedures were used. This could be because, when the effects are nested within the analysis, treatment variability decreases and better estimates are obtained (table 3). Hernández et al. (2003) refer that, when speaking of nested structure, and data is grouped into experimental units of different order, each with specific properties, according to the considered grouping level, it is necessary to eliminate this effect so that it does not affect the estimation of results.
Variables | Statistical analysis | Mean square of the error | Probability valueType I |
---|---|---|---|
Total bacteria, 1011 CFU/mL | ANAVA | 0.3712 | <0.0001 |
GLMMIX | 0.2719 | <0.0001 | |
Isovaleric acid mmol/L | ANAVA | 0.4951 | 0.4046 |
MIXED | 0.3824 | 0.2122 |
CFU: colony forming units
Mixed generalized linear models and generalized mixed additive models are used for modelling nested data and spatial and temporal correlation structures in counting data or binomial data. Additive mixed-effect models and mixed-effect models are useful for nested data (also called panel data or hierarchical data), repeated measurements, and temporally and spatially correlated data (Zuur et al. 2009).
Table 4 shows interaction results for the classical analysis of variance and the mixed generalized linear model. In both cases, interaction was significant. However, standard error was lower when this last was used. The analysis showed that the mixed generalized linear model, in some of the cases, was more conservative in finding similar groups.
Variable | Statistical analysis | Treatment | Hour | SE Signf. | ||
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
Total viable bacteria, 1011CFU/mL | ANAVA | Star grass | 2.80abcde (18.71) | 2.29abcdef (11.71) | 1.18f (4.71) | ±0.31 P<0.0001 |
Superganius | 1.96bcdef (8.04) | 1.70cdef (5.54) | 2.49abcdef (16.54) | |||
Tunera | 3.04abcd (26.21) | 2.57abcdef (16.71) | 2.22abcdef (10.04) | |||
Camerún | 3.64a (43.21) | 3.17abc (24.71) | 1.46ef (7.04) | |||
Paraguaya | 2.51abcdef (13.04) | 3.41ab (31.71) | 1.59ef (7.21) | |||
Planin | 2.59abcdef (17.21) | 3.09abcd (23.21) | 2.84abcde (19.71) | |||
GLMMIX | Star grass | 2.93abcde (18.71) | 2.43 bcdef (11.71) | 1.55 f (4.71) | ±0.24 P<0.0001 | |
Superganius | 2.08 cdef (8.04) | 1.71 ef (5.54) | 2.81abcde (16.55) | |||
Tunera | 3.27abc (26.20) | 2.82abcde (16.33) | 2.31 bcdef (10.04) | |||
Camerún | 3.77 a (43.23) | 3.21abc (24.71) | 1.95 def (7.03) | |||
Paraguaya | 2.57 abcdef (13.04) | 3.46 ab (31.72) | 1.98 def (7.21) | |||
Planin | 2.85 abcde (17.21) | 3.14 abcd (23.20) | 2.98 abcd (19.72) |
CFU: colony forming units
When comparing both models, some of the treatment mean values that correspond to the mixed generalized linear model had a slight increase. This could be related to the adjustment of the link function, selected according to the distribution followed by the variable, so means are estimated by the effect of this link function.
When analyzing the isovaleric acid variable, it was observed that interaction between the main effects was not significant. Therefore, the main effects were reported (tables 5 and 6). In the effect of varieties, the standard error for the mixed procedure was slightly lower than the classical analysis of variance, although for both, no significant differences were found among treatments (table 5).
Statistical analysis | Treatments Variable | Star grass | Super ganius | Tunera | Camerún | Paraguaya | Planin | SE Signf. |
---|---|---|---|---|---|---|---|---|
ANAVA | Isovaleric acid mmol/L | 2.01 | 1.89 | 1.45 | 1.89 | 1.60 | 1.83 | ±0.17 P=0.0693 |
MIXED | 2.01 | 1.89 | 1.45 | 1.89 | 1.60 | 1.83 | ±0.15 P=0.0825 |
Table 6 reports the effect of hours. In both methods, standard errors presented similar results, and no significant differences were found among times. Therefore, this type of analysis can be proposed for research related to rumen microbiology experiments, as long as an adequate statistical analysis is carried out, justifying the use of these methods.
Statistical analysis | Variable | Hours | SE and Signif. | ||
---|---|---|---|---|---|
1 | 2 | 3 | |||
ANAVA | Isovaleric acid mmol/L | 1.73 | 1.87 | 1.73 | ±0.12 P=0.6046 |
MIXED | 1.73 | 1.87 | 1.73 | ±0.12 P=0.5469 |
According to Gómez et al. (2012) and Dicovskiy and Pedroza (2017), mixed models are a proposal for advanced statistical modeling, which allow improving the quality of the analysis of fixed and random factors, by modeling random variability and error correlation. These models are very useful in the analysis of unbalanced data, or of data with some type of hierarchical or grouping structure.
From the results of this research, it is concluded that mixed models improve accuracy and precision of analysis results. The mean square of the smallest error is obtained when using mixed procedures, and standard errors decrease with respect to classical analysis of variance. From this perspective, these models are proposed for the analysis of variables related to counting experiments in the rumen microbial population.