SciELO - Scientific Electronic Library Online

 
vol.53 número1Aplicación del Modelo Lineal Mixto y Lineal Generalizado Mixto, como alternativas de análisis en experimentos con medidas repetidasComportamiento de la Brachiaria decumbens vc. Basilisk, sometida a dos intensidades de pastoreo en la época lluviosa índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Articulo

Indicadores

  • No hay articulos citadosCitado por SciELO

Links relacionados

  • No hay articulos similaresSimilares en SciELO

Compartir


Cuban Journal of Agricultural Science

versión On-line ISSN 2079-3480

Cuban J. Agric. Sci. vol.53 no.1 Mayabeque ene.-mar. 2019  Epub 18-Ene-2019

 

Biomathematics

Categorical regression model for the analysis and interpretation of statistical power

Walkiria Guerra1  *  , Magaly Herrera2  , Lucía Fernández1  , Noslen Rodríguez Álvarez1 

1Universidad Agraria de La Habana, Autopista Nacional y Carretera de Tapaste, San José de las Lajas, Mayabeque, Cuba

2Instituto de Ciencia Animal, Apartado Postal 24, San José de las Lajas, Mayabeque, Cuba

Abstract

Criteria of theoretical-practical value are established in models of analysis of variance of fixed effects (parametric and non-parametric), from an integral analysis of variables related to statistical indicators and experimental design, which includes statistical power as a dependent variable. The analyzed information was selected from independent researches, processed by the Biomathematics department from the Instituto de Ciencia Animal, developed in areas of birds, pigs, grasses and ruminants. The analyzed experiments correspond to completely randomized (CRD), balanced and random block designs (RBD). The results were processed by parametric F Fisher test and were compared with the non-parametric equivalent tests, Kruskal- Wallis and Friedman. A total of 21 experiments were selected, 16 related to the CRD and five to the RBD. For the analysis of data, a data matrix was created with the nine selected variables. It is considered, as the most outstanding result, the strong negative relation that is manifests between the power and the probability of type I error in the analysis of variance models (parametric and non-parametric). That is, at low values of the probability of type I error, high values of power. It is convenient, in future studies, to deepen in the aspects of sample size, the distribution of the variable under study and the criterion of power-efficiency (Asymptotic Relative Efficiency, ARE), in relation to the probability of type I error and the power.

Key words: Statistical and experimental design indicators; analysis of variance parametric and non- parametric

Introduction

Bono and Arnau (1995), when reviewing the development of the concept of a test power, point out that in the theory developed by Neyman and Pearson, from 1928 to 1933, the power of a statistical test is the probability of significant results. Their estimation, according to these authors, is determined by three basic components: sample size, level of significance (α) and size of the effect to be detected.

There are two ways to estimate the power: the prefixed (a priori) and a posteriori. The first shows the researcher about the sample size needed for adequate power and, to this end, power tables have been constructed. The a posteriori power is important in the interpretation of the results of completed studies, as is the case of this study, whose purpose is to alert about the conduct to be followed in future researches.

Scheffé (1959) discusses the power of the F Fisher test in analysis of variance models (ANAVA), with fixed effects. It refers to the power tables, calculated for the values of α = 0.01 and 0.05, and reproduces power graphs for the F Fisher test.

Siegel and Castellan (1995), in the area of non-parametric statistics, introduce the concept of power - efficiency or Asymptotic Relative Efficiency (ARE) or efficiency of Pitman. Several authors (De Calzadilla 1999, Guerra et al. 2000, Christ 2001, De Calzadilla et al. 2002, Vásquez 2011 and Cabrera et al. 2012) performed empirical evaluations to assess the appropriateness of the application of analysis of variance models, parametric and non-parametric, with univariate and bivariate approaches.

Menchaca (1974, 1975), Venereo (1976), Caballero (1979) and Menchaca and Torres (1985) contributed tables of sample sizes and number of replications in analysis of variance models, associated with designs completely randomized, random blocks , Latin square and tunover design. They include the maximum standardized difference between two means (Δ), the number of treatments (t), the level of significance (α) and the power of the test (1-β). These tables represent valuable work tools for researchers from different branches. Currently, with the advance of computer science, there are statistical packages that include the calculation of power, such as InfoStat, G Power and SPSS, among others.

With the established background, it was decided to use another way of analyzing and interpreting the statistical power through an integral analysis of variables (numerical and categorical) that may affect it, associated with statistical indicators and experimental design. For this, it was considered convenient to apply a categorical regression analysis (CATREG).

It is proposed to establish criteria of theoretical - practical value, from an integral analysis of variables associated with statistical power, in analysis of variance models of fixed effects (parametric and non - parametric).

Materials and Methods

The information analyzed was selected from the databases with numerical variables (counts and proportions), processed by the Biomathematics department from the Instituto de Ciencia Animal, from 2003 - 2011, located in San José de las Lajas, Mayabeque province. The processed data correspond to independent researches, developed in areas of birds, pigs, grasses and ruminants, associated to the completely randomized (CRD) balanced and random blocks (RBD) designs.

The experimental results were analyzed according to the corresponding models of parametric variance analysis (single and double) and the non-parametric equivalent tests Kruskal-Wallis and Friedman, respectively. In each case, the probability of type I error of the parametric and non- parametric tests, distribution of the response variable (in general was Normal, Binomial and Poisson), power of the F Fisher test, sample size in the experimental design and fulfilment of the basic assumptions of the ANAVA. The rest of the variables correspond to the experimental design.

The number of experiments and variables by experimental design are:

  • CRD: 16 experiments and 66 variables

  • RBD: 5 experiments and 34 variables

A data matrix was formed with the following variables:

  1. Type of experiment

  2. Type of experimental design

  3. Number of treatments

  4. Fulfilment of the theoretical assumptions

  5. Probability of type I error of the F Fisher test

  6. Probability of type I error of the non-parametric equivalent tests

  7. Power of F Fisher test

  8. Sample size in the experimental design

  9. Variable distribution

To establish the relation of the power of the Fisher test with the rest of the variables, the categorical regression analysis was applied, due to the presence of numerical variables (3 and 5 to 8) and categorical variables (1, 2, 4 and 9).

In accordance with the criteria of Meulman and Heiser (2010) and the applications made by Navarro et al. (2010), Vázquez (2012) and Guerra et al. (2014), in the Agricultural Sciences and others, the general characteristics of the CATREG are summarized in the following aspects:

  • They are based on the Optimal Scaling Methodology, proposed by the Dutch school of data scalation, with numerous contributions from the Data Scaling System Theory Group at the University of Leiden, in Holland, implemented with the credits of this group by IBM SPSS Statistics, in different versions.

  • It is suitable for data that are difficult or impossible to analyze using the classical statistics methods.

  • The optimal quantifications of each variable are obtained by the alternating least squares method (minimizes the loss of information function).

  • The quantifications of each variable are improved by iterative procedures.

  • The quantified variables have metric properties.

  • It extends the classical method of regression analysis, through the optimal scaling of nominal, ordinal and numerical variables, simultaneously.

  • A linear regression equation optimal for the transformed variables is obtained.

  • The estimated regression coefficients reflect the changes produced by the predictor variables in the response variable.

  • The optimal quantifications reflect the characteristics of the original variables.

  • The indicators of the categorical variables must be positive integers.

  • Only allows one response variable and a maximum of 200 predictor variables.

The CATREG analysis includes characteristic aspects of classical regression analysis: coefficient of determination (R2), analysis of variance in the regression and significance of the model parameters. As complementary elements of this analysis, other indicators are included for the analysis:

  • Importance: measure of the relative importance of the predictor variables, given by Pratt (1987).

  • Tolerance: represents the proportion of the variation of each predictor variable, which is not explained by the others. It represents a protection against multicollinearity (Hair et al. 1999).

For the processing of the information, the optimal scaling option (CATREG) of the statistical software SPSS version 22.0 (2013) was selected from the regression analysis.

Results and Discussion

In the application of the categorical regression analysis, the power was considered as a dependent variable, to analyze its relation with the rest of variables. Table 1 reflects the high value of the coefficient of determination (R2), accompanied by the significance of the model, which indicates that the power has a good explanation by the analyzed variables.

Table 1 includes the partial regression coefficients, standardized with their statistical significance, when considering the results of the parametric ANAVA, except for the probability of type I error (p-value). The rest of the indicators do not show significant statistical contributions for the explanation of the power (value p> 0.05). It is considered that it is necessary to take into account in future researches, the sample size, because of its known incidence in the potency and the significance shown (value p = 0.065).

Table 1 Fit of the CATREG model and significance of the regression coefficients with the parametric ANAVA 

The significant and negative incidence of the probability of type I error in the power, when the rest of the variables remain constant, is corroborated by the high negative correlations of zero and partial order, which are observed in table 2, between the probability of type I error and power.

Table 2 Correlations and other indicators of the CATREG with the parametric ANAVA. 

It is considered that there is correspondence of the previous results with that reported in the specialized literature, specifically by Mood and Graybill (1972) and Rodríguez (2008), who denote the power function as:

From this analysis it is concluded that a powerful test must show a low probability of type I error (reject the null hypothesis being true) and high power (reject the null hypothesis being false), situation that must be in correspondence with the characteristics of the test, in this case F Fisher test.

These high negative correlations are corroborated with the results of a simulation study conducted by Vázquez (2011), which includes the analysis of the power and the probability of type I error, under the assumption of binomial distribution. In addition to the study of Herrera (2014), in animal science researches.

Table 2 highlights, for its importance, the probability of type I error, which is able to explain 76.6 % of its variability (given by tolerance), being low in the case of the number of treatments and type of design, and very high for the fulfilment of the assumptions and the distribution of the variable.

The distribution of the analyzed variables that, to a large extent, correspond to the normal, binomial and Poisson distribution, although it does not present an appreciable incidence in the power, should be a reason for analysis in future researches, as well as the sample size.

Due to the high level of coincidences presented by the decision making based on the probability of type I error , F Fisher test of the ANAVA, with its non-parametric equivalents (Kruskal-Wallis and Friedman tests), a comparative analysis of the probabilities of type I error, using the t Student test for paired samples (table 3).

As can be seen in the table, there are no significant statistical differences (P <0.05) between the probabilities of type I error in each design.

Table 3 Comparison of probabilities of type I error for both designs. 

In this result it is decided to evaluate the performance of the CATREG model, when substituting the values of the probability of type I error of the F Fisher test, for those corresponding to the Kruskal-Wallis and Friedman tests (according to the design). This variable is identified in this analysis by probability of type I error (NP). The obtained results are reflected in tables 4 and 5. It is observed that a very significant and negative incidence of the probability of type I error (NP) in the power is maintained, when considering the rest of the variables or constant indicators, very similar to that reported in table 1.

The results of table 4 are corroborated with the zero and partial order correlation indicators, which appear in table 5. Regarding the importance and tolerance (table 5), it is reiterated that the probability of type I error (NP), is highlighted by its importance with respect to the rest of the indicators, lower to explain its variability (69.7 %). In the same way, the high tolerance of variables is considered, fulfilment with assumptions and distribution of the variable.

From the analyzes carried out, it can be seen that the indicator with a strong negative relation with the power is the probability of type I error in the ANAVA models (parametric and non-parametric). That is, at low values of the probability of type I error, high values of power.

Table 4 Fit of the CATREG model and significance of the regression coefficients with the non-parametric ANAVA. 

Table 5 Correlations and other indicators of the CATREG with the non-parametric ANAVA 

It is concluded that the use of the categorical regression model is an alternative analysis tool to assess the incidence of each of the selected variables of statistical indicators and experimental design, in the statistical power and in analysis of variance models of fixed effects through Fisher test with its non- parametric equivalents (Kruskal-Wallis and Friedman tests).The probability of type I error is identified as the most important indicator that contributes to explain the power, and the strong negative relation between the power and the probability of type I error is shown in the analysis of variance models( parametric and non- parametric).

It is recommended to deepen in the aspects of sample size, distribution of the analyzed variable and the power-efficiency criterion, in relation to the probability of type I error and power, as well as to incorporate the results of the generalized linear model into the analysis as another alternative to be evaluated.

References

Bono, R. y Arnau, J. 1995. Consideraciones generales en torno a los estudios de potencia. Revista Anales de Psicología. 11(1): 193-202. [ Links ]

Caballero, A. 1979. Tamaños de muestras en diseños completamente aleatorizados y bloques al azar donde la unidad experimental esté formada por grupos de animales. Revista Cubana de Ciencia Agrícola. 13 (3): 225-235. [ Links ]

Cabrera, A., Guerra, C. W., Herrera, M. & Suris, M. 2012. Non-parametric statistical methods and data transformations in agricultural pest population studies. Chilean Journal of Agricultural Research. 72(3): 440-443. [ Links ]

Cristo, M. 2001. Comportamiento de las dócimas no paramétricas respecto a las paramétricas en distribuciones no normales. Tesis presentada en opción al título de Master en Matemática. Universidad Central de Las Villa. Cuba. [ Links ]

De Calzadilla, J. 1999. Procedimientos de la Estadística no paramétrica. Aplicaciones en las Ciencias Agropecuarias. Master Thesis. Cuba. [ Links ]

De Calzadilla J., Guerra, W. & Torres, V. 2002. El uso y abuso de transformaciones matemáticas. Aplicaciones en modelos de análisis de varianza. Rev. Cubana Ciencia Agrícola. 36(1): 103-106. [ Links ]

Guerra, C. W., De Calzadilla, J. & Torres, V. 2000. Índice de eficiencia en relación con procedimientos de la estadística no paramétrica. Revista Cubana de Ciencia Agrícola 34 (1): 1-4. [ Links ]

Guerra, C. W., Herrera, M., Vázquez, Y. & Quintero, A. B. 2014. Contribución de la Estadística al análisis de variables categóricas: Aplicación del Análisis de Regresión Categórica en las Ciencias Agropecuarias. Revista Ciencias Técnicas Agropecuaria. 23(1): 68 - 73. [ Links ]

Hair, J. F., Anderson, R. E., Tatham, R. L. & Lack, W. C. 1999. Analisis Multivariate. Practice. Hall Iberia. Madrid. España. 799 p. [ Links ]

Herrera, M. 2014. Métodos Estadísticos alternativos de análisis con variables discretas y categóricas en investigaciones agropecuarias. PhD Thesis. ICA. Mayabeque. Cuba. 100p. [ Links ]

Menchaca, M. A. 1974. Tablas útiles para determinar tamaños de muestras en diseño de Clasificación Simple y de Bloques al Azar. Revista Cubana de Ciencia Agrícola. 8 (1) 111-116. [ Links ]

Menchaca, M. A. 1975. Determinación de tamaños de muestra en diseños Cuadrados Latinos. Revista Cubana de Ciencia Agrícola. 9 (1): 1-3. [ Links ]

Menchaca, M. A. & Torres, V. 1985. Tablas de uso frecuente en la Bioestadística. Instituto de Ciencia Animal. Cuba. [ Links ]

Meulman, J.J. & Heiser, W.J. 2010. SPSS versión 19.0 for Windows. Categories. Statistical Package for the Social Sciences. [ Links ]

Mood, A. M. & Graybill, F. A. 1972. Introducción a la teoría de la Estadística. Ediciones Aguilar S. A. Madrid. España. 536 p. [ Links ]

Navarro, J.M., Casa, G. & González, E. 2010. Análisis de Componentes Principales y Análisis de Regresión para datos categóricos. Aplicación en la Hipertensión Arterial. Revista de Matemática. Teorías y Aplicaciones. 17(2):199-230. [ Links ]

Pratt, J. W. 1987. Dividing the indivisible: Using simple symmetry to partition variance explained. In: Proceedings of the Second International Conference in Statistics, T. Pukkila, and S. Puntanen, eds. Tampere, Finland: University of Tampere. [ Links ]

Rodríguez, F. 2008. Estudio de métodos no paramétricos. Informe de pasantías presentado como requisito para optar al título de Licenciado en Matemática Mención Probabilidad y Estadística. Universidad Nacional Abierta, Centro Local Metropolitano. Caracas Venezuela.. [ Links ]

Scheffé, H. 1959. The Analysis of Varianza. John Wiley & Sons, Inc, New York. 477p. [ Links ]

Siegel, S. y Castellan, N. J. 1995. Estadística no paramétrica aplicada a las Ciencias de la Conducta. Cuarta edición. Editorial Trillas, México. p 57. [ Links ]

SPSS 2013. Version 22.0 for Windows. Statistical Package for the Social Sciences. IBM. [ Links ]

Vásquez, R. E. 2011. Contribución al tratamiento estadístico de datos con distribución Binomial en el Modelo de Análisis de Varianza. PhD Thesis. Instituto Nacional de Ciencias Agrícolas. Cuba. [ Links ]

Vázquez, Y. 2012. Modelación Estadístico - Matemática con variables mixtas para el estudio de la sostenibilidad social en una Empresa ganadera bovina. PhD Thesis. ICA. Mayabeque. Cuba. 100p. [ Links ]

Venereo, A. 1976. Número de réplicas en diseños cuadrados latinos balanceados para la estimación de efectos residuales. Revista Cubana de Ciencia Agrícola. 10(3): 237-246. [ Links ]

Received: June 14, 2018; Accepted: January 18, 2019

Creative Commons License