A study on the effects of unbalanced data when fitting logistic regression models in ecology

Primer Autor
Salas-Eljatib, Christian
Co-autores
Fuentes-Ramirez, Andres#Gregoire, Timothy G.#Altamirano, Adison#Yaitul, Valeska
Título
A study on the effects of unbalanced data when fitting logistic regression models in ecology
Editorial
ELSEVIER SCIENCE BV
Revista
ECOLOGICAL INDICATORS
Lenguaje
en
Resumen
Binary variables have two possible outcomes: occurrence or non-occurrence of an event (usually with 1 and 0 values, respectively). Binary data are common in ecology, including studies of presence/absence, alive/dead, and change/no-change. Logistic regression analysis has been widely used to model binary response variables. Unbalanced data (i.e., an extremely larger proportion of zeros than ones) are often found across a variety of ecological datasets. Sometimes the data are balanced (i.e., same amount of zeros and ones) before fitting the model, however, the statistical implications of balancing (or not) the data remain unclear. We assessed the statistical effects of balancing data when fitting a logistic regression model by studying both its statistical properties of the estimated parameters and its predictive capabilities. We used a base forest-mortality model as reference, and by using stochastic simulations representing different configurations of 0/1 data in a sample (unbalanced data scenarios), we fitted the logistic regression model by maximum likelihood. For each scenario we computed the bias and variance of the estimated parameters and several prediction indexes. We found that the variability of the estimated parameters is affected, with the balanced-data scenario having the lowest variability, thus, affecting the statistical inference as well. Furthermore, the prediction capabilities of the model are altered by balancing the data, with the balanced-data scenario having the better sensitivity/specificity ratio. Balancing, or not, the data to be used for fitting a logistic regression models may affect the conclusion that can arise from the fitted model and its subsequent applications.
Tipo de Recurso
Artículo original
Description
This study was supported by the Chilean research grant Fondecyt No. 1151495. AFR is supported by a Postdoctoral Scholarship from Vicerrectoria de Investigacion y Postgrado, Universidad de La Frontera, Temuco, Chile.
doi
10.1016/j.ecolind.2017.10.030
Formato Recurso
pdf
Palabras Claves
Statistical inference# Model prediction# Logit model# Binary variable# Bias# Precision
Statistical inference# Model prediction# Logit model# Binary variable# Bias# Precision
Ubicación del archivo
http://dx.doi.org/10.1016/j.ecolind.2017.10.030
Categoría OCDE
Biodiversity Conservation# Environmental Sciences
Materias
Inferencia estadística# Predicción de modelos# modelo logístico# variable binaria# Inclinación# Precisión
Inferencia estadística# Predicción de modelos# modelo logístico# variable binaria# Inclinación# Precisión
Id de Web of Science
WOS:000430634500051
Título de la cita (Recomendado-único)
A study on the effects of unbalanced data when fitting logistic regression models in ecology
Identificador del recurso (Mandatado-único)
Artículo original
Versión del recurso (Recomendado-único)
version publicada
Editorial
ELSEVIER SCIENCE BV
Revista/Libro
ECOLOGICAL INDICATORS
Categoría WOS
Conservación de la Biodiversidad# Ciencias Ambientales
ISSN
1470-160X
Idioma
en
Referencia del Financiador (Mandatado si es aplicable-repetible)
ANID FONDECYT 1151495
Descripción
This study was supported by the Chilean research grant Fondecyt No. 1151495. AFR is supported by a Postdoctoral Scholarship from Vicerrectoria de Investigacion y Postgrado, Universidad de La Frontera, Temuco, Chile.
Formato
pdf
Tipo de ruta
suscripción#verde
Access Rights
metadata
Derechos de acceso
metadata
Página de inicio (Recomendado-único)
1495
Página final (Recomendado-único)
1501
Revisa las metricas alternativas de Almetrics
Revisa las citaciones de Dimensions