Choosing a stochastic probability distribution to fit your data type

Kelsey Martinez, PhD, Research Manager
July 21, 2021

Red dice — Photo by By Naser Tamimi on UnSplash

Regression analysis using R, Python, or other statistical modeling software is commonplace in social research. After you’ve decided to run a regression analysis, the first thing you usually tackle is figuring out the stochastic probability distribution that fits your data type. This is a critical step since the stochastic distribution you choose has a huge impact on the final modeled effect parameters.

In this blog post, I’ll go over a few different data types and the types of stochastic distributions that usually work for modeling those data types. We will specifically be discussing dependent variables or “y” variables. However, this post will not cover anything about independent variable types here, and this blog will not cover data cleaning or assessment of model fit. Instead, this blog is meant to be a primer on appropriate distribution types for different common data types.

With my extensive training in ecological modeling, the best reference I can recommend for more info on this topic is Ecological Models and Data in R by Ben Bolker. Ben Bolker is a respected stats and R authority on many online stats help forums and is an author of the R package lme4(). Chapter 4 of his book covers applied probability and stochastic distributions in great detail.

Unbounded continuous data

Say you’re looking at annual wages in dollar value for a few sets of people with different demographic characteristics. You want to know if there is a significant difference in wages between these sets of people. Continuous dependent variables generally work just fine with a standard linear regression model. The stochastic probability distribution for a plain old linear regression is a normal distribution. This function is lm() in R. You will need to check for a normal distribution of errors after running your model.

Count data (discrete)

Count data generally requires a bit more thought and deep understanding of your data than a continuous response model. Count data are discrete. You will generally consider the two types of stochastic probability distributions if your dependent variable is discrete count data: the Poisson distribution and the negative binomial distribution. Both Poisson and negative binomial distributions are two-parameter distributions. In a Poisson distribution, the mean and variance parameters of the distribution are the same by definition. In a negative binomial distribution, the variance can be larger than the mean. If your count data are overdispersed (variance is greater than the mean), you’ll need to use a negative binomial model.

There also exists a binomial distribution for count data, but I’ve never encountered this one used in a real-world situation. In the binomial distribution, your data must have a variance less than the mean.
To fit a Poisson model, you will need to use the glm() function in R. You must specify the family and link terms with the applicable model or distribution names.
To fit a negative binomial model, you will need to use the glm.nb() function in the MASS library in R.
There are zero-inflated Poisson and negative binomial options available if your count data are zero-inflated. However, the justification for using zero-inflated models is fuzzy and open to lots of interpretation. See this site for more details on zero-inflated models.

Proportion data (continuous)

Even though proportion data are continuous, it is generally not advisable to use a linear model with a normal distribution to analyze them. The reason for this is because proportions are bounded at 0 and 1 by definition. A normal distribution is unbounded. The beta distribution is frequently used to model proportion data. The beta distribution is continuous and bounded at 0 and 1. Use the betareg() library in R to perform beta regressions.

Binary data (discrete)

Discrete binary data is usually a fairly easy data type for folks in the social research realm. Binary response data calls for a logistic regression model. Logistic regression uses a binomial stochastic probability distribution since each “trial” - or data point - is a Bernouilli trial with two possible outcomes. In R, you’ll need to use the ‘glm()’ function and specify family=binomial(link=”logit”) to run a logistic regression.