Thinking Generatively and Probabilistically

Friday. February 26, 2021

What has recent experience taught me.

I’ve learned some hard lessons in the past year, some statistical, some programmatic, some managerial, some psychological. Especially as it relates to the statistical, it has better equipped me to solve the puzzle that is our world. Sources: Statistical Rethinking by Richard McElreath, Probabilistic Programming & Bayesian Methods for Hackers by Cameron Davidson-Pilon, PYMC3 talks, datacamp.com.

1: The difference between analytics and data science.

In the summer of 2020 I went into a project thinking to do a quick analytical consultation for a small business trying to understand their membership retention. A Managing Director of a consultancy that helps organizations integrate statistical modeling into their decision-making would guide me. At the time, my view of modelling in data science was fitting a scikit-learn object on training data and comparing predicted values with actual values to see how well we could predict the future.

I soon began learning about generative models and probabilistic programming, and now have a better understanding about what statistical models are really doing: revealing all the ways that the data you observed could have been produced. A statistical model is like any other kind of model in that you’re recreating the system of interest by mapping one set of variables onto another set of variables through a probability distribution. Fundamentally, they define the ways values of some variables can arise, given values of other variables. There are two modes of thinking that help build this understanding:

Thinking generatively, or thinking in data-generating stories, which means asking yourself “how did this data happen?” As Cameron Davidson-Pilon writes in Probabilistic Programming & Bayesian Methods for Hackers, “position yourself in an omniscient position, and try to imagine how you would recreate the dataset.” Simulating a story of how the data came about helps identify which type of distribution applies to a given situation (a distribution is just a mathematical description of outcomes). Building a model that explains the data gives you a much better chance of really understanding the problem. And sometimes, a qualitative understanding of the phenomena is more important than getting quantitatively correct numbers out of it. Seeing the underlying story allows you to get to the root of the problem.
Thinking probabilistically: outcomes of measurements follow probability distributions defined by the story of how the data came to be. To concisely and unambiguously describe our data and draw conclusions from it, we need to estimate optimal parameters of the probability distribution that best define the story of your data. The distribution contains all the information we have about a random variable. It’s like a model that assigns a likelihood function to each possible value of a random variable. I now have a handful distributions at my disposal to compose into a model; these are the guidelines I use to match them to different types of situations (a precise definition is best described by parameters, which I’m not going into intentionally here):

The most fundamental natural distributions used in statistical modeling are members of the exponential family, all its members are important as they populate the world. The most common and important distribution is of course the normal distribution. Though not universally useful, it provides an excellent approximation to a number of practical situations and has a strong claim to being foundational (meaning once you have learned to build and interpret it, you can more easily move on to other types which are less normal). Things to model with it: people’s heights, SAT scores, for example.

If you want to restrict the values you’re modelling to positive values only, transform them into log-normal instead. A log-normal distribution is intimately related to the normal distribution: it describes a random variable whose logarithm is normally distributed. Incomes are classically log-normally distributed, as are stock prices, and the length of chess games.

If a variable can be classified as either a success or a failure, it’s Bernoulli distributed. A coin flip is the prototypical example of a Bernoulli trial. A student randomly guessing on a four-choice multiple choice test and a voter survey approximately follows a sequence of Bernoulli trials, as the responses can be assumed to be independent.

More formally, a sequence of identical, independent Bernoulli trials (that is, each trial has the same probability of success, and the results of one trial do not affect any of the following trials) is described by the binomial distribution. The number of sixes in 100 rolls of a die is such an example, as each roll has a defined success (rolling a six) and failure (rolling something other than a six).

A distribution that represents the number of failures before the first success in a sequence of Bernoulli trials is the geometric distribution. For instance, the number of times a coin needs to be flipped until the first head appears can be described by a geometric distribution.

The Poisson distribution is one of the most versatile and widely used distributions. It represents the number of events occurring over a specific time period, given that several conditions hold: an event should have the same probability of occurring at any point in a given time period; each event is independent; the average number of events per the time period being considered is known; any number of events could occur during the time period being considered. Example: the number of cars traveling along a road in a given hour. But the number of cars traveling along the road in a given day is not Poisson-distributed, as certain times (such as rush hours) are likely to be more populated than others.

If the number of events per unit time follows a Poisson distribution, then the amount of time between events follows the exponential distribution. The Poisson distribution is discrete and the exponential distribution is continuous, yet the two distributions are closely related. The exponential distribution describes the time between events that follow a Poisson distribution. The time between two flashes of lightning describes the time between events in a Poisson process, as the flashes of lightning are well-described by a Poisson distribution. Thus, the time between flashes of lightning is exponentially distributed.

The gamma distribution is used to model the length of time before an event occurs. This is in contrast to the exponential distribution, which models the time between events. For instance, the gamma distribution is often used to model waiting times. In particular, an insurance company may use it to model a lifespan, where the event is death.

And if you think all values of a variable are equally likely, one of the simplest probability distributions to use is the uniform distribution; it’s also a common noninformative flat prior. A spinner that’s equally likely to land on one of its multiple color sections is modeled by a uniform distribution.```

Bayesian data analysis combines the two modes of thinking described above. It takes a question in the form of a model and uses logic to produce an answer (inference) in the form of probability distributions.

In my experience, data science classes might mention Bayes theorem out of obligation but don’t really instill a motivation or a habit for students to use it. Datacamp, an online platform with numerous insightful courses still doesn’t have a Bayesian course in Python, as of February 2021 (a trello card says a PyMC3 course by Chris Fonnesbeck is overdue, but there are courses such as Statistical Simulation and Statistical Thinking with numpy that are great). One major factor in the lack of adoption of Bayesian methods for data analysis is the computational difficulty in approximating the (posterior) inference. Implementing my retention project together with the Statistical Rethinking book has finally tied together a lot of foundational, but until now separate, concepts of a Bayesian workflow.

The basic strategy, in one sentence: define a generative model and then use that model to design strategies for causal inference and statistical estimation. It can be broken into three main steps:

Data story: Motivate the model by narrating how the data came to be. Translate the data story into a formal probability model: assign unknown parameters, data, covariates, missing data, predictions, all must be assigned some probability density.
Update: Educate your model by feeding it the data. A Bayesian model begins with the prior plausibilities. Then it updates them in light of the data to produce posterior plausibilities. Once the posterior distribution is approximated (via MCMC mostly), you get point estimates, credible intervals, quantiles, predictions - which can be used to summarize and interpret the posterior distribution depending on the context of the problem and your purpose. One of the biggest difficulties lies in the subject matter - which variables matter and how does theory tell us to connect them.
Evaluate and check your model. All statistical models require supervision, leading possibly to model revision. The model and its outputs must be assessed before using the outputs for inference. The objective is to check the models adequacy for some purpose. This usually means asking and answering additional questions, beyond those that originally constructed the model. Both depend upon the context. The least we can do is examine the output in detail, relative to the specified model and the data that were used to fit the model. Does the model fit data? Are the conclusions reasonable? Are the outputs sensitive to changes in model structure?

2: The difference between prediction and inference.

Before the retention project, I didn’t really differentiate between prediction and inference and used the same tools for both. When assessing a model’s performance, I mostly looked at predictions’ accuracy and tried to minimize prediction errors. I compared variable coefficients when trying to understand their effect on the phenomena of interest. I thought I was able to understand the causal relationship, when in reality either of the predictors could provide independent value, could be redundant, or one could eliminate the value of the other. But parameter estimates are rather opaque answers.

Attempting an explanatory model, I’ve kept in mind some hazards, such as

multicollinearity (strong correlation between two or more predictor variables will suggest that none of the variables is reliably associated with the outcome variable even if all of the variables are in reality strongly associated with the outcome - this will still make fine predictions but will be frustrating to interpret), but not of other hazards, such as
post-treatment bias (mistaken inferences arising from including variables that are consequences of other variables) and
collider biases (conditioning on a wrong variable can create statistical but not necessarily causal association and thus distort inference).

The thing to be aware of is that models that are causally incorrect can make better predictions than those that are causally correct. Including an additional variable might fit the sample better and make better out-of-sample predictions, but we can’t interpret it or do causal inference with it. If we want to intervene and influence things, focusing on prediction can mislead us. Recent work showed me that inference is distinct from prediction and requires a more sophisticated and thoughtful set of tools. If we don’t have a causal model, our causal interpretation of an association might be wrong.

To infer true causal relationships, we have to think causally, a goal that requires a more rigid statistical procedure. One tool to help us design and interpret regression models is graphical causal models. A heuristic causal graph isn’t as detailed as a full model description, but contains info that a purely statistical model does not: they impose directionality that allows us to interpret what’s going on. Unlike a statistical model, a DAG, if it is correct, will tell you the consequences of intervening to change a variable.

The models we create can have powerful implications. They help us make discoveries, but these can be silly and even dangerous. Models can perform impressive calculations or find patterns where there are none obvious. But they don’t know when the context is inappropriate: they just know their own procedures. No statistical procedures can substitute domain knowledge or infer causes from evidence: they do not understand cause and effect! Regression itself cannot be wrong - it doesn’t provide evidence to justify a causal model - but our causal interpretation of an association could be. Pre-made tools often used in Machine Learning cannot substitute a unified theory or set of strategies. You have to regress and make inferences responsibly. You need to understand how the model processes information to interpret its output. Personally, I find using inference to understand causal paths a more interesting task than brute-forcing perfect predictions.

3: Programmatically implementing an academic paper.

Having figured out there are several ways to calculate a seemingly straightforward statistic, such as a retention rate, that look the same but are inherently different and, having settled on the most obvious one that’s not implemented in python on the internet, I needed to fit a constant churn rate. Then we could start saying things like “if you reduced your churn rate by 5%, you’d have XYZ more members today, and would have accrued $ABC more in membership dollars up to today”. But since a constant churn rate doesn’t fit very well (the most marginal customers will cancel first, so the churn rate will decline over time), I was recommended a paper that demonstrates a probability model (known as a “shifted-beta-geometric” model) with a well-grounded churn story as an alternative to a “curve-fitting” regression model.

The authors have implemented it in excel using 10 data points, but since my data was much more multi-dimensional, and I wanted to be flexible about certain parameters, such as time periods and data input format, I needed to programmatically implement the paper’s model in my Python environment. This proved to be a rather frustrating exercise but with help of a data-engineer friend I was able to adapt the study’s framework to my environment and, in the process, understand the phenomena better. This experience of replicating a study, adapting it to my context and using my data to do inference with it made me experience what a dynamic scientific discovery can be like: peer review, repetition and synthesis that build upon published findings. I wish there was a more streamlined way to apply papers to one’s work.

4. Getting closer to understanding the subtleties of preprocessing variables.

I’ve never come across a unified systemic rationale for choosing between the different pre-processing techniques. It’s not properly talked about, it seems more of an art. It seems to depend on how the data was generated (and how much you do know about this), what insights you want to generate from it, how important interpretability is and how much the data distribution deviates from your desired distribution. The overall goal of rescaling variables is to create focal points that you might have prior information about, prior to seeing the actual data values. Throughout the Statistical Rethinking book, there have been many examples that might help build intuition:

Logarithms can immediately improve your model by limiting values to positive. This is because the way logs are defined, prevents its argument to be negative, it can only be a positive number not equal to 1.
When we work with logarithms, we can work on a more evenly spaced scale of magnitudes. A log transform loses no information, it just changes what the model assumes about the shape of the association between variables. An exponential pattern of a variable will not be associated with anything, but a log version will be linearly associated.
Raw magnitudes are often not meaningful to humans. Scaling the variables will make the units easier to work with. The usual standardization is to subtract the mean and divide by the standard deviation. This makes a variable into z-scores. But zero of some variables isn’t meaningful. So instead it can be divided by the maximum value observed. This means it ends up scaled from 0 to the maximum in the sample at 1. Similarly, log is divided by the average value. So it is rescaled as a proportion of the average. 1 means average, 0.8 means 80% of the average and 1.1 means 10% more than average.
Scaling X by its maximum observed value can be done for three reasons: 1) The large values on the raw scale will make optimization difficult. 2) It will be easier to assign a reasonable prior this way. 3) We don’t want to standardize X because zero is a meaningful boundary we want to preserve.
You can normalize a distribution such that it is a probability distribution by dividing each count of a variable by the total number of a variable.
Using dummy variables makes it hard to construct sensible priors. Instead, build an index variable.

5: I (too) often underestimate how long things take.

When I decided to change my careers from PR manager in the arts into data science , I thought I’d be able to go into a data scientist role right after, if not before, finishing a bootcamp course. This not happening exactly like that put a lot of pressure on my self-confidence and caused frustration. I had to relax the arbitrary (and uninformed) timeframe that I coerced on myself and allow this process to take time. Little by little I have been finding my path. My lesson there is that all good things take time, they can’t be rushed, there are no hacks.

This summer I was approached by Google’s foobar challenge. Having passed it, I was approached by their recruiting team. I scheduled the interview to take place in a month, giving me some solid time to prepare for interviewing after I’d finish my current project (they shared some specific suggestions how to). In a month I wasn’t ready and postponed by another month. By that time, they had met their hiring goals for 2020 and “weren’t onboarding online anymore”. Sometimes it’s hard to prioritize, but now I try not to miss windows like that and strike while the iron is hot.

Another timing lesson is related to estimating how long a project will take. As a beginner, the difficulty comes from not knowing what you don’t know. While I try to learn the theory I need when I need it (as inspired by the youngest Kaggle grandmaster Mikel Bober-Irizar) - as opposed to learning “all the theory” - learning from real-world feedback and iterating quickly is even better.

A big part of the value of building projects in data science is to impose constraints on the scope of what you’re building and therefore the range of skills you need to focus on. The space is so vast and so open-ended that it’s easy to get lost running down rabbit holes. Knowing when to walk away is important, so that you don’t end up sacrificing precious years of exposure to real-world problems for the sake of a research project you might not be all that excited about anyway (Russell Pollari). Instead of asking “How long will this take?” ask “How much time are we willing to spend on this?” The latter is a much easier question to answer and will lead to more effective prioritization (also Russell).

My lesson: document approximately where you are on the roadmap towards your goal, what has been achieved, and what are the struggles and the obstacles.

6. Psychological.

I would like to be realistic about my limitations. Sometimes after struggling with something for sometime, I start doubting myself and wonder “How do you know when is the optimal time to give up?” On the one hand it’s foolish to do so right before overcoming the normal learning curve, but you also have to be mindful of the opportunity cost of not directing your efforts in things that you could be better at. How do you not succumb to the sunk-cost fallacy and continue down the wrong path for too long?

I now try to question my decisions more often and look at my work with the big picture in mind. I need to remember to stop and check if what I’m doing makes sense in the broader perspective, bringing me closer to my goals. I’ll be doing more reality checks, searching for global inconsistencies, and trying to avoid rigidity, dogmatism and egocentricity. Lastly, more brainstorming with others.

At some point I came across concept of Seth Godin’s called the Dip: “The Dip is the long slog between starting and mastery. The Dip is the long stretch between beginner’s luck and real accomplishment. The Dip is the set of artificial screens set up to keep people like you out.”

Other than hindsight, how does one know when it’s time to quit? “It’s time to quit when you secretly realize you’ve been settling for mediocrity all along. It’s time to quit when the things you’re measuring aren’t improving, and you can’t find anything better to measure.”

What’s the worst time to quit? “When the pain is the greatest. Decisions made during great pain are rarely good decisions. … When the Dip shows up, you know you’re close to a breakthrough, to getting to the other side, to mastery, and to being the best in the world. … Dips don’t last quite as long when you whittle at them. … Quit the dead ends and invest in the Dips.

The point is that in a world of infinite choice, in a world where the best in the world is worth more every single day, the only chance you’ve got is to find a Dip and embrace it. Realize that it’s actually your best ally. The harder it is to get through, the better your chance of being the only one to get through it.

Conclusion: Coming into data science, the knowledge pool seemed endless. I think now I’m starting to see its perimeter.