Data Science, Machine Learning & AI
Kontakt

Introduction

Sometimes here at STATWORX, we have impromptu discussions about statistical methods. In one such discussion, one of my colleagues decided to declare (albeit jokingly) Bayesian statistics unnecessary. That made me ask myself: Why would I ever use Bayesian models in the context of a standard regression problem? Existing approaches such as Ridge Regression are just as good, if not better. However, the Bayesian approach has the advantage that it lets you regularize your model to prevent overfitting and meaningfully interpret the regularization parameters. Contrary to the usual way of looking at ridge regression, the regularization parameters are no longer abstract numbers but can be interpreted through the Bayesian paradigm as derived from prior beliefs. In this post, I’ll show you the formal similarity between a generalized ridge estimator and the Bayesian equivalent.

A (very brief) Primer on Bayesian Stats

To understand the Bayesian regression estimator, a minimal amount of knowledge about Bayesian statistics is necessary, so here’s what you need to know (if you don’t already): In Bayesian statistics, we think about model parameters (i.e., regression coefficients) probabilistically. In other words, the data given to us is fixed, and the parameters are considered random. That runs counter to the standard frequentist perspective in which the underlying model parameters are treated as fixed. At the same time, the data are considered random realizations of the stochastic process driven by those fixed model parameters. The end goal of Bayesian analysis is to find the posterior distribution, which you may remember from Bayes Rule:

    \[p(theta|y) = frac{p(y|theta) p(theta)}{p(y)}\]

While p(y|theta) is our likelihood and p(y) is a normalizing constant, p(theta) is our prior which does not depend on the data, y. In classical statistics, p(theta) is set to 1 (an improper reference prior) so that when the posterior ‘probability’ is maximized, really just the likelihood is maximized because it’s the only part that still depends on theta. However, in Bayesian statistics, we use an actual probability distribution in place of p(theta), a Normal distribution, for example. So let’s consider the case of a regression problem, and we’ll assume that our target, y, and our prior follow normal distributions. That leads us to conjugate Bayesian analysis, in which we can neatly write down an equation for the posterior distribution. In many cases, this is not possible, and for this reason, Markov Chain Monte Carlo methods were invented to sample from the posterior – taking a frequentist approach, ironically. We’ll make the usual assumption about the data: y_i is i.i.d. N(bold {x_i beta}, sigma^2) for all observations i. This gives us our standard likelihood for the Normal distribution. Now we can specify the prior for the parameter we’re trying to estimate, (beta, sigma^2). If we choose a Normal prior (conditional on the variance, sigma^2) for the vector or weights in beta, i.e. N(b_0, sigma^2 B_0) and an inverse-Gamma prior over the variance parameter it can be shown that the posterior distribution for beta is Normally distributed with mean

    \[hatbeta_{Bayesian} = (B_0^{-1} + X'X)^{-1}(B_0^{-1} b_0 + X'X hatbeta)\]

If you’re interested in a proof of this result check out Jackman (2009, p.526). Let’s look at it piece by piece:
  • hatbeta is our standard OLS estimator, (X'X)^{-1}X'y
  • b_0 is the mean vector of (multivariate normal) prior distribution, so it lets us specify what we think the average values of each of our model parameters are
  • B_0 is the covariance matrix and contains our respective uncertainties about the model parameters. The inverse of the variance is called the precision
What we can see from the equation is that the mean of our posterior is a precision weighted average of our prior mean (information not based on data) and the OLS estimator (based solely on the data). The second term in parentheses indicates that we are taking the uncertainty weighted prior mean, B_0^{-1} b_0, and adding it to the weighted OLS estimator, X'Xhatbeta. Imagine for a moment that B_0^{-1} = 0 . Then

    \[hatbeta_{Bayesian} = (X'X)^{-1}(X'X hatbeta) = hatbeta\]

That would mean that we are infinitely uncertain about our prior beliefs that the mean vector of our prior distribution would vanish, contributing nothing to our posterior! Likewise, if our uncertainty decreases (and the precision thus increases), the prior mean, b_0, would contribute more to the posterior mean. After this short primer on Bayesian statistics, we can now formally compare the Ridge estimator with the above Bayesian estimator. But first, we need to take a look at a more general version of the Ridge estimator.

Generalizing the Ridge estimator

A standard tool used in many regression problems, the standard Ridge estimator is derived by solving a least-squares problem from the following loss function:

    \[L(beta,lambda) = frac{1}{2}sum(y-Xbeta)^2 + frac{1}{2} lambda ||beta||^2\]

While minimizing this gives us the standard Ridge estimator you have probably seen in textbooks on the subject, there’s a slightly more general version of this loss function:

    \[L(beta,lambda,mu) = frac{1}{2}sum(y-Xbeta)^2 + frac{1}{2} lambda ||beta - mu||^2\]

Let’s derive the estimator by first re-writing the loss function in terms of matrices:

    \[begin{aligned} L(beta,lambda,mu) &= frac{1}{2}(y - X beta)^{T}(y - X beta) + frac{1}{2} lambda||beta - mu||^2 &= frac{1}{2} y^Ty - beta^T X^T y + frac{1}{2} beta^T X^T X beta + frac{1}{2} lambda||beta - mu||^2 end{aligned}\]

Differentiating with respect to the parameter vector, we end up with this expression for the gradient:

    \[nabla_{beta} L (beta, lambda, mu) = -X^T y + X^T X beta + lambda (beta - mu)\]

So, Minimizing over beta we get this expression for the generalized ridge estimator:

    \[hatbeta_{Ridge} = (X'X + lambda I )^{-1}(lambda mu + X'y)\]

The standard Ridge estimator can be recovered by setting mu=0. Usually we regard lambda as an abstract parameter that regulates the penalty size and mu as a vector of values (one for each predictor) that increases the loss the further these coefficients deviate from these values. When mu=0 the coefficients are pulled towards zero. Let’s take a look at how the estimator behaves when the parameters, mu, and lambda change. We’ll define a meaningful ‘prior’ for our example and then vary the penalty parameter. As an example, we’ll use the diamonds dataset from the ggplot2 package and model the price as a linear function of the number of carats, in each diamond, the depth, table, x, y and z attributes
As we can see from the plot, both with and without a prior, the coefficient estimates change rapidly for the first few increases in the penalty size. We also see that the ‘shrinkage’ effect holds from the upper plot: as the penalty increases, the coefficients tend towards zero, some faster than others. The plot on the right shows how the coefficients change when we set a sensible ‘prior’. The coefficients still change, but they now tend towards the ‘prior’ we specified. That’s because lambda penalizes deviations from our mu, which means that larger values for the penalty pull the coefficients towards mu. You might be asking yourself how this compares to the Bayesian estimator. Let’s find out!

Comparing the Ridge and Bayesian Estimator

Now that we’ve seen both the Ridge and the Bayesian estimators, it’s time to compare them. We discovered, that the Bayesian estimator contains the OLS estimator. Since we know its form, let’s substitute it and see what happens:

    \[begin{aligned} hatbeta_{Bayesian} &= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X hatbeta) &= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X (X'X)^{-1}X'y) &= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'y) end{aligned}\]

This form makes the analogy much clearer:
  • lambda I corresponds to B_0^{-1}, the matrix of precisions. In other words, since I is the identity matrix, the ridge estimator assumes no covariances between the regression coefficients and a constant precision across all coefficients (recall that lambda is a scalar)
  • lambda mu corresponds to B_0^{-1} b_0, which makes sense, since the vector b_0 is the mean of our prior distribution, which essentially pulls the estimator towards it, just like mu ‘shrinks’ the coefficients towards its values. This ‘pull’ depends on the uncertainty captured by B_0^{-1} or lambda I in the ridge estimator.
That’s all well and good, but let’s see how changing the uncertainty in the Bayesian case compares to the behavior of the ridge estimator. Using the same data and the same model specification as above, we’ll set the covariance matrix B_0 matrix to equal lambda I and then change lambda. Remember, smaller values of lambda now imply a more significant contribution of the prior (less uncertainty), and therefore increasing them makes the prior less important.
The above plots match out understanding so far: With a prior mean of zeros, the coefficients are shrunken towards zero, as in the ridge regression case when the prior dominates, i.e., when the precision is high. And when a previous mean is set, the coefficients tend towards it as the precision increases. So much for the coefficients, but what about the performance? Let’s have a look!

Performance comparison

Lastly, we’ll compare the predictive performance of the two models. Although we could treat the parameters in the model as hyperparameters, which we would need to tune, this would defy the purpose of using prior knowledge. Instead, let’s choose a previous specification for both models, and then compare the performance on a holdout set (30% of the data). While we can use the simple Xhatbeta as our predictor for the Ridge model, the Bayesian model provides us with a full posterior predictive distribution, which we can sample from to get model predictions. To estimate the model I used the brmspackage.
RMSE MAE MAPE
Bayesian Linear Model 1625.38 1091.36 44.15
Ridge Estimator 1756.01 1173.50 43.44
Overall, both models perform similarly, although some error metrics slightly favor one model over the other. Judging by these errors, we could certainly improve our models by specifying a more appropriate probability distribution for our target variable. After all, prices can not be negative, yet our models can and do produce negative predictions.

Recap

In this post, I’ve shown you how the ridge estimator compares to the Bayesian conjugate linear model. Now you understand the connection between the two models and how a Bayesian approach can provide a more readily interpretable way of regularizing your model. Normally lambda would be considered a penalty size, but now it can be interpreted as a measure of prior uncertainty. Similarly, the parameter vector mu can be seen as a vector of prior means for our model parameters in the extended ridge model. As far as the Bayesian approach goes, we also can use prior distributions to implement expert knowledge in your estimation process. This regularizes your model and allows for incorporation of external information in your model. If you are interested in the code, check it out at our GitHub page!

References

  • Jackman S. 2009. Bayesian Analysis for the Social Sciences. West Sussex: Wiley.
At STATWORX we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms which allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post. As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. In my first blog post, I gave an introduction into the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science. In this second blog post, I will introduce the so-called Causal Forest, one of the most popular Causal Machine Learning algorithms to estimate heterogeneous treatment effects.

Why Heterogeneous Treatment Effects?

In Causal Forests, the goal is to estimate heterogeneity in treatment effects. As explained in my previous blog post, a treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. For example the causal effect of a subsidised training programme on earnings. As individual treatment effects are unobservable, the practice focuses on estimating unbiased and consistent averages of the individual treatment effect. The most common parameter thereof is the average treatment effect, which is the mean of all individual treatment effects in the entire population of interest. However, sometimes treatment effects may vary widely between different subgroups in the population, bet it larger or smaller than the average treatment effect. In some cases, it might therefore be more interesting to estimate these different, i.e. heterogeneous treatment effects.
In most applications it is also interesting to look beyond the average effects in order to understand how the causal effects vary with observable characteristics. (Knaus, Lechner & Strittmatter, 2018)
The estimation of heterogeneous treatment effects can assist in answering questions like: For whom are there big or small treatment effects? For which subgroup does a treatment generate beneficial or adverse effects? In the field of marketing, for example, the estimation of heterogeneous treatment effects can help to optimise resource allocation by answering the question of which customers respond the most to a certain marketing campaign or for which customers is the causal effect of intervention strategies on their churn behaviour the highest. Or when it comes to pricing, it might be interesting to quantify how a change in price has varying impact on sales among different age or income groups.

Where Old Estimation Methods Fail

Estimating heterogeneous treatment effects is nothing new. Econometrics and other social sciences have long been studying which variables predict a smaller or larger than average treatment effect, which in statistical terms is also known as Moderation. One of the most traditional ways to find heterogeneous treatment effects is to use a Multiple Linear Regression with interaction terms between the variables of interest (i.e. the ones which might lead to treatment heterogeneity) and the treatment indicator. In this blog post, I will always assume that the data is from a randomised experiment, such that the assumptions to identify treatment effects are valid without further complications. We then conclude that the treatment effect depends on the variables whose interaction term is statistically significant. For example, if we have only one variable, the regression model would look as follows:

    \[Y = beta_0 + beta_1 w + beta_2 x_1 + beta_3 (w * x_1),\]

where w is the treatment indicator and x_1 is the variable of interest. In that case, if beta_3 is significant, we know that the treatment effect depends on variable x_1. The treatment effect for each observation can then be calculated as

    \[beta_1 + beta_3 * x_1,\]

which is dependent on the value of x_1 and therefore heterogeneous among the different observations. So why is there a need for more advanced methods to estimate heterogeneous treatment effects? The example above was very simple, it only included one variable. However, usually, we have more than one variable which might influence the treatment effect. To see which variables predict heterogeneous treatment effects, we have to include many interaction terms, not only between each variable and the treatment indicator but also for all possible combinations of variables with and without the treatment indicator. If we have p variables and one treatment, this gives a total number of parameters of:

    \[displaystylesum_{k = 0}^{p + 1} {p + 1 choose k}.\]

So, for example if we had 5 variables, we would have to include a total number of 64 parameters into our Linear Regression Model. This approach suffers from a lack of statistical power and could also cause computational issues. The use of a Multiple Linear Regression also imposes linear relationships unless more interactions with polynomials are included. Because Machine Learning algorithms can handle enormous numbers of variables and combining them in nonlinear and highly interactive ways, researchers have found ways to better estimate heterogeneous treatment effects by combining the field of Machine Learning with the study of Causal Inference.

Generalised Random Forests

Over recent years, different Machine Learning algorithms have been developed to estimate heterogeneous treatment effects. Most of them are based on the idea of Decision Trees or Random Forests, just like the one I focus on in this blog post: Generalised Random Forests by Athey, Tibshirani and Wager (2018). Generalised Random Forests follows the idea of Random Forests and apart from heterogeneous treatment effect estimation, this algorithm can also be used for non-parametric quantile regression and instrumental variable regression. It keeps the main structure of Random Forests such as the recursive partitioning, subsampling and random split selection. However, instead of averaging over the trees Generalised Random Forests estimate a weighting function and uses the resulting weights to solve a local GMM model. To estimate heterogeneous treatment effects, this algorithm has two important additional features, which distinguish it from standard Random Forests.

1. Splitting Criterion

The first important difference to Random Forests is the splitting criterion. In Random Forests, where we want to predict an outcome variable Y, the split at each tree node is performed by minimising the mean squared error of the outcome variable Y. In other words, the variable and value to split at each tree node are chosen such that the greatest reduction in the mean squared error with regard to the outcomes Y is achieved. After each tree partition has been completed, the tree’s prediction for a new observation x is obtained by letting it trickle through all the way from tree’s root into a terminal node, and then taking the average of outcomes Y of all the observations x that fell into the same node during training. The Random Forest prediction is then calculated as the average of the predicted tree values. In Causal Forests, we want to estimate treatment effects. As stated by the Fundamental Problem of Causal Inference however, we can never observe a treatment effect on an individual level. Therefore, the prediction of a treatment effect is given by the difference in the average outcomes Y between the treated and the untreated observations in a terminal node. Without going into too much detail, to find most heterogeneous but also accurate treatment effects, the splitting criterion is adapted such that it searches for a partitioning where the treatment effects differ the most including a correction that accounts for how the splits affect the variance of the parameter estimates.
tree branches

2. Honesty

Random Forests are usually evaluated by applying them to a test set and measure the accuracy of the predictions of Y using an error measure such as the mean squared error. Because we can never observe treatment effects, this form of performance measure is not possible in Causal Forests. When estimating causal effects, one, therefore, evaluates their accuracy by examining the bias, standard error and the related confidence interval of the estimates. To ensure that an estimate is as accurate as possible, the bias should asymptotically disappear and the standard error and, thus, the confidence interval, should be as small as possible. To enable this statistical inference in their Generalised Random Forest, Athey, Tibshirani and Wager introduce so-called honest trees. In order to make a tree honest, the training data is split into two subsamples: a splitting subsample and an estimating subsample. The splitting subsample is used to perform the splits and thus grow the tree. The estimating subsample is then used to make the predictions. That is, all observations in the estimating subsample are dropped down the previously-grown tree until it falls into a terminal node. The prediction of the treatment effects is then given by the difference in the average outcomes between the treated and the untreated observations of the estimating subsample in the terminal nodes. With such honest trees, the estimates of a Causal Forest are consistent (i.e. the bias vanishes asymptotically) and asymptotically Gaussian which together with the estimator for the asymptotic variance allow valid confidence intervals.

Causal Forest in Action

To show the advantages of Causal Forests compared to old estimation methods, in the following I will compare the Generalised Random Forest to a Regression with interaction terms in a small simulation study. I use simulated data to be able to compare the estimated treatment effects with the actual treatment effects, which, as we know, would not be observable in real data. To compare the two algorithms with respect to the estimation of heterogeneous treatment effects, I test them on two different data sets, one with and one wihtout heterogeneity in the treatment effect:
Data Set Heterogeneity Heterogeneity Variables Variables Observations
1 No Heterogeneity x_1x_{10} 20000
2 Heterogeneity x_1 and x_2 x_1x_{10} 20000
This means that in the first data set, all observations have the same treatment effect. In this case, the average treatment effect and the heterogeneous treatment effects are the same. In the second data set, the treatment effect varies with the variables x_1 and x_2. Without going into too much detail here (I will probably write a separate blog post only about causal data generating processes), the relationship between those heterogeneity variables (x_1 and x_2) and the treatment effect is not linear. Both simulated data sets have 20’000 observations containing an outcome variable Y and 10 covariates with values between zero and one. To evaluate the two algorithms, the data sets are split in a train (75%) and a test set (25%). For the Causal Forest, I use the causal_forest() from the grf-package with tune.parameters = "all". I compare this to an lm() model, which includes all variables, the treatment indicator and the necessary interaction terms of the heterogeneity variables and the treatment indicator: Linear Regression Model for data set with heterogeneity:

    \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + dots + beta_{10} x_{10} + beta_{11} w +\]

    \[beta_{12} (w * x_1) + beta_{13} (w * x_2) + beta_{14} (x_1 * x_2) + beta_{15} (w * x_1 * x_2)\]

Linear Regression Model for data set with no heterogeneity:

    \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + … + beta_{10} x_{10} + beta_{11} w\]

where x_1x_{10} are the heterogeneity variables and w is the treatment indicator (i.e. w = 1 if treated and w = 0 if not treated). As already explained above, we usually do not know which variables affect the treatment effect and have therefore to include all possible interaction terms into the Linear Regression Model to see which variables lead to treatment effect heterogeneity. In the case of 10 variables as we have it here, this means we would have to include a total of 2048 parameters in our Linear Regression Model. However, since the heterogeneity variables are known in the simulated data, here, I only include the interaction terms for those variables.
Data Set Metric grf lm
No Heterogeneity RMSE 0.01 0.00
Heterogeneity RMSE 0.08 0.45
Looking at the results, we can see that without heterogeneity, the treatment effect is equally well predicted by the Causal Forest (RMSE of 0.01) and the Linear Regression (RMSE of 0.00). However, as the heterogeneity level increases, the Causal Forest is far more accurate (RMSE of 0.08) than the Linear Regression (RMSE of 0.45). As expected, the Causal Forest seems to be better at detecting the underlying non-linear relationship between the heterogeneity variables and the treatment effect than the Linear Regression Model, which can also be seen in the plots below. Thus, even if we already know which variables influence the treatment effect and only need to include the necessary interaction terms, the Linear Regression Model is still less accurate than the Causal Forest due to its lack of modelling flexibility.
treatment effect hexplot

Outlook

I hope that this blog post has helped you to understand what Causal Forests are and what advantages they bring in estimating heterogeneous treatment effects compared to old estimation methods. In my upcoming blog posts on Causal Machine Learning, I will explore this new field of data science further. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or introduce different data generating processes to evaluate Causal Machine Learning methods in simulation studies.

References

  • Athey, S., Tibshirani, J., & Wager, S. (2019). Generalised random forests. The Annals of Statistics, 47(2), 1148-1178.
  • Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence. arXiv:1810.13237v2.
At STATWORX, we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms that allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post. As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. This first blog post is an introduction into the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science.

The Origins of Causal Machine Learning

As Markus has already explained in his earlier blog post, analysis in economic and other social sciences revolves primarily around the estimation of causal effects, that is, the isolated effect of a feature X on the outcome variable Y. An example, which has been investigated by my colleague Matthias, is the causal effect of oil prices on gas prices. Actually, in most cases, the interest lies in so-called treatment effects. A treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. In economics, one of the most analyzed treatment effects is the causal effect of a subsidized training program on earnings. Following the potential outcome framework introduced by Rubin (1947), the treatment effect of an individual is defined as follows:

    \[gamma_i = Y_i(1) - Y_i(0)\]

where Y_i(1) indicates the potential outcome of the individual i with treatment and contrary, Y_i(0) denotes the potential outcome of the individual i without treatment. However, as an individual can either receive the treatment or not, and thus, we can only ever observe one of the two potential outcomes for an individual at one point in time, the individual treatment effect is unobservable. This problem is also known as the Fundamental Problem of Causal Inference. Nevertheless, under certain assumptions, the averages of the individual treatment effect may be identified. In randomized experiments, where the treatment is randomly assigned, these assumptions are naturally valid, and the identification of any aggregation level of individual treatment effects is possible without further complications. In many situations, however, randomized experiments are not possible, and the researcher has to work with observational data, where these assumptions are usually not valid. Thus, an extensive literature in economics and other fields has focused on techniques identifying causal effects in cases where these assumptions are not given.
Prediction and causal inference are distinct (though closely related) problems. Athey, 2017, p. 484
In contrast, (Supervised) Machine Learning literature has traditionally focused on prediction, that is, produce predictions of the outcome variable from the feature(s) . Machine Learning models are designed to discover complex structures in given data and generalize them so that they can be used to make accurate predictions on new data. These algorithms can handle enormous numbers of predictors and combining them in nonlinear and highly interactive ways. They have been proven to be hugely successful in practice and are used in applications ranging from medicine to resource allocations in cities.

Bringing Together the Best of Both Worlds

Although economists and other social scientists prioritize precise estimates of causal effects above predictive power, they were intrigued by the advantages of Machine Learning methods, such as the precise out-of-sample prediction power or the ability to deal with large numbers of features. But as we have seen, classical Machine Learning models are not designed to estimate causal effects. Using off-the-shelf prediction methods from Machine Learning leads to biased estimates of causal effects. The existing Machine Learning techniques had to be modified to use the advantages of Machine Learning for consistently and efficiently estimating causal effects– the birth of Causal Machine Learning!
distracted-economist
Currently, Causal Machine Learning can be broadly divided into two lines of research, defined by the type of causal effect to be estimated. One line of Causal Machine Learning research focuses on modifying Machine Learning methods to estimate unbiased and consistent average treatment effects. The average treatment effect is the mean of all individual treatment effects in an entire population of interest, and probably the most common parameter analyzed in econometric causal studies. Models from this line of research try to answer questions like: How will customers react on average to a marketing campaign? What is the average effect of a price change on sales? The other line of Causal Machine Learning research focuses on modifying Machine Learning methods to uncover treatment effect heterogeneity. That is, identifying subpopulations (based on features) of individuals who have a larger or smaller than average treatment effect. These models are designed to answer questions such as: Which customers respond the most to a marketing campaign? How does the effect of a price change on sales change with the age of customers?

Decision-Making Questions Need Causal Answers

Although the study of Causal Machine Learning has been mainly driven by economic researchers, its importance for other areas such as business should not be neglected. Companies often reach for classical Machine Learning tools to solve decision-making problems, such as where to set the price or which customers to target with a marketing campaign. However, there is a significant gap between making a prediction and making a decision. To make a data-driven decision, the understanding of causal relationships is key. Let me illustrate this problem with two examples from our daily business.

Example 1: Price Elasticities

At the core of every company’s pricing management is the understanding of how customers will respond to a change in price. To set an optimal price, the company needs to know how much it will sell at different (hypothetical) price levels. The most practicable and meaningful metric answering this question is the price elasticity of demand. Although it might seem straightforward to estimate the price elasticity of demand using classical Machine Learning methods to predict sales as the outcome with the price level as a feature, in practice, this approach does not simply give us the causal effect of price on sales.
There are a number of gaps between making a prediction and making a decision, and underlying assumptions need to be understood in order to optimise data-driven decision making. Athey, 2017, p. 483
Following a similar example introduced by Athey (2017), assume we have historical data on airline prices and the respective occupancy rates. Typically, prices and occupancy rates are positively correlated as usual pricing policies of airlines specify that airlines raise their seat prices when their occupancy rate increases. In this case, a classical Machine Learning model will answer the following question: If on a particular day, airline ticket prices are high, what is the best prediction for the occupancy rate on that day? The model will correctly predict that the occupancy rate is likely to be high. However, it would be wrong to infer from this that an increase in prices leads to a higher occupancy rate. From common experience, we know that the true casual effect is quite the opposite – if an airline systematically raised its ticket prices by 10% everywhere, it would be unlikely to sell more tickets.

Example 2: Customer Churn

Another common problem, which companies like to solve with the help of Machine Learning, is the prediction of customer churn (i.e., customers abandoning the firm or service thereof). The companies are interested in identifying the customers with the highest risk of churn so that they can respond by allocating interventions in the hope of preventing these customers from leaving. Classical Machine Learning algorithms have proven to be very good at predicting customer churn. Unfortunately, these results cannot sufficiently address the company’s resource allocation problem of which customers to best target with intervention strategies. The question of the optimal allocation of resources to customers is of causal nature: For which customers are the causal effect of intervention strategies on their churn behavior the highest? A study has shown that in many cases, the overlap between customers with the highest risk of churning and customers who would respond most to interventions was much lower than 100%. Thus, treating the problem of customer churn as a prediction problem and therefore using classical Machine Learning models is not optimal, yielding lower payoffs to companies.

The Wish of Every Data Scientist

Looking beyond these practical examples, we can observe that there is a more profound reason why Causal Machine Learning should be of interest to any data scientist: model generalisability. A Machine Learning model that can capture causal relationships of data will be generalizable to new settings, which is still one of the biggest challenges in Machine Learning.
rooster
To illustrate this, I’ll use the example of the rooster and the sun, from “The Book of Why” by Pearl and Mackenzie (2018). A Machine Learning algorithm that is shown data about a rooster and the sun would associate the rising of the sun with the crow of the rooster and may be able to predict when the sun will rise accurately: If the rooster has just crowed, the sun will rise shortly after that. Such a model that is only capable of predicting correlations will not generalize to a situation where there is no rooster. In that case, a Machine Learning model will never predict that the sun will rise because it has never observed such a data point (i.e., without a rooster). If, however, the model captured the true causal relationship, that is, the sun being about to rise causes the rooster to crow, it would be perfectly able to predict that the sun will rise even if there is no rooster.

No True Artificial Intelligence Without Causal Reasoning

Pearl and Mackenzie (2018) go even further, arguing that we can never reach true human-level Artificial Intelligence without teaching machines causal reasoning since cause and effect are the key mechanisms through which we humans process and understand the complex world around us. The ability to predict correlations does not make machines intelligent; it merely allows them to model a reality based on data the algorithm is provided.
The algorithmisation of counterfactuals invites thinking machines to benefit from the ability to reflect on one’s past actions and to participate in this (until now) uniquely human way of thinking about the world. Pearl & Mackenzie, 2018, p. 10
Furthermore, Machine Learning models need the capacity to detect causal effects to ask counterfactual questions, that is, to inquire how some relationship would change given some kind of intervention. As counterfactuals are the building blocks of moral behavior and scientific thought, machines will only be able to communicate more effectively with us humans and reach the status of moral beings with free will if they learn causal and thus counterfactual reasoning.

Outlook

Although this last part has become very philosophical in the end, I hope that this blog post has helped you to understand what Causal Machine Learning is and why it is necessary not only in practice but also for the future of data science in general. In my upcoming blog posts, I will discuss various aspects of this topic in more detail. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or compare different Causal Machine Learning algorithms in a simulation study.

References

  • Athey, S. (2017). Beyond prediction: using big data for policy problems. Science 335, 483-485.
  • Pearl, J., & Mackenzie, D. (2018). The book of why. New York, NY: Basic Books.
  • Rubin, D. B. (1974). Estimating causal effects of treatments in randomised and non-randomised studies. Journal of Educational Psychology, 66(5), 688-701.
Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

Table of Contents

Model evaluation and cross-validation basics

Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or validate, whether the predictions are accurate. Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the “unseen” data. Since this test partition is, in fact, part of the original training data, we have a full range of “correct” outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan). Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for “tuning” the hyperparameters of a model by comparing the generalization error of different model specifications.

Common approaches to model evaluation

There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data. The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.
train-test-split
A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into k folds. The main advantage over the train-test-split approach is that each of the k partitions is iteratively used as a test (i.e., validation) set, with the remaining k-1 parts serving as the training sets in this iteration. This process is repeated k times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the k folds, giving the cross-validation error. This is more of an extension of the train-test split rather than a completely new method: That is, the train-test procedure is repeated k times. However, note that even if k is chosen to be as low as k=2, i.e., you end up with only two parts. This approach is still superior to the train-test-split in that both parts are iteratively chosen for training so that the model has a chance to learn all the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.
k-fold-cross-validation
Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to one iteration of a 5-fold (that is, k=5) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the k iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split k times: In k-fold CV, the partitioning is done once, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data k times, potentially omitting some data from training.

Repeated CV and LOOCV

There are many flavors of k-fold cross-validation. For instance, you can do “repeated cross-validation” as well. The idea is that, once the data is divided into k folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into k folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate. Another special case of k-fold cross-validation is “Leave One Out Cross-Validation” (LOOCV), where you set k = n. That is, in each iteration, you use a single observation from your data as the validation portion and the remaining n-1 observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:
  • First, it’s usually very computationally expensive. For most datasets used in applied machine learning, training your model n-1 times is neither desirable nor feasible (although it may be useful for very small datasets).
  • Second, even if you had the computational power (and time on your hands) to endure this process, another argument advanced by critics of LOOCV from a statistical point of view is that the resulting cross-validation error can exhibit high variance. The cause of that is that your “validation set” consists of only one observation, and depending on the distribution of your data (and potential outliers), this can vary substantially.
In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being “it depends”. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

The value of k and the bias-variance trade-off

If k=n is not (necessarily) the best choice, then how to find an appropriate value for k? It turns out that the answer to this question boils down to the notorious bias-variance trade-off. Why is that? The value for k governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose k in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of k, the more observations are included in our training set in each iteration. For instance, suppose we have 1,200 observations in our dataset, then with k=3 our training set would consist of frac{k-1}{k} * N = 800 observations, but with k=8 it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing k, the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the “extremes” of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.
bias-variance-tradeoff
As a rule of thumb, with higher values for k, bias decreases and variance increases. By convention, values like k=5 or k=10 have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.
“These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.” James et al. 2013: 184
If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for k and leave it at that.

Implementing cross-validation in caret

Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the caret package in R. Here you can specify the method with the trainControl function. Below is a script where we fit a random forest with 10-fold cross-validation to the iris dataset.
library(caret)

set.seed(12345)
inTrain <- createDataPartition(y = irisSpecies, p = .7, list = FALSE)  iris.train <- iris[inTrain, ] iris.test <- iris[- inTrain, ]  fit.control <- caret::trainControl(method = "cv", number = 10)  rf.fit <- caret::train(Species ~ .,                        data = iris.train,                        method = "rf",                        trControl = fit.control)</code></pre> <!-- /wp:code -->  <!-- wp:paragraph -->  We define our desired cross-validation method in the <code>trainControl</code> function, store the output in the object <code>fit.control</code>, and then pass this object to the <code>trControl</code> argument of the <code>train</code> function. You can specify the other methods introduced in this post in a similar fashion:  <!-- /wp:paragraph -->  <!-- wp:code --> <pre class="wp-block-code"><code># Leave-One-Out Cross-validation: fit.control <- caret::trainControl(method = "LOOCV", number = 10)  # Repeated CV (remember to specify the number of repeats!) fit.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)</code></pre> <!-- /wp:code -->  <!-- wp:heading --> <h2 id="h-the-old-fashioned-way-implementing-k-fold-cross-validation-by-hand">The old-fashioned way: Implementing k-fold cross-validation by hand</h2> <!-- /wp:heading -->  <!-- wp:paragraph -->  However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our <a href="https://github.com/STATWORX/blog/tree/master/cross-validation">GitHub</a>).  <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-simulating-data-defining-the-error-metric-and-setting-k">Simulating data, defining the error metric, and settingk</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code># devtools::install_github("andrebleier/Xy") library(tidyverse) library(Xy)  sim <- Xy(n = 1000,           numvars = c(2,2),           catvars = 0,           cor = c(-0.5, 0.9),           noisevars = 0)  sim_data <- simdata

RMSE <- function(f, o){
  sqrt(mean((f - o)^2))
}

k <- 5
We start by loading the required packages and simulating some simulation data with 1,000 observations with the Xy() package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (f) und observed (o) values—you can pretty much read the function from left to right. Lastly, we specify our k, which is set to the value of 5 in the example and is stored as a simple integer.

Partitioning the data

set.seed(12345)
sim_data <- mutate(sim_data,
                   my.folds = sample(1:k,
                                     size = nrow(sim_data),
                                     replace = TRUE))
Next up, we partition our data into k folds. For this purpose, we add a new column, my.folds, to the data: We sample (with replacement) from 1 to the value of k, so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

Training and validating the model

cv.fun <- function(this.fold, data){

  train <- filter(data, my.folds != this.fold)
  validate <- filter(data, my.folds == this.fold)

  model <- lm(y ~ NLIN_1 + NLIN_2 + LIN_1 + LIN_2,
              data = train)

  pred <- predict(model, newdata = validate) %>% as.vector()

  this.rmse <- RMSE(f = pred, o = validatey)    return(this.rmse) }</code></pre> <!-- /wp:code -->  <!-- wp:paragraph -->  Next, we define <code>cv.fun</code>, which is the heart of our cross-validation procedure. This function takes two arguments: <code>this.fold</code> and <code>data</code>. I will come back to the meaning of <code>this.fold</code> in a minute, let's just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of <code>my.folds</code> and <code>this.fold</code>: Every observation with a randomly assigned <code>my.folds</code> value <strong>other than 1</strong> (so approximately 4/5 of the data) goes into training. Every observation with a <code>my.folds</code> value <strong>equal to 1</strong> (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the <code>train</code> data! We then use this model to <code>predict()</code> our validation data, and since we have true observed outcomes for this <em>subset of the original overall training data</em> (this is the whole point!), we can compute our RMSE and return it.  <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-iterating-through-the-folds-and-computing-the-cv-error">Iterating through the folds and computing the CV error</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code>cv.error <- sapply(seq_len(k),                    FUN = cv.fun,                    data = sim_data) %>%   mean()  cv.error</code></pre> <!-- /wp:code -->  <!-- wp:paragraph -->  Lastly, we wrap the function call to <code>cv.fun</code> into a <code>sapply()</code> loop—this is where all the magic happens: Here we iterate over the range ofk, so <code>seq_len(k)</code> leaves us with the vector <code>[1] 1 2 3 4 5</code> in this case. We apply each element of this vector to <code>cv.fun</code>. In <code>apply()</code> statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to <code>this.fold</code>. We also pass our simulated <code>sim_data</code> as the <code>data</code> argument.  <!-- /wp:paragraph -->  <!-- wp:paragraph -->  Let us quickly recap what this means: In the first iteration, <code>this.fold</code> equals 1. This means that our train set consists of all the observations where <code>my.folds</code> is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, <code>this.fold</code> equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values ofk, this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set.  <!-- /wp:paragraph -->  <!-- wp:paragraph -->  To wrap it all up, we calculate the mean: This is the mean of ourkindividual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function!  <!-- /wp:paragraph -->  <!-- wp:paragraph -->  This is merely a template: You can insert any model and any error metric. If you've been following along so far, feel free to try implementing repeated CV yourself or play around with different values fork$.




Conclusion

As you can see, implementing cross-validation yourself isn't all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don't need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze. I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

References

  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.
A major problem arises when comparing forecasting methods and models across different time series. This is a challenge we regularly face at STATWORX. Unit dependent measures like the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) turn out to be unsuitable and hardly helpful if the time series is measured in different units. However, if this is not the case, both measures provide valuable information. The MAE is perfectly interpretable as it embodies the average absolute deviation from the actual values. The RMSE, on the other hand, is not that easy to interpret, more vulnerable to extreme values but still often used in practice. MAE =frac{1}{n} sum_{i =1}^{n}{|{rm Actual}_i - {rm Forecast}_i}| mathrm{RMSE= }sqrt{frac{mathrm{1}}{mathrm{n}}mathrm{ } sum_{mathrm{i = 1}}^{mathrm{n}}{mathrm{(}{mathrm{Actual}}_mathrm{i}mathrm{-} {mathrm{Forecast}}_mathrm{i}mathrm{)} }^mathrm{2}} One of the most commonly used measures that avoids this problem is called MAPE (Mean Absolute Percentage Error). It solves the problem of the mentioned approaches as it does not depend on the unit of the time series. Furthermore, decision-makers without a statistical background can easily interpret and understand this measure. Despite its popularity, the MAPE was and is still criticized. MAPE =frac{1}{n} sum_{i =1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{{rm Actual}_i}|}*100 In this article, I evaluate these critical arguments and prove that at least some of them are highly questionable. The second part of my article concentrates on true weaknesses of the MAPE, some of them well-known but others hiding in the shadows. In the third section, I discuss various alternatives and summarize under which circumstances the use of the MAPE seems to be appropriate (and when it’s not).

Table of Contents

What the MAPE is FALSELY blamed for!

It Puts Heavier Penalties on Negative Errors Than on Positive Errors

Most sources dealing with the MAPE point out this “major” issue of the measure. The statement is primarily based on two different arguments. First, they claim that interchanging the actual value with the forecasted value proofs their point (Makridakis 1993). Case 1: {Actual}_1 = 150 & {Forecast}_1 = 100 (positive error) {rm APE}_1 = |frac{{rm Actual}_1 - {rm Forecast}_1}{{rm Actual}_1}| *100 = |frac{150 - 100}{150}| *100 = 33.33 Percent Case 2: {Actual}_2 = 100 & {Forecast}_2 = 150 (negative error) {rm APE}_2 = |frac{{rm Actual}_2 - {rm Forecast}_2}{{rm Actual}_2}| *100 = |frac{100 - 150}{100}| *100 = 50 Percent It is true that Case 1 (positive error of 50) is related to a lower APE (Absolute Percentage Error) than Case 2 (negative error of 50). However, the reason here is not that the error is positive or negative but simply that the actual value changes. If the actual value stays constant, the APE is equal for both types of errors (Goodwin & Lawton 1999). That is clarified by the following example. Case 3: {Actual}_3 = 100 & {Forecast}_3 = 50 {rm APE}_3 = |frac{{rm Actual}_3 - {rm Forecast}_3}{{rm Actual}_3}| *100 = |frac{100 - 50}{100}| *100 = 50 Percent Case 4: {Actual}_4= 100 & {Forecast}_4 = 150 {rm APE}_4 = |frac{{rm Actual}_4 - {rm Forecast}_4}{{rm Actual}_4}| *100 = |frac{100 - 150}{100}| *100 = 50 Percent The second, equally invalid argument supporting the asymmetry of the MAPE arises from the assumption about the predicted data. As the MAPE is mainly suited to be used to evaluate predictions on a ratio scale, the MAPE is bounded on the lower side by an error of 100% (Armstrong & Collopy 1992). However, this does not imply that the MAPE overweights or underweights some types of errors, but that these errors are not possible.

Its TRUE weaknesses!

It Fails if Some of the Actual Values Are Equal to Zero

This statement is a well-known problem of the focal measure. However, that and the latter argument were the reason for the development of a modified form of the MAPE, the SMAPE (“Symmetric” Mean Absolute Percentage). Ironically, in contrast to the original MAPE, this modified form suffers from true asymmetry (Goodwin & Lawton 1999). I will clarify this argument in the last section of the article.

Particularly Small Actual Values Bias the Mape

If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.
extreme-update

The Mape Implies Only Which Forecast Is Proportionally Better

As mentioned at the beginning of this article, one advantage of using the MAPE for comparison between forecasts of different time series is its unit independency. However, it is essential to keep in mind that the MAPE only implies which forecast is proportionally better. The following graph shows three different time series and their corresponding forecasts. The only difference between them is their general level. The same absolute errors lead, therefore, to profoundly different MAPEs. This article critically questions, if it is reasonable to use such a percentage-based measure for the comparison between forecasts for different time series. If the different time series aren’t behaving in a somehow comparable level (as shown in the following graphic), using the MAPE to infer if a forecast is generally better for one time series than for another relies on the assumption that the same absolute errors are less problematic for time series on higher levels than for time series on lower levels:
If a time series fluctuates around 100, then predicting 101 is way better than predicting 2 for a time series fluctuating around 1.”
That might be true in some cases. However, in general, this a questionable or at least an assumption people should always be aware of when using the MAPE to compare forecasts between different time series.
level-update

Summary

In summary, the discussed findings show that the MAPE should be used with caution as an instrument for comparing forecasts across different time series. A necessary condition is that the time series only contains strictly positive values. Second only some extremely small values have the potential to bias the MAPE heavily. Last, the MAPE depends systematically on the level of the time series as it is a percentage based error. This article critically questions if it is meaningful to generalize from being a proportionally better forecast to being a generally better forecast.

BETTER alternatives!

The discussed implies that the MAPE alone is often not very useful when the objective is to compare accuracy between different forecasts for different time series. Although relying only on one easily understandable measure appears to be comfortable, it comes with a high risk of drawing misleading conclusions. In general, it is always recommended to use different measures combined. In addition to numerical measures, a visualization of the time series, including the actual and the forecasted values always provides valuable information. However, if one single numeric measure is the only option, there are some excellent alternatives.

Scaled Measures

Scaled measures compare the measure of a forecast, for example, the MAE relative to the MAE of a benchmark method. Similar measures can be defined using RMSE, MAPE, or other measures. Common benchmark methods are the “random walk”, the “naïve” method and the “mean” method. These measures are easy to interpret as they show how the focal model compares to the benchmark methods. However, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method. Relative MAE = frac{{rm MAE}_{focal model}}{{rm MAE}_{benchmark model}}

Scaled Errors

Scaled errors approaches also try to remove the scale of the data by comparing the forecasted values to those obtained by some benchmark forecast method, like the naïve method. The MASE (Mean Absolute Scaled Error), proposed by Hydnmann & Koehler 2006, is defined slightly different dependent on the seasonality of the time series. In the simple case of a non-seasonal time series, the error of the focal forecast is scaled based on the in-sample MAE from the naïve forecast method. One major advantage is that it can handle actual values of zero and that it is not biased by very extreme values. Once again, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method. Non-Seasonal MASE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{frac{1}{T-1}sum_{t=2}^{T}{|{rm Actual}_t-{rm Actual}_{t-1}|}}|} Seasonal MASE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{frac{1}{T-M}sum_{t=m+1}^{T}{|{rm Actual}_t-{rm Actual}_{t-m}|}}|}

SDMAE

In my understanding, the basic idea of using the MAPE to compare different time series between forecasts is that the same absolute error is assumed to be less problematic for time series on higher levels than for time series on lower levels. Based on the examples shown earlier, I think that this idea is at least questionable. I argue that how good or bad a specific absolute error is evaluated should not depend on the general level of the time series but on its variation. Accordingly, the following measure the SDMAE (Standard Deviation adjusted Mean Absolute Error) is a product of the discussed issues and my imagination. It can be used for evaluating forecasts for times series containing negative values and does not suffer from actual values being equal to zero nor particularly small. Note that this measure is not defined for time series that do not fluctuate at all. Furthermore, there might be other limitations of this measure, that I am currently not aware of. SDMAE = frac{{rm MAE}_{focal model}}{{rm SD}_{actual values}}
SDMAE-update

Summary

I suggest using a combination of different measures to get a comprehensive understanding of the performance of the different forecasts. I also suggest complementing the MAPE with a visualization of the time series, including the actual and the forecasted values, the MAE, and a Scaled Measure or Scaled Error approach. The SDMAE should be seen as an alternative approach that was not discussed by a broader audience so far. I am thankful for your critical thoughts and comments on this idea.

Worse alternatives!

SMAPE

The SMAPE was created, to solve and respond to the problems of the MAPE. However, this did neither solve the problem of extreme small actual values nor the level dependency of the MAPE. The reason is that extreme small actual values are typically related to extreme small predictions (Hyndman & Koehler 2006). Additional, and in contrast to the unmodified MAPE, the SMAPE raises the problem of asymmetry (Goodwin & Lawton 1999). This is clarified through the following graphic, whereas the ” APE” relates to the MAPE and the “SAPE” relates to the SMAPE. It shows that the SAPE is higher for positive errors than for negative errors and therefore, asymmetric. The SMAPE is not recommended to be used by several scientists (Hyndman & Koehler 2006). SMAPE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{({rm Actual}_i+{rm Forecast}_1)/2}|*100} On the asymmetry of the symmetric MAPE (Goodwin & Lawton 1999)
ape-vs-modified-ape

References

  • Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. International journal of forecasting, 15(4), 405-408.
  • Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International journal of forecasting, 22(4), 679-688.
  • Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527-529.
  • Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons. International journal of forecasting, 8(1), 69-80.
 

Introduction

At STATWORX, we are not only helping customers find, develop, and implement a suitable data strategy but we also spend some time doing research to improve our own tool stack. This way, we can give back to the open-source community. One special focus of my research lies in tree-based ensemble methods. These algorithms are bread and butter tools in predictive learning and you can find them as a standalone model or as an ingredient to an ensemble in almost every supervised learning Kaggle challenge. Renowned models are Random Forest by Leo Breiman, Extremely Randomized Trees (Extra-Trees) by Pierre Geurts, Damien Ernst & Louis Wehenkel, and Multiple Additive Regression Trees (MART; also known as Gradient Boosted Trees) by Jerome Friedman. One thing I was particularly interested in, was how much randomization techniques have helped improve prediction performance in all of the algorithms named above. In Random Forest and Extra-Trees, it is quite obvious. Here, randomization is the reason why the ensembles offer an improvement over bagging; through the de-correlation of the base learners, the variance of the ensemble and therefore its prediction error decreases. In the end, you achieve de-correlation by “shaking up” the base trees, as it’s done in the two ensembles. However, MART also profits from randomization. In 2002, Friedman published another paper on boosting, showing that you can improve the prediction performance of boosted trees by training each tree on only a random subsample of your data. As a side-effect, your training time also decreases. Furthermore, in 2015, Rashmi and Gilad suggested adding a method known as a dropout to the boosting ensemble: a method found and used in neural nets.

The Idea behind Random Boost

Inspired by theoretical readings on randomization techniques in boosting, I developed a new algorithm, that I called Random Boost (RB). In its essence, Random Boost sequentially grows regression trees with random depth. More precisely, the algorithm is almost identical to and has the exact same input arguments as MART. The only difference is the parameter d_{max}. In MART, d_{max} determines the maximum depth of all trees in the ensemble. In Random Boost, the argument constitutes the upper bound of possible tree sizes. In each boosting iteration i, a random number d_i between 1 and d_{max} is drawn, which then defines the maximum depth of that tree T_i(d_i). In comparison to MART, this has two advantages: First, RB is faster than MART on average, when being equipped with the same value for the tree size. When RB and MART are trained with a value for the maximum tree depth equal to d_{max}, then Random Boost will in many cases grow trees the size of d < d_{max} by nature. If you assume that for MART, all trees will be grown to their full size d_{max} (i.e. there is enough data left in each internal node so that tree growing doesn’t stop before the maximum size is reached), you can derive a formula showing the relative computation gain of RB over MART:

    \[t_{rel}(d_{max}) = frac{t_{mathrm{RB}}(d_{max})}{t_{mathrm{MART}}(d_{max})} approx frac{2}{d_{max}}left(1 - left(frac{1}{2}right)^{d_{max}} right).\]

t_{RB}(d_{max}) e.g. is the training time of a RB boosting ensemble with the tree size parameter being equal to d_{max}. To make it a bit more practical, the formula predicts that for d_{max}=2, 3, and 4, RB takes 75%, 58%, and 47% of the computation time of MART, respectively. These predictions, however, should be seen as RB’s best-case scenario, as MART is also not necessarily growing full trees. Still, the calculations suggest that efficiency gains can be expected (more on that later). Second, there are also reasons to assume that randomizing over tree depths can have a beneficial effect on the prediction performance. As already mentioned, from a variance perspective, boosting suffers from overcapacity for various reasons. One of them is choosing too rich of a base learner in terms of depth. If, for example, one assumes that the dominant interaction in the data generating process is of order three, one would pick a tree with equivalent depth in MART in order to capture this interaction depth. However, this may be overkill, as fully grown trees with a depth equal to 3 have eight leaves and therefore learn noise in the data, if there are only a few of such high order interactions. Perhaps, in this case, a tree with depth 3 but less than eight leaves would be optimal. This is not accounted for in MART, if one doesn’t want to add a pruning step to each boosting iteration at the expense of computational overhead. Random Boost may offer a more efficient remedy to this issue. With probability 1 / d_{max}, a tree is grown, which is able to capture the high order effect at the cost of also learning noise. However, in all the other cases, Random Boost constructs smaller trees that do not show the over-capacity behavior and that can focus on interactions of smaller order. If over-capacity is an issue in MART due to different interactions in the data governed by a small number of high order interactions, Random Boost may perform better than MART. Furthermore, Random Boost also decorrelates trees through the extra source of randomness, which has a variance reducing the effect on the ensemble. The concept of Random Boost constitutes a slight change to MART. I used the sklearn package as a basis for my code. As a result, the algorithm is developed based on sklearn.ensemle.GradientBoostingRegressor and sklearn.ensemle.GradientBoostingClassifier and is used in exactly the same way (i.e. argument names match exactly and CV can be carried out with sklearn.model_selection.GridSearchCV). The only difference is that the RandomBoosting*-object uses max_depth to randomly draw tree depths for each iteration. As an example, you can use it like this:
rb = RandomBoostingRegressor(learning_rate=0.1, max_depth=4, n_estimators=100)
rb = rb.fit(X_train, y_train)
rb.predict(X_test)
For the full code, check out my GitHub account.

Random Boost versus MART – A Simulation Study

In order to compare the two algorithms, I ran a simulation on 25 Datasets generated by a Random Target Function Generator that was introduced by Jerome Friedman in his famous Boosting paper he wrote in 2001 (you can find the details in his paper. Python code can be found here). Each dataset (containing 20,000 observations) was randomly split into a 25% test set and a 75% training set. RB and MART were tuned via 5-fold CV on the same tuning grid.
  • learning_rate = 0.1
  • max_depth = (2, 3, ..., 8)
  • n_estimators = (100, 105, 110, ..., 195)
For each dataset, I tuned both models to obtain the best parameter constellation. Then, I trained each model on every point of the tuning grid again and saved the test MAE as well as the total training time in seconds. Why did I train every model again and didn’t simply store the prediction accuracy of the tuned models along with the overall tuning time? Well, I wanted to be able to see how training time varies with the tree size parameter.

A Comparison of Prediction Accuracies

You can see the distribution of MAEs of the best models on all 25 datasets below.
absolute MAE
Evidently, both algorithms perform similarly. For a better comparison, I compute the relative difference between the predictive performance of RB and MART for each dataset j, i.e. MAE_{rel,j}=frac{MAE_{RB,j}-MAE_{MART,j}}{MAE_{MART,j}}. If MAE_{rel,j}>0, then RB had a larger mean absolute error than MART on dataset j, and vice versa.
MAE by dataset with boxplot
In the majority of cases, RB did worse than MART in terms of prediction accuracy (MAE_{rel}>0). In the worst case, RB had a 1% higher MAE than MART. In the median, RB has a 0.19% higher MAE. I’ll leave itu p to you to decide whether that difference is practically significant.

A comparison of Training times

When we look at training time, we get a quite clear picture. In absolute terms, it took 433 seconds to train all parameter combinations of RB on average, as opposed to 803 seconds for MART.
average training time
The small black lines on top of each bar are the error bars (2 times the means standard deviation; rather small in this case). To give you a better feeling of how each model performed on each dataset, I also plotted the training times for each round.
total training time
If you now compute the training time ratio between MART and RB (frac{t_{MART}}{t_{RB}}), you see that RB is roughly 1.8 times faster than MART on average. Another perspective on the case is to compute the relative training time t_{rel,j}=frac{t_{RB,j}}{t_{MART,j}} which is just 1 over the speedup. Note that this measure has to be interpreted a bit differently from the relative MAE measure above. If t_{rel,j}=1 then RB is as fast as MART, if t_{rel,j}>1, then it takes longer to train RB than MART, and if t_{rel,j}<1, then RB is faster than MART. On average, RB only needs roughly 54% of MART’s tuning time in the median, and it is noticeably faster in all cases. I was also wondering how the relative training time varies with d_{max} and how well the theoretically derived lower bound from above fits the actually measured relative training time. That’s why I computed the relative training time across all 25 datasets by tree size.
Tree size (max_depth) Actual Training time (RB / MART) Theoretical lower bound
2 0.751 0.750
3 0.652 0.583
4 0.596 0.469
5 0.566 0.388
6 0.532 0.328
7 0.505 0.283
8 0.479 0.249
The theoretical figures are optimistic, but the relative performance gain of RB increases with tree size.

Results in a Nutshell and Next Steps

As part of my research on tree-based ensemble methods, I developed a new algorithm called Random Boost. Random Boost is based on Jerome Friedman’s MART, with the slight difference that it fits trees of random size. In total, this little change can reduce the problem of overfitting and noticeably speed up computation. Using a Random Target Function Generator suggested by Friedman, I found that, on average, RB is roughly twice as fast as MART with a comparable prediction accuracy in expectation. Since running the whole simulation takes quite some time (finding the optimal parameters and retraining every model takes roughly one hour for each data set on my Mac), I couldn’t run hundreds or more simulations for this blog post. That’s the objective for future research on Random Boost. Furthermore, I want to benchmark the algorithm on real-world datasets. In the meantime, feel free to look at my code and run the simulations yourself. Everything is on GitHub. Moreover, if you find something interesting and you want to share it with me, please feel free to shoot me an email.

References

  • Breiman, Leo (2001). Random Forests. Machine Learning, 45, 5–32
  • Chang, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 785-794
  • Chapelle, Olivier, and Yi Chang. 2011. “Yahoo! learning to rank challenge overview”. In Proceedings of the Learning to Rank Challenge, 1–24.
  • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
  • Friedman, J. H. (2002). “Stochastic gradient boosting”. Computational Statistics & Data Analysis 38 (4): 367–378.
  • Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006). “Extremely randomized trees”. Machine learning 63 (1): 3–42.
  • Rashmi, K. V., and Ran Gilad-Bachrach (2015). Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38.
   

In this blog we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounce for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

Bagging bypasses this tradeoff by reducing the variance of the unstable predictor, while leaving its bias mostly unaffected.

Method

In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

Let’s see how Bagging works in detail:

  1. Construct a bootstrap sample (y_{i}^{*}, mathbf{x_{i}^{*}}) :(i = 1, dots , n) (with replacement) of the original i.i.d. data at hand (y_{i}, mathbf{x_{i}}) : (i = 1, dots , n).
  2. Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by hat{theta}_{n}^{*}(mathbf{x}).
  3. Repeat Steps one and two B many times and calculate frac{1}{B}sum_{b=1}^{B}hat{theta}_{n, b}^{*}(mathbf{x}) .

OK – so let us take a glimpse into the construction phase: We draw in total B different bootstrap samples simultaneously from the original data. Then to each of those samples a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

The variance-reduction happens in Step 3. To see this, consider the following toy example.

Let X_1, dots, X_n be i.i.d. random variables with mu = E[X_1] and sigma^2 = Var[X_1] and let bar{X}= frac{1}{n}sum_{i=1}^{n}X_{i}. Easy re-formulations show that

  • Var[bar{X}]=frac{sigma^2}{n} leq sigma^2
  • E[bar{X}]=mu

We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing m times without replacement from our original data with m < n and get hence a subset of size m. With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size m.

Simulation Set-Up

As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

In order to do this, we consider the following model

    \[y_{i} = f(mathbf{x}_{i}) + epsilon_{i}\]

where f(mathbf{x}_{i}) is the regression function, mathbf{x}_{i} sim U^{10}[0,1] is the design matrix generated from a uniform distribution and epsilon_{i}sim N(0,1) is the error term (forall i = 1, dots, n).

For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed “Friedman #1”-model:

    \[f(mathbf{x}) = 10 sin(pi x^{(1)} x^{(2)}) + 20(x^{(3)} - frac{1}{2})^{2} + 10 x^{(4)} + 5 x^{(5)}\]

where mathbf{x}^{(j)} is the j-th column of the design matrix mathbf{x} (for 1 leq j leq 10).

As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

We first import the relevant packages.

<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor

The module simulation_class is a user-specified class that we will not discuss in this blog post but in a subsequent one.

Further, we specify the simulation set-up:

<span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>

grid_range = np.linspace(start_grid, end_grid, num = n_grid)

Below we will explain in more detail for what we need the grid specification.

To store our simulation results we set up containers.

<span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
mse_temp_subagging = np.empty(shape = (n_obs, n_sim))

y_predict_bagging = np.empty(shape = (n_obs, n_sim))
y_predict_subagging = np.empty(shape = (n_obs, n_sim))

mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))

With this initialization at hand, we generate the test and train data by the simulation_class module.

<span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
train_setup = simulation_class.simulation(n_reg = n_reg,
                                          n_obs = n_obs,
                                          n_sim = n_sim,
                                          sigma = sigma,
                                          random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
                                          random_seed_noise =  <span class="hljs-number"><span class="hljs-number">1</span></span>)

test_setup = simulation_class.simulation(n_reg = n_reg,
                                         n_obs = n_obs,
                                         n_sim = n_sim,
                                         sigma = sigma,
                                         random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
                                         random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)

f_train = train_setup.friedman_model()
X_train, y_train = train_setup.error_term(f_train)

f_test = test_setup.friedman_model()
X_test, y_test = test_setup.error_term(f_test)

As we have generated the data for our “Friedman #1”-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the BaggingRegressor method of the sklearn.ensemble package. Observe that for the Subagged predictor we need to specify the parameter max_samples in the BaggingRegressor. This ensures that we can draw a subsample size m = a cdot{} n with subsample fraction a from the original data. Indeed, for the subsample fraction a we have already specified the grid above by the variable grid_range .

<span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
    <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
        <span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
        bagging = sklearn.ensemble.BaggingRegressor(
            bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
            n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)

        y_predict_bagging[:,i] = bagging.fit(
            X_train,
            y_train[:,i]).predict(X_test)
        
        mse_temp_bagging[:,i] = mean_squared_error(
            y_test[:,i], 
            y_predict_bagging[:,i])
        
        <span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
        subagging = sklearn.ensemble.BaggingRegressor(
            max_samples = math.ceil(a*n_obs),
            bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
            n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
        
        y_predict_subagging[:,i] = subagging.fit(
            X_train,
            y_train[:,i]).predict(X_test)
        
        mse_temp_subagging[:,i] = mean_squared_error(
            y_test[:,i],
            y_predict_subagging[:,i])
       
    mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
    mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)

On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

Results

The results of our above simulation can be found in Figure 1.

Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

Table 1: Performance of fully grown tree and Bagged Predictor

Predictor Tree (fully grown) Bagged Tree
Bias^2 3.47 2.94
Variance 6.13 0.35
MSPE 10.61 4.26

Figure 1 displays the MSPE as a function of the subsample fraction a for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

  • We see that both the Bagged and Subagged predictor outperform a single tree (in terms of MSPE).
  • For a subsampling fraction of approximately 0.5, Subagging achieves nearly the same prediction performance as Bagging while coming at a lower computational cost.

MSPE-Comparison of Bagging and Subagging

References

  1. Breiman, L.: Bagging predictors. Machine Learning, 24, 123–140 (1996).
  2. Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927–961 (2002).

Last time we dove deep into the world of a little salad bar just a few steps away from the STATWORX office. This time we are going to dig even deeper … well, we are going to dig a little deeper. Today’s Specials are cross-elasticities and the effect of promotions.

We talked so much about salads, because the situation of our little island of semi-healty lunch choices makes for an illustrative case on how we can calculate price elasticities – the measure generally agreed upon by economists to evaluate how any change in price affects demand. Since the intricacies of deriving these price elasticities of demand with regressions are to be the subject of this short blog series, the salad bar was a cheap example.

What we did so far

Specifically, in my last post, we wanted to know how a linear regression function relates to elasticity. It turns out that this depends on how the price and demand variable have been transformed. We explored four different transformations and in the end, we came to the conclusion that the log-log model fits our data best.

This was by no means an accident. Empirical explorations may occasionally guide us in choosing a different direction, but microeconomics arguments are heavily in favor of log-log models. The underlying demand curve describes demand most like economists assume it to behave. This model ensures that demand cannot sink below zero as the price increases; while on the other side, demand exponentially grows as the price decreases. Yet, the deductibility of a constant elasticity value is its most desirable feature. It makes the utilization of elasticity that much simpler.

Granted, if I’m being honest, the real reason the log-log model worked may have to do with the fact that I created the data. This salad vendor does exist, but obviously, they certainly did not fork over their sales data just because I thought their store would make an illustrative example. All the data we worked with was the result of how I imagined this little salad vendor’s market situation to be. Daily sales prices over the past two years were simulated for our little salad bar by randomly selecting prices between a certain price range, then a multiplicative demand function was used to derive sales with some added randomness. And with that we were done simulating the data.

Obviously, this is far from the often messy historical data that one will encounter at retailers or any business that sells anything. There was no consideration of competition, no in-store alternatives, no new promotional activaties, no seasonal-effects, or any of the other business-specific factors that obscure price effects. The relationship between price and demand is usually obfuscated by the many other factors that influence a consumers buying decision. Thus, it is the intricacies of isolating the interfering factors that determines the success of empirical work.

A closer look at the data at hand

To illustrate this I went back to the sales data drawing board and made up some more data. For details, check out the code at our Github page. The results can be seen in the graph below. The graph actually consists of two graphs: a scatter plot that illustrates daily sales quanties over time and a line graph that also describes the price development over time. If one looks more closely at the graph, the development of sales cannot always be explained by just looking at the price. The second half of the year 2014 illustrates this most glaringly. In some cases, sales spikes occur which seem to be unrelated to the product price.

price and sales over time

Additional information is thus needed. Obviously, there is a multitute of possible factors that might explain the discrepancies in the relationship between price and demand. And of course, when offered, we look at anything provided to us in order to evaluate whether we can extract some pricing-relevant insight from it. The two most desired requests concern information about promotional activities and intel on competitor prices. Luckily, we do not have to ask as I simulated the data and integrated it into the previously seen graph (see below). Promotional activities by the salad bar are indicated by thin blue lines across the two graphs and the pricing history of the closest competitor are illustrated in the graph below.

price and sales over time with promotions

Almost surprisingly (though not really), the graph illustrates that both promotions and competitor prices impact the number of salads sold by our small salad bar. If we were now running the same log-log regression, the resulting elasticity score would be skewed.

But how does one include both the promotional data and the competitor price information? There are several fancy ways of doing it, but for now we’ll stick with simple (also often efficient) ways. For the promotional data, we will use a dummy variable – coded one for every day our salad bar turned on its promotions engine and zero otherwise. In reality though, the salad bar’s key promotional tools are an incentive-lacking stamp card and the occasional flyer with some time-limited coupons. The data describes the latter.

The therory of cross price elasticity

To utilize the prices of competitors, we quickly need to delve into some economical theory. Next to price elasticity of demand, there is a second concept called cross price elasticity of demand. The two concepts are as similar, as their names suggest. Just look at the two formulas.

The formula for price elasticity:

    \[epsilon = frac{Delta qty/qty}{Delta price/price}\]

And the formula for cross price elasticity:

    \[epsilon_{C} = frac{Delta qty/qty}{Delta price_{Comp}/price_{Comp}}\]

The only thing that differs is the price with which one calculates. Instead of using the product’s price, one uses the competitors price. Conceptually, however price elasticity and cross price elasticity slightly differ. With cross elasticity scores, it is plausible to get positive scores. However, with elasticity this is highly implausible and almost always indicates omitted variable bias, granted some luxury goods, like iPhones, may be the exception.

However, cross elastities are actually commonly positive. The reason is captured by the terms substitutional goods and complimentary goods. An obvious substiutional good would be a salad from the restaurant next to the salad bar. One eats either at the salad bar or at the restaurant. Thusly, a higher price for the salad at the restaurant could make it more likely that more customers choose to buy at the salad at our salad bar. A higher price at the restaurant would therefore consequently positively impact sales at the salad bar. The according cross elasticity score would be positive.

The influence of competitors

But we can also imagine negative cross elasticity scores. In this case, we are confronted with a complementary product. For example, next to our salad bar is a cute little coffee place. Coffee is perfect at overcoming the after lunch food coma – so many people want one. Unfortunately, the greedy owner of this cute coffee shop just increased her prices – again! Annoying, but not a big problem for us, the restaurant with which our salad bar competes also sells cheap coffee. Now this is a big problem for the salad bar. Their salad and the coffee from cute, but greedy shop are complimentary products. As the coffee price increases, salad sales at our salad bar slow.

In Frankfurt, there are obviously only coffee shops with very reasonably priced coffee, so we stick to just the information about the main competitor. But how to operationalize the competitor price data? The conceptional and mathematical closeness between elasticity and cross elasticity suggests that one could treat them similarly. And indeed, it is good starting point to include competitor prices the same way as the actual product prices, spoiler alert, logged.

Advanced Simple
Estimate Std. Error Estimate Std. Error
Intercept 5.36 0.09 8.43 0.11
Log Price -1.45 0.03 -1.76 0.05
Promo 0.45 0.02
Log Price_{Comp} 1.23 0.03
Adjusted R^2 0.80 0.41

Let’s look at the output in the table above. As I was in full control of the data generation process – and since we know what the underlying elasticity actually was – I set it at -1.5. Obviously, because of the miniscule level of randomness added, both models do reasonably well. Even withstanding R^2 and the standard errors, there is still a clear winner. By ignoring the impact of promotional activities and the competitors prices, the price elasticity estimate of simpler model is biased. Here the difference is relatively small, but with real data, the difference can be substantial or as we omit essential information here systemic.

That is all! In the next blog post in this series, we introduce more complex relationships between price, promotions, competitor pricing, and concepts to utilize the insights for business purposes.

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many “off-the-shelf” ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

“Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal.” – Bradley Efron

Introduction

However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in “The Annals of Statistics”1.

Model-based Recursive Partitioning

One of the earlier papers about causal trees is by Zeileis et al., 20082. They describe an algorithm for Model-based Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using Maximum-Likelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.

Bayesian Additive Regression Tree

Another paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects3. Hereby, one advantage of this approach is, that BART can detect and handle interactions and non-linearity in the response surface. It uses a Sum-of-Tree Model. First, a weak-learning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.

Generalized Random Forest

However, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for non-parametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a “local GMM” model.

Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman’s Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don’t have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our in-sample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradient-based approximation for it. This results in a computational performance, similar to standard CART- approaches.

Comparing the regression forest of GRF to standard random forest

Athey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the grf-package4 is a regression forest. Therefore, I will compare those results, with the random forest implementations of the randomForest-package as well as the implementation of the ranger-packages. For tuning porpuses, I will use a random search with 50 iterations for the randomForest and ranger-package and for the grf the implemented tune_regression_forest()-function. The Algorithms will be benchmarked on 3 data sets, which have been already used in another blog post, while using the RMSE to compare the results. For easy handling, I implemented the regression_forest() into the caret framework, which can be found on my GitHub.

Data Set Metric grf ranger randomForest
air RMSE 0.25 0.24 0.24
bike RMSE 2.90 2.41 2.67
gas RMSE 36.0 32.6 34.4

The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the tune_regression_forest()-Function.
One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees5. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.

However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.

References

  1. Athey, Tibshirani, Wager. Forthcoming.”Generalized Random Forests”
  2. Zeileis, Hothorn, Hornik. 2008.”Model-based Recursive Partitioning”
  3. Hill. 2011.”Bayesian Nonparametric Modeling for Causal Inference”
  4. https://github.com/swager/grf
  5. Athey and Imbens. 2016.”Recursive partitioning for heterogeneous causal effects.”

A few hundred meters from our office, there is a little lunch place. It is part of a small chain that specializes in assemble-yourself, ready-to-eat salads. When we moved into our new office a few years ago, this salad vendor quickly became a daily fixture. However, overtime, this changed. We still eat there regularly, but I am certain, if one were to look at their STATWORX – related turnover the trend would not delight management and the question is why?
The answer has a lot to do with the arrival of new competitors, improved cooking skills, elaborate promotions and certainly also pricing. It is the latter – pricing – that will be at the center of this series.

When analyzing pricing related issues, it is often of essential interest to have a measure of how some change in price affects demand. The measure generally agreed upon by economists to describe this relationship is that of price elasticity of demand, epsilon. As a relative measure, it is unit independent, which turns it into a winner. Elasticity is defined as the percent change in quantity divided by percentage change in price:

    \[epsilon = frac{Delta qty/ qty}{Delta price/ price}\]

Conceptually, three conditions are commonly distinguished: elasticity scores of < -1 indicate an ‘elastic demand.’ This means that if one increases the price by one percent, the quantity of demand decreases by more than one percent. The other two conditions are elasticities of demand > -1, in which case we speak of ‘inelastic demand’ and the case when elasticity equals -1. This is called ‘unit elastic demand.’

Being able to deduce the actual price elasticity of demand for our salad bar would be of great help. With a reliable elastic score at hand, we can answer questions like: How many salads can we expect to sell at a given price? How does a price change of 10% affect demand? With this knowledge, it would be possible to utilize one’s pricing as a tool in order to target different salad-business KPIs. Eventually, the salad bar can adjust its price in order to maximize profit or to increase sales – depending on their strategic objectives.

It is the intricacies of deriving this price elasticities of demand with regressions that will be the subject of this short blog. The situation of our salad vendor makes for an illustrative case on how we can calculate price elasticity and how they can be used to adjust one’s pricing strategy.

Setup

To be upfront – although this salad vendor exists, and it is in fact an integral part in the STATWORX food chain – all the data we work with is made up. It describes, how I imagine this little salad vendor’s market situation to be. With each new post, as we examine more complex issues, we will delve deeper into the intricacies of the salad vendor’s world.

The question of this blog post is simple: How can we use linear regression to derive price elasticities? To explore this, we need historic prices and sales information. To begin, there will be no consideration of competition, no in-store alternatives, no new promotional activates, no seasonal-effects, or anything else.

Daily sales prices of the past two years were simulated for our little salad bar by randomly selecting prices between 5.59€ and 9.99€ – clearly not a great pricing strategy, but it suffices for this post’s purposes. A multiplicative demand function was used to derive sales with some randomness added. And with that we are done simulating the data. For more details, check out the code at our Githubpage.

Calculating Elasticity of Demand

We want to know how a linear regression function relates to elasticity. It turns out that this depends on how the variables have been transformed. It is possible to deduce elasticity – a factor of relative of change – in almost any situation. Here you find the four most common transformations.

Transformation Function Elasticity
Level-Level Y = a+bX epsilon=b*frac{X}{Y}
Log-Level log(Y) = a+bX epsilon=b*X
Level-Log Y = a+b*log(X) epsilon=frac{b}{Y}
Log-Log log(Y) = a+b*log(X) epsilon=b

Dependent on the pre-regression variable transformation, different post-regression transformations are necessary in order derive the elasticity scores. The table above shows that in the case of a log-log model, the elasticity is a constant value across the entire demand curve; while in all other cases, it is dependent on the specific current price and/or demand. This means that the choice of the model is indicative of the assumed demand curve. Choosing wrongly results in a misspecified model.

This is great to know, but which model should one use? To evaluate this, I simply ran each of these four models. The results you can find in the table below, but they are nothing like you will ever find in the real world, in that all effects are highly significant and the R^2 is ridiculously high for any social or economic analysis. This is by design as hardly any randomness was added. In addition, the data was setup up in a way that the log-log model was predestined to generate the winning model.

Model Intercept Price Variable varnothing Elasticity R^2
Level-Level 439.58 (3.2) -38.57 (0.42) -2.50 0.84
Log-Level 6.59 (0.01) -0.23 (0.01) -1.63 0.95
Level-Log 671.22 (3.53) -265.66 (1.83) -2.11 0.93
Log-Log 7.86 (0.01) -1.52 (0.01) -1.51 0.97

The argument is not that a log-log model is the best model to derive elasticities. Although, there are strong microeconomics arguments to be made about why the log-log model is the most reasonable model to describe demand elasticity. The underlying demand curve describes demand most like economists assume it to behave. It ensures that demand cannot sink below zero as the price increases and on the other side demand exponentially grows as the price decreases. Yet, the deductibility of a constant elasticity value, as aforementioned, is its most desirable feature. This fact makes it much easier to apply elasticity to optimize pricing.

Still, empirical analysis might guide us to assume other price-demand relationships. The graphic below shows this in an illustrative way. The legend of the graph orders the models in increasing order of fit. Looking at each graph, it becomes clear why the level-log model fares better than the level-level model, and why the log-level model outperforms the level-log model and so on. The non-linear relationship between price and demand that we introduced by relying a multiplicative demand curve is best described by the log-log model. Had I used an additive demand curve the ranking would have been the other way around. Thus, the argument is that under certain circumstances the model choice can have a significant impact.

elasticity regression comp

For the application in practice we have to be very aware of the functional form that is indicated by the regression we chose. The effects can have severe consequences. The elasticity with which the data was generated was -1.5. In order to illustrate the effect that model choice can have on the estimated elasticity, I calculated average elasticities for level-level, log-levvel and level-level models and compared it with the price coefficient of the log-log model. This is a bit of an oversimplification, but the point still stands: The results are substantially different, which has consequences when one tries to utilize the deducted elasticities. Yet, based on the graphical analysis and the model information, we would come to the conclusion that the log-log fares best, so we can proceed as theory would want us to.

Price Optimalization

Before we finish, let’s quickly look at how we can use elasticity to improve the little salad vendor’s erratic pricing strategy (my random daily price change). For this we need to know the salad bar’s cost function. Luckily, we do: it has fix costs of about 300€ for every day it is open. The preparation of a single salad costs about 2.50€ per salad. The cost function is thus:

    \[€€TotalCost = 300€ + SaladesSold * 2.50€\]

Microeconomic theory teaches us that fix costs do not matter when calculating elasticity based margin optimized prices. I’ll spare you the details, but the function to calculate the optimized price eventually states:

    \[OptimalPrice = frac{Elasticity*CostPerSalad}{1+Elasticity}\]

Applying the elasticity derived from the log-log model, this results in a proposed optimized price that lies somewhere between 7.21€ and 7.47€. The estimation is 7.34€. Instead of daily changing its price randomly, it is best to stick to prices in this range. The salad vendor can expect to sell around 125 Salads each day, ensuring a daily profit of between 287€ and 320€.

KPIs Lowerbound Elasticity Exp. Elasticity Upperbound Elasticity
Elasticity -1.51 -1.52 -1.54
Opt. Price 7.43€ 7.30€ 7.17€
Quantity 126 125 125
Profit 319€ 302€ 285€

This is a daring statement. With the actual example data the conclusion would be fine. But in practice such perfect regression results, with so little uncertainty, are unrealistic. And this is where it tends to get tricky. To illustrate this point, I adjusted the standard error from almost nonexistent to 0.15. The results should still be highly significant, but looking at the table below one would be surprised about the consequences of such small changes.

KPIs Lowerbound Elasticity Exp. Elasticity Upperbound Elasticity
Elasticity -1.23 -1.52 -1.82
Opt. Price 13.51€ 7.30€ 5.57€
Quantity 106 125 114
Profit 864€ 302€ 51€

The certainty with which we proposed the optimal price was very much unfounded. In this example, the range for elasticity still is relatively small despite the increased uncertainty. Yet, the resulting price range for the ideal price is between 5.58€ and 13.73€, which is not a very precise proposal. The price range actually exceeds the highest price that the little salad vendor ever dared to set. The consequences are severe: the resulting profit varies almost sixteen-fold between the highest and lowest prices.

To state the obvious, the illustrated approach to elasticity calculations is just the tip of the iceberg. Meaning that we need to invest time into improving the current approach. The next posts will focus on intervening factors like promotional activities and similar products.

A few hundred meters from our office, there is a little lunch place. It is part of a small chain that specializes in assemble-yourself, ready-to-eat salads. When we moved into our new office a few years ago, this salad vendor quickly became a daily fixture. However, overtime, this changed. We still eat there regularly, but I am certain, if one were to look at their STATWORX – related turnover the trend would not delight management and the question is why?
The answer has a lot to do with the arrival of new competitors, improved cooking skills, elaborate promotions and certainly also pricing. It is the latter – pricing – that will be at the center of this series.

When analyzing pricing related issues, it is often of essential interest to have a measure of how some change in price affects demand. The measure generally agreed upon by economists to describe this relationship is that of price elasticity of demand, epsilon. As a relative measure, it is unit independent, which turns it into a winner. Elasticity is defined as the percent change in quantity divided by percentage change in price:

    \[epsilon = frac{Delta qty/ qty}{Delta price/ price}\]

Conceptually, three conditions are commonly distinguished: elasticity scores of < -1 indicate an ‘elastic demand.’ This means that if one increases the price by one percent, the quantity of demand decreases by more than one percent. The other two conditions are elasticities of demand > -1, in which case we speak of ‘inelastic demand’ and the case when elasticity equals -1. This is called ‘unit elastic demand.’

Being able to deduce the actual price elasticity of demand for our salad bar would be of great help. With a reliable elastic score at hand, we can answer questions like: How many salads can we expect to sell at a given price? How does a price change of 10% affect demand? With this knowledge, it would be possible to utilize one’s pricing as a tool in order to target different salad-business KPIs. Eventually, the salad bar can adjust its price in order to maximize profit or to increase sales – depending on their strategic objectives.

It is the intricacies of deriving this price elasticities of demand with regressions that will be the subject of this short blog. The situation of our salad vendor makes for an illustrative case on how we can calculate price elasticity and how they can be used to adjust one’s pricing strategy.

Setup

To be upfront – although this salad vendor exists, and it is in fact an integral part in the STATWORX food chain – all the data we work with is made up. It describes, how I imagine this little salad vendor’s market situation to be. With each new post, as we examine more complex issues, we will delve deeper into the intricacies of the salad vendor’s world.

The question of this blog post is simple: How can we use linear regression to derive price elasticities? To explore this, we need historic prices and sales information. To begin, there will be no consideration of competition, no in-store alternatives, no new promotional activates, no seasonal-effects, or anything else.

Daily sales prices of the past two years were simulated for our little salad bar by randomly selecting prices between 5.59€ and 9.99€ – clearly not a great pricing strategy, but it suffices for this post’s purposes. A multiplicative demand function was used to derive sales with some randomness added. And with that we are done simulating the data. For more details, check out the code at our Githubpage.

Calculating Elasticity of Demand

We want to know how a linear regression function relates to elasticity. It turns out that this depends on how the variables have been transformed. It is possible to deduce elasticity – a factor of relative of change – in almost any situation. Here you find the four most common transformations.

Transformation Function Elasticity
Level-Level Y = a+bX epsilon=b*frac{X}{Y}
Log-Level log(Y) = a+bX epsilon=b*X
Level-Log Y = a+b*log(X) epsilon=frac{b}{Y}
Log-Log log(Y) = a+b*log(X) epsilon=b

Dependent on the pre-regression variable transformation, different post-regression transformations are necessary in order derive the elasticity scores. The table above shows that in the case of a log-log model, the elasticity is a constant value across the entire demand curve; while in all other cases, it is dependent on the specific current price and/or demand. This means that the choice of the model is indicative of the assumed demand curve. Choosing wrongly results in a misspecified model.

This is great to know, but which model should one use? To evaluate this, I simply ran each of these four models. The results you can find in the table below, but they are nothing like you will ever find in the real world, in that all effects are highly significant and the R^2 is ridiculously high for any social or economic analysis. This is by design as hardly any randomness was added. In addition, the data was setup up in a way that the log-log model was predestined to generate the winning model.

Model Intercept Price Variable varnothing Elasticity R^2
Level-Level 439.58 (3.2) -38.57 (0.42) -2.50 0.84
Log-Level 6.59 (0.01) -0.23 (0.01) -1.63 0.95
Level-Log 671.22 (3.53) -265.66 (1.83) -2.11 0.93
Log-Log 7.86 (0.01) -1.52 (0.01) -1.51 0.97

The argument is not that a log-log model is the best model to derive elasticities. Although, there are strong microeconomics arguments to be made about why the log-log model is the most reasonable model to describe demand elasticity. The underlying demand curve describes demand most like economists assume it to behave. It ensures that demand cannot sink below zero as the price increases and on the other side demand exponentially grows as the price decreases. Yet, the deductibility of a constant elasticity value, as aforementioned, is its most desirable feature. This fact makes it much easier to apply elasticity to optimize pricing.

Still, empirical analysis might guide us to assume other price-demand relationships. The graphic below shows this in an illustrative way. The legend of the graph orders the models in increasing order of fit. Looking at each graph, it becomes clear why the level-log model fares better than the level-level model, and why the log-level model outperforms the level-log model and so on. The non-linear relationship between price and demand that we introduced by relying a multiplicative demand curve is best described by the log-log model. Had I used an additive demand curve the ranking would have been the other way around. Thus, the argument is that under certain circumstances the model choice can have a significant impact.

elasticity regression comp

For the application in practice we have to be very aware of the functional form that is indicated by the regression we chose. The effects can have severe consequences. The elasticity with which the data was generated was -1.5. In order to illustrate the effect that model choice can have on the estimated elasticity, I calculated average elasticities for level-level, log-levvel and level-level models and compared it with the price coefficient of the log-log model. This is a bit of an oversimplification, but the point still stands: The results are substantially different, which has consequences when one tries to utilize the deducted elasticities. Yet, based on the graphical analysis and the model information, we would come to the conclusion that the log-log fares best, so we can proceed as theory would want us to.

Price Optimalization

Before we finish, let’s quickly look at how we can use elasticity to improve the little salad vendor’s erratic pricing strategy (my random daily price change). For this we need to know the salad bar’s cost function. Luckily, we do: it has fix costs of about 300€ for every day it is open. The preparation of a single salad costs about 2.50€ per salad. The cost function is thus:

    \[€€TotalCost = 300€ + SaladesSold * 2.50€\]

Microeconomic theory teaches us that fix costs do not matter when calculating elasticity based margin optimized prices. I’ll spare you the details, but the function to calculate the optimized price eventually states:

    \[OptimalPrice = frac{Elasticity*CostPerSalad}{1+Elasticity}\]

Applying the elasticity derived from the log-log model, this results in a proposed optimized price that lies somewhere between 7.21€ and 7.47€. The estimation is 7.34€. Instead of daily changing its price randomly, it is best to stick to prices in this range. The salad vendor can expect to sell around 125 Salads each day, ensuring a daily profit of between 287€ and 320€.

KPIs Lowerbound Elasticity Exp. Elasticity Upperbound Elasticity
Elasticity -1.51 -1.52 -1.54
Opt. Price 7.43€ 7.30€ 7.17€
Quantity 126 125 125
Profit 319€ 302€ 285€

This is a daring statement. With the actual example data the conclusion would be fine. But in practice such perfect regression results, with so little uncertainty, are unrealistic. And this is where it tends to get tricky. To illustrate this point, I adjusted the standard error from almost nonexistent to 0.15. The results should still be highly significant, but looking at the table below one would be surprised about the consequences of such small changes.

KPIs Lowerbound Elasticity Exp. Elasticity Upperbound Elasticity
Elasticity -1.23 -1.52 -1.82
Opt. Price 13.51€ 7.30€ 5.57€
Quantity 106 125 114
Profit 864€ 302€ 51€

The certainty with which we proposed the optimal price was very much unfounded. In this example, the range for elasticity still is relatively small despite the increased uncertainty. Yet, the resulting price range for the ideal price is between 5.58€ and 13.73€, which is not a very precise proposal. The price range actually exceeds the highest price that the little salad vendor ever dared to set. The consequences are severe: the resulting profit varies almost sixteen-fold between the highest and lowest prices.

To state the obvious, the illustrated approach to elasticity calculations is just the tip of the iceberg. Meaning that we need to invest time into improving the current approach. The next posts will focus on intervening factors like promotional activities and similar products.