The Initial Problem
Management ViewLet’s assume for a moment that you are a data scientist here at STATWORX. Monday morning, at 10 o’clock the telephone rings, and a manager of an international bank is on the phone. After a bit of back and forth, the bank manager explains that they have a problem with defaulting loans and they need a program that predicts loans which are going to default in the future. Unfortunately, he must end the call now, but he’ll catch up with you later. In the meanwhile, you start to make sense of the problem.
Data Scientist ViewWhile it’s clear for the bank manager that he provided you with all necessary information, you grab another cup of coffee, lean back in your chair and recap the problem:
- The bank lends money to customers today
- The customer promises the bank to pay back the loan bit by bit over the next couple of months/years
- Unfortunately, some of the customers are not able to do so and are going to default on the loan
Data Science ExplanationFrom a data science perspective, we differentiate between two sorts of problems: Classification and Regression tasks. The way we prepare the data and the models we apply are inherently different between the two tasks. Classification problems, as the name suggested, assign data points into a specific category. For bank loans, one approach could be to construct two categories:
- The loan defaulted
- The loan is still performing
- The percentage of loans which are going to default in a given month
- The total amount of money the bank will lose in a given month
Scenario Classification Problem
Management ViewFor the next day, you set up a phone conference with the manager and decision-makers of the bank to discuss the overall direction of the project. The management board of the bank decided that it is more important to focus on the default prediction of single loans, instead of the overall default trend. Now you know that you have to solve a classification problem. Further, you ask the board what exactly they expect from the model. Manager A: I want to have the best performing model possible! Manager B: As long as it predicts reality as accurate as possible, I’m happy 🙂 Manager C: As long as it catches every defaulted loan for sure… Manager A: … but of course, it should not predict too many loans wrong!
Data Scientist ViewYou try to match every requirement from the bank. Understandably, the bank wants to have the perfect model, which makes little to no mistakes. Unfortunately, there is always an error. You are still unsure which error is worse for the bank. To properly continue your work, it is important to define with the client which problem exactly to solve and, therefore, which error to minimize. Some options could be:
- Catch every loan that will default
- Make sure the model does not classify a performing loan as a defaulted loan
- Some kind of weighted average between both of them
Data Science ExplanationTo generate predictions, you have to train a model on the given data. To tell the model how well it performed and to punish it for mistakes, it is necessary to define an error metric. The choice of the error metric always depends on the business case. From a technical point of view, it is possible to model nearly every business case, however, there are four metrics that are used in most classification problems.
This metric measures, as the name suggests, how accurate the model can predict the loan status. While this is the most basic metric one can think of, it’s also a dangerous one. Let’s say the bank tells us that roughly 5% of the loans on the balance sheet default. If, for some reason, our model never predicts defaults. In other words, the model classifies every loan as a non-defaulting loan. The accuracy is immediately 95/100 = 95%. For datasets where the classes are highly imbalanced, it is usually a good idea to discard accuracy.
Optimizing the machine learning algorithm for recall would ensure that the algorithm catches as many defaulted loans as possible. On the flip side, an algorithm that predicts perfectly all defaulted loans as a default is often the result that the algorithm predicts too many loans as defaulted. Many loans that are not going to default are also flagged as default.
High precision ensures that all of the loans the algorithm flags as a default are classified correctly. This is done at the expense of the overall amount of loans which are flagged as default. Therefore, it might not be possible to flag every loan which is going to default as a default, but the loans which are flagged as defaults are most likely really going to default.
Empirically speaking, an increase in recall is almost always associated with a decrease in precision and vice versa. Often, it is desired to balance precision and recall somehow. This can be done with the F-beta score.
Scenario Regression Problem
Management ViewDuring the phone conference (same one as in the classification scenario), the decision-makers from the bank announced that they want to predict the overall default trends. While that’s already important information, you evaluate with the client what exactly their business need is. At the end you’ll end up with a list of requirements: Manager A: It’s important to match the overall trend as close as possible. Manager B: During normal times, I won’t pay too much attention to the model. However, it is absolute necessary that the model performs well in extreme market situations. Manager C: To make it as easy and convenient to use as possible and to be able to explain it to the regulating agency, it has to be as explainable as possible.
Data Science ViewSimilar to the last scenario, there is again a tradeoff. It is a business problem to define which error is worse. Is every deviation from the ground truth equally bad? Is a certain stability of the prediction error important? Does the client care about the volatility of the forecast? Does a baseline exists? Have a look at the left chart above to see how it could look like.
Data Science ExplanationOnce again, there are several metrices one can choose from. The best metric always depends on the business need. Here are the most common ones:
The Mean Absolute Error (MAE) calculates, as the name suggests, how far the predictions are off in absolute terms. While the number is easy to interpret, it treats every deviation in the same way. On a 100-day time interval, being every day off by 1 unit is the same as predicting everything, every day right but being one day off by 100 units.
The Mean Squared Error (MSE) also calculates the difference between the actual and the predicted output. This time, the deviation is weighted. Extreme values are worse compared to many small errors.
The compares the model to evaluate against a simple baseline model. The advantage is that the output is easy to interpret. A value of 1 describes the perfect model, while a value close to 0 (or even negative) describes a model with room for improvement. This metric is commonly used among economists and econometricians and, therefore, in some industries a metric to consider. However, it is also relatively easy to get a high , which makes it hard to compare.
The Mean Absolute Percentage Error (MAPE) measures the absolute deviation from the predicted values. On the contrary to the MAE, the MAPE displays them in relative terms, which makes it very easy to interpret and to compare. The MAPE has its own set of drawbacks and caveats. Fortunately, my colleague Jan already wrote an article about it. Check it out if you want to learn more about it here