## Management Summary

In modern companies, information in text form can be found in many places in day-to-day business. Depending on the business context, this can involve invoices, emails, customer input (such as reviews or inquiries), product descriptions, explanations, FAQs, and applications. Until recently, these information sources were reserved mainly for human beings, as the understanding of a text is a technologically challenging problem for machines.

Due to recent achievements in deep learning, several different NLP (“Natural Language Processing”) tasks can now be solved with outstanding quality.

In this article, you will learn how NLP applications solve various business problems through five practical examples, which ensured an increase in efficiency and innovation in their field of application.

## Introduction

Natural Language Processing (NLP) is undoubtedly an area that has received special attention in the Big Data environment in the recent past. The interest in the topic, as measured by Google, has more than doubled in the last three years. This shows that innovative NLP technologies have long since ceased to be an issue only for big players such as Apple, Google, or Amazon. Instead, a general democratization of the technology can be observed. One of the reasons for this is that according to an IBM estimate, about 80% of “global information” is not available in structured databases, but unstructured, natural language. NLP will play a key role in the future when it comes to making this information usable. Thus, the successful use of NLP technologies will become one of the success factors for digitization in companies.

To give you an idea of the possibilities NLP opens up in the business context today, I will present five practical use cases and explain the solutions behind them in the following.

### What is NLP? – A Short Overview

As a research topic that had already occupied linguists and computer scientists in the 1950s, NLP had a barely visible existence on the application side in the 20th century.

The main reason for this was the availability of the necessary training data. Although the availability of unstructured data, in the form of texts, has generally increased exponentially, especially with the rise of the Internet, there was still a lack of suitable data for model training. This can be explained by the fact that the early NLP models mostly had to be trained under supervision (so-called supervised learning). However, supervised learning requires that training data must be provided with a dedicated target variable. This means that, for example, in the case of text classification, the text corpus must be manually annotated by humans before the model training.

This changed at the end of the 2010s when a new model generation of artificial neural networks led to a paradigm shift. These so-called “Language Models” are based on huge text corpora of Facebook, Google, etc., (pre-)trained by randomly masking individual words in the texts and predicting them in the course of training. This is so-called self-supervised learning, which no longer requires a separate target variable. In the course of the training, these models learn a contextual understanding of texts.

The advantage of this approach is that the same model can be readjusted for various downstream tasks (e.g., text classification, sentiment analysis, named entity recognition) with the help of the learned contextual understanding. This process is called transfer learning. In practice, these pre-trained models can be downloaded so that only the fine-tuning for the specific application must be done by additional data. Consequently, high-performance NLP applications can now be developed with little development effort.

To learn more about Language Models (especially the so-called Transformer Models like “BERT”, resp. “roBERTa”, etc.) as well as trends and obstacles in the field of NLP, please read the article on NLP trends by our colleague Dominique Lade. [https://www.statworx.com/de/blog/neue-trends-im-natural-language-processing-wie-nlp-massentauglich-wird/].

## The 5 Use Cases

### Text Classification in the Recruitment Process

A medical research institute wants to make its recruitment process of study participants more efficient.

For testing a new drug, different, interdependent requirements are placed on the persons in question (e.g., age, general health status, presence/absence of previous illnesses, medications, genetic dispositions, etc.). Checking all these requirements is very time-consuming. Usually, it takes about one hour per potential study participant to view and assess relevant information. The main reason for this is that the clinical notes contain patient information that exceeds structured data such as laboratory values and medication: Unstructured information in text form can also be found in the medical reports, physician’s letters, and discharge reports. Especially the evaluation of the latter data requires a lot of reading time and is therefore very time-consuming. To speed up the process, the research institute is developing a machine learning model that pre-selects promising candidates. The experts then only have to validate the proposed group of people.

#### The NLP Solution

From a methodological point of view, this problem is a so-called text classification. Based on a text, a prognosis is created for a previously defined target variable. To train the model, it is necessary – as usual in supervised learning – to annotate the data, in this case the medical documents, with the target variable. Since a classification problem has to be solved here (suitable or unsuitable study participants), the experts manually assess the suitability for the study for some persons in the pool. If a person is suitable, they are marked with a one (=positive case), otherwise with a zero (=negative case). Based on these training examples, the model can now learn the relationships between the persons’ medical documents and their suitability.

To cope with the complexity of the problem, a correspondingly complex model called ClinicalBERT is used. This is a language model based on BERT (Bidirectional Encoder Representations from Transformers), which was additionally trained on a data set of clinical texts. Thus, ClinicalBERT can generate so-called representations of all medical documentation for each person. In the last step, the neural network of ClinicalBERT is completed by a task-specific component. In this case, it is a binary classification: For each person, a probability of suitability should be output. Through a corresponding linear layer, the high-dimensional text documentation is finally transformed into a single number, the suitability probability. In a gradient procedure, the model now learns the suitability probabilities based on the training examples.

#### Further Application Scenarios of Text Classification:

Text classification often takes place in the form of sentiment analysis. This involves classifying texts into predefined sentiment categories (e.g., negative/positive). This information is particularly important in the financial world or for social media monitoring. Text classification can also be used in various contexts where it is vital to sort documents according to their type (e.g., invoices, letters, reminders).

### Name Entity Recognition for Usability Improvement of a News Page

A publishing house offers its readers on a news page a large number of articles on various topics. In the course of optimization measures, one would like to implement a better recommender system so that for each article, further suitable (complementary or similar) articles are suggested. Also, the search function on the landing page is to be improved so that the customer can quickly find the article he or she is looking for.

To create a good data basis for these purposes, the publisher decided to use Named Entity Recognition (NER) to assign automated tags to the texts, improving both the recommender system and the search function. After successful implementation, significantly more suggested articles are clicked on, and the search function has become much more convenient. As a result, the readers spend substantially more time on the page.

#### The NLP Solution

To solve the problem, one must first understand how NER works:

NER is about assigning words or entire phrases to content categories. For example, “Peter” can be identified as a person, “Frankfurt am Main” is a place, and “24.12.2020” is a time specification. There are also much more complicated cases. For this purpose, compare the following pairs of sentences:

- In the past, Adam didn’t know how to parallel park. (park = from the verb “to park”)
- Yesterday I took my dog for a walk in the park. (park = open green area)

It is perfectly evident to humans that the word “park” has a different meaning in each of the two sentences. However, this seemingly simple distinction is anything but trivial for the computer. An entity recognition model could characterize the two sentences as follows:

- “[
**In the past]***(time reference),***[Adam]***(person)*didn’t know how to parallel**[park]**(verb).” - “
**[Yesterday]***(time reference)***[I]**(person) took my dog for a walk in the**[park]***(location).*

In the past, rule-based algorithms would have been used to solve the above NER problem, but here too, the machine learning approach is gaining ground:

The present multiclass classification problem of entity determination is again addressed using the BERT model. Additionally, the model is trained on an annotated data set in which the entities are manually identified. The most comprehensive publicly accessible database in the English language is the Groningen Meaning Bank (GMB). After successful training, the model can correctly determine previously unknown words from the context resulting from the sentence. Thus, the model recognizes that prepositions like “in, at, after…” are followed by a location, but more complex contexts are also used to determine the entity.

#### Further Application Scenarios of NER:

NER is a classic information retrieval task and is central to many other NER tasks, such as chatbots and question-answer systems. Also, NER is often used for text cataloging, where the type of text is determined based on valid recognized entities.

### A Chatbot for a Long-Distance Bus Company

A long-distance bus company would like to increase its accessibility and expand the communication channels with the customer. In addition to its homepage and app, the company wants to offer a third way to the customer, namely a Whatsapp-Chatbot. The goal is to perform specific actions in the conversation with the chatbot, such as searching, booking, and canceling trips. In addition, the chatbot is intended to create a reliable way of informing passengers about delays.

With the introduction of the chatbot, not only existing passengers can be reached more quickly, but also, contact can be established with new customers* who have not yet installed an app.

#### The NLP solution

Depending on the requirements that are placed on the chatbot, you can choose between different chatbot architectures.

Over the years, four main chatbot paradigms have been tested: In a first generation, the inquiry was examined for well-known patterns and accordingly adapted prefabricated answers were spent (“pattern matching”). More sophisticated is the so-called “grounding”, in which information extracted from knowledge libraries (e.g., Wikipedia) is organized in a network by Named Entity Recognition (see above). Such a network has the advantage that not only registered knowledge can be retrieved, but also unregistered knowledge can be inferred by the network structure. In “searching”, question-answer pairs from the conversation history (or from previously registered logs) are directly used to find a suitable answer. The use of machine learning models is the most proven approach to generate suitable answers (“generative models”) dynamically.

The best way to implement this modern chatbot with clearly definable competencies for the company is to use existing frameworks such as Google Dialogflow. This is a platform for configuring chatbots that have the elements of all previously mentioned chatbot paradigms. For this purpose, parameters such as *intents*, *entities,* and *actions* are passed.

An *intend* (“user intention”) is, for example, the timetable information. By giving different example phrases (“How do I get from … to … from … to …”, “When is the next bus from … to …”) to a language model, the chatbot can assign even unseen input to the correct intend (see text classification).

Furthermore, different travel locations and times are defined as *entities*. If the chatbot now captures an intend with matching entities (see NER), an *action*, in this case a database query, can be triggered. Finally, an intend-answer with the relevant information is given and adapted to all information in the chat history specified by the user (“stateful”).

#### Further Application Scenarios of Chatbots:

There are many possible applications in customer service, depending on the complexity of the scenario, from the automatic preparation (e.g., sorting) of a customer order to the complete processing of a customer experience.

### A Question-Answering System as a Voice Assistant for Technical Questions About the Automobile

An automobile manufacturer discovers that many of its customers do not get along well with the manuals that come with the cars. Often, finding the relevant information takes too long, or it is not found at all. Therefore, it was decided to offer a Voice Assistant to provide precise answers to technical questions in addition to the static manual. In the future, drivers will be able to speak comfortably with their center console when they want to service their vehicle or request technical information.

#### The NLP solution

Question-answer systems have been around for decades, as they are at the forefront of artificial intelligence. A question-answer system that would always find a correct answer, taking into account all available data, could also be called “General AI”. A significant difficulty on the way to General AI is that the area the system needs to know about is unlimited. In contrast, question-answer systems provide good results when the area is delimited, as is the case with the automotive assistant. In general, the more specific the area, the better results can be expected.

For the implementation of the question-answer system, two data types from the manual are used: structured data, such as technical specifications of the components and key figures of the model, and unstructured data, such as instructions for action. All data is transformed into question-answer form in a preparatory step using other NLP techniques (classification, NER). This data is transferred to a version of BERT that has already been pre-trained on a large question-answer data set (“SQuAD”). The model is thus able to answer questions that have already been fed into the system and provide educated guesses for unseen questions.

#### Further Application Scenarios of Question-Answer Systems:

With the help of question-answer systems, company-internal search engines can be extended by functionalities. In e-commerce, answers to factual questions can be given automatically based on article descriptions and reviews.

### Automatic Text Summaries (Text Generation) of Damage Descriptions for a Property Insurance

An insurance company wants to increase the efficiency of its claim settlement department. It has been noticed that some claims complaints from the customer lead to internal conflicts of responsibility. The reason for this is simple: customers usually describe the claims over several pages, and an increased training period is needed to be able to judge whether or not to process the case. Thus, it often happens that a damage description must be read thoroughly to understand that the damage itself does not need to be processed. Now, a system that generates automated summaries is to remedy this situation. As a result of the implementation, the claim handlers can now make responsibility decisions much faster.

#### The NLP solution

One can differentiate between two different approaches to the text summary problem: In the *extraction*, the most important sentences are identified from the input text and are then used as a summary in the simplest case. In a*bstraction*, a text is transformed by a model into a newly generated summary text. The second approach is much more complex since paraphrasing, generalization, or the inclusion of further knowledge is possible here. Therefore, this approach has a higher potential to generate meaningful summaries but is also more error-prone. Modern text summary algorithms use the second approach or a combination of both methods.

A so-called sequence-to-sequence model is used to solve the insurance use case, which assigns a word sequence (the damage description) to another word sequence (the summary). This is usually a recurrent neural network (RNN), trained based on text summary pairs. The training process is designed to model the probability of the next word depending on the last words (and additionally, an “inner state” of the model). Similarly, the model effectively writes the summary “from left to right” by successively predicting the next word. An alternative approach is to have the input numerically encoded by the Language Model BERT and to have a GPT decoder autoregressively summarize the text based on this numerical representation. With the help of model parameters, it can be adjusted in both cases how long the summary should be.

#### Further Application Scenarios of Text Generation:

Such a scenario is conceivable in many places: Automated report writing, text generation based on retail sales data analysis, electronic medical record summaries, or textual weather forecasts from weather data are possible applications. Text generation is also used in other NLP use cases such as chatbots and Q&A systems.

## Outlook

These five application examples of text classification, chatbots, question-answer systems, NER, and text summaries show that there are many processes in all kinds of companies that can be optimized with NLP solutions.

NLP is not only an exciting field of research but also a technology whose applicability in the business environment is continually growing.

In the future, NLP will not only become a foundation of a data-driven corporate culture but also already holds a considerable innovation potential through direct application, in which it is worth investing.

At STATWORX, we already have years of experience in the development of customized NLP solutions. Here are two of our case studies on NLP: Social Media Recruiting with NLP & Supplier Recommendation Tool. We are happy to provide you with individual advice on this and many other topics.

Did you ever want to make your machine learning model available to other people, but didn’t know how? Or maybe you just heard about the term API, and want to know what’s behind it? Then this post is for you! Here at STATWORX, we use and write APIs daily. For this article, I wrote down how you can build your own API for a machine learning model that you create and the meaning of some of the most important concepts like REST. After reading this short article, you will know how to make requests to your API within a Python program. So have fun reading and learning!## Table of Contents

## What is an API?

API is short for Application Programming Interface. It allows users to interact with the underlying functionality of some written code by accessing the interface. There is a multitude of APIs, and chances are good that you already heard about the type of API, we are going to talk about in this blog post: The web API. This specific type of API allows users to interact with functionality over the internet. In this example, we are building an API that will provide predictions through our trained machine learning model. In a real-world setting, this kind of API could be embedded in some type of application, where a user enters new data and receives a prediction in return. APIs are very flexible and easy to maintain, making them a handy tool in the daily work of a Data Scientist or Data Engineer. An example of a publicly available machine learning API is Time Door. It provides Time Series tools that you can integrate into your applications. APIs can also be used to make data available, not only machine learning models.## And what is REST?

Representational State Transfer (or REST) is an approach that entails a specific style of communication through web services. When using some of the REST best practices to implement an API, we call that API a “REST API”. There are other approaches to web communication, too (such as the Simple Object Access Protocol: SOAP), but REST generally runs on less bandwidth, making it preferable to serve your machine learning models. In a REST API, the four most important types of requests are:- GET
- PUT
- POST
- DELETE

## The Model

For this How-To-article, I decided to serve a machine learning model trained on the famous iris dataset. If you don’t know the dataset, you can check it out here. When making predictions, we will have four input parameters: sepal length, sepal width, petal length, and finally, petal width. Those will help to decide which type of iris flower the input is. For this example I used the scikit-learn implementation of a simple KNN (K-nearest neighbor) algorithm to predict the type of iris:```
# model.py
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import numpy as np
def train(X,y):
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=1)
# fit the model
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f'Successfully trained model with an accuracy of {acc:.2f}')
return knn
if __name__ == '__main__':
iris_data = datasets.load_iris()
X = iris_data['data']
y = iris_data['target']
labels = {0 : 'iris-setosa',
1 : 'iris-versicolor',
2 : 'iris-virginica'}
# rename integer labels to actual flower names
y = np.vectorize(labels.__getitem__)(y)
mdl = train(X,y)
# serialize model
joblib.dump(mdl, 'iris.mdl')
```

As you can see, I trained the model with 70% of the data and then validated with 30% out of sample test data. After the model training has taken place, I serialize the model with the *joblib*library. Joblib is basically an alternative to pickle, which preserves the persistence of

*scikit*estimators, which include a large number of

*numpy*arrays (such as the KNN model, which contains all the training data). After the file is saved as a

*joblib*file (the file ending thereby is not important by the way, so don’t be confused that some people call it .model or .joblib), it can be loaded again later in our application.

## The API with Python and Flask

To build an API from our trained model, we will be using the popular web development package*Flask*and

*Flask-RESTful*. Further, we import joblib to load our model and numpy to handle the input and output data. In a new script, namely app.py, we can now set up an instance of a

*Flask*app and an API and load the trained model (this requires saving the model in the same directory as the script):

```
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np
APP = Flask(__name__)
API = Api(APP)
IRIS_MODEL = joblib.load('iris.mdl')
```

The second step now is to create a class, which is responsible for our prediction. This class will be a child class of the *Flask-RESTful*class

`Resource`

. This lets our class inherit the respective class methods and allows *Flask*to do the work behind your API without needing to implement everything. In this class, we can also define the methods (REST requests) that we talked about before. So now we implement a

`Predict`

class with a `.post()`

method we talked about earlier.
The post method allows the user to send a body along with the default API parameters. Usually, we want the body to be in JSON format. Since this body is not delivered directly in the URL, but as a text, we have to parse this text and fetch the arguments. The `flask _restful`

package offers the `RequestParser`

class for that. We simply add all the arguments we expect to find in the JSON input with the `.add_argument()`

method and parse them into a dictionary. We then convert it into an array and return the prediction of our model as JSON.
```
class Predict(Resource):
@staticmethod
def post():
parser = reqparse.RequestParser()
parser.add_argument('petal_length')
parser.add_argument('petal_width')
parser.add_argument('sepal_length')
parser.add_argument('sepal_width')
args = parser.parse_args() # creates dict
X_new = np.fromiter(args.values(), dtype=float) # convert input to array
out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}
return out, 200
```

You might be wondering what the `200`

is that we are returning at the end: For APIs, some HTTP status codes are displayed when sending requests. You all might be familiar with the famous `404 - page not found`

code. `200`

just means that the request has been received successfully. You basically let the user know that everything went according to plan.
In the end, you just have to add the `Predict`

class as a resource to the API, and write the main function:
```
API.add_resource(Predict, '/predict')
if __name__ == '__main__':
APP.run(debug=True, port='1080')
```

The `'/predict'`

you see in the `.add_resource()`

call, is the so-called API endpoint. Through this endpoint, users of your API will be able to access and send (in this case) POST requests. If you don’t define a port, port 5000 will be the default.
You can see the whole code for the app again here:
```
# app.py
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np
APP = Flask(__name__)
API = Api(APP)
IRIS_MODEL = joblib.load('iris.mdl')
class Predict(Resource):
@staticmethod
def post():
parser = reqparse.RequestParser()
parser.add_argument('petal_length')
parser.add_argument('petal_width')
parser.add_argument('sepal_length')
parser.add_argument('sepal_width')
args = parser.parse_args() # creates dict
X_new = np.fromiter(args.values(), dtype=float) # convert input to array
out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}
return out, 200
API.add_resource(Predict, '/predict')
if __name__ == '__main__':
APP.run(debug=True, port='1080')
```

## Run the API

Now it’s time to run and test our API! To run the app, simply open a terminal in the same directory as your`app.py`

script and run this command.
`python run app.py`

You should now get a notification, that the API runs on your localhost in the port you defined. There are several ways of accessing the API once it is deployed. For debugging and testing purposes, I usually use tools like *Postman*. We can also access the API from within a Python application, just like another user might want to do to use your model in their code. We use the

*requests*module, by first defining the URL to access and the body to send along with our HTTP request:

```
import requests
url = 'http://127.0.0.1:1080/predict' # localhost and the defined port + endpoint
body = {
"petal_length": 2,
"sepal_length": 2,
"petal_width": 0.5,
"sepal_width": 3
}
response = requests.post(url, data=body)
response.json()
```

The output should look something like this:
`Out[1]: {'Prediction': 'iris-versicolor'}`

That’s how easy it is to include an API call in your Python code! Please note that this API is just running on your localhost. You would have to deploy the API to a live server (e.g., on AWS) for others to access it.
# Conclusion

In this blog article, you got a brief overview of how to build a REST API to serve your machine learning model with a web interface. Further, you now understand how to integrate simple API requests into your Python code. For the next step, maybe try securing your APIs? If you are interested in learning how to build an API with R, you should check out this post. I hope that this gave you a solid introduction to the concept and that you will be building your own APIs immediately. Happy coding!## Because You Are Interested In Data Science, You Are Interested In This Blog Post

If you love streaming movies and tv series online as much as we do here at STATWORX, you’ve probably stumbled upon recommendations like “Customers who viewed this item also viewed…” or “Because you have seen …, you like …”. Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering. In this blog post, I will first explain how collaborative filtering works. Secondly, I’m going to show you how to develop your own small movie recommender with the R package`recommenderlab`

and provide it in a shiny application.
## Different Approaches

There are several approaches to give a recommendation. In the**user-based collaborative filtering**(UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion. For the

**item-based collaborative filtering**IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested. Those and other collaborative filtering methods are implemented in the

`recommenderlab`

package:
**ALS_realRatingMatrix**: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.**ALS_implicit_realRatingMatrix**: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.**IBCF_realRatingMatrix**: Recommender based on item-based collaborative filtering.**LIBMF_realRatingMatrix**: Matrix factorization with LIBMF via package recosystem.**POPULAR_realRatingMatrix**: Recommender based on item popularity.**RANDOM_realRatingMatrix**: Produce random recommendations (real ratings).**RERECOMMEND_realRatingMatrix**: Re-recommends highly-rated items (real ratings).**SVD_realRatingMatrix**: Recommender based on SVD approximation with column-mean imputation.**SVDF_realRatingMatrix**: Recommender based on Funk SVD with gradient descend.**UBCF_realRatingMatrix**: Recommender based on user-based collaborative filtering.

## Developing your own Movie Recommender

### Dataset

To create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets.`head(movie_data)`

```
movieId title genres
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 2 Jumanji (1995) Adventure|Children|Fantasy
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
```

`head(ratings_data)`

```
userId movieId rating timestamp
1 1 1 4 964982703
2 1 3 4 964981247
3 1 6 4 964982224
4 1 47 5 964983815
5 1 50 5 964982931
6 1 70 3 964982400
```

To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of “smooth” ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users.
```
# ranting_vector
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
5830804 1370 2811 1791 7551 5550 20047 13136 26818 8551 13211
```

In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings.
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.208 3.444 3.748 3.665 3.944 4.429
```

Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user:
You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data.
`ratings_movies_norm <- normalize(ratings_movies)`

## Model Training and Evaluation

To train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector`n_recommendations`

.
```
eval_sets <- evaluationScheme(data = ratings_movies_norm,
method = "cross-validation",
k = 10,
given = 5,
goodRating = 0)
models_to_evaluate <- list(
`IBCF Cosinus` = list(name = "IBCF",
param = list(method = "cosine")),
`IBCF Pearson` = list(name = "IBCF",
param = list(method = "pearson")),
`UBCF Cosinus` = list(name = "UBCF",
param = list(method = "cosine")),
`UBCF Pearson` = list(name = "UBCF",
param = list(method = "pearson")),
`Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL)
)
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations)
```

We then have the results displayed graphically for analysis.
We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the **UBCF Pearson**model. Since the n most similar users (parameter

`nn`

) are used to calculate the recommendations, we will examine the results of the model for different numbers of users.
```
vector_nn <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_nn, function(nn){
list(name = "UBCF",
param = list(method = "pearson", nn = vector_nn))
})
names(models_to_evaluate) <- paste0("UBCF mit ", vector_nn, "Nutzern")
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations)
```

## Conclusion

Our user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App. However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models. But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX.## Shiny-App

Here you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio. At STATWORX we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms which allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post. As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. In my first blog post, I gave an introduction into the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science. In this second blog post, I will introduce the so-called Causal Forest, one of the most popular Causal Machine Learning algorithms to estimate heterogeneous treatment effects.## Why Heterogeneous Treatment Effects?

In Causal Forests, the goal is to estimate heterogeneity in treatment effects. As explained in my previous blog post, a treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. For example the causal effect of a subsidised training programme on earnings. As individual treatment effects are unobservable, the practice focuses on estimating unbiased and consistent averages of the individual treatment effect. The most common parameter thereof is the average treatment effect, which is the mean of all individual treatment effects in the entire population of interest. However, sometimes treatment effects may vary widely between different subgroups in the population, bet it larger or smaller than the average treatment effect. In some cases, it might therefore be more interesting to estimate these different, i.e. heterogeneous treatment effects.In most applications it is also interesting to look beyond the average effects in order to understand how the causal effects vary with observable characteristics. (Knaus, Lechner & Strittmatter, 2018)The estimation of heterogeneous treatment effects can assist in answering questions like: For whom are there big or small treatment effects? For which subgroup does a treatment generate beneficial or adverse effects? In the field of marketing, for example, the estimation of heterogeneous treatment effects can help to optimise resource allocation by answering the question of which customers respond the most to a certain marketing campaign or for which customers is the causal effect of intervention strategies on their churn behaviour the highest. Or when it comes to pricing, it might be interesting to quantify how a change in price has varying impact on sales among different age or income groups.

## Where Old Estimation Methods Fail

Estimating heterogeneous treatment effects is nothing new. Econometrics and other social sciences have long been studying which variables predict a smaller or larger than average treatment effect, which in statistical terms is also known as Moderation. One of the most traditional ways to find heterogeneous treatment effects is to use a Multiple Linear Regression with interaction terms between the variables of interest (i.e. the ones which might lead to treatment heterogeneity) and the treatment indicator. In this blog post, I will always assume that the data is from a randomised experiment, such that the assumptions to identify treatment effects are valid without further complications. We then conclude that the treatment effect depends on the variables whose interaction term is statistically significant. For example, if we have only one variable, the regression model would look as follows:where is the treatment indicator and is the variable of interest. In that case, if is significant, we know that the treatment effect depends on variable . The treatment effect for each observation can then be calculated as

which is dependent on the value of and therefore heterogeneous among the different observations. So why is there a need for more advanced methods to estimate heterogeneous treatment effects? The example above was very simple, it only included one variable. However, usually, we have more than one variable which might influence the treatment effect. To see which variables predict heterogeneous treatment effects, we have to include many interaction terms, not only between each variable and the treatment indicator but also for all possible combinations of variables with and without the treatment indicator. If we have variables and one treatment, this gives a total number of parameters of:

So, for example if we had 5 variables, we would have to include a total number of 64 parameters into our Linear Regression Model. This approach suffers from a lack of statistical power and could also cause computational issues. The use of a Multiple Linear Regression also imposes linear relationships unless more interactions with polynomials are included. Because Machine Learning algorithms can handle enormous numbers of variables and combining them in nonlinear and highly interactive ways, researchers have found ways to better estimate heterogeneous treatment effects by combining the field of Machine Learning with the study of Causal Inference.

## Generalised Random Forests

Over recent years, different Machine Learning algorithms have been developed to estimate heterogeneous treatment effects. Most of them are based on the idea of Decision Trees or Random Forests, just like the one I focus on in this blog post: Generalised Random Forests by Athey, Tibshirani and Wager (2018). Generalised Random Forests follows the idea of Random Forests and apart from heterogeneous treatment effect estimation, this algorithm can also be used for non-parametric quantile regression and instrumental variable regression. It keeps the main structure of Random Forests such as the recursive partitioning, subsampling and random split selection. However, instead of averaging over the trees Generalised Random Forests estimate a weighting function and uses the resulting weights to solve a local GMM model. To estimate heterogeneous treatment effects, this algorithm has two important additional features, which distinguish it from standard Random Forests.### 1. Splitting Criterion

The first important difference to Random Forests is the splitting criterion. In Random Forests, where we want to predict an outcome variable , the split at each tree node is performed by minimising the mean squared error of the outcome variable . In other words, the variable and value to split at each tree node are chosen such that the greatest reduction in the mean squared error with regard to the outcomes is achieved. After each tree partition has been completed, the tree’s prediction for a new observation is obtained by letting it trickle through all the way from tree’s root into a terminal node, and then taking the average of outcomes of all the observations that fell into the same node during training. The Random Forest prediction is then calculated as the average of the predicted tree values. In Causal Forests, we want to estimate treatment effects. As stated by the Fundamental Problem of Causal Inference however, we can never observe a treatment effect on an individual level. Therefore, the prediction of a treatment effect is given by the difference in the average outcomes between the treated and the untreated observations in a terminal node. Without going into too much detail, to find most heterogeneous but also accurate treatment effects, the splitting criterion is adapted such that it searches for a partitioning where the treatment effects differ the most including a correction that accounts for how the splits affect the variance of the parameter estimates.### 2. Honesty

Random Forests are usually evaluated by applying them to a test set and measure the accuracy of the predictions of Y using an error measure such as the mean squared error. Because we can never observe treatment effects, this form of performance measure is not possible in Causal Forests. When estimating causal effects, one, therefore, evaluates their accuracy by examining the bias, standard error and the related confidence interval of the estimates. To ensure that an estimate is as accurate as possible, the bias should asymptotically disappear and the standard error and, thus, the confidence interval, should be as small as possible. To enable this statistical inference in their Generalised Random Forest, Athey, Tibshirani and Wager introduce so-called*honest trees*. In order to make a tree honest, the training data is split into two subsamples: a splitting subsample and an estimating subsample. The splitting subsample is used to perform the splits and thus grow the tree. The estimating subsample is then used to make the predictions. That is, all observations in the estimating subsample are dropped down the previously-grown tree until it falls into a terminal node. The prediction of the treatment effects is then given by the difference in the average outcomes between the treated and the untreated observations of the estimating subsample in the terminal nodes. With such honest trees, the estimates of a Causal Forest are consistent (i.e. the bias vanishes asymptotically) and asymptotically Gaussian which together with the estimator for the asymptotic variance allow valid confidence intervals.

## Causal Forest in Action

To show the advantages of Causal Forests compared to old estimation methods, in the following I will compare the Generalised Random Forest to a Regression with interaction terms in a small simulation study. I use simulated data to be able to compare the estimated treatment effects with the actual treatment effects, which, as we know, would not be observable in real data. To compare the two algorithms with respect to the estimation of heterogeneous treatment effects, I test them on two different data sets, one with and one wihtout heterogeneity in the treatment effect:Data Set | Heterogeneity | Heterogeneity Variables | Variables | Observations |
---|---|---|---|---|

1 | No Heterogeneity | – | 20000 | |

2 | Heterogeneity | and | – | 20000 |

`causal_forest()`

from the grf-package with `tune.parameters = "all"`

. I compare this to an `lm()`

model, which includes all variables, the treatment indicator and the necessary interaction terms of the heterogeneity variables and the treatment indicator:
**Linear Regression Model for data set with heterogeneity:**

**Linear Regression Model for data set with no heterogeneity:**

where – are the heterogeneity variables and is the treatment indicator (i.e. if treated and if not treated). As already explained above, we usually do not know which variables affect the treatment effect and have therefore to include all possible interaction terms into the Linear Regression Model to see which variables lead to treatment effect heterogeneity. In the case of 10 variables as we have it here, this means we would have to include a total of 2048 parameters in our Linear Regression Model. However, since the heterogeneity variables are known in the simulated data, here, I only include the interaction terms for those variables.

Data Set | Metric | grf | lm |
---|---|---|---|

No Heterogeneity | RMSE | 0.01 | 0.00 |

Heterogeneity | RMSE | 0.08 | 0.45 |

## Outlook

I hope that this blog post has helped you to understand what Causal Forests are and what advantages they bring in estimating heterogeneous treatment effects compared to old estimation methods. In my upcoming blog posts on Causal Machine Learning, I will explore this new field of data science further. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or introduce different data generating processes to evaluate Causal Machine Learning methods in simulation studies.## References

- Athey, S., Tibshirani, J., & Wager, S. (2019). Generalised random forests.
*The Annals of Statistics*,*47(2)*, 1148-1178. - Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence.
*arXiv:1810.13237v2*.

## Table of Contents

- What the Mape Is Falsely Blamed For
- Its True Weaknesses
- Better Alternatives
- Worse Alternatives
- References

## What the MAPE is FALSELY blamed for!

### It Puts Heavier Penalties on Negative Errors Than on Positive Errors

Most sources dealing with the MAPE point out this “major” issue of the measure. The statement is primarily based on two different arguments. First, they claim that interchanging the actual value with the forecasted value proofs their point*(Makridakis 1993)*. Case 1: = 150 & = 100 (positive error) Case 2: = 100 & = 150 (negative error) It is true that Case 1 (positive error of 50) is related to a lower APE (Absolute Percentage Error) than Case 2 (negative error of 50). However, the reason here is not that the error is positive or negative but simply that the actual value changes. If the actual value stays constant, the APE is equal for both types of errors

*(Goodwin & Lawton 1999)*. That is clarified by the following example. Case 3: = 100 & = 50 Case 4: = 100 & = 150 The second, equally invalid argument supporting the asymmetry of the MAPE arises from the assumption about the predicted data. As the MAPE is mainly suited to be used to evaluate predictions on a ratio scale, the MAPE is bounded on the lower side by an error of 100%

*(Armstrong & Collopy 1992)*. However, this does not imply that the MAPE overweights or underweights some types of errors, but that these errors are not possible.

## Its TRUE weaknesses!

### It Fails if Some of the Actual Values Are Equal to Zero

This statement is a well-known problem of the focal measure. However, that and the latter argument were the reason for the development of a modified form of the MAPE, the SMAPE (“Symmetric” Mean Absolute Percentage). Ironically, in contrast to the original MAPE, this modified form suffers from true asymmetry*(Goodwin & Lawton 1999)*. I will clarify this argument in the last section of the article.

### Particularly Small Actual Values Bias the Mape

If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE*(Hyndman & Koehler 2006)*. The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.

### The Mape Implies Only Which Forecast Is Proportionally Better

As mentioned at the beginning of this article, one advantage of using the MAPE for comparison between forecasts of different time series is its unit independency. However, it is essential to keep in mind that the MAPE only implies which forecast is*proportionally better*. The following graph shows three different time series and their corresponding forecasts. The only difference between them is their general level. The same absolute errors lead, therefore, to profoundly different MAPEs. This article critically questions, if it is reasonable to use such a percentage-based measure for the comparison between forecasts for different time series. If the different time series aren’t behaving in a somehow comparable level (as shown in the following graphic), using the MAPE to infer if a forecast is

*generally better*for one time series than for another relies on the assumption that the same absolute errors are less problematic for time series on higher levels than for time series on lower levels:

“That might be true in some cases. However, in general, this a questionable or at least an assumption people should always be aware of when using the MAPE to compare forecasts between different time series.If a time series fluctuates around 100, then predicting 101 is way better than predicting 2 for a time series fluctuating around 1.”

### Summary

In summary, the discussed findings show that the MAPE should be used with caution as an instrument for comparing forecasts across different time series. A necessary condition is that the time series only contains strictly positive values. Second only some extremely small values have the potential to bias the MAPE heavily. Last, the MAPE depends systematically on the level of the time series as it is a percentage based error. This article critically questions if it is meaningful to generalize from being a*proportionally better*forecast to being a

*generally better*forecast.

## BETTER alternatives!

The discussed implies that the MAPE alone is often not very useful when the objective is to compare accuracy between different forecasts for different time series. Although relying only on one easily understandable measure appears to be comfortable, it comes with a high risk of drawing misleading conclusions. In general, it is always recommended to use different measures combined. In addition to numerical measures, a visualization of the time series, including the actual and the forecasted values always provides valuable information. However, if one single numeric measure is the only option, there are some excellent alternatives.### Scaled Measures

Scaled measures compare the measure of a forecast, for example, the MAE relative to the MAE of a benchmark method. Similar measures can be defined using RMSE, MAPE, or other measures. Common benchmark methods are the “random walk”, the “naïve” method and the “mean” method. These measures are easy to interpret as they show how the focal model compares to the benchmark methods. However, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.### Scaled Errors

Scaled errors approaches also try to remove the scale of the data by comparing the forecasted values to those obtained by some benchmark forecast method, like the naïve method. The MASE (Mean Absolute Scaled Error), proposed by*Hydnmann & Koehler 2006*, is defined slightly different dependent on the seasonality of the time series. In the simple case of a non-seasonal time series, the error of the focal forecast is scaled based on the in-sample MAE from the naïve forecast method. One major advantage is that it can handle actual values of zero and that it is not biased by very extreme values. Once again, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.

*Non-Seasonal*

*Seasonal*

### SDMAE

In my understanding, the basic idea of using the MAPE to compare different time series between forecasts is that the same absolute error is assumed to be less problematic for time series on higher levels than for time series on lower levels. Based on the examples shown earlier, I think that this idea is at least questionable. I argue that how good or bad a specific absolute error is evaluated should not depend on the general level of the time series but on its variation. Accordingly, the following measure the SDMAE (Standard Deviation adjusted Mean Absolute Error) is a product of the discussed issues and my imagination. It can be used for evaluating forecasts for times series containing negative values and does not suffer from actual values being equal to zero nor particularly small. Note that this measure is not defined for time series that do not fluctuate at all. Furthermore, there might be other limitations of this measure, that I am currently not aware of.### Summary

I suggest using a combination of different measures to get a comprehensive understanding of the performance of the different forecasts. I also suggest complementing the MAPE with a visualization of the time series, including the actual and the forecasted values, the MAE, and a Scaled Measure or Scaled Error approach. The SDMAE should be seen as an alternative approach that was not discussed by a broader audience so far. I am thankful for your critical thoughts and comments on this idea.## Worse alternatives!

### SMAPE

The SMAPE was created, to solve and respond to the problems of the MAPE. However, this did neither solve the problem of extreme small actual values nor the level dependency of the MAPE. The reason is that extreme small actual values are typically related to extreme small predictions*(Hyndman & Koehler 2006)*. Additional, and in contrast to the unmodified MAPE, the SMAPE raises the problem of asymmetry

*(Goodwin & Lawton 1999)*. This is clarified through the following graphic, whereas the ” APE” relates to the MAPE and the “SAPE” relates to the SMAPE. It shows that the SAPE is higher for positive errors than for negative errors and therefore, asymmetric. The SMAPE is not recommended to be used by several scientists (Hyndman & Koehler 2006).

*On the asymmetry of the symmetric MAPE*

*(Goodwin & Lawton 1999)*

## References

- Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE.
*International journal of forecasting*,*15*(4), 405-408. - Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
*International journal of forecasting*,*22*(4), 679-688. - Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns.
*International Journal of Forecasting*,*9*(4), 527-529. - Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons.
*International journal of forecasting*,*8*(1), 69-80.

`R`

. These kinds of questions arise here at STATWORX when developing, for example, new machine learning algorithms or testing established ones which shall generalize well to new unseen data.
## Decomposing the mean squared error

We consider the basic regression setup where we observe a real-valued sample and our aim is to predict an outcome based on some predictor variables (“covariates”) via a regression fitting method. Usually, we can measure our target only with noise , To measure our prediction accuracy, we will use the mean squared error (MSE) which can be decomposed as follows: is our prediction obtained by the regression method at hand. From the above expression, we observe that the MSE consists of two parts:- : measures the (squared) difference between the true underlying process and the mean of our predictions, i.e.
- : measures the variation of our predictions around its mean, i.e.

## Monte Carlo Setup & Simulation Code

To illustrate this, we consider a simple toy model. where , and As a fitting procedure, we will use a cubic smoothing spline (`smooth.spline`

). For the purpose of this blog post, we only need to know that a smoothing spline divides our univariate predictor space into intervals and fits each one to a cubic polynomial to approximate our target. The complexity of our smoothing spline can be controlled via the degrees of freedom (function argument `df`

). If you are interested in the (impressive) mathematical details of smoothing splines, check out this script by Ryan Tibshirani.
Figure 1 shows the bias-variance tradeoff from above. For relatively low degrees of freedom we obtain a model (red line) which is too simple and does not approximate the true data generating process well (black dashed line) – Note: for we obtain the least-squares fit.
Here, the bias will be relatively large while the variance will remain low. On the other hand, choosing relatively high degrees of freedoms leads to a model which overfits the data (green line). In this case, we clearly observe that the model is fitting characteristic features of the data which are not relevant for approximating the true data generating process.
To make this tradeoff more rigorous, we explicitly plot the bias and variance. For this, we conduct a Monte Carlo simulation. As a side note, to run the code snippets below, you only need the `stats`

module which is contained in the standard `R`

module scope.
For validation purposes, we use a training and test dataset (`*_test`

and `*_train`

, respectively). On the training set, we construct the algorithm (obtain coefficient estimates, etc.) while on the test set, we make our predictions.
```
# set a random seed to make the results reproducible
set.seed(123)
n_sim <- 200
n_df <- 40
n_sample <- 100
# setup containers to store the results
prediction_matrix <- matrix(NA, nrow = n_sim, ncol = n_sample)
mse_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
results <- matrix(NA, nrow = 3, ncol = n_df)
# Train data -----
x_train <- runif(n_sample, -0.5, 0.5)
f_train <- 0.8*x_train+sin(6*x_train)
epsilon_train <- replicate(n_sim, rnorm(n_sample, 0, sqrt(2)))
y_train <- replicate(n_sim,f_train) + epsilon_train
# Test data -----
x_test <- runif(n_sample, -0.5, 0.5)
f_test <- 0.8*x_test+sin(6*x_test)
```

The bias-variance tradeoff can be modelled in `R`

using two `for`

-loops. The outer one will control the complexity of the smoothing splines (counter: `df_iter`

). The Monte Carlo Simulation with 200 iterations (`n_sim`

) to obtain the prediction matrix for the variance and bias is run in the inner loop.
```
# outer for-loop
for (df_iter in seq(n_df)){
# inner for-loop
for (mc_iter in seq(n_sim)){
cspline <- smooth.spline(x_train, y_train[, mc_iter], df=df_iter+1)
cspline_predict <- predict(cspline, x_test)
prediction_matrix[mc_iter, 1:n_sample] <- cspline_predicty - f_test)^2)
}
var_matrix <- apply(prediction_matrix, 2, FUN = var)
bias_matrix <- apply(prediction_matrix, 2, FUN = mean)
squared_bias <- (bias_matrix - f_test)^2
results[1, df_iter] <- mean(var_matrix)
results[2, df_iter] <- mean(squared_bias)
}
results[3,1:n_df] <- apply(mse_temp, 2, FUN = mean)
```

To model and from the above MSE-equation, we have to approximate those theoretical (population) terms by means of a Monte Carlo simulation (inner `for`

-loop). We run (`n_sim`

) Monte Carlo iterations and save the predictions obtained by the `smooth.spline`

-object in a prediction matrix. To approximate at each test sample point , by , we take the average of each column. denotes the prediction of the algorithm obtained at some sample point in iteration . Similar considerations can be made to obtain an approximation for .
## Bias-variance tradeoff as a function of the degrees of freedom

Figure 2 shows the simulated bias-variance tradeoff (as a function of the degrees of freedom). We clearly observe the complexity considerations of Figure 1. Here, the bias is quickly decreasing to zero while the variance exhibits linear increments with increasing degrees of freedoms. Hence, for higher degrees of freedom, the MSE is driven mostly by the variance. Comparing this Figure with Figure 1, we note that for , the bias contributes substantially more than the variance to the MSE. If we increase the degrees of freedom to the bias tends to zero, characteristic features of the data are fitted and the MSE consists mostly of the variance. With small modifications, you can use this code to explore the bias-variance tradeoff of other regression fitting and also Machine Learning methods such as Boosting or Random Forest. I leave it to you to find out which hyperparameters induce the bias-variance tradeoff in these algorithms. You can find the full code on my Github Account. If you spot any mistakes or if you are interested in discussing applied statistic and econometrics topics, feel free to contact me.# References

The simulation setup in this blog post follows closely the one from Buhlmann.- Buhlmann, P. and Yu, B. (2003). Boosting with the L2-loss: Regression and classification. J. Amer. Statist. Assoc. 98, 324–339.

## Motivation

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.

While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly.

In my two subsequent blog post I will introduce two machine learning algorithms in 150 lines of R Code. This blog post will be about regression trees, which are the foundation of most tree-based algorithms. You can find the other blog post about coding gradient boosted machines from scratch on our blog. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.

## Gathering all puzzle pieces

Surely, there are tons of great articles out there which explain regression trees theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in a hands-on tutorial with all necessary theory I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow. The only thing you need to know is our objective: We want to estimate our real-valued target (`y`

) with a set of real-valued features (`X`

).

Most probably you are already familiar with decision trees, which is a machine learning algorithm to solve classification tasks. As the name itself states regression trees solve regressions, i.e. estimation with continuous scaled targets. These kind of trees are the key part of every tree-based method, since the way you grow a tree is more of the same really. The differing parts among the implementations are mostly about the splitting rule. In this tutorial we will program a very simple, but generic implementation of a regression tree.

Fortunately, we do not have to cover much maths in this tutorial, because the algorithm itself is rather a technical than a mathematical challenge. With that said, the technical path I have chosen might not be the most efficient way, but I tried to trade-off efficiency with simplicity.

Anyway, as most of you might know decision or regression trees are rule-based approaches. Meaning, we are trying to split the data into partitions conditional to our feature space. The data partitioning is done with the help of a splitting criterion. There is no common ground on how to do those splits, there are rather multiple different splitting criteria with different pros and cons. We will focus on a rather simple criterion in this tutorial. Bear with me, here comes some maths.

Ok, so what does this state? This is the sum of squared errors determined in two different subsets ( and ). As the name suggests, that should be something we want to minimize. In fact, it is the squared distance between the mean and the target within this data subset. In every node of our regression tree we calculate the SSE for every potential split we could do in our data for every feature we have to figure out the best split we can achieve.

Let us have a look at the R Code:

```
# This is the splitting criterion we minimize (SSE [Sum Of Squared Errors]):
# SSE = sum{i in S_1} (y_i - bar(y)_1)^2 + sum{i in S_2} (y_i - bar(y)_2)^2
sse_var <- function(x, y) {
splits <- sort(unique(x))
sse <- c()
for (i in seq_along(splits)) {
sp <- splits[i]
sse[i] <- sum((y[x < sp] - mean(y[x < sp]))^2) +
sum((y[x >= sp] - mean(y[x >= sp]))^2)
}
split_at <- splits[which.min(sse)]
return(c(sse = min(sse), split = split_at))
}
```

The function takes two inputs our numeric feature `x`

and our target real-valued `y`

. We then go ahead and calculate the SSE for every unique value of our `x`

. This means we calculate the SSE for every possible data subset we could obtain conditional on the feature. Often we want to cover more than one feature in our problem, which means that we have to run this function for every feature. As a result, the best splitting rule has the lowest SSE among all possible splits of all features. Once we have determined the best splitting rule we can split our data into these two subsets according to our criterion, which is nothing else than feature `x <= split_at`

and `x > split_at`

. We call these two subsets children and they again can be split into subsets again.

Let us lose some more words on the SSE though, because it reveals our estimator. In this implementation our estimator in the leaf is simply the avarage value of our target within this data subset. This is the simplest version of a regression tree. However, with some additional work you can apply more sophisticated models, e.g. an ordinary least squares fit.

## The Algorithm

Enough with the talking, let’s get to the juice. In the following you will see the algorithm in all of its beauty. Afterwards we will breakdown the algorithm in easy-to-digest code chunks.

```
#' reg_tree
#' Fits a simple regression tree with SSE splitting criterion. The estimator function
#' is the mean.
#'
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param minsize a numeric value indicating the minimum size of observations
#' in a leaf
#'
#' @return itemize{
#' item tree - the tree object containing all splitting rules and observations
#' item fit - our fitted values, i.e. X %*% theta
#' item formula - the underlying formula
#' item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
reg_tree <- function(formula, data, minsize) {
# coerce to data.frame
data <- as.data.frame(data)
# handle formula
formula <- terms.formula(formula)
# get the design matrix
X <- model.matrix(formula, data)
# extract target
y <- data[, as.character(formula)[2]]
# initialize while loop
do_splits <- TRUE
# create output data.frame with splitting rules and observations
tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA,
TERMINAL = "SPLIT",
stringsAsFactors = FALSE)
# keep splitting until there are only leafs left
while(do_splits) {
# which parents have to be splitted
to_calculate <- which(tree_infoNODE)
# paste filter rules
tmp_filter <- c(paste(names(tmp_splitter), ">=",
splitting[2,tmp_splitter]),
paste(names(tmp_splitter), "<",
splitting[2,tmp_splitter]))
# Error handling! check if the splitting rule has already been invoked
split_here <- !sapply(tmp_filter,
FUN = function(x,y) any(grepl(x, x = y)),
y = tree_infoTERMINAL != "SPLIT")
} # end for
} # end while
# calculate fitted values
leafs <- tree_info[tree_infoTERMINAL == "SPLIT")
for (j in to_calculate) {
# handle root node
if (!is.na(tree_info[j, "FILTER"])) {
# subset data according to the filter
this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"])))
# get the design matrix
X <- model.matrix(formula, this_data)
} else {
this_data <- data
}
# estimate splitting criteria
splitting <- apply(X, MARGIN = 2, FUN = sse_var, y = y)
# get the min SSE
tmp_splitter <- which.min(splitting[1,])
# define maxnode
mn <- max(tree_infoFILTER)
# append the splitting rules
if (!is.na(tree_info[j, "FILTER"])) {
tmp_filter <- paste(tree_info[j, "FILTER"],
tmp_filter, sep = " & ")
}
# get the number of observations in current node
tmp_nobs <- sapply(tmp_filter,
FUN = function(i, x) {
nrow(subset(x = x, subset = eval(parse(text = i))))
},
x = this_data)
# insufficient minsize for split
if (any(tmp_nobs <= minsize)) {
split_here <- rep(FALSE, 2)
}
# create children data frame
children <- data.frame(NODE = c(mn+1, mn+2),
NOBS = tmp_nobs,
FILTER = tmp_filter,
TERMINAL = rep("SPLIT", 2),
row.names = NULL)[split_here,]
# overwrite state of current node
tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT")
# bind everything
tree_info <- rbind(tree_info, children)
# check if there are any open splits left
do_splits <- !all(tree_infoTERMINAL == "LEAF", ]
fitted <- c()
for (i in seq_len(nrow(leafs))) {
# extract index
ind <- as.numeric(rownames(subset(data, eval(parse(
text = leafs[i, "FILTER"])))))
# estimator is the mean y value of the leaf
fitted[ind] <- mean(y[ind])
}
```

At the end of our calculation we have a filter rule for every leaf in our tree. With the help of these filters we can easily calculate the fitted values by simply applying the filter on our data and calculating our fit, i.e. the mean of our target in this leaf. I am sure by now you can think of a way to implement more sophisticated estimators, which I would leave up to you.

Well, that’s a regression tree with minimum size restriction. I have created a little runthrough with data from my simulation package on my GitHub, which you can checkout and try everything on your own. Make sure to checkout my other blog post about coding gradient boosted machines from scratch.

## Introduction

One of the highlights of this year’s H2O World was a Kaggle Grandmaster Panel. The attendees, Gilberto Titericz (Airbnb), Mathias Müller (H2O.ai), Dmitry Larko (H2O.ai), Marios Michailidis (H2O.ai), and Mark Landry (H2O.ai), answered various questions about Kaggle and data science in general.

One of the questions from the audience was which tools and algorithms the Grandmasters frequently use. As expected, every single of them named the gradient boosting implementation XGBoost (Chen and Guestrin 2016). This is not surprising, since it is long known that XGBoost is at the moment the probably most used algorithm in data science.

The popularity of XGBoost manifests itself in various blog posts. Including tutorials for R and Python, Hyperparameter for XGBoost, and even using XGBoost with Nvidia’s CUDA GPU support.

At STATWORX, we also frequently leverage XGBoost’s power for external and internal projects (see Sales Forecasting Automative Use-Case). One question that we lately asked ourselves was how big the difference between the two **base learners** (also called *boosters*) offered by XGBoost is? This posts tries to answer this question in a more systematic way.

## Weak Learner

Gradient boosting can be interpreted as a combination of single models (so called *base learners* or *weak learner*s) to an ensemble model (Natekin and Knoll 2013).

In theory, any base learner can be used in the boosting framework, whereas some base learners have proven themselves to be particularly useful: Linear and penalized models (Hastie et al. 2008), (B/P-) splines (Huang and Yang 2004), and especially decision trees (James et al. 2013). Among the less frequently used base learners are random effects (Tutz and Groll 2009), radial basis functions (Gomez-Verdejo et al. 2002), markov random fields (Dietterich et al. 2004) und wavlets (Dubossarsky et al. 2016).

Chen and Guestrin (2016) describe XGBoost as an additive function, given the data , of the following form:

In their original paper, is defined as an *classification or regression tree* (CART). Apart from that, the alert reader of the technical documentation knows that one can alter the functional form of by using the `booster`

argument in R/Python

```
# Example of XGBoost for regression in R with trees (CART)
xgboost(data = train_DMatrix,
obj = "reg:linear".
eval_metric = "rmse",
booster = "gbtree")
```

One can choose between decision trees (`gbtree`

and `dart`

) and linear models (`gblinear`

). Unfortunately, there is only limited literature on the comparison of different base learners for boosting (see for example Joshi et al. 2002). To our knowledge, for the special case of XGBoost no systematic comparison is available.

## Simulation and Setup

In order to compare linear with tree base learners, we propose the following Monte Carlo simulation:

1) Draw a random number from a uniform distribution .

2) Simulate four datasets, two for classification and two for regression, each having observations.

3) On each dataset, train a boosting model with tree and linear base learners, respectively.

4) Calculate an appropriate error metric for each model on each dataset.

Repeat the outlined procedure times.

As for simulation, we use the functions `twoClassSim()`

, `LPH07_1()`

, `LPH07_2()`

, `SLC14_1()`

from the `caret`

package. In addition to the relevant features, a varying number of (correlated) random features was added. Note that in order to match real life data, all data generating processes involve non-linear components. For further details, we advise the reader to take a look at the `caret`

package documentation.

For each dataset, we apply the same (random) splitting strategy, where 70% of the data goes to training, 15% is used for validation, and the last 15% is used for testing. Regarding hyperparameter tuning, we use a grid-search strategy in combination with 10-fold crossvalidation on the training data. Regardless of the base learner type, (`alpha`

) and (`lambda`

) regularization were tuned using a shared parameter space.

For tree boosting, the learning rate (`eta`

) was held constant at 0.3 while tuning the optimal tree size (`max_depth`

). Finally, we used a fixed number of 1000 boosting iterations (`nrounds`

) in combination with ten early stopping rounds (`early_stopping_rounds`

) on the validation frame. The final performance was evaluated by applying the model with the best crossvalidated parameters on the test dataset.

## Results

Figure 1 and Figure 2 show the distributions of out of sample classification errors (AUC) and regression errors (RMSE) for both datasets. Associated summary statistics can be found in Table 1.

##### Table 1: Error summary statistics by datasets and base learners

Base learner | Dataset | Type | Error metric | Average error | Error std. |
---|---|---|---|---|---|

Linear | 1 | Classification | AUC | 0.904 | 0.031 |

Tree | 1 | Classification | AUC | 0.934 | 0.090 |

Linear | 2 | Classification | AUC | 0.733 | 0.087 |

Tree | 2 | Classification | AUC | 0.730 | 0.062 |

Linear | 3 | Regression | RMSE | 45.182 | 2.915 |

Tree | 3 | Regression | RMSE | 17.207 | 9.067 |

Linear | 4 | Regression | RMSE | 17.383 | 1.454 |

Tree | 4 | Regression | RMSE | 6.595 | 3.104 |

For the first dataset, the models using tree learners are on average better than the models with linear learners. However, the tree models exhibit a greater variance. The relationships are reversed for the second dataset. On average, the linear models are slightly better and the tree models exhibit a lower variance.

In contrast to the classification case, there is for both regression datasets a substantial difference in performance in favor of the tree models. For the third dataset, the tree models are on average better than their linear counterparts. Also, the variance of the results is substantially higher for the tree models. The results are similar for the fourth dataset. The tree models are again better on average than their linear counterparts, but feature a higher variation.

## Summary

The results from a Monte Carlo simulation with 100 artificial datasets indicate that XGBoost with tree and linear base learners yields comparable results for classification problems, while tree learners are superior for regression problems. Based on this result, there is no single recommendation which model specification one should use when trying to minimize the model bias. In addition, tree based XGBoost models suffer from higher estimation variance compared to their linear counterparts. This finding is probably related to the more sophisticated parameter space of tree models. The complete code can be found on github.

## References

- Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost – A Scalable Tree Boosting System.” https://arxiv.org/pdf/1603.02754.pdf.
- Dietrich, Thomas G., Pedro Domingos, Lise Getoor, Stephen Muggleton, and Prasad Tadepalli. 2008. “Structured Machine Learning: The Next Ten Years.” http://www.doc.ic.ac.uk/~shm/Papers/strucml.pdf
- Dubossarsky, E., J.H. Friedman, J.T. Ormerod, and M.P. Wand. 2016. “Wavelet-Based Gradient Boosting.” http://www.matt-wand.utsacademics.info/publicns/Dubossarsky16.pdf
- Gómez-Verdejo, Vanessa, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal. 2002. “Designing RBF Classifiers for Weighted Boosting.” http://www.tsc.uc3m.es/~jarenas/papers/international_conferences/2005_IJCNN_RBF_boosting.pdf
- Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2008. “The Elements of Statistical Learning.” Springer
- Huang, Juanhua Z., and Lijian Yang. 2004. “Identification of Nonlinear Additive Autoregressive Models.” http://lijianyang.com/mypapers/splinelag.pdf
- James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. “An Introduction to Statistical Learning.” Springer
- Natekin, Alexey, and Alois Knoll. 2013. “Gradient Boosting Machines, a Tutorial.” https://www.frontiersin.org/articles/10.3389/fnbot.2013.00021/full.
- Tut, Gerhard, and Andreas Groll. 2009. “Generalized Linear Mixed Models Based on Boosting.” http://www.fm.mathematik.uni-muenchen.de/download/publications/glmm_boost.pdf