Data Science, Machine Learning & AI
Kontakt
Did you ever want to make your machine learning model available to other people, but didn’t know how? Or maybe you just heard about the term API, and want to know what’s behind it? Then this post is for you! Here at STATWORX, we use and write APIs daily. For this article, I wrote down how you can build your own API for a machine learning model that you create and the meaning of some of the most important concepts like REST. After reading this short article, you will know how to make requests to your API within a Python program. So have fun reading and learning!

Table of Contents

What is an API?

API is short for Application Programming Interface. It allows users to interact with the underlying functionality of some written code by accessing the interface. There is a multitude of APIs, and chances are good that you already heard about the type of API, we are going to talk about in this blog post: The web API. This specific type of API allows users to interact with functionality over the internet. In this example, we are building an API that will provide predictions through our trained machine learning model. In a real-world setting, this kind of API could be embedded in some type of application, where a user enters new data and receives a prediction in return. APIs are very flexible and easy to maintain, making them a handy tool in the daily work of a Data Scientist or Data Engineer. An example of a publicly available machine learning API is Time Door. It provides Time Series tools that you can integrate into your applications. APIs can also be used to make data available, not only machine learning models.
API Illustration

And what is REST?

Representational State Transfer (or REST) is an approach that entails a specific style of communication through web services. When using some of the REST best practices to implement an API, we call that API a “REST API”. There are other approaches to web communication, too (such as the Simple Object Access Protocol: SOAP), but REST generally runs on less bandwidth, making it preferable to serve your machine learning models. In a REST API, the four most important types of requests are:
  • GET
  • PUT
  • POST
  • DELETE
For our little machine learning application, we will mostly focus on the POST method, since it is very versatile, and lots of clients can’t send GET methods. It’s important to mention that APIs are stateless. This means that they don’t save the inputs you give during an API call, so they don’t preserve the state. That’s significant because it allows multiple users and applications to use the API at the same time, without one user request interfering with another.

The Model

For this How-To-article, I decided to serve a machine learning model trained on the famous iris dataset. If you don’t know the dataset, you can check it out here. When making predictions, we will have four input parameters: sepal length, sepal width, petal length, and finally, petal width. Those will help to decide which type of iris flower the input is. For this example I used the scikit-learn implementation of a simple KNN (K-nearest neighbor) algorithm to predict the type of iris:
# model.py
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import numpy as np


def train(X,y):

    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    knn = KNeighborsClassifier(n_neighbors=1)

    # fit the model
    knn.fit(X_train, y_train)
    preds = knn.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'Successfully trained model with an accuracy of {acc:.2f}')

    return knn

if __name__ == '__main__':

    iris_data = datasets.load_iris()
    X = iris_data['data']
    y = iris_data['target']

    labels = {0 : 'iris-setosa',
              1 : 'iris-versicolor',
              2 : 'iris-virginica'}

    # rename integer labels to actual flower names
    y = np.vectorize(labels.__getitem__)(y)

    mdl = train(X,y)

    # serialize model
    joblib.dump(mdl, 'iris.mdl')
As you can see, I trained the model with 70% of the data and then validated with 30% out of sample test data. After the model training has taken place, I serialize the model with the joblib library. Joblib is basically an alternative to pickle, which preserves the persistence of scikit estimators, which include a large number of numpy arrays (such as the KNN model, which contains all the training data). After the file is saved as a joblib file (the file ending thereby is not important by the way, so don’t be confused that some people call it .model or .joblib), it can be loaded again later in our application.

The API with Python and Flask

To build an API from our trained model, we will be using the popular web development package Flask and Flask-RESTful. Further, we import joblib to load our model and numpy to handle the input and output data. In a new script, namely app.py, we can now set up an instance of a Flask app and an API and load the trained model (this requires saving the model in the same directory as the script):
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')
The second step now is to create a class, which is responsible for our prediction. This class will be a child class of the Flask-RESTful class Resource. This lets our class inherit the respective class methods and allows Flask to do the work behind your API without needing to implement everything. In this class, we can also define the methods (REST requests) that we talked about before. So now we implement a Predict class with a .post() method we talked about earlier. The post method allows the user to send a body along with the default API parameters. Usually, we want the body to be in JSON format. Since this body is not delivered directly in the URL, but as a text, we have to parse this text and fetch the arguments. The flask _restful package offers the RequestParser class for that. We simply add all the arguments we expect to find in the JSON input with the .add_argument() method and parse them into a dictionary. We then convert it into an array and return the prediction of our model as JSON.
class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200
You might be wondering what the 200 is that we are returning at the end: For APIs, some HTTP status codes are displayed when sending requests. You all might be familiar with the famous 404 - page not found code. 200 just means that the request has been received successfully. You basically let the user know that everything went according to plan. In the end, you just have to add the Predict class as a resource to the API, and write the main function:
API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')
The '/predict' you see in the .add_resource() call, is the so-called API endpoint. Through this endpoint, users of your API will be able to access and send (in this case) POST requests. If you don’t define a port, port 5000 will be the default. You can see the whole code for the app again here:
# app.py
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')


class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200


API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

Run the API

Now it’s time to run and test our API! To run the app, simply open a terminal in the same directory as your app.py script and run this command.
python run app.py
You should now get a notification, that the API runs on your localhost in the port you defined. There are several ways of accessing the API once it is deployed. For debugging and testing purposes, I usually use tools like Postman. We can also access the API from within a Python application, just like another user might want to do to use your model in their code. We use the requests module, by first defining the URL to access and the body to send along with our HTTP request:
import requests

url = 'http://127.0.0.1:1080/predict'  # localhost and the defined port + endpoint
body = {
    "petal_length": 2,
    "sepal_length": 2,
    "petal_width": 0.5,
    "sepal_width": 3
}
response = requests.post(url, data=body)
response.json()
The output should look something like this:
Out[1]: {'Prediction': 'iris-versicolor'}
That’s how easy it is to include an API call in your Python code! Please note that this API is just running on your localhost. You would have to deploy the API to a live server (e.g., on AWS) for others to access it.

Conclusion

In this blog article, you got a brief overview of how to build a REST API to serve your machine learning model with a web interface. Further, you now understand how to integrate simple API requests into your Python code. For the next step, maybe try securing your APIs? If you are interested in learning how to build an API with R, you should check out this post. I hope that this gave you a solid introduction to the concept and that you will be building your own APIs immediately. Happy coding!   My blog post aims at aspiring data scientists who have to decide which programming language they want to learn first. At STATWORX, we primarily use the two most popular languages, R and Python. Both languages have their strengths and weaknesses, which is why you should ideally master both. To get started, however, we recommend to learn one language and then tackle the other one. Don’t forget that both languages are just a tool; what matters is what you do with your tool at hand. Once you have understood and mastered the concepts of working with data, it should generally be easier to learn the second language. This blog post introduces both languages to beginners. I try to stay as unbiased as possible. For certain tasks, I prefer R – for others Python. Hence, the recommendations are subjective to my preferences. Nevertheless, I hope my experiences help you as a newcomer.

Overview R and Python

Both Python and R are open source programming languages, which means that the source code is publicly available and can be used for free. While Python is a general-purpose programming language, R was developed for statistical analysis. Therefore, users of these languages often have different backgrounds. In general terms, software developers use Python and statisticians R.
R Python
Release 1993 1991
Developer R Core Team Python Software Foundation
Package Management CRAN PyPI

A collection of extensions

Both languages have a basic set of functions that can be extended with packages. The Comprehensive R Archive Network (CRAN) is a platform for R packages. A set of requirements must be satisfied to include a package on CRAN. CRAN ensures that all packages available for download work. More than 10,000 packages are available on CRAN. Since R is the standard language for statisticians, CRAN has a suitable solution for almost any statistical problem. So it’s just the right place for the latest statistical methods and analyses. Some packages depend on other packages, which can cause problems in specific scenarios. The objective of the packrat package is to ensure that all dependencies are met and everything runs smoothly. Python has two package management platforms: conda and PyPI (Python Package Index). There are also over 10,000 packages for Python, which, compared to R, cover a vast range of applications. However, only a small share of packages are relevant for data science projects. Because complications can occur when you install Python packages globally, you can use virtual environments. They ensure smooth processes for the various packages and dependencies from package to package, similar to the packrat package in R. It can be quite hard for beginners to get a good grasp of the idea behind the different environments. Although R and Python packages are used the same way, there are some fundamental differences. Usually, an R package is developed by a single person or a small group of researchers. The authors write the package based on a scientific paper and refer to it in the documentation. Whereas, often, large groups of developers are working on Python packages (numpy, pandas, scikit-learn). That has advantages and disadvantages:
  • The namespace for functions is clear, and functions have the same structure. E.g., when setting up different models and comparing their performance, you’ll use the scikit-learn package in Python. In R you’ll use different packages depending on what model you want to implement. The function and argument names differ from package to package – which can be cumbersome. It is noteworthy that the packages caret and parsnip are trying to correct those discrepancies in R in hindsight.
  • In some cases, the functions of scikit-learn have to be checked thoroughly. The developers are usually optimization oriented and are neglecting some statistical aspects. E.g., the scikit-learn function performing a logistic regression (sklearn.linear_model.LogisticRegression) uses per default L2 regularization. The only way to get a linear logistic model without regularization is to set the regularization parameter to a high number. That’s surprising, given the functions name. Furthermore, the developers didn’t understand why this poses an issue to the users. Even if the regularized model generalizes better and, thus, might have a better predictive performance, there are cases where I would like to obtain the non-regularized coefficients for inference.

Find your IDE

Programmers often use an integrated development environment (IDE) that facilitates their work with small but subtle tools. For R users, RStudio has become the standard IDE. The IDE is distributed by the company RStudio Inc., which stands commercially behind R. RStudio provides not only a pleasant working environment but also develops packages and extensions for the R language. For example, the RStudio team contributed important packages like tidyverse, packrat, and devtools, as well as popular extensions like shiny (for dashboards) and RMarkdown (for reports). Python users can choose between numerous IDEs (PyCharm, Visual Studio Code, Spyder, and many more). However, no company stands behind Python and is comparable to RStudio Inc. Thanks to the efforts of the large community and the Python Software Foundation, new extensions for Python are continually being published.

The Art of Data Visualization

The most commonly used packages for data visualization with Python are matplotlib and seaborn. Dashboards can be created in Python with dash. However, R has an ace up its sleeve in data visualization: the ggplot2 package, which is based on Leland Wilkinson’s book The Grammar of Graphics. With this package, you can create attractive and customized graphics, which you can share via shiny dashboards with others. Both programming languages offer the possibility to create beautiful graphics easily. Nevertheless, the R package ggplot2 convinces with its flexibility, visual possibilities, and thought-out philosophy behind it. Sharing the graphics with a shiny dashboard is extremely easy. I want to mention that there are efforts to implement ggplot2 into Python.

Props for readability

Python was designed following the motto readability counts. So even people, who are not familiar with the programming language can interpret what was done in the code. R, as a programming language, has changed a lot in recent years. Mainly because of the packages developed by the RStudio team. The readability of the code has improved substantially with the dplyr package and the use of the pipe operator, where code can be read from left to right.

The speed with different observation sizes

Next, I compare how long it takes to create a simulated dataset in R and Python. For a fair comparison, the conditions should be approximately the same. The data is simulated with the packages Xy and XyPy in R and Python, respectively. I used microbenchmark in R and timeit in Python to measure the time. Also, I parallelized the process using eight cores (R: parallel, Python: multiprocessing) to generate the simulation as fast as possible. For the experiment, a dataset with 100 observations and 50 variables is simulated 100 times. The time it takes the computer to perform the simulation is measured individually for each simulation. That is then repeated for 1,000, 10,000, 100,000 and 1,000,000 observations. The R and Python code snippets are shown below.
# R
# devtools::install_github("andrebleier/Xy")
# install.packages("parallel")
# install.packages("microbenchmark")

# Load packages
library(Xy)
library(microbenchmark)
library(parallel)

# Extract function definition from for loop
sim_this <- function(n_sim) {
  sim <- microbenchmark(Xy(n = n_sim,
                           numvars = c(50,0),
                           catvars = 0), 
                        times = 100, unit = "s")
  data.frame(n = n_sim, 
             mean = summary(sim)[, 4])
}

# Time measurement for different number of simulations
n_sim <- c(1e2, 1e3, 1e4, 1e5, 1e6)
sim_in_r <- data.frame(n = rep(0, length(n_sim)),
                       t = rep(0, length(n_sim)))
for(i in 1:length(n_sim)){
  out <- mclapply(n_sim[i],
                  FUN = sim_this,
                  mc.cores = 8)

  sim_in_r[i, 1] <- out[[1]][1]
  sim_in_r[i, 2] <- out[[1]][2]
}
# Python
import multiprocessing as mp
import numpy as np
import timeit
from XyPy import Xy

# Predefine function of interest
def sim_this(n_sim):
  return(timeit.timeit( lambda: Xy(n = int(n_sim),
      numvars = [50, 0],
      catvars = [0, 0],
      weights = [5, 10],
      stn = 4.0,
      cor = [0, 0.1],
      interactions = 1,
      noisevars = 5), number = 100))

# Paralleled computation 
pool = mp.Pool(processes = 8)
n_sim = np.array([1e2, 1e3, 1e4, 1e5, 1e6])
results = [pool.map(sim_this, n_sim)]
The average duration, sorted by the number of observations, is shown in the plot below for R and Python. The x-axis is shown on a logarithmic scale with base 10, to make the plot clearer. While R is a little faster for dataset sizes of 100 and 1.000 observations, Python is significantly faster for 100.000 and 1.000.000 observations.
r-python-speed-comparison
For other speed comparisons, I recommend the following STATWORX blog posts: pandas vs. data.table and pandas vs. data.table part 2. In these posts, the focus was laid on data manipulation.

The Standard in Deep Learning

If you are particularly interested in deep learning, I recommend Python to you. Most popular deep learning libraries were written or are designed to be used with Python. Deep learning is also possible with R, but the R deep learning community is much smaller. Implementations like Keras and TensorFlow can be called in R but are run in Python in the background. Furthermore, the packages do not provide full flexibility for the users, e.g., not all TensorFlow functions are available.

A Survey in the Community

As aspiring data scientists, Kaggle is an essential platform for you. There you can participate in exciting machine learning competitions, experiment for yourself, and learn from the experiences of the community. In 2018, Kaggle conducted a Machine Learning & Data Science Survey. The poll was online for two weeks and received a total of 23,859 replies. From the results of this survey, I have created different plots to get some insights regarding my blog topic. The code for the individual plots is publicly available on Github.

Excursion: Python & R compared to other languages

Before we jump to R and Python, let’s see how they compare to other programming languages. Each respondent indicated which language she uses primarily. The plot below aggregates by language, and the result is: Most of the participants use Python! Followed by R in second place. In this survey, we do not distinguish between fields of work, which is why Python, the general-purpose programming language, is probably so prominent.
r-python-default-lang

The results between R & Python

In a direct comparison between R and Python, you can see that many R users also use Python. Whereas Python users often work exclusively with Python.
r-python-venndiagramm-usage
Comparing the use of languages by field reveals a clear dominance of Python. In all fields, except for statisticians, the majority uses Python.
r-python-relative-shares
Participants were also asked: What language do you recommend for beginners to learn first? The answers to the question are shown in the table below.
Sprache Empfehlung Nutzer Differenz
Python 14.181 8.180 6.001
R 2.342 2.046 296
SQL 914 1.211 -297
C++ 339 739 -400
Matlab 256 355 -99
Java 184 903 -719
Scala 74 106 -32
Javascript 72 408 -336
SAS 69 228 -159
VBA 38 135 -97
Go 26 46 -20
Other 161 117 44
When the number of recommendations and the number of users are compared, you can see that R and Python are the only languages that have a positive difference. Again, Python (14.181) is well ahead of R (2.342).
r-python-lang-recommendation

Conclusion

Beforehand: both languages are very powerful. Therefore you can not make a wrong choice! The choice of language depends on what kind of project you want to tackle. As a universal programming language, Python is suitable for a variety of applications. Which is why I generally recommend starting with Python. But if statistical analysis or data visualization is paramount in your projects, R has an advantage over Python. As already mentioned, both languages have their advantages and disadvantages. As an advanced data scientist, you should ideally master both languages. I hope this post gives you an idea of what the differences are between R and Python and helps you make the right choice for yourself. Since I could not go into much depth with arguments for my preferences in this blog post – you are very welcome to shoot me an e-mail, if you have any further questions regarding the topic. Happy Coding! If you’re interested in training, feel free to check out our course Catalogs for R and Python at STATWORX Academy.

References

In this blog we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounce for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

Bagging bypasses this tradeoff by reducing the variance of the unstable predictor, while leaving its bias mostly unaffected.

Method

In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

Let’s see how Bagging works in detail:

  1. Construct a bootstrap sample (y_{i}^{*}, mathbf{x_{i}^{*}}) :(i = 1, dots , n) (with replacement) of the original i.i.d. data at hand (y_{i}, mathbf{x_{i}}) : (i = 1, dots , n).
  2. Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by hat{theta}_{n}^{*}(mathbf{x}).
  3. Repeat Steps one and two B many times and calculate frac{1}{B}sum_{b=1}^{B}hat{theta}_{n, b}^{*}(mathbf{x}) .

OK – so let us take a glimpse into the construction phase: We draw in total B different bootstrap samples simultaneously from the original data. Then to each of those samples a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

The variance-reduction happens in Step 3. To see this, consider the following toy example.

Let X_1, dots, X_n be i.i.d. random variables with mu = E[X_1] and sigma^2 = Var[X_1] and let bar{X}= frac{1}{n}sum_{i=1}^{n}X_{i}. Easy re-formulations show that

  • Var[bar{X}]=frac{sigma^2}{n} leq sigma^2
  • E[bar{X}]=mu

We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing m times without replacement from our original data with m < n and get hence a subset of size m. With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size m.

Simulation Set-Up

As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

In order to do this, we consider the following model

    \[y_{i} = f(mathbf{x}_{i}) + epsilon_{i}\]

where f(mathbf{x}_{i}) is the regression function, mathbf{x}_{i} sim U^{10}[0,1] is the design matrix generated from a uniform distribution and epsilon_{i}sim N(0,1) is the error term (forall i = 1, dots, n).

For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed “Friedman #1”-model:

    \[f(mathbf{x}) = 10 sin(pi x^{(1)} x^{(2)}) + 20(x^{(3)} - frac{1}{2})^{2} + 10 x^{(4)} + 5 x^{(5)}\]

where mathbf{x}^{(j)} is the j-th column of the design matrix mathbf{x} (for 1 leq j leq 10).

As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

We first import the relevant packages.

<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor

The module simulation_class is a user-specified class that we will not discuss in this blog post but in a subsequent one.

Further, we specify the simulation set-up:

<span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>

<span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>

grid_range = np.linspace(start_grid, end_grid, num = n_grid)

Below we will explain in more detail for what we need the grid specification.

To store our simulation results we set up containers.

<span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
mse_temp_subagging = np.empty(shape = (n_obs, n_sim))

y_predict_bagging = np.empty(shape = (n_obs, n_sim))
y_predict_subagging = np.empty(shape = (n_obs, n_sim))

mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))

With this initialization at hand, we generate the test and train data by the simulation_class module.

<span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
train_setup = simulation_class.simulation(n_reg = n_reg,
                                          n_obs = n_obs,
                                          n_sim = n_sim,
                                          sigma = sigma,
                                          random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
                                          random_seed_noise =  <span class="hljs-number"><span class="hljs-number">1</span></span>)

test_setup = simulation_class.simulation(n_reg = n_reg,
                                         n_obs = n_obs,
                                         n_sim = n_sim,
                                         sigma = sigma,
                                         random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
                                         random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)

f_train = train_setup.friedman_model()
X_train, y_train = train_setup.error_term(f_train)

f_test = test_setup.friedman_model()
X_test, y_test = test_setup.error_term(f_test)

As we have generated the data for our “Friedman #1”-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the BaggingRegressor method of the sklearn.ensemble package. Observe that for the Subagged predictor we need to specify the parameter max_samples in the BaggingRegressor. This ensures that we can draw a subsample size m = a cdot{} n with subsample fraction a from the original data. Indeed, for the subsample fraction a we have already specified the grid above by the variable grid_range .

<span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
    <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
        <span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
        bagging = sklearn.ensemble.BaggingRegressor(
            bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
            n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)

        y_predict_bagging[:,i] = bagging.fit(
            X_train,
            y_train[:,i]).predict(X_test)
        
        mse_temp_bagging[:,i] = mean_squared_error(
            y_test[:,i], 
            y_predict_bagging[:,i])
        
        <span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
        subagging = sklearn.ensemble.BaggingRegressor(
            max_samples = math.ceil(a*n_obs),
            bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
            n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
        
        y_predict_subagging[:,i] = subagging.fit(
            X_train,
            y_train[:,i]).predict(X_test)
        
        mse_temp_subagging[:,i] = mean_squared_error(
            y_test[:,i],
            y_predict_subagging[:,i])
       
    mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
    mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)

On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

Results

The results of our above simulation can be found in Figure 1.

Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

Table 1: Performance of fully grown tree and Bagged Predictor

Predictor Tree (fully grown) Bagged Tree
Bias^2 3.47 2.94
Variance 6.13 0.35
MSPE 10.61 4.26

Figure 1 displays the MSPE as a function of the subsample fraction a for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

  • We see that both the Bagged and Subagged predictor outperform a single tree (in terms of MSPE).
  • For a subsampling fraction of approximately 0.5, Subagging achieves nearly the same prediction performance as Bagging while coming at a lower computational cost.

MSPE-Comparison of Bagging and Subagging

References

  1. Breiman, L.: Bagging predictors. Machine Learning, 24, 123–140 (1996).
  2. Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927–961 (2002).

Last Christmas is one of the most popular Christmas tunes that were, are and will be out there. The song is written by the brilliant musician George Michael and was released in 1984, when at that time, Epic Records quickly wanted to release a Christmas tune. According to Wikipedia, there are rumours going around that George Michael just changed the lyrics of an already composed tune named “last easter” in a more “winterly” manner. However, this has officially never been confirmed by the record company. Nonetheless, “Last Christmas” remains at the top of all christmas pop songs – also at STATWORX, where “Last Christmas” is on heavy rotation during the holiday season!

george michael meme

In a recent meme on the web I saw, that the Google Trends search volume for “last cristmas” beginning to kick in (first week of october), indicating that Christmas (and the voice of George Michael, backed by christmas bells) is knocking at the door. In order to get ready for the “most wonderful time of the year”, I decided to build a small neural network in Keras that is able to perform a multistep forecast of the expected “last christmas” search volume this year (that maybe correlates with the number of plays on TV, radio etc.).

google trends last christmas

The screenshot above shows the normalized search volume (range between 0 and 100) for last christmas for Germany from 2004 to October 2018. In winter 2017 there was the alltime high in search traffic for “Last Christmas”, maybe due to the tragic death of Michael on December 25th in 2016.

Data Preparation

I’ve downloaded the search volume data from Google Trends as a CSV file and manually formatted the file header (it included a description of the data as well as some blank lines) as well as string values of “<1”, which I’ve replace with numeric zeros. After preparing the file, I imported it into Python using pandas.read_csv().

Since neural networks work best with scaled data, i.e. data that ranges between a specific lower and upper bound, e.g. 0 and 1 or -1 and 1, I divided the normalized search volume by 100, y_{norm}=y/100.

There are many neural network architectures that can be used in order to perform time series forecasting. Since the data is modeled in a simple input-output-style, of course, MLPs can be used. Furthermore, 1-dimensional convolutional networks can be employed. Last but not least, LSTMs are also applicable.

Multistep forecasting can be done in several ways: (1) building a separate forecasting model for each forecast timestep, (2) building a recursive AR(p) model, that predicts the next value based on previous predictions and (3) building a model that is able to predict multiple values into the future at the same time. Since neural networks can easily handle multiple outputs, I decided to go with neural networks.

Preparing data for multistep forecasting can be a bit cumbersome, especially, when the input data consists on multiple timesteps and variables. In order to prepare my X and y data, I used the following snippet:

def prepare_data(target, window_X, window_y):
    """ Data preprocessing for multistep forecast """
    # Placeholders
    X, y = [], []
    n = len(target)
    # Iterators
    start_X = 0
    end_X = start_X + window_X
    start_y = end_X
    end_y = start_y + window_y
    # Build tensors
    for _ in range(n):
        if end_y < n:
            X.append(target[start_X:end_X])
            y.append(target[start_y:end_y])
        start_X += 1
        end_X = start_X + window_X
        start_y += 1
        end_y = start_y + window_y
    # Convert to array
    X = np.array(X)
    y = np.array(y)
    # Return
    return np.array(X), np.array(y)

The function transforms a single vector of values, target, into two 2-dimensional tensors X and y. The function arguments window_Xand window_y define the number of input lags per observation and the number of output values (timesteps to be predicted), respectively. Let’s take a look at the tensors. X[:3] yields:

array([[ 2,  1,  0,  0,  1,  1,  1,  1,  2,  4, 21, 69],
       [ 1,  0,  0,  1,  1,  1,  1,  2,  4, 21, 69,  1],
       [ 0,  0,  1,  1,  1,  1,  2,  4, 21, 69,  1,  1]])

whereas y[:3] yields:

array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1]])

One can see, that the tensors contain iterating data over the vector target of a fixed length. Specifically, I used window_x = 12 and window_y = 6 which means, that each row in our input tensor X consists of window_X = 12 elements that are used to forecast the next window_y = 6 timesteps. Note, that the original time series contains T=178 months of data, whereas X is of shape (160, 12). The reason for this is that values only get appended to the X and y tensors, if the current end_y iterator is smaller than the number of total obervations in the dataset. Otherwise, y would contain NaN values at the end of the series.

# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

For model training, I used the first 100 observations for training purposes and the remaining 60 observations for validation (early stopping). Furthermore, I built a tensor for prediction, X_test that contains the last 12 observations from the time series (November, 1st. 2017 – October, 1st. 2018) in order to compute a prediction for the following 6 months.

Model Building

The following Python snippet shows the function for fitting three neural network models: a simple MLP, a 1-dimensional CNN and a LSTM (I will not go into detail about how the models exactly work – there are tons of great tutorials on the web). I wrote a single function for all three architectures. Thereby, I exploited a litte trick I recently discovered: if you’re experimenting with different network archirectures you always have to make sure, that the input tensors are of the appropriate shape. Since MLPs, CNNs and LSTMs require different dimensions of input tensors, it can be helpful to provide a single input layer that takes the original input tensor and then add a Reshape() layer in order to reshape the input into the required dimensionality of the respective Dense(), Conv() or LSTM() layer. I know, that from a computational point of view, this is not very efficient, however, I do not care in this case 😉

def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
    """ Training function for network """
    
    # Model input
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1], )))

    if type == 'mlp':
        model.add(Reshape(target_shape=(X_train.shape[1], )))
        model.add(Dense(units=64, activation='relu'))

    if type == 'cnn':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
        model.add(MaxPool1D())
        model.add(Flatten())

    if type == 'lstm':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(LSTM(units=64, return_sequences=False))

    # Output layer
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=y_train.shape[1], activation='sigmoid'))

    # Compile
    model.compile(optimizer='adam', loss='mse')

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=10)
    model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
    callbacks = [early_stopping, model_checkpoint]

    # Fit model
    model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
              batch_size=batch_size, epochs=epochs,
              callbacks=callbacks, verbose=2)

    # Load best model
    model.load_weights('model.h5')

    # Return
    return model

The InputLayer() consumes the prepared input tensor X_train. Depending on the type of model, distinct reshape operations are carried out in order to provide proper tensor dimensions for the model. Thereby, the MLP requires the input tensor to be 2-dimensional whereas the CNN and LSTM models require 3-dimensional input tensors. After the respective model architecture layers, another Dense() layer and the output layer are added. Note, that I’ve used a sigmoidal activation for the output. This makes sure, that the predicted values of the network range between 0 and 1, which corresponds to the normalized search volume data (another cool trick for predicting normalized outcomes or percentages). Dense activations are set to ReLU, optimization is performed using the ADAM optimizer. The model uses early stopping to stop model training when the validation error does not improve for 10 consecutive epochs. Then, thanks to model checkpointing, the best model is reloaded and returned.

Prediction

After model training, the predictions for the following 6 months are generated by feeding X_test into the respective model architecture. The following table shows the predicted values:

month MLP CNN LSTM
2018-11 0.165803 0.193593 0.214891
2018-12 0.857727 0.881791 0.817105
2019-01 0.034604 0.034484 0.034604
2019-02 0.007150 0.002432 0.007150
2019-03 0.012865 0.000508 0.012865
2019-04 0.013644 0.000502 0.013644

Overall, the predictions look quite similar between the models. However, there are certain systematic differences:

  • the CNN model predicts the highest search volume in december
  • the MLP seems to overestimate the search volume after the holiday season
  • the LSTM model shows the highest prediction in November and January but the lowest prediction in December

The following plot shows the search volume as well as the predictions for all three models from April 2016 to April 2019:

predictions last christmas

Conclusion and Outlook

Based on the predictions, there is an expected minor decline in search interest. This might be a predictor for lower “Last Christmas” song penetration. However, I did not find any data on radio plays etc. Of course, the model can be improved by incorporating further information into the network. Anyways, I had great fun building the model and working with the search volume data. In the next step, I will check in November, which model performed best when new actual data comes in (beginning of November). Besides this dataset, Google Trends is a great ressource to include external information into your models. For example, in a recent use case at STATWORX, we used Google Trends data to incorporate search interest of specific products into our sales forecasting models. You should give it a try!

If you want to play around with the data or the model, you can find everything on our GitHub repository.

If you have any comments or questions on my blog post, contact me, I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice. Follow me on LinkedIn or Twitter, if you want to stay in touch.

Make sure, you frequently check the awesome STATWORX Blog for more interesting data science, ML and AI content straight from the our office in Frankfurt, Germany!

If you’re interested in more quality content like this, join my mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team right into your inbox!

Introduction

Teaching machines to handle image data is probably one of the most exciting tasks in our daily routine at STATWORX. Computer vision in general is a path to many possibilities some would consider intruiging. Besides learning images, computer vision algorithms also enable machines to learn any kind of video sequenced data. With autonomous driving on the line, learning images and videos is probably one of hottest topics right.

learning images - so hot right now

In this post, I will show you how to get started with learning image data. I will make use of Keras, a high level API for Tensorflow, CTNK, and Theano. Keras is implemented in Python and in R. For this post I will work through the Python implementation.

Setup

I am using Python 3.6.5. and Keras is running with a Tensorflow backend. The dataset I will be using is the Airbus Ship Detection dataset from their recent Kaggle Competition. To get started, we will be building a very simple network, a so called autoencoder network. Autoencoders are simple networks in the sense that they are not aiming to predict any target. They rather aim to learn and reconstruct images or data in general. In his blog post Venelin Valkov shows the following figure, I think is quite cool:

mushroom encoder

The figure perfectly describes the intension of an autoencoder. The algorithm takes an input, compresses, and then tries to reconstruct it. Why would we do this? Well, autoencoders have numerous interesting applications. First, they are reasonably good in detecting outliers. The idea is, you teach a machine to reconstruct non-outliers. Thus, when confronted with an outlier, the algorithm will probably have a hard time reconstructing that very observation. Second, autoencoders are fairly interesting to look at, when you are looking to reduce the dimensionality of your data. Speaking about images, you can think of it as a complexity reduction for the images. An algorithm is unlikely to reconstruct nuances of the image that are rather irrelevant to the content. Image recognition or classification algorithms are prone to overreact to certain nuances of images, so denoising them, might ease the learning procedure. Thus, autoencoders can serve as a powerful preprocessing tool to denoising your data.

Data Preparation

Prepararing your data is one of the most important tasks when training algorithms. Even more so, when you are handling image data. Images typically require large amounts of storage, especially since computer vision algorithms usually need to be fed with a considerable amount of data. To encompass this issue my colleauges and I typically make use of either large on-premise servers or cloud instances.

For this blog post however, I am choosing to run everything on my local machine. Why? Well, if you are reading this and you are interested in taking your first steps in developing your own code to handle image data, I would probably bother you with details of setting up cloud instances. If you are reading this and you are already experienced in working with this kind of problems, you will most likely work with cloud instances and you will be bothered by my description as well. So, for this little experiment I am running everything on my local machine and I organized the data as follows:


00_data
    |
    | train
	| train_image_01
	| train_image_02
	| ...
    | test
	| test_image_01
	| ...

To read in the data, I am simply looping over the images. I am using the OpenCV implementation cv2 and the Keras preprocessing tools. I know, I know Keras has this genious ImageDataGenerator modul, however I think it is kind of important to understand the required input, so for this post I will make use of the OpenCV tools. The preprocessing is a little different than with other data. While we see something similar to this:

training images

A machine however, does not see images, but rather data. Each image is representated by a matrix of pixel values. Thus each picture is a data matrix. Unlike with other problems where all data is compressed in one matrix, we need to consider this complex setup. To deal with this issue, we can use the ndarray data type. Implemented in the numpy ecosystem, ndarrays provided a handy data type for multidimensional data. Thus, we convert all our images to numpy arrays and pack them together in an ndarraydata format.

# import libs
import os
import pandas as pd
import numpy as np
import cv2 
import random
from keras.preprocessing.image import ImageDataGenerator, img_to_array

# set the path to the images
train_path = "00_data/train"
test_path = "00_data/test"

# load train images
train_images = os.listdir(f'{train_path}')

# load test images
test_images = os.listdir(f'{test_path}')

# load a couple of random pics
train_images_first_run = train_images[0:10000]
test_images_first_run = test_images[0:1000]

# open up container for training_data
train_data = []
test_data = []

# loop over training images
for imgs in train_images_first_run:
    
    # load the image and resize it
    img = cv2.imread(f'{train_path}/{imgs}')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (128, 128))
    img = img_to_array(img)
    train_data.append(img)

# loop over testing images
for imgs in test_images_first_run:
    
    # load the images and resize it
    img = cv2.imread(f'{test_path}/{imgs}')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (128, 128))
    img = img_to_array(img)
    test_data.append(img)

# convert data to np array
train_data = np.array(train_data, dtype = "float")
test_data = np.array(test_data, dtype = "float")
 
# reshape the data
x_train = train_data.astype('float32') / train_data.max()
x_test = test_data.astype('float32') / test_data.max()
x_train = np.reshape(x_train, (len(x_train), 128, 128, 1)) 
x_test = np.reshape(x_test, (len(x_test), 128, 128, 1)) 

We use the cv2 function cvtColor to change the color palette to a rather easy to interpret gray-scale. Next, we resize the input to 128 x 128. In addition, we are converting the image to an numpy array. Afterwards we stack all the arrays together. At last, we rescale the input data between 0 and 1. So let’s check out what the data looks like right now.

preprep images

Algorithm Design

The architecture of my autoencoder is somehwat arbitrary I have to confess. To equip my network with some computer vision features, I am adding convolutional layers. Convolutional layers are the essence of Convolutional Neural Networks (CNN). I won’t be going into detail, cause I could probably bore you with 20 pages about CNNs and still, I would barely cover the basics. Thus, I am just assuming you kind of know what’s going on.

As I said, we are setting up a convolutional autoencoder. It sounds quite fancy, though Keras is making it ridiculously simple. A little disclaimer, I am quite aware that there are many other ways to setup the code and so the code above might offend you. Though, I checked the Keras documentation and tried to align my code with the documentation. So if you are offended by my coding, don’t judge me… or at least not too much.

# import libraries
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
import matplotlib.pyplot as plt
from keras.models import load_model

# define input shape
input_img = Input(shape=(128, 128, 1))

# encoding dimension
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)

# decoding dimension
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((4, 4))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

# build model
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

As I said before, the design is somewhat arbitrary, however those of you who are working with these kind of networks probably know, the preliminary architecture quite often is somewhat arbitrary. Let us go through the code above and again, I know there are a million ways to setup my model. At first of course I am importing all the modules I need. Then I am defining the input shape. We reshaped the images to 128 x 128 and we gray-scaled all the images, thus the third dimension is of value 1. Second, I am defining the encoding layers, so the first part of the autoencoder network. I am using three convolutional layers to compress the input. The decoding dimension is build using three convolutional layers as well. I am using relufor an activation function and sigmoidfor the last layer. Once I set up the layers, I am just stacking them all together with the Keras Model function. I am using adadelta as an optimizer and the binary crossentropy as the loss function. So let’s have a look at our model’s architecture the keras way:

>>>autoencoder.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 128, 128, 1)       0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 128, 128, 16)      160
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 64, 16)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 64, 64, 8)         1160
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 32, 32, 8)         0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 32, 8)         584
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 16, 16, 8)         0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 16, 16, 8)         584
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 32, 32, 8)         0
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 32, 32, 8)         584
_________________________________________________________________
up_sampling2d_2 (UpSampling2 (None, 128, 128, 8)       0
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 128, 128, 1)       73
=================================================================
Total params: 3,145
Trainable params: 3,145
Non-trainable params: 0
_________________________________________________________________

Results

To run the model we make use of the fit() method for keras.engine.training.Model objects. To fit the model we just need to specify the batch size and the number of epochs. Since I am running this on my machine I am choosing way to large a batch size and way to small a epoch number.

autoencoder.fit(x_train, x_train,
                epochs=100,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

Our autoencoder is now trained and evaluated on the testing data. As a default, Keras provides extremely nice progress bars for each epoch. To evaluate the results I am not going to bother you with a lot of metrics, instead let’s check the input images and the reconstructed ones. To do so, we can quickly loop over some test images and some reconstructed images. First, we need to predict the reconstructed ones, once again Keras is incredibly handy.

decoded_imgs = autoencoder.predict(x_test)

The prediction is stored in a numpy ndarray and has the exact same structure as our prepped data. Now, let’s take a look at our reconstructed images:

n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
    # display original
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i + 100].reshape(128, 128))
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # display reconstruction
    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(decoded_imgs[i + 100].reshape(128, 128))
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

result images

Well, well, well.. isn’t that impressive. Of course this is not quite the result, we are looking for. Thus, we need to keep on improving the model. The first steps to take are quite obvious: smaller batch size, more epochs, more images, and of course a lot of iterations to adjust the architecture of the model.

This is quite a nice finding though. You see, the code above is incredibly simply. Thanks to implementations such as Keras it is becoming increasingly simple to build the most complex algorithms. However, the design is still very complicated and requires a lot of time and experience.

The story continues

As Christian and I have already mentioned in part 1 of this simulation study series, pandas and data.table have become the most widely used packages for data manipulation in Python and R, respectively (in R, of course, one may not miss mentioning the dplyr package). Furthermore, at STATWORX we have experts in both domains, and besides having some lovely sweets at our company (see David’s and Jessica’s blog posts on this topic), we also have discussions about one particular topic: R or Python? Which one is the better language? Well, “better” is quite a broad term, and eventually, it is certainly a matter of taste and depends on the context of your problem. However, when crunching large amounts of data, one criterion needs to be met by any “data language”: speed.
Speed is the central subject of our series. We present performance comparisons between pandas and data.table when being confronted with common tasks in data engineering, and this time, we focus on three of such operations especially:

  • Data sorting
  • Data column generation
  • Data aggregation

The set up (skip if you have read part 1)

The simulation study is rather complex, so some time should be spent on explaining the set up. Each of the above mentioned operation categories is a simulation in its own right. Independently of the operations, we have created 12 datasets of different sizes — the largest one has 1200 columns and 100.000 rows — whose columns consist of four possible data types: integer, double, string, and boolean. Each column in turn follows a different distribution from which it was generated, depending on the data type it was assigned (for example, an integer column may stem from a zero-inflated poisson distribution, a poisson distribution, or a uniform distribution). For each of the three above data operation categories — sorting, generating, and aggregating — we have defined a set of concrete actions, which we call scenario. One scenario for the data sorting case may, for instance, be “take a pre-specified randomly selected integer column and sort the whole dataset according to that column”. Each scenario is run 100 times and the median computation time in milliseconds computed. In the main plots you see below, we put the median computation times of pandas and data.table in contrast by dividing the former by the latter.
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD harddrive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table.

Sort / Arrange

Here we check the speed of sorting (also referred to as arranging) a data frame on one or two randomly drawn columns. In total, there are 5 scenarios: sorting (1) an integer, (2) a double, (3) a character string, (4) a boolean, and (5) two randomly drawn columns of any data type.
comparison arrange
As indicated by the relative median sorting times, it becomes obvious that data.table performs sorting tasks considerably faster than its python counterpart. This difference becomes most prominent when looking at the extreme case where the data set has 1200 columns.

Mutate

comparison mutate
In this setting, we create a new “result” column in our dataset based on some pre-specified operations that depend on the data type we are are considering. In total, we measure the computation time of four scenarios, one scenario per data type (We used the term “mutate” here as a reference to the dplyr package, which has a mutate()-function):

  1. Take a pre-specified randomly chosen integer column int and increment each element in int by one: result = int + 1
  2. Take a pre-specified randomly chosen double column dbl and double each element: result = dbl * 2
  3. Take a string column str and append each element with the character ‘a’: result = str oplus a
  4. Take a bool column lgl and negate it: result = neg lgl

In this operation category, pandas performs very efficiently accross als scenarios. In absolute terms, when having to create a column of type integer, double, or bool, pandas never needed more than a millisecond to complete the task, while it took data.table up to 40 milliseconds to do the same thing. This is the reason why — when rounding to one digit — most of the fractions in the graph above mainly comprise panels filled with a zero. Interestingly, the relative performance gain of pandas over data.table is constant with respect to the number of rows across all numbers of columns. In the string scenario, the relative mutate time is increasing with an increase in the number of rows, yet pandas beats data.table with a significant margin.

Aggregate

We have tested 16 different scenarios that are a combination of a randomly selected column of a specific data type and a randomly selected grouping variable (provided we are in a grouping senario) which is in all cases an integer. Pictures are always better than words, so here you go.
aggregation scenarios
The results are illustrated below. There is a lot of information in this graph, so take your time construing the results yourself. I will pinpoint some general patterns that are observable from the graph.
comparison summarise
First, let us focus in the aggregation scenarios in which no grouping was applied. In the double and string case, data.table appears to outperform pandas on all data sets. Strangely, in the string summarise case, pandas does way worse than data.table, with no clear pattern explaining why. In the int scenario, their computation time is more akin, with slight wins and losses on both sides. In case of handling booleans, pandas seems to do a little better, taking half the computation time of data.table at the larges data set with 100.000 rows and 1200 columns. Now, things become more interesting when we look at the grouping scenarios.
When looking at the relative times, you can see that — when only comparing grouping scenarios — data.table performes most efficient when the number of groups is largest and groups are evenly distributed (“unif”), being faster than pandas across all data types. Then, the relative efficiency declines when grouping by a variable that is less evenly distributed (“pois”), and is the lowest (i.e. data.table is the slowest) when being confronted with a heavily scewed grouping variable that follows a zero-inflated poisson distribution (“zip”). In the latter grouping scenario, pandas does way better than the R counterpart.
In general, in the bool, int and double case, pandas seems to get closer to or even overtake data.table in terms of computation time when the number of rows in the data increases, i.e. more data needs to be aggregated. In the string case, it is the opposite.

Conclusion and Outlook

The results from the simulation study are mixed, with no clear winner across all operation categories that needed to be carried out. If sorting happened a lot in your codes, data.table should be the package to go. However, if you wanted to use a package that is fast at creating new data columns, pandas would be your preferred choice. With aggregation, my opinion based on the results is that pandas is more often extremely faster than slower when compared with data.table, but in general, which one is quicker depends on the data type and the nature of the grouping variable you encounter. Having said that, I would not want to dare declaring a clear winner.
We are aware that there are numerous points in the study that could be altered in order to improve the validity of our findings. For instance, it would be necessary to investigate the aggregation scenarios in more depth in order to understand what really drives performance of the packages. If only having looked at the grouping scenarios, I would have concluded that data.table becomes slower than pandas when the number of groups by which it has to carry out aggregations decreases, and vice versa. This conclusion, however, is contradicted by the finding that data.table beats pandas in aggregating a double and string column when no grouping is required.
That being said, we consider this simulation experiment a valuable starting point from which we can start more thorough endeavors into the realm of performance comparisons between data.table and pandas. Please feel free to contribute to this projects by sending us an email. We also have a public repo containing all necessary codes on GitHub.
So far, the only thing we can say for sure is that, in the end, it indeed seems that choosing between pandas and data.table is up to your taste (speaking of which: I am going to treat myself with some sweets…).

The story continues

As Christian and I have already mentioned in part 1 of this simulation study series, pandas and data.table have become the most widely used packages for data manipulation in Python and R, respectively (in R, of course, one may not miss mentioning the dplyr package). Furthermore, at STATWORX we have experts in both domains, and besides having some lovely sweets at our company (see David’s and Jessica’s blog posts on this topic), we also have discussions about one particular topic: R or Python? Which one is the better language? Well, “better” is quite a broad term, and eventually, it is certainly a matter of taste and depends on the context of your problem. However, when crunching large amounts of data, one criterion needs to be met by any “data language”: speed.
Speed is the central subject of our series. We present performance comparisons between pandas and data.table when being confronted with common tasks in data engineering, and this time, we focus on three of such operations especially:

The set up (skip if you have read part 1)

The simulation study is rather complex, so some time should be spent on explaining the set up. Each of the above mentioned operation categories is a simulation in its own right. Independently of the operations, we have created 12 datasets of different sizes — the largest one has 1200 columns and 100.000 rows — whose columns consist of four possible data types: integer, double, string, and boolean. Each column in turn follows a different distribution from which it was generated, depending on the data type it was assigned (for example, an integer column may stem from a zero-inflated poisson distribution, a poisson distribution, or a uniform distribution). For each of the three above data operation categories — sorting, generating, and aggregating — we have defined a set of concrete actions, which we call scenario. One scenario for the data sorting case may, for instance, be “take a pre-specified randomly selected integer column and sort the whole dataset according to that column”. Each scenario is run 100 times and the median computation time in milliseconds computed. In the main plots you see below, we put the median computation times of pandas and data.table in contrast by dividing the former by the latter.
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD harddrive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table.

Sort / Arrange

Here we check the speed of sorting (also referred to as arranging) a data frame on one or two randomly drawn columns. In total, there are 5 scenarios: sorting (1) an integer, (2) a double, (3) a character string, (4) a boolean, and (5) two randomly drawn columns of any data type.
comparison arrange
As indicated by the relative median sorting times, it becomes obvious that data.table performs sorting tasks considerably faster than its python counterpart. This difference becomes most prominent when looking at the extreme case where the data set has 1200 columns.

Mutate

comparison mutate
In this setting, we create a new “result” column in our dataset based on some pre-specified operations that depend on the data type we are are considering. In total, we measure the computation time of four scenarios, one scenario per data type (We used the term “mutate” here as a reference to the dplyr package, which has a mutate()-function):

  1. Take a pre-specified randomly chosen integer column int and increment each element in int by one: result = int + 1
  2. Take a pre-specified randomly chosen double column dbl and double each element: result = dbl * 2
  3. Take a string column str and append each element with the character ‘a’: result = str oplus a
  4. Take a bool column lgl and negate it: result = neg lgl

In this operation category, pandas performs very efficiently accross als scenarios. In absolute terms, when having to create a column of type integer, double, or bool, pandas never needed more than a millisecond to complete the task, while it took data.table up to 40 milliseconds to do the same thing. This is the reason why — when rounding to one digit — most of the fractions in the graph above mainly comprise panels filled with a zero. Interestingly, the relative performance gain of pandas over data.table is constant with respect to the number of rows across all numbers of columns. In the string scenario, the relative mutate time is increasing with an increase in the number of rows, yet pandas beats data.table with a significant margin.

Aggregate

We have tested 16 different scenarios that are a combination of a randomly selected column of a specific data type and a randomly selected grouping variable (provided we are in a grouping senario) which is in all cases an integer. Pictures are always better than words, so here you go.
aggregation scenarios
The results are illustrated below. There is a lot of information in this graph, so take your time construing the results yourself. I will pinpoint some general patterns that are observable from the graph.
comparison summarise
First, let us focus in the aggregation scenarios in which no grouping was applied. In the double and string case, data.table appears to outperform pandas on all data sets. Strangely, in the string summarise case, pandas does way worse than data.table, with no clear pattern explaining why. In the int scenario, their computation time is more akin, with slight wins and losses on both sides. In case of handling booleans, pandas seems to do a little better, taking half the computation time of data.table at the larges data set with 100.000 rows and 1200 columns. Now, things become more interesting when we look at the grouping scenarios.
When looking at the relative times, you can see that — when only comparing grouping scenarios — data.table performes most efficient when the number of groups is largest and groups are evenly distributed (“unif”), being faster than pandas across all data types. Then, the relative efficiency declines when grouping by a variable that is less evenly distributed (“pois”), and is the lowest (i.e. data.table is the slowest) when being confronted with a heavily scewed grouping variable that follows a zero-inflated poisson distribution (“zip”). In the latter grouping scenario, pandas does way better than the R counterpart.
In general, in the bool, int and double case, pandas seems to get closer to or even overtake data.table in terms of computation time when the number of rows in the data increases, i.e. more data needs to be aggregated. In the string case, it is the opposite.

Conclusion and Outlook

The results from the simulation study are mixed, with no clear winner across all operation categories that needed to be carried out. If sorting happened a lot in your codes, data.table should be the package to go. However, if you wanted to use a package that is fast at creating new data columns, pandas would be your preferred choice. With aggregation, my opinion based on the results is that pandas is more often extremely faster than slower when compared with data.table, but in general, which one is quicker depends on the data type and the nature of the grouping variable you encounter. Having said that, I would not want to dare declaring a clear winner.
We are aware that there are numerous points in the study that could be altered in order to improve the validity of our findings. For instance, it would be necessary to investigate the aggregation scenarios in more depth in order to understand what really drives performance of the packages. If only having looked at the grouping scenarios, I would have concluded that data.table becomes slower than pandas when the number of groups by which it has to carry out aggregations decreases, and vice versa. This conclusion, however, is contradicted by the finding that data.table beats pandas in aggregating a double and string column when no grouping is required.
That being said, we consider this simulation experiment a valuable starting point from which we can start more thorough endeavors into the realm of performance comparisons between data.table and pandas. Please feel free to contribute to this projects by sending us an email. We also have a public repo containing all necessary codes on GitHub.
So far, the only thing we can say for sure is that, in the end, it indeed seems that choosing between pandas and data.table is up to your taste (speaking of which: I am going to treat myself with some sweets…).