Data Science, Machine Learning und KI
Kontakt

Getting the data in the quantity, quality and format you need is often the most challenging part of data science projects. But it’s also one, if not the most important part. That’s why my colleagues and I at STATWORX tend to spend a lot of time setting up good ETL processes. Thanks to frameworks like Airflow this isn’t just a Data Engineer prerogative anymore. If you know a bit of SQL and Python, you can orchestrate your own ETL process like a pro. Read on to find out how!

ETL does not stand for Extraterrestrial Life

At least not in Data Science and Engineering.

ETL stands for Extract, Transform, Load and describes a set of database operations. Extracting means we read the data from one or more data sources. Transforming means we clean, aggregate or combine the data to get it into the shape we want. Finally, we load it to a destination database.

Does your ETL process consist of Karen sending you an Excel sheet that you can spend your workday just scrolling down? Or do you have to manually query a database every day, tweaking your SQL queries to the occasion? If yes, venture a peek at Airflow.

How Airflow can help with ETL processes

Airflow is a python based framework that allows you to programmatically create, schedule and monitor workflows. These workflows consist of tasks and dependencies that can be automated to run on a schedule. If anything fails, there are logs and error handling facilities to help you fix it.

Using Airflow can make your workflows more manageable, transparent and efficient. And yes, I’m talking to you fellow Data Scientists! Getting access to up-to-date, high-quality data is far too important to leave it only to the Data Engineers 😉 (we still love you).

The point is, if you’re working with data, you’ll profit from knowing how to wield this powerful tool.

How Airflow works

To learn more about Airflow, check out this blog post from my colleague Marvin. It will get you up to speed quickly. Marvin explains in more detail how Airflow works and what advantages/disadvantages it has as a workflow manager. Also, he has an excellent quick-start guide with Docker.

What matters to us is knowing that Airflow’s based on DAGs or Directed Acyclic Graphs that describe what tasks our workflow consists of and how these are connected.

Note that in this tutorial we’re not actually going to deploy the pipeline. Otherwise, this post would be even longer. And you probably have friends and family that like to see you. So today is all about creating a DAG step by step.

If you care more about the deployment side of things, stay tuned though! I plan to do a step by step guide of how to do that in the next post. As a small solace, know that you can test every task of your workflow with:

airflow test [your dag id] [your task id] [execution date]

There are more options, but that’s all we need for now.

What you need to follow this tutorial

This tutorial shows you how you can use Airflow in combination with BigQuery and Google Cloud Storage to run a daily ETL process. So what you need is:

If you already have a Google Cloud account, you can hit the ground running! If not, consider opening one. You do need to provide a credit card. But don’t worry, this tutorial won’t cost you. If you sign up new, you get a free yearly trial period. But even if that one’s expired, we’re staying well within the bounds of Google’s Always Free Tier.

Finally, it helps if you know some SQL. I know it’s something most Data Scientists don’t find too sexy, but the more you use it the more you like it. I guess it’s like the orthopedic shoes of Data Science. A bit ugly sure, but unbeatable in what it’s designed for. If you’re not familiar with SQL or dread it like the plague, don’t sweat it. Each query’s name says what it does.

What BigQuery and Google Cloud Storage are

BigQuery and Cloud Storage are some of the most popular products of the Google Cloud Platform (GCP). BigQuery is a serverless cloud data warehouse that allows you to analyze up to petabytes of data at high speeds. Cloud Storage, on the other hand, is just that: a cloud-based object storage. Grossly simplified, we use BigQuery as a database to query and Cloud Storage as a place to save the results.

In more detail, our ETL process:

  • checks for the existence of data in BigQuery and Google Cloud Storage
  • queries a BigQuery source table and writes the result to a table
  • ingests the Cloud Storage data into another BigQuery table
  • merges the two tables and writes the result back to Cloud Storage as a CSV

Connecting Airflow to these services

If you set up a Google Cloud account you should have a JSON authentication file. I suggest putting this in your home directory. We use this file to connect Airflow to BigQuery and Cloud Storage. To do this, just copy and paste these lines in your terminal, substituting your project ID and JSON path. You can read more about connections here.

# for bigquery
airflow connections -d --conn_id bigquery_default

airflow connections -a --conn_id bigquery_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'


# for google cloud storage
airflow connections -d --conn_id google_cloud_default

airflow connections -a --conn_id google_cloud_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'

Writing our DAG

Task 0: Setting the start date and the schedule interval

We are ready to define our DAG! This DAG consists of multiple tasks or things our ETL process should do. Each task is instantiated by a so-called operator. Since we’re working with BigQuery and Cloud Storage, we take the appropriate Google Cloud Platform (GCP) operators.

Before defining any tasks, we specify the start date and schedule interval. The start date is the date when the DAG first runs. In this case, I picked February 20th, 2020. The schedule interval is how often the DAG runs, i.e. on what schedule. You can use cron notation here, a timedelta object or one of airflow’s cron presets (e.g. ‚@daily‘).

Tip: you normally want to keep the start date static to avoid unpredictable behavior (so no datetime.now() shenanigans even though it may seem tempting).

# set start date and schedule interval
start_date = datetime(2020, 2, 20)
schedule_interval = timedelta(days=1)

You find the complete DAG file on our STATWORX Github. There you also see all the config parameters I set, e.g., what our project, dataset, buckets, and tables are called, as well as all the queries. We gloss over it here, as it’s not central to understanding the DAG.

Task 1: Check that there is data in BigQuery

We set up our DAG taking advantage of python’s context manager. You don’t have to do this, but it saves some typing. A little detail: I’m setting catchup=False because I don’t want Airflow to do a backfill on my data.

# write dag
with DAG(dag_id='blog', default_args=default_args, schedule_interval=schedule_interval, catchup=False) as dag:

    t1 = BigQueryCheckOperator(task_id='check_bq_data_exists',
                               sql=queries.check_bq_data_exists,
                               use_legacy_sql=False)

We start by checking if the data for the date we’re interested in is available in the source table. Our source table, in this case, is the Google public dataset bigquery-public-data.austin_waste.waste_and_diversion. To perform this check we use the aptly named BigQueryCheckOperator and pass it an SQL query.

If the check_bq_data_exists query returns even one non-null row of data, we consider it successful and the next task can run. Notice that we’re making use of macros and Jinja templating to dynamically insert dates into our queries that are rendered at runtime.

check_bq_data_exists = """
SELECT load_id
FROM `bigquery-public-data.austin_waste.waste_and_diversion`
WHERE report_date BETWEEN DATE('{{ macros.ds_add(ds, -365) }}') AND DATE('{{ ds }}')
"""

Task 2: Check that there is data in Cloud Storage

Next, let’s check that the CSV file is in Cloud Storage. If you’re following along, just download the file from the STATWORX Github and upload it to your bucket. We pretend that this CSV gets uploaded to Cloud Storage by another process and contains data that we need (in reality I just extracted it from the same source table).

So we don’t care how the CSV got into the bucket, we just want to know: is it there? This can easily be verified in Airflow with the GoogleCloudStorageObjectSensor which checks for the existence of a file in Cloud Storage. Notice the indent because it’s still part of our DAG context.

Defining the task itself is simple: just tell Airflow which object to look for in your bucket.

    t2 = GoogleCloudStorageObjectSensor(task_id='check_gcs_file_exists',
                                        bucket=cfg.BUCKET,
                                        object=cfg.SOURCE_OBJECT)

Task 3: Extract data and save it to BigQuery

If the first two tasks succeed, then all the data we need is available! Now let’s extract some data from the source table and save it to a new table of our own. For this purpose, there’s none better than the BigQueryOperator.

    t3 = BigQueryOperator(task_id='write_weight_data_to_bq',
                          sql=queries.write_weight_data_to_bq,
                          destination_dataset_table=cfg.BQ_TABLE_WEIGHT,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE',
                          use_legacy_sql=False)

This operator sends a query called write_weight_data_to_bq to BigQuery and saves the result in a table specified by the config parameter cfg.BQ_TABLE_WEIGHT. We can also set a create and write disposition if we so choose.

The query itself pulls the total weight of dead animals collected every day by Austin waste management services for a year. If the thought of possum pancakes makes you queasy, just substitute ‚RECYCLING – PAPER‘ for the TYPE variable in the config file.

Task 4: Ingest Cloud Storage data into BigQuery

Once we’re done extracting the data above, we need to get the data that’s currently in our Cloud Storage bucket into BigQuery as well. To do this, just tell Airflow what (source) object from your Cloud Storage bucket should go to which (destination) table in your BigQuery dataset.

Tip: You can also specify a schema at this step, but I didn’t bother since the autodetect option worked well.

    t4 = GoogleCloudStorageToBigQueryOperator(task_id='write_route_data_to_bq',
                                              bucket=cfg.BUCKET,
                                              source_objects=[cfg.SOURCE_OBJECT],
                                              field_delimiter=';',
                                              destination_project_dataset_table=cfg.BQ_TABLE_ROUTE,
                                              create_disposition='CREATE_IF_NEEDED',
                                              write_disposition='WRITE_TRUNCATE',
                                              skip_leading_rows=1)

Task 5: Merge BigQuery and Cloud Storage data

Now we have both the BigQuery source table extract and the CSV data from Cloud Storage in two separate BigQuery tables. Time to merge them! How do we do this?

The BigQueryOperator is our friend here. We just pass it a SQL query that specifies how we want the tables merged. By specifying the destination_dataset argument, it’ll put the result into a table that we choose.

    t5 = BigQueryOperator(task_id='prepare_and_merge_data',
                          sql=queries.prepare_and_merge_data,
                          use_legacy_sql=False,
                          destination_dataset_table=cfg.BQ_TABLE_MERGE,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE')

Click below if you want to see the query. I know it looks long and excruciating, but trust me, there are Tinder dates worse than this (‚So, uh, do you like SQL?‘ – ‚Which one?‘). If he/she follows it up with Lord of the Rings or ‚Yes‘, propose!


What is it that this query is doing? Let’s recap: We have two tables at the moment. One is basically a time series on how much dead animal waste Austin public services collected over the course of a year. The second table contains information on what type of routes were driven on those days.

As if this pipeline wasn’t weird enough, we now also want to know what the most common route type was on a given day. So we start by counting what route types were recorded on a given day. Next, we use a window function (... OVER (PARTITION BY ... ORDER BY ...)) to find the route type with the highest count for each day. In the end, we pull it out and using the date as a key, merge it to the table with the waste info.

prepare_and_merge_data = """
WITH
simple_route_counts AS (
SELECT report_date,
       route_type,
       count(route_type) AS count
FROM `my-first-project-238015.waste.route` 
GROUP BY report_date, route_type
),
max_route_counts AS (
SELECT report_date,
       FIRST_VALUE(route_type) OVER (PARTITION BY report_date ORDER BY count DESC) AS top_route,
       ROW_NUMBER() OVER (PARTITION BY report_date ORDER BY count desc) AS row_number
FROM simple_route_counts
),
top_routes AS (
SELECT report_date AS date,
       top_route,
FROM max_route_counts
WHERE row_number = 1
)
SELECT a.date,
       a.type,
       a.weight,
       b.top_route
FROM `my-first-project-238015.waste.weight` a
LEFT JOIN top_routes b
ON a.date = b.date
ORDER BY a.date DESC
"""

Task 6: Export result to Google Cloud Storage

Let’s finish off this process by exporting our result back to Cloud Storage. By now, you’re probably guessing what the right operator is called. If you guessed BigQueryToCloudStorageOperator, you’re spot on. How to use it though? Just specify what the source table and the path (uri) to the Cloud Storage bucket are called.

    t6 = BigQueryToCloudStorageOperator(task_id='export_results_to_gcs',
                                        source_project_dataset_table=cfg.BQ_TABLE_MERGE,
                                        destination_cloud_storage_uris=cfg.DESTINATION_URI,
                                        export_format='CSV')

The only thing left to do now is to determine how the tasks relate to each other, i.e. set the dependencies. We can do this using the >> notation which I find more readable than set_upstream() or set_downstream(). But take your pick.

    t1 >> t2 >> [t3, t4] >> t5 >> t6

The notation above says: if data is available in the BigQuery source table, check next if data is also available in Cloud Storage. If so, go ahead, extract the data from the source table and save it to a new BigQuery table. In addition, transfer the CSV file data from Cloud Storage into a separate BigQuery table. Once those two tasks are done, merge the two newly created tables. At last, export the merged table to Cloud Storage as a CSV.

Conclusion

That’s it! Thank you very much for sticking with me to the end! Let’s wrap up what we did: we wrote a DAG file to define an automated ETL process that extracts, transforms and loads data with the help of two Google Cloud Platform services: BigQuery and Cloud Storage. Everything we needed was some Python and SQL code.

What’s next? There’s much more to explore, from using the Web UI, monitoring your workflows, dynamically creating tasks, orchestrating machine learning models, and, and, and. We barely scratched the surface here.

So check out Airflow’s official website to learn more. For now, I hope you got a better sense of the possibilities you have with Airflow and how you can harness its power to manage and automate your workflows.

References

„Taxes and random forest again?“ Thomas, my colleague here at STATWORX, raised his eyebrows when I told him about this post. „Yes!“ I replied, „but not because I love taxes so much (who does?). This time it’s about tuning!“

Let’s rewind for a moment: in the previous post, we looked at how we can combine econometric techniques like differencing with machine learning (ML) algorithms like random forests to predict a time series with a high degree of accuracy. If you missed it, I encourage you to check it out here. The data is now also available as a CSV file on our STATWORX GitHub.

Since we covered quite some ground in the last post, there wasn’t much room for other topics. This is where this post comes in. Today, we take a look at how we can tune the hyperparameters of a random forest when dealing with time series data.

Any takers? Alright, then let’s do this! If you read the last post, feel free to skip over section 1 and move right on to 2.

The setup

# load packages
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(tsibble))
suppressPackageStartupMessages(require(randomForest))
suppressPackageStartupMessages(require(forecast))

# specify the path to the csv file (your path here)
file <- "/Users/manueltilgner/Library/Mobile Documents/com~apple~CloudDocs/Artificial Intelligence/10_Blog/RF/tax.csv"

# read in the csv file
tax_tbl <- readr::read_delim(
  file = file,
  delim = ";",
  col_names = c("Year", "Type", month.abb),
  skip = 1,
  col_types = "iciiiiiiiiiiii",
  na = c("...")
) %>% 
  select(-Type) %>% 
  gather(Date, Value, -Year) %>% 
  unite("Date", c(Date, Year), sep = " ") %>% 
  mutate(
    Date = Date %>% 
      lubridate::parse_date_time("m y") %>% 
      yearmonth()
  ) %>% 
  drop_na() %>% 
  as_tsibble(index = "Date") %>% 
  filter(Date <= "2018-12-01")

# convert to ts format
tax_ts <- as.ts(tax_tbl)

Again, all this is from last time, I just have it here so you can get up and running quickly (just copy paste).

# pretend we're in December 2017 and have to forecast the next twelve months
tax_ts_org <- window(tax_ts, end = c(2017, 12))

# estimate the required order of differencing
n_diffs <- nsdiffs(tax_ts_org)

# log transform and difference the data
tax_ts_trf <- tax_ts_org %>% 
  log() %>% 
  diff(n_diffs)

# embed the matrix
lag_order <- 6 # the desired number of lags (six months)
horizon <- 12 # the forecast horizon (twelve months)

tax_ts_mbd <- embed(tax_ts_trf, lag_order + 1) # embedding magic!

# do a train test split
y_train <- tax_ts_mbd[, 1] # the target
X_train <- tax_ts_mbd[, -1] # everything but the target

y_test <- window(tax_ts, start = c(2018, 1), end = c(2018, 12)) # the year 2018
X_test <- tax_ts_mbd[nrow(tax_ts_mbd), c(1:lag_order)] # the test set consisting
# of the six most recent values (we have six lags) of the training set. It's the
# same for all models.

Cross-validation and its (mis-)contents

How do you tune the hyperparameters of an ML model when you’re dealing with time series data? Viewer discretion advised: the following material might be disturbing to some audiences.

Before we answer the question above, let’s first answer the following: how do we tune the hyperparameters of an ML model on data that has no time dimension?

Generally, we proceed as follows:

  1. pick a handful of hyperparameters,
  2. select a sensible range of values for each, and finally,
  3. determine which configuration is ‚best‘, e.g., in the sense of minimizing an error metric.

Steps 1) and 2) are a matter of trial and error. In most cases, there’s no analytic way of deriving which hyperparameters in which configuration work best. While some practical insights have emerged over the years, generally, you just have to experiment and see what works with your data.

What about step 3)? Finding the ‚best‘ configuration can be achieved with the help of different resampling schemes, a popular one of which k-fold cross-validation. Popular because k-fold cross-validation tends to provide fairly reliable model performance estimates and makes efficient use of the data. How so?

Let’s refresh: k-fold cross-validation works by splitting the data into k folds of roughly equal size. Each fold constitutes a subset of the data with N/k observations. It then fits the model on k-1 folds and computes its loss (e.g., RMSE) kth left out fold, which serves as a validation set. This process repeats k times until each fold has served once as a validation set.

From the description, it’s clear why this sounds like a bad idea for time series data. If we do k-fold cross-validation, we invariably end up fitting a model on future data to predict the past! I don’t know about you, but somehow this doesn’t feel right. No need to mention, of course, that if we shuffle the data before splitting it into folds, we completely destroy the time structure.

Hold out, not so fast

What to do instead? In time series econometrics, a traditional approach is this: reserve the last part of your time series as a holdout set. That is, fit your model on the first part of the series and evaluate it on the last part. In essence, this is nothing more than a validation set strategy.

This approach is simple and intuitive. Moreover, it preserves the order of the time series. But it also has a significant downside: it does not provide robust performance estimates since we test our model only once and on a ‚test‘ set that is arbitrary.

Is it a problem?

It definitely can be: time series above all tend to represent phenomena that are ever-changing and evolving. As such, the test set (i.e., the ‚future‘) may be systematically different than the train set (i.e., the ‚past‘). This change in the underlying data generating process over time is known as ‚concept drift‘ and poses a serious challenge in time series forecasting.

But we’re getting off track! What is it that we’re trying to achieve here? We want to get reliable performance estimates of our ML model on time series data. Only then we can decide with some degree of certitude which hyperparameters to choose for our model. Yet, the traditional approach of doing a single train/test split doesn’t really cut it for us.

Much better would be multiple train/test splits. In the context of time series, this means sliding a fixed or steadily expanding window over our series, training on one part of the data, and predicting the next, then computing the loss. Next, move the window ahead an observation (or enlarge it) and repeat. Rob Hyndman has a great post on it.

This approach, called time series cross-validation is effective, but also computationally expensive. Imagine this, if you have 10 hyperparameter configurations and you test each of them with 20 train/test splits, you end up calculating two hundred models. Depending on the model and the amount of data you have, this can take its sweet time.

Since we’re busy people living a busy world, let’s stick with the holdout strategy for now. It sounds simple, but how do you do it? Personally, I like to use the createTimeSlices function from the caret package. This makes sense if you have a caret workflow or work with many different series. If not, simply slice your (training) series so that the last part is reserved as a validation set.

To see how the createTimeSlices function works, run it on its own first.

# we hold out the last 12 observations from our train set because this is also how far we want to predict into the future later
caret::createTimeSlices(
  1:nrow(X_train),
  initialWindow = nrow(X_train) - horizon,
  horizon = horizon,
  fixedWindow = TRUE
)

You see that it splits our train set into a train and validation set, based on the forecast horizon we specified (12 months). Ok, now let’s set it up in such a way that we can use it in a train workflow:

tr_control <- caret::trainControl(
  method = 'timeslice',
  initialWindow = nrow(X_train) - horizon,
  fixedWindow = TRUE
)

With trainControl in place, let us next set up a tuning grid. While we can get super fancy here, for random forests, it often boils down to two hyperparameters that matter: the number of trees (ntree) and the number of predictors (mtry) that get sampled at each split in the tree.

Good values for ntree are a few hundred to a thousand. More trees can give you a bump in accuracy, but usually not much. Since random forests do not run a high risk of overfitting, the question of how many trees you use really comes down to how much computing power (or time) you have. Since tuning with time series tends to be computationally expensive, let’s pick 500 trees.

mtry, on the other hand, tends to be the parameter where the party’s at. It represents the number of predictors that get considered as splitting candidates at each node in the tree. James et al. (2013) recommend p/3 or sqrt{p} (where p is the number of predictors) as a guideline. I’m also going to throw in p, which amounts to bagging (side note: if you want to explore these concepts further, check out the posts of my colleagues on cross-validation and bagging).

# caret actually only allows us to put one hyperparameter here
tune_grid <- expand.grid(
  mtry = c(
    ncol(X_train), # p
    ncol(X_train) / 3, # p / 3
    ceiling(sqrt(ncol(X_train))) # square root of p
  )
)

# let's see which hyperparameter the holdout method recommends
holdout_result <- caret::train(
  data.frame(X_train),
  y_train,
  method = 'rf',
  trControl = tr_control,
  tuneGrid = tune_grid
)

This method recommends mtry = 6. So, should we stop and fit our model already? Not just yet!

Cross-validation again?

Remember all the bad things we said about k-fold cross-validation (CV) with time series data? All of it is still technically true. But now comes the weird part. Bergmeir et al. (2018) found that k-fold CV can actually be applied with time series models. But only if they are purely autoregressive. In other words, we can use k-fold CV on time series data, but only if the predictors in our model are lagged versions of the response.

How come? The authors note that in order for k-fold CV to be valid, the errors of the model must be uncorrelated. This condition is met when the model that we train fits the data well. In other words, k-fold CV is valid if our model nests a good model, i.e., a model whose weights are a subset of the model weights we train. When this is the case, k-fold cross-validation is actually a better choice than the holdout strategy!

I don’t know about you, but for me, this was a surprise. Since it’s a mighty useful insight, it’s worth repeating one more time: if the residuals of our (purely autoregressive) time series model are serially uncorrelated, then k-fold cross-validation can and should be used over the holdout strategy.

Are you as curious as I am now? Then let’s put it work for us and not just once, but repeatedly!

tr_control <- trainControl(
  method = 'repeatedcv',
  number = 10, 
  repeats = 3
)

kfold_result <- caret::train(
  data.frame(X_train),
  y_train,
  method = 'rf',
  trControl = tr_control,
  tuneGrid = tune_grid
)

Using k-fold CV on this time series suggests a value of mtry = 2.

Interesting! Let’s train our random forest twice now, once with mtry = 2and once with mtry = 6. Then, we compare which one gives a better prediction!

mtrying it out!

# set up our empty forecast tibble
forecasts_rf <- tibble(
  mtry_holdout = rep(NA, horizon),
  mtry_kfold = rep(NA, horizon)
)

# collect the two mtry values from the tuning step
mtrys <- c(
  holdout_resultbestTune[[1]],   kfold_resultbestTune[[1]]
)

# train the model in a double loop
for (i in seq_len(length(mtrys))) {
  # start fresh for each mtry run
  y_train <- tax_ts_mbd[, 1] 
    X_train <- tax_ts_mbd[, -1] 

  # train the models
  for (j in seq_len(horizon)) {
    # set seed
    set.seed(2019)

    # fit the model
    fit_rf <- randomForest(X_train, y_train, mtry = mtrys[i])

    # predict using the test set
    forecasts_rf[j, i] <- predict(fit_rf, X_test)

    # here is where we repeatedly reshape the training data to reflect the time                            # distance corresponding to the current forecast horizon.
    y_train <- y_train[-1] 

    X_train <- X_train[-nrow(X_train), ] 
  }
}

Again, we need to back-transform our forecasts to calculate the accuracy on our test set. For this, I’m going to use purrr.

last_observation <- as.vector(tail(tax_ts_org, 1))

forecasts <- forecasts_rf %>% 
  purrr::map_df(function(x) exp(cumsum(x)) * last_observation)

accuracies <- forecasts %>% 
  purrr::map(function(x) accuracy(x, as.vector(y_test))) %>%
  do.call(rbind, .)

And what do you know! k-fold CV proved indeed better than our holdout approach. It reduced both RMSE and MAPE. We even validated our result from last time, where we also had a MAPE of 2.6. So at least here, using random forest out of the box was totally fine.

  ME RMSE MAE MPE MAPE
holdout 429258.4 484154.7 429258.4 4.709705 4.709705
k-fold 198307.5 352789.9 238652.6 2.273785 2.607773

Let’s visualize it as well:

plot <- tax_tbl %>% 
  filter(Date >= "2018-01-01") %>% 
  mutate(
    Ground_Truth = Value / 10000,
    Forecast_Holdout = forecastsmtry_holdout / 10000,     Forecast_Kfold = forecastsmtry_kfold / 10000,
  ) %>% 
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Value / 10000, linetype = "Truth")) +
  geom_line(aes(y = Forecast_Holdout, color = "Holdout")) +
  geom_line(aes(y = Forecast_Kfold, color = "k-fold CV")) +
  theme_minimal()+
  labs(
    title = "Forecast of the German Wage and Income Tax for the Year 2018",
    x = "Months",
    y = "Euros"
  ) +
  scale_color_manual(
    name = "Forecasts",
    values = c("Truth" = "black", "Holdout" = "darkblue", "k-fold CV" = "orange")
  ) +
  scale_linetype_manual(name = "Original", values = c("Truth" = "dashed")) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months")

We see that the orange line, which represents the forecasts from the k-fold CV model, tends to hug the true values more snugly at several points.

tax-scaled.png

Conclusion (TL;DR)

Tuning ML models on time series data can be expensive, but it needn’t be. If the model you’re fitting uses only endogenous predictors, i.e., lags of the response, you’re in luck! You can go ahead and use the known and beloved k-fold cross-validation strategy to tune your hyperparameters. If you want to go deeper, check out the original paper in the reference. Otherwise, go pick a time a time series of your choice and see if you can improve your model with a bit of tuning. No matter the result, it’ll always beat doing taxes!

References

James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: Springer, 2013.

Bergmeir, Christoph, and José M. Benítez. „On the use of cross-validation for time series predictor evaluation.“ Information Sciences 191 (2012): 192-213.

Benjamin Franklin said that only two things are certain in life: death and taxes. That explains why my colleagues at STATWORX were less than excited when they told me about their plans for the weekend a few weeks back: doing their income tax declaration. Man, I thought, that sucks, I’d rather spend this time outdoors. And then an idea was born.

What could taxes and the outdoors possibly have in common? Well, I asked myself: can we predict tax revenue using random forest? (wildly creative, I know). When dealing with tax revenue, we enter the realm of time series, ruled by fantastic beasts like ARIMA, VAR, STLM, and others. These are tried and proven methods, so why use random forests?

Well, you and I may both agree that random forest is one of the most awesome algorithms around: it’s simple, flexible, and powerful. So much so, that Wyner et al. (2015) call it the ‚off-the-shelf‘ tool for most data science applications. Long story short, it’s one of those algorithms that just works (if you want to know exactly how then check out this excellent post by my colleague Andre).

Random forest is a hammer, but is time series data a nail?

You probably used random forest for regression and classification before, but time series forecasting? Hold up you’re going to say; time series data is special! And you’re right. When it comes to data that has a time dimension, applying machine learning (ML) methods becomes a little tricky.

How come? Well, random forests, like most ML methods, have no awareness of time. On the contrary, they take observations to be independent and identically distributed. This assumption is obviously violated in time series data which is characterized by serial dependence.

What’s more, random forests or decision tree based methods are unable to predict a trend, i.e., they do not extrapolate. To understand why, recall that trees operate by if-then rules that recursively split the input space. Thus, they’re unable to predict values that fall outside the range of values of the target in the training set.

So, should we go back to ARIMA? Not just yet! With a few tricks, we can do time series forecasting with random forests. All it takes is a little pre- and (post-)processing. This blog post will show you how you can harness random forests for forecasting!

Let it be said that there are different ways to go about this. Here’s how we are going to pull it off: We’ll raid the time series econometrics toolbox for some old but gold techniques – differencing and statistical transformations. These are cornerstones of ARIMA modeling, but who says we can’t use them for random forests as well?

To stick with the topic, we’ll use a time series from the German Statistical Office on the German wage and income tax revenue from 1999 – 2018 (after tax redistribution). You can download the data here. Let’s do it!

Getting ready for machine learning or what’s in a time series anyway?

Essentially, a (univariate) time series is a vector of values indexed by time. In order to make it ‚learnable‘ we need to do some pre-processing. This can include some or all of the following:

  • Statistical transformations (Box-Cox transform, log transform, etc.)
  • Detrending (differencing, STL, SEATS, etc.)
  • Time Delay Embedding (more on this below)
  • Feature engineering (lags, rolling statistics, Fourier terms, time dummies, etc.)

For brevity and clarity, we’ll focus on steps one to three in this post.

Ok, let’s structure this a bit: in order to use random forest for time series data we do TDE: transform, difference and embed. Let’s fire up R and load the required packages plus our data.

# load the packages
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(tsibble))
suppressPackageStartupMessages(require(randomForest))
suppressPackageStartupMessages(require(forecast))

# specify the csv file (your path here)
file <- ".../tax.csv"

# read in the csv file
tax_tbl <- readr::read_delim(
  file = file,
  delim = ";",
  col_names = c("Year", "Type", month.abb),
  skip = 1,
  col_types = "iciiiiiiiiiiii",
  na = c("...")
) %>% 
  select(-Type) %>% 
  gather(Date, Value, -Year) %>% 
  unite("Date", c(Date, Year), sep = " ") %>% 
  mutate(
    Date = Date %>% 
      lubridate::parse_date_time("m y") %>% 
      yearmonth()
  ) %>% 
  drop_na() %>% 
  as_tsibble(index = "Date") %>% 
  filter(Date <= "2018-12-01")

# convert to ts format
tax_ts <- as.ts(tax_tbl)

Before we dive into the analysis, let’s quickly check for implicit and explicit missings in the data. The tsibble package has some handy functions to do just that:

# implicit missings
has_gaps(tax_tbl)

# explicit missings
colSums(is.na(tax_tbl[, "Value"]))

Nope, looks good! So what kind of time series are we dealing with?

# visualize
plot_org <- tax_tbl %>% 
  ggplot(aes(Date, Value / 1000)) + # to get the axis on a more manageable scale
  geom_line() +
  theme_minimal() +
  labs(title = "German Wage and Income Taxes 1999 - 2018", x = "Year", y = "Euros")
plot-org

Differencing can make all the difference

If you’ve worked with classical time series models before, you likely stumbled across the concept of differencing. The reason for this is that classical time series models require the data to be stationary.

Stationarity means that the mean and variance of the series is finite and does not change over time. Thus, it implies some stability in the statistical properties of the time series. As we can see in the plot, our time series is far from it! There is an upward trend as well as a distinct seasonal pattern in the series.

How do these two concepts – differencing and stationarity – relate? You probably already know or guessed it: differencing is one way to make non-stationary time series data stationary. That’s nice to know, but right now we care more about the fact that differencing removes changes in the level of a series and, with it, the trend. Just what we need for our random forest!

How is it done? Here, we simply take the first differences of the data, i.e., the difference between consecutive observations y_t' = y_t - y_{t-1}. This also works with a seasonal lag y_t' = y_t - y_{t-s}, which amounts to taking the difference between an observation and a previous observation from the same season, e.g., November this year and November last year.

Whereas differencing can stabilize the mean of a time series, a Box-Cox or log transformation can stabilize the variance. The family of Box-Cox transformations revolves around the parameter lambda:

    \[tilde{y}_t = begin{cases} log(y_t) & lambda = 0 (y_t^lambda - 1)/ lambda & lambda neq 0 end{cases}\]

When lambda is zero, the Box-Cox transformation amounts to taking logs. We choose this value to make the back-transformation of our forecasts straightforward. But don’t hesitate to experiment with different values of lambda or estimate the ‚best‘ value with the help of the forecast package.

# pretend we're in December 2017 and have to forecast the next twelve months
tax_ts_org <- window(tax_ts, end = c(2017, 12))

# estimate the required order of differencing
n_diffs <- nsdiffs(tax_ts_org)

# log transform and difference the data
tax_ts_trf <- tax_ts_org %>% 
  log() %>% 
  diff(n_diffs)

# check out the difference! (pun)
plot_trf <- tax_ts_trf %>% 
  autoplot() +
  xlab("Year") +
  ylab("Euros") +
  ggtitle("German Wage and Income Taxes 1999 - 2018") +
  theme_minimal()

gridExtra::grid.arrange(plot_org, plot_trf)
plot-trf

Let’s sum up what we’ve done so far: we first took logs of the data to stabilize the variance. Then, we differenced the data once to make it stationary in the mean. Together, these rather simple transformations took us from non-stationary to stationary.

What’s next? Well, now we use the data thus transformed to train our random forest and to make forecasts. Once we obtain the forecasts, we reverse the transformations to get them on the original scale of the data.

Just one more step before we get to the modeling part: we still only have a vector. How do we cast this data in a shape, that an ML algorithm can handle?

Enter the matrix: Time Delay Embedding

To feed our random forest the transformed data, we need to turn what is essentially a vector into a matrix, i.e., a structure that an ML algorithm can work with. For this, we make use of a concept called time delay embedding.

Time delay embedding represents a time series in a Euclidean space with the embedding dimension K. To do this in R, use the base function embed(). All you have to do is plug in the time series object and set the embedding dimension as one greater than the desired number of lags.

lag_order <- 6 # the desired number of lags (six months)
horizon <- 12 # the forecast horizon (twelve months)

tax_ts_mbd <- embed(tax_ts_trf, lag_order + 1) # embedding magic!

When you check out the tax_ts_mbd object, you’ll see that you get a matrix where the dependent variable in the first column is regressed on its lags in the remaining columns:

    \[Y^K = begin{bmatrix} y_K & y_{K-1} & dots & y_2 & y_1 vdots & vdots & vdots & vdots & vdots y_i & y_{i-1} & dots & y_{i-K+2} & y_{i-K+1} vdots & vdots & vdots & vdots & vdots y_N & y_{N-1} & dots & y_{N-K+2} & y_{N-K+1} end{bmatrix}\]

Time delay embedding allows us to use any linear or non-linear regression method on time series data, be it random forest, gradient boosting, support vector machines, etc. I decided to go with a lag of six months, but you can play around with other lags. Moreover, the forecast horizon is twelve as we’re forecasting the tax revenue for the year 2018.

When it comes to forecasting, I’m pretty direct

In this post, we make use of the direct forecasting strategy. That means that we estimate H separate models f_H, one for each forecast horizon. In other words, we train a separate model for each time distance in the data. For an awesome tutorial on how this works check out this post.

The direct forecasting strategy is less efficient than the recursive forecasting strategy, which estimates only one model f and, as the name suggests, re-uses it H times. Recursive, in this case, means that we feed back each forecast as input back to the model to get the next forecast.

Despite this drawback, the direct strategy has two key advantages: First, it does not suffer from an accumulation of forecast errors, and second, it makes it straightforward to include exogenous predictors.

How to implement the direct forecasting strategy is nicely demonstrated in the before-mentioned post, so I don’t want rehash it here. If you’re short on time, the tl;dr is this: we use the direct forecasting strategy to generate multi-step ahead forecasts. This entails training a model for each forecast horizon by progressively reshaping the training data to reflect the time distance between observations.

y_train <- tax_ts_mbd[, 1] # the target
X_train <- tax_ts_mbd[, -1] # everything but the target

y_test <- window(tax_ts, start = c(2018, 1), end = c(2018, 12)) # the year 2018
X_test <- tax_ts_mbd[nrow(tax_ts_mbd), c(1:lag_order)] # the test set consisting
# of the six most recent values (we have six lags) of the training set. It's the
# same for all models.

If you followed me here, kudos, we’re almost done! For now, we get to the fun part: letting our random forest loose on this data. We train the model in a loop, where each iteration fits one model, one for each forecast horizon h = 1 dots H.

The random forest forecast: things are looking good

Below I’m using the random forest straight out of the box, not even bothering tuning it (a topic to which I’d like to dedicate a post in the future). It may seem lazy (and probably is), but I stripped the process down to its bare bones in the hope of showing most clearly what is going on here.

forecasts_rf <- numeric(horizon)

for (i in 1:horizon){
  # set seed
  set.seed(2019)

  # fit the model
  fit_rf <- randomForest(X_train, y_train)

  # predict using the test set
  forecasts_rf[i] <- predict(fit_rf, X_test)

  # here is where we repeatedly reshape the training data to reflect the time distance
  # corresponding to the current forecast horizon.
  y_train <- y_train[-1] 

  X_train <- X_train[-nrow(X_train), ] 
}

Alright, the loop’s done. We just trained twelve models and got twelve forecasts. Since we transformed our time series before training, we need to transform the forecasts back.

Back to the former or how we get forecasts on the original scale

As we took the log transform earlier, the back-transform is rather straightforward. We roll back the process from the inside out, i.e., we first reverse the differencing and then the log transform. We do this by exponentiating the cumulative sum of our transformed forecasts and multiplying the result with the last observation of our time series. In other words, we calculate:

    \[y_{t+H} = y_{t}expleft(sum_{h = 1}^H tilde{y}_{t+h}right)\]

# calculate the exp term
exp_term <- exp(cumsum(forecasts_rf))

# extract the last observation from the time series (y_t)
last_observation <- as.vector(tail(tax_ts_org, 1))

# calculate the final predictions
backtransformed_forecasts <- last_observation * exp_term

# convert to ts format
y_pred <- ts(
  backtransformed_forecasts,
  start = c(2018, 1),
  frequency = 12
)

# add the forecasts to the original tibble
tax_tbl <- tax_tbl %>% 
  mutate(Forecast = c(rep(NA, length(tax_ts_org)), y_pred))

# visualize the forecasts
plot_fc <- tax_tbl %>% 
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Value / 1000)) +
  geom_line(aes(y = Forecast / 1000), color = "blue") +
  theme_minimal() +
  labs(
    title = "Forecast of the German Wage and Income Tax for the Year 2018",
    x = "Year",
    y = "Euros"
  )

accuracy(y_pred, y_test)
plot-fc
  ME RMSE MAE MPE MAPE ACF1 Theil’s U
Test set 198307.5 352789.9 238652.6 2.273785 2.607773 0.257256 0.0695694

It looks like our forecast is pretty good! We achieved a MAPE of 2.6 percent. But since one should never rely on accuracy metrics alone, let’s quickly calculate a simple benchmark like the seasonal naive model. That’s a low hurdle to pass, so if our model doesn’t beat it, in the bin it goes.

benchmark <- forecast(snaive(tax_ts_org), h = horizon)

tax_ts %>% 
  autoplot() +
  autolayer(benchmark, PI = FALSE)

accuracy(benchmark, y_test)
  ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 216938.7 470874.8 388618.5 2.852533 6.648711 1.000000 0.6303982 NA
Test set 485006.8 495096.8 485006.8 5.507943 5.507943 1.248028 0.1382461 0.0926723

The error metrics are much higher here, so it’s safe to say our random forest did a good job.

Where do we go from here? Well, we haven’t tried it yet, but we may further improve our forecasts with some hyperparameter tuning. Also, just between the two of us, maybe random forest is pretty good, but not the best model for the job. We could try others or an ensemble of models.

If you take away one thing from this post today, let it be this: We can do effective time series forecasting with machine learning without whipping out big guns like recurrent neural networks. All it takes is a little pre- and post-processing. So why not include random forest in your arsenal the next time you do forecasting (or procrastinate on doing your taxes)? 🙂

UPDATE: Hi, some people have been asking me for the data, which I realize is a bit hard to come by (destatis isn’t exactly what you call user-friendly). So here’s a link to the STATWORX GitHub where you can download the csv file.

References

  • Wyner, Abraham J., et al. „Explaining the success of adaboost and random forests as interpolating classifiers.“ The Journal of Machine Learning Research 18.1 (2017): 1558-1590.

It’s March 15th and that means it’s World Sleep Day (WSD). Don’t snooze off just yet! We’re about to check out a package that can make using R and Python a dream. It’s called reticulate and we’ll use it to train a Support Vector Machine for a simple classification task.

What’s WSD got do with data science?

I discovered WSD on the train to STATWORX HQ in Frankfurt. If you’ve travelled by train before, you know that the only thing worse than disgruntled staff is unconscious staff. It hasn’t happened to me yet, but hey let’s face it: the personnel shortage of Deutsche Bahn isn’t getting better any time soon.

This brings me to the topic of today’s blog post: the sleeping habits of railroad workers. For this, I pulled a dataset from the US Federal Railroad Administration (FRA). The FRA conducted a survey on the work and sleep schedules of its train dispatchers. You can find it here.

As the title suggests, we’re going to use both R and Python to predict whether a dispatcher was diagnosed with a sleeping disorder. Before we get started, a warning to all R and Python purists out there: the example below is a bit contrived. Everything we’re about to do can be done entirely in either one of the languages.

So why bother using R and Python?

As data scientists, we ideally have a firm grasp on both R and Python. After all, as my colleague Fran will discuss in an upcoming post, both have unique strengths that we can use to our advantage. Personally, I love R for the tidyverse and ggplot2, Python for its unified machine learning (ML) API scikit-learn.

Alright, point taken you might say, but it’s a hassle to switch back and forth. It’s not worth it, and until a few years ago you would have been right! However, nowadays there are some cool ways to get the best of both worlds. If you’re coming from the R community look no further than reticulate!

The reticulate package gives you a set of tools to use both R and Python interactively within an R session. Say you’re working in Python and need a specialized statistical model from an R package – or you’re working in R and want to access Python’s ML capabilities. Well, you’ve come to the right place. This package is a godsend for tasks where it’s required, or simply more convenient, to use both languages as part of your workflow.

Now, there are different ways to use R and Python interactively and I encourage you to check reticulate’s github site to see which one suits you best. In this post, we’re going through a simple example of how to use Python modules within an R Notebook (i.e. Markdown document).

How to get started?

If you’re working on a Mac, Python comes preinstalled on your system. Irrespective of the machine you’re on though, I recommended downloading the Anaconda distribution, so you have everything you need. One more note: you need RStudio’s newest preview version 1.2 for this to work. That’s it!

Let’s open an R Notebook, insert an R chunk and (install and) load the reticulate library. Immediately after loading reticulate, use the use_python() command with the appropriate path. Placing it later in the script causes problems for some people. I specified my path as follows (note: yours may differ):

library(tidyverse)
library(recipes)
library(reticulate)
use_python("/anaconda3/bin/python")

We’re good to go. Let’s read in the data and perform some transformations with dplyr. This is mostly recoding work. As you can see in the select command, we pick a handful of variables like sex, age, caffeine consumption, health and stress to predict whether a railroad dispatcher was diagnosed with a sleeping disorder.

1) Read in the data

data <- readxl::read_xls("sleep.xls")

sleep <- data %>%
  select(
      Diagnosed_Sleep_disorder, Age_Group, Sex, Total_years_dispatcher,
      Total_years_present_job, Marital_Status, Childrendependents,
      Children_under_2_yrs, Caff_Beverages, Sick_Days_in_last_year,
      Health_status, Avg_Work_Hrs_Week, FRA_report, Phys_Drained,
      Mentally_Drained, Alert_at_Work, Job_Security
  ) %>%
  rename_all(tolower) %>%
  mutate_if(is.character, as.numeric) %>%
  mutate_at(vars(diagnosed_sleep_disorder, sex, caff_beverages, fra_report),
            ~ -(. - 2)) %>%
  mutate_at(vars(marital_status), ~ (. - 1)) %>%
  drop_na()

Now that we have the variables we want, it’s time to get the data into the right shape. Here’s one more reason to love R: the recipes package. If you’re not familiar with it, check it out. You may find its workflow a bit peculiar at first, but once you get used to it, it makes data preparation a breeze.

What we’re doing here is straightforward. First, we split the data into a training and test set. Next, we specify a data preparation recipe, which consists of three steps: one hot encoding factor predictors, standardising numeric predictors and down-sampling the data. One hot encoding and standardising ensure that the Support Vector Machine algorithm works properly. Down-sampling is a counter-measure against the class imbalance in our dataset.

2) Prepare the data

numeric_variables <- c(
  "total_years_dispatcher", "total_years_present_job",
  "childrendependents", "children_under_2_yrs", 
  "sick_days_in_last_year", "avg_work_hrs_week"
)

factor_variables <- setdiff(colnames(sleep), numeric_variables)

sleep <- mutate_at(sleep, vars(factor_variables), as.factor)

set.seed(2019)
index <- sample(1:nrow(sleep), floor(nrow(sleep) * .75))

sleep_train <- sleep[index, ]
sleep_test <- sleep[-index, ]

recipe_formula <- recipes::recipe(diagnosed_sleep_disorder ~ ., sleep_train)

recipe_steps <- recipe_formula %>%
  recipes::step_dummy(factor_variables, -all_outcomes(), one_hot = TRUE) %>%
  recipes::step_downsample(diagnosed_sleep_disorder) %>%
  recipes::step_center(numeric_variables) %>%
  recipes::step_scale(numeric_variables)

recipe_prep <- recipes::prep(recipe_steps, sleep_train, retain = TRUE)

training_data <- juice(recipe_prep)
testing_data <- bake(recipe_prep, sleep_test)

Now comes the part where Python shines: its unified ML library scikit-learn. Let’s go ahead and import the Support Vector Machine (SVM) classifier as well as some other modules to tune and evaluate our model.

SVM is a supervised learning algorithm. It works by finding a hyperplane in an N-dimensional space, which separates two (or more) classes as cleanly as possible. More technically, SVM maximizes the margin or the distance between the separating hyperplane and the closest data points. This is why SVM is also called a maximum margin estimator.

SVM is mostly used for classification, but it can do regression too. The upside is that it works with high-dimensional data and different kernel functions, meaning it can flexibly adapt to different types of data. Its downside is that computation becomes costly with large datasets and that it reacts sensitively to hyperparameters. Still, for some applications SVM performs quite competitively.

Combining SVM with kernels allows us to project our data into a higher-dimensional space. The point of this is to make the classes better separable. In our example here, we’ll use a simple linear and a radial basis function kernel. The latter can map the predictor space into infinite dimensions.

3) Train the model

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_train = r.training_data['diagnosed_sleep_disorder']
X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1)

y_test = r.testing_data['diagnosed_sleep_disorder']
X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1)

clf = svm.SVC(kernel = 'linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

clf = svm.SVC(kernel = 'rbf')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We can tune an SVM with the help of two parameters: C and gamma. C is a regularisation parameter that represents how much error we’re willing to tolerate. The higher C, the stricter we are, the more exactly SVM will try to fit the data by picking a hyperplane with smaller margins. As a consequence, higher C’s carry a higher risk of overfitting. For the radial basis function kernel, we can also tune gamma which determines the size of the kernel and with it the decision boundary. The higher gamma, the closer the decision boundary is to single training examples. A higher gamma can lead to a better model, yet likewise increases the risk of overfitting.

4) Tune the model

param_grid = [{'C': [0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10],
               'kernel': ['rbf', 'linear']}]

grid = GridSearchCV(svm.SVC(), param_grid, cv = 5, scoring = 'balanced_accuracy')

grid.fit(X_train, y_train)

print(grid.best_params_)

From the cross-validation procedure, it appears that a gamma of 0.1 and a C of 0.01 are optimal in combination with a radial basis function kernel. So, let’s train our model with this kernel and the hyperparameter values.

5) Evaluate the accuracy of the model

clf = grid.best_estimator_
y_pred = clf.predict(X_test)

print('Confusion Matrix:nn', confusion_matrix(y_test, y_pred))
print('nClassification Report:nn', classification_report(y_test, y_pred))
print('nTraining Set Accuracy: {:.2f}%'.format(clf.score(X_train, y_train)))
print('nTest Set Accuracy: {:.2f}%'.format(clf.score(X_test, y_test)))

conf_mat = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_mat, square = True, annot = True, fmt = 'g',
            cbar = False, cmap = 'viridis')
plt.xlabel('predicted')
plt.ylabel('observed')
plt.show()

In this case, we achieve a training set accuracy of 93 per cent and a test set accuracy of 74 per cent. This suggests that some overfitting has occurred. To achieve a higher accuracy (or better: sensitivity/recall), we could experiment with different kernels and/or hyperparameter values. But this I’d leave up to you. With reticulate, you now have a tool to get the best of both R and Python.

[author class=“mtl“ title=“About the author“]

It’s March 15th and that means it’s World Sleep Day (WSD). Don’t snooze off just yet! We’re about to check out a package that can make using R and Python a dream. It’s called reticulate and we’ll use it to train a Support Vector Machine for a simple classification task.

What’s WSD got do with data science?

I discovered WSD on the train to STATWORX HQ in Frankfurt. If you’ve travelled by train before, you know that the only thing worse than disgruntled staff is unconscious staff. It hasn’t happened to me yet, but hey let’s face it: the personnel shortage of Deutsche Bahn isn’t getting better any time soon.

This brings me to the topic of today’s blog post: the sleeping habits of railroad workers. For this, I pulled a dataset from the US Federal Railroad Administration (FRA). The FRA conducted a survey on the work and sleep schedules of its train dispatchers. You can find it here.

As the title suggests, we’re going to use both R and Python to predict whether a dispatcher was diagnosed with a sleeping disorder. Before we get started, a warning to all R and Python purists out there: the example below is a bit contrived. Everything we’re about to do can be done entirely in either one of the languages.

So why bother using R and Python?

As data scientists, we ideally have a firm grasp on both R and Python. After all, as my colleague Fran will discuss in an upcoming post, both have unique strengths that we can use to our advantage. Personally, I love R for the tidyverse and ggplot2, Python for its unified machine learning (ML) API scikit-learn.

Alright, point taken you might say, but it’s a hassle to switch back and forth. It’s not worth it, and until a few years ago you would have been right! However, nowadays there are some cool ways to get the best of both worlds. If you’re coming from the R community look no further than reticulate!

The reticulate package gives you a set of tools to use both R and Python interactively within an R session. Say you’re working in Python and need a specialized statistical model from an R package – or you’re working in R and want to access Python’s ML capabilities. Well, you’ve come to the right place. This package is a godsend for tasks where it’s required, or simply more convenient, to use both languages as part of your workflow.

Now, there are different ways to use R and Python interactively and I encourage you to check reticulate’s github site to see which one suits you best. In this post, we’re going through a simple example of how to use Python modules within an R Notebook (i.e. Markdown document).

How to get started?

If you’re working on a Mac, Python comes preinstalled on your system. Irrespective of the machine you’re on though, I recommended downloading the Anaconda distribution, so you have everything you need. One more note: you need RStudio’s newest preview version 1.2 for this to work. That’s it!

Let’s open an R Notebook, insert an R chunk and (install and) load the reticulate library. Immediately after loading reticulate, use the use_python() command with the appropriate path. Placing it later in the script causes problems for some people. I specified my path as follows (note: yours may differ):

library(tidyverse)
library(recipes)
library(reticulate)
use_python("/anaconda3/bin/python")

We’re good to go. Let’s read in the data and perform some transformations with dplyr. This is mostly recoding work. As you can see in the select command, we pick a handful of variables like sex, age, caffeine consumption, health and stress to predict whether a railroad dispatcher was diagnosed with a sleeping disorder.

1) Read in the data

data <- readxl::read_xls("sleep.xls")

sleep <- data %>%
  select(
      Diagnosed_Sleep_disorder, Age_Group, Sex, Total_years_dispatcher,
      Total_years_present_job, Marital_Status, Childrendependents,
      Children_under_2_yrs, Caff_Beverages, Sick_Days_in_last_year,
      Health_status, Avg_Work_Hrs_Week, FRA_report, Phys_Drained,
      Mentally_Drained, Alert_at_Work, Job_Security
  ) %>%
  rename_all(tolower) %>%
  mutate_if(is.character, as.numeric) %>%
  mutate_at(vars(diagnosed_sleep_disorder, sex, caff_beverages, fra_report),
            ~ -(. - 2)) %>%
  mutate_at(vars(marital_status), ~ (. - 1)) %>%
  drop_na()

Now that we have the variables we want, it’s time to get the data into the right shape. Here’s one more reason to love R: the recipes package. If you’re not familiar with it, check it out. You may find its workflow a bit peculiar at first, but once you get used to it, it makes data preparation a breeze.

What we’re doing here is straightforward. First, we split the data into a training and test set. Next, we specify a data preparation recipe, which consists of three steps: one hot encoding factor predictors, standardising numeric predictors and down-sampling the data. One hot encoding and standardising ensure that the Support Vector Machine algorithm works properly. Down-sampling is a counter-measure against the class imbalance in our dataset.

2) Prepare the data

numeric_variables <- c(
  "total_years_dispatcher", "total_years_present_job",
  "childrendependents", "children_under_2_yrs", 
  "sick_days_in_last_year", "avg_work_hrs_week"
)

factor_variables <- setdiff(colnames(sleep), numeric_variables)

sleep <- mutate_at(sleep, vars(factor_variables), as.factor)

set.seed(2019)
index <- sample(1:nrow(sleep), floor(nrow(sleep) * .75))

sleep_train <- sleep[index, ]
sleep_test <- sleep[-index, ]

recipe_formula <- recipes::recipe(diagnosed_sleep_disorder ~ ., sleep_train)

recipe_steps <- recipe_formula %>%
  recipes::step_dummy(factor_variables, -all_outcomes(), one_hot = TRUE) %>%
  recipes::step_downsample(diagnosed_sleep_disorder) %>%
  recipes::step_center(numeric_variables) %>%
  recipes::step_scale(numeric_variables)

recipe_prep <- recipes::prep(recipe_steps, sleep_train, retain = TRUE)

training_data <- juice(recipe_prep)
testing_data <- bake(recipe_prep, sleep_test)

Now comes the part where Python shines: its unified ML library scikit-learn. Let’s go ahead and import the Support Vector Machine (SVM) classifier as well as some other modules to tune and evaluate our model.

SVM is a supervised learning algorithm. It works by finding a hyperplane in an N-dimensional space, which separates two (or more) classes as cleanly as possible. More technically, SVM maximizes the margin or the distance between the separating hyperplane and the closest data points. This is why SVM is also called a maximum margin estimator.

SVM is mostly used for classification, but it can do regression too. The upside is that it works with high-dimensional data and different kernel functions, meaning it can flexibly adapt to different types of data. Its downside is that computation becomes costly with large datasets and that it reacts sensitively to hyperparameters. Still, for some applications SVM performs quite competitively.

Combining SVM with kernels allows us to project our data into a higher-dimensional space. The point of this is to make the classes better separable. In our example here, we’ll use a simple linear and a radial basis function kernel. The latter can map the predictor space into infinite dimensions.

3) Train the model

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_train = r.training_data['diagnosed_sleep_disorder']
X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1)

y_test = r.testing_data['diagnosed_sleep_disorder']
X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1)

clf = svm.SVC(kernel = 'linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

clf = svm.SVC(kernel = 'rbf')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We can tune an SVM with the help of two parameters: C and gamma. C is a regularisation parameter that represents how much error we’re willing to tolerate. The higher C, the stricter we are, the more exactly SVM will try to fit the data by picking a hyperplane with smaller margins. As a consequence, higher C’s carry a higher risk of overfitting. For the radial basis function kernel, we can also tune gamma which determines the size of the kernel and with it the decision boundary. The higher gamma, the closer the decision boundary is to single training examples. A higher gamma can lead to a better model, yet likewise increases the risk of overfitting.

4) Tune the model

param_grid = [{'C': [0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10],
               'kernel': ['rbf', 'linear']}]

grid = GridSearchCV(svm.SVC(), param_grid, cv = 5, scoring = 'balanced_accuracy')

grid.fit(X_train, y_train)

print(grid.best_params_)

From the cross-validation procedure, it appears that a gamma of 0.1 and a C of 0.01 are optimal in combination with a radial basis function kernel. So, let’s train our model with this kernel and the hyperparameter values.

5) Evaluate the accuracy of the model

clf = grid.best_estimator_
y_pred = clf.predict(X_test)

print('Confusion Matrix:nn', confusion_matrix(y_test, y_pred))
print('nClassification Report:nn', classification_report(y_test, y_pred))
print('nTraining Set Accuracy: {:.2f}%'.format(clf.score(X_train, y_train)))
print('nTest Set Accuracy: {:.2f}%'.format(clf.score(X_test, y_test)))

conf_mat = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_mat, square = True, annot = True, fmt = 'g',
            cbar = False, cmap = 'viridis')
plt.xlabel('predicted')
plt.ylabel('observed')
plt.show()

In this case, we achieve a training set accuracy of 93 per cent and a test set accuracy of 74 per cent. This suggests that some overfitting has occurred. To achieve a higher accuracy (or better: sensitivity/recall), we could experiment with different kernels and/or hyperparameter values. But this I’d leave up to you. With reticulate, you now have a tool to get the best of both R and Python.

[author class=“mtl“ title=“About the author“]