Data Science, Machine Learning und KI
Kontakt

Introduction

When working on data science projects in R, exporting internal R objects as files on your hard drive is often necessary to facilitate collaboration. Here at STATWORX, we regularly export R objects (such as outputs of a machine learning model) as .RDS files and put them on our internal file server. Our co-workers can then pick them up for further usage down the line of the data science workflow (such as visualizing them in a dashboard together with inputs from other colleagues).

Over the last couple of months, I came to work a lot with RDS files and noticed a crucial shortcoming: The base R saveRDS function does not allow for any kind of archiving of existing same-named files on your hard drive. In this blog post, I will explain why this might be very useful by introducing the basics of serialization first and then showcasing my proposed solution: A wrapper function around the existing base R serialization framework.

Be wary of silent file replacements!

In base R, you can easily export any object from the environment to an RDS file with:

saveRDS(object = my_object, file = "path/to/dir/my_object.RDS")

However, including such a line somewhere in your script can carry unintended consequences: When calling saveRDS multiple times with identical file names, R silently overwrites existing, identically named .RDS files in the specified directory. If the object you are exporting is not what you expect it to be — for example due to some bug in newly edited code — your working copy of the RDS file is simply overwritten in-place. Needless to say, this can prove undesirable.

If you are familiar with this pitfall, you probably used to forestall such potentially troublesome side effects by commenting out the respective lines, then carefully checking each time whether the R object looked fine, then executing the line manually. But even when there is nothing wrong with the R object you seek to export, it can make sense to retain an archived copy of previous RDS files: Think of a dataset you run through a data prep script, and then you get an update of the raw data, or you decide to change something in the data prep (like removing a variable). You may wish to archive an existing copy in such cases, especially with complex data prep pipelines with long execution time.

Don’t get tangled up in manual renaming

You could manually move or rename the existing file each time you plan to create a new one, but that’s tedious, error-prone, and does not allow for unattended execution and scalability. For this reason, I set out to write a carefully designed wrapper function around the existing saveRDS call, which is pretty straightforward: As a first step, it checks if the file you attempt to save already exists in the specified location. If it does, the existing file is renamed/archived (with customizable options), and the „updated“ file will be saved under the originally specified name.

This approach has the crucial advantage that the existing code that depends on the file name remaining identical (such as readRDS calls in other scripts) will continue to work with the latest version without any needs for adjustment! No more saving your objects as „models_2020-07-12.RDS“, then combing through the other scripts to replace the file name, only to repeat this process the next day. At the same time, an archived copy of the — otherwise overwritten — file will be kept.

What are RDS files anyways?

Before I walk you through my proposed solution, let’s first examine the basics of serialization, the underlying process behind high-level functions like saveRDS.

Simply speaking, serialization is the „process of converting an object into a stream of bytes so that it can be transferred over a network or stored in a persistent storage.“

Stack Overflow: What is serialization?

There is also a low-level R interface, serialize, which you can use to explore (un-)serialization first-hand: Simply fire up R and run something like serialize(object = c(1, 2, 3), connection = NULL). This call serializes the specified vector and prints the output right to the console. The result is an odd-looking raw vector, with each byte separately represented as a pair of hex digits. Now let’s see what happens if we revert this process:

s <- serialize(object = c(1, 2, 3), connection = NULL)
print(s)
# >  [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 0e 00
# > [29] 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00 00 40 08 00 00 00 00 00 00

unserialize(s)
# > 1 2 3

The length of this raw vector increases rapidly with the complexity of the stored information: For instance, serializing the famous, although not too large, iris dataset results in a raw vector consisting of 5959 pairs of hex digits!

Besides the already mentioned saveRDS function, there is also the more generic save function. The former saves a single R object to a file. It allows us to restore the object from that file (with the counterpart readRDS), possibly under a different variable name: That is, you can assign the contents of a call to readRDS to another variable. By contrast, save allows for saving multiple R objects, but when reading back in (with load), they are simply restored in the environment under the object names they were saved with. (That’s also what happens automatically when you answer „Yes“ to the notorious question of whether to „save the workspace image to ~/.RData“ when quitting RStudio.)

Creating the archives

Obviously, it’s great to have the possibility to save internal R objects to a file and then be able to re-import them in a clean session or on a different machine. This is especially true for the results of long and computationally heavy operations such as fitting machine learning models. But as we learned earlier, one wrong keystroke can potentially erase that one precious 3-hour-fit fine-tuned XGBoost model you ran and carefully saved to an RDS file yesterday.

Digging into the wrapper

So, how did I go about fixing this? Let’s take a look at the code. First, I define the arguments and their defaults: The object and file arguments are taken directly from the wrapped function, the remaining arguments allow the user to customize the archiving process: Append the archive file name with either the date the original file was archived or last modified, add an additional timestamp (not just the calendar date), or save the file to a dedicated archive directory. For more details, please check the documentation here. I also include the ellipsis ... for additional arguments to be passed down to saveRDS. Additionally, I do some basic input handling (not included here).

save_rds_archive <- function(object,
                             file = "",
                             archive = TRUE,
                             last_modified = FALSE,
                             with_time = FALSE,
                             archive_dir_path = NULL,
                             ...) {

The main body of the function is basically a series of if/else statements. I first check if the archive argument (which controls whether the file should be archived in the first place) is set to TRUE, and then if the file we are trying to save already exists (note that „file“ here actually refers to the whole file path). If it does, I call the internal helper function create_archived_file, which eliminates redundancy and allows for concise code.

if (archive) {

    # check if file exists
    if (file.exists(file)) {

      archived_file <- create_archived_file(file = file,
                                            last_modified = last_modified,
                                            with_time = with_time)

Composing the new file name

In this function, I create the new name for the file which is to be archived, depending on user input: If last_modified is set, then the mtime of the file is accessed. Otherwise, the current system date/time (= the date of archiving) is taken instead. Then the spaces and special characters are replaced with underscores, and, depending on the value of the with_time argument, the actual time information (not just the calendar date) is kept or not.

To make it easier to identify directly from the file name what exactly (date of archiving vs. date of modification) the indicated date/time refers to, I also add appropriate information to the file name. Then I save the file extension for easier replacement (note that „.RDS“, „.Rds“, and „.rds“ are all valid file extensions for RDS files). Lastly, I replace the current file extension with a concatenated string containing the type info, the new date/time suffix, and the original file extension. Note here that I add a „$“ sign to the regex which is to be matched by gsub to only match the end of the string: If I did not do that and the file name would be something like „my_RDS.RDS“, then both matches would be replaced.

# create_archived_file.R

create_archived_file <- function(file, last_modified, with_time) {

  # create main suffix depending on type
  suffix_main <- ifelse(last_modified,
                        as.character(file.info(file)$mtime),
                        as.character(Sys.time()))

  if (with_time) {

    # create clean date-time suffix
    suffix <- gsub(pattern = " ", replacement = "_", x = suffix_main)
    suffix <- gsub(pattern = ":", replacement = "-", x = suffix)

    # add "at" between date and time
    suffix <- paste0(substr(suffix, 1, 10), "_at_", substr(suffix, 12, 19))

  } else {

    # create date suffix
    suffix <- substr(suffix_main, 1, 10)

  }

  # create info to paste depending on type
  type_info <- ifelse(last_modified,
                      "_MODIFIED_on_",
                      "_ARCHIVED_on_")

  # get file extension (could be any of "RDS", "Rds", "rds", etc.)
  ext <- paste0(".", tools::file_ext(file))

  # replace extension with suffix
  archived_file <- gsub(pattern = paste0(ext, "$"),
                        replacement = paste0(type_info,
                                             suffix,
                                             ext),
                        x = file)

  return(archived_file)

}

Archiving the archives?

By way of example, with last_modified = FALSE and with_time = TRUE, this function would turn the character file name „models.RDS“ into „models_ARCHIVED_on_2020-07-12_at_11-31-43.RDS“. However, this is just a character vector for now — the file itself is not renamed yet. For this, we need to call the base R file.rename function, which provides a direct interface to your machine’s file system. I first check, however, whether a file with the same name as the newly created archived file string already exists: This could well be the case if one appends only the date (with_time = FALSE) and calls this function several times per day (or potentially on the same file if last_modified = TRUE).

Somehow, we are back to the old problem in this case. However, I decided that it was not a good idea to archive files that are themselves archived versions of another file since this would lead to too much confusion (and potentially too much disk space being occupied). Therefore, only the most recent archived version will be kept. (Note that if you still want to keep multiple archived versions of a single file, you can set with_time = TRUE. This will append a timestamp to the archived file name up to the second, virtually eliminating the possibility of duplicated file names.) A warning is issued, and then the already existing archived file will be overwritten with the current archived version.

The last puzzle piece: Renaming the original file

To do this, I call the file.rename function, renaming the „file“ originally passed by the user call to the string returned by the helper function. The file.rename function always returns a boolean indicating if the operation succeeded, which I save to a variable temp to inspect later. Under some circumstances, the renaming process may fail, for instance due to missing permissions or OS-specific restrictions. We did set up a CI pipeline with GitHub Actions and continuously test our code on Windows, Linux, and MacOS machines with different versions of R. So far, we didn’t run into any problems. Still, it’s better to provide in-built checks.

It’s an error! Or is it?

The problem here is that, when renaming the file on disk failed, file.rename raises merely a warning, not an error. Since any causes of these warnings most likely originate from the local file system, there is no sense in continuing the function if the renaming failed. That’s why I wrapped it into a tryCatch call that captures the warning message and passes it to the stop call, which then terminates the function with the appropriate message.

Just to be on the safe side, I check the value of the temp variable, which should be TRUE if the renaming succeeded, and also check if the archived version of the file (that is, the result of our renaming operation) exists. If both of these conditions hold, I simply call saveRDS with the original specifications (now that our existing copy has been renamed, nothing will be overwritten if we save the new file with the original name), passing along further arguments with ....

        if (file.exists(archived_file)) {
          warning("Archived copy already exists - will overwrite!")
        }

        # rename existing file with the new name
        # save return value of the file.rename function
        # (returns TRUE if successful) and wrap in tryCatch
        temp <- tryCatch({file.rename(from = file,
                                      to = archived_file)
        },
        warning = function(e) {
          stop(e)
        })

      }

      # check return value and if archived file exists
      if (temp & file.exists(archived_file)) {
        # then save new file under specified name
        saveRDS(object = object, file = file, ...)
      }

    }

These code snippets represent the cornerstones of my function. I also skipped some portions of the source code for reasons of brevity, chiefly the creation of the „archive directory“ (if one is specified) and the process of copying the archived file into it. Please refer to our GitHub for the complete source code of the main and the helper function.

Finally, to illustrate, let’s see what this looks like in action:

x <- 5
y <- 10
z <- 20

## save to RDS
saveRDS(x, "temp.RDS")
saveRDS(y, "temp.RDS")

## "temp.RDS" is silently overwritten with y
## previous version is lost
readRDS("temp.RDS")
#> [1] 10

save_rds_archive(z, "temp.RDS")
## current version is updated
readRDS("temp.RDS")
#> [1] 20

## previous version is archived
readRDS("temp_ARCHIVED_on_2020-07-12.RDS")
#> [1] 10

Great, how can I get this?

The function save_rds_archive is now included in the newly refactored helfRlein package (now available in version 1.0.0!) which you can install directly from GitHub:

# install.packages("devtools")
devtools::install_github("STATWORX/helfRlein")

Feel free to check out additional documentation and the source code there. If you have any inputs or feedback on how the function could be improved, please do not hesitate to contact me or raise an issue on our GitHub.

Conclusion

That’s it! No more manually renaming your precious RDS files — with this function in place, you can automate this tedious task and easily keep a comprehensive archive of previous versions. You will be able to take another look at that one model you ran last week (and then discarded again) in the blink of an eye. I hope you enjoyed reading my post — maybe the function will come in handy for you someday!

Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

Table of Contents

Model evaluation and cross-validation basics

Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or validate, whether the predictions are accurate.

Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the „unseen“ data. Since this test partition is, in fact, part of the original training data, we have a full range of „correct“ outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan).

Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for „tuning“ the hyperparameters of a model by comparing the generalization error of different model specifications.

Common approaches to model evaluation

There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data.

The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.

train-test-split

A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into k folds. The main advantage over the train-test-split approach is that each of the k partitions is iteratively used as a test (i.e., validation) set, with the remaining k-1 parts serving as the training sets in this iteration. This process is repeated k times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the k folds, giving the cross-validation error.

This is more of an extension of the train-test split rather than a completely new method: That is, the train-test procedure is repeated k times. However, note that even if k is chosen to be as low as k=2, i.e., you end up with only two parts. This approach is still superior to the train-test-split in that both parts are iteratively chosen for training so that the model has a chance to learn all the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.

k-fold-cross-validation

Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to one iteration of a 5-fold (that is, k=5) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the k iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split k times: In k-fold CV, the partitioning is done once, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data k times, potentially omitting some data from training.

Repeated CV and LOOCV

There are many flavors of k-fold cross-validation. For instance, you can do „repeated cross-validation“ as well. The idea is that, once the data is divided into k folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into k folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate.

Another special case of k-fold cross-validation is „Leave One Out Cross-Validation“ (LOOCV), where you set k = n. That is, in each iteration, you use a single observation from your data as the validation portion and the remaining n-1 observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:

  • First, it’s usually very computationally expensive. For most datasets used in applied machine learning, training your model n-1 times is neither desirable nor feasible (although it may be useful for very small datasets).
  • Second, even if you had the computational power (and time on your hands) to endure this process, another argument advanced by critics of LOOCV from a statistical point of view is that the resulting cross-validation error can exhibit high variance. The cause of that is that your „validation set“ consists of only one observation, and depending on the distribution of your data (and potential outliers), this can vary substantially.

In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being „it depends“. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

The value of k and the bias-variance trade-off

If k=n is not (necessarily) the best choice, then how to find an appropriate value for k? It turns out that the answer to this question boils down to the notorious bias-variance trade-off. Why is that?

The value for k governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose k in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of k, the more observations are included in our training set in each iteration.

For instance, suppose we have 1,200 observations in our dataset, then with k=3 our training set would consist of frac{k-1}{k} * N = 800 observations, but with k=8 it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing k, the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the „extremes“ of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.

bias-variance-tradeoff

As a rule of thumb, with higher values for k, bias decreases and variance increases. By convention, values like k=5 or k=10 have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.

„These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.“

James et al. 2013: 184

If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for k and leave it at that.

Implementing cross-validation in caret

Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the caret package in R. Here you can specify the method with the trainControl function. Below is a script where we fit a random forest with 10-fold cross-validation to the iris dataset.

library(caret)

set.seed(12345)
inTrain <- createDataPartition(y = irisSpecies, p = .7, list = FALSE)  iris.train <- iris[inTrain, ] iris.test <- iris[- inTrain, ]  fit.control <- caret::trainControl(method = "cv", number = 10)  rf.fit <- caret::train(Species ~ .,                        data = iris.train,                        method = "rf",                        trControl = fit.control)</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> We define our desired cross-validation method in the <code>trainControl</code> function, store the output in the object <code>fit.control</code>, and then pass this object to the <code>trControl</code> argument of the <code>train</code> function. You can specify the other methods introduced in this post in a similar fashion: <!-- /wp:paragraph -->  <!-- wp:code --> <pre class="wp-block-code"><code># Leave-One-Out Cross-validation: fit.control <- caret::trainControl(method = "LOOCV", number = 10)  # Repeated CV (remember to specify the number of repeats!) fit.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)</code></pre> <!-- /wp:code -->  <!-- wp:heading --> <h2 id="h-the-old-fashioned-way-implementing-k-fold-cross-validation-by-hand">The old-fashioned way: Implementing k-fold cross-validation by hand</h2> <!-- /wp:heading -->  <!-- wp:paragraph --> However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our <a href="https://github.com/STATWORX/blog/tree/master/cross-validation">GitHub</a>). <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-simulating-data-defining-the-error-metric-and-setting-k">Simulating data, defining the error metric, and settingk</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code># devtools::install_github("andrebleier/Xy") library(tidyverse) library(Xy)  sim <- Xy(n = 1000,           numvars = c(2,2),           catvars = 0,           cor = c(-0.5, 0.9),           noisevars = 0)  sim_data <- simdata

RMSE <- function(f, o){
  sqrt(mean((f - o)^2))
}

k <- 5

We start by loading the required packages and simulating some simulation data with 1,000 observations with the Xy() package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (f) und observed (o) values—you can pretty much read the function from left to right. Lastly, we specify our k, which is set to the value of 5 in the example and is stored as a simple integer.

Partitioning the data

set.seed(12345)
sim_data <- mutate(sim_data,
                   my.folds = sample(1:k,
                                     size = nrow(sim_data),
                                     replace = TRUE))

Next up, we partition our data into k folds. For this purpose, we add a new column, my.folds, to the data: We sample (with replacement) from 1 to the value of k, so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

Training and validating the model

cv.fun <- function(this.fold, data){

  train <- filter(data, my.folds != this.fold)
  validate <- filter(data, my.folds == this.fold)

  model <- lm(y ~ NLIN_1 + NLIN_2 + LIN_1 + LIN_2,
              data = train)

  pred <- predict(model, newdata = validate) %>% as.vector()

  this.rmse <- RMSE(f = pred, o = validatey)    return(this.rmse) }</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> Next, we define <code>cv.fun</code>, which is the heart of our cross-validation procedure. This function takes two arguments: <code>this.fold</code> and <code>data</code>. I will come back to the meaning of <code>this.fold</code> in a minute, let's just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of <code>my.folds</code> and <code>this.fold</code>: Every observation with a randomly assigned <code>my.folds</code> value <strong>other than 1</strong> (so approximately 4/5 of the data) goes into training. Every observation with a <code>my.folds</code> value <strong>equal to 1</strong> (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the <code>train</code> data! We then use this model to <code>predict()</code> our validation data, and since we have true observed outcomes for this <em>subset of the original overall training data</em> (this is the whole point!), we can compute our RMSE and return it. <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-iterating-through-the-folds-and-computing-the-cv-error">Iterating through the folds and computing the CV error</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code>cv.error <- sapply(seq_len(k),                    FUN = cv.fun,                    data = sim_data) %>%   mean()  cv.error</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> Lastly, we wrap the function call to <code>cv.fun</code> into a <code>sapply()</code> loop—this is where all the magic happens: Here we iterate over the range ofk, so <code>seq_len(k)</code> leaves us with the vector <code>[1] 1 2 3 4 5</code> in this case. We apply each element of this vector to <code>cv.fun</code>. In <code>apply()</code> statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to <code>this.fold</code>. We also pass our simulated <code>sim_data</code> as the <code>data</code> argument. <!-- /wp:paragraph -->  <!-- wp:paragraph --> Let us quickly recap what this means: In the first iteration, <code>this.fold</code> equals 1. This means that our train set consists of all the observations where <code>my.folds</code> is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, <code>this.fold</code> equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values ofk, this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set. <!-- /wp:paragraph -->  <!-- wp:paragraph --> To wrap it all up, we calculate the mean: This is the mean of ourkindividual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function! <!-- /wp:paragraph -->  <!-- wp:paragraph --> This is merely a template: You can insert any model and any error metric. If you've been following along so far, feel free to try implementing repeated CV yourself or play around with different values fork$.

Conclusion

As you can see, implementing cross-validation yourself isn't all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don't need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze.

I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

References

  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.

Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

Table of Contents

Model evaluation and cross-validation basics

Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or validate, whether the predictions are accurate.

Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the „unseen“ data. Since this test partition is, in fact, part of the original training data, we have a full range of „correct“ outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan).

Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for „tuning“ the hyperparameters of a model by comparing the generalization error of different model specifications.

Common approaches to model evaluation

There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data.

The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.

train-test-split

A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into k folds. The main advantage over the train-test-split approach is that each of the k partitions is iteratively used as a test (i.e., validation) set, with the remaining k-1 parts serving as the training sets in this iteration. This process is repeated k times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the k folds, giving the cross-validation error.

This is more of an extension of the train-test split rather than a completely new method: That is, the train-test procedure is repeated k times. However, note that even if k is chosen to be as low as k=2, i.e., you end up with only two parts. This approach is still superior to the train-test-split in that both parts are iteratively chosen for training so that the model has a chance to learn all the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.

k-fold-cross-validation

Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to one iteration of a 5-fold (that is, k=5) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the k iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split k times: In k-fold CV, the partitioning is done once, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data k times, potentially omitting some data from training.

Repeated CV and LOOCV

There are many flavors of k-fold cross-validation. For instance, you can do „repeated cross-validation“ as well. The idea is that, once the data is divided into k folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into k folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate.

Another special case of k-fold cross-validation is „Leave One Out Cross-Validation“ (LOOCV), where you set k = n. That is, in each iteration, you use a single observation from your data as the validation portion and the remaining n-1 observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:

In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being „it depends“. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

The value of k and the bias-variance trade-off

If k=n is not (necessarily) the best choice, then how to find an appropriate value for k? It turns out that the answer to this question boils down to the notorious bias-variance trade-off. Why is that?

The value for k governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose k in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of k, the more observations are included in our training set in each iteration.

For instance, suppose we have 1,200 observations in our dataset, then with k=3 our training set would consist of frac{k-1}{k} * N = 800 observations, but with k=8 it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing k, the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the „extremes“ of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.

bias-variance-tradeoff

As a rule of thumb, with higher values for k, bias decreases and variance increases. By convention, values like k=5 or k=10 have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.

„These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.“

James et al. 2013: 184

If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for k and leave it at that.

Implementing cross-validation in caret

Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the caret package in R. Here you can specify the method with the trainControl function. Below is a script where we fit a random forest with 10-fold cross-validation to the iris dataset.

library(caret)

set.seed(12345)
inTrain <- createDataPartition(y = irisSpecies, p = .7, list = FALSE)  iris.train <- iris[inTrain, ] iris.test <- iris[- inTrain, ]  fit.control <- caret::trainControl(method = "cv", number = 10)  rf.fit <- caret::train(Species ~ .,                        data = iris.train,                        method = "rf",                        trControl = fit.control)</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> We define our desired cross-validation method in the <code>trainControl</code> function, store the output in the object <code>fit.control</code>, and then pass this object to the <code>trControl</code> argument of the <code>train</code> function. You can specify the other methods introduced in this post in a similar fashion: <!-- /wp:paragraph -->  <!-- wp:code --> <pre class="wp-block-code"><code># Leave-One-Out Cross-validation: fit.control <- caret::trainControl(method = "LOOCV", number = 10)  # Repeated CV (remember to specify the number of repeats!) fit.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)</code></pre> <!-- /wp:code -->  <!-- wp:heading --> <h2 id="h-the-old-fashioned-way-implementing-k-fold-cross-validation-by-hand">The old-fashioned way: Implementing k-fold cross-validation by hand</h2> <!-- /wp:heading -->  <!-- wp:paragraph --> However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our <a href="https://github.com/STATWORX/blog/tree/master/cross-validation">GitHub</a>). <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-simulating-data-defining-the-error-metric-and-setting-k">Simulating data, defining the error metric, and settingk</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code># devtools::install_github("andrebleier/Xy") library(tidyverse) library(Xy)  sim <- Xy(n = 1000,           numvars = c(2,2),           catvars = 0,           cor = c(-0.5, 0.9),           noisevars = 0)  sim_data <- simdata

RMSE <- function(f, o){
  sqrt(mean((f - o)^2))
}

k <- 5

We start by loading the required packages and simulating some simulation data with 1,000 observations with the Xy() package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (f) und observed (o) values—you can pretty much read the function from left to right. Lastly, we specify our k, which is set to the value of 5 in the example and is stored as a simple integer.

Partitioning the data

set.seed(12345)
sim_data <- mutate(sim_data,
                   my.folds = sample(1:k,
                                     size = nrow(sim_data),
                                     replace = TRUE))

Next up, we partition our data into k folds. For this purpose, we add a new column, my.folds, to the data: We sample (with replacement) from 1 to the value of k, so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

Training and validating the model

cv.fun <- function(this.fold, data){

  train <- filter(data, my.folds != this.fold)
  validate <- filter(data, my.folds == this.fold)

  model <- lm(y ~ NLIN_1 + NLIN_2 + LIN_1 + LIN_2,
              data = train)

  pred <- predict(model, newdata = validate) %>% as.vector()

  this.rmse <- RMSE(f = pred, o = validatey)    return(this.rmse) }</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> Next, we define <code>cv.fun</code>, which is the heart of our cross-validation procedure. This function takes two arguments: <code>this.fold</code> and <code>data</code>. I will come back to the meaning of <code>this.fold</code> in a minute, let's just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of <code>my.folds</code> and <code>this.fold</code>: Every observation with a randomly assigned <code>my.folds</code> value <strong>other than 1</strong> (so approximately 4/5 of the data) goes into training. Every observation with a <code>my.folds</code> value <strong>equal to 1</strong> (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the <code>train</code> data! We then use this model to <code>predict()</code> our validation data, and since we have true observed outcomes for this <em>subset of the original overall training data</em> (this is the whole point!), we can compute our RMSE and return it. <!-- /wp:paragraph -->  <!-- wp:heading {"level":3} --> <h3 id="h-iterating-through-the-folds-and-computing-the-cv-error">Iterating through the folds and computing the CV error</h3> <!-- /wp:heading -->  <!-- wp:code --> <pre class="wp-block-code"><code>cv.error <- sapply(seq_len(k),                    FUN = cv.fun,                    data = sim_data) %>%   mean()  cv.error</code></pre> <!-- /wp:code -->  <!-- wp:paragraph --> Lastly, we wrap the function call to <code>cv.fun</code> into a <code>sapply()</code> loop—this is where all the magic happens: Here we iterate over the range ofk, so <code>seq_len(k)</code> leaves us with the vector <code>[1] 1 2 3 4 5</code> in this case. We apply each element of this vector to <code>cv.fun</code>. In <code>apply()</code> statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to <code>this.fold</code>. We also pass our simulated <code>sim_data</code> as the <code>data</code> argument. <!-- /wp:paragraph -->  <!-- wp:paragraph --> Let us quickly recap what this means: In the first iteration, <code>this.fold</code> equals 1. This means that our train set consists of all the observations where <code>my.folds</code> is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, <code>this.fold</code> equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values ofk, this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set. <!-- /wp:paragraph -->  <!-- wp:paragraph --> To wrap it all up, we calculate the mean: This is the mean of ourkindividual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function! <!-- /wp:paragraph -->  <!-- wp:paragraph --> This is merely a template: You can insert any model and any error metric. If you've been following along so far, feel free to try implementing repeated CV yourself or play around with different values fork$.

Conclusion

As you can see, implementing cross-validation yourself isn't all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don't need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze.

I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

References

  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.