Data Science, Machine Learning und KI
Kontakt

Introduction

At STATWORX, we are not only helping customers find, develop, and implement a suitable data strategy but we also spend some time doing research to improve our own tool stack. This way, we can give back to the open-source community.

One special focus of my research lies in tree-based ensemble methods. These algorithms are bread and butter tools in predictive learning and you can find them as a standalone model or as an ingredient to an ensemble in almost every supervised learning Kaggle challenge. Renowned models are Random Forest by Leo Breiman, Extremely Randomized Trees (Extra-Trees) by Pierre Geurts, Damien Ernst & Louis Wehenkel, and Multiple Additive Regression Trees (MART; also known as Gradient Boosted Trees) by Jerome Friedman.

One thing I was particularly interested in, was how much randomization techniques have helped improve prediction performance in all of the algorithms named above. In Random Forest and Extra-Trees, it is quite obvious. Here, randomization is the reason why the ensembles offer an improvement over bagging; through the de-correlation of the base learners, the variance of the ensemble and therefore its prediction error decreases. In the end, you achieve de-correlation by „shaking up“ the base trees, as it’s done in the two ensembles. However, MART also profits from randomization. In 2002, Friedman published another paper on boosting, showing that you can improve the prediction performance of boosted trees by training each tree on only a random subsample of your data. As a side-effect, your training time also decreases. Furthermore, in 2015, Rashmi and Gilad suggested adding a method known as a dropout to the boosting ensemble: a method found and used in neural nets.

The Idea behind Random Boost

Inspired by theoretical readings on randomization techniques in boosting, I developed a new algorithm, that I called Random Boost (RB). In its essence, Random Boost sequentially grows regression trees with random depth. More precisely, the algorithm is almost identical to and has the exact same input arguments as MART. The only difference is the parameter d_{max}. In MART, d_{max} determines the maximum depth of all trees in the ensemble. In Random Boost, the argument constitutes the upper bound of possible tree sizes. In each boosting iteration i, a random number d_i between 1 and d_{max} is drawn, which then defines the maximum depth of that tree T_i(d_i).

In comparison to MART, this has two advantages:

First, RB is faster than MART on average, when being equipped with the same value for the tree size. When RB and MART are trained with a value for the maximum tree depth equal to d_{max}, then Random Boost will in many cases grow trees the size of d < d_{max} by nature. If you assume that for MART, all trees will be grown to their full size d_{max} (i.e. there is enough data left in each internal node so that tree growing doesn’t stop before the maximum size is reached), you can derive a formula showing the relative computation gain of RB over MART:

    \[t_{rel}(d_{max}) = frac{t_{mathrm{RB}}(d_{max})}{t_{mathrm{MART}}(d_{max})} approx frac{2}{d_{max}}left(1 - left(frac{1}{2}right)^{d_{max}} right).\]

t_{RB}(d_{max}) e.g. is the training time of a RB boosting ensemble with the tree size parameter being equal to d_{max}.

To make it a bit more practical, the formula predicts that for d_{max}=2, 3, and 4, RB takes 75%, 58%, and 47% of the computation time of MART, respectively. These predictions, however, should be seen as RB’s best-case scenario, as MART is also not necessarily growing full trees. Still, the calculations suggest that efficiency gains can be expected (more on that later).

Second, there are also reasons to assume that randomizing over tree depths can have a beneficial effect on the prediction performance. As already mentioned, from a variance perspective, boosting suffers from overcapacity for various reasons. One of them is choosing too rich of a base learner in terms of depth. If, for example, one assumes that the dominant interaction in the data generating process is of order three, one would pick a tree with equivalent depth in MART in order to capture this interaction depth. However, this may be overkill, as fully grown trees with a depth equal to 3 have eight leaves and therefore learn noise in the data, if there are only a few of such high order interactions. Perhaps, in this case, a tree with depth 3 but less than eight leaves would be optimal. This is not accounted for in MART, if one doesn’t want to add a pruning step to each boosting iteration at the expense of computational overhead. Random Boost may offer a more efficient remedy to this issue. With probability 1 / d_{max}, a tree is grown, which is able to capture the high order effect at the cost of also learning noise. However, in all the other cases, Random Boost constructs smaller trees that do not show the over-capacity behavior and that can focus on interactions of smaller order. If over-capacity is an issue in MART due to different interactions in the data governed by a small number of high order interactions, Random Boost may perform better than MART. Furthermore, Random Boost also decorrelates trees through the extra source of randomness, which has a variance reducing the effect on the ensemble.

The concept of Random Boost constitutes a slight change to MART. I used the sklearn package as a basis for my code. As a result, the algorithm is developed based on sklearn.ensemle.GradientBoostingRegressor and sklearn.ensemle.GradientBoostingClassifier and is used in exactly the same way (i.e. argument names match exactly and CV can be carried out with sklearn.model_selection.GridSearchCV). The only difference is that the RandomBoosting*-object uses max_depth to randomly draw tree depths for each iteration. As an example, you can use it like this:

rb = RandomBoostingRegressor(learning_rate=0.1, max_depth=4, n_estimators=100)
rb = rb.fit(X_train, y_train)
rb.predict(X_test)

For the full code, check out my GitHub account.

Random Boost versus MART – A Simulation Study

In order to compare the two algorithms, I ran a simulation on 25 Datasets generated by a Random Target Function Generator that was introduced by Jerome Friedman in his famous Boosting paper he wrote in 2001 (you can find the details in his paper. Python code can be found here). Each dataset (containing 20,000 observations) was randomly split into a 25% test set and a 75% training set. RB and MART were tuned via 5-fold CV on the same tuning grid.

  • learning_rate = 0.1
  • max_depth = (2, 3, ..., 8)
  • n_estimators = (100, 105, 110, ..., 195)

For each dataset, I tuned both models to obtain the best parameter constellation. Then, I trained each model on every point of the tuning grid again and saved the test MAE as well as the total training time in seconds. Why did I train every model again and didn’t simply store the prediction accuracy of the tuned models along with the overall tuning time? Well, I wanted to be able to see how training time varies with the tree size parameter.

A Comparison of Prediction Accuracies

You can see the distribution of MAEs of the best models on all 25 datasets below.

absolute MAE

Evidently, both algorithms perform similarly.

For a better comparison, I compute the relative difference between the predictive performance of RB and MART for each dataset j, i.e. MAE_{rel,j}=frac{MAE_{RB,j}-MAE_{MART,j}}{MAE_{MART,j}}. If MAE_{rel,j}>0, then RB had a larger mean absolute error than MART on dataset j, and vice versa.

MAE by dataset with boxplot

In the majority of cases, RB did worse than MART in terms of prediction accuracy (MAE_{rel}>0). In the worst case, RB had a 1% higher MAE than MART. In the median, RB has a 0.19% higher MAE. I’ll leave itu p to you to decide whether that difference is practically significant.

A comparison of Training times

When we look at training time, we get a quite clear picture. In absolute terms, it took 433 seconds to train all parameter combinations of RB on average, as opposed to 803 seconds for MART.

average training time

The small black lines on top of each bar are the error bars (2 times the means standard deviation; rather small in this case).

To give you a better feeling of how each model performed on each dataset, I also plotted the training times for each round.

total training time

If you now compute the training time ratio between MART and RB (frac{t_{MART}}{t_{RB}}), you see that RB is roughly 1.8 times faster than MART on average.

Another perspective on the case is to compute the relative training time t_{rel,j}=frac{t_{RB,j}}{t_{MART,j}} which is just 1 over the speedup. Note that this measure has to be interpreted a bit differently from the relative MAE measure above. If t_{rel,j}=1 then RB is as fast as MART, if t_{rel,j}>1, then it takes longer to train RB than MART, and if t_{rel,j}<1, then RB is faster than MART.

On average, RB only needs roughly 54% of MART’s tuning time in the median, and it is noticeably faster in all cases. I was also wondering how the relative training time varies with d_{max} and how well the theoretically derived lower bound from above fits the actually measured relative training time. That’s why I computed the relative training time across all 25 datasets by tree size.

Tree size (max_depth) Actual Training time (RB / MART) Theoretical lower bound
2 0.751 0.750
3 0.652 0.583
4 0.596 0.469
5 0.566 0.388
6 0.532 0.328
7 0.505 0.283
8 0.479 0.249

The theoretical figures are optimistic, but the relative performance gain of RB increases with tree size.

Results in a Nutshell and Next Steps

As part of my research on tree-based ensemble methods, I developed a new algorithm called Random Boost. Random Boost is based on Jerome Friedman’s MART, with the slight difference that it fits trees of random size. In total, this little change can reduce the problem of overfitting and noticeably speed up computation. Using a Random Target Function Generator suggested by Friedman, I found that, on average, RB is roughly twice as fast as MART with a comparable prediction accuracy in expectation.

Since running the whole simulation takes quite some time (finding the optimal parameters and retraining every model takes roughly one hour for each data set on my Mac), I couldn’t run hundreds or more simulations for this blog post. That’s the objective for future research on Random Boost. Furthermore, I want to benchmark the algorithm on real-world datasets.

In the meantime, feel free to look at my code and run the simulations yourself. Everything is on GitHub. Moreover, if you find something interesting and you want to share it with me, please feel free to shoot me an email.

References

  • Breiman, Leo (2001). Random Forests. Machine Learning, 45, 5–32
  • Chang, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 785-794
  • Chapelle, Olivier, and Yi Chang. 2011. “Yahoo! learning to rank challenge overview”. In Proceedings of the Learning to Rank Challenge, 1–24.
  • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
  • Friedman, J. H. (2002). “Stochastic gradient boosting”. Computational Statistics & Data Analysis 38 (4): 367–378.
  • Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006). “Extremely randomized trees”. Machine learning 63 (1): 3–42.
  • Rashmi, K. V., and Ran Gilad-Bachrach (2015). Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38.

 

 

Introduction

Writing appealing interactive web applications – one of STATWORX’s many competences – is an ease with R shiny. Just a few lines of code in one R script create the whole logic you need to let the whole magic of shiny happen. It is so simple that you can make a hello world app in a heartbeat, like so.

library(shiny)
ui <- fluidPage(
  "Hello, World!"
)
server <- function(input, output, session) { }
shinyApp(ui, server)

Today I am going to show you one way you can use native shiny syntax to modularize pieces of your code in a way that makes your code basis easily maintainable and extendable. Since I assume you are already familiar with shiny, I’ll skip the intro wading pool example and go right to the high-dive.

What are event chains?

An event chain describes the relationship between events and tasks and how the events affect each other. In some cases, you may want to have an app that takes user input and performs actions based on the nature of the input, potentially asking for more information along the way. In such a case, chances are you want to implement an event chain. You could immediately start hacking some crude solution to your problem, but you may risk creating hardly comprehensible code. Furthermore, imagine that requirements on your event chain suddenly change. In this case, it is important to modularize your event chain so that it remains maintainable and adaptable.

Example: the friend logger

So, let me illustrate how to build a modularized event chain. Imagine you are pedantic about time and take appointments seriously. Quite to the detriment of your so called „friends“, you make no exceptions. Every time a friend is too late, you suffer so bad you have decided to use a shiny app to keep score of your friends‘ visits in order to determine how reliable they are (you pathetic you!). Requirements on the app’s usage are simple, as shown in the graph below.

friends

You want to compare the expected arrival time of your friend with his actual arrival time. If his delay is above a certain threshold (e.g. 5 minutes), you want to protocol his excuse for being late. If you deem his excuse as being acceptable, you neglect his sin (but still keep protocol!). If he is punctual, he receives a bonus point. If he arrives too late and his excuse is not acceptable, he receives a minus point. In any case, you log his visit (how low can you get?). To keep things more visual, here is a sketch of the app’s UI including the event sequence when a friend is being late.

friends-app-view

Now, it is time to implement the app.

Event chain architecture in R Shiny

It takes two ingredients to implement event chains:

  1. triggers that are stored in reactiveValues()
  2. observers (observeEvent()) that are triggered and carry out the actual checks and other computations

The actual trick is to find the appropriate number of observeEvent()s so that each step in the event chain is covered by one observeEvent and therefore no code redundancies are created. Using the example above, we have three possible sequences of events:

  1. Friend is too late and has a good excuse
  2. Friend is too late and doesn’t have a good excuse
  3. Friend is not too late

In all three cases, we need to log a friend’s visit, so it definitely makes sense to put the visit logging part in one observeEvent and to call that observer at the end of each of the sequences above. Drawing an event chain diagram comes in especially handy here, as it supports a suitable architectural design choice. I used draw.io for the task.

For the app, I used one reactiveValues-object in which I put all triggers (you can find the whole app code on GitHub).

shinyServer(function(input, output, session) {
  
  # Data
  rv <- reactiveValues(
    ...
    # Triggers
    ask_for_reason = TRUE,
    change_friend_score = TRUE,
    save_visit = TRUE,
    error = FALSE
  )
  ...
})

I use boolean values for the triggers so that I only have to negate them if I want to change their value (a <- !a). Using integers would also work, but I find the flip-trick nicer. Let’s look at the part of the chain where a friend’s punctuality is checked in more detail. The module that checks punctuality also reads in the data. Depending on the input, it either calls the „Ask-for-a-reason“-module or directly calls the visit logger.

# Submit friend data ----
observeEvent(input$submit, {
  # Collect data
  ...
    
  is_delayed <- difftime(actual_time, expected_time, units = "mins") > input$acceptance
  if (is_delayed) {
    # Friend is delayed --> trigger Ask-for-reason-module
    rv$ask_for_reason <- isolate(!rv$ask_for_reason)
    return()
  }
  # Friend seems punctual --> Add a point to score list :)
  friend_data <- set_data(friend_data, score = 1) 
  # Trigger visit logger
  rv$change_friend_score <- isolate(!rv$change_friend_score)
})

As you can see, once you have drawn the event chain it is quite intuitive to translate it into shiny code. If the friend is punctual, we set his score to one (score will be added in the visit logger module) and call the visit logger module, which looks like this:

# Change friend score ----
observeEvent(rv$change_friend_score, ignoreInit = TRUE, {
  rv$friend_score[friend_score$name == friend_data$name, "score"] <-
    isolate(rv$friend_score[friend_score$name == friend_data$name, "score"]) + 
    friend_data$score
  # Make change permanent
  saveRDS(rv$friend_score, "data/friend_score.RDS")
  rv$save_visit <- isolate(!rv$save_visit)
 })

Note that the rv$save_visit trigger simply calls an observer that adds another row to the friend visit table and does some cleaning.

So now let’s make a little test run with the ready product. For your app to work, you of course have to first create an initial dataset with your friends and their initial scores in order to know who you are keeping record of. In the example below, we have four friends: John, Emily, Olivia, and Ethan. You can see their scores in the lower left corner. Past visits are logged and displayed in the upper right corner.

app-ui

John wants to hang out with us to play some brutal video games, and for no obvious reason we made an appointment at 9 am. However, John shows up 7 (!!!) minutes late. Enough is enough. We enter his misdeed.

john-entered

It exceeds the threshold, so we are, as expected, prompted to enter the reason.

enter-reason

When we asked John to justify himself, he just shrugged his shoulders. How dare he?! That’s a minus point…

Extend our event chain

Even though you hurt all over because of John’s unreliability, you are quite happy with your app. Yet, things could be better! For example, every time you misspell a friend in the name field when protocolling a visit, the app crashes. Your app could use some (additional) sanity checks. A perfect use case for showing the flexibility of your architecture. After a few months of deep reflection, you came up with a new event flow graph that takes care of wrong inputs.

friends-with-error

You figured two spots where the app ought to be stabilized. First, you want to throw an error to the user if a friend doesn’t exist (without stopping the app). Second, you require yourself to enter a reason (we all know how sloppy our future self can be from time to time).

With the already chosen modularized structure, it is easy to incorporate these checks. You simply need to add one more trigger (rv$error) and one global container that stores the error information.

# Error handler
error <- reactiveValues(
  title = "",
  message = ""
)

If you for example want to check whether an entered name exists in your data base, all you have to do is to add a few lines of code at the beginning of the observer where a friend’s punctuality is checked.

# Submit friend data ----
observeEvent(input$submit, {
  # Friend exists?
  if (!input$name %in% rv$friend_score$name) {
    error$title <- "%s not found" %>% sprintf(., input$name)
    error$message <- h1("404")
    rv$error <- isolate(!rv$error)
    return()
  }
  ...
})

If the name doesn’t match any of your friends‘ names, you trigger an error handler module whose only purpose is to show an error message:

# Error handling ----
observeEvent(rv$error, ignoreInit = TRUE, {
  showModal(modalDialog(    
    title = error$title,
    error$message,
    footer = actionButton("exit", "Ok", class = "btn-primary")
  ))
})

The nice thing is that you can use this module to handle any errors, no matter which sanity checks have caused them.

So if we go back to the app now and enter a name that doesn’t exist (like Tobias), we get the following error message:

friend-not-found

Furthermore, if we miss to enter a reason when being asked for one, we get a passive aggressive reminder:

no-reason-give

You are welcome! So would you excuse me now? I have some visits to protocol…

Everybody talks about them, many people know how to use them, few people understand them: Long Short-Term Memory Neural Networks (LSTM). At STATWORX, with the beginning of the hype around AI and projects with large amounts of data, we also started using this powerful tool to solve business problems.

In short, an LSTM is a special type of recurrent neural network – i.e. a network able to access its internal state to process sequences of inputs – which is really handy if you want to exploit some time-like structure in your data. Use cases for recurrent networks range from guessing the next frame in a video to stock prediction, but you can also use them to learn and produce original text. And this shall already be enough information about LSTMs from my side. I won’t bother you with yet another introduction into the theory of LSTMs, as there are more than enough great blog posts about their architecture (Kudos to Andrej Karpathy for this very informative piece of work which you should definitely read if you are not already bored by neural networks :)).

Especially inspired by the blog mentioned above, I thought about playing with a use case for LSTMs that actually has no intended use at all. LSTMs are good for learning text, so I thought it might be fun to let a character-level LSTM learn to write R code. It was not so important that the code is semantically correct or even solves a particular problem. Having a NN that is able to produce (more or less) syntactically correct code is already enough.

So, on my journey to CodeR, a NN that makes my workforce totally obsolete, I will let you participate in the three major steps of getting an RNN to write R code:

  1. Get enough text training data
  2. Build and train CodeR with that data
  3. Let CodeR write majestic R code

If you like to try it yourself or follow the subsequent steps along, you can get the code from my GitHub repository.

Step 1: Data Acquisition

Where to get enough Data?

Ultimately, CodeR needs data; a lot of data. Plus, the data should be of good quality and not be too heterogeneous so that CodeR is able to learn the structure from the given text. Since R is open source, the first address to search for good R code is GitHub. GitHub offers you an API to access information about its repositories, but for the flexibility and data I needed, I found the API too restrictive. That’s why I decided to scrape the webpage myself using Hadley Wickham’s rvest package.

Scrape GitHub

The goal is simple: Clone all R repositories from famous R users. Of course, you could manually define R contributors that seem to be good programmers, but chances are you miss someone out that has some good and influential packages to offer. Remember that we need a lot of code and that it isn’t much of a problem to reduce the data afterwards (which I in fact did).

Get trending R user names

So, let’s start by getting the names of the trending users. If you visit https://github.com/trending/developers/r?since=monthly, you see a list of all trending users. On June 14, 2018, it looked like this:

trending-r-users

If you inspect the HTML code, you quickly see that the actual user names are the href attribute of a link surrounded by <h2> tags, so we use rvest to dig us through that structure.

git_url <- "https://github.com"

trending_user <- glue("{git_url}/trending/developers/r?since=monthly") %>%
  read_html() %>%
  html_nodes(., "h2") %>%
  html_nodes(., "a") %>%
  html_attr(., "href") %>% 
  gsub("/", "", .)
trending_user
[1] "hadley"        "rstudio"       "yihui"         ...

It’s good to see that the names match the expected result from the webpage :).

Get R repository names

In the next step, taking user i, we need to get all repositories of i that are her own (i.e. not forked) R repositories. When checking the url that lets you inspect all repos of a user (e.g. https://github.com/hadley?page=1&tab=repositories), you realize that you need to go through all pages of a user’s repository tab. I wrote a function that does that plus makes sure that:

  • The repo’s main language is R
  • If the repo is forked, the repo will be assigned to the original author

With that function, it is easy to extract all R repo names from our trending users

repos <- list()
for (user in trending_user) {
  cat("User: ", user, "n")
  repos[[user]] <- get_r_repos(user)  # The actual magic
}
repos %<>% unlist() %>% unique()

Clone R repositories

Now that we have a bunch of repository names, the last step is to clone all those repos and to clean them so that they only contain R files. I have decided to clean a repo directly after I have cloned it since I am going to download a lot of data and don’t want to use too much space on my hard drive. The example code below clones the repo where you can find all of the code above (you are welcome ;)).

repo <- "tkrabel/rcoder"
system(glue("git clone https://github.com{repo}.git"),
       wait = TRUE)

After having cloned all repos, I simply smash their content together in one big text file (r_scripts_text.txt).

Step 2: Teach the Baby to Walk

So, we have a big text file now that is ready to be inspected by CodeR so that it can learn to produce own good pieces of code. But how does the training actually work? There are a few steps that need to be taken care of here

  1. Prepare the data in a way it can actually be learned by an LSTM
  2. Construct the network’s architecture
  3. The actual training step

The general idea behind step 1 is to slice the text data in overlapping sequences of characters with a pre-specified size s corresponding to the „time horizon“. For example, imagine a text file containing the string "STATWORX ROCKS!" and let s = 3, meaning that you want the LSTM to use the last three characters to predict the fourth one. From this text file, you generate the data which looks like this.

x1 x2 x3 y
‚S‘ ‚T‘ ‚A‘ ‚T‘
‚T‘ ‚A‘ ‚T‘ ‚W‘
‚A‘ ‚T‘ ‚W‘ ‚O‘
‚C‘ ‚K‘ ‚S‘ ‚!‘

In a next step, you have to represent each character as a numeric object so that your model can actually work with it. The most popular way is to represent characters as unit vectors. Making it more tangible, remember that in the sentence above, we have 11 distinct characters (including the blank space and the exclamation mark). The so-called vocabulary {'S', 'T', 'A', 'W', 'O', 'R', 'X', ' ', 'C', 'K', '!'}is utilized to represent each character by a 11-dimensional unit vector with the 1 at its respective character position, e.g. S = (1, 0, dots, 0)^top (because 'S' is the first character of the vocabulary), T = (0, 1, 0, dots, 0)^top, and so on. With these transformations, we finally have data our model can learn from.

Step 2 (building the model) is an ease with the R keras package, and it in fact took only 9 lines of code to build and LSTM with one input layer, 2 hidden LSTM layers with 128 units each and a softmax output layer, making it four layers in total. I kept the model that „simple“ because I knew it is going to take a long time to learn. However, the learning results were not satisfying even after longer training times, so I decided to look out for ways of training networks on better (free) hardware in order to configure much more complex models. My search brought me to the Google Colaboratory, an environment that runs in the cloud and offers GPU support. Especially the GPU support gave training a huge time boost. However, for all R passionates out there, Google’s Colab has a drawback: it is a Jupyter Notebook environment and therefore requires you to write Python code, which makes my use case somewhat cynical since I now use Python to train a network which writes R code. Well, in the end, I suppose, we all have to make some sacrifices :)!

As I started translating my code into Python, I found that there is a very useful package textgenrnn that lets you very easily build and train a model. The advantage of the package is that its functions handle the whole data preparation step for you. The only thing you need to do is to specify the raw input text file from which the model learns and to configure the model, the rest is done for you (Credits go to Max Woolf for this great piece of work).

If you want to build your own version of CodeR, just copy this notebook to your Google Drive and follow the instructions.

Step 3: Let CodeR Talk to Us

After we have a trained version of CodeR, it is time to let it write some code. Starting with a blank sheet, CodeR is asked to sample the first character, which is a random one. In the next step, we feed that created character back to the model in order to write the next character. After that, we always use up to the last 40 characters as an input for the prediction of the next element in the text sequence.

There is a parameter in the corresponding textgenrnn function that can determine CodeR’s creativity while writing R code (so-called temperature). The higher the value, the more creative, i.e. diverse, the text. However, the results are not checked for syntactical correctness, so choosing too high of a temperature leads to more syntax errors. On the other hand, lower values in temperature (e.g. 0.5) make CodeR more conservative in its predictions, being closer to what it has learned. For a value of temperature = 0.5, CodeR knows how to pass any code review:

partition_by = NULL,
                                                                                                                                                                                                                                                                                                                                                                                                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

BATMAAAAAAAN!!! Looks like CodeR is lost in an infinite NA loop. Note the comma after the first statement. It is an artefact of the fact that we have some shiny code in the code base. This is, of course, an issue, as it leads to a very heterogenous text base, destabilizing RNN’s learning process. Making the training data more homogenous is terms of syntax is definitely a topic for future refinements.

But let’s look at some code now that is funny and, quite frankly, impressive. It will also teach us something about the learning set. Staying with a temerature value of 0.5, CodeR mainly produces, well, NA loops, but also writes a lot of roxygen comments.

#' @param x An object to transform to a string.
#'
#' @param x An object of column names
#' @param categorical_column A string with a model format.
#' @param ... Export to make setting to a selection of the length of the scale.
#' @param ordered If not supported values are stored in the data state on a post container.
#' @param conf.level character vector of structure (a, b, separate the second library in the first directory for the global environment for each vector of the same as a single argument.
#'
#' @param ... Additional arguments to exponentiate for the new name of the document, not a context
#'   the code{searchPackages} returns a function that dependencies to create a character vector of the specified values. The name of the command packages in the top level normal for a single static when the default to the selection

Quite entertaining, isn’t it? I like the way CodeR is using totally confusing argument names, such as categorical_column for a „string with a model format“.

With temperature values around 1, CodeR starts writing first syntactically correct functions (although the semantics might be a problem). Let’s look at some snippets I found in the output.

format.rdnore_configNames <- function(x) {
  class(x) <- y_string[[mull_bt]]
  x
}
addToN <- function(packagePath, ..., recursive = FALSE) {
  assert_that(is_string(id))

  base <- as.data.frame(names(purrr::packagesjitter))   stop("parents has columns")   layer_class = c(list(values, command), top = "Tar") } counts.latex <- function(x, ...) {   if(is.null(x)) {     stop("ROC Git classes:")   } }  spark_guess_value <- function(pd, stringsAsFactors = FALSE) {   if (!is.null(varify))     cat("Installing ", cols, "/", xdate)


  return(selfregistry) } </code></pre> It is impressive that <em>CodeR</em> correctly sets blank spaces and braces most of the time. I only needed to mildly correct it when it set a backtick instead of a quotation mark.  It can also use <code>dplyr</code> functions and the <code>magrittr</code> pipe (which is great since I am a big fan of the pipe as you can read <a href="https://statworx-1727.demosrv.review/de/blog/show-me-your-pipe/">here</a>). <pre><code class="language-r">rf_car <- function(operator, input_col, config, default.uniques) %>%   group_by(minorm) %>% summarise(week = 100) </code></pre> Of course, this is just the tip of the iceberg, and there was a lot of unusable code <em>CodeR</em> produced. So to be fair, let's take a look a the lines that didn't make it into the hall of fame. <pre><code class="language-r">#' @import knit_print.j installed file #' @param each HTML pages (in numbers matching metadata),       # in the seed location       if(tfitem %in% cl) {
        unknown <<- 100 : mutate(contsList)
      v = integer(1)
    }

    unused(
      {retries_cred_, revdeps, message_format = ","))
      

      if (flattenToBinour && !renamed) {

    # define validations for way top aggregated instead, constants that does not want
#' arguments  code{list}.
#'
#' @inheritParams dontrun{
#' # Default environment is supplied
#'
#' @keywords internal
str_replace_all <- function(pkg_lines, list(token = path),
                           list(using := force_init(), compare, installInfoname, ") %>%   ` %>% #'   modifyList(list(2, coord = FALSE)) #' </code></pre> If you set the temperature to 2, it becomes wildly creative. <pre><code class="language-r">x <- xp>$scenqNy89L'<JW]
#' Clear tuning
#r
# verifican
ignore <- wwMap.com:(.p/qafffs.tboods,4max LNh,	rmAR',5R}/6)  Y/AS_M(SB423eyt
mf(,9] **.4L2;3) # v1.3mDE); *}

g3 <%yype_3X-C(r63,JAE)Zsd <- 1

Summary and outlook

LSTMs offer many interesting and amusing use cases. As a free-time side project, I try to leverage the recurrent structure of an LSTM in order to train a model I call CodeR to write, well …, R code. The results are truly entertaining and informative, as they reveal some of the training data set’s structure. For example, we see that R code contains roxygen comments to a large extent, which makes sense as we included many R packages in the training set.

One point for further improvements of CodeR definitely is to remove all the shiny code from the training set in order to make the syntax more homogeneous and therefore to improve CodeR’s output text quality. Furthermore, it may be worthwhile to remove all roxygen comments.

If you have any ideas what to do next with CodeR, if you have any suggestions on how to improve my code or if you just want to leave a comment, please feel free to shoot me a message. Especially if you trained a version of CodeR yourself, don’t hesitate to share your favorite lines of code with me. I would also be very curious if you could improve CodeR’s output quality by altering the training set (e.g. in the way described above). If you want to learn more about keras, check out our open workshop!

Introduction

RStudio is a powerful IDE that helped me so many times with conveniently debugging large and complex R programs as well as shiny apps, but there is one thing that bugs me about it: there is no easy option or interface in RStudio that lets you customize your theme, just as you can do in more developed text editors such as Atom or Sublime. This is sad, especially if you like working with RStudio but are not satisfied with its appearance. That being said, I did some research on how to customize its themes by hand with a little hack! Here you go!

Customizing your theme

Well, before we get started, I’ve got some sad news for you: first, I use a Mac, so all the instructions pertain to that platform. However, I think things should work similarly when having Windows on your computer. Second, you have to sacrifice one of RStudio’s built-in editor themes, so choose wisely. In the end, what we will do is overwriting the settings of one theme with your own preferences. For the sake of demonstration, I will do that with the dawn editor theme in RStudio version 1.1.419. To understand what will be going on, be aware that the RStudio IDE in fact works like a browser and the styles and themes you have at your hand are essentially css files lying somewhere in your file system. We will eventually access those files and change them.

First step: Change a theme until you like it

Let’s go now and look at the dawn theme.dawn theme

Now, if you want to experiment with changing it, recall that RStudio actually is a browser, so right-clicking somewhere in the text editor and selecting „Inspect Element“ should open the HTML the editor is based on.

href to css
Scroll down until you find the <link> tag referencing a css file. There is a path in the href argument. You should remember the filename of the corresponding css file, because we will change that file on our file system later. Simply click on the path to the css file to view its content.

old theme overview

Perfect! Now, you can mess around with the css selectors‘ attributes and view the result on the theme in real time (unfortunately, you cannot rearrange the selectors)! As an example, I will change the selector .ace_comment which defines the physical appearance of comments in the code. Let’s say I don’t like it being italic and I want to change the color to be a bit more … noticeable. That’s why I decide to make the font bold and change the color to red. Furthermore, I add another attribute font-size to make the text larger.

This is what it looked like before …

comment old

… and this is what it looks like now ….

comment new

Second step: Overwrite the theme on your file system

So far, we have temporarily changed the theme. If you reopen RStudio, everything is gone. This can be handy if you just want to play around. However, we want to make things permanent, which is our next step.

As I have already mentioned, an editor theme in RStudio essentially is a css file lying somewhere on your computer, so all R themes can be accessed through your file system. Simply navigate on the program folder with your finder, right-click on the RStudio logo and click on „show package content“ (or something similar to that; sorry, my system language is German ;)).

RStudio show package content

You should now find yourself in the directory Contents. Next, navigate to the css files as shown below

path to css

If you change the file corresponding to the dawn theme (97D3C…9C49C5.cache.css), you will have permanently changed that theme.

Conclusion

Customizing RStudio themes requires some little tricks, but it is definitely doable. Please keep in mind that any changes you make when inspecting the editor theme will only be temporary. To make changes permanent, take a look at what theme you want to overwrite, search its corresponding css file name, and enter your changes there.

If you like the Atom theme „One Dark“ and you would like to have it as a RStudio theme, you can find it on my GitHub account. Simply rename it to the css file you want to replace and put it in RStudio’s theme folder. As a little teaser: this is what it looks like:

atom theme rstudio

[author class=“mtl“ title“Über den Autor“]

Wer mit großen Datenmengen in seinem beruflichen Alltag zu tun hat, der weiß, wie nützlich Datenbanken sein können. Als elektronisches Verwaltungssystem sind Datenbanken darauf ausgelegt, effizient und widerspruchsfrei mit großen Datenmengen umzugehen. Zudem sorgt eine Datenbank im Unternehmen dafür, dass jeder Mitarbeiter auf einen einheitlichen und aktuellen Datenstand zurückgreifen kann. Änderungen in der Datenbasis werden somit allen Beteiligten direkt zuteil. Das ist insbesondere dann von Vorteil, wenn Daten computergestützt automatisch verarbeitet werden.

Mit diesen Paketen meistern Sie den Umgang mit Datenbanken

R bietet verschiedene Pakete zum Umgang mit Datenbanken und ermöglicht es somit auf eine komfortable Art, sich aus R heraus mit Datenbank zu verbinden und diese in den Data Science Prozess zu integrieren. So können wir uns zum Beispiel mit den Paketen DBI und RMySQL bequem und einfach mit unserer Testdatenbank test_db verbinden und uns die darin liegende Testtabelle flights anschauen, welche Informationen zu Abflügen am New Yorker Flughafen des Jahres 2013 enthält.

# Pakete laden 
library(DBI)    # Funktionen zum Umgang mit Datenbanken 
library(RMySQL) # MySQL Treiber 
library(dplyr)  # Für %>%  

# Konnektor-objekt erzeugen 
con % 
  dbGetQuery("SELECT month, day, carrier, origin, dest, air_time  
              FROM flights LIMIT 3") 

#   month day carrier origin dest air_time 
# 1     1   1      UA    EWR  IAH      227 
# 2     1   1      UA    LGA  IAH      227 
# 3     1   1      AA    JFK  MIA      160 

# Verbindung schließen 
dbDisconnect(con) 

# [1] TRUE 

Wie wir sehen, genügen wenige Zeilen Code, um den Inhalt der Datenbanktabelle einzusehen. Der Umgang mit Datenbanken, auch über die R API, hat jedoch einen kleinen Knackpunkt: Für den Umgang mit Datenbanken muss man SQL beherrschen. Das ist an sich kein großes Problem, da SQL als deklarative Sprache recht intuitiv ist. So ist die im Beispiel aufgeführte Query auch einfach zu verstehen. Das Ganze ändert sich jedoch, wenn die Daten auf der Datenbank zu groß sind, um diese über ein einfaches SELECT * FROM abzufragen. In diesem Fall ist es notwendig, erste Aggregationen auf der Datenbank durchzuführen. Sind diese komplex, kann SQL zur echten Hürde eines Data Scientisten werden.

Bei STATWORX greifen wir ständig auf Datenbanken zurück, um unsere prädiktiven Systeme nahtlos in die Datenprozesse unserer Kunden integrieren zu können. Man muss jedoch kein SQL-Experte sein, um dies zu tun. Einzelne Packages helfen, mit Datenbanken sicher und souverän umzugehen.

Im Folgenden werden drei Pakete vorgestellt, die die Arbeit mit Datenbanken sicherer, stabiler und einfacher machen.

Verbindungen clever verwalten mit pool

Beim Umgang mit Datenbanken sind oft auch etwas technischere Themen von Relevanz.

So kann das Verwalten der Verbindungen sehr mühsam sein, wenn diese dynamisch, zum Beispiel in einer Shiny App, erzeugt werden. Dies kann zum Absturz einer App führen, da manche Datenbanken per Default nur 16 gleichzeitige Verbindungen erlauben. Konnektoren müssen daher immer auch geschlossen werden, sollten diese nicht mehr gebraucht werden. Das Schließen wird im Beispielcode oben zum Schluss durchgeführt.

Um Konnektoren stabiler zu verwalten, kann das Paket pool verwendet werden.

Das pool-Paket erzeugt eine Art intelligenten Konnektor, einen sogenannten Objekt-Pool. Das Praktische daran ist, dass sich der Pool einmal zu Beginn des Aufrufs um das Erzeugen von Konnektoren kümmert und diese das möglichst effizient verwaltet, d.h. Verbindungen zur Datenbank möglichst optimal auslastet. Ein Vorteil im Umgang mit dem pool-Paket ist, dass sich die Funktionalitäten zum DBI-Paket kaum unterscheiden. Dies wollen wir einmal genauer anhand eines Beispiels betrachten.

# Paket laden  
library(pool) 

# Pool-Objekt erzeugen 
pool % 
  dbGetQuery("SELECT month, day, carrier, origin, dest, air_time  
             FROM flights LIMIT 3") 
 
#   month day carrier origin dest air_time 
# 1     1   1      UA    EWR  IAH      227 
# 2     1   1      UA    LGA  IAH      227 
# 3     1   1      AA    JFK  MIA      160 

# Verbindung schließen 
poolClose(pool)  

Wie wir sehen, hat sich die Syntax kaum verändert. Der einzige Unterschied liegt darin, dass wir den Pool mit den eigens dafür vorgesehenen Funktionen dbPool() und poolClose() verwalten. Um die Konnektoren, die für Datenbankabfragen benötigt werden, kümmert sich der Pool selbst. Dies zeigt die untere Grafik schematisch. Der User sendet – stark vereinfacht gesagt – die Query an den Pool, woraufhin der Pool bestimmt, über welchen Konnektor er die Query an die Datenbank schickt und das Ergebnis zurückgibt.

Schemata von pool

Zugangsdaten verbergen mit config

Beim Erstellen einer Verbindung ist die Eingabe von Zugangsdaten notwendig, die besser an einem geschützten Ort liegen sollten. Dies stellte im obigen Beispiel kein Problem dar, da wir test_db bei uns lokal auf dem PC laufen lassen und die Zugangsdaten somit nicht sensibel sind. Soll der Code zur Erstellung des Konnektors jedoch mit anderen Mitarbeitern geteilt werden, empfiehlt es sich, die Zugangsdaten aus einem R-Objekt auszulesen.

Mit config können YAML Konfigurationsdateien aus R heraus eingelesen werden, in denen sich zum Beispiel Zugangsdaten für Datenbanken abspeichern lassen (YAML ist ein für den Menschen lesefreundliches Format, um Daten zu speichern). So müssen wir im ersten Schritt lediglich eine Konfigurationsdatei erstellen, die wir config.yml nennen.

# Konfigurationsdatei erstellen  
default: 
  database_settings: 
    host: 127.0.0.1 
    dbname: test_db 
    user: root 
    pwd: root 
    port: 3306 
  other_setting: 
    filepath: /path/to/file 
    username: gauss

Zu beachten ist hierbei, dass die erste Zeile der Datei obligatorisch ist. Das YAML Format hat den Vorteil, dass Einstellungen thematisch durch das Einführen von Unterlisten voneinander getrennt werden können. So haben wir im Beispiel eine Unterliste mit den Zugangsdaten zur Datenbank (database_settings) und eine Unterliste mit sonstigen beispielhaften Einstellungen (other_settings) erstellt. Nun können im zweiten Schritt mit der Funktion get() gezielt die Datenbankeinstellungen eingelesen werden.

# Paket laden 
library(config) 

# Datenbankeinstellungen laden 
config <- get(value = "database_settings", 
              file  = "~/Desktop/r-spotlight/config.yml") 

str(config) 

# List of 5 
# $ host  : chr "127.0.0.1" 
# $ dbname: chr "test_db" 
# $ user  : chr "root" 
# $ pwd   : chr "root" 
# $ port  : int 3306 

Beim Erstellen eines Pools müssen wir nun nicht mehr unsere sensiblen Daten offenbaren.

# Pool-Objekt erzeugen 
pool <- dbPool(drv      = RMySQL::MySQL(), 
               user     = config$user, 
               password = config$pwd, 
               host     = config$host, 
               port     = config$port, 
               dbname   = config$dbname) 

SQL ohne SQL dank dbplyr

dbplyr ist das Datenbank-Back-End von dplyr und kümmert sich schlicht und einfach darum, dass dplyr’s elegante Syntax auch bei der Nutzung von Konnektor-Objekten Anwendung findet. dbplyr wurde in dplyr integriert und muss daher nicht separat geladen werden.

# Paket laden 
library(dplyr) 

# Eine kleine Query mit dplyr 
pool %>%  
  tbl("flights") %>%  
  select(month, day, carrier, origin, dest, air_time) %>% 
  head(n = 3) 

# Source:   lazy query [?? x 6] 
# Database: mysql 5.6.35 [root@127.0.0.1:/test_db] 
#   month   day carrier origin  dest air_time 
#               
# 1     1     1      UA    EWR   IAH      227 
# 2     1     1      UA    LGA   IAH      227 
# 3     1     1      AA    JFK   MIA      160

Dabei fällt auf, dass es sich bei dem Ergebnis nicht um einen R Data Frame handelt (wie an „Source: lazy query …“ zu sehen). Vielmehr wird die eingegebene R-Syntax in SQL übersetzt und als Query an die Datenbank geschickt. Es wird somit alles auf der Datenbank berechnet. Den dahinterliegenden SQL-Befehlt kann man sich mit show_query() anzeigen lassen.

# SQL anzeigen lassen 

pool %>%  
  tbl("flights") %>%  
  select(month, day, carrier, origin, dest, air_time) %>% 
  head(n = 3) %>% 
  show_query() 

# :SQL: 
# SELECT `month` AS `month`, `day` AS `day`, `carrier` AS `carrier`,  
#        `origin` AS `origin`, `dest` AS `dest`, `air_time` AS `air_time` 
# FROM `flights` 
# LIMIT 3 

Zugegeben: das war noch nicht die Killer-Query, aber das Prinzip sollte klar sein. Mit diesem Tool lassen sich schnell auch deutlich komplexere Querys schreiben. Entsprechend könnten wir nun die mittlere geflogene Distanz pro Fluggesellschaft auf der Datenbank berechnen lassen.

# Eine etwas komplexere Query 
qry % 
  tbl("flights") %>%  
  group_by(carrier) %>% 
  summarise(avg_dist = mean(distance)) %>% 
  arrange(desc(avg_dist)) %>% 
  head(n = 3) 

qry 

# Source:     lazy query [?? x 2] 
# Database:   mysql 5.6.35 [root@127.0.0.1:/test_db] 
# Ordered by: desc(avg_dist) 
#   carrier avg_dist 
#          
# 1      HA 4983.000 
# 2      VX 2499.482 
# 3      AS 2402.000 

Mit collect() können wir uns das Ergebnis des SQL-Statements als R-Objekt abspeichern lassen.

# SQL Resultat in R abspeichern 
qry %>% 
  collect() 

# A tibble: 3 x 2 
#   carrier avg_dist 
#          
# 1      HA 4983.000 
# 2      VX 2499.482 
# 3      AS 2402.000 

Fazit

Das Arbeiten mit Datenbanken kann durch das Verwenden der richtigen Pakete um einiges einfacher und sicherer gemacht werden. Für das Schreiben komplexer SQL Querys müssen wir dank dplyr nicht länger online SQL-Tutorials pauken, sondern können die freigewordene Zeit für wichtigere Dinge nutzen; zum Beispiel für das Vernaschen der köstlichen Süßigkeiten bei STATWORX, wie Jessica und David in ihren Blogbeiträgen festgestellt haben.

Referenzen

  1. Boergs, Barbara (2017). Pool: Object Pooling. R Package Version 0.1.3. URL: https://CRAN.R-project.org/package=pool
  2. Datenbanken verstehen. Was ist eine Datenbank? URL: http://www.datenbanken-verstehen.de/datenbank-grundlagen/datenbank/
  3. Wickham, Hadley (2017a). Flights that Departed NYC in 2013. R Package. URL: https://CRAN.R-project.org/package=nycflights13
  4. Wickham, Hadley (2017b). dbplyr: A ‚dplyr‘ Back End for Databases. R Package Version 1.1.0. URL: https://CRAN.R-project.org/package=dbplyr

Seit einiger Zeit ist mit Shiny ein Paket für die Statistiksoftware R verfügbar, mit dessen Hilfe man ansprechende, interaktive Webapplikationen erstellen und hierbei auf den kompletten Funktionsumfang von R zugreifen kann.

Mit Hilfe des R Paketes Shiny lassen sich schnell HTML/JavaScript-basierte, interaktive Webapplikationen erstellen. Die möglichen Anwendungsszenarien sind vielfältig: Reporting, Deployment von statistischen Analysen, interaktive Visualisierungen von Datenbeständen. Bei der Erstellung einer Shiny Applikation sind grundsätzlich keine HTML- oder JavaScript-Kenntnisse notwendig, da die komplette Programmierung der App direkt in R stattfindet. Lediglich beim Customizen der Applikation, sprich, bei der Anpassung von Farben, Logos, Fonts und Layouts sind Grundkenntnisse in HTML/CSS/JavaScript nötig.

Besonders vorteilhaft an Shiny ist, dass die interaktive Web-App auf den kompletten Funktionsumfang von R zurückgreifen kann. R bietet zur Zeit über 10000 Packages für statistische Auswertungen, Data Mining und Predictive Analytics an und zählt somit zu der wichtigsten Statistiksoftware auf dem Markt. Zudem ist R eine Open-Source-Software und somit, ebenso wie das Zusatzpaket Shiny, kostenlos erhältlich.

Shiny ermöglicht es dem Anwender, in einer optisch ansprechenden und übersichtlichen Web-Oberfläche, verschiedene Input-Parameter auszuwählen bzw. zu variieren, auf Basis derer eine Berechnung und Analyse in R gestartet wird. Hierfür werden zwei Shiny-Programme geschrieben. Eines zur Definition der Benutzeroberfläche (ui.r) und ein zweites in dem die Analyse mit R stattfindet (server.r). Als Eingabemöglichkeiten auf Benutzerseite stehen z. B. Slider, Drop-Downs, Text- und Zahleneingabe-, aber auch Optionsfelder zur Verfügung. Auch der Import von lokalen Excel-Files und eine entsprechende Verarbeitung der Daten in der Shiny Applikation ist möglich. Auf der Output-Seite stehen dem Benutzer Tabellen, Summaries, Grafiken, Landkarten und Texte zur Verfügung, die reaktiv, je nach Selektion der Inputparameter, on-the-fly angepasst werden. Grundsätzlich lassen sich alle R Packages in eine Shiny-App implementieren, was nahezu unendliche Anwendungsmöglichkeiten offenbart.

Die fertige Applikation kann bspw. via GitHub gist, als R-Paket oder als zip-Ordner versendet oder auf allen Rechnern, die R und Shiny installiert haben, aufgerufen werden. Außerdem kann die Web-App auf einem Linux-basierten Shiny Server online zur Verfügung gestellt werden. Hier existieren eine kostenfreie Open-Source-Edition sowie eine kostenpflichtige Enterprise-Variante, die weitere, für Unternehmen wichtige, Features enthält.

Fazit: Dank Shiny ist es möglich, auf eine einfache und schnelle Art ansprechende, interaktive Applikationen zu erstellen, die auf den kompletten statistischen Funktionsumfang von R zurückgreifen können.

Seit einiger Zeit ist mit Shiny ein Paket für die Statistiksoftware R verfügbar, mit dessen Hilfe man ansprechende, interaktive Webapplikationen erstellen und hierbei auf den kompletten Funktionsumfang von R zugreifen kann.

Mit Hilfe des R Paketes Shiny lassen sich schnell HTML/JavaScript-basierte, interaktive Webapplikationen erstellen. Die möglichen Anwendungsszenarien sind vielfältig: Reporting, Deployment von statistischen Analysen, interaktive Visualisierungen von Datenbeständen. Bei der Erstellung einer Shiny Applikation sind grundsätzlich keine HTML- oder JavaScript-Kenntnisse notwendig, da die komplette Programmierung der App direkt in R stattfindet. Lediglich beim Customizen der Applikation, sprich, bei der Anpassung von Farben, Logos, Fonts und Layouts sind Grundkenntnisse in HTML/CSS/JavaScript nötig.

Besonders vorteilhaft an Shiny ist, dass die interaktive Web-App auf den kompletten Funktionsumfang von R zurückgreifen kann. R bietet zur Zeit über 10000 Packages für statistische Auswertungen, Data Mining und Predictive Analytics an und zählt somit zu der wichtigsten Statistiksoftware auf dem Markt. Zudem ist R eine Open-Source-Software und somit, ebenso wie das Zusatzpaket Shiny, kostenlos erhältlich.

Shiny ermöglicht es dem Anwender, in einer optisch ansprechenden und übersichtlichen Web-Oberfläche, verschiedene Input-Parameter auszuwählen bzw. zu variieren, auf Basis derer eine Berechnung und Analyse in R gestartet wird. Hierfür werden zwei Shiny-Programme geschrieben. Eines zur Definition der Benutzeroberfläche (ui.r) und ein zweites in dem die Analyse mit R stattfindet (server.r). Als Eingabemöglichkeiten auf Benutzerseite stehen z. B. Slider, Drop-Downs, Text- und Zahleneingabe-, aber auch Optionsfelder zur Verfügung. Auch der Import von lokalen Excel-Files und eine entsprechende Verarbeitung der Daten in der Shiny Applikation ist möglich. Auf der Output-Seite stehen dem Benutzer Tabellen, Summaries, Grafiken, Landkarten und Texte zur Verfügung, die reaktiv, je nach Selektion der Inputparameter, on-the-fly angepasst werden. Grundsätzlich lassen sich alle R Packages in eine Shiny-App implementieren, was nahezu unendliche Anwendungsmöglichkeiten offenbart.

Die fertige Applikation kann bspw. via GitHub gist, als R-Paket oder als zip-Ordner versendet oder auf allen Rechnern, die R und Shiny installiert haben, aufgerufen werden. Außerdem kann die Web-App auf einem Linux-basierten Shiny Server online zur Verfügung gestellt werden. Hier existieren eine kostenfreie Open-Source-Edition sowie eine kostenpflichtige Enterprise-Variante, die weitere, für Unternehmen wichtige, Features enthält.

Fazit: Dank Shiny ist es möglich, auf eine einfache und schnelle Art ansprechende, interaktive Applikationen zu erstellen, die auf den kompletten statistischen Funktionsumfang von R zurückgreifen können.