Data Science, Machine Learning und KI
Kontakt

Did you know, that you can transform plain old static ggplot graphs to animated ones? Well, you can with the help of the package gganimate by RStudio’s Thomas Lin Pedersen and David Robinson and the results are amazing! My STATWORX colleagues and I are very impressed how effortless all kinds of geoms are transformed to suuuper smooth animations. That’s why in this post I will provide a short overview of some of the wonderful functionalities of gganimate, I hope you’ll enjoy them as much as we do!

Since Valentine’s Day is just around the corner, we’re going to explore the Speed Dating Experiment dataset compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar. Hopefully, we’ll learn about gganimate as well as how to find our Valentine. If you like, you can download the data from Kaggle.

Defining the basic animation: transition_*

How are static plots put into motion? Essentially, gganimate creates data subsets, which are plotted individually and constitute the substantial frames, which, when played consecutively, create the basic animation. The results of gganimate are so seamless because gganimate takes care of the so-called tweening for us by calculating data points for transition frames displayed in-between frames with actual input data.

The transition_* functions define how the data subsets are derived and thus define the general character of any animation. In this blogpost we’re going to explore three types of transitions: transition_states(), transition_reveal() and transition_filter(). But let’s start at the beginning.

We’ll start with transition_states(). Here the data is split into subsets according to the categories of the variable provided to the states argument. If several rows of a dataset pertain to the same unit of observation and should be identifiable as such, a grouping variable defining the observation units needs to be supplied. Alternatively, an identifier can be mapped to any other aesthetic.

Please note, to ensure the readability of this post, all text concerning the interpretation of the speed dating data is written in italics. If you’re not interested in that part you simply can skip those paragraphs. For the data prep, I’d like to refer you to my GitHub.

First, we’re going to explore what the participants of the Speed Dating Experiment look for in a partner. Participants were asked to rate the importance of attributes in a potential date by allocating a budget of 100 points to several characteristics, with higher values denoting a higher importance. The participants were asked to rate the attributes according to their own views. Further, the participants were asked to rate the same attributes according to the presumed wishes of their same-sex peers, meaning they allocated the points in the way they supposed their average same-sex peer would do.

We’re going to plot all of these ratings (x-axis) for all attributes (y-axis). Since we want to compare the individual wishes to the individually presumed wishes of peers, we’re going to transition between both sets of ratings. Color always indicates the personal wishes of a participant. A given bubble indicates the rating of one specific participant for a given attribute, switching between one’s own wishes and the wishes assumed for peers.

## Static Plot
# ...characteristic vs. (presumed) rating...
# ...color&size mapped to own rating, grouped by ID
plot1 <- ggplot(df_what_look_for, 
       aes(x = value,
           y = variable,
           color = own_rating, # bubbels are always colord according to own whishes
           size = own_rating,
           group = iid)) + # identifier of observations across states
  geom_jitter(alpha = 0.5, # to reduce overplotting: jitttering & alpha
              width = 5) + 
  scale_color_viridis(option = "plasma", # use virdis' plasma scale
                      begin = 0.2, # limit range of used hues
                      name = "Own Rating") +
  scale_size(guide = FALSE) + # no legend for size
  labs(y = "", # no axis label
       x = "Allocation of 100 Points",  # x-axis label
       title = "Importance of Characteristics for Potential Partner") +
  theme_minimal() +  # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot1 + 
  transition_states(states = rating) # animate contrast subsets acc. to variable rating  
tran-states

First off, if you’re a little confused which state is which, please be patient, we’ll explore dynamic labels in the section about ‚frame variables‘.

It’s apparent that different people look for different things in a partner. Yet attractiveness is often prioritized over other qualities. But the importance of attractiveness varies most strongly of all attributes between individuals. Interestingly, people are quite aware that their peer’s ratings might differ from their own views. Further, especially the collective presumptions (= the mean values) about others are not completely off, but of higher variance than the actual ratings.

So there is hope for all of us that somewhere out there somebody is looking for someone just as ambitious or just as intelligent as ourselves. However, it’s not always the inner values that count.

gganimate allows us to tailor the details of the animation according to our wishes. With the argument transition_length we can define the relative length of the transition from one to the other real subsets of data takes and with state_length how long, relatively speaking, each subset of original data is displayed. Only if the wrap argument is set to TRUE, the last frame will get morphed back into the first frame of the animation, creating an endless and seamless loop. Of course, the arguments of different transition functions may vary.

## Animated Plot
# ...replace default arguments
plot1 + 
  transition_states(states = rating,
                    transition_length = 3, # 3/4 of total time for transitions
                    state_length = 1, # 1/4 of time to display actual data
                    wrap = FALSE) # no endless loop
tran-states-arguments

Styling transitions: ease_aes

As mentioned before, gganimate takes care of tweening and calculates additional data points to create smooth transitions between successively displayed points of actual input data. With ease_aes we can control which so-called easing function is used to ‚morph‘ original data points into each other. The default argument is used to declare the easing function for all aesthetics in a plot. Alternatively, easing functions can be assigned to individual aesthetics by name. Amongst others quadric, cubic , sine and exponential easing functions are available, with the linear easing function being the default. These functions can be customized further by adding a modifier-suffix: with -in the function is applied as-is, with -out the function is reversely applied with -in-out the function is applied as-is in the first half of the transition and reversed in the second half.

Here I played around with an easing function that models the bouncing of a ball.

## Animated Plot
# ...add special easing function
plot1 + 
  transition_states(states = rating) + 
  ease_aes("bounce-in") # bouncy easing function, as-is
tran-states-ease

Dynamic labelling: {frame variables}

To ensure that we, mesmerized by our animations, do not lose the overview gganimate provides so-called frame variables that provide metadata about the animation as a whole or the previous/current/next frame. The frame variables – when wrapped in curly brackets – are available for string literal interpretation within all plot labels. For example, we can label each frame with the value of the states variable that defines the currently (or soon to be) displayed subset of actual data:

## Animated Plot
# ...add dynamic label: subtitle with current/next value of states variable
plot1 +
  labs(subtitle = "{closest_state}") + # add frame variable as subtitle
  transition_states(states = rating) 
tran-states-label

The set of available variables depends on the transition function. To get a list of frame variables available for any animation (per default the last one) the frame_vars() function can be called, to get both the names and values of the available variables.

Indicating previous data: shadow_*

To accentuate the interconnection of different frames, we can apply one of gganimates ’shadows‘. Per default shadow_null() i.e. no shadow is added to animations. In general, shadows display data points of past frames in different ways: shadow_trail() creates a trail of evenly spaced data points, while shadow_mark() displays all raw data points.

We’ll use shadow_wake() to create a little ‚wake‘ of past data points which are gradually shrinking and fading away. The argument wake_length allows us to set the length of the wake, relative to the total number of frames. Since the wakes overlap, the transparency of geoms might need adjustment. Obviously, for plots with lots of data points shadows can impede the intelligibility.

plot1B + # same as plot1, but with alpha = 0.1 in geom_jitter
  labs(subtitle = "{closest_state}") +  
  transition_states(states = rating) +
  shadow_wake(wake_length = 0.5) # adding shadow
tran-states-shadow

The benefits of transition_*

While I simply love the visuals of animated plots, I think they’re also offering actual improvement. I feel transition_states compared to facetting has the advantage of making it easier to track individual observations through transitions. Further, no matter how many subplots we want to explore, we do not need lots of space and clutter our document with thousands of plots nor do we have to put up with tiny plots.

Similarly, e.g. transition_reveal holds additional value for time series by not only mapping a time variable on one of the axes but also to actual time: the transition length between the individual frames displays of actual input data corresponds to the actual relative time differences of the mapped events. To illustrate this, let’s take a quick look at the ’success‘ of all the speed dates across the different speed dating events:

## Static Plot
# ... date of event vs. interest in second date for women, men or couples
plot2 <- ggplot(data = df_match,
                aes(x = date, # date of speed dating event
                    y = count, # interest in 2nd date
                    color = info, # which group: women/men/reciprocal
                    group = info)) +
  geom_point(aes(group = seq_along(date)), # needed, otherwise transition dosen't work
             size = 4, # size of points
             alpha = 0.7) + # slightly transparent
  geom_line(aes(lty = info), # line type according to group
            alpha = 0.6) + # slightly transparent
  labs(y = "Interest After Speed Date",
       x = "Date of Event",
       title = "Overall Interest in Second Date") +
  scale_linetype_manual(values = c("Men" = "solid", # assign line types to groups
                                   "Women" = "solid",
                                   "Reciprocal" = "dashed"),
                        guide = FALSE) + # no legend for linetypes
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # y-axis in %
  scale_color_manual(values = c("Men" = "#2A00B6", # assign colors to groups
                                "Women" = "#9B0E84",
                                "Reciprocal" = "#E94657"),
                     name = "") +
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(), # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot2 +
  transition_reveal(along = date) 
trans-reveal

Displayed are the percentages of women and men who were interested in a second date after each of their speed dates as well as the percentage of couples in which both partners wanted to see each other again.

Most of the time, women were more interested in second dates than men. Further, the attraction between dating partners often didn’t go both ways: the instances in which both partners of a couple wanted a second date always were far more infrequent than the general interest of either men and women. While it’s hard to identify the most romantic time of the year, according to the data there seemed to be a slack in romance in early autumn. Maybe everybody still was heartbroken over their summer fling? Fortunately, Valentine’s Day is in February.

Another very handy option is transition_filter(), it’s a great way to present selected key insights of your data exploration. Here the animation browses through data subsets defined by a series of filter conditions. It’s up to you which data subsets you want to stage. The data is filtered according to logical statements defined in transition_filter(). All rows for which a statement holds true are included in the respective subset. We can assign names to the logical expressions, which can be accessed as frame variables. If the keep argument is set to TRUE, the data of previous frames is permanently displayed in later frames.

I want to explore, whether one’s own characteristics relate to the attributes one looks for in a partner. Do opposites attract? Or do birds of a feather (want to) flock together?

Displayed below are the importances the speed dating participants assigned to different attributes of a potential partner. Contrasted are subsets of participants, who were rated especially funny, attractive, sincere, intelligent or ambitious by their speed dating partners. The rating scale went from 1 = low to 10 = high, thus I assume value of >7 to be rather outstanding.

## Static Plot (without geom)
# ...importance ratings for different attributes
plot3 <- ggplot(data = df_ratings, 
                 aes(x = variable, # different attributes
                     y = own_rating, # importance regarding potential partner
                     size = own_rating, 
                     color = variable, # different attributes
                     fill = variable)) +
  geom_jitter(alpha = 0.3) +
  labs(x = "Attributes of Potential Partner", # x-axis label
       y = "Allocation of 100 Points (Importance)",  # y-axis label
       title = "Importance of Characteristics of Potential Partner", # title
       subtitle = "Subset of {closest_filter} Participants") + # dynamic subtitle 
  scale_color_viridis_d(option = "plasma", # use viridis scale for color 
                        begin = 0.05, # limit range of used hues
                        end = 0.97,
                        guide = FALSE) + # don't show legend
  scale_fill_viridis_d(option = "plasma", # use viridis scale for filling
                       begin = 0.05, # limit range of used hues
                       end = 0.97, 
                       guide = FALSE) + # don't show legend
  scale_size_continuous(guide = FALSE) + # don't show legend
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) 
trans-filter

Of course, the number of extraordinarily attractive, intelligent or funny participants is relatively low. Surprisingly, there seem to be little differences between what the average low vs. high scoring participants look for in a partner. Rather the lower scoring group includes more people with outlying expectations regarding certain characteristics. Individual tastes seem to vary more or less independently from individual characteristics.

Styling the (dis)appearance of data: enter_* / exit_*

Especially if displayed subsets of data do not or only partially overlap, it can be favorable to underscore this visually. A good way to do this are the enter_*() and exit_*() functions, which enable us to style the entry and exit of data points, which do not persist between frames.

There are many combinable options: data points can simply (dis)appear (the default), fade (enter_fade()/exit_fade()), grow or shrink (enter_grow()/exit_shrink()), gradually change their color (enter_recolor()/exit_recolor()), fly (enter_fly()/exit_fly()) or drift (enter_drift()/exit_drift()) in and out.

We can use these stylistic devices to emphasize changes in the databases of different frames. I used exit_fade() to let further not included data points gradually fade away while flying them out of the plot area on a vertical route (y_loc = 100), data points re-entering the sample fly in vertically from the bottom of the plot (y_loc = 0):

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) +
  enter_fly(y_loc = 0) + # entering data: fly in vertically from bottom
  exit_fly(y_loc = 100) + # exiting data: fly out vertically to top...
  exit_fade() # ...while color is fading
trans-filter-exit-enter

Finetuning and saving: animate() & anim_save()

Gladly, gganimate makes it very easy to finalize and save our animations. We can pass our finished gganimate object to animate() to, amongst other things, define the number of frames to be rendered (nframes) and/or the rate of frames per second (fps) and/or the number of seconds the animation should last (duration). We also have the option to define the device in which the individual frames are rendered (the default is device = “png”, but all popular devices are available). Further, we can define arguments that are passed on to the device, like e.g. width or height. Note, that simply printing an gganimateobject is equivalent to passing it to animate() with default arguments. If we plan to save our animation the argument renderer, is of importance: the function anim_save() lets us effortlessly save any gganimate object, but only so if it was rendered using one of the functions magick_renderer() or the default gifski_renderer().

The function anim_save()works quite straightforward. We can define filename and path (defaults to the current working directory) as well as the animation object (defaults to the most recently created animation).

# create a gganimate object
gg_animation <- plot3 +
  transition_filter("More Attractive" = Attractive > 7,
                    "Less Attractive" = Attractive <= 7) 

# adjust the animation settings 
animate(gg_animation, 
        width = 900, # 900px wide
        height = 600, # 600px high
        nframes = 200, # 200 frames
        fps = 10) # 10 frames per second

# save the last created animation to the current directory 
anim_save("my_animated_plot.gif")

Conclusion (and a Happy Valentine’s Day)

I hope this blog post gave you an idea, how to use gganimate to upgrade your own ggplots to beautiful and informative animations. I only scratched the surface of gganimates functionalities, so please do not mistake this post as an exhaustive description of the presented functions or the package. There is much out there for you to explore, so don’t wait any longer and get started with gganimate!

But even more important: don’t wait on love. The speed dating data shows that most likely there’s someone out there looking for someone just like you. So from everyone here at STATWORX: Happy Valentine’s Day!

heart gif
## 8 bit heart animation
animation2 <- plot(data = df_eight_bit_heart %>% # includes color and x/y position of pixels 
         dplyr::mutate(id = row_number()), # create row number as ID  
                aes(x = x, 
                    y = y,
                    color = color,
                    group = id)) +
  geom_point(size = 18, # depends on height & width of animation
             shape = 15) + # square
  scale_color_manual(values = c("black" = "black", # map values of color to actual colors
                                "red" = "firebrick2",
                                "dark red" = "firebrick",
                                "white" = "white"),
                     guide = FALSE) + # do not include legend
  theme_void() + # remove everything but geom from plot
  transition_states(-y, # reveal from high to low y values 
                    state_length = 0) +
  shadow_mark() + # keep all past data points
  enter_grow() + # new data grows 
  enter_fade() # new data starts without color

animate(animation2, 
        width = 250, # depends on size defined in geom_point 
        height = 250, # depends on size defined in geom_point 
        end_pause = 15) # pause at end of animation

 

 

Summer in Frankfurt means blue skies, warm nights and happy people all around. That’s why STATWORX made it a tradition to have a company barbecue and enjoy some good food and drink while mingling with colleagues and friends. Moreover, we had a great reason to celebrate: as of late we’re the STATWORX GmbH (engl.: limited company). So, last Friday we had a party!

pictures GmbH party

Work hard, party hard

But not so fast: before the first beers were opened, our CEO and CFO gave us an overview of the company’s current situation and the general plans for the future. (Side note: you know you’re in a room full of consultants if half of the audience is so intrigued by the spot-on design of the powerpoint slides that the presented content almost becomes secondary!)

After several insightful hours of presentation, motivated by past accomplishments and eager to put the presented game plan into practice, we finally began to prepare the party. I must say it was really cool to see how smoothly everyone was working together. In almost no time we had set up the tables, grills and decorations in the backyard. When the buffet table was loaded up with home-made goodies, it became evident that some of us are not only talented statisticians, marketing geniuses or brilliant data scientists but also seasoned chefs.

Soon, the first guests started to arrive, and after the first vegetarian steaks and traditional sausages were sizzling on the grill, the party was instantly in full swing. Even a short shower of rain couldn’t dampen the mood. Maybe, the exotic and simply delicious cocktails mixed by our very own barkeeper might have helped with that. Let me tell you, the non-alcoholic Mojito was a revelation.

We spend the evening connecting with new collogues, old colleagues as well as their friends and families. It also was very cool to chat with our Swiss colleagues in person again, who were kind enough to bring an exceptional swiss speciality – a whole box of Luxemburgerli! – all the way with them. The evening was topped off by several former colleagues who stopped by to catch up.

To not overstrain the patience and ears of our lovely neighbours, we moved the party inside. On the makeshift dancefloor, laser-show included, we danced our heart out to the finest tunes played by our current CEO and former DJ. Since it wouldn’t be a STATWORX production without some facts and figures, so here some stats about the barbecue:

Would you like to be part of the next STATWORX summer barbecue? You can become part of the team. We’re looking forward to meeting you!

 

 

It’s Valentine’s day, making this the most romantic time of the year. But actually, already 2018 was a year full of love here at STATWORX: many of my STATWORX colleagues got engaged. And so we began to wonder – some fearful, some hopeful – who will be next? Therefore, today we’re going to tackle this question in the only true way: with data science!

Gathering the Data

To get my data, I surveyed my colleagues. I asked my (to be) married colleagues to answer my questions based on the very day they got engaged. My single colleagues answered my questions with respect to their current situation. I asked them about some factors that I’ve always suspected to influence someone’s likeliness to get married. For example, I’m sure that in comparison to Python users, R users are much more romantic. The indiscreet questions I badgered my coworkers with were:

  • Are you married or engaged?
  • How long have you been in your relationship?
  • Is your employment permanent?
  • How long have you been working at STATWORX?
  • What’s your age?
  • Are you living together with your partner?
  • Are you co-owning a pet with your partner?
  • What’s your preferred programming language: R, Python or none of both.

I’m going to treat the relationship status as dichotomous variable: Married or engaged vs. single or “only” dating. To maintain some of the privacy of my colleagues I gave them all some randomly (!!) chosen pet names. (Side note: There really is a subreddit for everything.)

Descriptive Exploration

Since the first step in generating data driven answer should always be a descriptive exploration of the data at hand, I made some plots.

First, I took a look at the absolute frequencies of preferred programming languages in the groups of singles vs. married or engaged STATWORX employees. I fear, the romantic nature of R users is not the explanation we’re looking for:

# reformatting the target variable
df1 <- df %>%
  dplyr::mutate(engaged = ifelse(engaged == "yes", 
                                 "Engaged or Married", 
                                 "Single")) %>%
  dplyr::group_by(`primary programming language`, engaged) %>%
  dplyr::summarise(freq = n(),
                   image = "~/Desktop/heart_red.png") 

# since in geom_image size cannot be mapped to variable
# multiple layers of data subsets  
ggplot() +
  geom_image(data = filter(df1, freq == 1), 
             aes(y = `primary programming language`,
                 x = engaged, 
                 image = image), 
             size = 0.1) + 
  geom_image(data = filter(df1, freq > 1 & freq <= 5), 
             aes(y = `primary programming language`, 
                 x = engaged, 
                 image = image),
             size = 0.2) +
  geom_image(data = filter(df1, freq >= 13), 
             aes(y= `primary programming language`, 
                 x = engaged, 
                 image = image),
             size = 0.3) +
  geom_text(data = df1, 
            aes(y =`primary programming language`, 
                x = engaged, 
                label = freq), 
            color = "white", size = 4) +
  ylab("Preferred programming language") +
  xlab("n Absolute frequencies") +
  theme_minimal()
programming languages frequencies

I also explored the association of relationship status and the more conventional factors of age and relationship duration. And indeed, those of my colleagues who are in their late twenties or older and have been partnered for a while now, are mostly married or engaged.

# plotting age and relationship duration vs. relationship status

ggplot() +
# doing this only to get a legend:
  geom_point(data = df,
             aes(x = age, y = `relationship duration`,
                 color = engaged), shape = 15) + 
  geom_image(data = filter(df, engaged == "yes"), 
             aes(x = age, y = `relationship duration`,
                 image = "~/Desktop/heart_red.png")) +
  geom_image(data = filter(df, engaged == "no"), 
             aes(x = age, y = `relationship duration`,
                 image = "~/Desktop/heart_black.png")) +
  ylab("Relationship duration n") +
  xlab("n Age") +
  scale_color_manual(name = "Married or engaged?",
                     values = c("#000000", "#D00B0B")) +
  scale_x_continuous(breaks = pretty_breaks()) +
  theme_minimal() +
  theme(legend.position = "bottom")
age relationship duration

Statistical Models

I’ll employ some statistical models, but the data base is rather small. Therefore, our model options are somewhat limited (and of course only suitable for entertainment). But it’s still possible to fit a decision tree, which might help to pinpoint due to which circumstance some of us are still waiting for that special someone to put a ring on (it).

# recoding target to get more understandable labels
df <- df %>%
  dplyr::mutate(engaged = ifelse(engaged == "yes", 
                                 "(to be) married", 
                                 "single"))

# growing a decision three with a ridiculous low minsplit
fit <- rpart(engaged ~ `relationship duration` + age + 
             `shared pet` + `permanent employment` +
             cohabitating + `years at STATWORX`,
             control = rpart.control(minsplit = 2), # overfitting ftw
             method = "class", data = df)

# plotting the three
rpart.plot(fit, type = 3, extra = 2, 
           box.palette = c("#D00B0B", "#fae6e6"))

relationship decision tree

Our decision tree implies, that the unintentionally unmarried of us maybe should consider to move in with their partner, since cohabitating seems to be the most important factor.

But that still doesn’t exactly answer the question, who of us will be next. To predict our chances to get engaged, I estimated a logistic regression.

We see that cohabiting, one’s age and the time we’ve been working at STATWORX are accompanied by a higher probability to (soon to) be married. However, simply having been together for a long time or owning a pet together with our partner, does not help. (Although, I assume that this rather unintuitive interrelation is caused by a certain outlier in the data – “Honey”, I’m looking at you!)

Finally, I got the logistic regression’s predicted probabilities for all of us to be married or engaged. As you can see down below, the single days of “Teddy Bear”, “Honey”, “Sweet Pea” and “Babe” seem to be numbered.

# reformatting the target variable
df <- df %>%
  dplyr::mutate(engaged = ifelse(engaged == "(to be) married", 1, 0))

# in-sample fitting: estimating the model 
log_reg <- glm(engaged ~ `relationship duration` + age +
               `shared pet` + `permanent employment` + 
               cohabitating + `years at STATWORX`,
              family = binomial, data = df)

df$probability <- predict(log_reg, df, type = "response")

# plotting the predicted probabilities
ggplot() +
  # again, doing this only to get a legend:
  geom_point(data = df,
             aes(x = probability, y = nickname,
                 color = as.factor(engaged)), shape = 15) + 
  geom_image(data = filter(df, engaged == 1), 
             aes(x = probability, y = nickname,
                 image = "~/Desktop/heart_red.png")) +
  geom_image(data = filter(df, engaged == 0), 
             aes(x = probability, y = nickname,
                 image = "~/Desktop/heart_black.png")) +
  ylab(" ") +
  xlab("Predicted Probability") +
  scale_color_manual(name = "Married or engaged?",
                     values = c("#000000", "#D00B0B"),
                     labels = c("no", "yes")) +
  scale_x_continuous(breaks = pretty_breaks()) +
  theme_minimal() +
  theme(legend.position = "bottom")
predicted probabilities for marriage

I hope this was as insightful for you as it was for me. And to all of us, whose hopes have been shattered by cold, hard facts, let’s remember: there are tons of discounted chocolates waiting for us on February 15th.

In the last post of this series of the STATWORX Blog, we explored ggplot2’s themes, which control the display of all non-data components of a plot.

Ggplot comes with several inbuilt themes that can be easily applied to any plot. However, one can tweak these out-of-the-box styles using the theme() function. We did this last time. Furthermore, one also can create a complete customized theme, that’s what we’re going to do in this post.

How the masters do it

When we create a theme from scratch we have to define all arguments of the theme function and set the complete argument to TRUE, signaling that the generated object indeed is a complete theme. Setting complete = TRUE also causes all elements to inherit from blank elements. This means, that every object that we want to show up in our plots has to be defined in full detail.

If we are a little lazy, instead of defining each and every argument, we also can start with an existing theme and alter only some of its arguments. Actually, this is exactly what the creators of ggplot did. While theme_gray() is „the mother of all themes“ and fully defined, for example theme_bw() builds upon theme_gray() , while theme_minimal in turn builds on theme_bw() .

We can see how tedious it is to define a complete theme, if we sneak a peak at the code of theme_grey on ggplot2’s GitHub repository. Further, it is obvious from the code of theme_bw or theme_minimal how much more convenient it is to create a new theme by building on an existing theme.

Creating our very own theme

What’s good enough for Hadley and friends, is good enough for me. Therefore, I’m going to create my own theme based on my favourite theme, theme_minimal(). As can be seen on the GitHub repo, we can create a new theme as a function calling an existing theme, which is altered by %+replace% theme() with all alterations defined in theme().

Several arguments are passed along to the function constituting a new theme and the existing theme called within the function: Specified are the default sizes for text (base_size), lines in general (base_line_size) as well as lines pertaining to rect-objects (base_rect_size), further defined is the font family. To ensure a consitent look, all sizes aren’t defined in absolute terms but relative to base sizes, using the rel() function. Therefore, for especially big or small plots the base sizes can be in- or decreased, with all other elements being adjusted automatically.

library(ggplot2)
library(gridExtra)
library(dplyr)

# generating new theme
theme_new <- function(base_size = 11,
                      base_family = "",
                      base_line_size = base_size / 170,
                      base_rect_size = base_size / 170){
  theme_minimal(base_size = base_size, 
                base_family = base_family,
                base_line_size = base_line_size) %+replace%
    theme(
      plot.title = element_text(
        color = rgb(25, 43, 65, maxColorValue = 255), 
        face = "bold",
        hjust = 0),
      axis.title = element_text(
        color = rgb(105, 105, 105, maxColorValue = 255),
        size = rel(0.75)),
      axis.text = element_text(
        color = rgb(105, 105, 105, maxColorValue = 255),
        size = rel(0.5)),
      panel.grid.major = element_line(
        rgb(105, 105, 105, maxColorValue = 255),
        linetype = "dotted"),   
      panel.grid.minor = element_line(
        rgb(105, 105, 105, maxColorValue = 255),
        linetype = "dotted", 
        size = rel(4)),   
      
      complete = TRUE
    )
}

Other than in theme_minimal() I’m decreasing the base size to 11 and set the base line size and base rect size to base size devided by 170. I don’t change the font family. The plot title is changed to a bold, dark blue font in the set base size and is left alinged. Axis text and axis ticks are set to have 75% and 50% of the base size, while their colour is changed to a light grey. Finally, the lines of the grid are defined to be dotted and light grey, with the major grid lines having the base line size and the minor grid lines having four times this size.

The result looks like this:

# base plot
base_plot <- data.frame(x = rnorm(n = 100, 1.5, 2),
                        y = rnorm(n = 100, 1, 2),
                        z = c(rnorm(n = 60, 0.5, 2), rnorm(n = 40, 5, 3))) %>%
  ggplot(.) +
  geom_jitter(aes(x = x, y = y, color = z, size = z), 
              alpha = 0.5) +
  geom_jitter(aes(x = x, y = y, size = z), 
              alpha = 0.8,
              shape = 21, 
              color = "white",  
              stroke = 0.4) +
  scale_size_continuous(range = c(1, 18), breaks = c(1,  4, 5, 13, 18)) +
  guides(size = FALSE, color = FALSE) +
  labs(y = "Flight Hight", x = "Flight Distance")

# plot with customized theme
p1 <- base_plot +
  ggtitle("Bubbels - theme_new()") +
  theme_new()

# plot with theme minimal
p2 <- base_plot +
    ggtitle("Bubbels - theme_minimal()") +
    theme_minimal()

grid.arrange(p1, p2, nrow = 2)

For later use, we can save our theme gerenrating script and source our customized theme function whenever we want to make use of our created theme.

So go ahead, be creative and build your signature theme!

If you want more details, our Data Visualization with R workshop may interest you!

As noted elsewhere, sometimes beauty matters. A plot that’s pleasing to the eye will be considered more
gladly, and thus might be understood more thoroughly. Also, since we at STATWORX oftentimes need to
subsume and communicate our results, we have come to appreciate how a nice plot can upgrade any presentation.

So how make a plot look good? How make it accord with given style guidelines? In ggplot2 the display of all non-data components is controlled by the theme system. Other than in some other packages, the appearance of plots is edited after all the data-related elements of the plot have been determined. The theme system of ggplot2 allows the manipulation of titles, labels, legends, grid lines and backgrounds. There are various build-in themes available that already have an all-around consistent style, pertaining to any detail of a plot.

Pre-defined themes

There are two ways to apply bulid-in (or otherwise predefined) themes (e.g. theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic).
For one, they can be added as an additional layer to individual plots:

rm(list = ls())
library(gridExtra)
library(ggplot2)

# generating a fictional data set containing hours of sunshine and temperature
sun_hours <- sample(seq(from = 1, to = 8, by = 0.1), size = 40, replace = TRUE)
noise <- sample(seq(from = 17, to = 24, by = 0.1), size = 40, replace = TRUE)
temperature <-  sun_hours + noise
df_sun <- data.frame(sun_hours, temperature)

# generate the plot base
base_plot <- ggplot(df_sun) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 6, size = 5, stroke = 2) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 21, size = 3.3, fill = "white", stroke = 2) +
  labs(x = "Hours of Sun", y = "Temperature") +
  scale_color_gradient(high = "firebrick", low = "#ffce00", name = " ") +
  ggtitle("Base Plot")

base-plot

# adding predefined themes
p1 <- base_plot +
  theme_classic() +
  ggtitle("Plot with theme_classic()")

p2 <- base_plot +
  theme_bw() +
  ggtitle("Plot with theme_bw()")

p3 <- base_plot +
  theme_dark() +
  ggtitle("Plot with theme_dark()")

p4 <- base_plot +
  theme_light() +
  ggtitle("Plot with theme_light()")

gridExtra::grid.arrange(p1, p2, p3, p4)

different-themes

Alternatively, the default theme that’s automatically added to any plot, can be set or get with the functions theme_set() or theme_get().

# making the classic theme the default
theme_set(theme_classic())

base_plot +
  ggtitle("Plot with theme_set(theme_classic())")

theme-set

While predefined themes are very convenient, there’s always the option to (additionally) tweak the appearance of any non-data detail of a plot via the various arguments of theme(). This can be done for a specific plot, or the currently active default theme. The default theme can be updated or partly replaced via theme_update and theme_replace, respectively.

# changing the default theme
theme_update(legend.position = "none")

base_plot +
  ggtitle("Plot with theme_set(theme_classic()) n& theme_update(legend.position = "none")")

theme-update

# changing the theme directly applied to the plot
base_plot +
  theme(legend.position = "bottom") +
  ggtitle("Plot with theme(legend.position = "bottom")")

plus-theme

Element functions

There’s a wide range of arguments for theme(), in fact such a wide range, that not all arguments can be discussed here. Therefore, this blog post is far from exhaustive and only deals with the general principles of the theme system and only provides some illustrative examples for a few of all the available arguments. The appearance of many elements needs to be specified via one of the four element functions: element_blank, element_text, element_line or element_rect.

  • How labels and titles are displayed, is controlled by the element_text function. For example, we can make the title of the y axis bold and increase its size.
  • Borders and backgrounds can be manipulated using element_rect. For example, we can choose the color of the plot’s background.
  • Lines can be defined via the element_line function. For example, we can change the line types of the mayor and minor grid.
  • Further, with element_blank() it is possible to remove an object completely, without having any space dedicated to the plot element.
# using element_text, element_rect, element_line, element_blank
base_plot +
  theme(axis.title.y = element_text(face = "bold", size = 16),
        plot.background = element_rect(fill = "#FED633"),
        panel.grid.major = element_line(linetype = "dashed"),
        panel.grid.minor = element_line(linetype = "dotted"),
        axis.text.y = element_blank(),
        axis.text.x = element_blank()) +
  ggtitle("Plot altered using element functions")

element-function

If we don’t want to change the display of some specific plot elements, but of all text, lines, titles or rectangular elements we can do so by specifying the arguments text, line, rect and title. Specifications passed to these arguments are inherited by all elements of the respective type. This inheritance principle also holds true for other ‚parent‘ arguments. ‚Parent‘ arguments oftentimes are easily identifiable, as their names are used as prefixes for all subordinate arguments.

# using overreaching arguments #1
base_plot +
  theme(line = element_line(linetype = "dashed")) +
  ggtitle("Plot with all lines altered by using line")

using-line

# using overreaching arguments #2
base_plot +
  theme(axis.title = element_text(size = 6)) + # here axis.title is the parent
  ggtitle("Plot with both axis titles altered by using axis.title")

using-axistitle

Outlook

Margins, spaces, sizes and orientations of elements are not specified with element functions but have their own sets of possible parameters. For example, the display of legends is controlled by such arguments and specific parameters.

# using parameters instead of element functions
base_plot +
  theme(legend.position = "top") 

using-otherparameters

Since ggplot2 enables to manipulate the appearance of non-data elements of plots in great detail, there is a multitude of arguments. This blog post only tries to give a first impression of the many, many possibilities to design a plot. Some additional occupation with the topic, might be advisable, but any time invested in understanding how to style plots, surely is well spent. If you want read more on making pretty plots in ggplot2, check out my other posts on coordinate stystems or customizing date and time scales. If you want more details, our Data Visualization with R workshop may interest you!

 

Referenzen

  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.

 

All plots have coordinate systems. Perhaps because they are such an integral element of plots, they are easily overlooked. However, in ggplot2, there are several very useful options to customize the coordinate systems of plots, which we will not overlook but explore in this blog post.

Since it is spring, we will use a random subset of the famous iris data set. When we plot the petal length against the petal width, and map species onto color and play around a little with the shape, color and sizes of aesthetics, one obtains this vernal plot:

plot base iris data

<span class="hljs-comment"># Base plot</span>
plot_base <- ggplot(data = df_iris) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species), 
             size = <span class="hljs-number">3</span>, alpha = <span class="hljs-number">0.9</span>, shape = <span class="hljs-number">8</span>) +
  geom_point(aes(x = Petal.Length, y = Petal.Width), 
             color = <span class="hljs-string">"yellow"</span>, size = <span class="hljs-number">0.4</span>) +
  scale_color_manual(values = c(<span class="hljs-string">"#693FE9"</span>, <span class="hljs-string">"#A089F8"</span>, <span class="hljs-string">"#0000FF"</span>)) +
  theme_minimal()

Cartesian coordinate system

Zooming in and out

The coordinate system can be manipulated by adding one of ggplot’s different coordinate systems. When you are imagining a coordinate system, you are most likely thinking of a Cartesian one. The Cartesian coordinate system combines x and y dimension orthogonally and is ggplots default (coord_cartesian).

There also are several varaitions of the familiar Cartesian coordinate system in ggplot, namely coord_fixed, coord_flip and coord_trans. For all of them, the displayed section of the data can be specified by defining the maximal value depicted on the x (xlim =) and y (ylim =) axis. This allows to “zoom in” or “zoom out” of a plot. It is a great advantage, that all manipulations of the coordinate system only alter the depiction of the data but not the data itself.

<span class="hljs-comment"># Zooming in with xlim/ylim</span>
plot_base +
  coord_cartesian(xlim = 5, ylim = 2) +
  ggtitle("coord_cartesian <span class="hljs-keyword">with</span> xlim = <span class="hljs-number">5</span> <span class="hljs-keyword">and</span> ylim = <span class="hljs-number">2</span><span class="hljs-string">")
</span>

zoomed scatter plot

Specifying the “aspect ratio” of the axes

Via coord_fixed one can specify the exact ratio of the length of a y unit relative to the length of a x unit within the final visualization.

# Setting the <span class="hljs-string">"aspect ratio"</span> of <span class="hljs-symbol">y</span> vs. <span class="hljs-symbol">x</span> units
plot_base + 
  coord_fixed(ratio = <span class="hljs-number">1</span>/<span class="hljs-number">2</span>) +
  ggtitle(<span class="hljs-string">"coord_fixed with ratio = 1/2"</span>)

plot aspect ratio

Transforming the scales of the axes

This helps to emphasize the exact insight one wants to communicate. Another way to do so is coord_trans, which allows several transformations of the x and y variable (see table below, taken from Wickham 2016 page 97). Let me stress this again, very conveniently such transformations only pertain to the depicted – not the actual – scale of the data. This also is the reason why, regardless of the conducted transformation, the original values are used as axis labels.

Name Funktion f(x) Inverse f^{ -1 } (y)
asn tanh^{-1}(x) tanh(y)
exp e^x log(y)
identity x y
log log(x) e^y
log10 log_{10}(x) 10^y
log2 log_{2}(x) 2^y
logit log(frac{ x }{ 1-x }) log(frac{ x }{ 1+e(y) })
pow10 10^x log_{10}(y)
probit Phi(x) Phi^{-1}(y)
recip x^{-1} y^{-1}
reverse -x -y
sqrt x^{1/2} y^2
<span class="hljs-comment"># Transforming the axes </span>
<span class="hljs-attribute">plot_base</span> + 
  coord_trans(x = <span class="hljs-string">"log"</span>, y = <span class="hljs-string">"log2"</span>) +
  ggtitle(<span class="hljs-string">"coord_trans with x = "log" and y = "log2""</span>)

transformed data

Swapping the axes

The last of the Cartesian options, cood_flip, swaps x and y axis. For example, this option can be useful, when we intend to change the orientation of univariate plots as histograms or plot types – like box plots – that visualize the distribution of a continuous variable over the categories of another variable. Nonetheless, coord_flip also works with all other plots. This multiplies the overall possibilities for designing plots, especially since all Cartesian coordinate systems can be combined.

<span class="hljs-comment"># Swapping axes </span>
<span class="hljs-comment"># base plot #2</span>
p1 <- ggplot(data = df_iris) +
  geom_bar(aes(<span class="hljs-keyword">x</span> = Species, fill = Species), alpha = <span class="hljs-number">0</span>.<span class="hljs-number">6</span>) +
  scale_fill_manual(<span class="hljs-keyword">values</span> = c(<span class="hljs-string">"#693FE9"</span>, <span class="hljs-string">"#A089F8"</span>, <span class="hljs-string">"#4f5fb7"</span>)) +
  theme_minimal() 

<span class="hljs-comment"># base plot & coord_flip()</span>
p2 <- ggplot(data = df_iris) +
  geom_bar(aes(<span class="hljs-keyword">x</span> = Species, fill = Species), alpha = <span class="hljs-number">0</span>.<span class="hljs-number">6</span>) +
  scale_fill_manual(<span class="hljs-keyword">values</span> = c(<span class="hljs-string">"#693FE9"</span>, <span class="hljs-string">"#A089F8"</span>, <span class="hljs-string">"#4f5fb7"</span>)) +
  theme_minimal() +
  coord_flip() 

gridExtra::grid.arrange(p1, p2, top = <span class="hljs-string">"Bar plot without and with coord_flip"</span>)

fliped coordinate system

Polar coordinate system

The customization of Cartesian coordinate systems allows for the fine tuning of plots. However, coord_polar, the final coordinate system discussed here, changes the whole character of a plot. By using coord_polar, bar geoms are transformed to pie charts or “bullseye” plots, while line geoms are transformed to radar charts. This is done by mapping x and y to the angle and radius of the resulting plot. By default, the x variable is mapped to the angle but by setting the theta augment in coord_polar to “y” this can be changed.

pie charts
polygon plot

While such plots might shine with respect to novelty and looks, their perceptual properties are intricate, and their correct interpretation may be quite hard and rather unintuitive.

<span class="hljs-meta"># Base plot 2 (long format, x = 1 is summed up to generate count)</span>
plot_base_2 <- df_iris %>%
  dplyr::mutate(x = <span class="hljs-number">1</span>) %>%
  ggplot(.) +
  geom_bar(aes(x = x, fill = Species), alpha = <span class="hljs-number">0.6</span>) + 
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  scale_fill_manual(values = c(<span class="hljs-string">"#693FE9"</span>, <span class="hljs-string">"#A089F8"</span>, <span class="hljs-string">"#4f5fb7"</span>)) +
  theme_minimal() +
  ggtitle(<span class="hljs-string">"base plot"</span>)

<span class="hljs-meta"># Bullseye plot </span>
<span class="hljs-meta"># geom_bar & coord_polar(theta = <span class="hljs-meta-string">"x"</span>)</span>
p2 <- plot_base_2 +
  coord_polar(theta = <span class="hljs-string">"x"</span>) +
  ggtitle(<span class="hljs-string">"theta = "x""</span>)

<span class="hljs-meta"># Pie chart </span>
<span class="hljs-meta"># geom_bar & coord_polar(theta = <span class="hljs-meta-string">"y"</span>)</span>
p3 <- plot_base_2 +
  coord_polar(theta = <span class="hljs-string">"y"</span>) +
  ggtitle(<span class="hljs-string">"theta = "y""</span>)

gridExtra::grid.arrange(p2, p3, plot_base_2, top = <span class="hljs-string">"geom_bar & coord_polar"</span>, ncol = <span class="hljs-number">2</span>)
# Base plot <span class="hljs-number">3</span> (long format, <span class="hljs-built_in">mean</span> width/length of sepals/petals calculated)
plot_base_3 <- iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarise(Petal.Length = <span class="hljs-built_in">mean</span>(Petal.Length),
                   Sepal.Length = <span class="hljs-built_in">mean</span>(Sepal.Length),
                   Sepal.Width = <span class="hljs-built_in">mean</span>(Sepal.Width),
                   Petal.Width = <span class="hljs-built_in">mean</span>(Petal.Width)) %>%
  reshape2::melt() %>%
  ggplot() +
  geom_polygon(aes(group = Species, color = Species, <span class="hljs-symbol">y</span> = value, <span class="hljs-symbol">x</span> = variable), 
  fill = NA) +
  scale_color_manual(values = c(<span class="hljs-string">"#693FE9"</span>, <span class="hljs-string">"#A089F8"</span>, <span class="hljs-string">"#4f5fb7"</span>)) +
  theme_minimal() +
  ggtitle(<span class="hljs-string">"base plot"</span>)

# Radar plot
# geom_polygon & coord_polar
p2 <- plot_base_3 +
  theme_minimal() +
  coord_polar() +
  ggtitle(<span class="hljs-string">"coord_polar"</span>)

gridExtra::grid.arrange(plot_base_3, p2, top = <span class="hljs-string">"geom_polygon & coord_polar"</span>, ncol = <span class="hljs-number">2</span>)

References

  • Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.

In the last post of this series, we took a first look at strategies for the effective visualization and exploration of data patterns within large data sets. Namely, we examined ways to overcome overplotting, with a focus on a two-dimensional feature space defined by two continuous features. However, oftentimes we want to visualize the distribution of data across several subgroups. For example, subgroups defined by the categories of one feature or even multiple features. As noted in the discussion of overplotting, the mapping of subgroups onto aesthetics, e.g. color, can get quite confusing. This is especially the case for larger numbers of subgroups with overlaying distributions. Therefore, in this blog post we are going to explore a different strategy: faceting.

To do so we are going to use gglot2’s diamonds data set, which comprises information on the price and other features of almost 54,000 diamonds.

In this post we are taking a look at facet_grid and facet_wrap. The basic functionality of both methods is the generation of small multiples of a basic plot for data subsets, defined by the categories of one feature or the unique combinations of categories of several features. The resulting visualizations make it easy to identify similarities or differences in the patterns of the subsets.

rm( list = ls())

library(ggplot2)
library(dplyr)

# Pooling and relabeling some categories for sake of clarity
df_diamonds<- diamonds %>% 
  mutate(color = ifelse(color == "D" | color == "E" | color == "F", 
                        "Colorless", 
                        "Yellowish"),
         clarity = ifelse(clarity == "I1" | clarity == "SI2" | clarity == "SI1", 
                          "Included", 
                          "Nearly Flawless"))

# Generating the base plot
plot_base <- ggplot(data = df_diamonds) +
  geom_histogram(aes(x = price),
                 color = "#A7256A",  
                 fill = "#A7256A",  
                 alpha = 0.7) +
  theme_minimal() +
  labs(x = "Price in $") 

Content and layout of panels

Of course, the most defining parameters of facetted plots are the considered subsets, which for both facet_grid and facet_wrap can be defined via formula notation.

Basically, facet_grid creates the plot versionof a contingency table: a two-dimensional grid of plots. The features mapped on the columns (right) and rows (left), divided by a ~, are to be specified in the facets argument. Rows as well as columns of the grid also can be defined by the combinations of multiple features, which is to be indicated by adding all to be crossed features with a +. If either rows or columns are not specified . is used as placeholder.

# facet_grid: cut in rows
# aligned horizontal scales facilitate comparisons of feature on x-axis
plot_base +
  facet_grid(cut ~ .) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-rows.png", width = 11, height = 5)

facet-grid-rows

# facet_grid: cut in columns
# aligned vertical scales facilitate comparisons of feature on y-axis
plot_base +
  facet_grid(facets = . ~ cut) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-colums.png", width = 11, height = 5)

facet-grid-colums

# facet_grid: cut in columns, color in rows
plot_base +
  facet_grid(facets = color ~ cut) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-colums-and-rows.png", width = 11, height = 5)

facet-grid-colums-and-rows

# facet_grid: all combinations of cut and clarity in columns, color in rows
plot_base +
  facet_grid(facets = color ~ cut + clarity)+
  ggtitle("Price of Diamonds by Cut, Color and Clarity") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-colums-and-xrows.png", width = 11, height = 5)

facet-grid-colums-and-xrows

Just as in contingency tables, marginal total plots combining all data within a given row or column can be added via the margins argument. If margins is set to TRUE, all marginal total plots are enabled, margins also can be set to a character vector to enable margin plots for all specified variables.

# facet_grid: cut in columns, color in rows, margins for cut
plot_base +
  facet_grid(facets = color ~ cut,
             margins = "cut") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less axis breaks
ggsave("facet-grid-one-margin.png", width = 11, height = 5)

facet-grid-one-margins

# facet_grid: cut in columns, color in rows, all margins
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE) +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less axis breaks
ggsave("facet-grid-all-margins.png", width = 11, height = 5)

facet-grid-all-margins

Other than facet_grid, facet_wrap generates a one-dimensional sequence of multiples, which only are arranged two-dimensionally. To accentuate the intrinsic one-dimensionality of the plot sequence, conventionally within the formula specification of the facets argument, the features which are to define the subsets are combined by + and placed behind the ~.

# facet_wrap: one variable 
plot_base +
  facet_wrap(facets =  ~ cut) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-one-var.png", width = 11, height = 5)

facet-wrap-one-var

# facet_wrap: multiple variables 
plot_base +
  facet_wrap(facets = ~ cut + color) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-multiple-vars.png", width = 11, height = 5)

facet-wrap-multiple-vars

To arrange the panels most efficiently, the layout of panels is oriented as square as possible. However, the nrow and ncol arguments allow to specify the number of panels within the rows and columns.

# facet_wrap: multiple variables, nrow defined 
plot_base +
  facet_wrap(facets = ~ cut + color,
             nrow = 2) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-nrow.png", width = 11, height = 5)

facet-wrap-nrow

# facet_wrap: multiple variables, ncol defined 
plot_base +
  facet_wrap(facets = ~ cut + color,
             ncol = 3) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-ncol.png", width = 11, height = 5)

facet-wrap-ncol

Further, the overall order of the panels can be defined the dir argument. If the argument is set to v panels are arranged across columns, starting from the top of the most left column. If the argument is set to h panels pertaining are arranged across rows starting on the left-hand side of the first row.

# facet_wrap: dir v 
plot_base +
  facet_wrap(facets = ~ cut ,
             dir = "v") +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-v.png", width = 11, height = 5)

facet-wrap-v

# facet_wrap: dir h 
plot_base +
  facet_wrap(facets = ~ cut ,
             dir = "h") +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-h.png", width = 11, height = 5)

facet-wrap-h

It is way beyond the scope of this post to exhaustively discuss all arguments of facet_grid and facet_wrap, but we briefly take a look at some more “cosmetic” parameters that concern the position of panel labels and the layout of panels themselves.

Within facet_grid, the positon of panel labels canbe controlled via the argument switch. By default, the labels in the columns, respectively rows, are displayed on top respectively right-hand side. When switch is set to x, y, or both column labels are displayed on the bottom, row labels on the left or both, respectively.

# facet_grid: switch x
plot_base +
  facet_grid(facets = color ~ cut,
             switch = "x") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-switch-x.png", width = 11, height = 5)

facet-grid-switch-x

# facet_grid: switch y
plot_base +
  facet_grid(facets = color ~ cut,
             switch = "y") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-switch-y.png", width = 11, height = 5)

facet-grid-switch-y

For facet_wrap this can be achieved by setting the argument strip.position to top, bottom, left or right.

# facet_wrap: strip.position left
plot_base +
  facet_wrap(facets = ~ cut + color,
             strip.position = "left") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-strpos-left.png", width = 11, height = 5)

facet-wrap-strpos-left

# facet_wrap: strip.position top
plot_base +
  facet_wrap(facets = ~ cut + color,
             strip.position = "bottom") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-strpos-bottom.png", width = 11, height = 5)

facet-wrap-strpos-bottom

Finally, the argument as.table defines the layout of the multiples. For as.table = TRUE the panels pertaining to the highest values or highest ranked categories of the categorizing features are positioned at the bottom right (as in a table), for as.table = FALSE the facets with the highest ranked categories are positioned at the top-right (as in a plot).

# facet_grid: as.table
plot_base +
  facet_grid(facets = cut ~ .,
             as.table = TRUE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-astable.png", width = 11, height = 5)

facet-grid-astable

# facet_grid: not as.table
plot_base +
  facet_grid(facets = cut ~ .,
             as.table = FALSE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-nottable.png", width = 11, height = 5)

facet-grid-nottable

# facet_wrap: as.table
plot_base +
  facet_wrap(facets = ~ cut,
             as.table = TRUE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-astable.png", width = 11, height = 5)

facet-wrap-astable

# facet_wrap: not as.table
plot_base +
  facet_wrap(facets = ~ cut,
             as.table = FALSE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-nottable.png", width = 11, height = 5)

facet-wrap-nottable

Manipulating the scales of panels

Apart from the definition of the contrasted subsets, the probably most important characteristic of facetted plots are the scales of the multiples. By default, the scales of all panels are identical. But depending on the data at hand and the comparison to be made, it might be more insightful to allow some or all scales to vary for the panels, thereby accentuating (smaller) particularities of the considered subsets.

The scales argument of facet_grid and facet_wrap, in combination with the options free, free_x or free_y allows respectively all, the x or the y scales to vary between panels. However, within facet_grid all plots within the columns or rows must have the same y scale respectively x scale, since they share the corresponding axes.

# facet_wrap: free x scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free_x") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free-x.png", width = 11, height = 5)

facet-wrap-free-x

# facet_wrap: free x scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free_y") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free-y.png", width = 11, height = 5)

facet-wrap-free-y

# facet_wrap: free x and y scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free.png", width = 11, height = 5)

facet-wrap-free

# facet_grid: free scales
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE,
             scales = "free") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-scale-free.png", width = 11, height = 5)

facet-grid-scale-free

While the functionality of the scales argument is constrained, facet_grid offers an additional argument: space. When set to free, the width or height of each column or row vary in proportion to the range of scale of the plot in the respective position.

# facet_grid: free space and free scale
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE,
             scales = "free",
             space = "free") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-scale-space-free.png", width = 11, height = 5)

facet-grid-scale-space-free

Faceting can be a powerful tool to facilitate the comparison of patterns within subsets of ones data. Especially since ggplot2 makes facetting so convenient, one should always keep this option in mind.

References

  • Wilkinson, L. (2011). ggplot2: Elegant Graphics for Data Analysis by WICKHAM, H.

Overplotting can be a serious problem, which complicates data visualization and thus also data exploration. Overplotting describes situations, in which multiple data points overlay each other within a plot, causing the individual observations to be non-distinguishable. In such cases, plots only indicate the general extent of the data, while existing relationshipsmight be heavily obscured. Overplotting especially occurs when dealing with large data sets.

# Generating a sample from a bivariate normal distribution to plot
library(ggplot2)
library(hexbin)
library(dplyr)
library(grid)
library(gridExtra)

# Correlation of the two variables:
r <- 0.8    
sim_data <- MASS::mvrnorm(
                          # Number of observations:
                          n = 20000, 
                          # Means of the variables:
                          mu = c(20, 0), 
                          # Covariance matrix of the variables:
                          Sigma = matrix(c(1, r, 
                                           r, 1),
                                         nrow =2 ), 
                          # Make mean and covaraince pertain to population:
                          empirical = FALSE)
x <- sample(sim_data[, 1], size = 20000) 
y <- sample(sim_data[, 2], size = 20000) 

df <- data.frame(x, y)

# Generating and storing scatter plots for differently sized subsamples
plot1_1 <- df %>%
  sample_n(size = 200) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 200")

plot1_2 <- df %>%
  sample_n(size = 2000) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 2000")

plot1_3 <- df %>%
  sample_n(size = 20000) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 20000")

# Arranging plot objects in overview plot
plot_1 <- grid.arrange(
  plot1_1, plot1_2, plot1_3,
  nrow = 1,
  top = "Overplotting in differently sized samples") 

sample-size-and-overplotting

Adjusting glyphs

When considering discrete (or heavily-rounded continuous) variables with a small range, overplotting is practically inevitable and graphical appraisal might be impractical. However, when exploring bivariate relationships involving at least one continuous variable, graphical analysis often grants very valuable insight. Therefore, this article demonstrates several options to circumvent overplotting using ggplot2.

The most used tool to graphically assess relationships between two continuous variables is the scatter plot. As shown above, overplotting can render scatter plots quite useless. When the degree of overplotting is moderate, modification of the glyphs might offer a solution.
Overplotting may be overcome, by using small, hollow and/or transparent glyphs (with the latter option being referred to as “alpha blending”). Within a given plot layer one can specify the size in millimeters via the size argument (size = … ). Alternatively, one can set the shape to be a dot of the size of a pixel (shape = “.”) or one of the hollow shapes (shape = 0 / 1 / 2 / 5 / 6). The transparency can be adjusted via the alpha argument (alpha = …). The alpha level can range between 0 and 1, representing a fraction with the denominator being the number of data points that would need to be overlaid to obtain a fully opaque color:

# Generating and storing scatter plots with different adjustments of glyphs
plot2_1 <- ggplot() + 
  geom_point(data = df, aes(x, y), size = 0.1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Small glyphs")

plot2_2 <- ggplot() +
  geom_point(data = df, aes(x, y), shape = 1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Hollow glyphs")

plot2_3 <- 
  ggplot() +
  geom_point(data = df, aes(x, y), alpha = 0.1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Transparent glyphs")

# Arranging plot objects in overview plot
plot_2 <- grid.arrange(
  plot2_1, plot2_2, plot2_3,
  nrow = 1,
  top = textGrob("Counteracting overplotting by adjusting the glyphs", 
                 gp = gpar(fontsize = 18)))  

glyphs-and-overplotting

Jittering glyphs

Further, one can jitter data points by adding a little bit of random noise to a variable. To keep the distortion of the data as small as possible, jittering is optimally done only within the less informative dimension pertaining to a discrete variable, if such a variable is indeed considered:

# Generating and storing scatter plots with jittered glyphs
# Shortcut for geom_point(position = "jitter"): geom_jitter
plot3_1 <- 
  df %>%
  mutate(x = round(x, 0)) %>%           # Rounding to simulate discrete variable
  sample_n(size = 2000) %>%
  ggplot() +
  geom_point(aes(x,y), color = "black") +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Without jittering")

plot3_2 <- 
  df %>%
  mutate(x = round(x, 0)) %>%           # Rounding to simulate discrete variable
  sample_n(size = 2000) %>%
  ggplot() +
  geom_jitter(aes(x, y), width = 0.3) + # Arguments with/height -> max. noise
                                        # default 40% of the resulution
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With jittering")

# Arranging plot objects in overview plot
plot_3 <- grid.arrange(plot3_1, plot3_2, 
  nrow = 1, 
  top = textGrob("Counteracting overplotting by jittering", 
                 gp = gpar(fontsize = 18))) 

jittering-and-overplotting

Plotting the joint density function

An alternative to overcome overplotting apart from adjusting the representation of “raw” data points, is to consider the 2d joint density function of two variables. There are two feasible approaches.

Firstly, data points can be binned, the number of observations falling into a given bin can be counted and the resulting counts can be visualized. Mapping the count to color or alpha level are straightforward options. With geom_hex and geom_bin2d ggplot2 offers two implementations to do so. The geoms employ rectangular respectively hexagonal bins but otherwise are quite alike. However, Carr et al. (1987) suggest using hexagonal bins, since utilizing too small square bins may produce visual artefacts.

#  Generating and storing plots with geom_hex
plot4_1 <- 
  ggplot() +
  geom_hex(data = df, 
           aes(x,y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With default bins")

plot4_2 <- 
  ggplot() +
  geom_hex(data = df,
           aes(x,y), 
           binwidth = c(0.1, 0.1)) + # Changing vector of hight and width of bins
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With 0.1 x 0.1 bins") 

# Arranging plot objects in overview plot
plot_4 <- grid.arrange(plot4_1, plot4_2, 
  nrow = 1, 
  top = textGrob("Counteracting overplotting with geom_hex", 
                 gp = gpar(fontsize = 18)))  

geom_hex-and-overplotting

Alternatively with geom_bin2d:

#  Generating and storing plots with geom_bin2d
plot5_1 <- 
  ggplot() +
  geom_bin2d(data = df, 
             aes(x,y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With default number of bins")

plot5_2 <- 
  ggplot() +
  geom_bin2d(data = df,
             aes(x,y), 
             bins = 100) +  # Changing number of bins (default 30)
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With 100 bins") 

# Arranging plot objects in overview plot
plot_5 <- grid.arrange(plot5_1, plot5_2, 
  nrow = 1,
  top = textGrob("Counteracting overplotting with geom_bin2d", 
  gp = gpar(fontsize = 18)))  

geom_bin2d-and-overplotting

Secondly, the 2d density can be estimated. The density can be visualized by plotting its contours or mapping it onto color or alpha level of tiles or onto the size of points. Such visualizations can stand alone or be used to supplement basic scatterplots. Within ggplot2 this statistical transformation is implemented within stat_density_2d. Several geoms are especially suitable for visualization of the transformed data and can be specified via the geom (geom = …) argument.

#  Generating and storing plots with stat_density_2d
plot6_1 <-
ggplot() +
  stat_density_2d(data = df, 
    aes(x, y, color = ..level..)) + # Mapping density level to color
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density contours") 

plot6_2 <-
  ggplot() +
  stat_density_2d(data = df, 
    aes(x, y, fill = ..level..),    # Mapping density level to color
    geom = "polygon") +             # Plotting area
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density polygons") 

plot_6 <- grid.arrange(plot6_1, plot6_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #1")) 

stat_density_2d-and-overplotting

Alternativly with geom = „tile“:

#  Generating and storing heatmaps with stat_density_2d 
plot7_1 <-
  ggplot() +
  stat_density_2d(data = df,
    aes(x, y, fill = ..density..),  # Mapping density level to color
    contour = FALSE,                # Drawing not contours of density 
    geom = "tile") +                # ... but square bins for density
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (tile)") 

plot7_2 <-
ggplot() +
  stat_density_2d(data = df,
    aes(x, y, fill = ..density..),  # Mapping density level to color
    contour = FALSE,                # Drawing not contours of density 
    geom = "tile",                  # ... but square bins for density
    h = c(0.1, 2)) +                # Changing hight and width of bins
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (0.1 x 2 tiles)") 

# Arranging plot objects in overview plot
plot_7 <- grid.arrange(plot7_1, plot7_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #2"))  

stat_density_2d-and-overplotting-heatmaps-tiles

Or using geom = „point“ and completely without color:

#  Generating and storing point heatmap with stat_density_2d 
plot8_1 <-
  ggplot() +
  stat_density_2d(data = df,
                  aes(x, y, 
                      size = ..density..,      # Mapping density level to size
                      alpha = ..density..),    # ... and alpha level
                  contour = FALSE,             # Drawing not contours of density
                  geom = "point",              # ... but points to map density to
                  n = 20) +                    # Specifying number of points
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (point)") 

plot8_2 <-
  ggplot() +
  stat_density_2d(data = df,
                  aes(x, y, 
                      size = ..density..,      # Mapping density level to size
                      alpha = ..density..),    # ... and alpha level
                  contour = FALSE,             # Drawing not contours of density
                  geom = "point",              # ... but points to map density to
                  n = 30) +                    # Specifying number of points
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (point)") 

# Arranging plot objects in overview plot
plot_8 <- grid.arrange(plot8_1, plot8_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #3"))  

stat_density_2d-and-overplotting-heatmaps-points

Whichever approach one chooses, overplotting should always be addressed to insure the data visualization to be truly informative.

References

  • D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82(398):424–436, 1987.

Im vorherigen Beitrag zur logistischen Regression wurde aufgezeigt, dass die absoluten Koeffizienten innerhalb logistischer Regressionsmodelle aufgrund ihrer Bezugseinheiten kaum verständlich zu interpretieren sind. Eine weitere Schwierigkeit bei der Interpretation logistischer Regressionsgewichte wurde bisher noch nicht explizit thematisiert: Der Effekt einer Erhöhung einer unabhängigen Variable um eine Einheit auf die Ausprägung der AV, der sogenannte marginale Effekt, ist in der logistischen Regression immer auch durch die genauen Ausprägung der betrachteten unabhängigen Variable, sowie aller anderen unabhängigen Variablen bedingt.

Schätzung auf Umwegen

Dies ist darauf zurückzuführen, dass die Auftrittswahrscheinlichkeit einer betrachteten Ausprägung einer kategorialen abhängigen Variable in der logistischen Regression in nichtlinearer Weise, sozusagen über einen Zwischenschritt, geschätzt wird: Im Prinzip wird in einem ersten Schritt über ein gewöhnliches lineares Modell die Ausprägung einer nicht beobachtbaren Variable, einer sogenannten latenten Variable, modelliert. Diese spiegelt die „Neigung “ für das Auftreten der betrachteten Kategorie der abhängigen Variable wieder. (Das Auftreten der interessierenden Kategorie der abhängigen Variable wird gängiger Weise als y = 1 notiert.)

y_{l} =beta_{ 0 }+beta_{ 1 }x_{ 1 } + ... + beta_{ n }x_{ n }

Das logistische Regressionsmodell trifft nun die Annahme, dass die beobachtbare kategoriale Variable die jeweilige Ausprägung von Interesse annimmt, wenn die latente Variable den arbiträr gewählten Schwellenwert von 0 überschreitet. Die modellierten Ausprägungen der latenten Variable und die assoziierten Regressionsgewichte der unabhängigen Variablen aus dem linearen Modell müssen transformiert werden, um die Auftrittswahrscheinlichkeit der interessierenden Ausprägung der abhängigen Variable zu bestimmen. Für diese Transformation muss die Verteilung der Schätz-Fehler bekannt sein. Innerhalb des logistischen Regressionsmodells wird angenommen, dass die Fehler einer logistischen Verteilung folgen. Da nicht nur die funktionale Form, sondern die genaue Verteilung der Fehler bekannt sein muss, werden die nicht schätzbare Varianz der Fehler-Verteilung sowie ihr bedingter Erwartungswert auf die Werte  sigma^{ 2 } =pi^{ 2 } / 3 und  E(epsilon|x) = 0 fixiert. Es ergibt sich die Grundgleichung des logistischen Modells:

 P(y = 1| x) = frac{ e^{beta_{ 0 }+beta_{ 1 }x_{1} + ... + beta_{ n }x_{ n }} }{ 1 + e^{beta_{ 0 }+beta_{ 1 }x_{ 1 } + ... + beta_{ n }x_{ n }} } = frac{e^ { x'beta } }{ 1 + e^{x'beta} } = frac{ e^{Logit} }{ 1 + e^{Logit} }

Aus dieser Schätzmethode und Transformation folgt/resultiert, dass logistische Regressionskoeffizienten den linearen Zusammenhang zwischen den unabhängigen Variablen und der latenten Variable, beziehungsweise den Logits, beziehungsweise logarithmierten Odds Ratios, für die betrachtete Merkmalsausprägung der abhängigen Variablen wiedergeben. Die Beziehung von Logits, Odds Ratios und Regressionskoeffizienten zu Auftritts-Wahrscheinlichkeiten der Ausprägungen der abhängigen Variable ist jedoch nicht linear. Diese Nicht-Linearität ist in den Gleichungen des logistischen Regressionsmodells stets offensichtlich. Besonders bei der Betrachtung der entlogarithmierten logistischen Regressionskoeffizienten, den Odds Ratios, wird zudem auf den ersten Blick deutlich, dass eine multiplikative und keine additiv-lineare Verknüpfung der Regressionsgewichte besteht. Odds Ratios geben eine faktorielle Veränderung der Auftrittswahrscheinlichkeit an, deren absoluter Umfang natürlich von der „Basiswahrscheinlichkeit“ abhängt.

Basisgleichung des logistischen Modells:

 P(y = 1| x) = frac{e^ { x'beta } }{ 1 + e^{x'beta} } = frac{ e^{Logit} }{ 1 + e^{Logit} }

…aufgelöst nach dem Logit:

 Logit = lnfrac{ P }{ 1- P } = beta_{ 0 }+beta_{ 1 }x_{ 1 } + ... + beta_{ n }x_{ n }

… und zusätzlich entlogarithmiert:

 OR := e^{ Logit } = e^{ lnfrac{ P }{ 1- P } } = e^{ beta_{ 0 }+beta_{ 1 }x_{ 1 } + ... + beta_{ n }x_{ n } } = e^{beta_{ 0 }}times e^{beta_{ 1 }x_{ 1 }} times ... times e^{beta_{ n }x_{ n }}

Eine intuitive Veranschaulichung

Warum marginale Effekte von unabhängigen Variable jeweils von den genauen Ausprägungen aller unabhängigen Merkmale bedingt sind, lässt sich intuitiv wie folgt verstehen/erfassen: Eine Erhöhung der (latenten) Neigung für das Auftreten der betrachten Ausprägung der abhängigen Variable um einen gewissen Betrag, geht bei einer bereits sehr hohen/sehr niedrigen Neigung, weit unter/über dem Schwellenwert, mit einem vernachlässigbaren Effekt auf die vorhergesagte (Wahrscheinlichkeit der) tatsächlich beobachtete Ausprägung der abhängigen einher. Ist jedoch die (latente) Neigung nahe dem Schwellenwert, ist die Erhöhung der Neigung um einen gewissen Betrag sehr wahrscheinlich(er) ausschlaggebend/entscheidend für die vorhergesagte (Wahrscheinlichkeit der) Ausprägung der abhängigen Variable. Die bereits bestehende Neigung für das Auftreten der betrachten Kategorie ist wiederrum abhängig von den genauen Ausprägungen aller unabhängigen Merkmale.

Die Nicht-Linearität der marginalen Effekte unabhängiger Merkmale wird besonders in graphischen Darstellungen deutlich, wenn die Auftrittswahrscheinlichkeit über die Ausprägungen eines unabhängigen Merkmals abgetragen wird: Die Steigung der Kurve der Auftrittswahrscheinlichkeit ist nicht konstant.

Plot der bedingten Wahrscheinlichkeit

Lösung AME und MEM?

Es gibt verschiedene Möglichkeiten bei der Interpretation logistischer Modelle mit dieser Nicht-Linearität umzugehen. So können (Veränderungen der) Auftrittswahrscheinlichkeiten für unterschiedliche Kombinationen der Ausprägungen sowohl der interessierenden unabhängigen Variable, als auch der verbleibenden unabhängigen Variablen berechnet oder geplottet und so kontrastiert werden. Sollen marginale Effekte aber in einer kompakten, zusammenfassenden Kennziffer zum Ausdruck gebracht werden, können auch AMEs oder MEMs berechnet werden. Der average marginal Effekt (AME) ist der durchschnittliche Effekt der Erhöhung der unabhängigen Variablen um eine Einheit, gemittelt über alle vorhandenen Beobachtungen. Der marginal effect at the mean (MEM) ist der Effekt der Erhöhung der unabhängigen Variable um eine Einheit evaluiert am Mittelwert aller unabhängigen Variablen. Allerding ist zu bedenken, dass die inhaltlich bedeutsame Nicht-Linearität der Effekte durch AMEs absolut nicht deutlich wird und wichtige Informationen bezüglich der Effekte der unabhängigen Variablen schlicht vernachlässigt werden.

Logistische Regressionsmodelle, sind mit gängiger Statistiksoftware meist genauso leicht zu schätzen wie lineare Regressionen. Doch die Interpretation solcher Modelle, also der Part der statistischen Analyse der nicht von der Software übernommen wird, birgt eine Tücke: die Bezugsgröße der Regressionskoeffizienten.

Ausgehend von den unabhängigen Merkmalen der Beobachtungen, modellieren logistische Regressionsmodelle die Wahrscheinlichkeit mit der eine bestimmte Ausprägung eines kategorialen abhängigen Merkmals auftritt. Zur Schätzung dieser Wahrscheinlichkeiten ist die Transformation der Regressionsgewichte der unabhängigen Variablen notwendig, so dass logistische Regressionskoeffizienten den Zusammenhang zwischen den Ausprägungen der unabhängigen Variablen und den Logits für die betrachtete Merkmalsausprägung der abhängigen Variablen spiegeln. Parallel zur linearen Regression kann geschlossen werden, dass eine Erhöhung einer gegebenen unabhängigen Variable um eine Einheit, mit der Veränderung des Logits für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable um β Einheiten einhergeht. Zwar ist diese Interpretation formal korrekt, offenkundig jedoch wenig aufschlussreich.

Logits, Odds Ratios und Wahrscheinlichkeiten

Es drängt sich die Frage auf, was genau Logits sind. Die Antwort ist augenscheinlich recht einfach: Logits sind logarithmierte Odds Ratios. Wir halten fest: Logit = ln(Odds Ratio).

Aber natürlich stellt sich nun die Frage, was wiederrum Odds Ratios sind. Im Deutschen werden Odds Ratios als Chancenverhätnisse (oder auch Quotenverhältnisse) bezeichnet. Tatsächlich sind Odds Ratios nicht mehr als simple Verhältnisse von Chancen (beziehungsweise Quoten oder eben Odds). Im gegebenen Kontext bezeichnen Odds Ratios das Verhältnis der Chancen für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable, zwischen zwei Gruppen welche sich in der Ausprägungen eines unabhängigen Merkmals unterscheiden. Wir halten fest: Odds Ratio = Chance für Merkmalsausprägung in Gruppe 1 : Chance für Merkmalsausprägung in Gruppe 2.

Natürlich wird damit die Frage aufgeworfen, was genau Chancen sind. Chancen sind das jeweilige Verhältnis der Wahrscheinlichkeit für das Auftreten einer Merkmalsausprägung relativ zu der Wahrscheinlichkeit für das Nicht-Auftreten der Merkmalsausprägung innerhalb einer, zum Beispiel durch ein unabhängiges Merkmal definierten, Gruppe. Wir halten fest: Chance für Merkmalsausprägung = Wahrscheinlichkeit von Merkmalsausprägung : Gegenwahrscheinlichkeit von Merkmalsausprägung.

Die Wahrscheinlichkeit für eine Merkmalsausprägung entspricht dabei dem Anteil von Beobachtungseinheiten einer Gruppe, welche die jeweilige Ausprägung aufweisen. Wir halten fest: Wahrscheinlichkeit von Merkmalsausprägung = Anteil der Gruppenmitglieder mit Merkmalsausprägung.

Ein Beispiel: Nerds, Normalos und Star Wars

Zur Veranschaulichung werden nachstehend Logit und Odds Ratio dafür ein Star-Wars-Fan zu sein, für eine Gruppe von 10 „Statistik-Nerds“ relativ zu einer Gruppe von 10 „Normalos“ berechnet.

Berechnung von Hand

7 der 10 Nerds sind Star Wars Fans 4 der 10 Normalos sind Star Wars Fans. Daraus folgt:

P_{ Nerds }left( Fan right)=frac{ 7 }{10 }=0,7
Odds_{Nerds }left( Fan right)=frac{ P_{ Nerds }left( Fan right) }{ P_{ Nerds }left( kein Fan right) }=frac{ 0,7 }{ 1-0,7} = 2,bar{3}
P_{ Normalos }left( Fan right)=frac{ 4 }{10 }=0,4
Odds_{Normalos }left( Fan right)=frac{ P_{Normalos }left( Fan right) }{ P_{ Normalos }left( kein Fan right) }=frac{ 0,4 }{ 1-0,4} = 0,bar{6}
Odds Ratio_{ Nerds : Normalos }left( Fan right)=frac{ Odds_{ Nerds }left( Fan right) }{ Odds_{ Normalos }left( Fan right) }=frac{ 2,bar{3} }{ 0,bar{6} }=3,5
beta_{ Nerds:Normalos }=lnleft( Odds Ratio_{ Nerds:Normalos } right)=1,25

Berechnung via logistischer Regression in R

Zu dem gleichen Ergebnis kommt man, wenn man in R eine logistische Regression für die gegebenen Daten schätzt und den standartmäßig ausgegebenen Logit-Koeffizienten exponenziert.

R Code und Ausgabe von glm

Die Gruppenzugehörigkeit wird über eine Dummy-Variablen mit der Ausprägung 1 für alle Nerds und der Ausprägung 0 für alle Normalos erfasst, daher entspricht hier die Erhöhung der UV um eine Einheit hier dem Wechsel der Gruppenzugehörigkeit.

(Logarithmierte) Verhältnisse von Verhältnissen

Die Berechnung von Odds Ratios ist zwar einfach, jedoch sind Odds Ratios zur Interpretation logistischer Modelle nur auf den ersten Blick geeigneter als die logistischen Regressionskoeffizienten. Es handelt sich bei Odds Ratios um Verhältnisse von Wahrscheinlichkeitsverhältnissen. Genau wie in ihrer logarithmierten Form als Logits, entziehen Odds Ratios sich daher wohl dem intuitiven Verständnis der allermeisten Menschen.

Formal korrekt kann ausgesagt werden, dass eine Erhöhung einer gegebenen unabhängigen Variable um eine Einheit, mit einer Veränderung der Odds für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable um den Faktor eβ einhergeht. Jedoch lässt sich von Odds Ratios, genauso wenig wie von logistischen Regressionskoeffizienten, nicht direkt auf die Wahrscheinlichkeiten in Gruppen oder die Wahrscheinlichkeitsverhältnisse zwischen kontrastierten Gruppen schließen.

Daher sind bei der Interpretation logistischer Regressionsmodelle Aussagen wie „…die Erhöhung einer der unabhängigen Variable um eine Einheit ist verbunden mit einer um eβ / β veränderten Wahrscheinlichkeit…“, nicht zulässig. Wie fehlgeleitet solche Behauptungen sind, wird deutlich, wenn man sich vor Augen führt, dass ganz unterschiedliche Ausgangswahrscheinlichkeiten in gleichen Odds Ratios beziehungsweise Logits resultieren können. So kann beispielsweise das Odds Ratio aus dem vorangegangenen Beispiel auch durch ganz andere Wahrscheinlichkeiten in zwei kontrastierten Gruppen entstehen:

.tg .tg-baqh{text-align:center;vertical-align:top}Löst man die Formel zur Berechnung des Odds Ratio nach der Eintrittswahrscheinlichkeit einer der Gruppen auf, erhält man die Funktionsgleichung der Kurve auf der alle Wahrscheinlichkeitskombinationen mit dem selben Odds Ratio liegen. Nachstehend ist diese Kurve für ein Odds Ratio von 3,5 abgebildet.Kombination von Wahrscheinlichkeiten mit einen OddRatio von 3.5FazitDa selbst formal korrekte Interpretationen der absoluten Werten von Logits (β), genauso wie von Odds Ratios (eβ) uninformativ und potentiell irreführend sind, wird an dieser Stelle empfohlen lediglich die durch Logits und Odds Ratios implizierte Richtung von Zusammenhängen zu interpretieren. Eine Erhöhung einer unabhängigen Variable (um eine Einheit), geht bei Odds Ratios > 1 mit einer erhöhten, bei Odds Ratios 0 mit einer erhöhten, bei β < 0 mit einer verringerten Wahrscheinlichkeit für das Auftreten der betrachteten Ausprägung der abhängigen Variable einher geht.Referenzen

  • Best, H., & Wolf, C. (2012). Modellvergleich und Ergebnisinterpretation in Logit-und Probit-Regressionen. KZfSS Kölner Zeitschrift für Soziologie und Sozialpsychologie, 64(2), 377-395.
P1 P2 Verhältnis P1 / P2 Odds 1 Odds 2 Odds Ratio
text{0,7} text{0,4}</ frac{0,7 }{0,4}=1,75 frac{0,7 }{1-0,7}=2,bar{3} frac{0,4 }{1-0,4}=0,bar{6} frac{2,bar{3} }{0,bar{6}}=3,5
text{0,6}</ text{0,3}</ frac{0,6 }{0,3}=2 frac{0,6 }{1-0,6}=1,5 frac{0,3 }{1-0,3}=0,overline{428571} frac{1.5}{0,overline{428571}}=3,5

Logistische Regressionsmodelle, sind mit gängiger Statistiksoftware meist genauso leicht zu schätzen wie lineare Regressionen. Doch die Interpretation solcher Modelle, also der Part der statistischen Analyse der nicht von der Software übernommen wird, birgt eine Tücke: die Bezugsgröße der Regressionskoeffizienten.

Ausgehend von den unabhängigen Merkmalen der Beobachtungen, modellieren logistische Regressionsmodelle die Wahrscheinlichkeit mit der eine bestimmte Ausprägung eines kategorialen abhängigen Merkmals auftritt. Zur Schätzung dieser Wahrscheinlichkeiten ist die Transformation der Regressionsgewichte der unabhängigen Variablen notwendig, so dass logistische Regressionskoeffizienten den Zusammenhang zwischen den Ausprägungen der unabhängigen Variablen und den Logits für die betrachtete Merkmalsausprägung der abhängigen Variablen spiegeln. Parallel zur linearen Regression kann geschlossen werden, dass eine Erhöhung einer gegebenen unabhängigen Variable um eine Einheit, mit der Veränderung des Logits für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable um β Einheiten einhergeht. Zwar ist diese Interpretation formal korrekt, offenkundig jedoch wenig aufschlussreich.

Logits, Odds Ratios und Wahrscheinlichkeiten

Es drängt sich die Frage auf, was genau Logits sind. Die Antwort ist augenscheinlich recht einfach: Logits sind logarithmierte Odds Ratios. Wir halten fest: Logit = ln(Odds Ratio).

Aber natürlich stellt sich nun die Frage, was wiederrum Odds Ratios sind. Im Deutschen werden Odds Ratios als Chancenverhätnisse (oder auch Quotenverhältnisse) bezeichnet. Tatsächlich sind Odds Ratios nicht mehr als simple Verhältnisse von Chancen (beziehungsweise Quoten oder eben Odds). Im gegebenen Kontext bezeichnen Odds Ratios das Verhältnis der Chancen für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable, zwischen zwei Gruppen welche sich in der Ausprägungen eines unabhängigen Merkmals unterscheiden. Wir halten fest: Odds Ratio = Chance für Merkmalsausprägung in Gruppe 1 : Chance für Merkmalsausprägung in Gruppe 2.

Natürlich wird damit die Frage aufgeworfen, was genau Chancen sind. Chancen sind das jeweilige Verhältnis der Wahrscheinlichkeit für das Auftreten einer Merkmalsausprägung relativ zu der Wahrscheinlichkeit für das Nicht-Auftreten der Merkmalsausprägung innerhalb einer, zum Beispiel durch ein unabhängiges Merkmal definierten, Gruppe. Wir halten fest: Chance für Merkmalsausprägung = Wahrscheinlichkeit von Merkmalsausprägung : Gegenwahrscheinlichkeit von Merkmalsausprägung.

Die Wahrscheinlichkeit für eine Merkmalsausprägung entspricht dabei dem Anteil von Beobachtungseinheiten einer Gruppe, welche die jeweilige Ausprägung aufweisen. Wir halten fest: Wahrscheinlichkeit von Merkmalsausprägung = Anteil der Gruppenmitglieder mit Merkmalsausprägung.

Ein Beispiel: Nerds, Normalos und Star Wars

Zur Veranschaulichung werden nachstehend Logit und Odds Ratio dafür ein Star-Wars-Fan zu sein, für eine Gruppe von 10 „Statistik-Nerds“ relativ zu einer Gruppe von 10 „Normalos“ berechnet.

Berechnung von Hand

7 der 10 Nerds sind Star Wars Fans 4 der 10 Normalos sind Star Wars Fans. Daraus folgt:

P_{ Nerds }left( Fan right)=frac{ 7 }{10 }=0,7
Odds_{Nerds }left( Fan right)=frac{ P_{ Nerds }left( Fan right) }{ P_{ Nerds }left( kein Fan right) }=frac{ 0,7 }{ 1-0,7} = 2,bar{3}
P_{ Normalos }left( Fan right)=frac{ 4 }{10 }=0,4
Odds_{Normalos }left( Fan right)=frac{ P_{Normalos }left( Fan right) }{ P_{ Normalos }left( kein Fan right) }=frac{ 0,4 }{ 1-0,4} = 0,bar{6}
Odds Ratio_{ Nerds : Normalos }left( Fan right)=frac{ Odds_{ Nerds }left( Fan right) }{ Odds_{ Normalos }left( Fan right) }=frac{ 2,bar{3} }{ 0,bar{6} }=3,5
beta_{ Nerds:Normalos }=lnleft( Odds Ratio_{ Nerds:Normalos } right)=1,25

Berechnung via logistischer Regression in R

Zu dem gleichen Ergebnis kommt man, wenn man in R eine logistische Regression für die gegebenen Daten schätzt und den standartmäßig ausgegebenen Logit-Koeffizienten exponenziert.

R Code und Ausgabe von glm

Die Gruppenzugehörigkeit wird über eine Dummy-Variablen mit der Ausprägung 1 für alle Nerds und der Ausprägung 0 für alle Normalos erfasst, daher entspricht hier die Erhöhung der UV um eine Einheit hier dem Wechsel der Gruppenzugehörigkeit.

(Logarithmierte) Verhältnisse von Verhältnissen

Die Berechnung von Odds Ratios ist zwar einfach, jedoch sind Odds Ratios zur Interpretation logistischer Modelle nur auf den ersten Blick geeigneter als die logistischen Regressionskoeffizienten. Es handelt sich bei Odds Ratios um Verhältnisse von Wahrscheinlichkeitsverhältnissen. Genau wie in ihrer logarithmierten Form als Logits, entziehen Odds Ratios sich daher wohl dem intuitiven Verständnis der allermeisten Menschen.

Formal korrekt kann ausgesagt werden, dass eine Erhöhung einer gegebenen unabhängigen Variable um eine Einheit, mit einer Veränderung der Odds für das Auftreten der betrachteten Merkmalsausprägung der abhängigen Variable um den Faktor eβ einhergeht. Jedoch lässt sich von Odds Ratios, genauso wenig wie von logistischen Regressionskoeffizienten, nicht direkt auf die Wahrscheinlichkeiten in Gruppen oder die Wahrscheinlichkeitsverhältnisse zwischen kontrastierten Gruppen schließen.

Daher sind bei der Interpretation logistischer Regressionsmodelle Aussagen wie „…die Erhöhung einer der unabhängigen Variable um eine Einheit ist verbunden mit einer um eβ / β veränderten Wahrscheinlichkeit…“, nicht zulässig. Wie fehlgeleitet solche Behauptungen sind, wird deutlich, wenn man sich vor Augen führt, dass ganz unterschiedliche Ausgangswahrscheinlichkeiten in gleichen Odds Ratios beziehungsweise Logits resultieren können. So kann beispielsweise das Odds Ratio aus dem vorangegangenen Beispiel auch durch ganz andere Wahrscheinlichkeiten in zwei kontrastierten Gruppen entstehen:

.tg .tg-baqh{text-align:center;vertical-align:top}Löst man die Formel zur Berechnung des Odds Ratio nach der Eintrittswahrscheinlichkeit einer der Gruppen auf, erhält man die Funktionsgleichung der Kurve auf der alle Wahrscheinlichkeitskombinationen mit dem selben Odds Ratio liegen. Nachstehend ist diese Kurve für ein Odds Ratio von 3,5 abgebildet.Kombination von Wahrscheinlichkeiten mit einen OddRatio von 3.5FazitDa selbst formal korrekte Interpretationen der absoluten Werten von Logits (β), genauso wie von Odds Ratios (eβ) uninformativ und potentiell irreführend sind, wird an dieser Stelle empfohlen lediglich die durch Logits und Odds Ratios implizierte Richtung von Zusammenhängen zu interpretieren. Eine Erhöhung einer unabhängigen Variable (um eine Einheit), geht bei Odds Ratios > 1 mit einer erhöhten, bei Odds Ratios 0 mit einer erhöhten, bei β < 0 mit einer verringerten Wahrscheinlichkeit für das Auftreten der betrachteten Ausprägung der abhängigen Variable einher geht.Referenzen

P1 P2 Verhältnis P1 / P2 Odds 1 Odds 2 Odds Ratio
text{0,7} text{0,4}</ frac{0,7 }{0,4}=1,75 frac{0,7 }{1-0,7}=2,bar{3} frac{0,4 }{1-0,4}=0,bar{6} frac{2,bar{3} }{0,bar{6}}=3,5
text{0,6}</ text{0,3}</ frac{0,6 }{0,3}=2 frac{0,6 }{1-0,6}=1,5 frac{0,3 }{1-0,3}=0,overline{428571} frac{1.5}{0,overline{428571}}=3,5