Data Science, Machine Learning & AI
Kontakt
“There is no way you know Thomas! What a coincidence! He’s my best friend’s saxophone teacher! This cannot be true. Here we are, at the other end of the world and we meet? What are the odds?” Surely, not only us here at STATWORX have experienced similar situations, be it in a hotel’s lobby, on the far away hiking trail or in the pub in that city you are completely new to. However, the very fact that this story is so suspiciously relatable might indicate that the chances of being socially connected to a stranger by a short chain of friends of friends isn’t too low after all. Lots of research has been done in this field, one particular popular result being the 6-Handshake-Rule. It states that most people living on this planet are connected by a chain of six handshakes or less. In the general setting of graphs, in which edges connect nodes, this is often referred to as the so-called small-world-effect. That is to say, the typical number of edges needed to get from node A to node B grows logarithmically in population size (i.e., # nodes). Note that, up until now, no geographic distance has been included in our consideration, which seems inadequate as it plays a significant role in social networks. When analyzing data from social networks such as Facebook or Instagram, three observations are especially striking:
  • Individuals who are geographically farther away from each other are less likely to connect, i.e., people from the same city are more likely to connect.
  • Few individuals have extremely many connections. Their number of connections follows a heavy-tailed Pareto distribution. Such individuals interact as hubs in the network. That could be a celebrity or just a really popular kid from school.
  • Connected individuals tend to share a set of other individuals they are both connected to (e.g., “friend cliques”). This is called the clustering property.

A model that explains these observations

Clearly, due to the characteristics of social networks mentioned above, only a model that includes geographic distances of the individuals makes sense. Also, to account for the occurrence of hubs, research has shown that reasonable models attach a random weight to each node (which can be regarded as the social attractiveness of the respective individual). A model that accounts for all three properties is the following: First, randomly place nodes in space with a certain intensity nu, which can be done with a Poisson process. Then, with an independent uniformly distributed weight U_x attached to each node x, every two nodes get connected by an edge with a probability

    \[p_{xy} =mathbb{P}(xtext{ is connected to } y):=varphi(frac{1}{beta}U_x^gamma U_y^gamma vert x-yvert^d)\]

where d is the dimension of the model (here: d=2 as we’ll simulate the model on the plane), model parameter gammain [0,1] controls the impact of the weights, model parameter beta>0 squishes the overall input to the profile function varphi, which is a monotonously decreasing, normalized function that returns a value between 0 and 1. That is, of course, what we want because its output shall be a probability. Take a moment to go through the effects of different beta and gamma on p_{xy}. A higher beta yields a smaller input value for varphi and thereby a higher connection probability. Similarly, a high gamma entails a lower U^gamma (as Uin [0,1]) and thus a higher connection probability. All this comprises a scale-free random connection model, which can be seen as a generalization of the model by Deprez and Würthrich. So much about the theory. Now that we have a model, we can use this to generate synthetic data that should look similar to real-world data. So let’s simulate!

Obtain data through simulation

From here on, the simulation is pretty straight forward. Don’t worry about specific numbers at this point.
library(tidyverse)
library(fields)
library(ggraph)
library(tidygraph)
library(igraph)
library(Matrix)

# Create a vector with plane dimensions. The random nodes will be placed on the plane.
plane <- c(1000, 1000)

poisson_para <- .5 * 10^(-3) # Poisson intensity parameter
beta <- .5 * 10^3
gamma <- .4

# Number of nodes is Poisson(gamma)*AREA - distributed
n_nodes <- rpois(1, poisson_para * plane[1] * plane[2])
weights <- runif(n_nodes) # Uniformly distributed weights

# The Poisson process locally yields node positions that are completely random.
x = plane[1] * runif(n_nodes)
y = plane[2] * runif(n_nodes)

phi <- function(z) { # Connection function
  pmin(z^(-1.8), 1)
} 
What we need next is some information on which nodes are connected. That means, we need to first get the connection probability by evaluating varphi for each pair of nodes and then flipping a biased coin, accordingly. This yields a 0-1 encoding, where 1 means that the two respective nodes are connected and 0 that they’re not. We can gather all the information for all pairs in a matrix that is commonly known as the adjacency matrix.
# Distance matrix needed as input
dist_matrix <-rdist(tibble(x,y))

weight_matrix <- outer(weights, weights, FUN="*") # Weight matrix

con_matrix_prob <- phi(1/beta * weight_matrix^gamma*dist_matrix^2)# Evaluation

con_matrix <- Matrix(rbernoulli(1,con_matrix_prob), sparse=TRUE) # Sampling
con_matrix <- con_matrix * upper.tri(con_matrix) # Transform to symmetric matrix
adjacency_matrix <- con_matrix + t(con_matrix)

Visualization with ggraph

In an earlier post we praised visNetwork as our go-to package for beautiful interactive graph visualization in R. While this remains true, we also have lots of love for tidyverse, and ggraph (spoken “g-giraffe”) as an extension of ggplot2 proves to be a comfortable alternative for non-interactive graph plots, especially when you’re already familiar with the grammar of graphics. In combination with tidygraph, which lets us describe a graph as two tidy data frames (one for the nodes and one for the edges), we obtain a full-fledged tidyverse experience. Note that tidygraph is based on a graph manipulation library called igraph from which it inherits all functionality and “exposes it in a tidy manner”. So before we get cracking with the visualization in ggraph, let’s first tidy up our data with tidygraph!

Make graph data tidy again!

Let’s attach some new columns to the node dataframe which will be useful for visualization. After we created the tidygraph object, this can be done in the usual dplyr fashion after using activate(nodes)and activate(edges)for accessing the respective dataframes.
# Create Igraph object
graph <- graph_from_adjacency_matrix(adjacency_matrix, mode="undirected")

# Make a tidygraph object from it. Igraph methods can still be called on it.
tbl_graph <- as_tbl_graph(graph)

hub_id <- which.max(degree(graph))

# Add spacial positions, hub distance and degree information to the nodes.
tbl_graph <- tbl_graph %>%
  activate(nodes) %>%
  mutate(
    x = x,
    y = y,
    hub_dist = replace_na(bfs_dist(root = hub_id), Inf),
    degree = degree(graph),
    friends_of_friends = replace_na(local_ave_degree(), 0),
    cluster = as.factor(group_infomap())
  )
Tidygraph supports most of igraphs methods, either directly or in the form of wrappers. This also applies to most of the functions used above. For example breadth-first search is implemented as the bfs_* family, wrapping igraph::bfs(), the group_graphfamily wraps igraphs clustering functions and local_ave_degree() wraps igraph::knn().

Let’s visualize!

GGraph is essentially built around three components: Nodes, Edges and Layouts. Nodes that are connected by edges compose a graph which can be created as an igraph object. Visualizing the igraph object can be done in numerous ways: Remember that nodes usually are not endowed with any coordinates. Therefore, arranging them in space can be done pretty much arbitrarily. In fact, there’s a specific research branch called graph drawing that deals with finding a good layout for a graph for a given purpose. Usually, the main criteria of a good layout are aesthetics (which is often interchangeable with clearness) and capturing specific graph properties. For example, a layout may force the nodes to form a circle, a star, two parallel lines, or a tree (if the graph’s data allows for it). Other times you might want to have a layout with a minimal number of intersecting edges. Fortunately, in ggraph all the layouts from igraph can be used. We start with a basic plot by passing the data and the layout to ggraph(), similar to what you would do with ggplot() in ggplot2. We can then add layers to the plot. Nodes can be created by using geom_node_point()and edges by using geom_edge_link(). From then on, it’s full-on ggplot2-style.
# Add coord_fixed() for fixed axis ratio!
basic <- tbl_graph %>%
  ggraph(layout = tibble(V(.)x, V(.)y)) +
  geom_edge_link(width = .1) +
  geom_node_point(aes(size = degree, color = degree)) +
  scale_color_gradient(low = "dodgerblue2", high = "firebrick4") +
  coord_fixed() +
  guides(size = FALSE)
To see more clearly what nodes are essential to the network, the degree, which is the number of edges a node is connected with, was highlighted for each node. Another way of getting a good overview of the graph is to show a visual decomposition of the components. Nothing easier than that!
cluster <- tbl_graph %>%
  ggraph(layout = tibble(V(.)x, V(.)y)) +
  geom_edge_link(width = .1) +
  geom_node_point(aes(size = degree, color = cluster)) +
  coord_fixed() +
  theme(legend.position = "none")
Wouldn’t it be interesting to visualize the reach of a hub node? Let’s do it with a facet plot:
# Copy of tbl_graph with columns that indicate weather in n - reach of hub.
reach_graph <- function(n) {
  tbl_graph %>%
    activate(nodes) %>%
    mutate(
      reach = n,
      reachable = ifelse(hub_dist <= n, "reachable", "non_reachable"),
      reachable = ifelse(hub_dist == 0, "Hub", reachable)
    )
}
# Tidygraph allows to bind graphs. This means binding rows of the node and edge dataframes.
evolving_graph <- bind_graphs(reach_graph(0), reach_graph(1), reach_graph(2), reach_graph(3))

evol <- evolving_graph %>%
  ggraph(layout = tibble(V(.)x, V(.)y)) +
  geom_edge_link(width = .1, alpha = .2) +
  geom_node_point(aes(size = degree, color = reachable)) +
  scale_size(range = c(.5, 2)) +
  scale_color_manual(values = c("Hub" = "firebrick4",
                                "non_reachable" = "#00BFC4",
                                "reachable" = "#F8766D","")) +
  coord_fixed() +
  facet_nodes(~reach, ncol = 4, nrow = 1, labeller = label_both) +
  theme(legend.position = "none")

A curious observation

At this point, there are many graph properties (including the three above but also cluster sizes and graph distances) that are worth taking a closer look at, but this is beyond the scope of this blogpost. However, let’s look at one last thing. Somebody just recently told me about a very curious fact about social networks that seems paradoxical at first: Your average friend on Facebook (or Instagram) has way more friends than the average user of that platform. It sounds odd, but if you think about it for a second, it is not too surprising. Sampling from the pool of your friends is very different from sampling from all users on the platform (entirely at random). It’s exactly those very prominent people who have a much higher probability of being among your friends. Hence, when calculating the two averages, we receive very different results.
As can be seen, the model also reflects that property: In the small excerpt of the graph that we simulate, the average node has a degree of around 5 (blue intercept). The degree of connected nodes is over 10 on average (red intercept).

Conclusion

In the first part, I introduced a model that describes the features of real-life data of social networks well. In the second part, we obtained artificial data from that model and used it to create an igraph object (by means of the adjacency matrix). The latter can then be transformed into a tidygraph object, allowing us to easily make manipulation on the node and edge tibble to calculate any graph statistic (e.g., the degree) we like. Further, the tidygraph object is then used for conveniently visualizing the network through Ggraph. I hope that this post has sparked your interest in network modeling and has given you an idea of how seamlessly graph manipulation and visualization with Tidygraph and Ggraph merge into the usual tidyverse workflow. Have a wonderful day! In my last blog post, I wrote about networks and their visualization in R. For the coloring of the objects, I clustered the Les Miserables characters in several groups with Louvain, a Community detection algorithm. Community detection algorithms are not only useful for grouping characters in French lyrics. At STATWORX, we use these methods to give our clients insights into their product portfolio, customer, or market structure. In this blog post, I want to show you the magic behind Community detection and give you a theoretical introduction into the Louvain and Infomap algorithm.

Find groups with a high density of connections within and a low density between groups

Networks are useful constructs to schematize the organization of interactions in social and biological systems. They are just as well suited to represent interactions in the economy, especially in marketing. Such an interaction can be a joint purchase of two or more products or a joint comparison of products in online shops or on price comparison portals. Networks consist of objects and connections between the objects. The connections, also called edges, can be weighted according to certain criteria. In marketing, for example, the number or frequency of joint purchases of two products is a significant weighting of the connection between these two products. Mostly, such real networks are so big and complex that we have to simplify their structure to get useful information from them. The methods of community detection help to find groups in the network with a high density of connections within and a low density of links between groups. We will have a look at the two methods Louvain Community Detection and Infomap because they gave the best results in the study of Lancchinetti and Fortunato (2009) when applied to different benchmarks on Community Detection methods.

Louvain: Build clusters with high modularity in large networks

The Louvain Community Detection method, developed by Blondel et al. (2008), is a simple algorithm that can quickly find clusters with high modularity in large networks.

Modularity

The so-called modularity measures the density of connection within clusters compared to the density of connections between clusters (Blondel 2008). It is used as an objective function to be maximized for some community detection techniques and takes on values between -1 and 1. In the case of weighted connections between the objects, we can define the modularity with the following formula:

    \[Q=frac{1}{2m}sumnolimits_{p,q} [A_{pq}-frac{k_{p}k_{q}}{2m}]delta(C_{p}, C_{q})\]

with: A_{pq}: Weight of a connection between object p and q k_{p}: sumnolimits_{q}A_{pq} = Sum of the weights of all connections originating from the object p C_{p}: Cluster to which the object p has been assigned delta(C_{p}, C_{q}): Dummy variable that takes the value 1 if both objects p and q are assigned to the same cluster m: frac{1}{2}sumnolimits_{p,q} A_{pq} = Sum of the weights of all connections between all existing objects, divided by 2

Phases

The algorithm is divided into two phases, which are repeated until the modularity cannot be maximized further. In the 1st phase, each object is considered as a separate cluster. For each object p (p = 1, …, N), its neighbors q (q = 1, …, N) are checked for whether the modularity increases if p is removed from its cluster and into the cluster of an object q is assigned. The object p is then assigned to the cluster, which maximizes the increase in modularity. However, this only applies in the case of a positive increase. If no positive increase in modularity can be realized by shifting, the object p remains in its previous cluster. The process described above will be repeated and sequentially performed for all objects until no improvement in modularity can be achieved. An object is often viewed and assigned several times. The 1st phase thus stops when a local maximum has been found, i.e., if no individual displacement of an object can improve the modularity. Building on the clusters formed in the 1st phase, a new network is created in the 2nd phase whose objects are now the clusters themselves, which were formed in the 1st phase. To obtain weights for the connections between the clusters, the sum of the weights of the connections between the objects of two corresponding clusters is used. If such a new network was formed with “metacluster”, the steps of the 1st phase will be applied to the new network next, and the modularity will be further optimized. A complete run of both phases is called a pass. Such passes are repeatedly carried out until there is no more change in the cluster, and a maximum of modularity is achieved.

Infomap: Minimize the description length of a random walk

The Infomap method was first introduced by Rosvall and Bergstrom (2008). The procedure of the algorithm is in the core identical to the procedure of Blondel. The algorithm repeats the two described phases until an objective function is optimized. However, as an objective function to be optimized, Infomap does not use modularity but the so-called map equation.

Map Equation

The map equation exploits the duality between finding cluster structures in networks and minimizing the description length of the motion of a so-called random walk (Bohlin 2014). This random walker randomly moves from object to object in the network. The more the connection of an object is weighted, the more likely the random walker will use that connection to reach the next object. The goal is to form clusters in which the random walker stays as long as possible, i.e., the weights of the connections within the cluster should take on greater values than the weights of the connections between objects of different clusters. The map equation code structure is designed to compress the descriptive length of the random walk when the random walker lasts for extended periods of time in certain regions of the network. Therefore, the goal is to minimize the map equation, which is defined as follows for weighted but undirected networks (Rosvall 2009):

    \[L(M)=wcurvearrowright log(wcurvearrowright)-2sum_{k=1}^K w_{k}curvearrowright log(w_{i}curvearrowright)-sum_{i=1}^N w_{i}log(w_{i})+sum_{k=1}^K (w_{k}curvearrowright+w_{i})log(w_{k}curvearrowright+w_{k})\]

with: M: Network with N objects (i = 1, …, N) and K clusters (k = 1, …, K) w_{i}: relative weight of all connections of an object i, that is the sum of the weights of all connections of an object divided by the sum of the weights of all connections of the network w_{k}: sumnolimits_{iin k}w_{i}: Sum of the relative weights of all connections of the objects of the cluster k w_{k}curvearrowright: Sum of the relative weights of all connections of the objects of the cluster k leaving the cluster (connections to objects from other clusters) wcurvearrowright: sum_{k=1}^K w_{k} = Sum of the weights of all connections between objects from different clusters This definition of the map equation is based on the so-called entropy, the average information content, or the information density of a message. This term is based on Shannon’s Source Coding Theorem, from the field of Information Theory (Rosvall 2009). The procedure described above is hereafter referred to as the main algorithm. Objects that were assigned to the same cluster in the first phase of the main algorithm when the new network was created can only be moved together in the second phase. A previously optimal shift into a specific cluster no longer necessarily has to be optimal in a later pass (Rosvall 2009).

Extensions

Thus, theoretically, there may be even better cluster divisions than the main algorithm solution. In order to improve the solution of the main algorithm, there are two extensions compared to Louvain Community Detection: Subcluster shift: The subcluster shift sees each cluster as its own network and applies the main algorithm to that network. Thus, one or more subclusters in each cluster, in the previous step, create optimal partitioning of the network. All subclusters are reassigned to their cluster and can now be moved freely between the clusters. By applying the main algorithm, it can be tested whether the shift of one subcluster into another cluster leads to a minimization of the map equation compared to the previously optimal cluster distribution (Rosvall 2009). Single object displacement: Each object is again considered as a separate cluster so that the displacement of individual objects between the optimal clusters determined in the previous step is possible. By applying the main algorithm, it can be determined whether the displacement of individual objects between the clusters can lead to further optimization of the map equation (Rosvall 2009). The two extensions are repeated sequentially until the map equation cannot be further minimized and an optimum has been achieved.

How does the Louvain algorithm work in an easy example?

As we can see, the core of both methods is to build clusters and reallocate objects in two phases to optimize an objective function. To get a better understanding of how these two phases work, let me illustrate the Louvain Community Detection method with an easy example, a network with six nodes:

1st Phase

In the beginning, each object is separated into its own cluster, and we have to check if the modularity gets maximized if we assign it to another cluster. Only a positive change in modularity leads to a cluster shift. For object A, for example, the calculations behind it look like the following: A → B: Q_{AB}=5-frac{10*7}{30}=2.667 A → C: Q_{AC}=4-frac{10*13}{30}=-0.333 A → E: Q_{AE}=1-frac{10*9}{30}=-2 Similarly, we can check for all other objects if a shift to another cluster maximizes the modularity: B → C: Q_{BC}=2-frac{7*13}{30}=-1.033 C → D:Q_{CD}=7-frac{13*10}{30}=2.667 D → F: Q_{CD}=3-frac{10*11}{30}=-0.667 E → F: Q_{AB}=8-frac{9*11}{30}=4.7

2nd Phase

Now we try to combine the clusters built in the 1st phase: Orange → Green: Q_{Or,Gr}=6-frac{7*9}{10}=-0.3 Orange → Yellow: Q_{Or, Ye}=1-frac{7*4}{10}=-1.8 Green → Yellow: Q_{Gr, Ye}=3-frac{9*4}{10}=-0.6 We can see that none of the assignments of a cluster to another cluster can improve the modularity. Hence we can finish Pass 1.
Because we have no change in the second phase of pass 1, no further passes are required because a maximum of modularity is already achieved. In larger networks, of course, more passes are required, since there the clusters can consist of significantly more objects.

In R only the package igraph is needed to apply both methods

All we need to use these two Community detection algorithms is the package igraph, which is a collection of network analysis tools and in addition a list or a matrix with the connections between the objects in our network. In our example we use the Les Misérables Characters network to cluster the characters in several groups. Therefore, we load the dataset lesmis, which you can find in the package geomnet. We need to extract the edges out of lesmis and convert it into a data.frame. Afterwards, you have to convert this into an igraph graph. To use the weights of every connection, we need to rename the weight column so that the algorithm can identify the weights. The resulting graph can be used as the input for the two algorithms.
# Libraries --------------------------------------------------------------

library(igraph)
library(geomnet)


# Data Preparation -------------------------------------------------------

#Load dataset
data(lesmis)

#Edges
edges <- as.data.frame(lesmis[1])
colnames(edges) <- c("from", "to", "weight")

#Create graph for the algorithms
g <- graph_from_data_frame(edges, directed = FALSE)
Now we are ready to find the communities with the functions cluster_louvain() respectively cluster_infomap() Furthermore, we can have a look to which community the characters are associated (membership()) or get a list with all communities and their members (communities()).
# Community Detection ----------------------------------------------------

# Louvain
lc <- cluster_louvain(g)
membership(lc)
communities(lc)
plot(lc, g)

# Infomap
imc <- cluster_infomap(g)
membership(imc)
communities(imc)
plot(lc, g)
If you want to visualize these results afterward, have a look at my last blog post or use the above-shown plot() function for a fast visualization. As you can see in the following, the fast plotting option is not as beautiful as with the package visNetwork. In addition, it is also not interactive.

Conclusion

Both algorithms outperform other community detection algorithms (Lancchinetti, 2009). They have excellent performance, and Infomap delivers slightly better results in this study than Louvain. Moreover, we should consider two additional facts when choosing between these two algorithms. First, both algorithms do their job very fast. If we apply them on large networks, we can see that Louvain outperforms dramatically. Second, Louvain cannot separate outliers. This could explain why the algorithms divide people into almost identical clusters, but Infomap cuts out a few people from some clusters, and they form their own cluster. We should keep these points in mind when we have to decide between both algorithms. Another approach could be to use them both and compare their solutions. Did I spark your interest to cluster your own networks? Feel free to use my code or contact me if you have any questions.

References

  • Blondel, Vincent D. / Guillaume, Jean L. / Lambiotte, Renaud / Lefebvre, Etienne (2008), “Fast unfolding of communities in large networks”, Journal of Statistical Mechanics: Theory and Experiment, Jg.2008, Nr.10, P10008
  • Bohlin, Ludvig / Edler, Daniel / Lancichinetti, Andrea / Rosvall, Martin (2014), “Community Detection and Visualization of Networks with the Map Equation Framework”, in: Measuring Scholary Impact: Methods and Practice (Ding, Ying / Rousseau, Ronald / Wolfram, Dietmar, Eds.), S.3-34, Springer-Verlag, Berlin
  • Lancichinetti, Andrea / Fortunato, Santo (2009), ” Community detection algorithms: a comparative analysis”, Physical review E, Jg.80, Nr.5, 056117
  • Rosvall, Martin / Bergstrom, Carl T. (2008), “Maps of random walks on complexe net- works reveal community structure”, Proceedings of the National Academy of Sciences USA, Jg.105, Nr.4, S.1118-1123
  • Rosvall, Martin / Axellson, Daniel / Bergstrom, Carl T. (2009), “The map equation”, The European Physical Journal Special Topics, Jg.178, Nr.1, S.13-23
Did you know, that you can transform plain old static ggplot graphs to animated ones? Well, you can with the help of the package gganimate by RStudio’s Thomas Lin Pedersen and David Robinson and the results are amazing! My STATWORX colleagues and I are very impressed how effortless all kinds of geoms are transformed to suuuper smooth animations. That’s why in this post I will provide a short overview of some of the wonderful functionalities of gganimate, I hope you’ll enjoy them as much as we do! Since Valentine’s Day is just around the corner, we’re going to explore the Speed Dating Experiment dataset compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar. Hopefully, we’ll learn about gganimate as well as how to find our Valentine. If you like, you can download the data from Kaggle.

Defining the basic animation: transition_*

How are static plots put into motion? Essentially, gganimate creates data subsets, which are plotted individually and constitute the substantial frames, which, when played consecutively, create the basic animation. The results of gganimate are so seamless because gganimate takes care of the so-called tweening for us by calculating data points for transition frames displayed in-between frames with actual input data. The transition_* functions define how the data subsets are derived and thus define the general character of any animation. In this blogpost we’re going to explore three types of transitions: transition_states(), transition_reveal() and transition_filter(). But let’s start at the beginning. We’ll start with transition_states(). Here the data is split into subsets according to the categories of the variable provided to the states argument. If several rows of a dataset pertain to the same unit of observation and should be identifiable as such, a grouping variable defining the observation units needs to be supplied. Alternatively, an identifier can be mapped to any other aesthetic. Please note, to ensure the readability of this post, all text concerning the interpretation of the speed dating data is written in italics. If you’re not interested in that part you simply can skip those paragraphs. For the data prep, I’d like to refer you to my GitHub. First, we’re going to explore what the participants of the Speed Dating Experiment look for in a partner. Participants were asked to rate the importance of attributes in a potential date by allocating a budget of 100 points to several characteristics, with higher values denoting a higher importance. The participants were asked to rate the attributes according to their own views. Further, the participants were asked to rate the same attributes according to the presumed wishes of their same-sex peers, meaning they allocated the points in the way they supposed their average same-sex peer would do. We’re going to plot all of these ratings (x-axis) for all attributes (y-axis). Since we want to compare the individual wishes to the individually presumed wishes of peers, we’re going to transition between both sets of ratings. Color always indicates the personal wishes of a participant. A given bubble indicates the rating of one specific participant for a given attribute, switching between one’s own wishes and the wishes assumed for peers.
## Static Plot
# ...characteristic vs. (presumed) rating...
# ...color&size mapped to own rating, grouped by ID
plot1 <- ggplot(df_what_look_for, 
       aes(x = value,
           y = variable,
           color = own_rating, # bubbels are always colord according to own whishes
           size = own_rating,
           group = iid)) + # identifier of observations across states
  geom_jitter(alpha = 0.5, # to reduce overplotting: jitttering & alpha
              width = 5) + 
  scale_color_viridis(option = "plasma", # use virdis' plasma scale
                      begin = 0.2, # limit range of used hues
                      name = "Own Rating") +
  scale_size(guide = FALSE) + # no legend for size
  labs(y = "", # no axis label
       x = "Allocation of 100 Points",  # x-axis label
       title = "Importance of Characteristics for Potential Partner") +
  theme_minimal() +  # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot1 + 
  transition_states(states = rating) # animate contrast subsets acc. to variable rating  
tran-states
First off, if you’re a little confused which state is which, please be patient, we’ll explore dynamic labels in the section about ‘frame variables’. It’s apparent that different people look for different things in a partner. Yet attractiveness is often prioritized over other qualities. But the importance of attractiveness varies most strongly of all attributes between individuals. Interestingly, people are quite aware that their peer’s ratings might differ from their own views. Further, especially the collective presumptions (= the mean values) about others are not completely off, but of higher variance than the actual ratings. So there is hope for all of us that somewhere out there somebody is looking for someone just as ambitious or just as intelligent as ourselves. However, it’s not always the inner values that count. gganimate allows us to tailor the details of the animation according to our wishes. With the argument transition_length we can define the relative length of the transition from one to the other real subsets of data takes and with state_length how long, relatively speaking, each subset of original data is displayed. Only if the wrap argument is set to TRUE, the last frame will get morphed back into the first frame of the animation, creating an endless and seamless loop. Of course, the arguments of different transition functions may vary.
## Animated Plot
# ...replace default arguments
plot1 + 
  transition_states(states = rating,
                    transition_length = 3, # 3/4 of total time for transitions
                    state_length = 1, # 1/4 of time to display actual data
                    wrap = FALSE) # no endless loop
tran-states-arguments

Styling transitions: ease_aes

As mentioned before, gganimate takes care of tweening and calculates additional data points to create smooth transitions between successively displayed points of actual input data. With ease_aes we can control which so-called easing function is used to ‘morph’ original data points into each other. The default argument is used to declare the easing function for all aesthetics in a plot. Alternatively, easing functions can be assigned to individual aesthetics by name. Amongst others quadric, cubic , sine and exponential easing functions are available, with the linear easing function being the default. These functions can be customized further by adding a modifier-suffix: with -in the function is applied as-is, with -out the function is reversely applied with -in-out the function is applied as-is in the first half of the transition and reversed in the second half. Here I played around with an easing function that models the bouncing of a ball.
## Animated Plot
# ...add special easing function
plot1 + 
  transition_states(states = rating) + 
  ease_aes("bounce-in") # bouncy easing function, as-is
tran-states-ease

Dynamic labelling: {frame variables}

To ensure that we, mesmerized by our animations, do not lose the overview gganimate provides so-called frame variables that provide metadata about the animation as a whole or the previous/current/next frame. The frame variables – when wrapped in curly brackets – are available for string literal interpretation within all plot labels. For example, we can label each frame with the value of the states variable that defines the currently (or soon to be) displayed subset of actual data:
## Animated Plot
# ...add dynamic label: subtitle with current/next value of states variable
plot1 +
  labs(subtitle = "{closest_state}") + # add frame variable as subtitle
  transition_states(states = rating) 
tran-states-label
The set of available variables depends on the transition function. To get a list of frame variables available for any animation (per default the last one) the frame_vars() function can be called, to get both the names and values of the available variables.

Indicating previous data: shadow_*

To accentuate the interconnection of different frames, we can apply one of gganimates ‘shadows’. Per default shadow_null() i.e. no shadow is added to animations. In general, shadows display data points of past frames in different ways: shadow_trail() creates a trail of evenly spaced data points, while shadow_mark() displays all raw data points. We’ll use shadow_wake() to create a little ‘wake’ of past data points which are gradually shrinking and fading away. The argument wake_length allows us to set the length of the wake, relative to the total number of frames. Since the wakes overlap, the transparency of geoms might need adjustment. Obviously, for plots with lots of data points shadows can impede the intelligibility.
plot1B + # same as plot1, but with alpha = 0.1 in geom_jitter
  labs(subtitle = "{closest_state}") +  
  transition_states(states = rating) +
  shadow_wake(wake_length = 0.5) # adding shadow
tran-states-shadow

The benefits of transition_*

While I simply love the visuals of animated plots, I think they’re also offering actual improvement. I feel transition_states compared to facetting has the advantage of making it easier to track individual observations through transitions. Further, no matter how many subplots we want to explore, we do not need lots of space and clutter our document with thousands of plots nor do we have to put up with tiny plots. Similarly, e.g. transition_reveal holds additional value for time series by not only mapping a time variable on one of the axes but also to actual time: the transition length between the individual frames displays of actual input data corresponds to the actual relative time differences of the mapped events. To illustrate this, let’s take a quick look at the ‘success’ of all the speed dates across the different speed dating events:
## Static Plot
# ... date of event vs. interest in second date for women, men or couples
plot2 <- ggplot(data = df_match,
                aes(x = date, # date of speed dating event
                    y = count, # interest in 2nd date
                    color = info, # which group: women/men/reciprocal
                    group = info)) +
  geom_point(aes(group = seq_along(date)), # needed, otherwise transition dosen't work
             size = 4, # size of points
             alpha = 0.7) + # slightly transparent
  geom_line(aes(lty = info), # line type according to group
            alpha = 0.6) + # slightly transparent
  labs(y = "Interest After Speed Date",
       x = "Date of Event",
       title = "Overall Interest in Second Date") +
  scale_linetype_manual(values = c("Men" = "solid", # assign line types to groups
                                   "Women" = "solid",
                                   "Reciprocal" = "dashed"),
                        guide = FALSE) + # no legend for linetypes
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # y-axis in %
  scale_color_manual(values = c("Men" = "#2A00B6", # assign colors to groups
                                "Women" = "#9B0E84",
                                "Reciprocal" = "#E94657"),
                     name = "") +
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(), # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot2 +
  transition_reveal(along = date) 
trans-reveal
Displayed are the percentages of women and men who were interested in a second date after each of their speed dates as well as the percentage of couples in which both partners wanted to see each other again. Most of the time, women were more interested in second dates than men. Further, the attraction between dating partners often didn’t go both ways: the instances in which both partners of a couple wanted a second date always were far more infrequent than the general interest of either men and women. While it’s hard to identify the most romantic time of the year, according to the data there seemed to be a slack in romance in early autumn. Maybe everybody still was heartbroken over their summer fling? Fortunately, Valentine’s Day is in February. Another very handy option is transition_filter(), it’s a great way to present selected key insights of your data exploration. Here the animation browses through data subsets defined by a series of filter conditions. It’s up to you which data subsets you want to stage. The data is filtered according to logical statements defined in transition_filter(). All rows for which a statement holds true are included in the respective subset. We can assign names to the logical expressions, which can be accessed as frame variables. If the keep argument is set to TRUE, the data of previous frames is permanently displayed in later frames. I want to explore, whether one’s own characteristics relate to the attributes one looks for in a partner. Do opposites attract? Or do birds of a feather (want to) flock together? Displayed below are the importances the speed dating participants assigned to different attributes of a potential partner. Contrasted are subsets of participants, who were rated especially funny, attractive, sincere, intelligent or ambitious by their speed dating partners. The rating scale went from 1 = low to 10 = high, thus I assume value of >7 to be rather outstanding.
## Static Plot (without geom)
# ...importance ratings for different attributes
plot3 <- ggplot(data = df_ratings, 
                 aes(x = variable, # different attributes
                     y = own_rating, # importance regarding potential partner
                     size = own_rating, 
                     color = variable, # different attributes
                     fill = variable)) +
  geom_jitter(alpha = 0.3) +
  labs(x = "Attributes of Potential Partner", # x-axis label
       y = "Allocation of 100 Points (Importance)",  # y-axis label
       title = "Importance of Characteristics of Potential Partner", # title
       subtitle = "Subset of {closest_filter} Participants") + # dynamic subtitle 
  scale_color_viridis_d(option = "plasma", # use viridis scale for color 
                        begin = 0.05, # limit range of used hues
                        end = 0.97,
                        guide = FALSE) + # don't show legend
  scale_fill_viridis_d(option = "plasma", # use viridis scale for filling
                       begin = 0.05, # limit range of used hues
                       end = 0.97, 
                       guide = FALSE) + # don't show legend
  scale_size_continuous(guide = FALSE) + # don't show legend
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) 
trans-filter
Of course, the number of extraordinarily attractive, intelligent or funny participants is relatively low. Surprisingly, there seem to be little differences between what the average low vs. high scoring participants look for in a partner. Rather the lower scoring group includes more people with outlying expectations regarding certain characteristics. Individual tastes seem to vary more or less independently from individual characteristics.

Styling the (dis)appearance of data: enter_* / exit_*

Especially if displayed subsets of data do not or only partially overlap, it can be favorable to underscore this visually. A good way to do this are the enter_*() and exit_*() functions, which enable us to style the entry and exit of data points, which do not persist between frames. There are many combinable options: data points can simply (dis)appear (the default), fade (enter_fade()/exit_fade()), grow or shrink (enter_grow()/exit_shrink()), gradually change their color (enter_recolor()/exit_recolor()), fly (enter_fly()/exit_fly()) or drift (enter_drift()/exit_drift()) in and out. We can use these stylistic devices to emphasize changes in the databases of different frames. I used exit_fade() to let further not included data points gradually fade away while flying them out of the plot area on a vertical route (y_loc = 100), data points re-entering the sample fly in vertically from the bottom of the plot (y_loc = 0):
## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) +
  enter_fly(y_loc = 0) + # entering data: fly in vertically from bottom
  exit_fly(y_loc = 100) + # exiting data: fly out vertically to top...
  exit_fade() # ...while color is fading
trans-filter-exit-enter

Finetuning and saving: animate() & anim_save()

Gladly, gganimate makes it very easy to finalize and save our animations. We can pass our finished gganimate object to animate() to, amongst other things, define the number of frames to be rendered (nframes) and/or the rate of frames per second (fps) and/or the number of seconds the animation should last (duration). We also have the option to define the device in which the individual frames are rendered (the default is device = “png”, but all popular devices are available). Further, we can define arguments that are passed on to the device, like e.g. width or height. Note, that simply printing an gganimateobject is equivalent to passing it to animate() with default arguments. If we plan to save our animation the argument renderer, is of importance: the function anim_save() lets us effortlessly save any gganimate object, but only so if it was rendered using one of the functions magick_renderer() or the default gifski_renderer(). The function anim_save()works quite straightforward. We can define filename and path (defaults to the current working directory) as well as the animation object (defaults to the most recently created animation).
# create a gganimate object
gg_animation <- plot3 +
  transition_filter("More Attractive" = Attractive > 7,
                    "Less Attractive" = Attractive <= 7) 

# adjust the animation settings 
animate(gg_animation, 
        width = 900, # 900px wide
        height = 600, # 600px high
        nframes = 200, # 200 frames
        fps = 10) # 10 frames per second

# save the last created animation to the current directory 
anim_save("my_animated_plot.gif")

Conclusion (and a Happy Valentine’s Day)

I hope this blog post gave you an idea, how to use gganimate to upgrade your own ggplots to beautiful and informative animations. I only scratched the surface of gganimates functionalities, so please do not mistake this post as an exhaustive description of the presented functions or the package. There is much out there for you to explore, so don’t wait any longer and get started with gganimate! But even more important: don’t wait on love. The speed dating data shows that most likely there’s someone out there looking for someone just like you. So from everyone here at STATWORX: Happy Valentine’s Day!
heart gif
## 8 bit heart animation
animation2 <- plot(data = df_eight_bit_heart %>% # includes color and x/y position of pixels 
         dplyr::mutate(id = row_number()), # create row number as ID  
                aes(x = x, 
                    y = y,
                    color = color,
                    group = id)) +
  geom_point(size = 18, # depends on height & width of animation
             shape = 15) + # square
  scale_color_manual(values = c("black" = "black", # map values of color to actual colors
                                "red" = "firebrick2",
                                "dark red" = "firebrick",
                                "white" = "white"),
                     guide = FALSE) + # do not include legend
  theme_void() + # remove everything but geom from plot
  transition_states(-y, # reveal from high to low y values 
                    state_length = 0) +
  shadow_mark() + # keep all past data points
  enter_grow() + # new data grows 
  enter_fade() # new data starts without color

animate(animation2, 
        width = 250, # depends on size defined in geom_point 
        height = 250, # depends on size defined in geom_point 
        end_pause = 15) # pause at end of animation
    Nearly one year ago, I analyzed how we use emojis in our Slack messages. Since then, STATWORX grew, and we are a lot more people now! So, I just wanted to check if something changed. Last time, I did not show our custom emojis, since they are, of course, not available in the fonts I used. This time, I will incorporate them with geom_image(). It is part of the ggimage package from Guangchuang Yu, which you can find here on his Github. With geom_image() you can include images like .png files to your ggplot.

What changed since last year?

Let’s first have a look at the amount of emojis we are using. In the plot below, you can see that since my last analysis in October 2018 (red line) the amount of emojis is rising. Not as much as I thought it would, but compared to the previous period, we now have more days with a usage of over 100 emojis per day!
Like last time, our top emoji is 👍, followed by 😂 and 😄. But sneaking in at number ten is one of our custom emojis: party_hat_parrot!
top-10-used-emojis

How to include custom images?

In my previous blogpost, I hid all our custom emojis behind❓since they were not part of the font. It did not occur to me to use their images, even though the package is from the same creator! So, to make up for my ignorance, I grabbed the top 30 custom emojis and downloaded their images from our Slack servers, saved them as .png and made sure they are all roughly the same size. To use geom_image() I just added the path of the images to my data (the are just an abbreviation for the complete path).
                NAME COUNT REACTION IMAGE
1:          alnatura    25       63 .../custom/alnatura.png
2:              blog    19       20 .../custom/blog.png
3:           dataiku    15       22 .../custom/dataiku.png
4: dealwithit_parrot     3      100 .../custom/dealwithit_parrot.png
5:      deananddavid    31       18 .../custom/deananddavid.png
This would have been enough to just add the images now, but since I wanted the NAME attribute as a label, I included geom_text_repel from the ggrepel library. This makes handling of non-overlapping labels much simpler!
ggplot(custom_dt, aes( x = REACTION, y = COUNT, label = NAME)) +
  geom_image(aes(image = IMAGE), size = 0.04) +
  geom_text_repel(point.padding = 0.9, segment.alpha = 0) +
  xlab("as reaction") +
  ylab("within message") +
  theme_minimal()
Usually, if a label is “too far” away from the marker, geom_text_repel includes a line to indicate where the labels belong. Since these lines would overlap the images, I used segment.alpha = 0 to make them invisible. With point.padding = 0.9 I gave the labels a bit more space, so it looks nicer. Depending on the size of the plot, this needs to be adjusted. In the plot, one can see our usage of emojis within a message (y-axis) and as a reaction (x-axis).
To combine the emoji font and custom emojis, I used the following data and code — really… why did I not do this last time? 🤔 Since the UNICODE is NA when I want to use the IMAGE, there is no “double plotting”.
                     EMOJI REACTION COUNT  SUM PLACE    UNICODE   IMAGE
 1:                    :+1:     1090     0 1090     1 U0001f44d
 2:                   :joy:      609   152  761     2 U0001f602
 3:                 :smile:       91   496  587     3 U0001f604
 4:                    :-1:      434     9  443     4 U0001f44e
 5:                  :tada:      346    38  384     5 U0001f389
 6:                  :fire:      274    17  291     6 U0001f525
 7: :slightly_smiling_face:        1   250  251     7 U0001f642
 8:                  :wink:       27   191  218     8 U0001f609
 9:                  :clap:      201    13  214     9 U0001f44f
10:      :party_hat_parrot:      192     9  201    10       <NA>  .../custom/party_hat_parrot.png
quartz()
ggplot(plotdata2, aes(x = PLACE, y = SUM, label = UNICODE)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(family="EmojiOne") +
  xlab("Most popular emojis") +
  ylab("Number of usage") +
  scale_fill_brewer(palette = "Paired") +
  geom_image(aes(image = IMAGE), size = 0.04) +
  theme_minimal()
ps = grid.export(paste0(main_path, "plots/top-10-used-emojis.svg"), addClass=T)
dev.off()

The meaning behind emojis

Now we know what our top emojis are. But what is the rest of the world doing? Thanks to Emojimore for providing me with this overview! On their site, you can find meanings for a lot more emojis.
Behind each of our custom emojis is a story as well. For example, all the food emojis are helping us every day to decide where to eat and provide information on what everyone is planning for lunch! And if you do not agree with the decision, just react with sadphan to let the others know about your feelings. If you want to know the whole stories behind all custom emojis or even help create new ones, then maybe you should join our team — check out our available job offers here!   Networks are everywhere. We have social networks like Facebook, competitive product networks or various networks in an organisation. Also, for STATWORX it is a common task to unveil hidden structures and clusters in a network and visualize it for our customers. In the past, we used the tool Gephi to visualize our results in network analysis. Impressed by this outstanding pretty and interactive visualization, our idea was to find a way to do visualizations in the same quality directly in R and present it to our customers in an R Shiny app. Our first intention was to visualize networks with igraph, a package that contains a collection of network analysis tools with the emphasis on efficiency, portability and ease of use. We use it in the past in our helfRlein package for the function getnetwork, described in this blog post. Unfortunately, igraph can create beautiful network visualizations, but they’re solely static. To build interactive network visualizations, you can use particular packages in R that are all using javascript libraries. Our favorite package for this visualization task is visNetwork, which uses vis.js javascript library and is based on htmlwidgets. It’s compatible with Shiny, R Markdown documents and RStudio viewer. visNetwork has many adjustments to personalize your network, a pretty output and good performance, which is very important when using the output in Shiny. Furthermore, you can find excellent documentation here. So let us go through the steps that have to be done from your data basis up till the perfect visualization in R Shiny. To do so, we use the Les Misérables Characters network in the following as an example. This undirected network contains co-occurrences of characters in Victor Hugo’s novel ‘Les Misérables’. A node represents a character, and an edge between two nodes shows that these two characters appeared in the same chapter of the book. The weight of each link indicates how often such a co-appearance occurred.

Data Preparation

First of all, we have to install the package with install.packages("visNetwork") and load the dataset lesmis. You can find the dataset in the package geomnet. To visualize the network between the Les Miserables characters, the package visNetwork needs two data frames. One for the nodes and one for the edges of the network. Fortunately, our loaded data provides both, and we only have to bring them in the right format.
rm(list = ls())

# Libraries ---------------------------------------------------------------
library(visNetwork)
library(geomnet)
library(igraph)


# Data Preparation --------------------------------------------------------

#Load dataset
data(lesmis)

#Nodes
nodes <- as.data.frame(lesmis[2])
colnames(nodes) <- c("id", "label")

#id has to be the same like from and to columns in edges
nodes$id <- nodes$label

#Edges
edges <- as.data.frame(lesmis[1])
colnames(edges) <- c("from", "to", "width")
The following function needs specific names for the columns to detect the right column. For this purpose, edges must be a dataframe with at least one column that indicates in which node an edge starts (from) and where it ends (to). For the nodes, we require at a minimum a unique ID (id) which has to coincide to the from and to entries. Nodes:
  • label: A column that defines how a node is labelled
  • value: Defines the size of a node inside the network
  • group: Assigns a node to a group; this can be a result of a cluster analysis or a community detection
  • shape: Defines how a node is presented. For example as a circle, square or triangle
  • color: Defines the color of a node
  • title: Sets the tooltip, which occurs when you hover over a node (this can be HTML or character)
  • shadow: Defines if a node has a shadow or not (TRUE/FALSE)
Edges:
  • label, title, shadow
  • length, width: Defines the length/width of an edge inside the network
  • arrows: Defines where to set a possible arrow on the edge
  • dashes: Defines if the edges should be dashed or not (TRUE/FALSE)
  • smooth: Smooth lines (TRUE/FALSE)
These are the most important settings. They were made for every single node or edge particularly. To set some configurations for all nodes or edges like the same shape or arrows you can do this later when you specify the output with visNodes and visEdges. We’ll show you this possibility later on. Additionally, we want to have a more interesting network with groups inside. We’ll highlight the groups later by adding colors to the edges in the network. Therefore we cluster the data with the community detection method Louvain and get a group column:
#Create graph for Louvain
graph <- graph_from_data_frame(edges, directed = FALSE)

#Louvain Comunity Detection
cluster <- cluster_louvain(graph)

cluster_df <- data.frame(as.list(membership(cluster)))
cluster_df <- as.data.frame(t(cluster_df))
cluster_df$label <- rownames(cluster_df)

#Create group column
nodes <- left_join(nodes, cluster_df, by = "label")
colnames(nodes)[3] <- "group"

Output Options

To give you an impression which possibilities we have when it comes to the design and functional options when creating our output, we will have a more in-depth look at two presentations of the Les Misérables network. We’re starting with the easiest possibility and only give the nodes and edges dataframes to the function:
visNetwork(nodes, edges)
Using the pipe operator we can customize our network with some other functions like visNodes, visEdges, visOptions, visLayout or visIgraphLayout:
visNetwork(nodes, edges, width = "100%") %>%
  visIgraphLayout() %>%
  visNodes(
    shape = "dot",
    color = list(
      background = "#0085AF",
      border = "#013848",
      highlight = "#FF8000"
    ),
    shadow = list(enabled = TRUE, size = 10)
  ) %>%
  visEdges(
    shadow = FALSE,
    color = list(color = "#0085AF", highlight = "#C62F4B")
  ) %>%
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             selectedBy = "group") %>% 
  visLayout(randomSeed = 11)
visNodes and visEdges describe the overall appearance of the nodes and edges in the network. For example, we can set the shape of all nodes or define the colors of the edges. When it comes to publication in R, rendering the network can take a long time. To deal with this issue, we use the visIgraphfunction. It decreases plotting time while computing coordinates in advance and provides all available igraph layouts. With visOptions we can adjust how the network reacts when we interact with it. For example, what happens if we click on a node. visLayout allows us to define the look of the network. Should it be a hierarchical one or do we want to improve the layout with a special algorithm? Furthermore, we can provide a seed (randomSeed), so that the network always looks the same when you load it. These are only some example functions of how we can customize our network. The package provides much more options for customization. For more details have a look at the documentation.

Shiny Integration

To present the interactive results to our customers, we want to integrate them into a Shiny app. Therefore we prepare the data “offline”, save the nodes and edges files and create the output inside the Shiny app “online”. Here is a minimal example code of the scheme you can use for Shiny: global.R:
library(shiny)
library(visNetwork)
server.R:
shinyServer(function(input, output) {
  output$network <- renderVisNetwork({
    load("nodes.RData")
    load("edges.RData")

    visNetwork(nodes, edges) %>%
      visIgraphLayout()
  })
})
ui.R:
shinyUI(
  fluidPage(
    visNetworkOutput("network")
  )
)
A screenshot of our Shiny app illustrates a possible result:

Conclusion

Besides other available packages to visualize networks interactively in R, visNetwork is our absolute favorite. It is a powerful package to create interactive networks directly in R and publish it in Shiny. We can integrate our networks directly into our Shiny application and run it with a stable performance when using the visIgraphLayout function. We would not need external software like Gephi anymore. Did I spark your interest to visualize your own networks? Feel free to use my code or contact me and visit the github page of the used package here.

References

Knuth, D. E. (1993) “Les miserables: coappearance network of characters in the novel les miserables”, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA Data exploration is a critical step in every Data Science project here at STATWORX. In order to discover every insight, we prefer interactive charts to static charts, because we have the ability to dig deeper into the data. This helps us to reveal every insight. In today’s blog post, we show you how to improve the interactivity of Plotly histograms in Python like you can see in the graphics below. You can find the code in our GitHub repo.

TL;DR: Quick Summary

  1. We show you the default Plotly histogram and its unexpected behaviour
  2. We improve the interactive histogram to match our human intuition and show you the code
  3. We explain the code line by line and give you some more context on the implementation

The problem of the default Plotly histogram

default histogram in plotly
This graphic shows you the behavior of a Plotly histogram if you zoom into a specific region. What you can see is that the bars just get bigger/wider. This is not what we expect! If we zoom into a plot, we want to dig deeper and see more fine-grained information for a special area. Therefore, we expect that the histogram shows more fine-grained information. In this specific case, this means that we expect that the histogram shows bins for the selected region. So, the histogram needs to be rebinned.

Improving the interactive histogram

rebinned histogram in plotly
In this graphic you can see our expected end result. If the user selects a new x-region we want to rebin the histogram based on the new x-range. In order to implement this behavior, we need to change the graphic, when the selected x-range changes. This is a new feature of plotly.py 3.0.0 which was brought to the community by the great Jon Mease. You can read more about the new Plotly.py 3.0.0 features in this announcement. Disclaimer: The implementation only works inside a Jupyter Notebook or JupyterLab because it needs an active Python kernel for the callback. It does not work in a standalone Python script and it does not work in a standalone HTML file. The implementation idea is the following: We create an interactive figure and every time the x-axis changes, we will update the underlying data of the histogram. Whoops, why didn’t we change the binning? Plotly histograms automatically handle the binning for the underlying data. Therefore, we can let the histogram do the work and just change the underlying data. This is a little bit counterintuitive but saves a lot of work.

Glimpse of the full code

So, finally here comes the relevant code without unnecessary imports etc. If you want to see the full code, please check this GitHub file.
x_values = np.random.randn(5000)
figure = go.FigureWidget(data=[go.Histogram(x=x_values,
                                            nbinsx=10)],
                         layout=go.Layout(xaxis={'range': [-4, 4]},
                                          bargap=0.05))
histogram = figure.data[0]

def adjust_histogram_data(xaxis, xrange):
    x_values_subset = x_values[np.logical_and(xrange[0] <= x_values,
                                              x_values <= xrange[1])]
    histogram.x = x_values_subset
figure.layout.xaxis.on_change(adjust_histogram_data, 'range')

Detailed explanations for each line of code

In the following, we will provide some detailed insights and explanations for each line of code.

1) Initializing the x_values

x_values = np.random.randn(5000)
We get 5000 new random x_values which are distributed according to a normal distribution. The values are created by the great numpy library which is abbreviated as np.

2) Creating the figure

figure = go.FigureWidget(data=[go.Histogram(x=x_values,
                                            nbinsx=10)],
                         layout=go.Layout(xaxis={'range': [-4, 4]},
                                          bargap=0.05))
We generate a new FigureWidget instance. The FigureWidget object is the new “magic object” which was introduced by Jon Mease. You can display it within Jupyter Notebook or JupyterLab like a normal Plotly figure but there are some advantages. You can manipulate the FigureWidgetin various ways from Python and you can also listen for some events and execute some more Python code which gives you a lot of options. This flexibility is the great benefit which Jon Mease envisioned. The FigureWidget receives the attributes data and layout. As data, we specify a list of all the traces (read: visualizations) that we want to show. In our case, we only want to show a single histogram. The x values for the histogram are our x_values. Also, we set the maximum number of bins with nbinsx to 10. Plotly will use this as a guideline but will not enforce the plot to exactly contain nbinsx bins. As layout, we specify a new layout object and set the range of the xaxis to [-4, 4]. With the bargap argument, we can enable the layout to show a gap between individual bars. This helps us to see where a bar stops and the next one begins. In our case, this value is set to 0.05.

3) Saving a reference to the histogram

histogram = figure.data[0]
We get the reference to the histogram because we want to manipulate the histogram in the last step. We don’t actually get the data but a reference to the Plotly trace object. This Plotly syntax might be a little bit misleading but it is consistent with the definition of the figure where we also specified the “traces” as “data”.

4) Overview of the callback

def adjust_histogram_data(xaxis, xrange):
    x_values_subset = x_values[np.logical_and(xrange[0] <= x_values,
                                              x_values <= xrange[1])]
    hist.x = x_values_subset
figure.layout.xaxis.on_change(adjust_histogram_data, 'range')
In this chunk, we first define what we want to do at the event when the xaxis changes. Afterwards, we register the callback function adjust_histogram_data. We will break this down further but we will start with the last line because Python will execute this line first at runtime. Therefore, it makes more sense to read the code based on this reverse execution order. A little bit more background on the callback: The code within the callback method adjust_histogram_data will be called when the xaxis.on_change event actually happens because the user interacted with the chart. But first, Python needs to register the adjust_histogram_data callback method. Way later, when the callback event xaxis.on_change occurs, Python will execute the callback method adjust_histogram_data and its contents. Go back and read this section 3-4 times until you fully understand it.

4a) Registering the callback

figure.layout.xaxis.on_change(adjust_histogram_data, 'range')
In this line, we tell the figure object to always call the callback function adjust_histogram_data whenever the xaxis changes. Please note, that we only specify the name of the function adjust_histogram_data without the round brackets (). This is because we only need to pass the reference to the function and do not want to call the function. This is a common error and source of confusion. Also we specify, that we are only interested in the range attribute. Therefore, the figure object will only send this information to the callback function. But how does the callback function look like and what is the task of the callback function? Those questions are explained in the next steps:

4b) Defining the callback signature

def adjust_histogram_data(xaxis, xrange):
In this line, we start to define our callback function. The first argument which is passed to the function is the xaxis object which initiated the callback. This is a Plotly convention and we just need to put this placeholder here although we don’t use it. The second argument is the xrange which contains the lower and upper limit of the new xrange configuration. You might wonder: “where do the arguments xaxis and xrange come from?” Those arguments are automatically provided by the figure when the callback gets called. When you use callbacks for the first time, this might seem like intransparent magic. But you will get used to it…

4c) Updating the x_values

x_values_subset = x_values[np.logical_and(xrange[0] <= x_values,
                                          x_values <= xrange[1])]
In this line, we define our new x_values which in most cases are a subset of the original x_values. However, if the lower and upper limit are very far away from each other, we might end up selecting all the original x_values. So the subset is not always a strict subset. The lower limit of the xrange is defined by xrange[0] and the upper limit via xrange[1]. In order to select the subset of the x_values which is within the lower and upper limit of the xrange we use the logical_and function from numpy. There are multiple ways how we can select data subsets in Python. For example, you can also do this via pandas selectors if you use pandas dataframes/series.

4d) Updating the histogram data

histogram.x = x_values_subset
In this line, we update the underlying data of the histogram and set it to the new x_values_subset. This line will trigger the update of the Histogram and the automatic rebinning. The histogram is the reference which we created in step 3 of the code because we need it here.

Wrapping up

In this blog post, we showed you how to improve the default Plotly histogram via interactive binning. We gave you the code and explained every line of the implementation. We hope that you were able to follow along and gained some good understanding of the new possibilities thanks to plotly.py 3.0.0. Maybe you sometimes have the feeling that you cannot understand the code that you find on the internet or Stackoverflow. If you want intuitive explanations for everything you need to know in the world of Data Science, consider some of our open courses which are taught by our Data Science experts here at STATWORX.

References

Introduction

At STATWORX we love beautiful plots. One of my favorite plotting libraries is plotly. It’s being developed by the company of the same name since 2012. Plotly.js is a high-level javascript library for interactive graphics and offers wrappers for a diverse range of languages, like Python, R or Matlab. Furthermore, it is open source and licensed under the MIT license, therefore it can be used in a commercial context. Plotly offers more than 30 different chart types. Another reason we at STATWORX use Plotly extensively is that it can be easily integrated into web-based frameworks, like Dash or R Shiny.

How does it work?

A Plotly plot is based on the following three main elements: Data, Layout and Figure.

Data

The Data object can contain several traces. For example, in a line chart with several lines, each line is represented by a different trace. According to that, the data object contains the data which should be plotted but also the specification of how the data should be plotted.

Layout

The Layout object defines everything, that is not related to the data. It contains elements, like the title, axis titles or background-color. However, you can also add annotations or shapes with the layout object.

Figure

The Figure object includes both data and layout. It creates our final figure for plotting, which is just a simple dictionary-like object. All figures are built with plotly.js, so in the end, the Python API only interacts with the plotly.js library.

Application

Let’s visualize some data. For that purpose we will use the LA Metro Bike Share dataset, which is hosted by the city of Los Angeles and contains anonymized Metro Bike Share trip data. In the following section we will use Plotly for Python and compare it later on with the R implementation.

Creating our first line plot with Plotly

First, we will generate a line plot which shows the number of rented bikes over different dates differentiated over the passholder type. Thus, we first have to aggregate our data before we can plot it. As shown above, we define our different traces. Each trace contains the number of rented bikes for a specific passholder type. For line plots, we use the Scatter()– function from plotly.graph_objs. It is used for scatter and line plots, however, we can define how it is displaced by setting the mode parameter accordingly. Those traces are unified as a list in our data object. Our layout object consists of a dictionary, where we define the main title and the axis titles. At last, we put our data and layout object together as a figure-object.
import pandas as pd
import plotly.graph_objs as go
import plotly.plotly as py

df = pd.read_pickle(path="LA_bike_share.pkl")

rental_count = df.groupby(["Start_Date", "Passholder_Type"]).size().reset_index(name ="Total_Count")

trace0 = go.Scatter(
    x=rental_count.query("Passholder_Type=='Flex Pass'").Start_Date,
    y=rental_count.query("Passholder_Type=='Flex Pass'").Total_Count,
    name="Flex Pass",
    mode="lines",
    line=dict(color="#013848")
)
trace1 = go.Scatter(
    x=rental_count.query("Passholder_Type=='Monthly Pass'").Start_Date,
    y=rental_count.query("Passholder_Type=='Monthly Pass'").Total_Count,
    name="Monthly Pass",
    mode="lines",
    line=dict(color="#0085AF")
)
trace2 = go.Scatter(
    x=rental_count.query("Passholder_Type=='Walk-up'").Start_Date,
    y=rental_count.query("Passholder_Type=='Walk-up'").Total_Count,
    name="Walk-up",
    mode="lines",
    line=dict(color="#00A378")
)
data = [trace0,trace1,trace2]

layout = go.Layout(title="Number of rented bikes over time",
                   yaxis=dict(title="Number of rented bikes", 
                              zeroline=False),
                   xaxis=dict(title="Date",
                              zeroline = False)
                  )

fig = go.Figure(data=data, layout=layout)

Understanding the structure behind graph_objs

If we output the figure-object, we will get the following dictionary-like object.
Figure({
    'data': [{'line': {'color': '#013848'},
              'mode': 'lines',
              'name': 'Flex Pass',
              'type': 'scatter',
              'uid': '5d8c0781-4592-4d19-acd9-a13a22431ccd',
              'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
                          datetime.date(2016, 7, 9), ..., datetime.date(2017, 3, 29),
                          datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
              'y': array([ 61,  93, 113, ...,  52,  36,  40])},
             {'line': {'color': '#0085AF'},
              'mode': 'lines',
              'name': 'Monthly Pass',
              'type': 'scatter',
              'uid': '4c4c76b9-c909-44b7-8e8b-1b0705fa2491',
              'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
                          datetime.date(2016, 7, 9), ..., datetime.date(2017, 3, 29),
                          datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
              'y': array([128, 251, 308, ..., 332, 312, 301])},
             {'line': {'color': '#00A378'},
              'mode': 'lines',
              'name': 'Walk-up',
              'type': 'scatter',
              'uid': '8303cfe0-0de8-4646-a256-5f3913698bd9',
              'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
                          datetime.date(2016, 7, 12), ..., datetime.date(2017, 3, 29),
                          datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
              'y': array([  1,   1,   1, ..., 122, 133, 176])}],
    'layout': {'title': {'text': 'Number of rented bikes over time'},
               'xaxis': {'title': {'text': 'Date'}, 'zeroline': False},
               'yaxis': {'title': {'text': 'Number of rented bikes'}, 'zeroline': False}}
})
In theory, we could build those dictionaries or change the entries by hand without using plotly.graph_objs. However, it is much more convenient to use graph_objs than to write dictionaries. In addition, we can call help on those functions and see which parameters are available for which chart type and it also raises an error with more details if something went wrong. There is also the possibility to export the fig-figure object as a JSON and import it for example in R.

Displaying our plot

Nonetheless, we don’t want a JSON-File but rather an interactive graph. We now have two options, either we publish it online, as Plotly provides a web-service for hosting graphs including a free plan, or we create the graphs offline. This way, we can display them in a jupyter notebook or save them as a standalone HTML. In order to display our plot in a jupyter notebook, we need to execute the following code
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
at the beginning of each Notebook. Finally, we can display our plot with iplot(fig). Before publishing it online, we first need to set our credentials with
plotly.tools.set_credentials_file(username='user.name', api_key='api.key')
and use py.plot(fig, filename = 'basic-plot', auto_open=True) instead of iplot(fig). The following graph is published online on Plotly’s plattform and embedded as an inline frame.
The chart above is fully interactive, which has multiple advantages:
  • Select and deselect different lines
  • Automatical scaling of the y-scale in case of deselected lines
  • Hover-informations with the exact numbers and dates
  • Zoom in and out with self-adjusting date ticks
  • Different chart-modes and the ability to toggle additional options, like spike lines
  • Possibility to include a range-slider or buttons
The graph shows a fairly clear weekly pattern, with Monthly Passholders having their high during the workweek, while Walk-ups are more active on the weekend. Apart from some unusual spikes, the number of rented bikes is higher for Monthly Passholders than for Walk-ups.

Visualizing the data as a pie chart

The next question is: how does the total duration look for the different passholder types? First, we need to aggregate our data accordingly. This time we will build a pie chart in order to get the share of the total duration for each passholder type. As we previously did with the line chart, we must first generate a trace object and use Pie() from graph_objs. The arguments we use are different now: we have labels and values instead of x and y. We’re also able to determine, which hover-information we want to display and can add with hovertext custom information, or completely customize it with hovertemplate. Afterward, the trace object goes into go.Figure() in form of a list.
share_duration = df.groupby("Passholder_Type").sum().reset_index()
colors = ["#013848", "#0085AF", "#00A378"]
trace = go.Pie(labels=share_duration.Passholder_Type,
               values=share_duration.Duration,
               marker=dict(colors=colors,
                           line=dict(color='white', width=1)),
               hoverinfo="label+percent"
              )
fig = go.Figure(data=[trace])
The pie chart shows us, that 59% of the total duration is caused by Walk-ups. Thus, we could assume that the average duration for Walk-ups is higher than for Monthly Passholders.

There is one more thing: figure factory

Now, let’s plot the distribution of the average daily duration. For that we use the create_distplot()-function from the figure_factory. The figure factory module contains wrapper functions that create unique chart types, which are not implemented in the nativ plotly.js library, like bullet charts, dendrograms or quiver plots. Thus, they are not available for other languages, like R or Matlab. However, those functions also deviate from the structure for building a Plotly graph we discussed above and are also not consistent within figure_factory. create_distplot() creates per default a plot with a KDE-curve, histogram, and rug, respectively those plots can be removed with show_curve, show_hist and show_rug set to False. First, we create a list with our data as hist_data, in which every entry is displayed as a distribution plot on its own. Optionally, we can define group labels, colors or a rug text, which is displayed as hover information on every rug entry.
import plotly.figure_factory as ff

mean_duration=df.groupby(["Start_Date", "Passholder_Type"]).mean().reset_index()

hist_data = [mean_duration.query("Passholder_Type=='Flex Pass'").Duration,
             mean_duration.query("Passholder_Type=='Monthly Pass'").Duration,
             mean_duration.query("Passholder_Type=='Walk-up'").Duration]

group_labels = ["Flex Pass", "Monthly Pass", "Walk-up"]

rug_text = [mean_duration.query("Passholder_Type=='Flex Pass'").Start_Date,
            mean_duration.query("Passholder_Type=='Monthly Pass'").Start_Date,
            mean_duration.query("Passholder_Type=='Walk-up'").Start_Date]

colors = ["#013848", "#0085AF", "#00A378"]


fig = ff.create_distplot(hist_data, group_labels, show_hist=False, 
                         rug_text=rug_text, colors=colors)
As we assumed, Walk-ups have a higher average duration than monthly or flex passholders. The average daily duration for Walk-ups is peaking at around 0.6 hours and for Monthly and Flex Passholders already at 0.18, respectively 0.2 hours. Also, the distribution for Walk-ups is much flatter with a fat right tail. Thanks to the rug, we can see that for Flex Pass, there are some days with a very high average duration and due to the hover-information, we can immediately detect, which days have an unusually high average renting duration. The average duration on February 2, 2017, was 1.57 hours. Next, we could dig deeper and have a look on the possible reasons for such an unusual activity, for example a special event or the weather.

Plotly with R

As mentioned in the beginning, Plotly is available for many languages. At STATWORX, we’re using Plotly mainly in R, especially if we’re creating a dashboard with R Shiny. However, the syntax is slightly different, as the R implementation utilizes R’s pipe-operator. Below, we create the same barplot in Python and in R. In Python, we aggregate our data with pandas, create different traces for every unique characteristic of Trip Route Category, specify that we want to create a stacked bar chart with our different traces and assemble our data and layout object with go.Figure().
 total_count = df.groupby(["Passholder_Type", "Trip_Route_Category"]).size().reset_index(name="Total_count")

 trace0 = go.Bar(
   x=total_count.query("Trip_Route_Category=='Round Trip'").Passholder_Type,
   y=total_count.query("Trip_Route_Category=='Round Trip'").Total_count,
   name="Round Trip",
   marker=dict(color="#09557F"))
trace1 = go.Bar(
   x=total_count.query("Trip_Route_Category=='One Way'").Passholder_Type,
   y=total_count.query("Trip_Route_Category=='One Way'").Total_count,
   name="One Way",
   marker=dict(color="#FF8000"))
data = [trace0, trace1]

layout = dict(barmode="stack")

fig = go.Figure(data=data, layout=layout)
With R, we can aggregate the data with dplyr and already start our pipe there. Afterward, we pipe the plotly function to it, in the same way we already specified which data frame we want to use. Within plot_ly(), we can directly address the column name. We don’t have to create several traces and add them with add_trace(), but can define the separation between the different Trip Route Category with the color argument. In the end, we pipe the layout()-function and define it as a stacked bar chart. Thus, with using the pipe-operator, the code looks slightly tidier. However, in comparison to the Python implementation, we are losing the neat functions of the figure factory.
basic_bar_chart <- df %>% 
  group_by(Passholder_Type, Trip_Route_Category) %>% 
  summarise( Total_count = n()) %>%
  plot_ly(x = ~Passholder_Type, 
          y = ~Total_count,
          color = ~Trip_Route_Category , 
          type = 'bar', 
          marker=list(color=c(rep("#FF8000",3),rep("#09557F",3)))) %>%
  layout( barmode = 'stack')
The bar plot shows that Walk-ups use their rented bikes more often for Round Trips in comparison to Monthly Passholders, which could be a reason for their higher average duration.

Conclusion

I hope I could motivate you to have a look at interactive graphs with Plotly instead of using static seaborn or ggplot plots, especially in case of hands-on sessions or dashboards. But there is also the possibility to create an interactive Plotly chart from a ggplot or Matplotlib object with one additional line of code. With version 3.0 of plotly.py there have been many interesting new features like Jupyter Widgets, the implementation of imperative methods for creating a plot and the possibility to use datashader. Soon you’ll find a blog post on here on how to implement zoomable histograms with Plotly and Jupyter Widgets and why automatic rebinning makes sense by a colleague of mine.

[author class=”mtl” title=”Über den Autor”]

Once upon a time, we at STATWORX used Slack just as a messenger, but than everything changed when emojis came… Since then, we use them for all kinds of purposes. For example we take polls with them to see were we will eat lunch or we capture unforgettable moments by creating new emojis. The possibilities are limitless! But since we use them so much, I was wondering how often do we use them. And when? And which is the top-emoji?! Is it just the thumbsup?

To answer all these questions, I went on a little journey troughout my emotions.

Getting the data

The first part was to gather data. Well, since nearly every tool has a log history this was quite simple. I just had to get the history (the last year) of our Slack channels which was provided within JSONS. These I could easily load into R with the jsonlite package. To get a list of all possible emojis, I found this list from Cal Henderson who works at Slack. I added our own custom emojis to complete the list.

All that followed, was a little loop to go through each message and its reactions and count the occurences of each emoji. Combined with the timestamp given in seconds from January the first in 1970, I had my emoji time series data that looked like this:

                     EMOJI COUNT     TIME     TYPE       DATE
1: :slightly_smiling_face:     1 08:56:05  message 2018-08-10
2: :slightly_smiling_face:     1 17:08:19  message 2018-08-10
3:                  :gift:     2 08:36:04 reaction 2018-08-18
4:                   :joy:     1 13:47:10 reaction 2018-09-03
5:                    :+1:     1 13:56:12 reaction 2018-09-04

After evaluating each single text I found that more than every second chat (57%) either has an emoji in the messasge or as a reaction. So we really use them a lot!

The right timing

Since the majority was used as a reaction, I summed them up and did not distinguish between messages and reactions. To get a first idea of how often and when we use emojis I looked at a frequency plot over time. There are two things to notice: First, we see an increase over time. Well our company grew, so more people equals more emojis. Second, we tend to not use them during the weekend – who would have thought!?

emoji-over-year

But what about our daily usage? Well there seem to be some peaks. One peak appears right after wake up time, another one when we arrive at the office. Here we can distinguish between colleagues with the mobil app and the ones that just use Slack with the desktop app. There is another peak around our lunchbreak and it comes all to end before we go to bed at 22:00.

emoji-over-day

Cause and effect

Since I started this little project, more and more questions popped into my head. For example: Is there a link between the emoji’s usage within a message and as a reaction? Are there words, that trigger us to use emojis? To answer the first questions, I used the networkD3 package to plot the interaction as a sankey diagram.

emoji-sankey-category

Here we can see which categories of emojis used within a message lead to which reactions. The most commonly used category is Smileys & People followed by custom. And around 40% stay within the same category.

To answer the second question I made some wordclouds to see which words we use. The ones in orange are those where a reaction followed.

wordcloud

We can see, that we use more words with no emoji reaction than the other way around. If we only look at the ones with emoji reactions, then we get the following picture.

wordcloud-emoji

Seems that if we ask a question “heute wer mal bitte …” (“today someone please…”), we get an reaction – mostly positive.

The most common emojis

emoji-category

As we can see in the plot above, we use the emoji’s categories differently. First of all, Smileys and People are used the most. But if we look at the emoji density – which represents the percentage of unique used emoji wihtin this category – we only use a third of them. On the other hand, nearly 80% of our custom emojis were used.

To find the most commonly used emojis I looked at the top 50 emojis within messages and reactions. Also I stumbled upon two R packages (emoGG and emojifont) which let you add emojis to your ggplots. This sounded wonderful, but there was a catch! Since we work with RStudio on a Mac I could not use the RStudio plotting device, since it would not show the emojis. After a little bit of research, I found this post here, which suggested to use quartz() and grid.export with .svg for the plotting. The loaded font obviously did not have our own emojis, so I just added them as❓.

emoji-plot

So, as I thought in the beginning our top emoji is 👍, followed by 😂 and 😄.

Emoji message as reaction total
👍 0 482 482
😂 98 281 379
😄 260 49 309

But since there is a lot going on in the lower left corner – let’s have a closer look! We can see a variety of emojis beeing used with a lot of custom ones.

emoji-plotdetails

With this package, we can also update the plot showing our daily usage with the most commonly used one at the time.

emoji-over-day-with-icon

Conclusion

With this little project I just scratched the surface of possibilities to analyse the usage of emojis. If you have more ideas, I would like to see your approach and solutions. Just send me an email at blog@statworx.com. To get started you can use my code at our GitHub – not containing our whole data of course, but with an example folder with two jsons. Here you can add your own Slack history.

References

In the last post of this series of the STATWORX Blog, we explored ggplot2’s themes, which control the display of all non-data components of a plot.

Ggplot comes with several inbuilt themes that can be easily applied to any plot. However, one can tweak these out-of-the-box styles using the theme() function. We did this last time. Furthermore, one also can create a complete customized theme, that’s what we’re going to do in this post.

How the masters do it

When we create a theme from scratch we have to define all arguments of the theme function and set the complete argument to TRUE, signaling that the generated object indeed is a complete theme. Setting complete = TRUE also causes all elements to inherit from blank elements. This means, that every object that we want to show up in our plots has to be defined in full detail.

If we are a little lazy, instead of defining each and every argument, we also can start with an existing theme and alter only some of its arguments. Actually, this is exactly what the creators of ggplot did. While theme_gray() is “the mother of all themes” and fully defined, for example theme_bw() builds upon theme_gray() , while theme_minimal in turn builds on theme_bw() .

We can see how tedious it is to define a complete theme, if we sneak a peak at the code of theme_grey on ggplot2’s GitHub repository. Further, it is obvious from the code of theme_bw or theme_minimal how much more convenient it is to create a new theme by building on an existing theme.

Creating our very own theme

What’s good enough for Hadley and friends, is good enough for me. Therefore, I’m going to create my own theme based on my favourite theme, theme_minimal(). As can be seen on the GitHub repo, we can create a new theme as a function calling an existing theme, which is altered by %+replace% theme() with all alterations defined in theme().

Several arguments are passed along to the function constituting a new theme and the existing theme called within the function: Specified are the default sizes for text (base_size), lines in general (base_line_size) as well as lines pertaining to rect-objects (base_rect_size), further defined is the font family. To ensure a consitent look, all sizes aren’t defined in absolute terms but relative to base sizes, using the rel() function. Therefore, for especially big or small plots the base sizes can be in- or decreased, with all other elements being adjusted automatically.

library(ggplot2)
library(gridExtra)
library(dplyr)

# generating new theme
theme_new <- function(base_size = 11,
                      base_family = "",
                      base_line_size = base_size / 170,
                      base_rect_size = base_size / 170){
  theme_minimal(base_size = base_size, 
                base_family = base_family,
                base_line_size = base_line_size) %+replace%
    theme(
      plot.title = element_text(
        color = rgb(25, 43, 65, maxColorValue = 255), 
        face = "bold",
        hjust = 0),
      axis.title = element_text(
        color = rgb(105, 105, 105, maxColorValue = 255),
        size = rel(0.75)),
      axis.text = element_text(
        color = rgb(105, 105, 105, maxColorValue = 255),
        size = rel(0.5)),
      panel.grid.major = element_line(
        rgb(105, 105, 105, maxColorValue = 255),
        linetype = "dotted"),   
      panel.grid.minor = element_line(
        rgb(105, 105, 105, maxColorValue = 255),
        linetype = "dotted", 
        size = rel(4)),   
      
      complete = TRUE
    )
}

Other than in theme_minimal() I’m decreasing the base size to 11 and set the base line size and base rect size to base size devided by 170. I don’t change the font family. The plot title is changed to a bold, dark blue font in the set base size and is left alinged. Axis text and axis ticks are set to have 75% and 50% of the base size, while their colour is changed to a light grey. Finally, the lines of the grid are defined to be dotted and light grey, with the major grid lines having the base line size and the minor grid lines having four times this size.

The result looks like this:

# base plot
base_plot <- data.frame(x = rnorm(n = 100, 1.5, 2),
                        y = rnorm(n = 100, 1, 2),
                        z = c(rnorm(n = 60, 0.5, 2), rnorm(n = 40, 5, 3))) %>%
  ggplot(.) +
  geom_jitter(aes(x = x, y = y, color = z, size = z), 
              alpha = 0.5) +
  geom_jitter(aes(x = x, y = y, size = z), 
              alpha = 0.8,
              shape = 21, 
              color = "white",  
              stroke = 0.4) +
  scale_size_continuous(range = c(1, 18), breaks = c(1,  4, 5, 13, 18)) +
  guides(size = FALSE, color = FALSE) +
  labs(y = "Flight Hight", x = "Flight Distance")

# plot with customized theme
p1 <- base_plot +
  ggtitle("Bubbels - theme_new()") +
  theme_new()

# plot with theme minimal
p2 <- base_plot +
    ggtitle("Bubbels - theme_minimal()") +
    theme_minimal()

grid.arrange(p1, p2, nrow = 2)

For later use, we can save our theme gerenrating script and source our customized theme function whenever we want to make use of our created theme.

So go ahead, be creative and build your signature theme!

If you want more details, our Data Visualization with R workshop may interest you!

As noted elsewhere, sometimes beauty matters. A plot that’s pleasing to the eye will be considered more
gladly, and thus might be understood more thoroughly. Also, since we at STATWORX oftentimes need to
subsume and communicate our results, we have come to appreciate how a nice plot can upgrade any presentation.

So how make a plot look good? How make it accord with given style guidelines? In ggplot2 the display of all non-data components is controlled by the theme system. Other than in some other packages, the appearance of plots is edited after all the data-related elements of the plot have been determined. The theme system of ggplot2 allows the manipulation of titles, labels, legends, grid lines and backgrounds. There are various build-in themes available that already have an all-around consistent style, pertaining to any detail of a plot.

Pre-defined themes

There are two ways to apply bulid-in (or otherwise predefined) themes (e.g. theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic).
For one, they can be added as an additional layer to individual plots:

rm(list = ls())
library(gridExtra)
library(ggplot2)

# generating a fictional data set containing hours of sunshine and temperature
sun_hours <- sample(seq(from = 1, to = 8, by = 0.1), size = 40, replace = TRUE)
noise <- sample(seq(from = 17, to = 24, by = 0.1), size = 40, replace = TRUE)
temperature <-  sun_hours + noise
df_sun <- data.frame(sun_hours, temperature)

# generate the plot base
base_plot <- ggplot(df_sun) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 6, size = 5, stroke = 2) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 21, size = 3.3, fill = "white", stroke = 2) +
  labs(x = "Hours of Sun", y = "Temperature") +
  scale_color_gradient(high = "firebrick", low = "#ffce00", name = " ") +
  ggtitle("Base Plot")

base-plot

# adding predefined themes
p1 <- base_plot +
  theme_classic() +
  ggtitle("Plot with theme_classic()")

p2 <- base_plot +
  theme_bw() +
  ggtitle("Plot with theme_bw()")

p3 <- base_plot +
  theme_dark() +
  ggtitle("Plot with theme_dark()")

p4 <- base_plot +
  theme_light() +
  ggtitle("Plot with theme_light()")

gridExtra::grid.arrange(p1, p2, p3, p4)

different-themes

Alternatively, the default theme that’s automatically added to any plot, can be set or get with the functions theme_set() or theme_get().

# making the classic theme the default
theme_set(theme_classic())

base_plot +
  ggtitle("Plot with theme_set(theme_classic())")

theme-set

While predefined themes are very convenient, there’s always the option to (additionally) tweak the appearance of any non-data detail of a plot via the various arguments of theme(). This can be done for a specific plot, or the currently active default theme. The default theme can be updated or partly replaced via theme_update and theme_replace, respectively.

# changing the default theme
theme_update(legend.position = "none")

base_plot +
  ggtitle("Plot with theme_set(theme_classic()) n& theme_update(legend.position = "none")")

theme-update

# changing the theme directly applied to the plot
base_plot +
  theme(legend.position = "bottom") +
  ggtitle("Plot with theme(legend.position = "bottom")")

plus-theme

Element functions

There’s a wide range of arguments for theme(), in fact such a wide range, that not all arguments can be discussed here. Therefore, this blog post is far from exhaustive and only deals with the general principles of the theme system and only provides some illustrative examples for a few of all the available arguments. The appearance of many elements needs to be specified via one of the four element functions: element_blank, element_text, element_line or element_rect.

  • How labels and titles are displayed, is controlled by the element_text function. For example, we can make the title of the y axis bold and increase its size.
  • Borders and backgrounds can be manipulated using element_rect. For example, we can choose the color of the plot’s background.
  • Lines can be defined via the element_line function. For example, we can change the line types of the mayor and minor grid.
  • Further, with element_blank() it is possible to remove an object completely, without having any space dedicated to the plot element.
# using element_text, element_rect, element_line, element_blank
base_plot +
  theme(axis.title.y = element_text(face = "bold", size = 16),
        plot.background = element_rect(fill = "#FED633"),
        panel.grid.major = element_line(linetype = "dashed"),
        panel.grid.minor = element_line(linetype = "dotted"),
        axis.text.y = element_blank(),
        axis.text.x = element_blank()) +
  ggtitle("Plot altered using element functions")

element-function

If we don’t want to change the display of some specific plot elements, but of all text, lines, titles or rectangular elements we can do so by specifying the arguments text, line, rect and title. Specifications passed to these arguments are inherited by all elements of the respective type. This inheritance principle also holds true for other ‘parent’ arguments. ‘Parent’ arguments oftentimes are easily identifiable, as their names are used as prefixes for all subordinate arguments.

# using overreaching arguments #1
base_plot +
  theme(line = element_line(linetype = "dashed")) +
  ggtitle("Plot with all lines altered by using line")

using-line

# using overreaching arguments #2
base_plot +
  theme(axis.title = element_text(size = 6)) + # here axis.title is the parent
  ggtitle("Plot with both axis titles altered by using axis.title")

using-axistitle

Outlook

Margins, spaces, sizes and orientations of elements are not specified with element functions but have their own sets of possible parameters. For example, the display of legends is controlled by such arguments and specific parameters.

# using parameters instead of element functions
base_plot +
  theme(legend.position = "top") 

using-otherparameters

Since ggplot2 enables to manipulate the appearance of non-data elements of plots in great detail, there is a multitude of arguments. This blog post only tries to give a first impression of the many, many possibilities to design a plot. Some additional occupation with the topic, might be advisable, but any time invested in understanding how to style plots, surely is well spent. If you want read more on making pretty plots in ggplot2, check out my other posts on coordinate stystems or customizing date and time scales. If you want more details, our Data Visualization with R workshop may interest you!

 

Referenzen

  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.

 

As noted elsewhere, sometimes beauty matters. A plot that’s pleasing to the eye will be considered more
gladly, and thus might be understood more thoroughly. Also, since we at STATWORX oftentimes need to
subsume and communicate our results, we have come to appreciate how a nice plot can upgrade any presentation.

So how make a plot look good? How make it accord with given style guidelines? In ggplot2 the display of all non-data components is controlled by the theme system. Other than in some other packages, the appearance of plots is edited after all the data-related elements of the plot have been determined. The theme system of ggplot2 allows the manipulation of titles, labels, legends, grid lines and backgrounds. There are various build-in themes available that already have an all-around consistent style, pertaining to any detail of a plot.

Pre-defined themes

There are two ways to apply bulid-in (or otherwise predefined) themes (e.g. theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic).
For one, they can be added as an additional layer to individual plots:

rm(list = ls())
library(gridExtra)
library(ggplot2)

# generating a fictional data set containing hours of sunshine and temperature
sun_hours <- sample(seq(from = 1, to = 8, by = 0.1), size = 40, replace = TRUE)
noise <- sample(seq(from = 17, to = 24, by = 0.1), size = 40, replace = TRUE)
temperature <-  sun_hours + noise
df_sun <- data.frame(sun_hours, temperature)

# generate the plot base
base_plot <- ggplot(df_sun) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 6, size = 5, stroke = 2) +
  geom_point(aes(x = sun_hours, y = temperature, color = temperature), 
             shape = 21, size = 3.3, fill = "white", stroke = 2) +
  labs(x = "Hours of Sun", y = "Temperature") +
  scale_color_gradient(high = "firebrick", low = "#ffce00", name = " ") +
  ggtitle("Base Plot")

base-plot

# adding predefined themes
p1 <- base_plot +
  theme_classic() +
  ggtitle("Plot with theme_classic()")

p2 <- base_plot +
  theme_bw() +
  ggtitle("Plot with theme_bw()")

p3 <- base_plot +
  theme_dark() +
  ggtitle("Plot with theme_dark()")

p4 <- base_plot +
  theme_light() +
  ggtitle("Plot with theme_light()")

gridExtra::grid.arrange(p1, p2, p3, p4)

different-themes

Alternatively, the default theme that’s automatically added to any plot, can be set or get with the functions theme_set() or theme_get().

# making the classic theme the default
theme_set(theme_classic())

base_plot +
  ggtitle("Plot with theme_set(theme_classic())")

theme-set

While predefined themes are very convenient, there’s always the option to (additionally) tweak the appearance of any non-data detail of a plot via the various arguments of theme(). This can be done for a specific plot, or the currently active default theme. The default theme can be updated or partly replaced via theme_update and theme_replace, respectively.

# changing the default theme
theme_update(legend.position = "none")

base_plot +
  ggtitle("Plot with theme_set(theme_classic()) n& theme_update(legend.position = "none")")

theme-update

# changing the theme directly applied to the plot
base_plot +
  theme(legend.position = "bottom") +
  ggtitle("Plot with theme(legend.position = "bottom")")

plus-theme

Element functions

There’s a wide range of arguments for theme(), in fact such a wide range, that not all arguments can be discussed here. Therefore, this blog post is far from exhaustive and only deals with the general principles of the theme system and only provides some illustrative examples for a few of all the available arguments. The appearance of many elements needs to be specified via one of the four element functions: element_blank, element_text, element_line or element_rect.

# using element_text, element_rect, element_line, element_blank
base_plot +
  theme(axis.title.y = element_text(face = "bold", size = 16),
        plot.background = element_rect(fill = "#FED633"),
        panel.grid.major = element_line(linetype = "dashed"),
        panel.grid.minor = element_line(linetype = "dotted"),
        axis.text.y = element_blank(),
        axis.text.x = element_blank()) +
  ggtitle("Plot altered using element functions")

element-function

If we don’t want to change the display of some specific plot elements, but of all text, lines, titles or rectangular elements we can do so by specifying the arguments text, line, rect and title. Specifications passed to these arguments are inherited by all elements of the respective type. This inheritance principle also holds true for other ‘parent’ arguments. ‘Parent’ arguments oftentimes are easily identifiable, as their names are used as prefixes for all subordinate arguments.

# using overreaching arguments #1
base_plot +
  theme(line = element_line(linetype = "dashed")) +
  ggtitle("Plot with all lines altered by using line")

using-line

# using overreaching arguments #2
base_plot +
  theme(axis.title = element_text(size = 6)) + # here axis.title is the parent
  ggtitle("Plot with both axis titles altered by using axis.title")

using-axistitle

Outlook

Margins, spaces, sizes and orientations of elements are not specified with element functions but have their own sets of possible parameters. For example, the display of legends is controlled by such arguments and specific parameters.

# using parameters instead of element functions
base_plot +
  theme(legend.position = "top") 

using-otherparameters

Since ggplot2 enables to manipulate the appearance of non-data elements of plots in great detail, there is a multitude of arguments. This blog post only tries to give a first impression of the many, many possibilities to design a plot. Some additional occupation with the topic, might be advisable, but any time invested in understanding how to style plots, surely is well spent. If you want read more on making pretty plots in ggplot2, check out my other posts on coordinate stystems or customizing date and time scales. If you want more details, our Data Visualization with R workshop may interest you!

 

Referenzen