Data Science, Machine Learning & AI
Kontakt

Management Summary

OCR (Optical Character Recognition) ist eine große Herausforderung für viele Unternehmen. Am OCR-Markt tummeln sich diverse Open Source sowie kommerzielle Anbieter. Ein bekanntes Open Source Tool für OCR ist Tesseract, das mittlerweile von Google bereitgestellt wird. Tesseract ist aktuell in der Version 4 verfügbar, die die OCR Extraktion mittels rekurrenten neuronalen Netzen durchführt. Die OCR Performance von Tesseract ist nach wie vor jedoch volatil und hängt von verschiedenen Faktoren ab. Eine besondere Herausforderung ist die Anwendung von Tesseract auf Dokumente, die aus verschiedenen Strukturen aufgebaut sind, z.B. Texten, Tabellen und Bildern. Eine solche Dokumentenart stellen bspw. Rechnungen dar, die OCR Tools aller Anbieter nach wie vor besondere Herausforderungen stellen. In diesem Beitrag wird demonstriert, wie ein Finetuning der Tesseract-OCR (Optical Character Recognition) Engine auf einer kleinen Stichprobe von Daten bereits eine erhebliche Verbesserung der OCR-Leistung auf Rechnungsdokumenten bewirken kann. Dabei ist der dargestellte Prozess nicht ausschließlich auf Rechnungen anwendbar sondern auf beliebige Dokumentenarten. Es wird ein Anwendungsfall definiert, der auf eine korrekte Extraktion des gesamten Textes (Wörter und Zahlen) aus einem fiktiven, aber realistischen deutschen Rechnungsdokument abzielt. Es wird hierbei angenommen, dass die extrahierten Informationen für nachgelagerte Buchhaltungszwecke bestimmt sind. Daher wird eine korrekte Extraktion der Zahlen sowie des Euro-Zeichens als kritisch angesehen. Die OCR-Leistung von zwei Tesseract-Modellen für die deutsche Sprache wird verglichen: das Standardmodell (nicht getuned) und eine finegetunete Variante. Das Standardmodell wird aus dem Tesseract OCR GitHub Repository bezogen. Das feinabgestimmte Modell wird mit denen in diesem Artikel beschriebenen Schritten entwickelt. Eine zweite deutsche Rechnung ähnlich der ersten wird für die Feinabstimmung verwendet. Sowohl das Standardmodell als auch das getunte Modell werden auf der gleichen Out-of-Sample Rechnung bewertet, um einen fairen Vergleich zu gewährleisten. Die OCR-Leistung des Tesseract Standardmodells ist bei Zahlen vergleichsweise schlecht. Dies gilt insbesondere für Zahlen, die den Zahlen 1 und 7 ähnlich sind. Das Euro-Symbol wird in 50% der Fälle falsch erkannt, sodass das Ergebnis für eine etwaig nachgelagerte Buchhaltungsanwendung ungeeignet ist. Das getunte Modell zeigt eine ähnliche OCR-Leistung für deutsche Wörter. Die OCR-Leistung bei Zahlen verbessert sich jedoch deutlich. Alle Zahlen und jedes Euro-Symbol werden korrekt extrahiert.  Es zeigt sich, dass eine Feinabstimmung mit minimalem Aufwand und einer geringen Menge an Schulungsdaten eine große Verbesserung der Erkennungsleistung erzielen kann. Dadurch wird Tesseract OCR mit seiner Open-Source-Lizenzierung zu einer attraktiven Lösung im Vergleich zu proprietärer OCR-Software. Weiterhin werden abschließende Empfehlungen für das Finetuning von Tesseract LSTM-Modellen dargestellt, für den Fall, dass mehr Trainingsdaten vorliegen.

Download des Tesseract Docker Containers

Der gesamte Finetuning-Prozess des LSTM-Modells von Tesseract wird im Folgenden ausführlich erörtert. Da die Installation und Anwendung von Tesseract kompliziert werden kann, haben wir einen Docker Container vorbereitet, der alle nötigen Installationen bereits enthält. [contact-form-7 404 "Not Found"]

Einführung

Tesseract 4 mit seiner LSTM-Engine funktioniert out-of-the-box für einfache Texte bereits recht gut. Es gibt jedoch Szenarien, für die das Standardmodell schlecht abschneidet. Beispiele hierfür sind exotische Schriftarten, Bilder mit Hintergründen oder Text in Tabellen.  Glücklicherweise bietet Tesseract eine Möglichkeit zum Finetuning der LSTM-Engine, um die OCR-Leistung für speziellere Anwendungsfälle zu verbessern.

Warum OCR für Rechnungen eine Herausforderung ist

Auch wenn OCR in Teilbereichen als ein gelöstes Problem gilt, stellt die fehlerfreie Extraktion eines großen Textkorpus nach wie vor eine Herausforderung dar. Dies gilt insbesondere für OCR auf Dokumenten, die eine hohe strukturelle Varianz aufweisen, wie bspw. Rechnungsdokumente. Diese bestehen häufig aus unterschiedlichsten Elementen, die OCR-Engine von Tesseract for Herausforderungen stellen: 1. Farbige Hintergründe und Tabellenstrukturen stellen eine Herausforderung für die Seitensegmentierung dar. 2. Rechnungen enthalten normalerweise seltene Zeichen wie das EUR- oder USD-Zeichen 3. Zahlen können nicht mit einem Sprachwörterbuch überprüft werden. Darüber hinaus ist die Fehlermarge gering: Häufig ist eine exakte Extraktion der numerischen Daten für nachfolgenden Prozessschritte von größter Bedeutung. Problem (1) lässt sich in der Regel dadurch lösen, dass man eine der 14 von Tesseract bereitgestellten Segmentierungsmodus auswählt. Die beiden letztgenannten Probleme lassen sich häufig durch ein Finetuning der LSTM-Engine auf Basis von Beispielen ähnlicher Dokumente lösen.

Use Case Zielsetzung und Daten

Zwei ähnliche Beispielrechnungen werden in dem Artikel näher betrachtet. Die in Abbildung 1 gezeigte Rechnung wird zur Bewertung der OCR-Leistung sowohl für das Standard- als auch des feingetunte Tesseract-Modell verwendet. Besondere Aufmerksamkeit wird der korrekten Extraktion von Zahlen gewidmet. Die in Abbildung 2 gezeigte, zweite Rechnung wird zum Finetuning des LSTM Modells verwendet. Die meisten Rechnungsdokumente sind in einer sehr gut lesbaren Schriftart wie “Arial” geschrieben. Um die Vorteile des Tunings zu veranschaulichen, wird das anfängliche OCR-Problem durch die Berücksichtigung von Rechnungen, die in der Schriftart “Impact” geschrieben sind, erschwert. „Impact“ ist eine Schriftart, die sich deutlich von normalen serifenlosen Schriften unterscheidet, und zu einer höheren Fehlerkennung für Tesseract führt. Es wird im Folgenden gezeigt, dass Tesseract nach der Feinabstimmung auf Basis einer sehr kleinen Datenmenge trotz dieser schwierigen Schriftart sehr zufriedenstellende Ergebnisse liefert.
Abbildung 1: Rechnung 1, die zur Evaluierung der OCR Performance beider Modelle verwendet wird
Abbildung 2: Rechnung 2, die zum Finetuning der LSTM Engine verwendet wird

Verwendung des Tesseract 4.0 Docker Containers

Die Einrichtung zum Finetuning der Tesseract-LSTM-Engine funktioniert derzeit nur unter Linux und kann etwas knifflig sein. Daher wird zusammen mit diesem Artikel ein Docker-Container mit vorinstalliertem Tesseract 4.0 sowie mit den kompilierten Trainings-Tools und Skripten bereitgestellt. Laden Sie das Docker-Image aus der bereitgestellten Archivdatei oder pullen Sie das Container-Image über den bereitgestellten Link:
docker load -i docker/tesseract_image.tar
Sobald das image aufgebaut ist, starten Sie den Container im “detached” Modus:
docker run -d --rm --name tesseract_container tesseract:latest
Greifen Sie auf die Shell des laufenden Containers zu, um die folgenden Befehle in diesem Artikel zu replizieren:
docker exec -it tesseract_container /bin/bash

Allgemeine Verbesserungen der OCR Performance

Es gibt drei Möglichkeiten, wie die OCR-Leistung von Tesseract verbessert werden kann, noch bevor ein Finetuning der LSTM-Engine vorgenommen wird.

1. Preprocessing der Bilder

Gescannte Dokumente können eine schiefe Ausrichtung haben, wenn sie auf dem Scanner nicht richtig platziert wurden. Gedrehte Bilder sollten entzerrt werden, um die Liniensegmentierungsleistung von Tesseract zu optimieren. Darüber hinaus kann beim Scannen ein Bildrauschen entstehen, das durch einen Rauschunterdrückungsalgorithmus entfernt werden sollte. Beachten Sie, dass Tesseract standardmäßig eine Schwellenwertbildung unter Verwendung des Otsu-Algorithmus durchführt, um Graustufenbilder in schwarze und weiße Pixel zu binärisieren. Eine detaillierte Behandlung der Bildvorverarbeitung würde den Rahmen dieses Artikels sprengen und ist nicht notwendig, um für den gegebenen Anwendungsfall zufriedenstellende Ergebnisse zu erzielen. Die Tesseract-Dokumentation bietet einen praktischen Überblick.

2. Seitensegmentierung

Während der Seitensegmentierung versucht Tesseract, rechteckige Textbereiche zu identifizieren. Nur diese Bereiche werden im nächsten Schritt für die OCR ausgewählt. Es ist daher wichtig, alle Regionen mit Text zu erfassen, damit keine Informationen verloren gehen. Tesseract ermöglicht die Auswahl aus 14 verschiedenen Seitensegmentierungsmethoden, die mit dem folgenden Befehl angezeigt werden können:
tesseract --help-psm
Die Standard-Segmentierungsmethode erwartet eine Bild ähnlich zu einer Buchseite. Dieser Modus kann jedoch aufgrund der zusätzlichen tabellarischen Strukturen in Rechnungsdokumenten nicht alle Textbereiche korrekt identifizieren. Eine bessere Segmentierungsmethode ist durch Option 4 gegeben: „Assume a single column of text of variable sizes“. Um die Bedeutung einer geeigneten Seitensegmentierungsmethode zu veranschaulichen, betrachten wir das Ergebnis der Verwendung der Standardmethode “Fully automatic page segmentation, but no OSD ” in Abbildung 3:
Abbildung 3: Die Standard-Segmentierungsmethode kann nicht alle Textbereiche erkennen
Beachten Sie, dass die Texte “Rechnungsinformationen:”, “Pos.” und “Produkt” nicht segmentiert wurden. In Abbildung 4 führt eine geeignetere Methode zu einer perfekten Segmentierung der Seite.

3. Verwendung von Dictionaries, Wortlisten und Mustern für den Text

Die von Tesseract verwendeten LSTM-Modelle wurden auf Basis von großen Textmengen in einer bestimmten Sprache trainiert. Dieser Befehl zeigt die Sprachen an, die derzeit für Tesseract verfügbar sind:
tesseract --list-langs 
Weitere Sprachmodelle sind verfügbar, indem die entsprechenden language.tessdata heruntergelden und in den Ordner tessdata der lokalen Tesseract-Installation abgelegt werden. Das Tesseract-Repository auf GitHub stellt drei Varianten von Sprachmodellen zur Verfügung: normal, fast und best. Nur die schnelle sowie die beste Variante sind für ein Finetuning verwendbar. Wie der Name schon sagt, handelt es sich dabei um die schnellsten bzw. genauesten Varianten von Modellen. Weitere Modelle wurden ebenfalls für spezielle Anwendungsfälle wie die ausschließliche Erkennung von Ziffern und Interpunktion trainiert und sind in den Referenzen aufgeführt. Da die Sprache der Rechnungen in diesem Anwendungsfall Deutsch ist, wird das zu diesem Artikel gehörende Docker-Image mit dem deu.tessdata-Modell geliefert. Für eine bestimmte Sprache kann die Wortliste von Tesseract weiter ausgebaut oder auf bestimmte Wörter oder sogar Zeichen beschränkt werden. Dieses Thema liegt außerhalb des Rahmens dieses Artikels, da es nicht notwendig ist, um für den vorliegenden Anwendungsfall zufriedenstellende Ergebnisse zu erzielen.

Setup des Finetuning-Prozesses

Für das Finetuning müssen drei Dateitypen erstellt werden:

1. tiff-Dateien

Tagged Image File Format oder TIFF ist ein unkomprimiertes Bilddateiformat (im Gegensatz zu JPG oder PNG, die komprimierte Dateiformate sind). TIFF-Dateien können mit einem Konvertierungswerkzeug aus PNG- oder JPG-Formaten gewonnen werden. Obwohl Tesseract mit PNG- und JPG-Bildern arbeiten kann, wird das TIFF-Format empfohlen.

2. Box-Dateien

Zum Trainieren des LSTM-Modells verwendet Tesseract so genannte Box-Dateien mit der Erweiterung “.box”. Eine Box-Datei enthält den erkannten Text zusammen mit den Koordinaten der Bounding Box, in der sich der Text befindet. Box-Dateien enthalten sechs Spalten, die korrespondieren zu Symbol, Links, Unten, Rechts, Oben und Seite:
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
o 157 2566 1465 2609 0
d 157 2566 1465 2609 0
u 157 2566 1465 2609 0
k 157 2566 1465 2609 0
t 157 2566 1465 2609 0
  157 2566 1465 2609 0
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
e 157 2566 1465 2609 0
i 157 2566 1465 2609 0
s 157 2566 1465 2609 0
  157 2566 1465 2609 0
( 157 2566 1465 2609 0
N 157 2566 1465 2609 0
e 157 2566 1465 2609 0
t 157 2566 1465 2609 0
t 157 2566 1465 2609 0
o 157 2566 1465 2609 0
) 157 2566 1465 2609 0
  157 2566 1465 2609 0
Jedes Zeichen befindet sich auf einer separaten Zeile in der Box-Datei. Das LSTM-Modell akzeptiert entweder die Koordinaten einzelner Zeichen oder einer ganzen Textzeile. In der obigen Beispiel-Box-Datei befindet sich der Text “Produkt Preis (Netto)” optisch auf der gleichen Zeile im Dokument. Alle Zeichen haben die gleichen Koordinaten, nämlich die Koordinaten des Begrenzungsrahmens um diese Textzeile herum. Die Verwendung von Koordinaten auf Zeilenebene ist wesentlich einfacher und wird standardmäßig bereitgestellt, wenn die Box-Datei mit dem folgenden Befehl erzeugt wird:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice --psm 4 -l best/deu lstmbox
Das erste Argument ist die zu extrahierende Bilddatei, das zweite Argument stellt den Dateinamen der Box-Datei dar. Der Sprachparameter -l weist Tesseract an, das deutsche Modell für die OCR zu verwenden. Der Parameter –psm weist Tesseract an, das vierte Seitensegmentierungsverfahren zu verwenden. Nahezu unvermeidlich ist, dass die generierten OCR-Box-Dateien Fehler in der Symbolspalte enthalten. Jedes Symbol in der Box-Datei des Trainings muss daher von Hand überprüft werden. Dies ist ein mühsamer Prozess, da die Box-Datei der Demo-Rechnung fast tausend Zeilen enthält (eine für jedes Zeichen in der Rechnung). Um die Korrektur zu vereinfachen, stellt der Docker-Container ein Python-Skript zur Verfügung, das die Bounding-Boxes zusammen mit dem OCR-Text auf dem Originalrechnungsbild zeichnet, um einen Vergleich zwischen der Box Datei und dem Dokument zu erleichtern. Das Ergebnis ist in Abbildung 4 dargestellt. Der Docker-Container enthält bereits die korrigierten Box-Dateien, die durch den Suffix “_correct” gekennzeichnet sind.
Abbildung 4 – Extrahierter Text bei Anwendung des Tesseract Modells „deu“

3. lstmf Dateien

Während des Finetunings extrahiert Tesseract den Text aus der Tiff-Datei und überprüft die Vorhersage anhand der Koordinaten sowie des Symbols in der Box-Datei. Tesseract verwendet dabei nicht direkt die Tiff- und Box-Datei, sondern erwartet eine sog. lstmf-Datei, die aus den beiden vorherigen Dateien erstellt wurde. Hierbei ist zu beachten, dass zur Erstellung der lstmf-Datei die Tiff- und Box-Datei denselben Namen haben müssen, z.B. train_invoice.tiff und train_invoice.box. Der folgende Befehl erzeugt eine lstmf-Datei für die Zugrechnung:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice lstm.train 
Alle lstmf-Dateien, die für das Training relevant sind, müssen durch ihren relativen Pfad in einer Textdatei namens deu.training_files.txt angegeben werden. In diesem Anwendungsfall wird nur eine lstmf-Datei für das Training verwendet, so dass die Datei deu.training_files.txt nur eine Zeile enthält, nämlich: eval/train_invoice_correct.lstmf. Es wird empfohlen, auch eine lstfm-Datei für die Evaluierungs-Rechnung zu erstellen. Auf diese Weise kann die Performance des Modells während dem Trainingsvorgang bewertet werden:
cd /home/fine_tune/eval
tesseract eval_invoice_correct.tiff eval_invoice_correct lstm.train

Evaluierung des Standard-LSTM-Modells

OCR-Vorhersagen aus dem deutschen Standardmodell “deu” werden als Benchmark verwendet. Einen genauen Überblick über die OCR-Leistung des deutschen Standardmodells erhält man, indem man eine Box-Datei für die Evaluierungs-Rechnung erzeugt und den OCR-Text mit dem bereits erwähnten Python-Skript visualisiert. Dieses Skript, das die Datei “eval_invoice_ocr deu.tiff” erzeugt, befindet sich im mitgelieferten Container unter „/home/fine_tune/src/draw_box_file_data.py“. Das Skript erwartet als Argument den Pfad zu einer Tiff-Datei, die entsprechende Box-Datei sowie einen Namen für die Ausgabe-Tiff-Datei. Der durch das deutsche Standardmodell extrahierte OCR-Text wird als eval/eval_invoice_ocr_deu.tiff gespeichert und ist in Abbildung 1 dargestellt. Auf den ersten Blick sieht der durch OCR extrahierte Text gut aus. Das Modell extrahiert deutsche Zeichen wie ä, ö ü und ß korrekt. Tatsächlich gibt es nur drei Fälle, in denen Wörter Fehler enthalten:
OCR Truth
Jessel GmbH 8 Co Jessel GmbH & Co
11 Glasbehälter 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Das Modell schneidet bei gebräuchlichen deutschen Wörtern bereits gut ab, hat aber Schwierigkeiten mit singulären Symbolen wie “&” und “l” sowie Wörtern wie “bloch”, die nicht in der Wortliste des Modells enthalten sind. Preise und Zahlen sind für das Modell eine viel größere Herausforderung. Hierbei treten deutlich häufiger Fehler bei der Extraktion auf:
OCR Truth
159,16 159,1€
1% 7%
1305.816 1305.81€
227.66 227.6€
341.51 347.57€
1115.16 1115.7€
242.86 242.8€
1456.86 1456.8€
51.46 54.1€
1954.719€ 1954.79€
Das deutsche Standardmodell extrahiert das Euro-Symbol € in 9 von 18 Fällen nicht korrekt. Dies entspricht einer Fehlerquote von 50%.

Finetuning des Standard-LSTM-Modells

Das Standard-LSTM-Modell wird nun auf der in Abbildung 2 gezeigten Rechnung finegetuned. Anschließend wird die OCR-Leistung anhand der in Abbildung 1 gezeigten Evaluierungs-Rechnung bewertet, die auch zuvor für das Benchmarking des deutschen Standardmodells verwendet wurde. Zum Finetuning des LSTM-Modells muss dieses zunächst aus der Datei deu.traineddata extrahiert werden. Mit dem folgenden Befehl wird das LSTM-Modell aus dem deutschen Standardmodell in das Verzeichnis lstm_model extrahiert:
cd /home/fine_tune
combine_tessdata -e tesseract/tessdata/best/deu.traineddata lstm_model/deu.lstm
Anschließend werden alle notwendigen Dateien für das Finetuning zusammengestellt. Die Dateien sind ebenfalls im Docker-Container vorhanden:
  1. Die Trainings-Dateien train_invoice_correct.lstmf und deu.training_files.txt im Verzeichnis train.
  2. Die Evaluierungs-Dateien eval_invoice_correct.lstmf und deu.training_files.txt im eval-Verzeichnis.
  3. Das extrahierte LSTM-Modell deu.lstm im Verzeichnis lstm_model.
Der Docker-Container enthält das Skript src/fine_tune.sh, das den Prozess des Finetunings startet. Sein Inhalt ist:
/usr/bin/lstmtraining 
 --model_output output/fine_tuned 
 --continue_from lstm_model/deu.lstm 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --train_listfile train/deu.training_files.txt 
 --evallistfile eval/deu.training_files.txt 
 --max_iterations 400
Mit diesem Befehl wird das extrahierte Modell deu.lstm in der in train/deu.training_files.txt angegebenen Datei train_invoice.lstmf getuned. Das Finetuning des LSTM-Modells erfordert sprachspezifische Informationen, die im Ordner deu.tessdata enthalten sind. Die Datei eval_invoice.lstmf, die in eval/deu.training_files.txt angegeben ist, wird zur Berechnung der OCR-Performance während des Trainings verwendet. Das Finetuning wird nach 400 Iterationen beendet. Die gesamte Trainingsdauer dauert weniger als zwei Minuten. Der folgende Befehl führt das Skript aus und protokolliert die Ausgabe in einer Datei:
cd /home/fine_tune
sh src/fine_tune.sh > output/fine_tune.log 2>&1
Der Inhalt der Protokolldatei nach dem Training ist unten dargestellt:
src/fine_tune.log
Loaded file lstm_model/deu.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from lstm_model/deu.lstm
Loaded 20/20 lines (1-20) of document train/train_invoice_correct.lstmf
Loaded 24/24 lines (1-24) of document eval/eval_invoice_correct.lstmf
2 Percent improvement time=69, best error was 100 @ 0
At iteration 69/100/100, Mean rms=1.249%, delta=2.886%, char train=8.17%, word train=22.249%, skip ratio=0%, New best char error = 8.17 Transitioned to stage 1 wrote best model:output/deu_fine_tuned8.17_69.checkpoint wrote checkpoint.
-----
2 Percent improvement time=62, best error was 8.17 @ 69
At iteration 131/200/200, Mean rms=1.008%, delta=2.033%, char train=5.887%, word train=20.832%, skip ratio=0%, New best char error = 5.887 wrote best model:output/deu_fine_tuned5.887_131.checkpoint wrote checkpoint.
-----
2 Percent improvement time=112, best error was 8.17 @ 69
At iteration 181/300/300, Mean rms=0.88%, delta=1.599%, char train=4.647%, word train=17.388%, skip ratio=0%, New best char error = 4.647 wrote best model:output/deu_fine_tuned4.647_181.checkpoint wrote checkpoint.
-----
2 Percent improvement time=159, best error was 8.17 @ 69
At iteration 228/400/400, Mean rms=0.822%, delta=1.416%, char train=4.144%, word train=16.126%, skip ratio=0%, New best char error = 4.144 wrote best model:output/deu_fine_tuned4.144_228.checkpoint wrote checkpoint.
-----
Finished! Error rate = 4.144
Während des Trainings speichert Tesseract nach jeder Iteration einen sog. Model Checkpoint. Die Leistung des Modells an diesem Kontrollpunkt wird anhand der Evaluierungs-Daten getestet und mit dem aktuell besten Ergebnis verglichen. Wenn sich das Ergebnis verbessert, d.h. der OCR-Fehler abnimmt, wird eine beschriftete Kopie des Checkpoints gespeichert. Die erste Nummer des Dateinamens für den Kontrollpunkt steht für den Zeichenfehler und die zweite Nummer für die Trainingsiteration. Der letzte Schritt ist die neue Zusammenstellung des finegetunten LSTM-Modells, so dass man wieder ein “traineddata” Modell erhält. Unter der Annahme, dass der Kontrollpunkt bei der 181. Iteration selektiert wurde, wird mit dem folgenden Befehl ein ausgewählter Kontrollpunkt “deu_fine_tuned4.647_181.checkpoint” in ein voll funktionsfähiges Tesseract-Modell “deu_fine_tuned.traineddata” umgewandelt:
cd /home/fine_tune
/usr/bin/lstmtraining 
 --stop_training 
 --continue_from output/deu_fine_tuned4.647_181.checkpoint 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --model_output output/deu_fine_tuned.traineddata
Dieses Modell muss in die Testdaten der lokalen Tesseract-Installation kopiert werden, um es Tesseract zur Verfügung zu stellen. Dies ist im Docker-Container bereits geschehen. Vergewissern Sie sich, dass das feinabgestimmte Modell in Tesseract verfügbar ist:
tesseract --list-langs

Evaluierung des finegetunten LSTM-Modells

Das finegetunte Modell wird analog zum Standardmodell evaluiert: Es wird eine Box-Datei der Auswertungs-Rechnung erstellt, und der OCR-Text wird mit Hilfe des Python-Skripts auf dem Bild der Auswertungsrechnung angezeigt. Der Befehl zur Erzeugung der Box-Dateien muss so modifiziert werden, dass das fein abgestimmte Modell “deu_fine_tuned” anstelle des Standardmodells “deu” verwendet wird:
cd /home/fine_tune/eval
tesseract eval_invoice.tiff eval_invoice --psm 4 -l deu_fine_tuned lstmbox
Der durch das fein abgestimmte Modell extrahierte OCR-Text ist in Abbildung 5 unten dargestellt.
Abbildung 5: OCR Ergebnisse des finegetunten LSTM Modells
Wie beim deutschen Standardmodell bleibt die Leistung bei den deutschen Wörtern gut, aber nicht perfekt. Um die Leistung bei seltenen Wörtern zu verbessern, könnte die Wortliste des Modells um weitere Worte erweitert werden.
OCR Truth
 Jessel GmbH 8 Co Jessel GmbH & Co
1! Glasbehälte 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Wichtiger ist, dass sich die OCR-Leistung bei Zahlen deutlich verbessert hat: Das verfeinerte Modell extrahierte alle Zahlen und jedes Vorkommen des €-Zeichens korrekt.
OCR Truth
159,1€ 159,1€
7% 7%
1305.81€ 1305.81€
227.6€ 227.6€
347.57€ 347.57€
1115.7€ 1115.7€
242.8€ 242.8€
1456.8€ 1456.8€
54.1€ 54.1€
1954.79€ 1954.79€

Fazit und Ausblick

In diesem Artikel wurde gezeigt, dass die OCR-Leistung von Tesseract durch Finetuning erheblich verbessert werden kann. Insbesondere bei Nicht-Standard-Anwendungsfällen, wie der Text-Extraktion von Rechnungsdokumenten, kann so die OCR-Leistung signifikant verbessert werden. Neben der Open Source Lizensierung macht die Möglichkeit, die LSTM-Engine von Tesseract mittels Finetunings für spezifische Anwendungsfälle zu tunen, das Framework zu einem attraktiven Tool, auch für anspruchsvollere OCR-Einsatzszenarien. Zur weiteren Verbesserung des Ergebnisses kann es sinnvoll sein, das Modell für weitere Iterationen zu tunen. In diesem Anwendungsfall wurde die Anzahl der Iterationen absichtlich begrenzt, da nur ein Dokument zum Finetuning verwendet wurde. Mehr Iterationen erhöhen potenziell das Risiko einer Überanpassung des LSTM-Modells auf bestimmten Symbolen, was wiederum die Fehlerquote bei anderen Symbolen erhöht. In der Praxis ist es wünschenswert, die Anzahl der Iterationen unter der Voraussetzung zu erhöhen, dass ausreichend Trainingsdaten zur Verfügung stehen. Die endgültige OCR-Leistung sollte immer auf Basis eines weiteren, jedoch repräsentativen Datensatz von Dokumenten überprüft werden.

Referenzen

  • Tesseract training: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
  • Image processing overview: https://tesseract-ocr.github.io/tessdoc/ImproveQuality#image-processing
  • Otsu thresholding: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
  • Tesseract digits comma model: https://github.com/Shreeshrii/tessdata_shreetest
 

Introduction

When working on data science projects in R, exporting internal R objects as files on your hard drive is often necessary to facilitate collaboration. Here at STATWORX, we regularly export R objects (such as outputs of a machine learning model) as .RDS files and put them on our internal file server. Our co-workers can then pick them up for further usage down the line of the data science workflow (such as visualizing them in a dashboard together with inputs from other colleagues). Over the last couple of months, I came to work a lot with RDS files and noticed a crucial shortcoming: The base R saveRDS function does not allow for any kind of archiving of existing same-named files on your hard drive. In this blog post, I will explain why this might be very useful by introducing the basics of serialization first and then showcasing my proposed solution: A wrapper function around the existing base R serialization framework.

Be wary of silent file replacements!

In base R, you can easily export any object from the environment to an RDS file with:
saveRDS(object = my_object, file = "path/to/dir/my_object.RDS")
However, including such a line somewhere in your script can carry unintended consequences: When calling saveRDS multiple times with identical file names, R silently overwrites existing, identically named .RDS files in the specified directory. If the object you are exporting is not what you expect it to be — for example due to some bug in newly edited code — your working copy of the RDS file is simply overwritten in-place. Needless to say, this can prove undesirable. If you are familiar with this pitfall, you probably used to forestall such potentially troublesome side effects by commenting out the respective lines, then carefully checking each time whether the R object looked fine, then executing the line manually. But even when there is nothing wrong with the R object you seek to export, it can make sense to retain an archived copy of previous RDS files: Think of a dataset you run through a data prep script, and then you get an update of the raw data, or you decide to change something in the data prep (like removing a variable). You may wish to archive an existing copy in such cases, especially with complex data prep pipelines with long execution time.

Don’t get tangled up in manual renaming

You could manually move or rename the existing file each time you plan to create a new one, but that’s tedious, error-prone, and does not allow for unattended execution and scalability. For this reason, I set out to write a carefully designed wrapper function around the existing saveRDS call, which is pretty straightforward: As a first step, it checks if the file you attempt to save already exists in the specified location. If it does, the existing file is renamed/archived (with customizable options), and the “updated” file will be saved under the originally specified name. This approach has the crucial advantage that the existing code that depends on the file name remaining identical (such as readRDS calls in other scripts) will continue to work with the latest version without any needs for adjustment! No more saving your objects as “models_2020-07-12.RDS”, then combing through the other scripts to replace the file name, only to repeat this process the next day. At the same time, an archived copy of the — otherwise overwritten — file will be kept.

What are RDS files anyways?

Before I walk you through my proposed solution, let’s first examine the basics of serialization, the underlying process behind high-level functions like saveRDS.
Simply speaking, serialization is the “process of converting an object into a stream of bytes so that it can be transferred over a network or stored in a persistent storage.” Stack Overflow: What is serialization?
There is also a low-level R interface, serialize, which you can use to explore (un-)serialization first-hand: Simply fire up R and run something like serialize(object = c(1, 2, 3), connection = NULL). This call serializes the specified vector and prints the output right to the console. The result is an odd-looking raw vector, with each byte separately represented as a pair of hex digits. Now let’s see what happens if we revert this process:
s <- serialize(object = c(1, 2, 3), connection = NULL)
print(s)
# >  [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 0e 00
# > [29] 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00 00 40 08 00 00 00 00 00 00

unserialize(s)
# > 1 2 3
The length of this raw vector increases rapidly with the complexity of the stored information: For instance, serializing the famous, although not too large, iris dataset results in a raw vector consisting of 5959 pairs of hex digits! Besides the already mentioned saveRDS function, there is also the more generic save function. The former saves a single R object to a file. It allows us to restore the object from that file (with the counterpart readRDS), possibly under a different variable name: That is, you can assign the contents of a call to readRDS to another variable. By contrast, save allows for saving multiple R objects, but when reading back in (with load), they are simply restored in the environment under the object names they were saved with. (That’s also what happens automatically when you answer “Yes” to the notorious question of whether to “save the workspace image to ~/.RData” when quitting RStudio.)

Creating the archives

Obviously, it’s great to have the possibility to save internal R objects to a file and then be able to re-import them in a clean session or on a different machine. This is especially true for the results of long and computationally heavy operations such as fitting machine learning models. But as we learned earlier, one wrong keystroke can potentially erase that one precious 3-hour-fit fine-tuned XGBoost model you ran and carefully saved to an RDS file yesterday.

Digging into the wrapper

So, how did I go about fixing this? Let’s take a look at the code. First, I define the arguments and their defaults: The object and file arguments are taken directly from the wrapped function, the remaining arguments allow the user to customize the archiving process: Append the archive file name with either the date the original file was archived or last modified, add an additional timestamp (not just the calendar date), or save the file to a dedicated archive directory. For more details, please check the documentation here. I also include the ellipsis ... for additional arguments to be passed down to saveRDS. Additionally, I do some basic input handling (not included here).
save_rds_archive <- function(object,
                             file = "",
                             archive = TRUE,
                             last_modified = FALSE,
                             with_time = FALSE,
                             archive_dir_path = NULL,
                             ...) {
The main body of the function is basically a series of if/else statements. I first check if the archive argument (which controls whether the file should be archived in the first place) is set to TRUE, and then if the file we are trying to save already exists (note that “file” here actually refers to the whole file path). If it does, I call the internal helper function create_archived_file, which eliminates redundancy and allows for concise code.
if (archive) {

    # check if file exists
    if (file.exists(file)) {

      archived_file <- create_archived_file(file = file,
                                            last_modified = last_modified,
                                            with_time = with_time)

Composing the new file name

In this function, I create the new name for the file which is to be archived, depending on user input: If last_modified is set, then the mtime of the file is accessed. Otherwise, the current system date/time (= the date of archiving) is taken instead. Then the spaces and special characters are replaced with underscores, and, depending on the value of the with_time argument, the actual time information (not just the calendar date) is kept or not. To make it easier to identify directly from the file name what exactly (date of archiving vs. date of modification) the indicated date/time refers to, I also add appropriate information to the file name. Then I save the file extension for easier replacement (note that “.RDS”, “.Rds”, and “.rds” are all valid file extensions for RDS files). Lastly, I replace the current file extension with a concatenated string containing the type info, the new date/time suffix, and the original file extension. Note here that I add a “$” sign to the regex which is to be matched by gsub to only match the end of the string: If I did not do that and the file name would be something like “my_RDS.RDS”, then both matches would be replaced.
# create_archived_file.R

create_archived_file <- function(file, last_modified, with_time) {

  # create main suffix depending on type
  suffix_main <- ifelse(last_modified,
                        as.character(file.info(file)$mtime),
                        as.character(Sys.time()))

  if (with_time) {

    # create clean date-time suffix
    suffix <- gsub(pattern = " ", replacement = "_", x = suffix_main)
    suffix <- gsub(pattern = ":", replacement = "-", x = suffix)

    # add "at" between date and time
    suffix <- paste0(substr(suffix, 1, 10), "_at_", substr(suffix, 12, 19))

  } else {

    # create date suffix
    suffix <- substr(suffix_main, 1, 10)

  }

  # create info to paste depending on type
  type_info <- ifelse(last_modified,
                      "_MODIFIED_on_",
                      "_ARCHIVED_on_")

  # get file extension (could be any of "RDS", "Rds", "rds", etc.)
  ext <- paste0(".", tools::file_ext(file))

  # replace extension with suffix
  archived_file <- gsub(pattern = paste0(ext, "$"),
                        replacement = paste0(type_info,
                                             suffix,
                                             ext),
                        x = file)

  return(archived_file)

}

Archiving the archives?

By way of example, with last_modified = FALSE and with_time = TRUE, this function would turn the character file name “models.RDS” into “models_ARCHIVED_on_2020-07-12_at_11-31-43.RDS”. However, this is just a character vector for now — the file itself is not renamed yet. For this, we need to call the base R file.rename function, which provides a direct interface to your machine’s file system. I first check, however, whether a file with the same name as the newly created archived file string already exists: This could well be the case if one appends only the date (with_time = FALSE) and calls this function several times per day (or potentially on the same file if last_modified = TRUE). Somehow, we are back to the old problem in this case. However, I decided that it was not a good idea to archive files that are themselves archived versions of another file since this would lead to too much confusion (and potentially too much disk space being occupied). Therefore, only the most recent archived version will be kept. (Note that if you still want to keep multiple archived versions of a single file, you can set with_time = TRUE. This will append a timestamp to the archived file name up to the second, virtually eliminating the possibility of duplicated file names.) A warning is issued, and then the already existing archived file will be overwritten with the current archived version.

The last puzzle piece: Renaming the original file

To do this, I call the file.rename function, renaming the “file” originally passed by the user call to the string returned by the helper function. The file.rename function always returns a boolean indicating if the operation succeeded, which I save to a variable temp to inspect later. Under some circumstances, the renaming process may fail, for instance due to missing permissions or OS-specific restrictions. We did set up a CI pipeline with GitHub Actions and continuously test our code on Windows, Linux, and MacOS machines with different versions of R. So far, we didn’t run into any problems. Still, it’s better to provide in-built checks.

It’s an error! Or is it?

The problem here is that, when renaming the file on disk failed, file.rename raises merely a warning, not an error. Since any causes of these warnings most likely originate from the local file system, there is no sense in continuing the function if the renaming failed. That’s why I wrapped it into a tryCatch call that captures the warning message and passes it to the stop call, which then terminates the function with the appropriate message. Just to be on the safe side, I check the value of the temp variable, which should be TRUE if the renaming succeeded, and also check if the archived version of the file (that is, the result of our renaming operation) exists. If both of these conditions hold, I simply call saveRDS with the original specifications (now that our existing copy has been renamed, nothing will be overwritten if we save the new file with the original name), passing along further arguments with ....
        if (file.exists(archived_file)) {
          warning("Archived copy already exists - will overwrite!")
        }

        # rename existing file with the new name
        # save return value of the file.rename function
        # (returns TRUE if successful) and wrap in tryCatch
        temp <- tryCatch({file.rename(from = file,
                                      to = archived_file)
        },
        warning = function(e) {
          stop(e)
        })

      }

      # check return value and if archived file exists
      if (temp & file.exists(archived_file)) {
        # then save new file under specified name
        saveRDS(object = object, file = file, ...)
      }

    }
These code snippets represent the cornerstones of my function. I also skipped some portions of the source code for reasons of brevity, chiefly the creation of the “archive directory” (if one is specified) and the process of copying the archived file into it. Please refer to our GitHub for the complete source code of the main and the helper function. Finally, to illustrate, let’s see what this looks like in action:
x <- 5
y <- 10
z <- 20

## save to RDS
saveRDS(x, "temp.RDS")
saveRDS(y, "temp.RDS")

## "temp.RDS" is silently overwritten with y
## previous version is lost
readRDS("temp.RDS")
#> [1] 10

save_rds_archive(z, "temp.RDS")
## current version is updated
readRDS("temp.RDS")
#> [1] 20

## previous version is archived
readRDS("temp_ARCHIVED_on_2020-07-12.RDS")
#> [1] 10

Great, how can I get this?

The function save_rds_archive is now included in the newly refactored helfRlein package (now available in version 1.0.0!) which you can install directly from GitHub:
# install.packages("devtools")
devtools::install_github("STATWORX/helfRlein")
Feel free to check out additional documentation and the source code there. If you have any inputs or feedback on how the function could be improved, please do not hesitate to contact me or raise an issue on our GitHub.

Conclusion

That’s it! No more manually renaming your precious RDS files — with this function in place, you can automate this tedious task and easily keep a comprehensive archive of previous versions. You will be able to take another look at that one model you ran last week (and then discarded again) in the blink of an eye. I hope you enjoyed reading my post — maybe the function will come in handy for you someday!

Because You Are Interested In Data Science, You Are Interested In This Blog Post

If you love streaming movies and tv series online as much as we do here at STATWORX, you’ve probably stumbled upon recommendations like “Customers who viewed this item also viewed…” or “Because you have seen …, you like …”. Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering. In this blog post, I will first explain how collaborative filtering works. Secondly, I’m going to show you how to develop your own small movie recommender with the R package recommenderlab and provide it in a shiny application.

Different Approaches

There are several approaches to give a recommendation. In the user-based collaborative filtering (UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion. For the item-based collaborative filtering IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested. Those and other collaborative filtering methods are implemented in the recommenderlab package:
  • ALS_realRatingMatrix: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
  • ALS_implicit_realRatingMatrix: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
  • IBCF_realRatingMatrix: Recommender based on item-based collaborative filtering.
  • LIBMF_realRatingMatrix: Matrix factorization with LIBMF via package recosystem.
  • POPULAR_realRatingMatrix: Recommender based on item popularity.
  • RANDOM_realRatingMatrix: Produce random recommendations (real ratings).
  • RERECOMMEND_realRatingMatrix: Re-recommends highly-rated items (real ratings).
  • SVD_realRatingMatrix: Recommender based on SVD approximation with column-mean imputation.
  • SVDF_realRatingMatrix: Recommender based on Funk SVD with gradient descend.
  • UBCF_realRatingMatrix: Recommender based on user-based collaborative filtering.

Developing your own Movie Recommender

Dataset

To create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets.
head(movie_data)
  movieId                              title                                      genres
1       1                   Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2       2                     Jumanji (1995)                  Adventure|Children|Fantasy
3       3            Grumpier Old Men (1995)                              Comedy|Romance
4       4           Waiting to Exhale (1995)                        Comedy|Drama|Romance
5       5 Father of the Bride Part II (1995)                                      Comedy
6       6                        Heat (1995)                       Action|Crime|Thriller
head(ratings_data)
  userId movieId rating timestamp
1      1       1      4 964982703
2      1       3      4 964981247
3      1       6      4 964982224
4      1      47      5 964983815
5      1      50      5 964982931
6      1      70      3 964982400
To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of “smooth” ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users.
# ranting_vector
0         0.5    1      1.5    2      2.5   3      3.5    4       4.5   5
5830804   1370   2811   1791   7551   5550  20047  13136  26818   8551  13211
Average Movie Ratings
In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings.
Average Movie Ratings - filtered
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.208   3.444   3.748   3.665   3.944   4.429
Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user:
Average Movie Ratings - relevant
You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data.
ratings_movies_norm <- normalize(ratings_movies)

Model Training and Evaluation

To train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector n_recommendations.
eval_sets <- evaluationScheme(data = ratings_movies_norm,
                              method = "cross-validation",
                              k = 10,
                              given = 5,
                              goodRating = 0)

models_to_evaluate <- list(
  `IBCF Cosinus` = list(name = "IBCF", 
                        param = list(method = "cosine")),
  `IBCF Pearson` = list(name = "IBCF", 
                        param = list(method = "pearson")),
  `UBCF Cosinus` = list(name = "UBCF",
                        param = list(method = "cosine")),
  `UBCF Pearson` = list(name = "UBCF",
                        param = list(method = "pearson")),
  `Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL)
)

n_recommendations <- c(1, 5, seq(10, 100, 10))

list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)
We then have the results displayed graphically for analysis.
Different models
We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the UBCF Pearson model. Since the n most similar users (parameter nn) are used to calculate the recommendations, we will examine the results of the model for different numbers of users.
vector_nn <- c(5, 10, 20, 30, 40)

models_to_evaluate <- lapply(vector_nn, function(nn){
  list(name = "UBCF",
       param = list(method = "pearson", nn = vector_nn))
})
names(models_to_evaluate) <- paste0("UBCF mit ", vector_nn, "Nutzern")
list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)
Different users

Conclusion

Our user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App. However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models. But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX.

Shiny-App

Here you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio. Getting the data in the quantity, quality and format you need is often the most challenging part of data science projects. But it’s also one, if not the most important part. That’s why my colleagues and I at STATWORX tend to spend a lot of time setting up good ETL processes. Thanks to frameworks like Airflow this isn’t just a Data Engineer prerogative anymore. If you know a bit of SQL and Python, you can orchestrate your own ETL process like a pro. Read on to find out how!

ETL does not stand for Extraterrestrial Life

At least not in Data Science and Engineering. ETL stands for Extract, Transform, Load and describes a set of database operations. Extracting means we read the data from one or more data sources. Transforming means we clean, aggregate or combine the data to get it into the shape we want. Finally, we load it to a destination database. Does your ETL process consist of Karen sending you an Excel sheet that you can spend your workday just scrolling down? Or do you have to manually query a database every day, tweaking your SQL queries to the occasion? If yes, venture a peek at Airflow.

How Airflow can help with ETL processes

Airflow is a python based framework that allows you to programmatically create, schedule and monitor workflows. These workflows consist of tasks and dependencies that can be automated to run on a schedule. If anything fails, there are logs and error handling facilities to help you fix it. Using Airflow can make your workflows more manageable, transparent and efficient. And yes, I’m talking to you fellow Data Scientists! Getting access to up-to-date, high-quality data is far too important to leave it only to the Data Engineers 😉 (we still love you). The point is, if you’re working with data, you’ll profit from knowing how to wield this powerful tool.

How Airflow works

To learn more about Airflow, check out this blog post from my colleague Marvin. It will get you up to speed quickly. Marvin explains in more detail how Airflow works and what advantages/disadvantages it has as a workflow manager. Also, he has an excellent quick-start guide with Docker. What matters to us is knowing that Airflow’s based on DAGs or Directed Acyclic Graphs that describe what tasks our workflow consists of and how these are connected.
Note that in this tutorial we’re not actually going to deploy the pipeline. Otherwise, this post would be even longer. And you probably have friends and family that like to see you. So today is all about creating a DAG step by step. If you care more about the deployment side of things, stay tuned though! I plan to do a step by step guide of how to do that in the next post. As a small solace, know that you can test every task of your workflow with:
airflow test [your dag id] [your task id] [execution date]
There are more options, but that’s all we need for now.

What you need to follow this tutorial

This tutorial shows you how you can use Airflow in combination with BigQuery and Google Cloud Storage to run a daily ETL process. So what you need is: If you already have a Google Cloud account, you can hit the ground running! If not, consider opening one. You do need to provide a credit card. But don’t worry, this tutorial won’t cost you. If you sign up new, you get a free yearly trial period. But even if that one’s expired, we’re staying well within the bounds of Google’s Always Free Tier. Finally, it helps if you know some SQL. I know it’s something most Data Scientists don’t find too sexy, but the more you use it the more you like it. I guess it’s like the orthopedic shoes of Data Science. A bit ugly sure, but unbeatable in what it’s designed for. If you’re not familiar with SQL or dread it like the plague, don’t sweat it. Each query’s name says what it does.

What BigQuery and Google Cloud Storage are

BigQuery and Cloud Storage are some of the most popular products of the Google Cloud Platform (GCP). BigQuery is a serverless cloud data warehouse that allows you to analyze up to petabytes of data at high speeds. Cloud Storage, on the other hand, is just that: a cloud-based object storage. Grossly simplified, we use BigQuery as a database to query and Cloud Storage as a place to save the results. In more detail, our ETL process:
  • checks for the existence of data in BigQuery and Google Cloud Storage
  • queries a BigQuery source table and writes the result to a table
  • ingests the Cloud Storage data into another BigQuery table
  • merges the two tables and writes the result back to Cloud Storage as a CSV

Connecting Airflow to these services

If you set up a Google Cloud account you should have a JSON authentication file. I suggest putting this in your home directory. We use this file to connect Airflow to BigQuery and Cloud Storage. To do this, just copy and paste these lines in your terminal, substituting your project ID and JSON path. You can read more about connections here.
# for bigquery
airflow connections -d --conn_id bigquery_default

airflow connections -a --conn_id bigquery_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'


# for google cloud storage
airflow connections -d --conn_id google_cloud_default

airflow connections -a --conn_id google_cloud_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'

Writing our DAG

Task 0: Setting the start date and the schedule interval

We are ready to define our DAG! This DAG consists of multiple tasks or things our ETL process should do. Each task is instantiated by a so-called operator. Since we’re working with BigQuery and Cloud Storage, we take the appropriate Google Cloud Platform (GCP) operators. Before defining any tasks, we specify the start date and schedule interval. The start date is the date when the DAG first runs. In this case, I picked February 20th, 2020. The schedule interval is how often the DAG runs, i.e. on what schedule. You can use cron notation here, a timedelta object or one of airflow’s cron presets (e.g. ‘@daily’). Tip: you normally want to keep the start date static to avoid unpredictable behavior (so no datetime.now() shenanigans even though it may seem tempting).
# set start date and schedule interval
start_date = datetime(2020, 2, 20)
schedule_interval = timedelta(days=1)
You find the complete DAG file on our STATWORX Github. There you also see all the config parameters I set, e.g., what our project, dataset, buckets, and tables are called, as well as all the queries. We gloss over it here, as it’s not central to understanding the DAG.

Task 1: Check that there is data in BigQuery

We set up our DAG taking advantage of python’s context manager. You don’t have to do this, but it saves some typing. A little detail: I’m setting catchup=False because I don’t want Airflow to do a backfill on my data.
# write dag
with DAG(dag_id='blog', default_args=default_args, schedule_interval=schedule_interval, catchup=False) as dag:

    t1 = BigQueryCheckOperator(task_id='check_bq_data_exists',
                               sql=queries.check_bq_data_exists,
                               use_legacy_sql=False)
We start by checking if the data for the date we’re interested in is available in the source table. Our source table, in this case, is the Google public dataset bigquery-public-data.austin_waste.waste_and_diversion. To perform this check we use the aptly named BigQueryCheckOperator and pass it an SQL query. If the check_bq_data_exists query returns even one non-null row of data, we consider it successful and the next task can run. Notice that we’re making use of macros and Jinja templating to dynamically insert dates into our queries that are rendered at runtime.
check_bq_data_exists = """
SELECT load_id
FROM `bigquery-public-data.austin_waste.waste_and_diversion`
WHERE report_date BETWEEN DATE('{{ macros.ds_add(ds, -365) }}') AND DATE('{{ ds }}')
"""

Task 2: Check that there is data in Cloud Storage

Next, let’s check that the CSV file is in Cloud Storage. If you’re following along, just download the file from the STATWORX Github and upload it to your bucket. We pretend that this CSV gets uploaded to Cloud Storage by another process and contains data that we need (in reality I just extracted it from the same source table). So we don’t care how the CSV got into the bucket, we just want to know: is it there? This can easily be verified in Airflow with the GoogleCloudStorageObjectSensor which checks for the existence of a file in Cloud Storage. Notice the indent because it’s still part of our DAG context. Defining the task itself is simple: just tell Airflow which object to look for in your bucket.
    t2 = GoogleCloudStorageObjectSensor(task_id='check_gcs_file_exists',
                                        bucket=cfg.BUCKET,
                                        object=cfg.SOURCE_OBJECT)

Task 3: Extract data and save it to BigQuery

If the first two tasks succeed, then all the data we need is available! Now let’s extract some data from the source table and save it to a new table of our own. For this purpose, there’s none better than the BigQueryOperator.
    t3 = BigQueryOperator(task_id='write_weight_data_to_bq',
                          sql=queries.write_weight_data_to_bq,
                          destination_dataset_table=cfg.BQ_TABLE_WEIGHT,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE',
                          use_legacy_sql=False)
This operator sends a query called write_weight_data_to_bq to BigQuery and saves the result in a table specified by the config parameter cfg.BQ_TABLE_WEIGHT. We can also set a create and write disposition if we so choose. The query itself pulls the total weight of dead animals collected every day by Austin waste management services for a year. If the thought of possum pancakes makes you queasy, just substitute ‘RECYCLING – PAPER’ for the TYPE variable in the config file.

Task 4: Ingest Cloud Storage data into BigQuery

Once we’re done extracting the data above, we need to get the data that’s currently in our Cloud Storage bucket into BigQuery as well. To do this, just tell Airflow what (source) object from your Cloud Storage bucket should go to which (destination) table in your BigQuery dataset. Tip: You can also specify a schema at this step, but I didn’t bother since the autodetect option worked well.
    t4 = GoogleCloudStorageToBigQueryOperator(task_id='write_route_data_to_bq',
                                              bucket=cfg.BUCKET,
                                              source_objects=[cfg.SOURCE_OBJECT],
                                              field_delimiter=';',
                                              destination_project_dataset_table=cfg.BQ_TABLE_ROUTE,
                                              create_disposition='CREATE_IF_NEEDED',
                                              write_disposition='WRITE_TRUNCATE',
                                              skip_leading_rows=1)

Task 5: Merge BigQuery and Cloud Storage data

Now we have both the BigQuery source table extract and the CSV data from Cloud Storage in two separate BigQuery tables. Time to merge them! How do we do this? The BigQueryOperator is our friend here. We just pass it a SQL query that specifies how we want the tables merged. By specifying the destination_dataset argument, it’ll put the result into a table that we choose.
    t5 = BigQueryOperator(task_id='prepare_and_merge_data',
                          sql=queries.prepare_and_merge_data,
                          use_legacy_sql=False,
                          destination_dataset_table=cfg.BQ_TABLE_MERGE,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE')
Click below if you want to see the query. I know it looks long and excruciating, but trust me, there are Tinder dates worse than this (‘So, uh, do you like SQL?’ – ‘Which one?’). If he/she follows it up with Lord of the Rings or ‘Yes’, propose!
What is it that this query is doing? Let’s recap: We have two tables at the moment. One is basically a time series on how much dead animal waste Austin public services collected over the course of a year. The second table contains information on what type of routes were driven on those days. As if this pipeline wasn’t weird enough, we now also want to know what the most common route type was on a given day. So we start by counting what route types were recorded on a given day. Next, we use a window function (... OVER (PARTITION BY ... ORDER BY ...)) to find the route type with the highest count for each day. In the end, we pull it out and using the date as a key, merge it to the table with the waste info.
prepare_and_merge_data = """
WITH
simple_route_counts AS (
SELECT report_date,
       route_type,
       count(route_type) AS count
FROM `my-first-project-238015.waste.route` 
GROUP BY report_date, route_type
),
max_route_counts AS (
SELECT report_date,
       FIRST_VALUE(route_type) OVER (PARTITION BY report_date ORDER BY count DESC) AS top_route,
       ROW_NUMBER() OVER (PARTITION BY report_date ORDER BY count desc) AS row_number
FROM simple_route_counts
),
top_routes AS (
SELECT report_date AS date,
       top_route,
FROM max_route_counts
WHERE row_number = 1
)
SELECT a.date,
       a.type,
       a.weight,
       b.top_route
FROM `my-first-project-238015.waste.weight` a
LEFT JOIN top_routes b
ON a.date = b.date
ORDER BY a.date DESC
"""

Task 6: Export result to Google Cloud Storage

Let’s finish off this process by exporting our result back to Cloud Storage. By now, you’re probably guessing what the right operator is called. If you guessed BigQueryToCloudStorageOperator, you’re spot on. How to use it though? Just specify what the source table and the path (uri) to the Cloud Storage bucket are called.
    t6 = BigQueryToCloudStorageOperator(task_id='export_results_to_gcs',
                                        source_project_dataset_table=cfg.BQ_TABLE_MERGE,
                                        destination_cloud_storage_uris=cfg.DESTINATION_URI,
                                        export_format='CSV')
The only thing left to do now is to determine how the tasks relate to each other, i.e. set the dependencies. We can do this using the >> notation which I find more readable than set_upstream() or set_downstream(). But take your pick.
    t1 >> t2 >> [t3, t4] >> t5 >> t6
The notation above says: if data is available in the BigQuery source table, check next if data is also available in Cloud Storage. If so, go ahead, extract the data from the source table and save it to a new BigQuery table. In addition, transfer the CSV file data from Cloud Storage into a separate BigQuery table. Once those two tasks are done, merge the two newly created tables. At last, export the merged table to Cloud Storage as a CSV.

Conclusion

That’s it! Thank you very much for sticking with me to the end! Let’s wrap up what we did: we wrote a DAG file to define an automated ETL process that extracts, transforms and loads data with the help of two Google Cloud Platform services: BigQuery and Cloud Storage. Everything we needed was some Python and SQL code. What’s next? There’s much more to explore, from using the Web UI, monitoring your workflows, dynamically creating tasks, orchestrating machine learning models, and, and, and. We barely scratched the surface here. So check out Airflow’s official website to learn more. For now, I hope you got a better sense of the possibilities you have with Airflow and how you can harness its power to manage and automate your workflows.

References

Recently, some colleagues and I attended the 2-day COVID-19 hackathon #wirvsvirus, organized by the German government. Thereby, we’ve developed a great application for simulating COVID-19 curves based on estimations of governmental measure effectiveness (FlatCurver). As there are many COVID-related dashboards and visualizations out there, I thought that gathering the underlying data from a single point of truth would be a minor issue. However, I soon realized that there are plenty of different data sources, mostly relying on the Johns Hopkins University COVID-19 case data. At first, I thought that’s great, but at a second glance, I revised my initial thought. The JHU datasets have some quirky issues to it that makes it a bit cumbersome to prepare and analyze it:
  • weird column names including special characters
  • countries and states “in the mix”
  • wide format, quite unhandy for data analysis
  • import problems due to line break issues
  • etc.
For all of you, who have been or are working with COVID-19 time series data and want to step up your data-pipeline game, let me tell you: we have an API for that! The API uses official data from the European Centre for Disease Prevention and Control and delivers a clear and concise data structure for further processing, analysis, etc.

Overview of our COVID-19 API

Our brand new COVID-19-API brings you the latest case number time series right into your application or analysis, regardless of your development environment. For example, you can easily import the data into Python using the requests package:
import requests
import json
import pandas as pd

# POST to API
payload = {'country': 'Germany'} # or {'code': 'DE'}
URL = 'https://api.statworx.com/covid'
response = requests.post(url=URL, data=json.dumps(payload))

# Convert to data frame
df = pd.DataFrame.from_dict(json.loads(response.text))
Or if you’re an R aficionado, use httr and jsonlite to grab the lastest data and turn it into a cool plot.
library(httr)
library(dplyr)
library(jsonlite)
library(ggplot2)

# Post to API
payload <- list(code = "ALL")
response <- httr::POST(url = "https://api.statworx.com/covid",
                       body = toJSON(payload, auto_unbox = TRUE), encode = "json")

# Convert to data frame
content <- rawToChar(response$content)
df <- data.frame(fromJSON(content))

# Make a cool plot
df %>%
  mutate(date = as.Date(date)) %>%
  filter(cases_cum > 100) %>%
  filter(code %in% c("US", "DE", "IT", "FR", "ES")) %>%
  group_by(code) %>%
  mutate(time = 1:n()) %>%
  ggplot(., aes(x = time, y = cases_cum, color = code)) +
  xlab("Days since 100 cases") + ylab("Cumulative cases") +
  geom_line() + theme_minimal()
covid-race

Developing the API using Flask

Developing a simple web app using Python is straightforward using Flask. Flask is a web framework for Python. It allows you to create websites, web applications, etc. right from Python. Flask is widely used to develop web services and APIs. A simple Flask app looks something like this.
from flask import Flask
app = Flask(__name__)

@app.route('/')
def handle_request():
  """ This code gets executed """
  return 'Your first Flask app!'
In the example above, app.route decorator defines at which URL our function should be triggered. You can specify multiple decorators to trigger different functions for each URL. You might want to check out our code in the Github repository to see how we build the API using Flask.

Deployment using Google Cloud Run

Developing the API using Flask is straightforward. However, building the infrastructure and auxiliary services around it can be challenging, depending on your specific needs. A couple of things you have to consider when deploying an API:
  • Authentification
  • Security
  • Scalability
  • Latency
  • Logging
  • Connectivity
We’ve decided to use Google Cloud Run, a container-based serverless computing framework on Google Cloud. Basically, GCR is a fully managed Kubernetes service, that allows you to deploy scalable web services or other serverless functions based on your container. This is how our Dockerfile looks like.
# Use the official image as a parent image
FROM python:3.7

# Copy the file from your host to your current location
COPY ./main.py /app/main.py
COPY ./requirements.txt /app/requirements.txt

# Set the working directory
WORKDIR /app

# Run the command inside your image filesystem
RUN pip install -r requirements.txt

# Inform Docker that the container is listening on the specified port at runtime.
EXPOSE 80

# Run the specified command within the container.
CMD ["python", "main.py"]
You can develop your container locally and then push it in to the container registry of your GCP project. To do so, you have to tag your local image using docker tag according to the following scheme: [HOSTNAME]/[PROJECT-ID]/[IMAGE]. The hostname is one of the following: gcr.io, us.gcr.io, eu.gcr.io, asia.gcr.io. Afterward, you can push using gcloud push, followed by your image tag. From there, you can easily connect the container to the Google Cloud Run service:
google cloud run
When deploying the service, you can define parameters for scaling, etc. However, this is not in scope for this post. Furthermore, GCR allows custom domain mapping to functions. That’s why we have the neat API endpoint https://api.statworx.com/covid.

Conclusion

Building and deploying a web service is easier than ever. We hope that you find our new API useful for your projects and analyses regarding COVID-19. If you have any questions or remarks, feel free to contact us or to open an issue on Github. Lastly, if you make use of our free API, please add a link to our website, https://statworx-1727.demosrv.dev to your project. Thanks in advance and stay healthy! When I started with R, I soon discovered that, more often than not, a package name has a particular meaning. For example, the first package I ever installed was foreign. The name corresponds to its ability to read and write data from other foreign psources to R. While this and many other names are rather straightforward, others are much less intuitive. The name of a package often conveys a story, which is inspired by a general property of its functions. And sometimes I just don’t get the deeper meaning, because English is not my native language. In this blog post, I will shed light on the wonderful world of package names. After this journey, you will not only admire the creativity of R package creators; you’ll also be king at your next class reunion! Or at least at the next R-Meetup. Before we start, and I know that you are eager to continue, I have two remarks about this article. First: Sometimes, I refer to official explanations from the authors or other sources; other times, it’s just my personal explanation of why a package is called that way. So if you know better or otherwise, do not hesitate to contact me. Second: There are currently 15,341 packages on CRAN, and I am sure there are a lot more naming mysteries and ingenuities to discover than any curious blog reader would like to digest in one sitting. Therefore, I focussed on the most famous packages and added some of my other preferences. But enough of the talking now, let’s start!

dplyr (diːˈplaɪə)

dplyrYou might have noticed that many packages contain the string plyr, e.g. dbplyr, implyr, dtplyr, and so on. This homophone of pliers corresponds to its refining of base R apply-functions as part of the “split-apply-combine” strategy. Instead of doing all steps for data analysis and manipulation at once, you split the problem into manageable pieces, apply your function to each piece, and combine everything together afterward. We see this approach in perfection when we use the pipe operator. The first part of each package just refers to the object it is applied upon. So the d stands for data frames, db for databases, im for Apache Impala, dt for data tables, and so on… Sources: Hadley Wickham

lubridate (ˈluːbrɪdeɪt)

lubridateThis wonderful package makes it so easy and smooth to work with dates and times in R. You could say it goes like a clockwork. In German, there is a proverb with the same meaning (“Das läuft wie geschmiert”), that can literally be translated to: “It works as lubricated”

ggplot2 (ʤiːʤiːplɒt tuː)

ggplotLeland Wilkinson wrote a book in which he defined multiple components that a comprehensive plot is made of. You have to define the data you want to show, what kind of plot it should be, e.g., points or lines, the scales of the axes, the legend, axis titles, etc. These parts, he called them layers, should be built on top of each other. The title of this influential piece of paper is Grammar of Graphics. Once you got it, it enables you to build complex yet meaningful plots with concise styling across packages. That’s because its logic has also been used by many other packages like plotly, rBokeh, visNetwork, or apexcharter. Sources: ggplot2

data.table (ˈdeɪtə ˈteɪbl) – logo

data_table_logoOkay, full disclosure, I am a tidyverse guy, and one of my sons shall be named Hadley. At least one. However, this does not mean that I don’t appreciate the very powerful package data.table. Occasionally, I take the liberty and exploit its functions to improve the performance of my code (hello fread() and rbindlist()). Anyway, the name itself is pretty straightforward – but did you notice how cool the logo is?! Well, there is obviously the name “data.table” and the square brackets that are fundamental in data.table syntax. Likewise, there is the assignment by reference operator, a.k.a. the walrus operator. “Wait, stop,” your inner marine mammal researcher says, “isn’t this a sea lion on top there?!” Yes indeed! The sea lion is used to highlight that it is an R package since, of course, it shouts R! R!. Source: Rdatatable

tibble (tɪbl)

data_table_logoRegular base R data frames are nice, but did you ever print a data frame in the console, unaware that it is 10 million rows long? Good luck with interrupting R without quitting the whole session. That might be one of the reasons why the tidyverse uses another type of data frames: tibbles. The name tibble could just stem from its similar sound to table, but I suspect there is more to it than meets the eye. Did you ever hear the story about Tibbles and Stephen Island’s Wren? NO? Then let me take you to New Zealand, AD 1894. Between the northern and southern main islands of NZ, there is a small and uninhabited island: Stephen Island. Its rocks have been the downfall of many poor souls that tried to pass the Cook Strait. Therefore, it was decided to build a lighthouse as that ships shall henceforth pass safely and undamaged. wrenDue to its isolation, Stephen Island was the only habitat for many rare species. One of these was Lyall’s wren, a small flightless passerine. It did not know any predators and lived its life in joy and harmony, until… The arrival of the first lighthouse keeper. His name was David Lyall and he was a man interested in natural history and, facing a long and lonely time on his own at Stephen Island, the owner of a cat. This cat was not satisfied by just comforting Mr. Lyall and enjoying beach walks. Shortly after his arrival, Mr. Lyall noticed the carcasses of little birds, seemingly slaughtered and dishonored by a fierce predator. Interested in biology as he was, he found out that these small birds were a distinct species. He preserved some carcasses in alcohol and sent them to a friend. This was in October 1894. A scientific article about the wren was published in an ornithology journal, soon making the specimen a sought-after collector’s item. The summer in New Zealand goes on and in February 1895, four bird-watchers arrived at Stephen Island. They were looking for this cute little wren and found… none. Within a few months, Mr. Lyalls hungry cat made the whole species go extinct. On March 16, 1895, the Christchurch newspaper The Press wrote: “there is very good reason to believe that the bird is no longer to be found on the island, and, as it is not known to exist anywhere else, it has apparently become quite extinct. This is probably a record performance in the way of extermination.”. The name of the cat? Tibbles. Sources: Wikipedia; All About Birds; Oddity Central Indicator: the fridge of Hadley Wickham’s parents

purrr (pɜːɜː)

purrrThis extension of the base R apply-functions has been one of my favorites lately. The concise usage of purrr enables powerful functional programming that, in turn, makes your code faster, more readable, and more stable. Or, as Mr. Wickham states, it makes “your pure R functions purr. Also, note its parallelized sibling furrr. Sources: Hadley Wickham

Amelia (əˈmiːlɪə)

ameliaDuring my Master’s degree, I had a course about missing data and multiple imputations. One of the packages we used, or rather analyzed, was Amelia. It turned out that this package is named after an impressive woman: Amelia Earhart. Living in the early 20th century, she was an aviation pioneer and feminist. She has been the first woman to fly solo across the Atlantic, a remarkable achievement and an inspiration for women to start a technical career. Unfortunately, she disappeared during a flight over the central pacific at age 39 and is thus… missing. ba dum-tss Source: Gary King – Co-Author

magrittr (maɡʁitə)

magrittrThe conciseness of coding with dplyr or its siblings is not imaginable without the pipe operator %>%. This allows you to write and read code from top to bottom and from left to right, just like regular text. Pipes are no special feature of R, yet I am sure René Magritte had nothing else in mind when he painted The Treachery of Images in 1929 with its slogan: “Ceci n’est pas une pipe“. The logo designers just made a slight adjustment to his painting. Or should I say: unearthed the meaning that has always been behind it?! Sources: Vignette, revolutionanalytics.com

batman (ˈbætmən)

ameliaData science can be quite fun if it weren’t for the data. Especially when working with textual data, typos and inconsistent coding can be very cumbersome. For example, you’ve got questionnaire data consisting of yes/no questions. For R, this corresponds to TRUE/FALSE, but who would write this in a questionnaire? In fact, when we try to convert such data to logical values by calling as.logical(), almost every string becomes NA. Lost and doomed? NO! Cause who is more expert to determine actual NA‘s than nananananana… batman!

Homeric (həʊˈmɛrɪk)

ameliaHey, you made it all the way down here?! You deserve a little treat! What about a soft, sweet, and special-sprinkled donut? And who would be better suitable to present it to you, than the best-known lover of donuts himself: Homer Simpson! Just help yourself: Homeric::PlotDoughnut(1, col = "magenta") Source: Homeric Documentation

fcuk (fʌk)

fcukError in view(my_data): could not find function "view" Are you sick and tired of this or similar error messages? Do you regularly employ your ample stock of swear words to describe the stupidity of inconsistent usage of camel or snake case function names across packages? Or do you just type faster than your shadow, causing minor typos in your, otherwise, excellent code? There is help! Just go and install the amazing fcuk package and useless error messages are a thing of the past.

hellno (hɛl nəʊ)

fcukSlip into the role of a dedicated R user. I can only imagine the troubles I had to have with a specific default argument value of a base R function to write an entire package that just handles this case. I am talking about the tormentor of many beginRs when working with as.data.frame(): stringsAsFactors = TRUE. But I do not only change it to FALSE! Also, I create my own FALSE value and name it HELLNO.

Honorable mentions

  • gremlin: package for mixed-effects model REML incorporating Generalized Inverses.
  • harrietr: named after Charles Darwin’s pet giant tortoise. A package for phylogenetic and evolutionary biology data manipulations.
  • beginr: it helps where we’ve all been, searching for ages until setting pch = 16.
  • charlatan: worse than creating dubious medicine, this one makes fake data.
  • fauxpas: explains what specific HTTP errors mean.
  • fishualize: give your plots a fishy look.
  • greybox: why just thinking black or white? This is a package for time series analysis.
  • vroom: it reads data so fast to R, you almost can hear it making vroom vroom.
  • helfRlein: some little helper functions, inspired by the German word Helferlein = little helper.
At STATWORX, coding is our bread and butter. Because our projects involve many different people in several organizations across multiple generations of programmers, writing clean code is essential. The main requirements for well structured and readable code are comments and sections. In RStudio, these sections are defined by comments that end with at least four dashes ---- (you can also use trailing equal signs ==== or hashes ####). In my opinion, the code is even more clear if the dashes cover the whole range of 80 characters (why you should not exceed the 80 characters limit). That’s how my code usually looks like:
# loading packages -------------------------------------------------------------
library(dplyr)

# load data --------------------------------------------------------------------
my_iris <- as_tibble(iris)

# prepare data -----------------------------------------------------------------
my_iris_preped <- my_iris %>% 
  filter(Species == "virginica") %>% 
  mutate_if(is.numeric, list(squared = sqrt))

# ...
Clean, huh? Well, yes, but neither of the three options available to achieve this are as neat as I want it to be:
  • Press - for some time.
  • Copy a certain amount of dashes and insert them sequentially. Both options often result in too many dashes, so I have to remove the redundant ones.
  • Use the shortcut to insert a new section (CMD/STRG + SHIFT + R). However, you cannot neatly include it after you wrote your comments.
Wouldn’t it be nice to have a keyboard shortcut that included the right amount of dashes up from the cursor position? “Easy as can be,” I thought before trying to define a custom shortcut in RStudio. Unfortunately, it turned out not to be that easy. There is a manual from RStudio that actually covers how you can create your shortcut, but it requires you to put it in a package first. Since I have not been an expert in R package development myself, I decided to go the full distance in this blogpost. By following it step by step, you should be able to define your shortcuts within a few minutes. Note: This article is not about creating a CRAN-worthy package, but covers what is necessary to define your own shortcuts. If you have already created packages before, you can skip the parts about package development and jump directly to what is new to you.

Setting up an R package

First of all, open RStudio and create an R package directory. For this, please do the following steps:
  1. Go to “New Project…”
  2. “New Directory”
  3. “R Package”
  4. Select an awesome package name of your choice. In this example, I named my package shoRtcut
  5. In “Create project as subdirectory of:” select a directory of your choice. A new folder with your package name will be created in this directory.
Tada, everything necessary for a powerful R Package has been set up. RStudio also automatically provides a dummy function hello(). Since we do not like to have this function in our own package, move to the “R” folder in your project and delete the hello.R file. Do the same in the “man” folder and delete hello.md.

Creating an Addin Function

Now we can start and define our function. For this, we need the wonderful packages usethis and devtools. These provide all the functionality we need for the next steps.

Defining the Addin Function

Via the use_r() function, we define a new R script file with the given name. That should correspond to the name of the function we are about to create. In my case, I call it set_new_chapter.
# use this function to automatically create a new r script for your function
usethis::use_r("set_new_chapter")
You are directly forwarded to the created file. Now the tricky part begins, defining a function that does what you want. When defining shortcuts that interact with an R script in RStudio, you will soon discover the package rstudioapi. With its functions, you can grab all information from RStudio and make it available within R. Let me guide you through it step by step.
  1. As per usual, I set up a regular R function and define its name as set_new_chapter. Next, I define up until which limit I want to include the dashes. You will note that I rather set nchars to 81 than 80. This is because the number corresponds to the cursor position after including the dashes. You will notice that when you write text, the cursor automatically jumps to the position right after the newly typed character. After you have written your 80th character, the cursor will be at position 81.
  2. Now we have to find out where the cursor is currently located. This information can be unearthed by the getActiveDocumentContext() function. The returned object returns quite a bit of information, but we are only interested in the cursor position regarding the column. Why the column? You can think of the script like a matrix. Hitting return brings you to a new row, typing a character to a new column. Having a font with equal space characters, which is the default setting in RStudio makes this concept easy to see.
  3. By sneaking into the nested list, we find the information we are looking for and store it in context_col. Now we check whether the cursor is already at “column” 81. If not, there is space in which we insert the dashes. For this final step, we can use another function: insertText.
  4. As its name implies, it inserts text in an R script or console. You can either specify a specific position in the document or, by leaving it empty, insert text at the current cursor position, which is exactly what I want right now. As the final step, I need to find out the number of dashes that should be inserted. That’s the difference between the current cursor location and its target position. For example, if the cursor blinks at column 51, meaning I already have typed 50 characters, I want to insert 30 dashes.
  5. To document the function, I use the “Code” > “Insert Roxygen Skeleton” feature and fill it out appropriately.
This is what my final function looks like.
#' Insert dashes from courser position to up to 80 characters
#'
#' @return dashes inside RStudio
set_new_chapter <- function(){
  # set limit to which position dashes should be included
  nchars <- 81

  # grab current document information
  context <- rstudioapi::getActiveDocumentContext()
  # extract horizontal courser position in document
  context_col <- contextselection[[1]]range$end["column"]

  # if a line has less than 81 characters, insert hyphens at the current line
  # up to 80 characters
  if (nchars > context_col) {
    rstudioapi::insertText(strrep("-", nchars - context_col))
  }
}

Defining the Function AS and Addin

Now we must somehow tell RStudio that this particular function should be used as an addin rather than a regular function. For this, go to “File” > “New File” > “Text File” and include the following text:
Name: Insert Dashes (---)
Description: Inserts `---` at the cursor position up to 80 characters.
Binding: set_new_chapter
Interactive: false
  • Name is a short description of what the addin does. This will be displayed when you want to set the shortcut later.
  • Description is a longer description of its functionality.
  • Binding sets the name of the function that should be called by the shortcut.
  • Interactive defines whether this addin is interactive (e.g., runs a Shiny application) or not.
You now must save this file as “addins.dcf” in your project with the following path: “inst” > “rstudio”. The result should look like this:

Finalize the Package

To wrap everything up and make the shortcut available to you and your colleagues, we only have to call a few more functions. Not all these steps are necessary, yet it is good practice to create a proper package.
# OPTIONAL: define the license of your package
usethis::use_mit_license(name = "Matthias Nistler")

# define dependencies you use in your package
usethis::use_package("rstudioapi")

# OPTIONAL: include your function description to the manual
roxygen2::roxygenise()

# check for errors
devtools::check()

# update/create your package
devtools::build()

> ✓  checking for file ‘/Users/matthiasnistler/Projekte/2020/blog_shoRtcut/DESCRIPTION’ ...
> ─  preparing ‘shoRtcut’:
> ✓  checking DESCRIPTION meta-information ...
> ─  checking for LF line-endings in source and make files and shell scripts
> ─  checking for empty or unneeded directories
> ─  building ‘shoRtcut_0.0.0.9000.tar.gz’
> [1] "/Users/matthiasnistler/Projekte/2020/shoRtcut_0.0.0.9000.tar.gz"
There you go! You just created an awesome package and distributed it to your friends and colleagues.

Make the shortcut available

For the last step, you have to install your package and set a keyboard combination for your shortcut. For this, use the following specification of install.packages:
install.packages(
    # same path as above
  "/Users/matthiasnistler/Projekte/2020/shoRtcut_0.0.0.9000.tar.gz", 
  # indicate it is a local file
  repos = NULL)

# check if everything works
shoRtcut:::set_new_chapter()
Now go to “Tools” > “Modify Keyboard Shortcuts…” and search for “dashes”. Here you can define the keyboard combination by clicking inside the empty “Shortcut” field and pressing the desired key-combination on your keyboard. Click “Apply”, and that’s it!
In case you are just here to use my shortcut, you can install it via remotes::install_github("mnist91/shoRtcut").

Congratulations!

You made it! Now you can use your own RStudio shortcut. Exciting, isn’t it? But that’s not all there ist – next week, I will give you an introduction to the wonderful world of R package naming. So stay tuned and happy coding! REST APIs have become a quasi-standard, be it to provide an interface to your application processes, be by setting up a flexible microservice architecture. Sooner or later, you might ask yourself what a proper testing schema would look like and which tools can support you. Some time ago, we at STATWORX asked this ourselves. A toolset that helps us with this task is Newman and Postman, which I will present to you in this blog post. Many of you, who are regularly using and developing REST, might already be quite familiar with Postman. It’s a handy and comfortable desktop tool that comes with some excellent features (see below). Newman, instead, is a command-line agent that runs the previously defined requests. Because of its lean interface, it can be used in several situations, for instance, it can be easily integrated into testing stages of Pipelines.
In the following, I will explain how these siblings can be used to implement a neat testing environment. We start with Postman’s feature sets, then move on to the ability to interact with Newman. We will further have a look at a testing schema, touching some test cases, and lastly, integrate it into a Jenkins pipeline.

About Postman

Postman is a convenient desktop tool handling REST request. Furthermore, Postman gives you the possibility to define test cases (in JavaScript), has a feature to switch environments, and provides you with Pre-Request steps to set up the setting before your calls. In the following, I will give you examples of some interesting features.

Collection and Requests

Requests are the basic unit in Postman, and everything else spins around them. As I said previously, Postman’s GUI provides you with a comfortable way to define these: request method can be picked from a drop-down list, header information is presented clearly, there is a helper for authorization, and many more.
You should have at least one collection per REST interface defined to bundle your requests. At the very end of the definition process, collections can be exported into JSON format. This result will, later on, be exploited for Newman.

Environments

Postman also implements the concept of environment variables. This means: Depending on where your requests are fired from, the variables adapt. The API’s hostname is a good example that should be kept variable: In the development stage, it may be just your localhost but could be different in a dockerized environment.
The syntax of environment variables is double-curly brackets. If you want to use the hostname variable hostname put it like this: {{ hostname }}
Like for collections, environments can be exported into JSON files. We should keep this in mind when we move to Newman.

Tests

Each API request in Postman should come along with at least one test. I propose the following list as an orientation on what to test:
  • the status code: Check the status code according to your expectation: regular GET requests are supposed to return 200 OK, POST requests 201 Created if successful. On the other hand, authorization should be tested as well as invalid client requests which are supposed to return 40x. – See below a POST request test:
pm.test("Successful POST request", function () {
     pm.expect(pm.response.code).to.be.oneOf([201,202]);
 });
  • whether data is returned Test if the response has any data as a first approximation
  • the schema of returned data Test if the structure of the request data fits the expectations: non-nullable fields, data types, names of properties. Find below an example of a schema validation:
pm.test("Response has correct schema", function () {
    var schema = {"type":"object",
                  "properties":{
                      "access_token":{"type":"string"},
                      "created_on":{"type":"string"},
                      "expires_seconds":{"type":"number"}
                  }};
    var jsonData = pm.response.json();
    pm.expect(tv4.validate(jsonData,schema)).to.be.true;
});
  • values of returned data: Check if the values of the response data are sound; for non-negative values:
pm.test("Expires non negative", function() {
    pm.expect(pm.response.json().expires_seconds).to.be.above(0);
})
  • Header values Check the header of the response if useful relevant is stored there.
All tests have to be written in JavaScript. Postman ships with its own library and tv4 for schema validation. Below you find a complete running test:

Introduction to Newman

As mentioned before, Newman acts as an executor of what was defined in Postman. To generate results, Newman uses reporters. Reporters can be the command line interface itself, but also known standards as JUnit can be found. The simplest way to install newman is via NPM (Node package manager). There are ready to use docker images of NodeJS on DockerHub. Install the package via npm install -g newman. There are two ways to call Newman: command-line interface and within JS code. We will only focus on the first.

Calling the CLI

To run a predefined test collections use the command newman run. Please see the example below:
newman run
            --reporters cli,junit
            --reporter-junit-export /test/out/report.xml
            -e /test/env/auth_jwt-docker.pmenv.json
            /test/src/auth_jwt-test.pmc.json
Let us take a closer look: Recall that we have previously exported the collection and the environment from Postman. The environment can be attached with the -e option. Moreover, two reporters were specified: the cli itself which prints into the terminal and junit which additional shall export a report to the file report.xml The CLI reporter prints the following (Note that the first three test cases are those from the test schema proposal):
→ jwt-new-token
  POST http://tp_auth_jwt:5000/new-token/bot123 [201 CREATED, 523B, 42ms]
  ✓  Successful POST request
  ✓  Response has correct schema
  ✓  Expires non negative

→ jwt-auth
  POST http://tp_auth_jwt:5000/new-token/test [201 CREATED, 521B, 11ms]
  GET http://tp_auth_jwt:5000/auth [200 OK, 176B, 9ms]
  ✓  Status code is 200
  ✓  Login name is correct

→ jwt-auth-no-token
  GET http://tp_auth_jwt:5000/auth [401 UNAUTHORIZED, 201B, 9ms]
  ✓  Status is 401 or 403

→ jwt-auth-bad-token
  GET http://tp_auth_jwt:5000/auth [403 FORBIDDEN, 166B, 6ms]
  ✓  Status is 401 or 403

Integration into Jenkins

Newman functionality can now be integrated into (almost?) any Pipeline tool. For Jenkins, we create a docker image based on NodeJS and with Newman installed. Next, we either pack or mount both the environment and the collection file into the docker container. When running the container, we use Newman as a command-line tool, just as we did before. To use this in a test stage of a Pipeline, we have to make sure that the REST API is actually running when Newman is executed. In the following example, the functionalities were defined as targets of a Makefile:
  • run to run the REST API with all dependencies
  • test to run Newman container which itself runs the testing collections
  • rm to stop and remove the REST API
After the API has been tested the report from JUnit is digested by Jenkins with the command junit <report> See below a Pipeline snippet of a test run:
node{
       stage('Test'){
            try{
                sh "cd docker && make run"
                sh "sleep 5"
                sh "cd docker && make test"
                junit "source/test/out/report.xml"

            } catch (Exception e){
                    echo e
            } finally {
                    sh "cd docker && make rm"
            }
        }
}

Summary

Now it’s time to code tests for your REST API. Please also try to integrate it into your build-test cycle and into your automation pipeline because automation and defined processes are crucial to delivering reliable code and packages. I hope with this blog post, you now have a better understanding of how Postman and Newman can be used to implement a test framework for REST APIs. Postman was used as a definition tool, whereas Newman was the runner of these definitions. Because of his nature, we have also seen that Newman is the tool for your build pipeline. Happy coding!

We’re hiring!

Data Engineering is your jam and you’re looking for a job? We’re currently looking for Junior Consultants and Consultants in Data Engineering. Check the requirements and benefits of working with us on our career site. We’re looking forward to your application! At STATWORX, deploying our project results with the help of Shiny has become part of our daily business. Shiny is a great way of letting users interact with their own data and the data science products that we provide. Applying the philosophy of reactivity to your app’s UI is an interesting way of bringing your apps closer in line with the spirit of the Shiny package. Shiny was designed to be reactive, so why limit this to only the server-side of your app? Introducing dynamic UI elements to your apps will help you reduce visual clutter, make for cleaner code and enhance the overall feel of your applications. I have previously discussed the advantages of using renderUI in combination with lapply and do.call in the first part of this series on dynamic UI elements in Shiny. Building onto this I would like to expand our toolbox for reactive UI design with a few more options.

The objective

In this particular case we’re trying to build an app where one of the inputs reacts to another input dynamically. Let’s assume we’d like to present the user with multiple options to choose from in the shape of a selectInput. Let’s also assume that one of the options may call for more input from the user, let’s say a comment, to explain more clearly the previous selection. One way to do this would be to add a static textInput or similar to the app. A much more elegant solution would be to conditionally render the second input to only appear if the proper option had been selected. The image below shows how this would look in practice.
shiny-app-dynamic-ui-elements
There are multiple ways of going about this in Shiny. I’d like to introduce two of them to you, both of which lead to the same result but with a few key differences between them.

A possible solution: req

What req is usually used for

req is a function from the Shiny package whose purpose is to check whether certain requirements are met before proceeding with your calculations inside a reactive environment. Usually this is used to avoid red error messages popping up in your ShinyApp UI when an element of your app depends on an input that doesn’t have a set value yet. You may have seen one of these before:
shiny-error
These errors usually disappear once you have assigned a value to the needed inputs. req makes it so that your desired output is only calculated once its required inputs have been set, thus offering an elegant way to avoid the rather garish looking error messages in your app’s UI.

How we can make use of req

In terms of reactive UI design we can make use of req‘s functionality to introduce conditional statements to our uiOutputs. This is achieved by using renderUI and req in combination as shown in the following example:
output$conditional_comment <- renderUI({
    # specify condition
    req(input$select == "B")

    # execute only if condition is met
    textAreaInput(inputId = "comment", 
                  label = "please add a comment", 
                  placeholder = "write comment here") 
  })
Within req the condition to be met is specified and the rest of the code inside the reactive environment created by renderUI is only executed if that condition is met. What is nice about this solution is that if the condition has not been met there will be no red error messages or other visual clutter popping up in your app, just like what we’ve seen at the beginning of this chapter.

A simple example app

Here’s the complete code for a small example app:
library(shiny)
library(shinydashboard)

ui <- dashboardPage(

  dashboardHeader(),
  dashboardSidebar(
    selectInput(inputId = "select", 
                label = "please select an option", 
                choices = LETTERS[1:3]),
    uiOutput("conditional_comment")
  ),
  dashboardBody(
    uiOutput("selection_text"),
    uiOutput("comment_text")
  )
)

server <- function(input, output) {

  output$selection_text <- renderUI({
    paste("The selected option is", input$select)
  })

  output$conditional_comment <- renderUI({
    req(input$select == "B")
    textAreaInput(inputId = "comment", 
                  label = "please add a comment", 
                  placeholder = "write comment here")
  })

  output$comment_text <- renderText({
    input$comment
  })
}

shinyApp(ui = ui, server = server)
If you try this out by yourself you will find that the comment box isn’t hidden or disabled when it isn’t being shown, it simply doesn’t exist unless the selectInput takes on the value of “B”. That is because the uiOutput object containing the desired textAreaInput isn’t being rendered unless the condition stated inside of req is satisfied.

The popular choice: conditionalPanel

Out of all the tools available for reactive UI design this is probably the most widely used. The results obtained with conditionalPanel are quite similar to what req allowed us to do in the example above, but there are a few key differences.

How does this differ from req?

conditionalPanel was designed to specifically enable Shiny-programmers to conditionally show or hide UI elements. Unlike the req-method, conditionalPanel is evaluated within the UI-part of the app, meaning that it doesn’t rely on renderUI to conditionally render the various inputs of the shinyverse. But wait, you might ask, how can Shiny evaluate any conditions in the UI-side of the app? Isn’t that sort of thing always done in the server-part? Well yes, that is true if the expression is written in R. To get around this, conditionalPanel relies on JavaScript to evaluate its conditions. After stating the condition in JS we can add any given UI-elements to our conditionalPanel as shown below:
conditionalPanel(
      # specify condition
      condition = "input.select == 'B'",

      # execute only if condition is met
      textAreaInput(inputId = "comment", 
                    label = "please add a comment", 
                    placeholder = "write comment here")
    )
This code chunk displays the same behaviour as the example shown in the last chapter with one major difference: It is now part of our ShinyApp’s UI-function unlike the req-solution, which was a uiOutput calculated in the server-part of the app and later passed to our UI-function as a list-element.

A simple example app:

Rewriting the app to include conditionalPanel instead of req yields a script that looks something like this:
library(shiny)
library(shinydashboard)

ui <- dashboardPage(

  dashboardHeader(),
  dashboardSidebar(
    selectInput(inputId = "select", 
                label = "please select an option", 
                choices = LETTERS[1:3]),
    conditionalPanel(
      condition = "input.select == 'B'",
      textAreaInput(inputId = "comment", 
                    label = "please add a comment", 
                    placeholder = "write comment here")
    )
  ),
  dashboardBody(
    uiOutput("selection_text"),
    textOutput("comment_text")
    )
)

server <- function(input, output) {

  output$selection_text <- renderUI({
    paste("The selected option is", input$select)
  })

  output$comment_text <- renderText({
    input$comment
  })
}

shinyApp(ui = ui, server = server)
With these two simple examples we have demonstrated multiple ways of letting your displayed UI elements react to how a user interacts with your app – both on the server, as well as the UI side of the application. In order to keep things simple I have used a basic textAreaInput for this demonstration, but both renderUI and conditionalPanel can hold so much more than just a simple input element. So get creative and utilize these tools, maybe even in combination with the functions from part 1 of this series, to make your apps even shinier! Did you know, that you can transform plain old static ggplot graphs to animated ones? Well, you can with the help of the package gganimate by RStudio’s Thomas Lin Pedersen and David Robinson and the results are amazing! My STATWORX colleagues and I are very impressed how effortless all kinds of geoms are transformed to suuuper smooth animations. That’s why in this post I will provide a short overview of some of the wonderful functionalities of gganimate, I hope you’ll enjoy them as much as we do! Since Valentine’s Day is just around the corner, we’re going to explore the Speed Dating Experiment dataset compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar. Hopefully, we’ll learn about gganimate as well as how to find our Valentine. If you like, you can download the data from Kaggle.

Defining the basic animation: transition_*

How are static plots put into motion? Essentially, gganimate creates data subsets, which are plotted individually and constitute the substantial frames, which, when played consecutively, create the basic animation. The results of gganimate are so seamless because gganimate takes care of the so-called tweening for us by calculating data points for transition frames displayed in-between frames with actual input data. The transition_* functions define how the data subsets are derived and thus define the general character of any animation. In this blogpost we’re going to explore three types of transitions: transition_states(), transition_reveal() and transition_filter(). But let’s start at the beginning. We’ll start with transition_states(). Here the data is split into subsets according to the categories of the variable provided to the states argument. If several rows of a dataset pertain to the same unit of observation and should be identifiable as such, a grouping variable defining the observation units needs to be supplied. Alternatively, an identifier can be mapped to any other aesthetic. Please note, to ensure the readability of this post, all text concerning the interpretation of the speed dating data is written in italics. If you’re not interested in that part you simply can skip those paragraphs. For the data prep, I’d like to refer you to my GitHub. First, we’re going to explore what the participants of the Speed Dating Experiment look for in a partner. Participants were asked to rate the importance of attributes in a potential date by allocating a budget of 100 points to several characteristics, with higher values denoting a higher importance. The participants were asked to rate the attributes according to their own views. Further, the participants were asked to rate the same attributes according to the presumed wishes of their same-sex peers, meaning they allocated the points in the way they supposed their average same-sex peer would do. We’re going to plot all of these ratings (x-axis) for all attributes (y-axis). Since we want to compare the individual wishes to the individually presumed wishes of peers, we’re going to transition between both sets of ratings. Color always indicates the personal wishes of a participant. A given bubble indicates the rating of one specific participant for a given attribute, switching between one’s own wishes and the wishes assumed for peers.
## Static Plot
# ...characteristic vs. (presumed) rating...
# ...color&size mapped to own rating, grouped by ID
plot1 <- ggplot(df_what_look_for, 
       aes(x = value,
           y = variable,
           color = own_rating, # bubbels are always colord according to own whishes
           size = own_rating,
           group = iid)) + # identifier of observations across states
  geom_jitter(alpha = 0.5, # to reduce overplotting: jitttering & alpha
              width = 5) + 
  scale_color_viridis(option = "plasma", # use virdis' plasma scale
                      begin = 0.2, # limit range of used hues
                      name = "Own Rating") +
  scale_size(guide = FALSE) + # no legend for size
  labs(y = "", # no axis label
       x = "Allocation of 100 Points",  # x-axis label
       title = "Importance of Characteristics for Potential Partner") +
  theme_minimal() +  # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot1 + 
  transition_states(states = rating) # animate contrast subsets acc. to variable rating  
tran-states
First off, if you’re a little confused which state is which, please be patient, we’ll explore dynamic labels in the section about ‘frame variables’. It’s apparent that different people look for different things in a partner. Yet attractiveness is often prioritized over other qualities. But the importance of attractiveness varies most strongly of all attributes between individuals. Interestingly, people are quite aware that their peer’s ratings might differ from their own views. Further, especially the collective presumptions (= the mean values) about others are not completely off, but of higher variance than the actual ratings. So there is hope for all of us that somewhere out there somebody is looking for someone just as ambitious or just as intelligent as ourselves. However, it’s not always the inner values that count. gganimate allows us to tailor the details of the animation according to our wishes. With the argument transition_length we can define the relative length of the transition from one to the other real subsets of data takes and with state_length how long, relatively speaking, each subset of original data is displayed. Only if the wrap argument is set to TRUE, the last frame will get morphed back into the first frame of the animation, creating an endless and seamless loop. Of course, the arguments of different transition functions may vary.
## Animated Plot
# ...replace default arguments
plot1 + 
  transition_states(states = rating,
                    transition_length = 3, # 3/4 of total time for transitions
                    state_length = 1, # 1/4 of time to display actual data
                    wrap = FALSE) # no endless loop
tran-states-arguments

Styling transitions: ease_aes

As mentioned before, gganimate takes care of tweening and calculates additional data points to create smooth transitions between successively displayed points of actual input data. With ease_aes we can control which so-called easing function is used to ‘morph’ original data points into each other. The default argument is used to declare the easing function for all aesthetics in a plot. Alternatively, easing functions can be assigned to individual aesthetics by name. Amongst others quadric, cubic , sine and exponential easing functions are available, with the linear easing function being the default. These functions can be customized further by adding a modifier-suffix: with -in the function is applied as-is, with -out the function is reversely applied with -in-out the function is applied as-is in the first half of the transition and reversed in the second half. Here I played around with an easing function that models the bouncing of a ball.
## Animated Plot
# ...add special easing function
plot1 + 
  transition_states(states = rating) + 
  ease_aes("bounce-in") # bouncy easing function, as-is
tran-states-ease

Dynamic labelling: {frame variables}

To ensure that we, mesmerized by our animations, do not lose the overview gganimate provides so-called frame variables that provide metadata about the animation as a whole or the previous/current/next frame. The frame variables – when wrapped in curly brackets – are available for string literal interpretation within all plot labels. For example, we can label each frame with the value of the states variable that defines the currently (or soon to be) displayed subset of actual data:
## Animated Plot
# ...add dynamic label: subtitle with current/next value of states variable
plot1 +
  labs(subtitle = "{closest_state}") + # add frame variable as subtitle
  transition_states(states = rating) 
tran-states-label
The set of available variables depends on the transition function. To get a list of frame variables available for any animation (per default the last one) the frame_vars() function can be called, to get both the names and values of the available variables.

Indicating previous data: shadow_*

To accentuate the interconnection of different frames, we can apply one of gganimates ‘shadows’. Per default shadow_null() i.e. no shadow is added to animations. In general, shadows display data points of past frames in different ways: shadow_trail() creates a trail of evenly spaced data points, while shadow_mark() displays all raw data points. We’ll use shadow_wake() to create a little ‘wake’ of past data points which are gradually shrinking and fading away. The argument wake_length allows us to set the length of the wake, relative to the total number of frames. Since the wakes overlap, the transparency of geoms might need adjustment. Obviously, for plots with lots of data points shadows can impede the intelligibility.
plot1B + # same as plot1, but with alpha = 0.1 in geom_jitter
  labs(subtitle = "{closest_state}") +  
  transition_states(states = rating) +
  shadow_wake(wake_length = 0.5) # adding shadow
tran-states-shadow

The benefits of transition_*

While I simply love the visuals of animated plots, I think they’re also offering actual improvement. I feel transition_states compared to facetting has the advantage of making it easier to track individual observations through transitions. Further, no matter how many subplots we want to explore, we do not need lots of space and clutter our document with thousands of plots nor do we have to put up with tiny plots. Similarly, e.g. transition_reveal holds additional value for time series by not only mapping a time variable on one of the axes but also to actual time: the transition length between the individual frames displays of actual input data corresponds to the actual relative time differences of the mapped events. To illustrate this, let’s take a quick look at the ‘success’ of all the speed dates across the different speed dating events:
## Static Plot
# ... date of event vs. interest in second date for women, men or couples
plot2 <- ggplot(data = df_match,
                aes(x = date, # date of speed dating event
                    y = count, # interest in 2nd date
                    color = info, # which group: women/men/reciprocal
                    group = info)) +
  geom_point(aes(group = seq_along(date)), # needed, otherwise transition dosen't work
             size = 4, # size of points
             alpha = 0.7) + # slightly transparent
  geom_line(aes(lty = info), # line type according to group
            alpha = 0.6) + # slightly transparent
  labs(y = "Interest After Speed Date",
       x = "Date of Event",
       title = "Overall Interest in Second Date") +
  scale_linetype_manual(values = c("Men" = "solid", # assign line types to groups
                                   "Women" = "solid",
                                   "Reciprocal" = "dashed"),
                        guide = FALSE) + # no legend for linetypes
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # y-axis in %
  scale_color_manual(values = c("Men" = "#2A00B6", # assign colors to groups
                                "Women" = "#9B0E84",
                                "Reciprocal" = "#E94657"),
                     name = "") +
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(), # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot2 +
  transition_reveal(along = date) 
trans-reveal
Displayed are the percentages of women and men who were interested in a second date after each of their speed dates as well as the percentage of couples in which both partners wanted to see each other again. Most of the time, women were more interested in second dates than men. Further, the attraction between dating partners often didn’t go both ways: the instances in which both partners of a couple wanted a second date always were far more infrequent than the general interest of either men and women. While it’s hard to identify the most romantic time of the year, according to the data there seemed to be a slack in romance in early autumn. Maybe everybody still was heartbroken over their summer fling? Fortunately, Valentine’s Day is in February. Another very handy option is transition_filter(), it’s a great way to present selected key insights of your data exploration. Here the animation browses through data subsets defined by a series of filter conditions. It’s up to you which data subsets you want to stage. The data is filtered according to logical statements defined in transition_filter(). All rows for which a statement holds true are included in the respective subset. We can assign names to the logical expressions, which can be accessed as frame variables. If the keep argument is set to TRUE, the data of previous frames is permanently displayed in later frames. I want to explore, whether one’s own characteristics relate to the attributes one looks for in a partner. Do opposites attract? Or do birds of a feather (want to) flock together? Displayed below are the importances the speed dating participants assigned to different attributes of a potential partner. Contrasted are subsets of participants, who were rated especially funny, attractive, sincere, intelligent or ambitious by their speed dating partners. The rating scale went from 1 = low to 10 = high, thus I assume value of >7 to be rather outstanding.
## Static Plot (without geom)
# ...importance ratings for different attributes
plot3 <- ggplot(data = df_ratings, 
                 aes(x = variable, # different attributes
                     y = own_rating, # importance regarding potential partner
                     size = own_rating, 
                     color = variable, # different attributes
                     fill = variable)) +
  geom_jitter(alpha = 0.3) +
  labs(x = "Attributes of Potential Partner", # x-axis label
       y = "Allocation of 100 Points (Importance)",  # y-axis label
       title = "Importance of Characteristics of Potential Partner", # title
       subtitle = "Subset of {closest_filter} Participants") + # dynamic subtitle 
  scale_color_viridis_d(option = "plasma", # use viridis scale for color 
                        begin = 0.05, # limit range of used hues
                        end = 0.97,
                        guide = FALSE) + # don't show legend
  scale_fill_viridis_d(option = "plasma", # use viridis scale for filling
                       begin = 0.05, # limit range of used hues
                       end = 0.97, 
                       guide = FALSE) + # don't show legend
  scale_size_continuous(guide = FALSE) + # don't show legend
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) 
trans-filter
Of course, the number of extraordinarily attractive, intelligent or funny participants is relatively low. Surprisingly, there seem to be little differences between what the average low vs. high scoring participants look for in a partner. Rather the lower scoring group includes more people with outlying expectations regarding certain characteristics. Individual tastes seem to vary more or less independently from individual characteristics.

Styling the (dis)appearance of data: enter_* / exit_*

Especially if displayed subsets of data do not or only partially overlap, it can be favorable to underscore this visually. A good way to do this are the enter_*() and exit_*() functions, which enable us to style the entry and exit of data points, which do not persist between frames. There are many combinable options: data points can simply (dis)appear (the default), fade (enter_fade()/exit_fade()), grow or shrink (enter_grow()/exit_shrink()), gradually change their color (enter_recolor()/exit_recolor()), fly (enter_fly()/exit_fly()) or drift (enter_drift()/exit_drift()) in and out. We can use these stylistic devices to emphasize changes in the databases of different frames. I used exit_fade() to let further not included data points gradually fade away while flying them out of the plot area on a vertical route (y_loc = 100), data points re-entering the sample fly in vertically from the bottom of the plot (y_loc = 0):
## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) +
  enter_fly(y_loc = 0) + # entering data: fly in vertically from bottom
  exit_fly(y_loc = 100) + # exiting data: fly out vertically to top...
  exit_fade() # ...while color is fading
trans-filter-exit-enter

Finetuning and saving: animate() & anim_save()

Gladly, gganimate makes it very easy to finalize and save our animations. We can pass our finished gganimate object to animate() to, amongst other things, define the number of frames to be rendered (nframes) and/or the rate of frames per second (fps) and/or the number of seconds the animation should last (duration). We also have the option to define the device in which the individual frames are rendered (the default is device = “png”, but all popular devices are available). Further, we can define arguments that are passed on to the device, like e.g. width or height. Note, that simply printing an gganimateobject is equivalent to passing it to animate() with default arguments. If we plan to save our animation the argument renderer, is of importance: the function anim_save() lets us effortlessly save any gganimate object, but only so if it was rendered using one of the functions magick_renderer() or the default gifski_renderer(). The function anim_save()works quite straightforward. We can define filename and path (defaults to the current working directory) as well as the animation object (defaults to the most recently created animation).
# create a gganimate object
gg_animation <- plot3 +
  transition_filter("More Attractive" = Attractive > 7,
                    "Less Attractive" = Attractive <= 7) 

# adjust the animation settings 
animate(gg_animation, 
        width = 900, # 900px wide
        height = 600, # 600px high
        nframes = 200, # 200 frames
        fps = 10) # 10 frames per second

# save the last created animation to the current directory 
anim_save("my_animated_plot.gif")

Conclusion (and a Happy Valentine’s Day)

I hope this blog post gave you an idea, how to use gganimate to upgrade your own ggplots to beautiful and informative animations. I only scratched the surface of gganimates functionalities, so please do not mistake this post as an exhaustive description of the presented functions or the package. There is much out there for you to explore, so don’t wait any longer and get started with gganimate! But even more important: don’t wait on love. The speed dating data shows that most likely there’s someone out there looking for someone just like you. So from everyone here at STATWORX: Happy Valentine’s Day!
heart gif
## 8 bit heart animation
animation2 <- plot(data = df_eight_bit_heart %>% # includes color and x/y position of pixels 
         dplyr::mutate(id = row_number()), # create row number as ID  
                aes(x = x, 
                    y = y,
                    color = color,
                    group = id)) +
  geom_point(size = 18, # depends on height & width of animation
             shape = 15) + # square
  scale_color_manual(values = c("black" = "black", # map values of color to actual colors
                                "red" = "firebrick2",
                                "dark red" = "firebrick",
                                "white" = "white"),
                     guide = FALSE) + # do not include legend
  theme_void() + # remove everything but geom from plot
  transition_states(-y, # reveal from high to low y values 
                    state_length = 0) +
  shadow_mark() + # keep all past data points
  enter_grow() + # new data grows 
  enter_fade() # new data starts without color

animate(animation2, 
        width = 250, # depends on size defined in geom_point 
        height = 250, # depends on size defined in geom_point 
        end_pause = 15) # pause at end of animation
   
Did you know, that you can transform plain old static ggplot graphs to animated ones? Well, you can with the help of the package gganimate by RStudio’s Thomas Lin Pedersen and David Robinson and the results are amazing! My STATWORX colleagues and I are very impressed how effortless all kinds of geoms are transformed to suuuper smooth animations. That’s why in this post I will provide a short overview of some of the wonderful functionalities of gganimate, I hope you’ll enjoy them as much as we do! Since Valentine’s Day is just around the corner, we’re going to explore the Speed Dating Experiment dataset compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar. Hopefully, we’ll learn about gganimate as well as how to find our Valentine. If you like, you can download the data from Kaggle.

Defining the basic animation: transition_*

How are static plots put into motion? Essentially, gganimate creates data subsets, which are plotted individually and constitute the substantial frames, which, when played consecutively, create the basic animation. The results of gganimate are so seamless because gganimate takes care of the so-called tweening for us by calculating data points for transition frames displayed in-between frames with actual input data. The transition_* functions define how the data subsets are derived and thus define the general character of any animation. In this blogpost we’re going to explore three types of transitions: transition_states(), transition_reveal() and transition_filter(). But let’s start at the beginning. We’ll start with transition_states(). Here the data is split into subsets according to the categories of the variable provided to the states argument. If several rows of a dataset pertain to the same unit of observation and should be identifiable as such, a grouping variable defining the observation units needs to be supplied. Alternatively, an identifier can be mapped to any other aesthetic. Please note, to ensure the readability of this post, all text concerning the interpretation of the speed dating data is written in italics. If you’re not interested in that part you simply can skip those paragraphs. For the data prep, I’d like to refer you to my GitHub. First, we’re going to explore what the participants of the Speed Dating Experiment look for in a partner. Participants were asked to rate the importance of attributes in a potential date by allocating a budget of 100 points to several characteristics, with higher values denoting a higher importance. The participants were asked to rate the attributes according to their own views. Further, the participants were asked to rate the same attributes according to the presumed wishes of their same-sex peers, meaning they allocated the points in the way they supposed their average same-sex peer would do. We’re going to plot all of these ratings (x-axis) for all attributes (y-axis). Since we want to compare the individual wishes to the individually presumed wishes of peers, we’re going to transition between both sets of ratings. Color always indicates the personal wishes of a participant. A given bubble indicates the rating of one specific participant for a given attribute, switching between one’s own wishes and the wishes assumed for peers.
## Static Plot
# ...characteristic vs. (presumed) rating...
# ...color&size mapped to own rating, grouped by ID
plot1 <- ggplot(df_what_look_for, 
       aes(x = value,
           y = variable,
           color = own_rating, # bubbels are always colord according to own whishes
           size = own_rating,
           group = iid)) + # identifier of observations across states
  geom_jitter(alpha = 0.5, # to reduce overplotting: jitttering & alpha
              width = 5) + 
  scale_color_viridis(option = "plasma", # use virdis' plasma scale
                      begin = 0.2, # limit range of used hues
                      name = "Own Rating") +
  scale_size(guide = FALSE) + # no legend for size
  labs(y = "", # no axis label
       x = "Allocation of 100 Points",  # x-axis label
       title = "Importance of Characteristics for Potential Partner") +
  theme_minimal() +  # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot1 + 
  transition_states(states = rating) # animate contrast subsets acc. to variable rating  
tran-states
First off, if you’re a little confused which state is which, please be patient, we’ll explore dynamic labels in the section about ‘frame variables’. It’s apparent that different people look for different things in a partner. Yet attractiveness is often prioritized over other qualities. But the importance of attractiveness varies most strongly of all attributes between individuals. Interestingly, people are quite aware that their peer’s ratings might differ from their own views. Further, especially the collective presumptions (= the mean values) about others are not completely off, but of higher variance than the actual ratings. So there is hope for all of us that somewhere out there somebody is looking for someone just as ambitious or just as intelligent as ourselves. However, it’s not always the inner values that count. gganimate allows us to tailor the details of the animation according to our wishes. With the argument transition_length we can define the relative length of the transition from one to the other real subsets of data takes and with state_length how long, relatively speaking, each subset of original data is displayed. Only if the wrap argument is set to TRUE, the last frame will get morphed back into the first frame of the animation, creating an endless and seamless loop. Of course, the arguments of different transition functions may vary.
## Animated Plot
# ...replace default arguments
plot1 + 
  transition_states(states = rating,
                    transition_length = 3, # 3/4 of total time for transitions
                    state_length = 1, # 1/4 of time to display actual data
                    wrap = FALSE) # no endless loop
tran-states-arguments

Styling transitions: ease_aes

As mentioned before, gganimate takes care of tweening and calculates additional data points to create smooth transitions between successively displayed points of actual input data. With ease_aes we can control which so-called easing function is used to ‘morph’ original data points into each other. The default argument is used to declare the easing function for all aesthetics in a plot. Alternatively, easing functions can be assigned to individual aesthetics by name. Amongst others quadric, cubic , sine and exponential easing functions are available, with the linear easing function being the default. These functions can be customized further by adding a modifier-suffix: with -in the function is applied as-is, with -out the function is reversely applied with -in-out the function is applied as-is in the first half of the transition and reversed in the second half. Here I played around with an easing function that models the bouncing of a ball.
## Animated Plot
# ...add special easing function
plot1 + 
  transition_states(states = rating) + 
  ease_aes("bounce-in") # bouncy easing function, as-is
tran-states-ease

Dynamic labelling: {frame variables}

To ensure that we, mesmerized by our animations, do not lose the overview gganimate provides so-called frame variables that provide metadata about the animation as a whole or the previous/current/next frame. The frame variables – when wrapped in curly brackets – are available for string literal interpretation within all plot labels. For example, we can label each frame with the value of the states variable that defines the currently (or soon to be) displayed subset of actual data:
## Animated Plot
# ...add dynamic label: subtitle with current/next value of states variable
plot1 +
  labs(subtitle = "{closest_state}") + # add frame variable as subtitle
  transition_states(states = rating) 
tran-states-label
The set of available variables depends on the transition function. To get a list of frame variables available for any animation (per default the last one) the frame_vars() function can be called, to get both the names and values of the available variables.

Indicating previous data: shadow_*

To accentuate the interconnection of different frames, we can apply one of gganimates ‘shadows’. Per default shadow_null() i.e. no shadow is added to animations. In general, shadows display data points of past frames in different ways: shadow_trail() creates a trail of evenly spaced data points, while shadow_mark() displays all raw data points. We’ll use shadow_wake() to create a little ‘wake’ of past data points which are gradually shrinking and fading away. The argument wake_length allows us to set the length of the wake, relative to the total number of frames. Since the wakes overlap, the transparency of geoms might need adjustment. Obviously, for plots with lots of data points shadows can impede the intelligibility.
plot1B + # same as plot1, but with alpha = 0.1 in geom_jitter
  labs(subtitle = "{closest_state}") +  
  transition_states(states = rating) +
  shadow_wake(wake_length = 0.5) # adding shadow
tran-states-shadow

The benefits of transition_*

While I simply love the visuals of animated plots, I think they’re also offering actual improvement. I feel transition_states compared to facetting has the advantage of making it easier to track individual observations through transitions. Further, no matter how many subplots we want to explore, we do not need lots of space and clutter our document with thousands of plots nor do we have to put up with tiny plots. Similarly, e.g. transition_reveal holds additional value for time series by not only mapping a time variable on one of the axes but also to actual time: the transition length between the individual frames displays of actual input data corresponds to the actual relative time differences of the mapped events. To illustrate this, let’s take a quick look at the ‘success’ of all the speed dates across the different speed dating events:
## Static Plot
# ... date of event vs. interest in second date for women, men or couples
plot2 <- ggplot(data = df_match,
                aes(x = date, # date of speed dating event
                    y = count, # interest in 2nd date
                    color = info, # which group: women/men/reciprocal
                    group = info)) +
  geom_point(aes(group = seq_along(date)), # needed, otherwise transition dosen't work
             size = 4, # size of points
             alpha = 0.7) + # slightly transparent
  geom_line(aes(lty = info), # line type according to group
            alpha = 0.6) + # slightly transparent
  labs(y = "Interest After Speed Date",
       x = "Date of Event",
       title = "Overall Interest in Second Date") +
  scale_linetype_manual(values = c("Men" = "solid", # assign line types to groups
                                   "Women" = "solid",
                                   "Reciprocal" = "dashed"),
                        guide = FALSE) + # no legend for linetypes
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # y-axis in %
  scale_color_manual(values = c("Men" = "#2A00B6", # assign colors to groups
                                "Women" = "#9B0E84",
                                "Reciprocal" = "#E94657"),
                     name = "") +
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(), # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot2 +
  transition_reveal(along = date) 
trans-reveal
Displayed are the percentages of women and men who were interested in a second date after each of their speed dates as well as the percentage of couples in which both partners wanted to see each other again. Most of the time, women were more interested in second dates than men. Further, the attraction between dating partners often didn’t go both ways: the instances in which both partners of a couple wanted a second date always were far more infrequent than the general interest of either men and women. While it’s hard to identify the most romantic time of the year, according to the data there seemed to be a slack in romance in early autumn. Maybe everybody still was heartbroken over their summer fling? Fortunately, Valentine’s Day is in February. Another very handy option is transition_filter(), it’s a great way to present selected key insights of your data exploration. Here the animation browses through data subsets defined by a series of filter conditions. It’s up to you which data subsets you want to stage. The data is filtered according to logical statements defined in transition_filter(). All rows for which a statement holds true are included in the respective subset. We can assign names to the logical expressions, which can be accessed as frame variables. If the keep argument is set to TRUE, the data of previous frames is permanently displayed in later frames. I want to explore, whether one’s own characteristics relate to the attributes one looks for in a partner. Do opposites attract? Or do birds of a feather (want to) flock together? Displayed below are the importances the speed dating participants assigned to different attributes of a potential partner. Contrasted are subsets of participants, who were rated especially funny, attractive, sincere, intelligent or ambitious by their speed dating partners. The rating scale went from 1 = low to 10 = high, thus I assume value of >7 to be rather outstanding.
## Static Plot (without geom)
# ...importance ratings for different attributes
plot3 <- ggplot(data = df_ratings, 
                 aes(x = variable, # different attributes
                     y = own_rating, # importance regarding potential partner
                     size = own_rating, 
                     color = variable, # different attributes
                     fill = variable)) +
  geom_jitter(alpha = 0.3) +
  labs(x = "Attributes of Potential Partner", # x-axis label
       y = "Allocation of 100 Points (Importance)",  # y-axis label
       title = "Importance of Characteristics of Potential Partner", # title
       subtitle = "Subset of {closest_filter} Participants") + # dynamic subtitle 
  scale_color_viridis_d(option = "plasma", # use viridis scale for color 
                        begin = 0.05, # limit range of used hues
                        end = 0.97,
                        guide = FALSE) + # don't show legend
  scale_fill_viridis_d(option = "plasma", # use viridis scale for filling
                       begin = 0.05, # limit range of used hues
                       end = 0.97, 
                       guide = FALSE) + # don't show legend
  scale_size_continuous(guide = FALSE) + # don't show legend
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) 
trans-filter
Of course, the number of extraordinarily attractive, intelligent or funny participants is relatively low. Surprisingly, there seem to be little differences between what the average low vs. high scoring participants look for in a partner. Rather the lower scoring group includes more people with outlying expectations regarding certain characteristics. Individual tastes seem to vary more or less independently from individual characteristics.

Styling the (dis)appearance of data: enter_* / exit_*

Especially if displayed subsets of data do not or only partially overlap, it can be favorable to underscore this visually. A good way to do this are the enter_*() and exit_*() functions, which enable us to style the entry and exit of data points, which do not persist between frames. There are many combinable options: data points can simply (dis)appear (the default), fade (enter_fade()/exit_fade()), grow or shrink (enter_grow()/exit_shrink()), gradually change their color (enter_recolor()/exit_recolor()), fly (enter_fly()/exit_fly()) or drift (enter_drift()/exit_drift()) in and out. We can use these stylistic devices to emphasize changes in the databases of different frames. I used exit_fade() to let further not included data points gradually fade away while flying them out of the plot area on a vertical route (y_loc = 100), data points re-entering the sample fly in vertically from the bottom of the plot (y_loc = 0):
## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) +
  enter_fly(y_loc = 0) + # entering data: fly in vertically from bottom
  exit_fly(y_loc = 100) + # exiting data: fly out vertically to top...
  exit_fade() # ...while color is fading
trans-filter-exit-enter

Finetuning and saving: animate() & anim_save()

Gladly, gganimate makes it very easy to finalize and save our animations. We can pass our finished gganimate object to animate() to, amongst other things, define the number of frames to be rendered (nframes) and/or the rate of frames per second (fps) and/or the number of seconds the animation should last (duration). We also have the option to define the device in which the individual frames are rendered (the default is device = “png”, but all popular devices are available). Further, we can define arguments that are passed on to the device, like e.g. width or height. Note, that simply printing an gganimateobject is equivalent to passing it to animate() with default arguments. If we plan to save our animation the argument renderer, is of importance: the function anim_save() lets us effortlessly save any gganimate object, but only so if it was rendered using one of the functions magick_renderer() or the default gifski_renderer(). The function anim_save()works quite straightforward. We can define filename and path (defaults to the current working directory) as well as the animation object (defaults to the most recently created animation).
# create a gganimate object
gg_animation <- plot3 +
  transition_filter("More Attractive" = Attractive > 7,
                    "Less Attractive" = Attractive <= 7) 

# adjust the animation settings 
animate(gg_animation, 
        width = 900, # 900px wide
        height = 600, # 600px high
        nframes = 200, # 200 frames
        fps = 10) # 10 frames per second

# save the last created animation to the current directory 
anim_save("my_animated_plot.gif")

Conclusion (and a Happy Valentine’s Day)

I hope this blog post gave you an idea, how to use gganimate to upgrade your own ggplots to beautiful and informative animations. I only scratched the surface of gganimates functionalities, so please do not mistake this post as an exhaustive description of the presented functions or the package. There is much out there for you to explore, so don’t wait any longer and get started with gganimate! But even more important: don’t wait on love. The speed dating data shows that most likely there’s someone out there looking for someone just like you. So from everyone here at STATWORX: Happy Valentine’s Day!
heart gif
## 8 bit heart animation
animation2 <- plot(data = df_eight_bit_heart %>% # includes color and x/y position of pixels 
         dplyr::mutate(id = row_number()), # create row number as ID  
                aes(x = x, 
                    y = y,
                    color = color,
                    group = id)) +
  geom_point(size = 18, # depends on height & width of animation
             shape = 15) + # square
  scale_color_manual(values = c("black" = "black", # map values of color to actual colors
                                "red" = "firebrick2",
                                "dark red" = "firebrick",
                                "white" = "white"),
                     guide = FALSE) + # do not include legend
  theme_void() + # remove everything but geom from plot
  transition_states(-y, # reveal from high to low y values 
                    state_length = 0) +
  shadow_mark() + # keep all past data points
  enter_grow() + # new data grows 
  enter_fade() # new data starts without color

animate(animation2, 
        width = 250, # depends on size defined in geom_point 
        height = 250, # depends on size defined in geom_point 
        end_pause = 15) # pause at end of animation