Dplyr

Sentiment of Security Now! over time

Posted on May 29, 2018 | 7 minutes (1332 words)

If you believe some people, everything is getting worse1. More so in infosec. For the past few years I listened to many many hours of podcasts, many hours where spent on the weekly show Security Now!. The hosts Steven Gibson and Leo Laporte have been talking about security related news every week over 13 years. Although the content has changed over time, there used to be more explanations but the majority of time is now filled with news, we could use the sentiment in the episodes to see if ‘everything is getting worse’. [Read More]

sentiment security-now tutorial tidytext tidyverse dplyr ggplot2 intermediate widyr

Cleaning up and combining data, a dataset for practice

Posted on March 12, 2018 | 3 minutes (564 words)

tldr: I created an open dataset for the explicit practice of data munging. Feel free to use it in assignments, but do mention where you got it from (CC-by-4.0). Also unicorns are awesome. Find the dataset at: https://github.com/RMHogervorst/unicorns_on_unicycles Data munging / cleaning / engineering At work I was working with a two excel files that were slightly different but could be combined into 1 dataset. This is very typical for day to day cleaning operations that analysts and data scientists do (statisticians too). [Read More]

dirty data munging dplyr readxl unicorns unicycles exercise

add abbreviations to your rmarkdown doc

Posted on January 24, 2018 | 2 minutes (238 words)

Today a small tip for when you write rmarkdown documents. Add a chunk on top with abbreviations. in the first chunks you set the options and load the packages. Next create abbreviations, you don’t have to care about the ordering, just put them down as you realize you are creating them. The first step makes a dataframe (a tibble, rowwise), and the second step orders them. tribble( ~Abbreviation, ~ Explanation, "CIA", "Central Intelligence Agency", "dplyr", "data. [Read More]

rmarkdown tibble dplyr

Where to live in the Netherlands based on temperature XKCD style

Posted on November 20, 2017 | 5 minutes (1037 words)

After seeing a plot of best places to live in Spain and the USA based on the weather, I had to chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences. First the end result: How to read this plot? In this xkcd comic we see that the topleft of the the graph represents “if you hate cold and hate heat”, if you go down from the topleft to the bottom left the winters get colder and ending in “if you love cold and hate heat”. [Read More]

XKCD weather humidex dplyr ggplot2 readr Netherlands

Generate text using Markov Chains (sort of)

Posted on January 21, 2017 | 6 minutes (1256 words)

Inspired by the hilarious podcast The Greatest Generation, I have worked again with all the lines from all the episode scripts of TNG. Today I will make a clunky bot (although it does nothing and is absolutely not useful) that talks like Captain Picard. I actually wanted to use a Markov Chain to generate text. A Markov Chain has a specific property. It doesn’t care what happened before, it only looks at probabilities from the current state to a next state. [Read More]

Markov TNG dplyr tidytext bot

Non-standard-evaluation and standard evaluation in dplyr

Posted on June 13, 2016 | 5 minutes (1060 words)

THIS POST IS NO LONGER ENTIRELY RELEVANT. DPLYR 0.7 HAS A SLIGHTLY DIFFERENT (AND SLIGHTLY MORE INTUITIVE) WAY OF WORKING WITH NON-STANDARD EVALUATION. I love the dplyr package with all of its functions, however if you use normal dplyr in functions in your package r-cmd-check will give you a warning: R CMD check NOTE: No visible binding for global variable NAME OF YOUR VARIABLE 1. The functions do work, and everything is normal, however if you submit your package to CRAN, such a NOTE is not acceptable. [Read More]

advanced dplyr NSE optimize-your-code duo2015 lazyeval reminder

From spss to R, part 4

Posted on April 4, 2016 | 15 minutes (3074 words)

This is the second part of working with ggplot. We will combine the packages dplyr and ggplot to improve our workflow. When you make a visualisation you often experiment with different versions of your plot. Our workflow will be dynamic, in stead of saving every version of the plot you created, we will recreate the plot untill it looks the way you want it. In the previous lesson we worked with some build in datasets. [Read More]

beginner dplyr ggplot2 spps-to-r tutorial

Tidying your data

Posted on February 24, 2016 | 4 minutes (808 words)

Introduction To make analyses work we often need to change the way files look. Sometimes information is recorded in a way that was very efficient for input but not workable for your analyses. In other words, the data is messy and we need to make it tidy. Tidy data means 1: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. [Read More]

beginner dplyr tidyr duo2015 tutorial

From spss to R, part 2

Posted on February 22, 2016 | 9 minutes (1881 words)

Introduction In this lesson we will open a .sav file in Rstudio and manipulate the data.frame. We will select parts of the file and create some simple overviews. First time with R? No problem, see lesson 1 toc {:toc} Download a .sav (SPSS) file I downloaded the following dataset from DUO (Dienst uitvoering onderwijs): [Aantal wo ingeschrevenen (binnen domein ho)][3]. This dataset has a cc0 declaration, which means it is in the public domain and we can do anything we want with this file. [Read More]

beginner haven dplyr spps-to-r duo2015 tutorial