Data munging with modern R tools

1st may 2017

What is data munging?

Data munging or data wrangling:

from raw data to cleaned data
from experiment to statistical result

Can involve:

reading from data bases, text files, websites and combining multiple sources
enriching data
error correction
cleaning
seperating variables
recoding

Common problems

How projects go sideways in a standard statistics course:

inability to scrape data off the web,
get data from an API
parse JSON or XML file
utter defeat by date times
text encoding fiascos
ineptitude with regular expressions
R scripts that consume infinite time, ram
software installations going wrong

from https://speakerdeck.com/jennybc/teach-data-science-and-they-will-come

Inability to work with messy or uncleaned data

Change of focus in coursework?

Should we focus more tidying data and data wrangling?

You can't do analysis when you can't use, or load the data.

Data Transforming

Very uncomfortable for students, so much that we often just do it for them.

Tidy data:

rectangle your data.
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Tidyverse

Packages in the tidyverse

Pipes

A poem

  " Little bunny, Foo Foo
  Hopping through the forest
  Scooping up the field mice
  And bopping them on the head"

base R version

foo_foo <- little_bunny()
bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head )

foo_foo <- little_bunny()
hopped <- hop_through(foo_foo, forest)
scooped <- scoop_up( hopped, field_mouse )
result <- pop_on(scooped, head)
result

Pipe version

foo_foo <- little_bunny()
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

Demonstration!

Conclusion

if possible, make rectangular 'tidy' data
use pipes to make a sequence of steps
keep within the tidyverse

Final Thoughts

Replace data-scientist with statistician

"Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

NY times 2014-08-18

Most master students will work as statisticians / data scientists in industry. Where data cleaning is 80% of the work.

Find this presentation on github at https://github.com/rmhogervorst/datawrangling look at Tijn's face now

Links

STAT545 course online

wikipedia page about data wrangling

video about putting lists and other types of data into a data-frame - Jenny Bryan slides from that presentation here

Book: Hadley Wickham's R 4 data science (online)

Chapter from that book about tidy data

Tidy data - Wickham 2013

code-heavy explanation of tidy data article