Recently I wanted to download all the transcripts of a podcast (600+ episodes). The transcripts are simple txt files so in a way I am not even ‘web’-scraping but just reading in 600 or so text files which is not really a big deal. I thought.

This post shows you where I went wrong

Also here is a picture I found of scraping.

### Webscraping general

For every download you ask the server for a file and it returns the file (this is also how you normally browse the web btw, your browser requests the pages).

In general it is nice if you ask permission (I did, on twitter and the author was really nice! I recommend it!) and don’t push the website to its limit. The servers where these files are hosted are quite beefy and I will probably not even make a dent in them, when I’m downloading these files. But still, be gentle.

No really, be a responsible scraper and tell the website owners you are scraping (in person or by identifying in the header) and check if it is allowed

I recently witnessed a demo where someone explained a lot of dirty tricks on how to get over those pesky servers denying them access and generally ignoring good practices and it made me sick…

Here are some general guides:

There are multiple ways I could do this downloading: if I had used rvest to scrape a website I would have set a user-agent header1 and I would have used incremental backoff: when the server refuses a connection we would wait and retry again, if it still refuses we would wait twice as long and retry again etc.

However, since these are txt files I can just use read_lines2 to read the txt file of a transcript and apply further work downstream.

#### A first, failing approach, tidy but wrong

This was my first approach:

• all episodes are numbered and the transcript files are sequental too, so just a paste0 of “https://first-part-of-link" number”.txt” would work.
• put all links as row into dataframe
latest_episode <- 636

system.time(

formatC(1:latest_episode, width = 3,flag = 0),".txt")) %>%

)

This failed.

Some episodes don’t exists or have no transcript (I didn’t know). Sometimes the internet connection didn’t want to work and just threw me out. Sometimes the server stopped my requests.

On every of those occasions the process would stop, give an informative error3. But the R-process would stop and I had no endresult.

#### Getting more information to my eyeballs and pausing in between requests

Also I didn’t know where it failed. So I created a new function that also sometimes waited (to not overwhelm the server)

## to see where we are this function wraps read_lines and prints the episodenumber

print(file)

if(runif(1,0,1) >0.008)Sys.sleep(5)

}

This one also failed, but more informatively, I now knew if it failed on a certain episode.

Also I wanted to let the logs show that I was the one doing the scraping and how to reach me if I was overwhelming the service.

Enter curl. Curl is a library that helps you download stuff, it is used by the httr package and is a wrapper around the c++ package with the same name, wrapped by Jeroen ‘c-plus-plus’ Ooms.

Since I ran this function a few times I downloaded some of the files, and didn’t really want to download every file again, so I also added a check to see if the file wasn’t already downloaded4 . And I wanted it to print to the screen, because I like moving text over the screen when I’m debugging.

download_file <- function(file){
filename <- basename(file)
if(file.exists(paste0("data/",filename))){
print(paste("file exists: ",filename))
}else{
h <- new_handle(failonerror = FALSE)
h <- handle_setheaders(h, "User-Agent"= "scraper by RM Hogervorst, @rmhoge, gh: rmhogervorst")
Sys.sleep(sample(seq(0,2,0.5), 1)) # copied this from  Bob Rudis(@hrbrmstr)
}
}

I set the header (I think…) and I tell curl not to worry if it fails, we all need reassurance sometimes, but just to continue.

# we choose walk here, because we don't expect output (we do get prints)

latest_episode <- 636
formatC(1:latest_episode, width = 3,flag = 0),".txt"), download_file)

## Conclusion

So in general, don’t be a dick, ask permission and take it easy.

The final download approach works great! And it doesn’t matter if you stop it halfway. In the future you can see why I wanted all of these files.

I thought this would be the easy step, would the rest be even harder? Tune in next time!

#### Cool things that I could have done:

• use purrr::safely ? I think it will continue to work after a fail then?