Add the R language to your analyst toolbox today!

Recently, having some time on my hands, I stumbled on Kaggle. A site for data scientist and other enthusiast programmer who want to test their skills in building the best machine learning models. It seemed to me at first like something for only a small community of people. However, as it turns out most machine learning models are already there (in R, Python etc.). Even most beginner programmer/analysts can try their luck and skills at building the best models.

Personally, this was an incentive for me to learn R – a free statistical language for data exploration/mining/analysis. R is a great and fairly simple language to learn, allowing you to load, explore, visualize and manipulate data very easily. I won’t go into detailed examples as there are already plenty of tutorials examples out there. I wanted just to share a few highlights why every data analyst (not only statistician) should have R in their toolbox:

Why use R?

…apart from some reasons mentioned by Inside-R I wanted to list those practical features that are particularly important to me compared to e.g. Excel:

  • Concise – most code in other languages will translate usually to a lot less in R. R is also very similar to Javascript in some sense which is a good thing!
  • Easy and quick manipulation on various data structures – apply transformation functions to rows, columns, cells. Doing this in Excel is much more cumbersome and usually less efficient
  • Easy data file loading from various sourses e.g. CSV, Excel files – a CSV can be loaded in just one line of code!
  • Easy multithreading! – one of the setbacks of Excel is its lack of multithreading. In R “for” iterations loops can run concurrently

I wanted also to share some real world examples when R could come in handy:

Example 1: Merging 2 csv files into 1

csv1 <- read.csv(file="testcsv1.csv", sep=";")
csv2 <- read.csv(file="testcsv2.csv", sep=";")
csvRes <- rbind(csv1,csv2)
write.table(csvRes, file="resCsv.csv", sep=";",row.names=FALSE)

Wow 4 lines of code! How cool is that? But why not load the files in parallel?

Example 2: Concurrency

require(foreach)
require(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
csvRes <- foreach(i=1:2, .combine=rbind) %dopar% {
   read.csv(file= paste("testcsv",i,".csv",sep=""), sep=";")
}
write.table(csvRes, file="resCsv.csv", sep=";",row.names=FALSE)
stopCluster(cl)

What? That’s it? Yes! Of course this example is very simplified – usually you will want to do some additional data processing/transformation. Imagine, however, having to load/process/merge hundreds of files in Excel. As in this example R will do it in 4 or any amount specified of separate threads.

I hope these short examples have encouraged you to add R to your toolbox!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.