Start your Kaggle journey with R

As mentioned in my previous post I started playing with Kaggle competitions, which pushed me into learning R. What I want to emphasize is that you do not have to be a data scientist to start in a Kaggle competition. Some programming/analytical skills are however more than welcome. Generally as I see it Kaggle competitors use R and Python – both worth learning and putting them into your toolbox.

However, I wanted to share a short but important highlights on how to quickly start Kaggling using R.

Some general tips below:

Regression vs Classification

Kaggle competitions usually tackle 2 problems – classification or regression. Both groups problems have their algorithms for which there are plenty of available libraries. Python has its scit-learn toolkit and R has it’s vast CRAN library. So you really do not have to invent the wheel by yourself! I generally suggest using R as a beginner.

Training and testing data

Usually be prepared to have to types of datasets: training and testing data. Training data is for teach your model to understand the data. The testing data is for testing how good your model is for making predictions. Kaggle will expect you to make your predictions and submit them for evaluating how you well you did against the leaderboard.

A strategy that I have adopted, and I can suggest as a starting point, can be summarized in the following steps using R:

  1. Explore the data – plot, explore, cleanse
  2. Select some models that might fit this problemhere is a valuable list to look through
  3. Use the R caret library to test which model works best – select a metric for evaluation e.g. Kappa for classification, RMSE for regression. Check such evaluation tools as CV, bootstrapping etc.
  4. Tweak it! – find your best model? Play with the parameters, smoothing/centering/scaling the data. Find what works best for you

This strategy generally gets me in the first 30% of the leaderboard on the first day :).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.