As mentioned in my previous post I started playing with Kaggle competitions, which pushed me into learning R. What I want to emphasize is that you do not have to be a data scientist to start in a Kaggle competition. Some programming/analytical skills are however more than welcome. Generally as I see it Kaggle competitors use R and Python – both worth learning and putting them into your toolbox.
However, I wanted to share a short but important highlights on how to quickly start Kaggling using R.
Some general tips below:
Regression vs Classification
Kaggle competitions usually tackle 2 problems – classification or regression. Both groups problems have their algorithms for which there are plenty of available libraries. Python has its scit-learn toolkit and R has it’s vast CRAN library. So you really do not have to invent the wheel by yourself! I generally suggest using R as a beginner.
Training and testing data
Usually be prepared to have to types of datasets: training and testing data. Training data is for teach your model to understand the data. The testing data is for testing how good your model is for making predictions. Kaggle will expect you to make your predictions and submit them for evaluating how you well you did against the leaderboard.
A strategy that I have adopted, and I can suggest as a starting point, can be summarized in the following steps using R:
- Explore the data – plot, explore, cleanse
- Select some models that might fit this problem – here is a valuable list to look through
- Use the R caret library to test which model works best – select a metric for evaluation e.g. Kappa for classification, RMSE for regression. Check such evaluation tools as CV, bootstrapping etc.
- Tweak it! – find your best model? Play with the parameters, smoothing/centering/scaling the data. Find what works best for you
This strategy generally gets me in the first 30% of the leaderboard on the first day :).