Tom. Excel / VBA / C# enthusiast and hobbist. Collecting and sharing my knowledge and experience with beginner/advanced analysts and VBA developers. My posts are written with one thing in mind: teaching analysts how to do things properly.
As mentioned in my previous post I started playing with Kaggle competitions, which pushed me into learning R. What I want to emphasize is that you do not have to be a data scientist to start in a Kaggle competition. Some programming/analytical skills are however more than welcome. Generally as I see it Kaggle competitors use R and Python – both worth learning and putting them into your toolbox.
However, I wanted to share a short but important highlights on how to quickly start Kaggling using R.
Some general tips below:
Regression vs Classification
Kaggle competitions usually tackle 2 problems – classification or regression. Both groups problems have their algorithms for which there are plenty of available libraries. Python has its scit-learn toolkit and R has it’s vast CRAN library. So you really do not have to invent the wheel by yourself! I generally suggest using R as a beginner.
Training and testing data
Usually be prepared to have to types of datasets: training and testing data. Training data is for teach your model to understand the data. The testing data is for testing how good your model is for making predictions. Kaggle will expect you to make your predictions and submit them for evaluating how you well you did against the leaderboard.
A strategy that I have adopted, and I can suggest as a starting point, can be summarized in the following steps using R:
Explore the data – plot, explore, cleanse
Select some models that might fit this problem – here is a valuable list to look through
Use the R caret library to test which model works best – select a metric for evaluation e.g. Kappa for classification, RMSE for regression. Check such evaluation tools as CV, bootstrapping etc.
Tweak it! – find your best model? Play with the parameters, smoothing/centering/scaling the data. Find what works best for you
This strategy generally gets me in the first 30% of the leaderboard on the first day :).
Recently, having some time on my hands, I stumbled on Kaggle. A site for data scientist and other enthusiast programmer who want to test their skills in building the best machine learning models. It seemed to me at first like something for only a small community of people. However, as it turns out most machine learning models are already there (in R, Python etc.). Even most beginner programmer/analysts can try their luck and skills at building the best models.
Personally, this was an incentive for me to learn R – a free statistical language for data exploration/mining/analysis. R is a great and fairly simple language to learn, allowing you to load, explore, visualize and manipulate data very easily. I won’t go into detailed examples as there are already plenty of tutorials examples out there. I wanted just to share a few highlights why every data analyst (not only statistician) should have R in their toolbox:
Why use R?
…apart from some reasons mentioned by Inside-R I wanted to list those practical features that are particularly important to me compared to e.g. Excel:
Easy and quick manipulation on various data structures – apply transformation functions to rows, columns, cells. Doing this in Excel is much more cumbersome and usually less efficient
Easy data file loading from various sourses e.g. CSV, Excel files – a CSV can be loaded in just one line of code!
Easy multithreading! – one of the setbacks of Excel is its lack of multithreading. In R “for” iterations loops can run concurrently
I wanted also to share some real world examples when R could come in handy:
What? That’s it? Yes! Of course this example is very simplified – usually you will want to do some additional data processing/transformation. Imagine, however, having to load/process/merge hundreds of files in Excel. As in this example R will do it in 4 or any amount specified of separate threads.
I hope these short examples have encouraged you to add R to your toolbox!
I always wanted to utilize the beautiful and interactive Google Charts in Excel. The Google Charts repository is constantly growing and sometimes Excel lacks those features. Hence I introduce the Excel Google Charts Tool to display a way to leverage some of those Google Charts directly in Excel.
The WebBrowser control is no longer supported by Office 2013 and above, hence this functionality might not work by default in those versions of Office
The Excel Google Charts Tool contains example Google Charts embedded inside an Excel xlsm file allowing you to visualize data in a more attractive way and enabling more user interaction.
Excel Google Charts: Gauge Chart
Gauge charts are extremely useful to highlight important values in reports. You can also visualize the good and bad ranges of values e.g. orange and red for too high values. These however, can be easily configured. I
Treemaps can be particularly useful when you want to drill-down data values e.g. used disk space and drill-down across folders. Google Treemaps have 2 values which you can visualize – the area of the treemap and the color.
I would say – one of the most useful charts when playing with geo-data. Using the Geo Chart you can easily visualize how your data is broken down across countries. You can zoom in the Geo Chart just to show a single continent, country or region.
The file below contains all examples of Google Charts used in the Excel Google Charts Tool.
Currently the Google Chart Tool contains examples of the following Google Charts:
Issues and errors
One issue you might stumble on when using the above Google Charts may be due to recent scriptable control restrictions imposed by Microsoft. Due to these in Excel 2013 and above Excel will restrict (by default) the use of some controls e.g. Microsoft Web Browser Control – which is required to run the above Google Charts. There is a way around that so utilize the link above to read more.
Many times I was irritated of the lack of some Excel functionality (or just I don’t know there is) to easily transform data w/o using pivot tables. SQL in VBA was the only thing that was missing for me.
Distinct, grouping rows of Excel data, running multiple selects etc. Some time agon when I had a moment of time to spare I did my homework on the topic only to discover that running SQL queries from Excel VBA is possible and easy…
Want to create SQL Queries directly from Excel instead? See my Excel SQL AddIn
Using SQL in VBA example
Let see how to run a simple SELECT SQL Query in Excel VBA on an example Excel Worksheet. On the right see my Excel Worksheet and the Message Box with the similar output from my VBA Macro. The VBA Code is below:
Being a corporate Consultant I always thought of the vast amount of data stored in various corporate databases that could be put to good use. Lot of data is stored and is accessible via Outlook – although mostly not appreciated.
I once received an assignment which required analyzing the data of all employees in our company. One way to approach this assignment was to contact HR, which would then need to contact regional HR … and so on. A process that would last at least a week before I could get my hands on the actual data. And even then this would require verifying the data and some cleansing. The other easier way – the corporate Outlook correspondence group containing all active employees and to somehow dump this data to an Excel/CSV file for analysis. Hence, I conducted a short research resulting in the following VBA function:
VBA to extract Outlook recipient information
Here is the VBA code:
Dim NewMail As Outlook.MailItem
Set NewMail = Application.ActiveInspector.CurrentItem
Dim s As String, filePath As String
s = ""
On Error Resume Next
For Each Recipient In NewMail.Recipients
s = s & Recipient.AddressEntry.GetExchangeUser.Alias
s = s & ";" & Recipient.AddressEntry.GetExchangeUser.Name
s = s & ";" & Recipient.AddressEntry.GetExchangeUser.JobTitle
s = s & ";" & Recipient.AddressEntry.GetExchangeUser.City
s = s & ";" & Recipient.AddressEntry.GetExchangeUser.Department & vbNewLine
Set FSO = CreateObject("Scripting.FileSystemObject")
filePath = Date & "-Outlook Data.txt"
Set oFile = FSO.CreateTextFile(filePath)
Call MsgBox("Saved to: " & filePath)
Notice that what it does is locate the current open email and inspect its recipients. Next it traverses the whole list of recipients and dumps that data to your default VBA directory.
How to dump Outlook recipient information – Step by step
To make sure you are using the code above correctly be sure to follow the instructions below:
Insert the function to your Outlook Visual Basic project
Go to Developer ribbon tab and select Visual Basic. In the VBE (Visual Basic Editor) right click on your VBA Project and select New Module. Insert the macro above.
Add recipients to Outlook draft email
Open a new email message in Outlook and add the desired recipients to the To: section e.g. if adding a correspondence group be sure to expand the list to list all recipients individually. This is an important step – be sure to list INDIVIDUAL recipients as the macro will not expand any group emails
Execute the macro
Go to Developer ribbon tab, select Macros and run the DumpOutlookData procedure. Wait for it to dump the data to the txt file (to your default folder – Documents, read more how to change default directory.). This might take some time depending on how many email recipients you want to dump to your file.
You can also extend the macro to get additional information on Outlook recipients like:
Measure employee retention over time – extract the data at least on a monthly basis and ther you go. You have yourself enough information to count the employee retention – for the entire company, your department etc.
Hopefully you see what an incredible resource everyone in a company has – which you can utilize for tons of internal company projects without needing to approach HR.