All posts by AnalystCave

Tom. Excel / VBA / C# enthusiast and hobbist. Collecting and sharing my knowledge and experience with beginner/advanced analysts and VBA developers. My posts are written with one thing in mind: teaching analysts how to do things properly.

Start your Kaggle journey with R

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 3.67 out of 5)
Loading...

As mentioned in my previous post I started playing with Kaggle competitions, which pushed me into learning R. What I want to emphasize is that you do not have to be a data scientist to start in a Kaggle competition. Some programming/analytical skills are however more than welcome. Generally as I see it Kaggle competitors use R and Python – both worth learning and putting them into your toolbox.

However, I wanted to share a short but important highlights on how to quickly start Kaggling using R.

Some general tips below:

Regression vs Classification

Kaggle competitions usually tackle 2 problems – classification or regression. Both groups problems have their algorithms for which there are plenty of available libraries. Python has its scit-learn toolkit and R has it’s vast CRAN library. So you really do not have to invent the wheel by yourself! I generally suggest using R as a beginner.

Training and testing data

Usually be prepared to have to types of datasets: training and testing data. Training data is for teach your model to understand the data. The testing data is for testing how good your model is for making predictions. Kaggle will expect you to make your predictions and submit them for evaluating how you well you did against the leaderboard.

A strategy that I have adopted, and I can suggest as a starting point, can be summarized in the following steps using R:

  1. Explore the data – plot, explore, cleanse
  2. Select some models that might fit this problemhere is a valuable list to look through
  3. Use the R caret library to test which model works best – select a metric for evaluation e.g. Kappa for classification, RMSE for regression. Check such evaluation tools as CV, bootstrapping etc.
  4. Tweak it! – find your best model? Play with the parameters, smoothing/centering/scaling the data. Find what works best for you

This strategy generally gets me in the first 30% of the leaderboard on the first day :).

Add the R language to your analyst toolbox today!

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4.00 out of 5)
Loading...

Recently, having some time on my hands, I stumbled on Kaggle. A site for data scientist and other enthusiast programmer who want to test their skills in building the best machine learning models. It seemed to me at first like something for only a small community of people. However, as it turns out most machine learning models are already there (in R, Python etc.). Even most beginner programmer/analysts can try their luck and skills at building the best models.

Personally, this was an incentive for me to learn R – a free statistical language for data exploration/mining/analysis. R is a great and fairly simple language to learn, allowing you to load, explore, visualize and manipulate data very easily. I won’t go into detailed examples as there are already plenty of tutorials examples out there. I wanted just to share a few highlights why every data analyst (not only statistician) should have R in their toolbox:

Why use R?

…apart from some reasons mentioned by Inside-R I wanted to list those practical features that are particularly important to me compared to e.g. Excel:

  • Concise – most code in other languages will translate usually to a lot less in R. R is also very similar to Javascript in some sense which is a good thing!
  • Easy and quick manipulation on various data structures – apply transformation functions to rows, columns, cells. Doing this in Excel is much more cumbersome and usually less efficient
  • Easy data file loading from various sourses e.g. CSV, Excel files – a CSV can be loaded in just one line of code!
  • Easy multithreading! – one of the setbacks of Excel is its lack of multithreading. In R “for” iterations loops can run concurrently

I wanted also to share some real world examples when R could come in handy:

Example 1: Merging 2 csv files into 1

csv1 <- read.csv(file="testcsv1.csv", sep=";")
csv2 <- read.csv(file="testcsv2.csv", sep=";")
csvRes <- rbind(csv1,csv2)
write.table(csvRes, file="resCsv.csv", sep=";",row.names=FALSE)

Wow 4 lines of code! How cool is that? But why not load the files in parallel?

Example 2: Concurrency

require(foreach)
require(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
csvRes <- foreach(i=1:2, .combine=rbind) %dopar% {
   read.csv(file= paste("testcsv",i,".csv",sep=""), sep=";")
}
write.table(csvRes, file="resCsv.csv", sep=";",row.names=FALSE)
stopCluster(cl)

What? That’s it? Yes! Of course this example is very simplified – usually you will want to do some additional data processing/transformation. Imagine, however, having to load/process/merge hundreds of files in Excel. As in this example R will do it in 4 or any amount specified of separate threads.

I hope these short examples have encouraged you to add R to your toolbox!

map

Excel Google Charts Tool

1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 3.75 out of 5)
Loading...

I always wanted to utilize the beautiful and interactive Google Charts in Excel. The Google Charts repository is constantly growing and sometimes Excel lacks those features. Hence I introduce the Excel Google Charts Tool to display a way to leverage some of those Google Charts directly in Excel.

The WebBrowser control is no longer supported by Office 2013 and above, hence this functionality might not work by default in those versions of Office

The Excel Google Charts Tool contains example Google Charts embedded inside an Excel xlsm file allowing you to visualize data in a more attractive way and enabling more user interaction.

Excel Google Charts: Gauge Chart

Gauge charts are extremely useful to highlight important values in reports. You can also visualize the good and bad ranges of values e.g. orange and red for too high values. These however, can be easily configured. I

Excel VBA Gauge Chart
Excel VBA Gauge Chart

How to configure a Gauge Chart?

Google Chart: Gauge Chart

Excel Google Charts: Treemap Chart

Treemaps can be particularly useful when you want to drill-down data values e.g. used disk space and drill-down across folders. Google Treemaps have 2 values which you can visualize – the area of the treemap and the color.

Excel VBA Treemap Chart
Excel VBA Treemap Chart

How to configure a Treemap Chart?

Google Charts: Treemap Chart

Excel Google Charts: Org Chart

Excel VBA Orgchart
Excel VBA Orgchart

How to configure a Org Chart?

Org Charts come in handy when you want to visualize the tree/organisational structure.
Google Charts: Org Chart

Excel Google Charts: Geo Chart

I would say – one of the most useful charts when playing with geo-data. Using the Geo Chart you can easily visualize how your data is broken down across countries. You can zoom in the Geo Chart just to show a single continent, country or region.

Excel VBA Geochart
Excel VBA Geochart

How to configure a Geo Chart?

Google Charts: Geo Chart

Download

The file below contains all examples of Google Charts used in the Excel Google Charts Tool.


Currently the Google Chart Tool contains examples of the following Google Charts:

Issues and errors

One issue you might stumble on when using the above Google Charts may be due to recent scriptable control restrictions imposed by Microsoft. Due to these in Excel 2013 and above Excel will restrict (by default) the use of some controls e.g. Microsoft Web Browser Control – which is required to run the above Google Charts. There is a way around that so utilize the link above to read more.

SQL in VBA

Using SQL in VBA on Excel. Run SELECT Queries from VBA

1 Star2 Stars3 Stars4 Stars5 Stars (5 votes, average: 4.40 out of 5)
Loading...

Many times I was irritated of the lack of some Excel functionality (or just I don’t know there is) to easily transform data w/o using pivot tables. SQL in VBA was the only thing that was missing for me.

Distinct, grouping rows of Excel data, running multiple selects etc.
Some time agon when I had a moment of time to spare I did my homework on the topic only to discover that running SQL queries from Excel VBA is possible and easy…

Using SQL in VBA example

sql in excel
An Excel Worksheet and the Output of the RunSELECT Macro
Let see how to run a simple SELECT SQL Query in Excel VBA on an example Excel Worksheet. On the right see my Excel Worksheet and the Message Box with the similar output from my VBA Macro. The VBA Code is below:

Explaining the Code

So what is happening in the macro above? Let us break it down:

Connecting to the Data Source

First we need to connect via the ADODB Driver to our Excel Worksheet. This is the same Driver which runs SQL Queries on MS Access Databases:

The Provider is the Drive which is responsible for running the query.

The ConnectionStrings defines the Connection properties, like the path to the Queries File (example above is for ThisWorkbook) or if the first row contains a header (HDR).

The Open command executes the connection.

You can find more information on the ADODB.Connection Object on MSDN.

Looking for other Connection Strings to XLS or Access files? Check out ConnectionStrings.com. Here are some examples:

Running the SQL Select Query

Having connected to our Data Source Excel Worksheet we can now run a SQL SELECT Query:

So what happens here? First we run the Execute command with our SELECT query:

What does it do? It indicates that our records are in Sheet1. We can obviously extend this query just to filter people above the age of 30:

This would be the result:

sql in excel - above 30
SELECT * FROM [Sheet1$] WHERE Age > 30

The Execute command returns a ADODB RecordSet. We need to loop through the recordset to get each record:

Clean up

Lastly we need to Clean up our Objects to free memory. This is actually quite an important step as if you VBA code is runs a lot of queries or computations you might see a slow-down soon enough!

What Else Can I Do?

You can do tons of great things with ADODB / MS Queries / SQL in Excel. Here are some additional ideas:

  • Run Queries Across Worksheets – you can run JOIN queries on multiple Excel Worksheets. E.g.

    On the below tables:
    sql in excel example
    Running SELECT with JOIN on 2 Worksheets
  • Extracting Data from External Data Sources – use different connection strings to connect to Access Databases, CSV files or text files
  • Do more efficient LOOKUPs – read my post on VLOOKUP vs SQL to learn more
outlook users

Get Outlook recipients information via VBA (Outlook users data)

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 3.00 out of 5)
Loading...

Being a corporate Consultant I always thought of the vast amount of data stored in various corporate databases that could be put to good use. Lot of data is stored and is accessible via Outlook – although mostly not appreciated.

I once received an assignment which required analyzing the data of all employees in our company. One way to approach this assignment was to contact HR, which would then need to contact regional HR … and so on. A process that would last at least a week before I could get my hands on the actual data. And even then this would require verifying the data and some cleansing. The other easier way – the corporate Outlook correspondence group containing all active employees and to somehow dump this data to an Excel/CSV file for analysis. Hence, I conducted a short research resulting in the following VBA function:

VBA to extract Outlook recipient information

Here is the VBA code:

Sub DumpOutlookData()
    Dim NewMail As Outlook.MailItem
    Set NewMail = Application.ActiveInspector.CurrentItem
    Dim s As String, filePath As String
    s = ""
    On Error Resume Next
    For Each Recipient In NewMail.Recipients
      s = s & Recipient.AddressEntry.GetExchangeUser.Alias
      s = s & ";" & Recipient.AddressEntry.GetExchangeUser.Name
      s = s & ";" & Recipient.AddressEntry.GetExchangeUser.JobTitle
      s = s & ";" & Recipient.AddressEntry.GetExchangeUser.City
      s = s & ";" & Recipient.AddressEntry.GetExchangeUser.Department & vbNewLine
    Next Recipient
    Set FSO = CreateObject("Scripting.FileSystemObject")
    filePath = Date & "-Outlook Data.txt"
    Set oFile = FSO.CreateTextFile(filePath)
    oFile.WriteLine s
    oFile.Close
    Call MsgBox("Saved to: " & filePath)
End Sub

Notice that what it does is locate the current open email and inspect its recipients. Next it traverses the whole list of recipients and dumps that data to your default VBA directory.

How to dump Outlook recipient information – Step by step

To make sure you are using the code above correctly be sure to follow the instructions below:

Insert the function to your Outlook Visual Basic project

Go to Developer ribbon tab and select Visual Basic. In the VBE (Visual Basic Editor) right click on your VBA Project and select New Module. Insert the macro above.

Add recipients to Outlook draft email

Open a new email message in Outlook and add the desired recipients to the To: section e.g. if adding a correspondence group be sure to expand the list to list all recipients individually. This is an important step – be sure to list INDIVIDUAL recipients as the macro will not expand any group emails

Execute the macro

Go to Developer ribbon tab, select Macros and run the DumpOutlookData procedure. Wait for it to dump the data to the txt file (to your default folder – Documents, read more how to change default directory.). This might take some time depending on how many email recipients you want to dump to your file.
outlook recipients dump
You can also extend the macro to get additional information on Outlook recipients like:

  • Manager (Recipient.AddressEntry.Manager.Alias)
  • Office Location (Recipient.AddressEntry.GetExchangeUser.OfficeLocation)
  • Phone Numbers (Recipient.AddressEntry.GetExchangeUser.MobileTelephoneNumber )

..and other information.

Where to go from here?

Ok so you got the data in a semicolon ; separated CSV file. What next?

Cool ideas to utilize this data

There are tons of useful things you can do with this Outlook recipient information. Here are just some:

Hopefully you see what an incredible resource everyone in a company has – which you can utilize for tons of internal company projects without needing to approach HR.