Tag Archives: SQL

How to create an MS Query in Excel

How to create a Microsoft Query in Excel (Excel Query)

1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 4.75 out of 5)
Loading...

You can use Microsoft Query in Excel to retrieve data from an Excel Workbook as well as External Data Sources using SQL SELECT Statements. Excel Queries created this way can be refreshed and rerun making them a comfortable and efficient tool in Excel

A Microsoft Query (aka MS Query, aka Excel Query) is in fact an SQL SELECT Statement. Excel as well as Access use Windows ACE.OLEDB or JET.OLEDB providers to run queries. Its an incredible often untapped tool underestimated by many users!

You can extract data from:

  • Excel Files – you can extract data from External Excel files as well as run a SELECT query on your current Workbook
  • Access – you can extract data from Access Database files
  • MS SQL Server – you can extract data from Microsoft SQL Server Tables
  • CSV and Text – you can upload CSV or tabular Text files

How to Create a Microsoft Query

In this step by step tutorial I will show you how to create an Microsoft Query to extract data from either you current Workbook or an external Excel file.

I will extract data from an External Excel file called Data Source.xlsx situated in C:\.

The below process shows how you can create a query for your current or an external Excel Workbook. However, the process for creating a MS Query for Access, SQL and Text (CSV) files is very similar

Open the MS Query (from Other Sources) wizard

Create a Microsoft  Query (QueryTable)Go to the DATA Ribbon Tab and click From Other Sources. Select the last option – From Microsoft Query.

Select the Data Source

Create a Microsoft Query (QueryTable)Next we need to specify the Data Source for our Microsoft Query. Select Excel Files to proceed.

Select Excel Source File

querytableCreate a Microsoft Query (QueryTable)3Now we need to select the Excel file that will be the source for our Microsoft Query. In my example I will select my current Workbook, the same from which I am creating my MS Query.

Select Columns for your MS Query

Create a Microsoft Query (QueryTable)3The Wizard now asks you to select Columns for your MS Query. If you plan to modify the MS Query manually later simply click OK. Otherwise select your Columns.

Return Query or Edit Query

Create a Microsoft Query (QueryTable)3Now you have two options:

  1. Return Data to Microsoft Excel this will return your query results to Excel and complete the Wizard

  2. View data or edit query in Microsoft Query this will open the Microsoft Query window and allow you to modify you Microsoft Query

Optional: Edit Query

Create a Microsoft Query (QueryTable)3If you select the View data or edit query in Microsoft Query option you can now open the SQL Edit Query window by hitting the SQL button. When you are done hit the return button (the one with the open door).

Import Data

Create a Microsoft Query (QueryTable)3Lastly, when you are done click OK on the Import Data window to complete running the query.

MS Query Trick

The above process is long and cumbersome. How about doing it quicker?
create ms query
Just use my VBA Code Snippet:

Just create a New VBA Module and paste the code above. You can run it hitting the CTRL+SHIFT+S Keyboardshortcut or Add the Macro to your Quick Access Toolbar.

Power Query vs Microsoft Query

Lastly, I would like to tackle the question of Why use MS Queries when I have Power Query? Microsoft has done a good job of understanding that users need a tool to transform data, but often don’t have the knowledge to use the SQL Language to create SELECT queries. That is why they created the Power Query AddIn (as part of Power BI Suite).

MS Query Pros: Power Query is an awesome tool, however, it doesn’t entirely invalidate Microsoft Queries. What is more, sometimes using Microsoft Queries is quicker and more convenient and here is why:

  • Microsoft Queries are more efficient when you know SQL. While you can click your way through to Transform Data via Power Query someone who knows SQL will likely be much quicker in writing a suitable SELECT query
  • You can’t re-run Power Queries without the AddIn. While this obviously will be a less valid statement probably in a couple of years (in newer Excel versions), currently if you don’t have the AddIn you won’t be able to edit or re-run Queries created in Power Query

MS Query Cons: Microsoft Query falls short of the Power Query AddIn in some other aspects however:

  • Power Query has a more convenient user interface. While Power Queries are relatively easy to create, the MS Query Wizard is like a website from the 90’s
  • Power Query stacks operations on top of each other allowing more convenient changes. While an MS Query works or just doesn’t compile, the Power Query stacks each transform operation providing visibility into your Data Transformation task, and making it easier to add / remove operations

In short I encourage learning Power Query if you don’t feel comfortable around SQL. If you are advanced in SQL I think you will find using good ole Microsoft Queries more convenient. I would compare this to the Age-Old discussion between Command Line devs vs GUI devs

excel data analysis

Data Analysis Excel Tools

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 4.67 out of 5)
Loading...

Excel is most appreciated for it’s ease of use as a Data Analysis Tool. I mean to explore the basics, as well as the more advanced Data Analysis Excel Tools. Be sure to read through the Other Tools section below for other honorable mentions.

Data Analysis Excel Tools
excel vba
Sort & Filter
excel vba
PivotTable
excel vba
What If Analysis
excel vba
Data Analysis Toolpak
excel vba
Solver
excel vba
MS Query (SQL)

The DATA ribbon tab

As some of you already have noticed MOST (except for the PivotTable for no good reason) of the tools described in this post focus on features available BY DEFAULT within the Excel DATA ribbon tab (see image below). Some of them are AddIn that are installed by default along with MS Office, however, need to be activated manually. Skip to the relevant section of this post to read more.

excel data ribbon tab
Excel DATA ribbon tab (click to expand)
In this post I am focusing just on features that allow you to explore data or solve certain analytical problems. I am not focusing on simple data transformation features like removing duplicates, however, I do encourage that you learn more on these non-mentioned features.

Sort & Filter

Let’s kick off with the Sort & Filter group. See below for a quick look at how to use the Sort and Filter feature in Excel on an example data set:

PivotTable

PivotTables are so versatile a tool that most often no other Data Analysis Tool is needed. PivotTables are great for exploring data sets and for visualizing data with the help of automatically generated PivotCharts. See a simple PivotTable example below:

What If Analysis

The What If Analysis button in the DATA ribbon tab facilitates generally two useful features:

  • Scenario Manager – the scenario manager allows you to add several scenarios to a certain problem and then generate a summary comparing how the scenario values impact a selected cell value
  • Goal Seek
  • Data Tables – creating scenarios

Let’s explore these tools (apart from the last one) on a simple problem case example.
The problem:
what if the problem

Our example assumes there is a 500k$ Marketing Budget to spend on a Marketing Campaign which can consist of spending on Internet and/or TV ads. What is more, for the sake of the example let’s assume:

  • TV ads have a parabolic impact (see below) on Product demand i.e. at first the more you spend the more Product demand you generate, until you reach a peak where Product demand will actually fall (maybe viewers are tired of their favorite programs being bombarded by your ads).
  • Internet ads have a linear impact on Product demand i.e. the more you spend the more Product demand you generate

internet and tv spend
The Product demand cell formula is:

=2*D3+((C3-100)^2*-1+10000)

Scenario Manager

Using the example mentioned above let’s use the Scenario Manager to create 3 separate scenarios:

  1. 100% TV spend – 500k$ spent on TV ads
  2. 50% TV and Internet spend – 250k$ spent both on TV and Internet ads
  3. 100% Internet spend – 500k$ spent on Internet ads

Open the Scenario Manager

To add our scenarios go to the DATA ribbon tab, click What If Analysis and then select Scenario Manager....

Click Add to add a new Scenario

Specify the Scenario details – especially the range of Changing Cells that are to be specified by the scenario:

Scenario Manager: Adding a Scenario
Scenario Manager: Adding a Scenario

Next provide values as appropriate to your Scenario:
Scenario Manager: Setting the Scenario Values
Scenario Manager: Setting the Scenario Values

Repeat for each Scenario

You need to Name and specify the Changing Values for each Scenario separately.

Generate a Summary

Hit Summary... to generate a neat worksheet with a summary of your Scenarios. Specify the Result Cells – cells which will be modified in result to your Scenario. In my case it is just the Product demand cell:

Scenario Manager: Generate Summary
Scenario Manager: Generate Summary

Now admire the results:
Scenario Manager: Results
Scenario Manager: Results

Notice that in this example on purpose I created the formula as such so that it is not directly obvious (unless you work out the formula maximum) what is the best mix of TV and Internet ads. We will work through this example using different tools below.

Goal Seek

Goal Seek calculates the values necessary to achieve a specific goal. The “goal” is an amount returned by a formula. So in short – you should use Goal Seek if you have a function which characteristics are unknown or hard to figure out. Goal Seek is Solver’s younger (less developed) brother.

Some examples of when to use Goal Seek:

  • Solving single (changing) parameter mathematical functions (seeking a specific value e.g. 0)
  • Doing a quick check if a single parameter can help achieve a specific value for a formula (e.g. financial statements)

Using Goal Seek

Ok let’s learn how to apply Goal Seek. Let’s reuse our above example – let’s say we are aware that our TV ads function is a parabolic or a polynomial at least and we want to quickly locate some of it’s 0 values.

Our Product demand (just for TV ads) looks as so if we ignore the Internet ads:

=((C3-100)^2*-1+10000)

Where C3 is Marketing Spend on TV ads.

Open Goal Seek

Go to the DATA ribbon tab, click What If Analysis and then select Goal Seek.

Provide the Goal Cell, Value and the Changing Cell

For our example the Goal is our Product demand, and our Changing Cell will be the Marketing Spend on TV ads:

Goal Seek: Setting up
Goal Seek: Setting up

Run Goal Seek and BEWARE!

This is the result I received. Notice that the value I provided into the TV ads cell was 0 to start with.

Goal Seek: Results
Goal Seek: Results

This time Goal Seek came up with the result: 0. Ok now let’s rerun this example. But this time I will type a higher starting value for TV ads spend… say 150k$. Do you think it will provide the same result? See below:
Goal Seek: Results with different starting point
Goal Seek: Results with different starting point

The computed result is different! This time Goal Seek came up with the result: 200. No why is that? Goal Seek is an iterative algorithm that in an iterative manner seeks to reach the Goal Value. However, if a function (cell formula) has several solutions expect that Goal Seek may land at different values.

This is very important to keep in mind!

Goal Seek in VBA

Not many know that you can use Goal Seek easily in VBA too! The parameters are as such:
Range.GoalSeek(Goal, ChangingCell)
where:

  • Range – single cell with Goal function (formula)
  • Goal – the Goal Value to achieve
  • ChangingCell – single cell which will can be modified to achieve the Goal

MSDN provides a sexy example below:

Worksheets("Sheet1").Range("Polynomial").GoalSeek Goal:=15, ChangingCell:=Worksheets("Sheet1").Range("X")

Data Analysis Toolpak

Also know as Analysis Toolpak is a Microsoft AddIn delivered with Excel (usually disabled by default) that provides data analysis capabilities to Excel. Let’s start by enabling this AddIn:

Enabling the Analysis Toolpak

Find Add-Ins in Excel Options

To open the Window go to FILE, select Options and then pick AddIns.

Analysis Toolpak: Excel Options
Analysis Toolpak: Excel Options

Next from the above window hit Go to progress to Step 2.

Open the Add-Ins window and enable the Add-In

From the Window below select Analysis Toolpak and hit OK:

Analysis Toolpak: Enabling the Add-In
Analysis Toolpak: Enabling the Add-In

Using the Analysis Toolpak

To open the Analysis Toolpak window simply go to the DATA ribbon and you should find at your far right in a new group (Analysis) the Data Analysis button. Hit it to open the whole range of Data Analysis Excel tools.

Analysis Toolpak: Opening
Analysis Toolpak: Opening

Click on the button to see the whole list of different tools:
Analysis Toolpak Tools
Analysis Toolpak Tools

I won’t go through all these tools as they are pretty straightforward. Personally I tend to use only the Correlation feature which creates a nice correlation matrix out of a series of numerical columns.

Solver

Solver is a Microsoft AddIn to Excel that allows you to find optimal solutions for certain decision problems defined by a certain Objective (cell value). It’s like the older brother of the previously mentioned Goal Seek as it allows for more complicated computations and optimization making. Let’s start by enabling it in Excel.

Enabling the Solver AddIn

Find Add-Ins in Excel Options

To open the Window go to FILE, select Options and then pick AddIns.

Analysis Toolpak: Excel Options
Solver: Excel Options

Next from the above window hit Go to progress to Step 2.

Open the Add-Ins window and enable the Solver Add-In

From the Window below select Solver Add-In and hit OK:

Solver: Enabling the AddIn
Solver: Enabling the AddIn

Using the Solver AddIn

Now similarly as above let’s again consider our Problem from above.
what if the problem
As previously let’s exemplify this tool with our Marketing Budget example. Quick reminder – we have 500k$ to spend on TV and Internet ads and we have a formula that calculates our spend to the resulting Product demand.

Now we want to know what is the optimal value of our spend on TV and Internet ads to maximize Product demand. This is exactly what we can calculate with Solver:

To open Solver window simply go to the DATA ribbon and you should find at your far right in a new group (Analysis) the Solver button. Hit it to open the Solver wizard:

Solver
Solver

Solver requires that you:

  • Specify the Objective – set a single cell that measures your Objective. Next select whether you want to Maximize, Minimize or set this value by changing certain variables (specified later)
  • Specify the Changing Variables – specify the variables (cells) which should be modified by Solver to optimize your Objective function (formula)
  • Specify your Constraints – specify cells which should evaluate to certain constraints limiting the values of the Changing Variables

I used the settings above to run Solver on my Problem. See the results below:

Solver: Problem results
Solver: Problem results

The results are perfect (99 for TV ads and 401 for Internet ads)! Solver managed to find the optimal solution to my problem. Some of you probably thought the answer to be 100 and 400 :) – that was the hinted answer but not the optimal one! This is the power of Solver.

MS Query (SQL)

If you have been around my blog long enough you might have noticed my affection towards SQL in Excel manifested by my SQL Excel AddIn. I consider SQL a significant tool in the Data Analysis Excel toolbox. There are simply some things you won’t do easily with a PivotTable that can be easily written in a few SQL lines of code.

Let’s demonstrate a simple example of how to create a MS Query using the DATA ribbon tab using a simple source worksheet – Sheet1:
SQL Example data set

Open the MS Query window

Go to the DATA ribbon tab, click From Other Sources and then select From Microsoft Query.

Add MS Query (SQL)
Add MS Query (SQL)

Follow the MS Query window wizard to create your MS Query

This is the hard part. In my example I want to run a query against my working Workbook (Sheet1). See screens below:

Selecting the Data Source
Selecting the Data Source

Selecting the Excel file (can be your current Workbook)
Selecting the Excel file (can be your current Workbook)

Selecting the Worksheet and columns for your Query
Selecting the Worksheet and columns for your Query

Return data to Worksheet or Edit Query
Return data to Worksheet or Edit Query

Edit the Query (optional)
Edit the Query (optional)

Return Data to Worksheet (as Table/PivotTable/PivotChart)
Return Data to Worksheet (as Table/PivotTable/PivotChart)

Indeed that’s a lot of steps right? Hence, if you consider using MS Query (SQL) I strongly recommend my SQL Excel AddIn. It does the same in just 1 step. It does not add any new features to Excel but simply makes it so easy to create MS Queries in Excel – so you can just use it to create the queries using the AddIn. The AddIn is not required to use them.

Refresh or Edit your MS Query

This is the wonderful part – your MS Query is represented in Excel as a QueryTable. A queryable Excel Table that can be Refreshed or modified. See the options below available for QueryTables:

Refreshing/Editing the MS Query
Refreshing/Editing the MS Query

Summary: Data Analysis Excel

It is worth at least being aware of the wide range of Excel Data Analysis tools that are available by DEFAULT in Excel. Most Excel users are limited to the PivotTable and maybe Solver (2 common Excel questions asked for Analyst job interviews). However, Excel contains much more Data Analysis Excel tools than just those two.

I personally see much value in the MS Query (SQL) feature which has been so successfully hidden in Excel and made so hard to setup that it is simply neglected. Fortunately, there is a way around that with my SQL AddIn which I myself use on a regular basis. SQL however is a powerful tool (and language) and often allows you to create queries that can replace VBA macros and in most cases will run faster than VBA.

VLOOKUP

Excel VLOOKUP vs INDEX MATCH vs SQL vs VBA

1 Star2 Stars3 Stars4 Stars5 Stars (12 votes, average: 5.00 out of 5)
Loading...

VLOOKUP vs INDEX MATCH vs SQL vs VBA – today you are in for the ultimate Excel Showdown. The VLOOKUP Excel function is one of the most popular functions, around which there has always been much debate. You will most definitely find an article about this function on almost every Excel blog site out there that matters. Similarly there has been much argue about how efficient this function is, when compared to other combos like INDEX MATCH or DOUBLE TRUE VLOOKUPS. I have always wanted to put the dot over the “i” in at least the discussion around performance when using the VLOOKUP and the INDEX MATCH combo (VLOOKUP vs INDEX MATCH).

VLOOKUP vs INDEX MATCH
What to do with your VLOOKUPs to significantly gain performance, and what to replace them with if you are looking to make your workbook more maintainable? How much performance will you actually gain? Hopefully here you will find answers to these questions.

One additional thing I always wanted to bring into this discussion was the MS Query. Many Excel experts often forget to mention that when you need to lookup a lot of values within a certain table there is an approach almost as effective as any Excel trick out there – Microsoft Query (SQL). Excel features so call Query Tables which can execute OLEDB SQL queries on Excel data (worksheets treated like separate SQL Tables).

This means that instead of doing a lookup of a certain value cell-by-cell you can do it within a single query. This query can always be refreshed at the push of a button (or macro), instead of dealing with uncontrollable automatic recalculations. I felt the urge to include this approach in this post as it can challenge face on all the other approaches out there. But let us start from the beginning…

VLOOKUP Example

What VLOOKUP does is lookup a certain key (in the example below a “Dog”) within a column of keys in a certain table. Then it takes a value corresponding to the row in which the key was located and returns a corresponding value from another column.

Let’s see this in the below VLOOKUP example:

VLOOKUP Example
VLOOKUP Example

It is one of the most often used formulas and simple enough. However, the VLOOKUP function has several setbacks:

  • Hard to maintain when columns are added/removed to/from the lookup table
  • Key column needs to be first in the lookup table
  • Little flexibility – cannot be used to match against both rows and columns of a lookup table. Although it can be replaced with HLOOKUP (the forgotten twin brother of VLOOKUP) this can be a nuisance if you want to create a table lookuping up both columns and rows

Why INDEX MATCH?

There are many decent posts on why to consider using INDEX MATCH against the common VLOOKUP. But before we go into the pros and cons let’s understand how the INDEX MATCH combo works.

In short we can replace a VLOOKUP with a combo consisting of 2 functions:

  • INDEX – returning the value of an element in a table or an array, selected by the row and/or column number indexes)
  • MATCH – returning the relative position of an item in a specified range

In the example below the MATCH function will first return the relative position (the row number) of the Dog in the A column. Next the INDEX function will return a corresponding value from the same row in column B.

INDEX MATCH Combo
INDEX MATCH Combo

Not really much complicated than the VLOOKUP but the INDEX MATCH combo certainly handles all of the setbacks of the VLOOKUP pretty well (see section above).

But why use INDEX MATCH instead of VLOOKUP, especially if we get the exactly same result? Short summary of the pros and cons of the INDEX MATCH vs. the VLOOKUP:

  • More flexible – allows you to match both against rows and columns
  • Less error prone – adding/removing columns/rows from the lookup table should not crash the INDEX MATCH combo
  • Can be split to match multiple columns – by splitting the INDEX from the MATCH we can in fact match several column off a single INDEX column which points us to the result row
  • Both vertical and horizontal lookups – VLOOKUP and HLOOKUP address either only vertial or horizontal lookups, whereas with the INDEX MATCH you can easily do both
  • Harder to use – VLOOKUP is a little easier to understand than INDEX MATCH and I know some people have difficulty with this two step approach

What about SQL?

As promised a quick example as to how SQL (MS Query) can fit in. The query below will do the same lookup operation as the VLOOKUP and the INDEX-MATCH.

Returning the value

A Query to lookup a whole column of values would look more like this:

How to create an SQL query in Excel? Just go to:DATA From Other Sources From Microsoft Query or check out my Excel SQL Add-In.

Now SQL will not prove much useful for just a single lookup operation. Its benefits appear when needing to carry out A LOT of lookup operations. As you will see there are certain tricks you can use in Excel to get better performance than SQL can provide you, however, I would encourage learning SQL as in most cases it can easily replace the need to create complex queries or Array Formulas and still provide awesome performance.

What about VBA?

Couldn’t miss out on an opportunity to check how VBA compares to the other approaches, although it seems pretty obvious it wouldn’t be a fair match – as VBA is singlethreaded as opposed to Excel’s native formulas, which are recalculated concurrently in multiple threads. To challenge the other approaches I devised a simple VBA procedure using the VBA Dictionary object.
vlookup vba macro
How does the VBA lookup procedure work? It loads the entire lookup table to a VBA Dictionary object and then looks-up the entire lookup values against the VBA Dictionary.

DOUBLE TRUE VLOOKUP

The VLOOKUP allows you to either approximate a match (by providing range_lookup value to TRUE) or select an exact match. Strangely enough Excel defaults to the approximate options which is obviously a nuisance for most of us… but maybe that was actually a hint from Microsoft that this is the option to go with? It turn out that using the TRUE option that approximates the lookup result will return the lookup value providing a significant performance boost (scroll down to see how significant).

However, a VLOOKUP using the TRUE option will always return a result – not necessarily the one you were looking for. The trick therefore socialized by Charles Williams here shows how using 2 TRUE VLOOKUPS you can get accurate results with great performance. How does it work? In short what Charles suggests is using an IF function with the condition being the FIRST TRUE VLOOKUP to match for the lookup value (the key). If the result matches against the original lookup value (the key) the IF functions returns the SECOND TRUE VLOOKUP which returns the matched value in the right column. If not it returns any default value set. See below:

DOUBLE TRUE VLOOKUP
DOUBLE TRUE VLOOKUP

The formula below:

Notice that the condition in the formula verifies if the VLOOKUP located the right lookup value. If so, the second VLOOKUP will return the associated lookup result. Might seem strange at first that 2 VLOOKUPs are better than one. Let’s look at the statistics to see how it will compare against the other approaches…

There is one CONSIDERABLE setback to keep in mind. The lookup table has to be SORTED by the lookup column (key column)! Otherwise the query will return inconclusive results.

Let’s now dive right into the performances stats around these different approach.

Performance comparison

Before we start I want to level set a couple of things. I am running these tests with Excel 2013 installed so keep in mind that if you are using a different version you may see slightly different results (even the INDEX MATCH being quicker then the simple VLOOKUP as I am told to believe). As all things changes, so should the discussion around VLOOKUP need updating – especially performance-wise. Apart from that some additional things to mention:

  • Tests run on an Intel i5-4300U processor (2 physical cores)
  • Tests run on a large dataset (200k lookup table) and assuming a large number of lookups (>25k) to diffuse the problem of accurate performance measurements (25k – 200k lookup operations). Results should give a good approximation of the actual metrics

Lookups on UNSORTED data

In most common cases you are carrying out lookup operations on a UNSORTED lookup table. The chart below present results of the following alternatives:

  1. VLOOKUP (UnSorted) – a simple VLOOKUP on an unsorted lookup table
  2. INDEX MATCH (UnSorted) – a simple INDEX MATCH on an unsorted lookup table
  3. SQL (UnSorted) – an simple SELECT query matching against the lookup values (keys) of the VLOOKUP (returns same results in same order)
  4. VBA (Sorted) – a VBA procedure that creates a dictionary of the lookup table and matches the lookups using the VBA Dictionary. You can find the source code here: source code
Execution time of lookup operations on UNSORTED data
Execution time of lookup operations on UNSORTED data

VLOOKUP vs INDEX MATCH
So what do we see? It seems that the INDEX MATCH combo performs consistently slightly worse then the simple VLOOKUP. There are no large differences when increasing the amounts of operations performed which seems like there is no reason to jump to an INDEX MATCH combo in the need for just performance. SQL (and VBA) on the other hand wiped out the competition being almost 19x faster when executed against 200k lookups. It seems like the MS Query did not increase it’s execution time considerably probably leading to think that it may perform equally well on much larger lookup tables. VBA was also quite efficient when executed on less than 100k lookup operations. Nevertheless, I consider SQL the winner as clearly it performed better for more operations.

Lookups on SORTED data

Let’s now consider the ideal situation where we have a SORTED lookup table. The chart below will now present results of the following alternatives:

  1. VLOOKUP (Sorted) – a simple VLOOKUP on a sorted lookup table
  2. DOUBLE TRUE VLOOKUP (Sorted) – a DOUBLE TRUE VLOOKUP on a sorted lookup table
  3. INDEX MATCH (Sorted) – a simple INDEX MATCH on a sorted lookup table
  4. SQL (Sorted) – an simple SELECT query matching against the lookup values (keys) of the
    VLOOKUP (returns same results in same order)
  5. VBA (Sorted) – a VBA procedure that creates a dictionary of the lookup table and matches the lookups using the VBA Dictionary, you can find the source code here: source code
Execution time of lookup operations on SORTED data
Execution time of lookup operations on SORTED data

VLOOKUP vs INDEX MATCH
As we can see the DOUBLE TRUE VLOOKUP rules the stage with an astonishing 0.22 seconds vs. the aweful 110 seconds of a regular VLOOKUP. An astonishing improvement! SQL comes second with a score just slightly above 5 seconds which seems reasonable (although using only 1 core due to being single threaded). What comes as strange is that both the VLOOKUP and the INDEX MATCH actually performed worse when executed against a sorted lookup table. Not something you might expect, but broadly explained by Excel-guru Bill Jelen in this podcast really worth watching if you want to know more on the subject.

The RIGHT approach

Although the DOUBLE TRUE VLOOKUP proved superior to any other method, VLOOKUP is a function that is far from perfect (read section on INDEX MATCH above). In short VLOOKUP is less immune to changes than the INDEX MATCH. Ideally we would like to have an INDEX MATCH formula that is just as efficient as the DOUBLE TRUE VLOOKUP… Well in fact there is a way. At least if we mix a little of both worlds – by combining a TRUE VLOOKUP with a APPROXIMATE INDEX MATCH.

So what’s happening here? We are using the first TRUE VLOOKUP to check whether the lookup_value is present in the lookup table. Then once we have that confirmed we can do an APPROXIMATE INDEX MATCH (less than or equal in this case) to efficiently search for our corresponding value in the result_column.

See below how the MATCH function is defined in Excel:

Excel MATCH function definition
Excel MATCH function definition

A simple example below:
TRUE VLOOKUP with APPROXIMATE INDEX MATCH
TRUE VLOOKUP with APPROXIMATE INDEX MATCH

The APPROXIMATE INDEX MATCH is similarly as efficient as the TRUE VLOOKUP hence both approaches are equivalent in terms of performance! Fantastic right!?

Keep in mind that the lookup table has to be SORTED by the lookup column (key column)! Otherwise the query will return inconclusive results.

Conclusions on VLOOKUP vs INDEX MATCH

For me these results mean at least 4 things:

  1. On a daily basis swapping your VLOOKUPs for INDEX-MATCH combos will not affect your Excel workbook performance, although may provide you with more flexibility and reduce the number of errors when working on the lookup table, which I personally appreciate
  2. If performance is key sort your lookup table and swap those VLOOKUPS with DOUBLE TRUE VLOOKUPS or rather the TRUE VLOOKUP APPROXIMATE INDEX MATCH combo. Alternatively, swap the VLOOKUPs with a Microsoft Query (SQL) for more simplicity and control
  3. Unless you are using the DOUBLE TRUE VLOOKUP (or TRUE VLOOKUP APPROXIMATE INDEX MATCH combo) don’t sort your lookup data table if you want to gain performance – this will have the opposite effect
  4. When working with very large datasets (>100k rows) consider MS Query. Although it may be just a little slower (consistently a couple of sec) than the DOUBLE TRUE VLOOKUP, for lookup operations, it is more flexible (performs well against other Excel functions) and provides you more control over your query. This also releases you out of the mercy of the Excel Automatic Calculation feature which may cause cells to recalculate when source data is modified (you can refresh the query via macro or by clicking refresh)
  5. Don’t resort to VBA for performance. It is an overkill for this exercise, although it performed almost as well as SQL, it introduces unnecessary complexity and requires saving files in XLSM/XLSB/XLS file format. Other than for amusement, VBA has no justification for this scenario

The VLOOKUP vs INDEX MATCH topic is one of the most popular Excel debates, but I hope that this post will shed more light and provide a measurable comparison of the options out there. I also feel I need to encourage the use of MS Queries as having their own place in this debate.

Do you agree? What do you think about the comparison of these approaches?

programming languages

Programming languages analysts should learn

1 Star2 Stars3 Stars4 Stars5 Stars (8 votes, average: 4.63 out of 5)
Loading...

Programming is a skill often associated only with Developers and not so much with analysts or data scientists. Why learn to code when you have tools such as Excel with it’s PivotTables, filtering and formulas. On the other hand, when thinking about analytic tools it seems like Tableu, ClickView, SAS or IBM SPSS is the way to go.programming languagesThese tools are great when working with Big Data. But most of us work with data sets counting not more than a few hundred thousand records. So why use a sledgehammer to crack a nut? Programming languages such as VBA, SQL or R let you squeeze much more of Excel and traditional tools .

VBA (Visual Basic for Applications) (MS Office)

USEFULNESS:
EASY TO LEARN:

Excel is the entry-level tool for most analysts and data scientists, allowing for easy data crunching tasks. In most cases formulas and PivotTables suffice in everyday analytics tasks. However, VBA extends Excels capabilities almost without limits – allowing for creating complex algorithms, extraction of data from various data sources, web scraping and load others. VBA owns it popularity to Excel beings its programming and testing environment at the same time. Write a piece of code and see it run in seconds.
If you want to progress as an analyst / data scientist – learn VBA.

One other perk to learning Visual Basic for Applications (VBA) is that the syntax is identical to VBscript. If you are not familiar – VBscript is a Windows scripting language working alongside Windows Batch Scripting (used in BAT files). VBscript can be used to write Windows scripts that don’t need Excel or MS Office installed on the workstation. VBA code can often be copied identically and used to create VBscript (with some differences like VBscript cannot include data types e.g. “Dim str as String” would raise an exception).

Example uses of VBA

programming languages vbaWhen VBA can prove useful?

Well there are ample examples to consider below. One of the most often proven ways to start is by using the Record Macro feature on the Developer tab.

  • Automate mundane Excel tasks e.g. formatting, editing etc.
  • Creating new custom Excel functions (UDF)
  • Implementing algorithms or complex formulas e.g. complex computations
  • Extending capabilities e.g. working with files, connecting with Internet resources
  • Creating Excel-based apps with User Forms and interaction

Code example

Here is a simple VBA script to show a simple message box with Hello World!

Sub HelloWorldProcedure()
  MsgBox "Hello World!"
End Sub

Where to start?

SQL (Structured Query Language) (RDBMS)

USEFULNESS:
EASY TO LEARN:

SQL lets you unleash the potential of database development. While Visual Basic (VBA is an implementation of Visual Basic) is a general-purpose scripting programming language, SQL is a special-purpose programming language- aimed at running queries and CRUD (Create, Read, Update, Delete) operations on relational databases such as MySQL, MS SQL, Oracle, MS Access etc. SQL is ubiquitous in the data world.

What is often not appreciated about SQL is that it supported by MS Office. Microsoft developed a database engine (JET.OLEDB and the newer ACE.OLEDB), that supports SQL, and can be used to query several Microsoft products such as MS Excel and MS Access (but also other formats like text files). Why is this important? SQL let’s you carry out many operations much easier and quicker e.g. merging Excel worksheets, filtering Excel worksheets on multiple complex conditions, linking Excel Workbooks with other Workbooks or Access databases. If you wanna learn more check out my Excel SQL Add-In

When will you want to reach out for SQL? Whenever you are analyzing/processing large data sets. Excel will typically start falling apart when processing 1000k record databases with the use of formulas, PivotTables etc.

Feel free to check-out this SQL Quick start learning video for Beginners:

Example uses of SQL

programming languages sqlSQL currently rules the world of data, even though NoSQL databases have enriched the database mix. Although many have announced the end of SQL times to be near, recent developments in approaches to programming such as micro-services and the overall ease of use and integrity of SQL database seem to negate these forecasts. Most datasets sit on SQL databases (with MySQL in the lead) so as an analyst sooner or later you will benefit from knowing how to query these datasets directly looking for insights and when building reports.

But does SQL come in handy when you are not dealing with large datasets or structured relational databases? Sure both in Access and Excel SQL can and should be used when dealing with data tables. Here are some of the main aspects SQL is valued for:

  • CRUD: SQL enables Create, Update, Update, Delete operations on relational databases (relations of tabular data sets)
  • Queries: SQL allows you to create queries based on multiple tables of data
  • Performance: SQL is fast and efficient for processing databases including large amounts of data

Code example

List all people at the age above 30, and with red eyes from a hypothetical People table :

SELECT * FROM People
WHERE Age > 30 AND EyeColor = "Red"

Where to start?

R programming language

USEFULNESS:
EASY TO LEARN:

R is a powerful programming languages for statistical computing and graphics. It’s one of the top programming languages used by data scientists (along with e.g. Python) out there (at least on Kaggle as I am aware). R has a moderately steep learning curve as its not a typical programming language, and was historically created by statisticians not developers, although it certainly is worth the effort. R focuses on operations on data frames (tables) and its whole design is oriented around this. What is more important in the case of R is its vast CRAN library of packages that provide additional statistical, graphics and machine-learning capabilities. R is also the go to language when processing so called Big Data. It make slicing and dicing data easy and pleasant.

Another nice thing about R is that it comes with a software environment (RStudio) similarly like you have the Visual Basic environment packed together with MS Office applications. This makes starting to work with R much easier, especially for beginners.

Here is a nice zero-to-hero learning series on R:

Example uses of R

Typically R is considered a great tool for the below. However, keep in mind that you can almost extend R with any capability by downloading the right package for R’s CRAN library e.g. Web Scraping etc.

  • Data exploration e.g. plotting statistical data relationships
  • Statistical data analysis
  • Data cleansing – correcting, formatting series of data e.g. CSV and text files
  • Machine-learning algorithms

Code Example

A simple function that introduces itself:

hello <- function( name ) {
  sprintf( "Hello, %s", name );
}
hello("World")

Just how easy it is to load CSV/text file data with headers:

data <- read.csv("data.csv", header = TRUE)
x <- c(1,2,3,4,5,6)   # Vector of numbers
y <- x^2  #Result: c(1, 4, 9, 16, 25, 36)

Where to start?

Here are some useful links to get you started with the R programming language:

Other programming languages worth mentioning

Now for some other honorable mentions worth exploring:

Python

USEFULNESS:
EASY TO LEARN:

Python is one of the most popular script programming languages out there currently. Especially valued for its ease of use and for being applicable to many environments and scenarios. Python is great for writing quick scripts to process files, data or automate easy tasks. It doesn’t have the overhead of object oriented programming languages and is very flexible. I personally use Python most often for Web Scraping or for automating Linux/UNIX tasks.

If you already know VBA and Excel is your primary working environment, Python might be of less value to you.

C#

USEFULNESS:
EASY TO LEARN:

C# is Microsoft’s flagship programming language and a rival to the all-popular Java. C# is a typical Object Oriented programming language and it’s primary software development environment is Microsoft’s Visual Studio. C# is very useful when making efficient and optimal solutions. Visual Studio is a big asset of C# as programming in VS is pleasant and easy. When to use C#? When you need a Graphical User Interface to put a face on your solution and when you want to “click out” most of your code instead of writing it.

SQL in VBA

Using SQL in VBA on Excel. Run SELECT Queries from VBA

1 Star2 Stars3 Stars4 Stars5 Stars (5 votes, average: 4.40 out of 5)
Loading...

Many times I was irritated of the lack of some Excel functionality (or just I don’t know there is) to easily transform data w/o using pivot tables. SQL in VBA was the only thing that was missing for me.

Distinct, grouping rows of Excel data, running multiple selects etc.
Some time agon when I had a moment of time to spare I did my homework on the topic only to discover that running SQL queries from Excel VBA is possible and easy…

Using SQL in VBA example

sql in excel
An Excel Worksheet and the Output of the RunSELECT Macro
Let see how to run a simple SELECT SQL Query in Excel VBA on an example Excel Worksheet. On the right see my Excel Worksheet and the Message Box with the similar output from my VBA Macro. The VBA Code is below:

Explaining the Code

So what is happening in the macro above? Let us break it down:

Connecting to the Data Source

First we need to connect via the ADODB Driver to our Excel Worksheet. This is the same Driver which runs SQL Queries on MS Access Databases:

The Provider is the Drive which is responsible for running the query.

The ConnectionStrings defines the Connection properties, like the path to the Queries File (example above is for ThisWorkbook) or if the first row contains a header (HDR).

The Open command executes the connection.

You can find more information on the ADODB.Connection Object on MSDN.

Looking for other Connection Strings to XLS or Access files? Check out ConnectionStrings.com. Here are some examples:

Running the SQL Select Query

Having connected to our Data Source Excel Worksheet we can now run a SQL SELECT Query:

So what happens here? First we run the Execute command with our SELECT query:

What does it do? It indicates that our records are in Sheet1. We can obviously extend this query just to filter people above the age of 30:

This would be the result:

sql in excel - above 30
SELECT * FROM [Sheet1$] WHERE Age > 30

The Execute command returns a ADODB RecordSet. We need to loop through the recordset to get each record:

Clean up

Lastly we need to Clean up our Objects to free memory. This is actually quite an important step as if you VBA code is runs a lot of queries or computations you might see a slow-down soon enough!

What Else Can I Do?

You can do tons of great things with ADODB / MS Queries / SQL in Excel. Here are some additional ideas:

  • Run Queries Across Worksheets – you can run JOIN queries on multiple Excel Worksheets. E.g.

    On the below tables:
    sql in excel example
    Running SELECT with JOIN on 2 Worksheets
  • Extracting Data from External Data Sources – use different connection strings to connect to Access Databases, CSV files or text files
  • Do more efficient LOOKUPs – read my post on VLOOKUP vs SQL to learn more