How to find duplicates in Excel? Today we will explore this question in and out. We will start first by understanding How to Highlight Duplicates in Excel. Our second objective will be learning How to Remove Duplicates. I will explore several approaches to this task – using the Data ribbons Remove Duplicates feature, but also showing how we can use MS Queries to remove duplicates and also do additional filtering/sanitizing of our data set.
Find Duplicates in Excel
There are several ways to go around Finding or Highlighting duplicates in Excel:
Following today’s post we will use the example Excel Table on the right.
1. Find Duplicates using Excel Formulas
To Find Duplicates in Excel using Formulas we will use the COUNTIF Excel formula to verify if a certain row as been repeated above. This will work only on a single column (our unique column identifier). Scroll down to the end to learn how to check for duplicates based on all three columns.
Add Is Duplicate? Column to your Worksheet
Add a Is Duplicate? Column to your Worksheet. In our example assuming that the Last column uniquely identifies records in my table input the following formula in the first cell of your new Is Duplicate? Column:
Drag the Formula down to all Cells
Drag the COUNTIF Formula from the First Cell down to the Last.
Instead of dragging the Formula down manually you can:
Select the First Cell with the Formula
Copy by Formula by hitting CTRL+C
Select all cells down by hitting CTRL+▼
Paste the Formula by hitting CTRL+P
Find Entire Row Duplicates
Did you notice that in the previous example 2 rows were in fact not duplicates? See below:
What should you do if there is no Column that Uniquely identifies your rows? We need to use a Helper Column! See Column D below:
Simply add the Helper Column and Concatenate all other Columns like so:
Next you need to use the COUNTIF Formula on the Helper Column instead.
I recommend separating columns with some Separator (such as the _) as otherwise you may again get False Positive Duplicates
2. Highlight Duplicates using Conditional Formatting
To Highlight Duplicates in Excel we need to use the Conditional Formatting feature in the HOME Ribbon Tab in Excel. Follow the steps below:
Select the Column to Highlight Duplicates
Select the Column in which you want to Highlight your Duplicates. As a Tip you can Select the First Cell and hit CTRL+END.
Open Conditional Formatting
Go to the HOME Excel Ribbon Tab and select Conditional Formatting.
Create New Rule
As we want to Highlight Duplicate Cell Values in Excel we need to create a new Conditional Formatting Rule. Click New Rule.
Select Duplicate formatting
Select the Format only unique or duplicate values option to define the Formatting of Duplicate Values.
Select Duplicate formatting
In the new window make sure Format All is set to duplicate. Next set the formatting for your Duplicates.
Once completed you should see your Duplicate Cells Highlighted in your Excel spreadsheet.
Remove Duplicates in Excel
Having answered the question of How to Find Duplicates in Excel, but what you often want to do is Remove Duplicates. So lets now tackle the issue of How to Remove Duplicates in Excel.
Similarly as with Finding Duplicates, there several ways to go around Removing Duplicates:
Fortunately Removing Duplicates is a typical scenario in Excel supported by the Data Ribbon Tab.
Select Excel Table
Select your Entire Excel Table by hitting CTRL+A.
Go to the Data Ribbon Tab in Excel and select Remove Duplicates.
Select Columns with Duplicates
In this last step you need to Select All Columns in which there are Duplicates. If you just want to remove rows in which there are duplicates in a single column – remember to select only that Column.
Once you finish hit OK and done!
The DISTINCT SQL statement filters only the DISTINCT rows within a TABLE. You can use it on any SQL SELECT Query. See my example below:
The table on the Left is the Source Table the table on Right is the Result Table. This is the Microsoft Query you can use to filter only DISTINCT rows:
Where replace Sheet1 with the name of your Worksheet.
Count Duplicates in Excel
Similarly as with the previous case, there several ways to go around Counting Duplicates:
The Excel VLOOKUP function (Vertical Lookup) is one of need-to-know features in Excel! I can’t stress enough how many articles/posts out there have been published on using VLOOKUP. Probably rightly so. I couldn’t however resist to make an my attempt at consolidating all the knowledge around how to use VLOOKUP and all the bits and pieces of information you need to know to be able to fully leverage the vertical lookup function in Excel! Let’s start with a VLOOKUP example…
The VLOOKUP Phonebook analogy
To quickly explain what VLOOKUP does I usually like to use the common Phonebook analogy. In the example below imagine we have a friend called John White. Let’s say we want to call John. What we need to do is to locate John in our Phonebook by his Name and Surname and call him on his corresponding phone number located within the same row.
VLOOKUP returns a cell located at a row corresponding to your lookup value. In a phonebook analogy you would be looking-up a person’s phonenumber by his name and surname
Everyday we deal with similar lookup problems e.g. restaurant menu (looking up the dish price), grocery lists (how many bottles of milk did I need to buy?) etc. Hence the usefulness of using VLOOKUP. Let us move on to learn how to use VLOOKUP.
How to use VLOOKUP?
Considering the above VLOOKUP example let’s take a closer look at the parameters of the VLOOKUP function:
Let’s look at the arguments of VLOOKUP in more detail:
lookup_value – value you want to look up. Must be in the first column of the range of cells you specify in table-array
table_array – range of cells VLOOKUP will
Search FIRST column to find the ROW containing the lookup_value
Search and return corresponding value in column col_index_num of the table_array
col_index_num – column number within the table_array that contains return values. The first left-most column in the table_array starts numbering with 1
range_lookup – an OPTIONAL boolean (True/False) parameter that defines how the VLOOKUP function will behave:
FALSE – search for the first EXACT MATCH in the first column of the table_array
TRUE – Default option. assume the FIRST COLUMN in table_array is sorted either numerically or alphabetically and return an APPROXIMATE MATCH. This parameter works correctly ONLY IF THE FIRST COLUMN OF THE table_array IS SORTED!
FALSE VLOOKUP Example
Nothing beats a hands-on VLOOKUP example! See the below animation to witness for yourself how VLOOKUP works!
Notice that I am using the range_lookup set to FALSE. This means that the VLOOKUP function will always look for an EXACT MATCH.
TRUE VLOOKUP Example
In the previous example the range_lookup parameter was set to FALSE. This example will equally work with range_lookup equal to TRUE as long as we sort our FIRST COLUMN alphabetically (or numerically in other cases)!
So what’s the difference you will ask?PERFORMANCE. The TRUE (approximated match) VLOOKUP is significantly faster than the FALSE (exact match) VLOOKUP. Read here more on VLOOKUP Performance.
Important! If you got hooked up already on the performance bit of the TRUE VLOOKUP beware! The TRUE VLOOKUP will always return a result… even if it didn’t find an exact match. That is a setback. Luckily Excel experts have found a way around this with a DOUBLE TRUE VLOOKUP. Read on.
DOUBLE TRUE VLOOKUP
To exemplify the DOUBLE TRUE VLOOKUP let’s consider this table:
We want to match the Animal Monkey against a Category. If we use a standard FALSE VLOOKUP like this:
We will get the following results:
"Mammal" if found
"#N/A" if not found
Great! But what happens if we use a simple TRUE VLOOKUP?:
We will get the following results:
"Mammal" if found
"Amphibian" (or similar) if not found
Well that’s not very appropriate huh? How can we know for sure if the TRUE VLOOKUP cross-referenced the Animal correctly? Well.. why not use a second VLOOKUP for that? Consider the DOUBLE VLOOKUP below:
"Mammal" if found
"" if not found (replace "" with #N/A if needed)
See what happens here? The first VLOOKUP validates whether there is an EXACT MATCH in the first column. If so the second VLOOKUP will return the corresponding result. See below:
Are 2 DOUBLE TRUE VLOOKUPs really worth the trouble? If you are looking for performance than YES! Read more here.
INDEX MATCH vs VLOOKUP – Do’s and Don’t
If you’ve been using VLOOKUP long enough you probably stumbled across MANY articles/posts on why many consider a certain combination of two Excel functions i.e. INDEX MATCH, better than using a regular VLOOKUP. In fact MOST Google results for VLOOKUP vs INDEX MATCH won’t mention the significant advantage VLOOKUP has over INDEX MATCH which I hinted in the previous section.
Let’s compare the pro’s and con’s of using INDEX MATCH (instead of VLOOKUP):
More flexible – allows you to match both against rows and columns
Less error prone – adding/removing columns/rows from the lookup table should not crash the INDEX-MATCH combo
Both vertical and horizontal lookups – VLOOKUP and HLOOKUP address either only vertial or horizontal lookups, whereas with the INDEX MATCH you can easily do both
Harder to use – VLOOKUP is a little easier to understand than INDEX-MATCH and I know some people have difficulty with this two step approach
In some cases you would like to run a VLOOKUP against MORE THAN ONE COLUMN i.e. lookup a certain set of values against several columns instead of just one column. There is a simple way to achieve this by introducing a HELPER COLUMN with a concatenation of the lookup columns. Let’s consider the example below:
As you can see each month is specified in two separate columns. A simple one-column VLOOKUP will not do. So how to use VLOOKUP to get the result for multiple criteria? Using a HELPER COLUMN! See the solution below:
The above is the simplest approach to a multiple criteria VLOOKUP. There are more elegant approaches that need not a Helper Column e.g. check-out Chandoo’s example here.
If you are impatient for the answer – for best performance always use DOUBLE TRUE VLOOKUPS. Now for a more thorough explanation let’s start from the beginning and summarize what we know in bullet points:
The VLOOKUP function can be well replaced with other functions/features in Excel
A common practice it to replace a VLOOKUP with a INDEX MATCH function combo. This does not affect performance (much) but solve a lot of typical usability issues
I have hinted above that a TRUE (approximated) VLOOKUP on a sorted table_array will have better performance that a regular FALSE (exact match) VLOOKUP
The TRUE (approximate) VLOOKUP function seems to be best candidate in terms of performance. However, approximating the result may render certain issues – a TRUE VLOOKUP will ALWAYS return a result! Even if there is no exact match! This creates a certain issue for us. Luckily there is a certain trick to use 2 TRUE VLOOKUPs in a combo to replace a regular FALSE VLOOKUP and expect similar results. This is commonly called a DOUBLE TRUE VLOOKUP.
Performance: VLOOKUPs VS INDEX MATCH VS SQL
The below performance comparison was carried out by me in my separate post on VLOOKUP vs INDEX-MATCH vs SQL. Follow to my post to read more. In the meantime let’s look at the comparison:
What do the various categories mean?:
VLOOKUP (sorted) – a regular FALSE (exact match) VLOOKUP against a sorted table_array
DOUBLE TRUE VLOOKUP (sorted) – 2 combined TRUE (approximate match) VLOOKUPs against a sorted table_array
INDEX-MATCH (sorted) – a combination of the INDEX and MATCH functions against a sorted table_array
SQL (Sorted) – an Excel MS Query (SQL) executed against a sorted table_array
How to use VLOOKUP recap
Well I do hope this exhausts the subject of using VLOOKUP in Excel. VLOOKUP is a common used function in Excel sometimes wrongly which, I hope from my performance stats above, can seriously tamper with your Workbook performance causing Excel Workbooks to recalculate in matters of minutes instead of seconds. Feel free to share your comments/thoughts below or on the AnalystCave forum.
Today I want to elaborate shortly on how to correctly and easily number rows in Excel by adding dynamic row numbers using simple formulas. Every neat data table in Excel should have a numbering column in place so that every row can be easily reference at least by the item number. One way of numbering rows is to simply input the numbers and drag them down. However, this is a static manner and the numbering will not refresh automatically if you change the places of any rows or add/delete rows. Here I want to introduce several easy ways to achieve nice and neat dynamic row numbering in Excel. Ok so let’s dive right into 3 methods for achieving nice dynamic row numbers in Excel.
Let’s say we want to dynamically number rows counting also every empty row in between i.e. if there is an empty row between the numbers we want the numbering to account for that an increase the index. I use this approach most often due to it’s simplicity. See the example below on how this works.
The formula is very simple:
A$1 is simply the header of the column to guarantee we start numbering from 1. Easy and neat right?
Method 2 – Dynamic numbering skipping empty rows
This time let’s account for every empty row in between. We want to continue the numbering from the last index. This comes in handy when you have sections of data or if you group the rows into different headers but want to retain the right numbering. See the example below:
Again the formula is dead simple:
We use the static $ marker to make the range start from always the first cell in the column. The formula will count all non-blank rows so will skip any blanks we leave in between.
Dynamic nested numbering
Now for a bonus – dynamic nested numbering. Sometimes we need to add numbering with nested indices. I certainly encourage using nested numbering as it makes many tables more clear to read and the grouping more obvious. The approach/formula below can easily be reused to support nesting of additional levels. Numbering nested indices manually is often a nightmare if we need to frequently rearrange rows or add/delete some rows in between. Unfortunately, no formula will know for us when to automatically increment the first few nested indices (e.g. 1.2.1 to 1.3.1), but we can automatically increment the last index in the nested numbering index. See below.
We need to input manually add the first index e.g.
Then below we can now use an automatic formula that we do the increment for us:
We can repeat this process with the next indices e.g. 2.1, 3.1 etc. We will always need to type the first one manually but the formula can help renumber the subsequent indices automatically e.g. 2.2,2.3,2.4 etc.
Hope this helps you with generating those dynamic row numbers!