Tag Archives: extract data

split cells in excel

How to Split Cells in Excel – Split Excel Columns

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading...

One common issue that many people have is to How to Split Columns in Excel upon non common delimiters e.g. Commas ,, Semicolons ;, or Non-Standard patterns such asCapital letters, a specific word – patterns which are not supported by the Text to Columns feature in the DATA ribbon tab.

How to deal with this? Usually you would need to resort to a VBA macro or really loooooong formula using the FIND, LEFT, RIGHT and LEN functions. That’s not an effective way of tackling this issue. What is the alternative? A simply Regex UDF function.

CONTENTS

Splitting Cells on Delimiter

A Delimiter is a sequence of 1 or more characters to separate columns within a Text String. An example of a Delimiter is the comma , in the following Text String Columns1,Column2 which separates the String Column1 from Column2. Popular Delimiters often used are:

  • Commas ,
  • Semicolons ;
  • Dots .
  • Tabs Tab
  • Spaces

A Delimiter can be any Sequence of Characters.

How to Split Cells in Excel using Text to Columns

The most obvious choice when wanting to Split Cells in Excel is to use the DATA Ribbon Text to Columns feature.

Select the Column

Select the Column with Cells you want to Split in Excel:

how to split cells in excel select column
Select Column

Select first column and proceed to Text to Columns

Select the entire first column where all your data should be located. Next click on the Text to Columns button in the DATA ribbon tab:

Data ribbon: Text to Columns
Data ribbon: Text to Columns

Proceed according to Wizard instructions

This is the hard part. Text to Columns need additional information on the delimiter and format of your columns.

Delimited or Fixed width?

Delimiters are any specific Sequence of Characters (like a comma or semicolon). Fixed Width means that each column in the Cell is separated by a Fixed Width of Whitespace Characters. In this case we select Delimited. Next click Next to proceed.

Text to Columns Wizard: Step 1
Text to Columns Wizard: Step 1

Select delimiter

Assuming your columns are separated with a specific Delimiter you need to provide this delimiter in the Wizard. Look at the Data preview to make sure your columns will be separated correctly. When finished proceed with Next

Text to Columns Wizard: Step 2
Text to Columns Wizard: Step 2

Format your columns (optional)

The last step is to format your columns if needed. If your columns represent Dates or you want to pull a column containing numbers/dates as text instead – be sure to format it appropriately. Usually, however, you are fine with hitting Finish:

Text to Columns Wizard: Step 3
Text to Columns Wizard: Step 3

The Resulting Split Columns

If you have proceeded according to the steps above you should have a neatly formatted spreadsheet like the one below.

how to split cells in excel the result
Splitted Columns in Excel

Split Cells on Patterns

Sometimes instead of Delimiters you want to Split your Excel Cells on Patterns that are dynamic and may be different for each cell in a certain column. Fortunately Excel supports Regular Expressions, which allow you to Define Patterns on which your Cells are to be Split.

Let us first introduce my often used GetRegex UDF function:

To install it – open your DEVELOPER Excel Tab, click Visual Basic and add the code above to any new VBA Module.
Its parameters are:

str
Any String.

reg
The Regular Expression including 1 Capture ().

index
The index of the capture e.g. if your Regex captures 3 matches then index=0 returns the first one and index=2 returns the 3rd one.

What does this UDF function do? If extracts any text via a Regular Expression. Let’s see it in action:

Split columns: Split the first column
Split first column

Example 1: Splitting Cells on Capital Letters

Let us take a common example where we have 1 Column of Cells that have 2 merged Columns inside. We want to split them on the second capital letter:

Split columns on regular expressions
Splitting columns on regular expressions

Now splitting this on the second capital letter using the FIND, LEFT, RIGHT and LEN functions will be a nightmare.

Let’s decipher the regular expression now:

Pattern Description
[A-Z].+? This will catch all words and whitespaces starting with a Capital letter the ? sign means that this is a non-greedy regular expression
([A-Z].+?) The () brackets will capture inside any pattern matching this regular expression
([A-Z].+?)[A-Z].+ The final regex this will capture only the first words and whitespaces starting with a Capital letter. Notice that the capture will end at the next Capital letter

Now for the second column:

Split columns: Split second column
Split second column

Great right? Now just drag the formula across the rows – and you are done!

Example 2: Splitting Cells on Whitespaces

A simple example – let us Split an Excel Cell on a Variable number of Whitespace characters. Let us say the Words in our String can have 1 or more Spaces in between.
split columns on pattern non whitespace
The Formula for the above is:

Again let us break it down:

Pattern Description
[^\s] This specifies a non-whitespace character
[^\s]+ This specifies at least 1 non-whitespace character
([^\s]+) The () brackets will capture inside any pattern matching this regular expression – hence all sequences of non-whitespace characters

Split Excel Function

If you just need an Excel Split function and you can introduce the following UDF Function (copy to VBA Module):

It uses the VBA Split function which is available in VBA only.

Parameters

str
A String.

delimiter
The delimiter on which the Split operation is to be done.

index
The index of the substring resulting from the Split.

How to use the Excel Split Function

The Function will be available as an Excel Function:
excel split string function
The above Formula looks like this (return second substring from Split):

automatic updating

Automatic Updating of Excel Worksheet

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Introducing the new extension to the Scrape HTML Add-In: Automatic Updating of Excel Worksheets. When configuring your Excel file to scrape data of HTML websites there is often the issue of how to refresh this data periodically. For example – you want to scrape stock price data and refresh it every 2 minutes. You can either do this manually, write some clever VBA using the Application.OnTime function or… simply easily configure the periodical refresh in the Automatic Update tool as part of the Add-In.

The Automatic Update tool

Automatic Update tool

So how does it work?

Configure the refresh rate for each worksheet

Automatic Update tool
Automatic Update tool

Click on the Automatic Updating button to see the list of worksheets and refresh rates. You can add any worksheet and configure the interval at which it should be updated. Then click “Off” to turn the updating on. And that is it! The tool will then refresh on the Get* functions (e.g. GetElementByRegex) on the interval.

No VBA required!

How to use this feature?

This will certainly be useful for users who:

  • Need fresh data – whether you are a stock broker, analyst in any other role. This will certainly make life easier
  • Are waiting for an important website update – maybe you are waiting for the new iPhone or want to sign up to a marathon with limited participation. This will also prove useful.
  • Want to make automatic reports – why not make a dashboard of reports based on web provided data? Make charts based on data all in one place w/o needing to constantly switch between websites to extract the information you need

See the video tutorial here:

scrape html add-in

Excel Scrape HTML Add-In now with HTML caching

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

I am constantly extending the Scrape HTML Add-In when I have some spare time. Today’s update adds Caching to the Get* functions.

See the latest version of the Excel Scrape HTML Add-In here:
Excel Scrape HTML Add-In

Why caching?

The primary concept of the Add-In is to reduce any need for writing VBA code. However, the problem with the Get* functions may be that if you specify multiple functions with the same URL there might be some overhead due to having to download the same HTML content just as many times – which might heavily impact performance. In order to solve this issue I have introduced the Cache functionality. If you specify the same URL in more than 1 Get* function then provided that the cache timeout has not passed the HTML content will not be refreshed from the Web but from the Cache. E.g. if you have 2 Get* functions with the same URL and a 60 sec timeout, then only the first call to the Get* function will download the content from the Web, whereas the second one will not – provided that the second call happens before the 60 sec timeout ends. You can change the setting at any time in the “Cache settings”.

Cache Settings
Cache Settings

regex

Excel Regex Tester Tool

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

I have recently added the Regex Tester Tool to the Scrape HTML Excel Add-In. Also as some might have noticed I decided to consolidate my HTML scraping UDF functions into the new Scrape HTML Excel Add-In.

What I was missing when using the UDF functions to scrape HTML content was some tool to test my regular expressions. Sometimes even a single character can render the whole regex to crash.

In the Scrape HTML Excel Add-In toolbox you will now find a very nifty tool called the Regex Tester.

Regex Tester Tool

Regex Tester Tool
Regex Tester Tool

The tool allows online evaulation of regular expressions – when any part of the expression is change the tool with evaluation your expression and returning the output. It comes useful when working with the GetElementByRegex and GetRegex functions but just as well can help you with any other regular expressions tasks!

Check out also this example video:

Download the tool together with the: Scrape HTML Excel Add-In.

I hope this will prove useful to some. Comment to let me know how this can be improved or what else should be in the Scraping HTML Excel Add-In toolbox!

scrape html add-in

Excel Scraping HTML by Regular expression continued…

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

After my post on “SCRAPE HTML BY ELEMENT ID, NAME OR… ANY REGEX!” I have been thinking about tinkering the macros a little bit more to make scraping HTML content even easier and reducing any additional needs for writing VBA code. What was missing in the puzzle was additional parsing of the scraped content i.e. let us say you want to download a HTML table row-by-row and cell-by-cell. Well the regex will probably capture your first row and cell… or the whole table leaving you with the dirty work of extracting the data you need for each row.

Struggling with Web Scraping using VBA? Check out my VBA Web Scraping Kit!

UDF VBA functions for scraping HTML

I therefore redefined the GetElementByRegex function and added an additional supporting function GetRegex:

'GetElementByRegex - capture HTML content by regular expression
Public Function GetElementByRegex(url As String, reg As String, Optional index As Integer)
    Dim XMLHTTP As Object, html As Object, objResult As Object
    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.ResponseText
    Set regEx = CreateObject("VBScript.RegExp")
    regEx.Pattern = reg
    regEx.Global = True
    If regEx.Test(XMLHTTP.ResponseText) Then
        Set matches = regEx.Execute(XMLHTTP.ResponseText)
        If IsMissing(index) Then
            GetElementByRegex = matches(0).SubMatches(0)
        Else
            GetElementByRegex = matches(index).SubMatches(0)
        End If
        Exit Function
    End If
    GetElementByRegex = ""
End Function

'GetRegex - capture any regex from a string
Public Function GetRegex(str As String, reg As String, Optional index As Integer)
    Set regEx = CreateObject("VBScript.RegExp")
    regEx.Pattern = reg
    regEx.Global = True
    If regEx.Test(str) Then
        Set matches = regEx.Execute(str)
        If IsMissing(index) Then
            GetRegex = matches(0).SubMatches(0)
        Else
            GetRegex = matches(index).SubMatches(0)
        End If
        Exit Function
    End If
    GetRegex = ""
End Function

This may seem like a small change but see this example to appreciate how flexible and easy scraping HTML is now:

Example of Scraping HTML table

Let us use this example HTML table on w3schools.

Let us scrape each cell into a separate Excel cell. It took me only a couple minutes to get this done:

Scraping HTML table in Excel
Scraped HTML table in Excel

Now step by step:

First I scraped the whole table into cell B2 using the GetElementbyRegex function:

=GetElementByRegex("http://www.w3schools.com/html/html_tables.asp";"<table class=""reference"" style=""width:100%"">([^""]*?)</table>")

I did this in a separate cell to optimize the workbook (so that in case of a recalculation of the worksheet the site content does not have to be downloaded separately for each cell). Notice the regex ([^”]*?). This is a non-greedy capture of ALL characters (non-“). This guarantees that only this table is captured in the expression and not all tables. Using (.*)? would not be enough as the dot character does not match newlines.

Next getting the th header cells (next headers by changing the last index in the range 0-3):

=GetRegex(GetRegex($B$1;"<tr>([^""]*?)</tr>";0);"<th>([^""]*?)</th>";0)

This captures the first row and then extracts the first header.

Similarly the td cells (columns and rows depending on the indices):

=GetRegex(GetRegex($B$1;"<tr>([^""]*?)</tr>";1);"<td>([^""]*?)</td>";0)

This captures the second row and then extracts the first cell.

Download the Scrape HTML example

Download the full example:

Summary

This is in my opinion a very powerful set of tools for every analyst working daily on Internet based content. There is no need for writing any additional VBA as the GetRegex function can be nested any number of times to allow you to extract the data you need. Use the index parameter in these functions to capture cells in structured tables or repeating patterns to reduce the amount of code you need to write.

I appreciate your comments!