scrape html add-in

Excel Scrape html by element id, name or… any regex!

Sometimes I have a need to quickly scrape some data from website to be able to work on them and update their values when needed e.g. stock prices, temperature, search results, statistics etc. From time to time I stumble upon similar issues. Below find 2 quick UDF functions (user defined functions) that you can use to scrape html items by id and name. Scrape HTML elements in Excel by ID, name or Regex.

Get element by id

An example of getting an element by ID:

Public Function GetElementById(url As String, id As String)
    Dim XMLHTTP As Object, html As Object, objResult As Object
    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.ResponseText
    Set objResult = html.GetElementById(id)
    GetElementById = objResult.innerHTML
End Function

Get element by name

An example of getting an element by name.

Public Function GetElementByName(url As String, name As String)
    Dim XMLHTTP As Object, html As Object, objResult As Object
    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.ResponseText
    Set objResult = html.GetElementByName(name)
    GetElementByName = objResult.innerHTML
End Function

Examples

You can use these functions directly like other Excel functions like this:

But these functions are often not enough to scrape more complicated structured data or require additional cleansing before being able to use the data. I thought therefore of using xpath at first but regular expressions seemed the more obvious solution. So I knocked up this more flexible alternative of the above functions which allows you to use any regex to scrape data of a website:

Get element by regex

An example of scraping an element by regex (regular expression).

Public Function GetElementByRegex(url As String, reg As String)
    Dim XMLHTTP As Object, html As Object, objResult As Object
    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.ResponseText
    Set regEx = CreateObject("VBScript.RegExp")
    regEx.Pattern = reg
    regEx.Global = True
    If regEx.Test(XMLHTTP.ResponseText) Then
        Set matches = regEx.Execute(XMLHTTP.ResponseText)
        GetElementByRegex = matches(0).SubMatches(0)
        Exit Function
    End If
    GetElementByRegex = ""
End Function

Need an example? Here is one.

Let us say we want to scrape the latest headline off the cnn website:
scrape: cnn

Try this regex then:

=GetElementByRegex("http://edition.cnn.com/";"<h2 data-analytics=""_list-hierarchical-xs_article_"" class=""banner-text banner-text--natural""><strong>(.*?)</strong></h2>")

The result:
scrape

What did I do? I looked at the HTML of the Cnn website and notice the following HTML:

<h2 data-analytics="_list-hierarchical-xs_article_" class="banner-text banner-text--natural"><strong>
Why these girls fear the summer
</strong></h2>

As this piece of HTML is quite unique in the whole HTML content we can simply replace the header between the strong tags with the following regex (.*?). This will extract any string of characters between these tags which are not end of line characters.

Simple? Yes. If you want to play around with this I recommend you read more on using regular expressions e.g. on stackoverflow.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.