Web Scraping Tutorial

Excel Scrape HTML Add-In

1 Star2 Stars3 Stars4 Stars5 Stars (5 votes, average: 5.00 out of 5)
Loading...

What does it do?

The Scrape HTML Add-In is a simple Excel Add-In which allows you to easily scrape HTML content from websites without needing to write a single line of VBA code. Most scraping tools/examples/add-ins are either very complicated or not really flexible e.g. prepared for scraping only some simple examples of HTML. The Scrape HTML Add-In makes use of regular expressions instead of e.g. XPath.

Hello World!

Nothing beats a simple example. Before we drop into the details here is a sneak peak to how easy it is to scrape the title from the Google web page using the Add-Ins built in functions:

Get Google web page title
Get Google web page title

How does it work?

See the example section for a more elaborate example. The core functionality of the Add-In are a couple of Excel functions that become available in Excel after enabling the Add-In e.g.:

  • GetElementById – extracts a single HTML element by id from a web page
  • GetElementByName – extracts a single HTML element by name from a web page
  • GetElementByRegex – extracts HTML from a web page using a defined regular expressions
  • GetRegex – extracts a string from any string using a defined regular expressions
  • RegexReplace – replaces all occurrences of a regular expression pattern with the specified string
  • And more…

These powerful functions when combined allow you to scrape virtually any content from any web site. Use e.g. the GetElementByRegex function to scrape the first raw part of the website and then use any amount of nested GetRegex functions to cleanse the final content. Use the RegexReplace to cleanse the output.

The Add-In

The Add-In itself once enabled will appear in the Excel ribbon like this:

Scrape HTML Add-In Ribbon
Scrape HTML Add-In Ribbon

The Add-In currently consists of the following features:

  • Insert function – inserts one of the useful Get* functions to the selected cell
  • Cache settings – allows you to change the caching of HTML requests to increase performance
  • Regex Tester – opens a tool for testing regular expressions (great when having trouble with
  • Scrape HTML Tool – opens a tool for scraping all matches of a certain scrapping pattern
  • Automatic Updating – opens a tool for enabling automatic updating of selected worksheets (only GetElementBy* functions)

The Regex Tester

When playing with scraping data from websites it often happens that you do not land the correct regular expression in your first go. The Regex Tester allows you to quickly test your regex against any string. The tool reacts to any change in the regular expression validating it against the string. This saves a lot of time, believe me.
See this video on using the Regex Tester:

The Scrape HTML Tool

The Get* functions are great for scraping very specific items from the website. If, however, you want to quickly scrape all items of a certain type – this is where the Scrape HTML Tool can come in handy. It comes with several predefined scraping expressions e.g. scraping URLs and img src properties.
See this video on how the Scrape HTML Tool can help you:

Caching

The primary concept of the Add-In is to reduce any need for writing VBA code. However, the problem with the Get* functions may be that if you specify multiple functions with the same URL there might be some overhead due to having to download the same HTML content just as many times – which might heavily impact performance. In order to solve this issue I have introduced the Cache functionality. If you specify the same URL in more than 1 Get* function then provided that the cache timeout has not passed the HTML content will not be refreshed from the Web but from the Cache. E.g. if you have 2 Get* functions with the same URL and a 60 sec timeout, then only the first call to the Get* function will download the content from the Web, whereas the second one will not – provided that the second call happens before the 60 sec timeout ends. You can change the setting at any time in the “Cache settings”.

Cache Settings
Cache Settings

Automatic Updating

When configuring your Excel file to scrape data of HTML websites there is often the issue of how to refresh this data from the web periodically. For example – you want to scrape stock price data and refresh it every 2 minutes. You can either do this manually, write some clever VBA using the Application.OnTime function or… simply easily configure the periodical refresh in the Automatic Update tool as part of the Add-In.

See example below:

Example of scraping a HTML table

Let us use this example HTML table on w3schools:

Example HTML from w3schools
Example HTML from w3schools

Let us scrape each cell into a separate Excel cell. It took me only a couple minutes to get this done:

Scraped HTML table in Excel
Scraped HTML table in Excel

Now step by step:

First I scraped the whole table into cell B2 using the GetElementbyRegex function:

=GetElementByRegex("http://www.w3schools.com/html/html_tables.asp";"<table class=""reference"" style=""width:100%"">([^""]*?)</table>")

I did this in a separate cell to optimize the workbook (so that in case of a recalculation of the worksheet the site content does not have to be downloaded separately for each cell). Notice the regex ([^”]*?). This is a non-greedy capture of ALL characters (non-“). This guarantees that only this table is captured in the expression and not all tables. Using (.*)? would not be enough as the dot character does not match newlines.

Next getting the th header cells (next headers by changing the last index in the range 0-3):

=GetRegex(GetRegex($B$1;"<tr>([^""]*?)</tr>";0);"<th>([^""]*?)</th>";0)

This captures the first row and then extracts the first header.

Similarly the td cells (columns and rows depending on the indices):

=GetRegex(GetRegex($B$1;"<tr>([^""]*?)</tr>";1);"<td>([^""]*?)</td>";0)

This captures the second row and then extracts the first cell.

Documentation

Find the complete documentation here:
Documentation

Installation

See an example of how to install Excel Add-Ins (Excel 2007 and above):
Installation

Download

Download the Scrape HTML AddIn here. Provide your email to get your download link:

Next steps

Check out these links:
Web Scraping Tutorial

38 thoughts on “Excel Scrape HTML Add-In”

  1. This code in the Ribbon Control:
    Public Sub ShowWebTreeviewTool(ByVal control As IRibbonControl)
    Dim wtv As WebTreeviewTool
    Set wtv = New WebTreeviewTool
    wtv.Show
    End Sub
    results in a request for a missing MSExchange control

    in the OnLoad(ribbon….
    you have GetElementByRegexArgDesc(1) repeated 3 times – you need to increment the number

    1. Thanks Sean! I had not noticed! Anyway I just updated the new version of the Add-In with a couple new functionalities so I will fix this too and post the new version in a moment.

      1. This is the error I get when using your example of scraping the table from the w3schools:
        Compile error in hidden module: ScrapeHTML.
        This error commonly occurs when code is incompatible with the version, platform, or architecture of this application. Click “Help” for information on how to correct this error.

        I am running Excel 2010 32 bit (14.0.7109.5000) SP1 MSO (14.0.7106.5003)
        Part of Microsoft Office Professional Plus 2010

        If you want to converse directly, outside of having multiple forum messages, feel free to use the email attached to these messages.

  2. I keep getting the following errors:

    Could not load some objects because the are not available on the machine.

    Compile error in hidden module: RibbonControl.
    This error commonly occurs when code is incompatible with the version, platform, or architecture of this application. Click “Help” for information on how to correct this error.

    I am running Excel 2013 x64. Is there any solution for this?

    1. Hi Randy,
      I am afraid that the Scrape HTML Add-In may be incompatible with Excel 2013 as it was developed and tested in Excel 2010.
      I will reach out to you and let’s see together if we can get around fixing this.

    1. Hi Harry,
      the link works now – feel free to try. Yesterday I was doing an upgrade to my site’s Download Manager therefore I was experiencing some issues with file links – solved now.

      1. The download links works now, thanks!

        However, when I installed the scraper (by just opening it) Excel returned an error about something missing. After that, the ribbon “Scrape HTML” is visible in Excel. But it doesn’t work: when I test the google example above, it returns “Compile error in hidden module: Misc”.

        Is the scraper not compatible with Excel 2013?
        Or perhaps I installed it wrong.

        Also, both the links to “Installation” and “Documentation” on this page are broken.

          1. Hi Harry,

            thanks for mentioning this. Try the new version – it is open for editing. Go to Tools->References and see if there are any libraries missing. Maybe I can try to fix this somehow. Other than that you can try to debug the functions that are not working for you and send me the error message

  3. Astonished by your site and the capacities of this tool.
    But, is it running under Google Chrome (I fear It does’nt) ?

    Does it work with pages written in HTML with javascritp or other esoteric langages ?

    The importing tool from web of EXCEL 2007 functionality does not work well with new complex pages, failing to find table to retrieve.
    Yours truly

    1. Hi Rando, thanks for the praise!

      But, is it running under Google Chrome (I fear It does’nt) ?

      Actually the Excel Scrape HTML Add-In does not run nor require any browser. This is because, in the backend, it uses the XMLHttpRequest object. What is the XMLHttpRequest object? This is actually a script object used for sending HTTP requests, similarly as browsers do but being more lightweight and primitive (like a mini-browser). Therefore no browser object (IE nor Chrome nor any other) is used/created in the process of sending these HTTP requests. If you are familiar with HTML/Javascript then this is the same object used natively by AJAX, the feature that allows websites to asynchronously exchange data with the server without requiring you to refresh the webpage (e.g. hit F5).

      Does it work with pages written in HTML with javascritp or other esoteric langages ?

      It will work with most pages using Javascript. The only limitations are that some websites do not allow you to access all their resources by using only the GET parameters (the ones appended to the URL e.g. http://site.com?getParam1=xxx). In some cases you need to interact with the website (click a button, enter text etc.) and only then will you get access to the data you need. In those cases the Scrape HTML Add-In will not be sufficient. In such cases feel free to leverage IE automation: http://analystcave.com/wp-admin/post.php?post=743&action=edit.

      Cheers

  4. Thanks for those precisions,
    I understand that it is an XL add-in which requires specific library to be run. So it should be settled on my system although I am not allowed to install sofware ? I hope so.
    I would like to automatize different tasks,
    1. – downloading all the files (Excel) targeted by http links in the web page,
    2. – pulling into an Excel file the contents of tables (“multitables” or “multiframes”), that is financial data, embedded in the web page,
    3 – feeding Excel and Word files (a “log” of my browsing) with the selected text on the webpage (MS Word), the URL of the current pages and some “home” keywords edited in an ad-hoc dialog box. That should be performed by launching a shortkey or command from the webpage, without activating Excel windows.
    I would be glad for your reactions to those projects (especially the 3rd project) as to the feasibility and the importance of necessary efforts (I am only a little gamer in XL VBA).
    I hope hearing from you soon
    Thanks.

    1. It’s an XLA add-in which requires only that it’s placed here “C:\Users\USERNAME\AppData\Roaming\Microsoft\AddIns” and enabled in Excel Add-Ins. It does not use any external libraries (only internal Excel references) therefore it should work without admin privileges and does not require installing any additional software.

  5. Hello I am using excel 2007, how can I install Excel: Scrape HTML Add-In (v1.20141023) to my excel. I have other add ins ( special text replace). I already download this plug in, but I don’t know where and how to install first.

    Thanks, looking forward to this post for your reply.

  6. Hi
    An amazing tool it seems. But I cannot get it to work.
    Running MS Office Professional Plus 2010 32-bit
    The script seems to fail in this line:
    ______
    XMLHTTP.Open “GET”, url, False
    XMLHTTP.setRequestHeader “Content-Type”, “text/xml”
    XMLHTTP.setRequestHeader “Cache-Control”, “no-cache”
    XMLHTTP.setRequestHeader “Pragma”, “no-cache”
    XMLHTTP.setRequestHeader “User-Agent”, “Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0”
    XMLHTTP.send ‘<—- THIS LINE FAILS
    ______
    Stepping through the code, it simply stops. No error message nor debugging (even after disabling on error resume next).

    /Thomas

    1. Hi Thomas,

      it may be failing for reasons associated with your Internet connection. Are you connecting via proxy e.g. office server? If yes then the method will be failing. If this is the case then you need to either connect directly to the Internet or configure the proxy – see link here.

      1. Hi
        I installed the add-in:
        1417680049wpdm_ScrapeHTMLAddIn
        on my Excel 2013.
        Now when I open Excel I get This warning:
        [Microsoft Forms]
        Could not load an object because it is not available on this machine.
        I click [OK] and the open a new empty workbook.
        I select the Scrape HTML tab, the select GetElementByRegex or anything else from the drop down box.
        A warning tone is sound, and ‘f’ appears in the selected cell.
        From now on, everything i click in the Scrape HTML bar dose not do anything accept beeping a warning sound.
        The VBA window is stuck also.
        Only after I select another cell I can continue working, but the same problem occurs again.
        Thanks for your help
        Eran

  7. Hi!,

    Thanks for your clear examples and hands-on information.
    I’m quite familiar with the IE object scraping. But facing a problem: alert windows/messages.

    The case: When user forgets to answer a form question, a alert (popup) is shown.
    The problem is when automating from Excel, the VBA procedure is ‘hanging’ until popup is clicked.
    I’d like to catch this popup, to test whether the correct popups are shown to the user (and after do a SendKeys (enter) to continue).
    The popups are triggered by javascript events, but I’d like to keep the events in place to make sure the event is triggered (GUI testing).

    Any thoughts how to achieve this? I was thinking of your ‘multithreaded’ tutorial, but how can I interact (stop) the other thread after some time?

    Hope to hear from you,

    Kind regards, Tom

    1. Hi Tom,

      this is a common issue with VBA Web Scraping automation. In case of VBA there are 2 ways around this:
      Intercepting the Pop-up window directly
      – Sending a key (Enter or Escape – which ever works) using the SendKeys command in VBA. (The IE Object window has to be topmost and visible thou)

      Whenever confronted with this issue I however resort to Selenium in Python: which handles this easily. Python is quite easy to learn – see my Selenium example in my Web Scraping Tutorial
      What is more if you are doing GUI testing you will find Selenium much more convenient.

      Let me know if this helps!

      1. Thanks Anthony. Error message is:

        “Can’t find project or library”

        It opens up a list of Available References (such as ‘Visual Basic for Applications”). Some boxes are checked, some aren’t. I’ve experimented but I suspect the problems are the references listed as:

        MISSING: Ref Edit control#Ref edit control
        – Path: /C\/Program files (x86)\Microsoft Office\Office 14REFEDIT.dll
        MISSING: Microsoft Windows Common Controls 6.0 (SP6)#Microsoft Windows
        – Path: /C\/Windows\syswow64\MSCOMCTL.ocx

        Naturally these paths don’t make much sense to a Mac.

        I’ve quit and restarted Excel but pasting your examples it tells me it’s not available on this machine.

        Thanks.

          1. Many thanks. Just realised my last sentence could be clearer if that’s any help – when pasting your GetElementByRegex examples I get the message “Not available on this machine”.

          2. Hi Jonathan just uploaded the latest version w/o the RefEdit reference. Let me know if this fixes the issue on Mac ;).

  8. Many thanks for looking into that. It now seems to recognise the commands (eg GetElementById) when I enter as a formula but keep getting the same error messages. It also seems to be looking for the reference “MISSING: Microsoft Windows Common Controls 6.0 (SP6)”.

    Tried this in Excel 2011 and 2016 and thanks again for your efforts on this.

  9. Hi, I have a very interesting problem. I have 2 Toshiba laptops, one running Windows 7 & Excel 2013 and the second running Windows 10 & Excel 2013. The Windows 7 & Excel 2013 machine works fine with the old ScrapeHTML add-in (1417680049wpdm_ScrapeHTMLAddIn). However the old add-in NOR the new add-in (2015-11-03 ScrapeHTMLAddIn) does not work on the Windows 10 & Excel 2013 machine. I get a “Compile Error: Can’t find project or library” message when I enter the “=GetElementByRegex(A1,A10) formula in any cell or in the formula bar. Also, I am using the exact same Excel file and VB macros on both machines. Any thoughts? Thanks. J

    1. Hi, I figured out my problem. It turns out that for some reason when I installed Excel 2013 on my Windows 10 machine, the ActiveX Control file “MSCOMCTL.OCX” normally located in the following path – “C:\Windows\SysWOW64” was missing. Once I copied the file to the aforementioned path the “Compile Error: Can’t find project or library” message stopped occurring and the new Add-In worked ! Thanks. J

        1. I am looking forward to the next AddIn update! Also, I am having a problem with Automatic Updating. Whenever I attempt to configure it, I get the following error msg.: “Could not load an object because it is not available on this machine.” Hope you’re able to get it to work with Windows 10, 64 bit & Excel 2013. I do have a special request. Can you code the Automatic Updating tool portion of the AddIn to allow Users to configure the update interval in seconds instead of minutes? Thanks. J

  10. I just found this Excel add-in tool yesterday and it looks like it could really help me with an Excel project I have.  I am using Office 2007 on a Toshiba laptop running Windows 10.  I downloaded your add-in tool and moved it to the designated directory location per you installation instructions.  When I clicked on the worksheet it opened without a spreadsheet and without any error messages.  So I started a new spreadsheet.  I went to the Excel Options for Addins and verified that the Scrapehtmladdin was available and the box was checked.  However, the Scrape ribbon does not show on the Developer tab.  Also, when I tried to cut and paste an example function command from your web site into my Excel sheet it said the formula was incorrect.  Any help you can give me in getting this going so I can take it for a test drive would be appreciated.

Leave a Reply

Simply the best place to learn VBA!

Error: Please enter a valid email address

Error: Invalid email

Error: Please enter your first name

Error: Please enter your last name

Error: Please enter a username

Error: Please enter a password

Error: Please confirm your password

Error: Password and password confirmation do not match