Category Archives: Web Scraping

hta example

Simple class for browser automation in VBA

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Web browser automation (using Microsoft’s Web Browser) is not an easy task in VBA when doing some Web Scraping. Excel is certainly a great tool for building your web automation scripts but at every corner you find some obstacles. For me the most challenging was always the nondeterministic state of the IE browser control whenever waiting for it to finish loading/refreshing a website.

Be sure to checkout my VBA Web Scraping Kit which includes a whole worth of VBA scripts for Web Scraping

The problem

Most browser automation code out there includes the following lines of code whenever waiting for the Internet Explorer browser control to finish loading or refreshing:

While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
Wend
'The row below may sometimes raise an error!
Set element = objIE.Document.GetElementById(id) 

The problem is, however, that although IE might state that it has fully loaded the web page this in fact does not always have to be the case and some code might generate errors at random lines of code.

The solution

One way of going around this problem is using arbitrary delays – leveraging Application.OnTime / Application.Wait / Sleep. This workaround, although somewhat useful, in most cases will be unsatisfactory as still does not guarantee that you will actually be able to access a web element or click a button that simply might not be loaded yet due to some JavaScript running in the background. So how to solve this issue?

My answer? Wait until the element appears accessible. See an example function below that will wait until the element is accessible on the web page and only then will it return it.

Public Function GetElementById(id As String, Optional isBlocking As Boolean)
'id: id of the html element; isBlocking: is the code to be blocked until the element is found
    Dim timeout As Long
    On Error Resume Next
TryAgain:
    Set GetElementById = objIE.Document.GetElementById(id)
    If IIf(IsMissing(isBlocking), True, isBlocking) And _
     (Err.Number <> 0 Or (GetElementById Is Nothing)) And _ 
     timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById"
End Function

So will the original code above change? Not too much really when we use this new function:

While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
Wend
'This will wait maxTimeout miliseconds until raising an error or return 
Set element = GetElementById(id)

The code above waits for the element with id to appear accessible on the web page. The 5 millisecond Sleep interval is simply there so we are able to specify an upper threshold for an error to appear on a web page (10 seconds). This threshold is there to prevent us from waiting for an infinite period of time in case the web page crashed etc. This I believe is an honest way out of this conundrum.

A simple class for deterministic browser automation

Browser automation requires usually a lot of coding and in order to provide the deterministic automation we need we would need to encapsulate more procedures for getting elements by id, name, tagname, classname and even regular expressions.


The Google web page is also not an easy task for browser automation, assuming you would not embed your query in the GET params but treat it like a regular web page where all data is to be input and handled by forms.

The IE class I wrote handles queries to the Google Web Page like this:

 Dim linkText as String, query as String: query = "Dog Wikipedia"
 On Error GoTo CleanIE
 Set ieClass = New IE
 ieClass.Navigate "https://www.google.pl/", False
 ieClass.GetElementByName("q").Value = query
 ieClass.GetElementByTagName("form").Submit
 ieClass.WaitForIE 'First wait for the page to mostly load
 linkText = ieClass.GetRegex("<h3(?:.|/n)*?<a onmousedown=""return(?:.|/n)*?"" href=""(?:.|/n)*?"">((?:.|/n)*?)</a>")
CleanIE:
 ieClass.Quit

That’s 10 lines of code and not hassle with handling the accessibility of any web elements. All the blocking/waiting is handled in the Get* methods which will simply wait until the control is available in the Web Browser or raise an error otherwise if the maxTimeout threshold is breached. You might have noticed the there is still a method called WaitForIE which actually still waits for the browser to confirm that it is not Busy anymore. Why is it there? Because we need to be sure that the browser has at least mostly loaded the new content and that the element we are looking for is not found, by mistake, in the previous page content. This code has proven deterministic for me in over 1’000 tested queries!

One other interesting thing is the GetRegex method which I basically scrapped of my Excel Scrape HTML Add-In. I prefer regular expressions over any other methods for extracting contents from strings/web pages and I encourage you to do so too. It is much easier and once you get the hand of it, it really makes life easier.

The code of the IE class (click on the arrow to expand):

Deterministic browser automation class

'!!!Please reference in Tools->References "Microsoft Internet Controls library"!!!
Option Explicit
Public Enum READYSTATE
    READYSTATE_UNINITIALIZED = 0
    READYSTATE_LOADING = 1
    READYSTATE_LOADED = 2
    READYSTATE_INTERACTIVE = 3
    READYSTATE_COMPLETE = 4
End Enum
Private objIE As Object
Const maxTimeout As Long = 10000 'Max time in milliseconds to wait until a control is found before raising error
Public ElementNotFoundError As Long
Public Function GetIE()
    Set GetIE = objIE
End Function
Private Sub Class_Initialize()
    ElementNotFoundError = 1
End Sub
Public Sub Navigate(urlAddress As String, isVisible As Boolean)
'urlAddress: destination url; isVisible: should the IE window be visible
    Set objIE = New InternetExplorer: objIE.Visible = isVisible: objIE.Navigate urlAddress
    WaitForIE
End Sub
Public Sub WaitForIE()
    While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
    Wend
End Sub
'----Get elements----
Public Function GetElementByName(name As String, Optional isBlocking As Boolean, Optional index As Long)
'name: name of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByName(name): Set GetElementByName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByName"
End Function
Public Function GetElementById(id As String, Optional isBlocking As Boolean)
'id: id of the html element; isBlocking: is the code to be blocked until the element is found
    Dim timeout As Long
    On Error Resume Next
TryAgain:
    Set GetElementById = objIE.Document.GetElementById(id)
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementById Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById"
End Function
Public Function GetElementByTagName(tagName As String, Optional isBlocking As Boolean, Optional index As Long)
'tagName: tagname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByTagName(tagName): Set GetElementByTagName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByTagName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByTagName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByTagName"
End Function
Public Function GetElementByClassName(className As String, Optional isBlocking As Boolean, Optional index As Long)
'className: classname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByClassName(tagName): Set GetElementByClassName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByClassName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByClassName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByClassName"
End Function
'----Get HTML by regex----
Public Function GetRegex(reg As String, Optional isBlocking, Optional index As Integer) As String
'reg: regular expression with 1 capture "()"; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements fulfilling this regular expression
    On Error Resume Next
    Dim regex, matches, timeout As Long
    Set regex = CreateObject("VBScript.RegExp")
    regex.Pattern = reg
    regex.Global = True
    If index < 0 Then index = 0
TryAgain:
    If regex.Test(objIE.Document.body.innerHtml) Then
        Set matches = regex.Execute(objIE.Document.body.innerHtml)
        GetRegex = matches(index).SubMatches(0)
        Exit Function
    End If
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or GetRegex = vbNullString) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    GetRegex = ""
End Function
Public Function GetMatchCount(reg As String) As Long
'reg: regular expression with 1 capture "()"
    On Error Resume Next
    Dim regex, matches
    Set regex = CreateObject("VBScript.RegExp")
    regex.Pattern = reg
    regex.Global = True
    If regex.Test(objIE.Document.body.innerHtml) Then
        Set matches = regex.Execute(objIE.Document.body.innerHtml)
         GetMatchCount = matches.Count
        Exit Function
    End If
    GetMatchCount = 0
End Function
'----Quit and terminate----
Public Sub Quit()
    If Not (objIE Is Nothing) Then objIE.Quit
    Set objIE = Nothing
End Sub
Private Sub Class_Terminate()
    Quit
End Sub

I hope this VBA class becomes the cornerstone of every browser automation script you write in VBA!

Download

Feel free to download the whole class file here:

Next steps

Check out my Scrape HTML Add-In for easy HTML scraping without any VBA:
Excel Scrape HTML Add-In

google maps distance

Excel VBA Calculate distance between two addresses or coordinates

1 Star2 Stars3 Stars4 Stars5 Stars (23 votes, average: 4.57 out of 5)
Loading...

Ever wanted to calculate the distance between two addresses in Excel? I recently had the following issue: from a list of over approx. 50 administration offices across my city I wanted to find the one that is closest to my workplace. Of course open Google Maps and type each location one by one… and then choose the shortest connection. By why bother when we have Google Maps Distance Matrix API!!!

This post includes information on how to calculate the distance in Excel:

To get the COORDINATES of any ADDRESS read this post

Using the Google Maps Distance Matrix API

Google has a lot of useful API out there and I encourage you to go sometime to the Google API Explorer and have a look at what other information you can easily utilize from Excel or your other applications.

Let’s however focus on getting the distance between two addresses. Google facilitates Google Maps Distance Matrix API for limited usage.

The API is configured using GET HTTP Params. A simple example below.
Say we want to get the distance between San Francisco and Victoria BC. We want to get there by bicycle. Copy this link into your browser:

https://maps.googleapis.com/maps/api/distancematrix/json?origins=Vancouver+BC|Seattle&destinations=San+Francisco|Victoria+BC&mode=bicycling&language=en
Google Distance between Vancouver, BC, Canada and San Francisco, CA, USA
Google Distance between Vancouver, BC, Canada and San Francisco, CA, USA

Google Distance Matrix Builder

Want to quickly test the Google API? Below a simple form for building a quick Distance URL. Simply type in the From an To addresses, select the transportation mode and hit Build URL!. Go here for more options for configuring the Google Maps Distance Matrix API.

From:
To:
Mode:

Google Maps Distance Matrix URL:

Calculate distance between two addresses using Google Maps in Excel

So I knocked up quickly this VBA Function in Excel which uses Google API distance matrix function to calculate the Google Maps distance. See the VBA code here:

Get Google Maps distance in meters

'Calculate Google Maps distance between two addresses
Public Function GetDistance(start As String, dest As String)
    Dim firstVal As String, secondVal As String, lastVal As String
    firstVal = "http://maps.googleapis.com/maps/api/distancematrix/json?origins="
    secondVal = "&destinations="
    lastVal = "&mode=car&language=pl&sensor=false"
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    URL = firstVal & Replace(start, " ", "+") & secondVal & Replace(dest, " ", "+") & lastVal
    objHTTP.Open "GET", URL, False
    objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
    objHTTP.send ("")
    If InStr(objHTTP.responseText, """distance"" : {") = 0 Then GoTo ErrorHandl
    Set regex = CreateObject("VBScript.RegExp"): regex.Pattern = """value"".*?([0-9]+)": regex.Global = False
    Set matches = regex.Execute(objHTTP.responseText)
    tmpVal = Replace(matches(0).SubMatches(0), ".", Application.International(xlListSeparator))
    GetDistance = CDbl(tmpVal)
    Exit Function
ErrorHandl:
    GetDistance = -1
End Function

Get Google Maps duration (in seconds)

Public Function GetDuration(start As String, dest As String)
    Dim firstVal As String, secondVal As String, lastVal As String
    firstVal = "http://maps.googleapis.com/maps/api/distancematrix/json?origins="
    secondVal = "&destinations="
    lastVal = "&mode=car&language=en&sensor=false"
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    URL = firstVal & Replace(start, " ", "+") & secondVal & Replace(dest, " ", "+") & lastVal
    objHTTP.Open "GET", URL, False
    objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
    objHTTP.send ("")
    If InStr(objHTTP.responseText, """duration"" : {") = 0 Then GoTo ErrorHandl
    Set regex = CreateObject("VBScript.RegExp"): regex.Pattern = "duration(?:.|\n)*?""value"".*?([0-9]+)": regex.Global = False
    Set matches = regex.Execute(objHTTP.responseText)
    tmpVal = Replace(matches(0).SubMatches(0), ".", Application.International(xlListSeparator))
    GetDuration = CDbl(tmpVal)
    Exit Function
ErrorHandl:
    GetDuration = -1
End Function

Calculate distance between two coordinates

You can calculate the distance between two coordinates (2 pairs of latitudes and longitudes) using either Google’s API or a simple VBA Function.

Calculate distance between coordinates using a VBA function

Taking into account the elliptic nature of Mother Earth you can easily calculate the distance between coordinates without using Google API .

Warning – no the world is not flat as opposed to what the Flat Earch Society is stating

The function returns the distance and using the unit variable you can state if you want the function to return the distance in Kilometres ("K"), Miles ("M") or even nautical miles ("N").

Public Function GetDistanceCoord(ByVal lat1 As Double, ByVal lon1 As Double, ByVal lat2 As Double, ByVal lon2 As Double, ByVal unit As String) As Double
    Dim theta As Double: theta = lon1 - lon2
    Dim dist As Double: dist = Math.Sin(deg2rad(lat1)) * Math.Sin(deg2rad(lat2)) + Math.Cos(deg2rad(lat1)) * Math.Cos(deg2rad(lat2)) * Math.Cos(deg2rad(theta))
    dist = WorksheetFunction.Acos(dist)
    dist = rad2deg(dist)
    dist = dist * 60 * 1.1515
    If unit = "K" Then
        dist = dist * 1.609344
    ElseIf unit = "N" Then
        dist = dist * 0.8684
    End If
    GetDistanceCoord= dist
End Function

Function deg2rad(ByVal deg As Double) As Double
    deg2rad = (deg * WorksheetFunction.Pi / 180#)
End Function

Function rad2deg(ByVal rad As Double) As Double
    rad2deg = rad / WorksheetFunction.Pi * 180#
End Function

Calculate distance between coordinates using Google Maps in Excel

To get Google Maps distance between two coordinates simply use the same GetDistance function as above and replace the start and dest parameters with the coordinates in this format:

41.43216,-81.49992

Final call example:

Debug.Print GetDistance("41.43206,-81.38992", "41.43216,-81.49992")

Same goes for the duration function:

Debug.Print GetDuration("41.43206,-81.38992", "41.43216,-81.49992")

How to set it up in Excel?

Important: Please remember that you need a direct Internet connection! Not via proxy etc.
Follow the steps:

Add new VBA Module

Go to the DEVELOPER ribbon and add select Visual Basic a and new Module to your VBA Project
insert calculate distance macro

Insert the VBA code

Insert the code from sections above (notice that the function is “Public” – therefore will be visible in your workbook as a so called UDF (User Defined Function) add calculate distance macro

Input the function in Excel

Go to the worksheet and input the function as shown below:

=GetDistance("Chicago"; "New York")

Make sure to replace ; with your default formula list separator (e.g. comma in the US)!
Calculate distance between addresses or coordinates

Calculate distance using Google Maps between any number of coordinates/addresses in Excel

Now that we know how to leverage our newly learned GetDistance and GetDuration functions we can use them to measure the distance/duration for any routes, with any coordinates/addresses in between our starting point and our destination. For this I created 2 simple procedures MultiGetDistance and MultiGetDuration:

Get distance between any number of locations

By using our GetDistance function we can get the distance between any number of locations using the Visual Basic procedure below:

Public Function MultiGetDistance(ParamArray args() As Variant) As Double
    MultiGetDistance = 0
    Dim startLoc As String, endLoc As String, i As Long
    For i = LBound(args) To UBound(args) - 1
        startLoc = args(i): endLoc = args(i + 1)
        MultiGetDistance = MultiGetDistance + GetDistance(startLoc, endLoc)
    Next i
End Function

Below example usage:

=MultiGetDistance("Chicago";"New York";"San Jose")

And here is the output:

Calculate distance using Google Mapsbetween multiple addresses
Get Google Maps distance between multiple addresses

Get duration between any number of locations

Similarly by using our GetDuration function we can get the duration between any number of locations using the Visual Basic procedure below:

Public Function MultiGetDuration(ParamArray args() As Variant) As Double
    MultiGetDuration = 0
    Dim startLoc As String, endLoc As String, i As Long
    For i = LBound(args) To UBound(args) - 1
        startLoc = args(i): endLoc = args(i + 1)
        MultiGetDuration = MultiGetDuration + GetDuration(startLoc, endLoc)
    Next i
End Function

Below example usage:

=MultiGetDuration("Chicago";"New York";"San Jose")

Parameters for calculating the Google Maps Distance

The following parameters are available in the API query:

Param Description
key Your application’s API key. This key identifies your application for purposes of quota management. Learn how to get a key from the Google Developers Console.
mode (defaults to driving) — Specifies the mode of transport to use when calculating distance. Valid values and other request details are specified in the Travel Modes section of this document. Other modes:

  • driving (default) for road network.
  • walking for pedestrian paths & sidewalks (where available).
  • bicycling for bicycling via bicycle paths & preferred streets (where available).
  • transit for public transit routes (where available).
language The language in which to return results. See list here
avoid Restrictions to the route. Available:

  • avoid=tolls
  • avoid=highways
  • avoid=ferries
units Unit system to use when expressing distance as text. Available:

  • units=metric (fordistances in kilometers and meters)
  • units=imperial

For other parameters see the original Google Distance Matrix API page:
The Google Distance Matrix API

Limits and common issues

Read more on the limitations of the Google Distance Matrix API here:
The Google Distance Matrix API

Google limits the amount of free queries to the following:

  • 100 elements per query.
  • 100 elements per 10 seconds.
  • 2 500 elements per 24 hour period.
There is a way around these limitation via the use of HTTP Proxies. Read more here.

So beware of any mass recalculation of these functions as you may quickly find yourself exceeding these limits.

Is the function returning -1 or simply not working properly? Be sure you have a direct Internet connection! The XMLHttpRequest Object above is not configured to handle proxy servers common for Office environments.

Common issues

The functions above are not full proof and don’t take into account:

  • That Google won’t find the exact addresses and will approximate with a similar address (see component filtering for more precision)
  • That Google might return several matches. While the function takes the first one from the returned list
  • HTTP or timeouts. See my Web Scraping Kit for how I dealt with such
  • That distances/duration differ depending on which location is set as origin and which is set as destination (one way roads, detours etc.)

Having trouble with matching a large dataset with distances and durations? Reach out for a free quote.

Download an example

You can download an example of the above here:

Help!

Need help with calculating the distance between two addresses using the approach above? Feel free to comment below or ask a new Question via Questions page.

Next steps

Want to get the geolocation (coordinates) of any address?
Excel: Get geolocation (coordinates) of any address
Want to add maps and other cool Google Charts in Excel?
Excel: Google Charts Tool
Want to use Google translate in Excel?
Excel: Google Translate functionality