Web browser automation (using Microsoft’s Web Browser) is not an easy task in VBA when doing some Web Scraping. Excel is certainly a great tool for building your web automation scripts but at every corner you find some obstacles. For me the most challenging was always the nondeterministic state of the IE browser control whenever waiting for it to finish loading/refreshing a website.
The problem
Most browser automation code out there includes the following lines of code whenever waiting for the Internet Explorer browser control to finish loading or refreshing:
While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE) DoEvents Wend 'The row below may sometimes raise an error! Set element = objIE.Document.GetElementById(id)
The problem is, however, that although IE might state that it has fully loaded the web page this in fact does not always have to be the case and some code might generate errors at random lines of code.
The solution
One way of going around this problem is using arbitrary delays – leveraging Application.OnTime / Application.Wait / Sleep. This workaround, although somewhat useful, in most cases will be unsatisfactory as still does not guarantee that you will actually be able to access a web element or click a button that simply might not be loaded yet due to some JavaScript running in the background. So how to solve this issue?
My answer? Wait until the element appears accessible. See an example function below that will wait until the element is accessible on the web page and only then will it return it.
Public Function GetElementById(id As String, Optional isBlocking As Boolean) 'id: id of the html element; isBlocking: is the code to be blocked until the element is found Dim timeout As Long On Error Resume Next TryAgain: Set GetElementById = objIE.Document.GetElementById(id) If IIf(IsMissing(isBlocking), True, isBlocking) And _ (Err.Number <> 0 Or (GetElementById Is Nothing)) And _ timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById" End Function
So will the original code above change? Not too much really when we use this new function:
While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE) DoEvents Wend 'This will wait maxTimeout miliseconds until raising an error or return Set element = GetElementById(id)
The code above waits for the element with id to appear accessible on the web page. The 5 millisecond Sleep interval is simply there so we are able to specify an upper threshold for an error to appear on a web page (10 seconds). This threshold is there to prevent us from waiting for an infinite period of time in case the web page crashed etc. This I believe is an honest way out of this conundrum.
A simple class for deterministic browser automation
Browser automation requires usually a lot of coding and in order to provide the deterministic automation we need we would need to encapsulate more procedures for getting elements by id, name, tagname, classname and even regular expressions.
The Google web page is also not an easy task for browser automation, assuming you would not embed your query in the GET params but treat it like a regular web page where all data is to be input and handled by forms.
The IE class I wrote handles queries to the Google Web Page like this:
Dim linkText as String, query as String: query = "Dog Wikipedia" On Error GoTo CleanIE Set ieClass = New IE ieClass.Navigate "https://www.google.pl/", False ieClass.GetElementByName("q").Value = query ieClass.GetElementByTagName("form").Submit ieClass.WaitForIE 'First wait for the page to mostly load linkText = ieClass.GetRegex("<h3(?:.|/n)*?<a onmousedown=""return(?:.|/n)*?"" href=""(?:.|/n)*?"">((?:.|/n)*?)</a>") CleanIE: ieClass.Quit
That’s 10 lines of code and not hassle with handling the accessibility of any web elements. All the blocking/waiting is handled in the Get* methods which will simply wait until the control is available in the Web Browser or raise an error otherwise if the maxTimeout threshold is breached. You might have noticed the there is still a method called WaitForIE which actually still waits for the browser to confirm that it is not Busy anymore. Why is it there? Because we need to be sure that the browser has at least mostly loaded the new content and that the element we are looking for is not found, by mistake, in the previous page content. This code has proven deterministic for me in over 1’000 tested queries!
One other interesting thing is the GetRegex method which I basically scrapped of my Excel Scrape HTML Add-In. I prefer regular expressions over any other methods for extracting contents from strings/web pages and I encourage you to do so too. It is much easier and once you get the hand of it, it really makes life easier.
The code of the IE class (click on the arrow to expand):
'!!!Please reference in Tools->References "Microsoft Internet Controls library"!!! Option Explicit Public Enum READYSTATE READYSTATE_UNINITIALIZED = 0 READYSTATE_LOADING = 1 READYSTATE_LOADED = 2 READYSTATE_INTERACTIVE = 3 READYSTATE_COMPLETE = 4 End Enum Private objIE As Object Const maxTimeout As Long = 10000 'Max time in milliseconds to wait until a control is found before raising error Public ElementNotFoundError As Long Public Function GetIE() Set GetIE = objIE End Function Private Sub Class_Initialize() ElementNotFoundError = 1 End Sub Public Sub Navigate(urlAddress As String, isVisible As Boolean) 'urlAddress: destination url; isVisible: should the IE window be visible Set objIE = New InternetExplorer: objIE.Visible = isVisible: objIE.Navigate urlAddress WaitForIE End Sub Public Sub WaitForIE() While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE) DoEvents Wend End Sub '----Get elements---- Public Function GetElementByName(name As String, Optional isBlocking As Boolean, Optional index As Long) 'name: name of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name Dim elems As Object, timeout As Long On Error Resume Next TryAgain: Set elems = objIE.Document.GetElementsByName(name): Set GetElementByName = elems(IIf(IsMissing(index), 0, index)) If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByName Is Nothing)) And timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If If (GetElementByName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByName" End Function Public Function GetElementById(id As String, Optional isBlocking As Boolean) 'id: id of the html element; isBlocking: is the code to be blocked until the element is found Dim timeout As Long On Error Resume Next TryAgain: Set GetElementById = objIE.Document.GetElementById(id) If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementById Is Nothing)) And timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById" End Function Public Function GetElementByTagName(tagName As String, Optional isBlocking As Boolean, Optional index As Long) 'tagName: tagname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name Dim elems As Object, timeout As Long On Error Resume Next TryAgain: Set elems = objIE.Document.GetElementsByTagName(tagName): Set GetElementByTagName = elems(IIf(IsMissing(index), 0, index)) If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByTagName Is Nothing)) And timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If If (GetElementByTagName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByTagName" End Function Public Function GetElementByClassName(className As String, Optional isBlocking As Boolean, Optional index As Long) 'className: classname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name Dim elems As Object, timeout As Long On Error Resume Next TryAgain: Set elems = objIE.Document.GetElementsByClassName(tagName): Set GetElementByClassName = elems(IIf(IsMissing(index), 0, index)) If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByClassName Is Nothing)) And timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If If (GetElementByClassName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByClassName" End Function '----Get HTML by regex---- Public Function GetRegex(reg As String, Optional isBlocking, Optional index As Integer) As String 'reg: regular expression with 1 capture "()"; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements fulfilling this regular expression On Error Resume Next Dim regex, matches, timeout As Long Set regex = CreateObject("VBScript.RegExp") regex.Pattern = reg regex.Global = True If index < 0 Then index = 0 TryAgain: If regex.Test(objIE.Document.body.innerHtml) Then Set matches = regex.Execute(objIE.Document.body.innerHtml) GetRegex = matches(index).SubMatches(0) Exit Function End If If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or GetRegex = vbNullString) And timeout < maxTimeout Then Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5 GoTo TryAgain End If GetRegex = "" End Function Public Function GetMatchCount(reg As String) As Long 'reg: regular expression with 1 capture "()" On Error Resume Next Dim regex, matches Set regex = CreateObject("VBScript.RegExp") regex.Pattern = reg regex.Global = True If regex.Test(objIE.Document.body.innerHtml) Then Set matches = regex.Execute(objIE.Document.body.innerHtml) GetMatchCount = matches.Count Exit Function End If GetMatchCount = 0 End Function '----Quit and terminate---- Public Sub Quit() If Not (objIE Is Nothing) Then objIE.Quit Set objIE = Nothing End Sub Private Sub Class_Terminate() Quit End Sub
I hope this VBA class becomes the cornerstone of every browser automation script you write in VBA!
Download
Feel free to download the whole class file here:
Download
Next steps
Check out my Scrape HTML Add-In for easy HTML scraping without any VBA:
Excel Scrape HTML Add-In