Tag Archives: scrape html

vba proxy server

Web Scraping Proxy HTTP request using VBA

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 4.67 out of 5)
Loading...

Visual Basic for Application (VBA) is great for making your first steps in Web Scraping as Excel is ubiquitous and a great training arena to learn Web Scraping. Web Scraping comes down to making HTTP requests to websites. At some point however you will find that some websites will cut you off or prevent multiple HTTP requests. This is were you need to use proxy servers to hide your HTTP requests behind multiple IPs. Using a proxy server is easy in VBA so let’s study a couple of real life examples.

What is a proxy server?

A proxy server is a server that acts as an intermediary for Internet/Intranet HTTP requests. A client machine connects to the proxy server, requesting a service, file, connection, web page, or other resource available from a different server and the proxy server serves that request. Proxy servers often provide anonymity.

proxy server
A typical proxy server HTTP request

VBA HTTP Request using ServerXMLHttp

Before we do a proxy HTTP request using VBA let’s look a how a regular HTTP request looks like. Typically I tend to use the ServerXMLHTTP request object, as I will show in the next section, it can be easily extended to run HTTP requests via a proxy IP.

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.4.0")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

Sub Example()
    'Print the HTML of Google.com
    Debug.Print GetResult "http://www.google.com" 
End Sub

VBA HTTP Request via proxy

Let us now extend the above example to run the same HTTP request via a proxy server. Notice the new line highlighted below:

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.4.0")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setProxy 2, "xxx.xxx.xxx:80"
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

The xxx.xxx.xxx is the IP of your proxy server and 80 is the default HTTP port. You can easily google free online proxy IPs which can easily be used to run proxy requests.

Last thoughts on VBA Proxy HTTP requests

Right.. so do play a little bit with this new piece of knowledge as this is one of the most important aspects of Web Scraping (unless you want to get your IP blocked). Here are some other tips to keep in mind:

  • Use at least 5-10 proxy servers or else your proxies will quickly be blocked too
  • Be sure to handle proxy server errors e.g. 500, 403. A good approach is to retest any server error with an additional proxy server
  • Test your proxies before use. Especially free proxy servers have a lot of down time. Although 1 proxy IP may have worked yesterday.. it may not be so today. I tend to run a simple HTTP request on all proxies and remove any raising server errors before use
  • Don’t cross the line: Web Scraping is useful when you got a lot of queries you want to run automatically instead of running them manually. Don’t cross the line by content-stealing or abusing websites TOS!
Web Scraping Tutorial

Web Scraping Tools for Beginners and the Advanced

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Web Scraping is a very wide topic and almost a separate profession. It is especially a valuable tool for SEO specialists, data scientists, analysts and many others. Due to this there are tons of tools out there. Trying to find the right one can be a real nightmare. For those that don’t have the time I dedicate this overview:

Ranking of Web Scraping tools/libraries (ease of use)

Level: Beginner

You need not be an expert coder to start extracting the data you need from websites! Web Scraping is a well known subject and there have been many tools adopted to make it easier to scrape html content. The tools below do not require any coding experience.

No. Name Comment
1 Excel Power Query (From Web)
  • Easy and quick to use
  • Limited to HTML tables
  • Available only in Excel 2010 and above (previous version have the less useful “Data->From Web” feature”)
2 Web Scraper plugin for Chrome
  • Quite easy to learn
  • Easy to configure straight from Chrome
  • Highly configurable (many options for selecting/scraping elements)
  • Available only from Chrome
3 Import.io
  • Quite easy to learn
  • Includes a wizard which will walk you through the process
  • Can be used to scrape numerous subpages of a page
4 Scrape Box
  • Commercial (not-free)
  • A simple app for scraping specific data off websites e.g. emails, links etc.
  • Easy configuration e.g. select the patterns you want to scrape and a list of websites/IPs
5 Scrape HTML Tool
  • Quite easy to learn. Requires only knowing how to build regular expressions (link to learn more on regex)
  • Provides simple UDF Excel functions
  • Provides integrated features to enhance Web Scraping performance (caching, automatic updating etc.)
  • Available only in Excel 2010 and above

Level: Advanced

The tools/libraries below require some coding experience

No. Name Comment
1 Selenium (Python, C#, Java, R etc.)
  • The best kit for problematically simulating web browser interaction
  • Available in most popular programming languages like Python, C#, Java (even for R – RSelenium)
  • Very easy to learn
  • Provides drivers for simulating user interaction in most browsers e.g. Chrome, FireFox, IE etc.
2 Scraper Wiki
  • A popular platform with web scraping tools
  • Easily transform web content into publicly (or privately) available data sets
  • Allows you to learn web scraping in the popular available technologies
  • Order web scraping aid
3 Kimono Labs
  • A popular web scraping website
  • Enables you to create easily available web interfaces to extract the web content you need
  • Interfaces available in mulitple formats
4 Scrapy (Python)
  • Allows crawling (webspiders) /scraping websites
  • Moderate level of coding proficiency required
  • One of the most popular web scraping library

Web Scraping libraries by programming language

With so many programming languages there must be multiple available web scraping libraries out there. He you can find a short list of the most popular web scraping libraries associated with each programming language.

Language Web Scraping Libraries
.NET (e.g. C#)
  • Html Agility Pack
  • WatiN
Java
  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
JavaScript
  • node.io
  • phantomjs
PHP
  • htmlSQL
Python
  • Scrapy
  • Selenium-Python
  • Beautiful Soup
  • lxml
  • HTQL
  • Mechanize
R
  • Rvest
  • RSelenium
Ruby
  • Nokogiri
  • Hpricot
  • Mechanize
  • scrAPI
  • scRUBYt!
  • wombat
  • Watir

Web Scraping Tools for Data Scientists

Are you a data scientist looking for the best tools out there for Web Scraping? Currently in data scientist communities (e.g. Kaggle) Python and R are the most regarded programming languages out there. Therefore find below a short list of libraries to consider for both:

Python:

R:

Next steps

Want to learn more on Web Scraping? Checkout these links:
Web Scraping Tutorial
Excel Scrape HTML Add-In

hta example

Simple class for browser automation in VBA

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Web browser automation (using Microsoft’s Web Browser) is not an easy task in VBA when doing some Web Scraping. Excel is certainly a great tool for building your web automation scripts but at every corner you find some obstacles. For me the most challenging was always the nondeterministic state of the IE browser control whenever waiting for it to finish loading/refreshing a website.

Be sure to checkout my VBA Web Scraping Kit which includes a whole worth of VBA scripts for Web Scraping

The problem

Most browser automation code out there includes the following lines of code whenever waiting for the Internet Explorer browser control to finish loading or refreshing:

While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
Wend
'The row below may sometimes raise an error!
Set element = objIE.Document.GetElementById(id) 

The problem is, however, that although IE might state that it has fully loaded the web page this in fact does not always have to be the case and some code might generate errors at random lines of code.

The solution

One way of going around this problem is using arbitrary delays – leveraging Application.OnTime / Application.Wait / Sleep. This workaround, although somewhat useful, in most cases will be unsatisfactory as still does not guarantee that you will actually be able to access a web element or click a button that simply might not be loaded yet due to some JavaScript running in the background. So how to solve this issue?

My answer? Wait until the element appears accessible. See an example function below that will wait until the element is accessible on the web page and only then will it return it.

Public Function GetElementById(id As String, Optional isBlocking As Boolean)
'id: id of the html element; isBlocking: is the code to be blocked until the element is found
    Dim timeout As Long
    On Error Resume Next
TryAgain:
    Set GetElementById = objIE.Document.GetElementById(id)
    If IIf(IsMissing(isBlocking), True, isBlocking) And _
     (Err.Number <> 0 Or (GetElementById Is Nothing)) And _ 
     timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById"
End Function

So will the original code above change? Not too much really when we use this new function:

While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
Wend
'This will wait maxTimeout miliseconds until raising an error or return 
Set element = GetElementById(id)

The code above waits for the element with id to appear accessible on the web page. The 5 millisecond Sleep interval is simply there so we are able to specify an upper threshold for an error to appear on a web page (10 seconds). This threshold is there to prevent us from waiting for an infinite period of time in case the web page crashed etc. This I believe is an honest way out of this conundrum.

A simple class for deterministic browser automation

Browser automation requires usually a lot of coding and in order to provide the deterministic automation we need we would need to encapsulate more procedures for getting elements by id, name, tagname, classname and even regular expressions.


The Google web page is also not an easy task for browser automation, assuming you would not embed your query in the GET params but treat it like a regular web page where all data is to be input and handled by forms.

The IE class I wrote handles queries to the Google Web Page like this:

 Dim linkText as String, query as String: query = "Dog Wikipedia"
 On Error GoTo CleanIE
 Set ieClass = New IE
 ieClass.Navigate "https://www.google.pl/", False
 ieClass.GetElementByName("q").Value = query
 ieClass.GetElementByTagName("form").Submit
 ieClass.WaitForIE 'First wait for the page to mostly load
 linkText = ieClass.GetRegex("<h3(?:.|/n)*?<a onmousedown=""return(?:.|/n)*?"" href=""(?:.|/n)*?"">((?:.|/n)*?)</a>")
CleanIE:
 ieClass.Quit

That’s 10 lines of code and not hassle with handling the accessibility of any web elements. All the blocking/waiting is handled in the Get* methods which will simply wait until the control is available in the Web Browser or raise an error otherwise if the maxTimeout threshold is breached. You might have noticed the there is still a method called WaitForIE which actually still waits for the browser to confirm that it is not Busy anymore. Why is it there? Because we need to be sure that the browser has at least mostly loaded the new content and that the element we are looking for is not found, by mistake, in the previous page content. This code has proven deterministic for me in over 1’000 tested queries!

One other interesting thing is the GetRegex method which I basically scrapped of my Excel Scrape HTML Add-In. I prefer regular expressions over any other methods for extracting contents from strings/web pages and I encourage you to do so too. It is much easier and once you get the hand of it, it really makes life easier.

The code of the IE class (click on the arrow to expand):

Deterministic browser automation class

'!!!Please reference in Tools->References "Microsoft Internet Controls library"!!!
Option Explicit
Public Enum READYSTATE
    READYSTATE_UNINITIALIZED = 0
    READYSTATE_LOADING = 1
    READYSTATE_LOADED = 2
    READYSTATE_INTERACTIVE = 3
    READYSTATE_COMPLETE = 4
End Enum
Private objIE As Object
Const maxTimeout As Long = 10000 'Max time in milliseconds to wait until a control is found before raising error
Public ElementNotFoundError As Long
Public Function GetIE()
    Set GetIE = objIE
End Function
Private Sub Class_Initialize()
    ElementNotFoundError = 1
End Sub
Public Sub Navigate(urlAddress As String, isVisible As Boolean)
'urlAddress: destination url; isVisible: should the IE window be visible
    Set objIE = New InternetExplorer: objIE.Visible = isVisible: objIE.Navigate urlAddress
    WaitForIE
End Sub
Public Sub WaitForIE()
    While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
    Wend
End Sub
'----Get elements----
Public Function GetElementByName(name As String, Optional isBlocking As Boolean, Optional index As Long)
'name: name of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByName(name): Set GetElementByName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByName"
End Function
Public Function GetElementById(id As String, Optional isBlocking As Boolean)
'id: id of the html element; isBlocking: is the code to be blocked until the element is found
    Dim timeout As Long
    On Error Resume Next
TryAgain:
    Set GetElementById = objIE.Document.GetElementById(id)
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementById Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementById Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementById"
End Function
Public Function GetElementByTagName(tagName As String, Optional isBlocking As Boolean, Optional index As Long)
'tagName: tagname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByTagName(tagName): Set GetElementByTagName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByTagName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByTagName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByTagName"
End Function
Public Function GetElementByClassName(className As String, Optional isBlocking As Boolean, Optional index As Long)
'className: classname of the html element; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements with this name
    Dim elems As Object, timeout As Long
    On Error Resume Next
TryAgain:
    Set elems = objIE.Document.GetElementsByClassName(tagName): Set GetElementByClassName = elems(IIf(IsMissing(index), 0, index))
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or (GetElementByClassName Is Nothing)) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    If (GetElementByClassName Is Nothing) Then Err.Raise vbObjectError + ElementNotFoundError, "IE.GetElementByClassName"
End Function
'----Get HTML by regex----
Public Function GetRegex(reg As String, Optional isBlocking, Optional index As Integer) As String
'reg: regular expression with 1 capture "()"; isBlocking: is the code to be blocked until the element is found; index: index of the element in the array of elements fulfilling this regular expression
    On Error Resume Next
    Dim regex, matches, timeout As Long
    Set regex = CreateObject("VBScript.RegExp")
    regex.Pattern = reg
    regex.Global = True
    If index < 0 Then index = 0
TryAgain:
    If regex.Test(objIE.Document.body.innerHtml) Then
        Set matches = regex.Execute(objIE.Document.body.innerHtml)
        GetRegex = matches(index).SubMatches(0)
        Exit Function
    End If
    If IIf(IsMissing(isBlocking), True, isBlocking) And (Err.Number <> 0 Or GetRegex = vbNullString) And timeout < maxTimeout Then
        Err.Clear: DoEvents: Sleep 5: timeout = timeout + 5
        GoTo TryAgain
    End If
    GetRegex = ""
End Function
Public Function GetMatchCount(reg As String) As Long
'reg: regular expression with 1 capture "()"
    On Error Resume Next
    Dim regex, matches
    Set regex = CreateObject("VBScript.RegExp")
    regex.Pattern = reg
    regex.Global = True
    If regex.Test(objIE.Document.body.innerHtml) Then
        Set matches = regex.Execute(objIE.Document.body.innerHtml)
         GetMatchCount = matches.Count
        Exit Function
    End If
    GetMatchCount = 0
End Function
'----Quit and terminate----
Public Sub Quit()
    If Not (objIE Is Nothing) Then objIE.Quit
    Set objIE = Nothing
End Sub
Private Sub Class_Terminate()
    Quit
End Sub

I hope this VBA class becomes the cornerstone of every browser automation script you write in VBA!

Download

Feel free to download the whole class file here:

Next steps

Check out my Scrape HTML Add-In for easy HTML scraping without any VBA:
Excel Scrape HTML Add-In

automatic updating

Automatic Updating of Excel Worksheet

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Introducing the new extension to the Scrape HTML Add-In: Automatic Updating of Excel Worksheets. When configuring your Excel file to scrape data of HTML websites there is often the issue of how to refresh this data periodically. For example – you want to scrape stock price data and refresh it every 2 minutes. You can either do this manually, write some clever VBA using the Application.OnTime function or… simply easily configure the periodical refresh in the Automatic Update tool as part of the Add-In.

The Automatic Update tool

Automatic Update tool

So how does it work?

Configure the refresh rate for each worksheet

Automatic Update tool
Automatic Update tool

Click on the Automatic Updating button to see the list of worksheets and refresh rates. You can add any worksheet and configure the interval at which it should be updated. Then click “Off” to turn the updating on. And that is it! The tool will then refresh on the Get* functions (e.g. GetElementByRegex) on the interval.

No VBA required!

How to use this feature?

This will certainly be useful for users who:

  • Need fresh data – whether you are a stock broker, analyst in any other role. This will certainly make life easier
  • Are waiting for an important website update – maybe you are waiting for the new iPhone or want to sign up to a marathon with limited participation. This will also prove useful.
  • Want to make automatic reports – why not make a dashboard of reports based on web provided data? Make charts based on data all in one place w/o needing to constantly switch between websites to extract the information you need

See the video tutorial here:

scrape html add-in

Excel Scrape HTML Add-In now with HTML caching

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

I am constantly extending the Scrape HTML Add-In when I have some spare time. Today’s update adds Caching to the Get* functions.

See the latest version of the Excel Scrape HTML Add-In here:
Excel Scrape HTML Add-In

Why caching?

The primary concept of the Add-In is to reduce any need for writing VBA code. However, the problem with the Get* functions may be that if you specify multiple functions with the same URL there might be some overhead due to having to download the same HTML content just as many times – which might heavily impact performance. In order to solve this issue I have introduced the Cache functionality. If you specify the same URL in more than 1 Get* function then provided that the cache timeout has not passed the HTML content will not be refreshed from the Web but from the Cache. E.g. if you have 2 Get* functions with the same URL and a 60 sec timeout, then only the first call to the Get* function will download the content from the Web, whereas the second one will not – provided that the second call happens before the 60 sec timeout ends. You can change the setting at any time in the “Cache settings”.

Cache Settings
Cache Settings