Category Archives: Web

excel webservice function

Excel WEBSERVICE and FILTERXML functions explained

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

One of the incredible new features coming to Excel 2013 are the Excel WEBSERVICE and Excel FILTERXML Worksheet functions. Excel 2013 introduces over 50 new Worksheet functions but I will tell you why the 2 above-mentioned rock my world ever since I learn of them. It is incredible that the introduction of this awesome function was almost unnoticed by most Excel gurus out there…

Excel up till now has been mostly an offline application. Although, you can use VBA, PowerQuery or other similar data tools in Excel to gain access to Internet and Online data sets these could not have been easily used with the help of Third Party AddIns such as my Scrape HTML AddIn. With the introduction of the WEBSERVICE function we gain immediate and easy access to any REST WebAPI.

In today’s post I will show you several examples of how to use these functions in practice. And how to extract data from an XML REST WebApi.

How to use the Excel WEBSERVICE function

The WEBSERVICE function downloads the HTTP response of the provided URL.

excel webservice function
The Excel WEBSERVICE Worksheet function

Try running the Excel WEBSERVICE function on Google:

=WEBSERVICE("https://www.google.com")

What do you get? If all goes well you should get the HTML response for the Google Web Page.
Now past the above formula to cell A1 and the below to cell A2:

=MID(A1;FIND("<title>";A1)+LEN("<title>");FIND("</title>";A1)-FIND("<title>";A1)-LEN("<title>"))

The result of this Worksheet Excel Formula should be:

Google

Using FILTERXML in tandem with WEBSERVICE

Another function introduced in Excel 2013 is the FILTERXML function. It is designed to work in tandem with the Excel WEBSERVICE function.

What does the FILTERXML function do?

The Excel FILTERXML Worksheet function parses a XML string (string containing XML document) and returns a single element (node or attribute) provided by a XPath.

excel filterxml
The Excel FILTERXML Worksheet Function

Lots of odd words there right? So let us start breaking it down. XML is markup language for encoding documents. As a matter of fact HTML is based on XML and share a lot of similarities. Unfortunately for us HTML does not need often have to be as strictly parsed/validated as an XML does. Although some HTML could pass as XML files – in fact many Websites don’t validate as XML.

Want to learn how to manage XML documents in VBA instead? Read my VBA XML Tutorial

Now what is XPath? XPath is a query language for selecting XML elements such as nodes and attributes. XPath works for XML and HTML.

How to use the FILTERXML function

Now that we know what the FILTERXML functions let’s use it on a simple XML file. W3Schools fortunately has a lot of simple XML examples – let us use a simple XML Food Menu:

Example: WEBSERVICE and FILTERXML function

filterxml and webservice example 1
Using WEBSERVICE and FILTERXML in tandem – Formulas

Example: WEBSERVICE and FILTERXML result

filterxml and webservice example 2
Using WEBSERVICE and FILTERXML in tandem – Result

Explanation

xml example
W3Schools XML Example Food Menu
Now to explain what happens above. The XML file includes a couple of nodes – each one nested within the previous one. We start with node which hosts nodes. Each food node represents a single item in the menu. A food node contains , and node – which describe each menu item.

Now the FILTERXML functions used the following XPath: //food/name. This translates to: take the first food tag and return the contents of it name node.

XPath is an easy query language to learn. I personally recommend the WSchools XPath tutorial

Scraping a whole XML document

Now the example above is fine when you need just a single node from your XML document URL. What if you want to scrape the whole contents of that XML? Thankfully we can combine the WEBSERVICE and FILTERXML functions with Array Formulas.

In the example below I will show you how to acquire all the names of the food items in the menu. You can use a similar technique to get other items.

Input the FILTERXML formula

Input the FILTERXML formula as shown below:

=FILTERXML(B2;"//food/name")

filterxml and webservice example 4

Drag the formula down

Drag the formula down to row 8:
filterxml and webservice example 5

Hit CTRL+SHIFT+ENTER

Hit the following key combo to create and Array Formula: CTRL+SHIFT+ENTER.
That is it. Now in each row you should see the name of a food item from the menu.

Don’t like using Array Functions? You can also use the XPath node index instead:

=FILTERXML(B2;"//food[2]/name")

will return the name of the second food menu item. To replicate this across all items use this example:

=FILTERXML(B$2;"//food[" & (ROW()-ROW(B$4)+1) &"]/name")

Provide you entered this function in cell B2, simply drag it down – the items should automatically complete.

WEBSERVICE functions Restrictions

Now the WEBSERVICE function unfortunately has several restrictions that will cause the function to return a #VALUE! error instead of the string:

  • If you don’t have a working Internet connection (or you are working with a proxy server)
  • Incorrect arguments or URL address
  • If HTTP result is not valid or contains more than the cell limit of 32767 characters
  • URL is a string that contains more than the 2048 characters that are allowed for a GET HTTP request
  • Protocols that aren’t supported, such as ftp:// or file://

Excel WEBSERVICE summary

The WEBSERVICE and FILTERXML functions are a great step forward to enabling access to Internet resources. These still have unfortunately a lot of limits. Especially when most websites have HTML files that exceed the 32727 character limit and often don’t parse as XML files.

This is where my VBA Web Scraping Kit fills the gap together with my Scrape HTML AddIn. The Kit has all the Web Scraping scenarios I consider possible in Excel where as the Scrape HTML AddIn extends a little the constrains of the WEBSERVICE and the FILTERXML functions

Want to learn Web Scraping?

Not satisfied? Want to know more about Web Scraping in Excel using VBA? Read my zero-to-hero Web Scraping Tutorial.

VBA Web Scraping Kit

Web Scraping Kit – use Excel to get that Web data

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

I am proud to present the next Kit coming from AnalystCave.com! The Web Scraping Kit is a simple kit for VBA Web Scrapers, contains a set of ready examples for different scraping scenarios. The kit is equipped with several tools letting you leverage HTTP GET&POST, IE, proxies, XPath, Regex and more Web Scraping tools. Get it and support AnalystCave!

DOWNLOAD The Web Scraping Kit here!

vba proxy server

Web Scraping: Proxy HTTP request using VBA

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 4.67 out of 5)
Loading...

Visual Basic for Application (VBA) is great for making your first steps in Web Scraping as Excel is ubiquitous and a great training arena to learn Web Scraping. Web Scraping comes down to making HTTP requests to websites. At some point however you will find that some websites will cut you off or prevent multiple HTTP requests. This is were you need to use proxy servers to hide your HTTP requests behind multiple IPs. Using a proxy server is easy in VBA so let’s study a couple of real life examples.

What is a proxy server?

A proxy server is a server that acts as an intermediary for Internet/Intranet HTTP requests. A client machine connects to the proxy server, requesting a service, file, connection, web page, or other resource available from a different server and the proxy server serves that request. Proxy servers often provide anonymity.

proxy server
A typical proxy server HTTP request

VBA HTTP Request using ServerXMLHttp

Before we do a proxy HTTP request using VBA let’s look a how a regular HTTP request looks like. Typically I tend to use the ServerXMLHTTP request object, as I will show in the next section, it can be easily extended to run HTTP requests via a proxy IP.

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.4.0")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

Sub Example()
    'Print the HTML of Google.com
    Debug.Print GetResult "http://www.google.com" 
End Sub

VBA HTTP Request via proxy

Let us now extend the above example to run the same HTTP request via a proxy server. Notice the new line highlighted below:

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.4.0")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setProxy 2, "xxx.xxx.xxx:80"
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

The xxx.xxx.xxx is the IP of your proxy server and 80 is the default HTTP port. You can easily google free online proxy IPs which can easily be used to run proxy requests.

Last thoughts on VBA Proxy HTTP requests

Right.. so do play a little bit with this new piece of knowledge as this is one of the most important aspects of Web Scraping (unless you want to get your IP blocked). Here are some other tips to keep in mind:

  • Use at least 5-10 proxy servers or else your proxies will quickly be blocked too
  • Be sure to handle proxy server errors e.g. 500, 403. A good approach is to retest any server error with an additional proxy server
  • Test your proxies before use. Especially free proxy servers have a lot of down time. Although 1 proxy IP may have worked yesterday.. it may not be so today. I tend to run a simple HTTP request on all proxies and remove any raising server errors before use
  • Don’t cross the line: Web Scraping is useful when you got a lot of queries you want to run automatically instead of running them manually. Don’t cross the line by content-stealing or abusing websites TOS!
Scrape Google Search Results to CSV

Scrape Google Search Results to CSV using VBA

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading...

Google is today’s entry point to the world greatest resource – information. If something can’t be found in Google it well can mean it is not worth finding. Similarly SEO experts blog and write about how to optimize your web pages to rank best in Google Search results often ignoring other search engines which contribution to page impressions & clicks is almost irrelevant nowadays due to the G-Monopoly. Naturally there are tons of tools out there for scraping Google Search results, which I don’t intend to compete with. On the other hand even Google makes scraping it’s web results dead easy by facilitating its dedicated Web API. My goal in this post we be to show how easy it is to utilize Excel & VBA to scrape those search results into a simple CSV file which as little effort as possible.

Before you start reading on remember that violation of Googles TOS will get you temporarily banned!

Google Ajax Search API

The obvious way in which we obtain Google Search results is via Googles Search Page. However, such HTTP requests return lot’s of unnecessary information (a whole HTML web page).

In my case a simple “cats and dogs” Google Search almost 1 second and contains over 300kb of data, as according to pingdom:

https://www.google.com.ua/?q=cats+and+dogs#
HTTP Google Request stats
HTTP Google Request stats

Thankfully Google is not oblivious to the fact that many people want to scrape it’s search results and facilitates a simple AJAX Google Search API. Let’s see how that compares. In my case the AJAX API call of “cats and dogs” returned in approx. 400 miliseconds and contained less than 2kb of data:

https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=cats%20and%20dogs

REST Google API Request stats
REST Google API Request stats

That’s an improvement right? The results are returned in JSON and we can leverage a whole range of parameters.

JSON Google Search Results format

The AJAX Google Search Web API returns results in JSON. To be able to scrape these results we need to understand the format in which Google returns these results.

The only part of the JSON query that interests us is the “results”: [ … ] section. This is the array that will iterate through the results providing the following parameters separated by commas:

  • “GsearchResultClass” – “GwebSearch” for simple text search
  • “unescapedUrl” – the unescaped URL of the results (e.g. http://en.wikipedia.org/wiki/Paris_Hilton)
  • “url” – escaped url
  • “visibleUrl” – shortened version of the URL of the result, stripped of protocol and path
  • “cacheUrl” – cached Google url
  • “title” – title of the page
  • “titleNoFormatting” – tile of the page w/o formatting
  • “content” – brief snippet of information from the page

Great. Seems like we have our data source and query string. Let’s get started with leveraging that in Excel VBA.

Scrape Google Search Results using VBA

As we know the structure of each Google Result let’s follow programming best practices and create a suitable VBA Class (I will explain why not a Type structure afterwards):

Public unescapedUrl As String
Public url As String
Public visibleUrl As String
Public cacheUrl As String
Public title As String
Public titleNoFormatting As String
Public content As String

Each object of type VBA Class GoogleSearchResult will represent a single search result record. To download the records and save the data as CSV file I have divided the task into 3 distinct steps:

Below the main Sub procedure SaveGoogleResultsCSV:

Sub SaveGoogleResultsCSV()
    Dim res As String, resCol As Collection
    'Step 1
    res = GetResult("https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=cats%20and%20dogs")
    'Step 2
    Set resCol = ExtractResults(res)
    'Step 3
    SaveToCSV resCol, "C:\results.csv"
End Sub

Below each section will address these steps.

Get the JSON query result

To scrape the JSON results we can welcome the XMLHttpObject which will efficiently download the entire HTTP response of the query:

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "Cache-Control", "no-cache"
    XMLHTTP.setRequestHeader "Pragma", "no-cache"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

Extract the query results to a VBA Collection

The JSON result is mashed up and unfortunately is harder to read than an XML which we could more easily read in VBA. In this case we need to revert to using VBA Regex:

Function ExtractResults(res As String) As Collection
    Dim resCol As Collection, tmp As String, matches, match, subMatch, GoogleRes As GoogleSearchResult
    Set resCol = New Collection
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """results"":\[([^\]]+)\]": .Global = False
    End With
    Set matches = regex.Execute(res)
    tmp = matches(0).SubMatches(0)
    With regex
      .Pattern = "\{([^\}]+)\}": .Global = True
    End With
    Set matches = regex.Execute(tmp)
    For Each match In matches
      If match.SubMatches.Count > 0 Then
        For Each subMatch In match.SubMatches
          Set GoogleRes = New GoogleSearchResult
          GoogleRes.cacheUrl = ExtractAttribute(CStr(subMatch), "cacheUrl")
          GoogleRes.content = ExtractAttribute(CStr(subMatch), "content")
          GoogleRes.title = ExtractAttribute(CStr(subMatch), "title")
          GoogleRes.titleNoFormatting = ExtractAttribute(CStr(subMatch), "titleNoFormatting")
          GoogleRes.unescapedUrl = ExtractAttribute(CStr(subMatch), "unescapedUrl")
          GoogleRes.url = ExtractAttribute(CStr(subMatch), "url")
          GoogleRes.visibleUrl = ExtractAttribute(CStr(subMatch), "visibleUrl")
          resCol.Add GoogleRes
        Next subMatch
      End If
    Next match
    Set ExtractResults = resCol
End Function
Function ExtractResult(res As String, attrib As String) As String
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = "{.*}": .Global = False
    End With
    Set matches = regex.Execute(str)
    ExtractAttribute = matches.SubMatches(0)
End Function
Function ExtractAttribute(res As String, attrib As String) As String
    Dim regex As Object, str As String, matches
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """" & attrib & """:""([^\""]*)""": .Global = False
    End With
    Set matches = regex.Execute(res)
    ExtractAttribute = matches(0).SubMatches(0)
End Function

Write the data to CSV

Lastly having all results uploaded to our resCol VBA Collection object we can write all the records to an CSV file:

Sub SaveToCSV(resCol As Collection, fileName As String)
    Dim textData As String, delimiter As String, it As GoogleSearchResult, fileNo As Integer
    delimiter = ";"
    For Each it In resCol
        textData = textData & it.cacheUrl & _
            delimiter & it.content & _
            delimiter & it.title & _
            delimiter & it.titleNoFormatting & _
            delimiter & it.unescapedUrl & _
            delimiter & it.url & _
            delimiter & it.visibleUrl & vbNewLine
    Next it
    fileNo = FreeFile
    Open fileName For Output As #fileNo
    Print #fileNo, textData
    Close #fileNo
End Sub

Google Search API parameters

Listed below are some key parameters that will definitely prove useful. You can find the rest here.

  • rsz – the number of results (e.g. rsz=4 will be 4 results)
  • hl – host language (e.g. hl=en for english)
  • start – the start index for the search results

A simple example below:

https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=cats%20and%20dogs&hl=pl&rsz=2

Will return 2 results, for host language Polish for the query “cats and dogs”.

Proxies

Do read my post on using proxies HTTP servers to distribute your queries.

Conclusions

Running the SaveGoogleResultsCSV procedure will provide us with a single result – as CSV file:

Scrape Google Results CSV
Scrape Google Results CSV

Obviously we can also refrain from actually saving the results and proceed with traversing our list of results to do an analysis or to save them to an Excel worksheet. For this we need only traverse the resCol VBA Collection and read only the attributes we need.

Before you start using the Google API extensively be sure to read Google’s Terms of service. Remember – scrape Google Search Results in a responsible way!

scrape html add-in

Web Scraping Tutorial

1 Star2 Stars3 Stars4 Stars5 Stars (13 votes, average: 4.46 out of 5)
Loading...

For years I have been reaching out to Web Scraping in order to download / scrape web content, however only recently have I really wanted to dive deep into the subject to really be aware of all the techniques out there. Ranging from the simple Excel “From Web” feature to simulating browser interaction – there are tons of ways to get the data you need, often limiting you only to user interaction you cannot (or it is really hard) simulate e.g. captcha, in-deterministic. Hence I summarize the tools I use in this brief Web Scraping Tutorial.

Web Scraping is almost a new profession – there tons of freelancers making their living off extracting web content and data. Having built your own “kit” of different tools any beginning coder can become quickly a professional full-blown Web Scraper. I hope this Web Scraping Tutorial will guide you safely through this journey. Making you a professional Web Scraper – From Zero To Hero!

Introduction

Although, I grew from C# and Java, VBA has really grown on me. Excel is a good tool for beginner Web Scrapers therefore I will often resort to code examples in VBA. Although when presenting more sophisticated techniques I will surely reach out for some Python and C#.

Now here is some target content I like to use in examples. This table is a great beginner target for Web Scraping which I will be using every now and then through out this article.

Web Scraping Tutorial:Example table from W3Schools
Example table from W3Schools

web scraping tutorial

First step – understanding HTML

The first thing you need to do is understand what HTML is. HTML is a markup language which structures the content of websites. In simple terms it is usually a text file (HTML or HTM), structured with the use of tags. The below is the simplest possible HTML page reading Hello World!:

<html>
<head></head>
<body>Hello World!</body>
</html>

Remind you anything? XML possibly?! No? Then do check-out this simple HTML DOM tutorial from W3Schools as a good starting point befor you do move on.

Basic tools (no coding required)

I assume not all of you are reviewing this Web Scraping Tutorial to master the art of Web Scraping. For some it is enough to be able to extract some simple web content without needing to know what XPath or Javascript is. For those of you I have gathered a list of basic out-of-the-box solutions that will enable you to quickly extract some web content.

Excel PowerQuery – From Web

Web Scraping Tutorial: Excel PowerQuery - From Web feature
Excel PowerQuery – From Web feature

Excel Power Query is a powerful must-have Microsoft Add-In to Excel which you can find here. It is a dedicated tool mainly for scraping HTML Tables. Just click the button, input your desired URL and select the table you want to scrape from the URL. E.g.

Web Scraping Tutorial: Imported table using Power Query - From Web
Imported table using Power Query – From Web

Too bad it does not support more complicated examples.

Import.io

If you are faced with a more complicated example then just a single HTML table then Import.io might be the tool for you. You can find the link to the website here. Import.io is a great tool for scraping any structured content. Want to scrape all the merchandise items of an e-commerce website? Welcome to Import.io. See an example below.

Web Scraping Tutorial: Import.io demo
Import.io demo

Although this tool does not require much coding experience it does require some practices and patience to learn.

Other tools worth mentioning

… and many more. Be aware the all these tools have their setbacks and most of the times it might actually turn out that doing it yourself is much easier.

What Web Scraping is about

Web Scraping is essentially about being able to query websites (or Web API) and extract the information needed:

  • Query websites (or Web API) – being able to send query Web Servers to request data (their HTML structure, associated data in XML/JSON/other formats, scripts (Javascript) and stylesheets (CSS) if needed too
  • Extract data from Websites – being able to extract only the information you need in the appropriate format. You need to be able to traverse the whole structure of the HTML document or XML/JSON output of your Web API request and extract those essential bits and pieces of data you need

Query websites (How websites makes HTTP server calls?)

Before we jump to the basic web scraping techniques in this Web Scraping Tutorial you need to understand how webpages exchange data with servers. Feel free to read more here. Servers can exchange data synchronously or asynchronously. The first, most popular, means that when you type in an URL in your browser or navigate over the website the browser will send a request to the server to load a certain URL e.g. when you search for “help” in Google, the browser will make a server call for the url: “https://www.google.com/search?q=help”. Asynchronous server calls happen without a need to refresh the whole web page e.g. you start typing something in Google and the webpage dynamically suggests some keywords. The latter method is sometimes also referred to as AJAX. Knowing what happens in the backend of the website can sometimes really make web scraping a lot easier and faster. I will dive deeper into this near the end of this article.

GET method

This is basically an URL which include the parameters of the web server call e.g.

https://www.google.com/search?q=help

The parameter here is q and the value of that parameter is help. The GET method is pretty straightforward as the URL can also be scraped for parameter values and you can scrape other URL simply by modifying the parameters in the URL.

POST method

Now the POST method is less user friendly as it submits the parameters in the body of the HTTP web server request. They are therefore not visible in the URL. How to check whether a web page passes its parameters via POST or GET? To view the HTML code of your webpages I do encourage you get familiar with FireBug or other similar browsers tools (hit F12). Looking through the HTML code will give you a good hint for the “method” attribute of the form the page is submitting:
This form is sent via GET:

<FORM action="http://example.com" method="GET">

This form is sent via POST:

<FORM action="http://example.com" method="POST">

Scraping static websites

Now before we jump into more sophisticated scraping techniques I would like to introduce you to the basics of string manipulation and text extraction. Websites are mostly HTML text files therefore being able to manipulate and extract text from them is a must-have capability. Some Web Scrapers are comfortable with just simple string manipulation functions, although knowing more advanced text / HTML element extraction functions will surely save you a lot more time and is a must in case you want to dive deeper into e.g. simulating user interaction.

String manipulation

The basic functions every Web Scraper needs to know are the following (VBA):

  • Len – returns the length of a certain string
  • InStr (Python: find, C#: IndexOf) – finds a substring in a certain string and returns its index
  • Left – returns a given amount of characters from the left of a given string
  • Right – returns a given amount of characters from the right of a given string
  • Mid – returns a given amount of characters from any position within a given string
  • Replace – replaces any occurrence(s) of a certain string in a given string

That is it. Want an example?
VBA / VBscript

Sub GetPageTitle(html as String)
  GetPageTitle = Mid(html, InStr(html, "<title>") _
                 + Len("<title>"), InStr(html, "</title>") _
                 - InStr(html, "<title>") - Len("</title>") + 1)
End Sub

This function will extract the title of a web page provided it is enclosed in a title tag without attributes and whitespaces.

Text / HTML element extraction

Using string manipulation is useful when text is unstructured e.g. articles, books etc. Whereas HTML is basically an extended version of XML, a structured markup language used for encoding structured data! Now why not benefit from this simple finding? A HTML file is built basically of HTML elements – tags insides other tags including, from time to time some actual content. HTML tags can also have attributes. If you want to learn more on HTML this is a good place to start: here.

In order to traverse these elements you can use three basic techniques:

  • XPath – a unique path for a HTML element within a HTML file. Want an example? “/html/head/title” would identify the title of a HTML document. The XPath below would identify the second div element inside the body of a HTML document.
    /html/body/div[2]
    
  • CSS selectors – patterns for identifying HTML elements by CSS. The CSS selector below will identify the first input element with an id “myBtn”. CSS selectors are said to be faster and more simple than XPath. Want to be a Web Scraper pro? Use CSS selectors!
    input[id=myBtn]
    
  • Regular expressions – regular expressions (regex) can be used to capture any patterns in text (not just HTML). I personally often prefer using regex other the previous two methods as in one go you can extract any pattern of text within a HTML page, whereas XPath and CSS selectors require usually at least 2 or more steps e.g. find the HTML element and extract some text from it. The example below will extract the title of a webpage w/o any trailing or preceding whitespaces.
    <title>s*((?:.|n)+?)s*</title>
    

Basic Web Scraping techniques

Now as we know how to extract text and HTML elements from HTML all we need to do is to be able to download the HTML data from the Website. These techniques allow you to download HTML content from static websites or URLs with specified GET parameters.

Excel Scrape HTML Tool

Excel is a great tool for beginner coders, due to its ubiquity and, as it includes both a developing and testing environment. I myself use Excel on a daily basis and so do you most probably. Therefore I want to introduce a simple Web Scraping Add-In that basically allows you to extract text and data off almost any static web site.

Web Scraping Tutorial: Excel Scrape HTML Add-In
Excel Scrape HTML Add-In

You can find the Excel Scrape HTML AddIn here.
The is no need of writing even a single line of VBA code, although… you will need to learn how to write regular expressions. Again the tool features a Scrape HTML Tool which will allow you to test your regex “on-the-fly”.

XMLHttpRequest object

The XMLHttpRequest object is simply a Javascript object used to exchange data with the server. It is often used in AJAX websites. If you see the website being refreshed without it being reloaded an XMLHttpRequest object was most definitely used to exchange data with the server. Knowing this object is a must for all Web Scrapers unless you use Scrapy or other libraries. Let’s start with a simple example in VBA:
VBA / VBscript:

Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", "http://www.google.com", False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.send
MsgBox XMLHTTP.ResponseText 'This will show a message box with the HTML

Easy! 5 lines of code and you get the HTML content of any provided URL. You can use ResponseText to extract all the web content you need.

Now let’s get the HTML response of a Google query:

Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", "http://www.google.com/search?q=hello", False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.send
MsgBox XMLHTTP.ResponseText 'This will show a message box with the HTML
If you want to scrape Google queries – read more in my post here

Notice that I added “search?q=hello” to the URL? That is because the parameter can be passed via the GET HTTP method.

Let’s try to pass some data via post (not sure if Google handles POST):

Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "POST", "http://www.google.com/search"
XMLHTTP.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
XMLHTTP.send("q=help") 'Additional params need to be preceded by & and encoded
MsgBox XMLHTTP.ResponseText 'This will show a message box with the HTML

Easy right?

The XMLHttpRequest object is often all you need to extract the content from websites, web server calls etc. as websites make most of their web server calls via an POST/GET URL that returns a HTML, XML or JSON response. Resorting to simulating user interaction is often an overkill used by beginner Web Scrapers who are often to lazy to analyze the underlying Javascript and web server calls.

Scrapy

When you need to scrape a single URL the XMLHttpRequest object is basically all you need. However, when in need of scraping a collection of static websites or a certain subset of webpages on a website you may be in need of a Web Crawler i.e. an bot that can crawl through websites or traverse through the resources of a certain website. Writing your own solution is always an option. It makes sense, however, to reach out for ready solutions like Scrapy. I will not elaborate more on Scrapy as I encourage you to check out this simple tutorial:
Scrapy Tutorial.

Simulating Web browser user interaction

Now we finally reached the much appreciated methods for simulating user interaction. Because they are often misused these methods should be the last resort in case all other methods for scraping HTML content fail e.g. the website expects direct user interaction, drag’n’drop etc.

Excel IE Object

Feel free to download my VBA IE Automation class for easy VBA web scraping.
The Excel Internet.Explorer object in Excel VBA is a popular technique for leveraging the Internet Explorer web browser for simulating user interaction. Again I can recommend this approach for those who want to learn Web Scraping via Excel. Now let’s see a simple example of simulating a query in Google:
VBA / VBscript:

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True 'Let's see the browser window
IE.Navigate "http://www.google.com/"
Do While IE.Busy 'We need to wait until the page has loaded
  Application.Wait DateAdd("s", 1, Now)
Loop
Set obj = IE.document.getElementById("lst-ib")
obj.Value = "help" 'Set the textbox value
Set objs = IE.document.getElementsByName("btnK")
objs(1).Click 'Click the search button

Using the Internet.Explorer object has some benefits e.g. being able to reference HTML elements by name, tag, class, id. For more elaborate solutions it is even possible to inject Javascript and load external JS libraries.

The Internet.Explorer objects has some setbacks e.g. indeterministic page loading (if IE.Busy is false it does not necessarily mean that the page has been fully loaded). To tackle this issue I have created a VBA class for using IE automation in Excel which wait for the necessary HTML elements to load instead of raising VBA exceptions. You can download the file from here.

Unfortunately the Internet.Explorer object does not “really” allow you to simulate user interaction without the browser window being visible. Manipulating HTML elements does not fire Javascript events e.g. onkeyup, onmouseover. This is an issue on some web pages which will not respond until an appropriate JS event is fired.

One way of going around this issue is simulating Excel keydown events e.g.
VBA / VBscript

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True 'Let's see the browser window
IE.Navigate "http://www.google.com/"
Do While IE.Busy 'We need to wait until the page has loaded
  Application.Wait DateAdd("s", 1, Now)
Loop
SendKeys "help"
SendKeys "{ENTER}"

Again – this is not an elegant solution as the browser window needs to be topmost therefore you can not multitask when running such Web Scraping procedures. Here is where Selenium can help…

Selenium

Selenium is an elaborate solution designed for simulating multiple different browsers ranging from IE to Chrome. It was designed both for Web Scraping and building test scenarios for Web Developers. Selenium is available in many programming environments C#, Java, Python. I personally prefer python as there is not that much need for Objective Oriented Programming when building most Web Scrapers.

Again let’s dive deep into an example with the Google search page:

Python

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome() 'Chrome must be installed! Other browser drivers are available
driver.get("http://www.google.com")
elem = driver.find_element_by_id("lst-ib")
elem.send_keys("help")
elem = driver.find_element_by_name("btnK")
elem.click()

Selenium is one of the most powerful tools in a Web Scrapers toolbox. Although there are many webdrivers available in Selenium I would encourage you to use PhantomJS for final solutions and any other webdriver for testing (as PhantomJS is an invisible browser – that is the point).

Selenium is easy to learn (learning curve similar as the vba Internet.Explorer object) and selenium code can be easily migrated to C#, Java and other languages which is a real advantage.

Analyzing Javascript and network connections

The methods above basically cover most popular Web Scraping techniques. Knowing all of them basically guarantees that you will be able to scrape and crawl any website, whether static or dynamic, whether using POST or GET or requiring user interaction. As I mentioned above often Web Scrapers settle for the easy approach – simulating user interaction. That is exactly why I first introduced the XMLHttpRequest object which makes HTTP calls instead of IE in VBA or Selenium. Beginner Web Scrapers will always prefer copying user interaction, sometimes even being to lazy to inject it via Javascript and doing it on a topmost visible web browser window. The approach below explains how you should leverage all the tools mentioned above in order to optimize your Web Scraping solution. Remember if you intend to scrape / crawl 10000 web pages every additional second lost for simulating user interaction means almost an additional 3 hours of computing time.

Chrome DevTools / FireFox FireBug / IE Developer Tools

Before I jump into explaining my approach I need to introduce you to F12 tools. F12 because basically when you hit F12 on most browser windows they will pop-up. Chrome has its DevTools, FireFox its FireBug, IE its Developer Tools and Safari… nah Safari sux :P. I personally prefer the IE Developer Tool window as it lacks the complexity of the other tools and is a little easier to navigate.

Web Scraping Tutorial: IE Developer Tools
IE Developer Tools

When you open the IE Developer Tools window you will often leverage the click element feature – to locate HTML element on a HTML web page (the cursor icon in the left upper corner). This is one of the most frequently used features, however, as a Web Scraper you need to also learn to Network tab (similar name in Chrome). This is where the magic happens, often neglected by most Web Scrapers. In case where a web page is loaded in one go this may not be of much interest to you – as anyway you will need to scrape the text / data right of the HTML page. However, in many cases modern webpages utilize web service calls or AJAX calls. These will be immediately visible on the Network tab.

The right approach to Web Scraping

Instead of explaining the approach let’s use an example of a popular Polish e-commerce website Allegro.

Example

When inputing some text to the search box the page will suggest some answers during input:

As you need not refresh the webpage this obviously must mean that there are asynchronous web calls going on in the background. Never fear F12 is here!
Let’s see what is really going in the background. Open the Network tab and hit Start Capturing. Next start inputing some text and viola – see the web calls appearing in the Network tab.

As you will see the tool already noticed that the response is JSON structured which is great as JSON is pretty easy to parse and scrape. Let’s now click on any of these web calls to view the results.

Seems like the tool is right – this is definitely JSON, although containing encoded HTML strings as some of the suggested results are to be formatted differently.

Knowing this you can already easily build a Web Crawler that can traverse through most of the resources of this page looking for similar search results. How can we use this information to leverage these web calls? Easy – notice the calls use the GET HTTP method therefore you can simulate it as follows with our good of XMLHttpRequest object:
VBA / VBscript:

Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", "http://allegro.pl/Suggest/Index.php/suggestAjax?q=nike", False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.send
MsgBox XMLHTTP.ResponseText 'This will show a message box with the JSON

Simple right? No IE objects, selenium etc. Just some basic research and with just a few lines of codes you have the Allegro suggestion web service at your service :).

The approach

As you probably are already aware the approach basically requires doing some research on the website which you intend to scrape instead of immediately resorting to user interaction simulating techniques like Selenium.

I always proceed as follows:

  1. Analyze the web page HTML – verify which controls are used for input and which for user interaction (submitting the form).
  2. Analyze network calls – is the data you need to scrape contained in the HTML output of the web page or is it returned in a separate web service call (preferred). Are the HTTP requests sent via GET or POST HTTP methods? Simple web service calls are a blessing for every Web Scrapers. They significantly reduce the time needed to extract the data you need. Even if there are no web service calls and the data is returned within the HTML response body of the web page. Then no worries – use XPath, CSS selectors or just regex to extract the data you need. Remember don’t be swayed by POST HTTP requests! They are just as simple as GET!
  3. User interaction required – once every now and then there will be a tough nut to crack. This is indeed the time for Selenium or the IE object depending on your working environment. Sometimes it is just a matter of complicated Javascript calls, encoding or simply you don’t need that much performance for scraping just 20 web pages even if it takes a minute per page and don’t have the time for doing any research.

Want to earn money as a professional Web Scraper?

web scraping tutorial
You will find many blogs out there telling you can EASILY earn real money on the web – most are scams, while others won’t allow you to earn impressive money. Web Scraping is an honest way of making actual money in a repeatable manner by selling scraped data, making online Internet analyses or simply taking freelance web-scraping jobs. A couple of hours of work may allow you to earn from 50$ to even over 500$ in more complex Web Scraping tasks and sometimes even much more when you need to build a whole website built on the idea of simply scraping data of other websites (not recommended!- but look at how many websites are mirrors of the popular StackOverflow site).

Ok, I know how to scrape data. What now?

If you want to earn money by selling Internet data or taking freelance Web Scraping jobs – sign-up to one of these popular Freelance job websites:

Keep in mind, however, that the competition is fierce. Usually Web Scraping jobs have 10 or more applicants. Here are a couple of tips for you beginners:

  • Start immediately and show it off! – this is a risky technique but often works. Invest your time by doing the job ahead and include a video or demo in your proposal. This will prove to your client that the risk with hiring you for the job is low
  • Start cheap – I know your time is worth more that 10$ per hour. But start cheap and build your credibility. Clients won’t pay premium for someone who doesn’t have anything to show for it. Start with small Web Scraping jobs and charge not more than 30-50$
  • Be quick – winning a freelance Web-Scraping job is a race. Often jobs are small and clients expect immediate results or quick proposals/responses. To win – be quick with your proposal and best include a video or demo
  • Demo! Demo! Demo! – share demos of your jobs with your client. I personally think videos are the best option to go with. If you want to win a job – include in your proposal a recent video of your latest similar Web-Scraping job or an actual demo of what the client wants. This has a high probability of winning you some credibility with your potential client

GUI Testing – what to use?

So you wanna learn Web Scraping to test your Web Application GUI? Which language/framework to use? Honestly the options are plenty see here a comprehensive list on Wiki of all the GUI testing tools out there.

Unfortunately, there is no easy answer to this question as you probably will prefer to use a framework or programming language that is closer to your original application environment. If you are however framework/language agnostic I personally can suggest using Selenium. Selenium automates browser interaction and provides a wide range of supported browsers (from Chrome to IE). I use the Python implementation.

Summary

Hopefully you will appreciate this end-to-end Web Scraping Tutorial. I have introduced you to all basic and advanced methods for Web Scraping, Web Crawling and even simulating user interaction. If you are able to leverage all techniques you can definitely consider yourself a professional Web Scraper.

Web Scraping is really a lot of fun and open you up to all the resources of the Internet. Web Scraping can come in handy:

  • Scraping data of websites
  • Crawling websites
  • Making last moment bids in Internet auctions
  • SEO

You can basically make a living in the Internet as a Web Scraper. There are many tools out there to do the job like Import.io. But none are simple and versatile enough to tackle every Web Scraping / Crawling task.

Let me know what you think in the comments!

Next steps

Want to learn more on Web Scraping? Checkout these links:
Web Scraping Tools – ranking of easy to use scraping tools, mapping of scraping tools for programming languages
EXCEL: Excel Scrape HTML Add-In
EXCEL: Simple class for using IE automation in VBA