Scrape Google Search Results to CSV

Scrape Google Search Results to CSV using VBA

Google is today’s entry point to the world greatest resource – information. If something can’t be found in Google it well can mean it is not worth finding. Similarly SEO experts blog and write about how to optimize your web pages to rank best in Google Search results often ignoring other search engines which contribution to page impressions & clicks is almost irrelevant nowadays due to the G-Monopoly. Naturally there are tons of tools out there for scraping Google Search results, which I don’t intend to compete with. On the other hand even Google makes scraping it’s web results dead easy by facilitating its dedicated Web API. My goal in this post we be to show how easy it is to utilize Excel & VBA to scrape those search results into a simple CSV file which as little effort as possible.

Before you start reading on remember that violation of Googles TOS will get you temporarily banned!

As Google has retired its old AJAX Query API the post has been updated to use the new Custom Search JSON/Atom API. However I have not tested the new code yet 🙂 so let me know if it works

Google Ajax Search API

The obvious way in which we obtain Google Search results is via Googles Search Page. However, such HTTP requests return lot’s of unnecessary information (a whole HTML web page).

In my case a simple “cats and dogs” Google Search almost 1 second and contains over 300kb of data, as according to pingdom:

https://www.google.com.ua/?q=cats+and+dogs#
HTTP Google Request stats

Thankfully Google is not oblivious to the fact that many people want to scrape it’s search results and facilitates a simple AJAX Google Search API. Let’s see how that compares. In my case the AJAX API call of “cats and dogs” returned in approx. 400 miliseconds and contained less than 2kb of data:

https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs

REST Google API Request stats
That’s an improvement right? The results are returned in JSON and we can leverage a whole range of parameters.

JSON Google Search Results format

The AJAX Google Search Web API returns results in JSON. To be able to scrape these results we need to understand the format in which Google returns these results.

The only part of the JSON query that interests us is the “results”: [ … ] section. This is the array that will iterate through the results providing the following parameters separated by commas:

  • “GsearchResultClass” – “GwebSearch” for simple text search
  • “unescapedUrl” – the unescaped URL of the results (e.g. http://en.wikipedia.org/wiki/Paris_Hilton)
  • “url” – escaped url
  • “visibleUrl” – shortened version of the URL of the result, stripped of protocol and path
  • “cacheUrl” – cached Google url
  • “title” – title of the page
  • “titleNoFormatting” – tile of the page w/o formatting
  • “content” – brief snippet of information from the page

Great. Seems like we have our data source and query string. Let’s get started with leveraging that in Excel VBA.

Scrape Google Search Results using VBA

As we know the structure of each Google Result let’s follow programming best practices and create a suitable VBA Class (I will explain why not a Type structure afterwards):

Public unescapedUrl As String
Public url As String
Public visibleUrl As String
Public cacheUrl As String
Public title As String
Public titleNoFormatting As String
Public content As String

Each object of type VBA Class GoogleSearchResult will represent a single search result record. To download the records and save the data as CSV file I have divided the task into 3 distinct steps:

Below the main Sub procedure SaveGoogleResultsCSV:

Sub SaveGoogleResultsCSV()
    Dim res As String, resCol As Collection
    'Step 1
    res = GetResult("https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs")
    'Step 2
    Set resCol = ExtractResults(res)
    'Step 3
    SaveToCSV resCol, "C:\results.csv"
End Sub

Below each section will address these steps.

See also  Excel Scrape HTML Tool added to the Scrape HTML Add-In

Get the JSON query result

To scrape the JSON results we can welcome the XMLHttpObject which will efficiently download the entire HTTP response of the query:

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "Cache-Control", "no-cache"
    XMLHTTP.setRequestHeader "Pragma", "no-cache"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

Extract the query results to a VBA Collection

The JSON result is mashed up and unfortunately is harder to read than an XML which we could more easily read in VBA. In this case we need to revert to using VBA Regex:

Function ExtractResults(res As String) As Collection
    Dim resCol As Collection, tmp As String, matches, match, subMatch, GoogleRes As GoogleSearchResult
    Set resCol = New Collection
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """results"":\[([^\]]+)\]": .Global = False
    End With
    Set matches = regex.Execute(res)
    tmp = matches(0).SubMatches(0)
    With regex
      .Pattern = "\{([^\}]+)\}": .Global = True
    End With
    Set matches = regex.Execute(tmp)
    For Each match In matches
      If match.SubMatches.Count > 0 Then
        For Each subMatch In match.SubMatches
          Set GoogleRes = New GoogleSearchResult
          GoogleRes.cacheUrl = ExtractAttribute(CStr(subMatch), "formattedUrl")
          GoogleRes.content = ExtractAttribute(CStr(subMatch), "snippet")
          GoogleRes.title = ExtractAttribute(CStr(subMatch), "title")
          resCol.Add GoogleRes
        Next subMatch
      End If
    Next match
    Set ExtractResults = resCol
End Function
Function ExtractResult(res As String, attrib As String) As String
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = "{.*}": .Global = False
    End With
    Set matches = regex.Execute(str)
    ExtractAttribute = matches.SubMatches(0)
End Function
Function ExtractAttribute(res As String, attrib As String) As String
    Dim regex As Object, str As String, matches
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """" & attrib & """:""([^\""]*)""": .Global = False
    End With
    Set matches = regex.Execute(res)
    ExtractAttribute = matches(0).SubMatches(0)
End Function

Write the data to CSV

Lastly having all results uploaded to our resCol VBA Collection object we can write all the records to an CSV file:

Sub SaveToCSV(resCol As Collection, fileName As String)
    Dim textData As String, delimiter As String, it As GoogleSearchResult, fileNo As Integer
    delimiter = ";"
    For Each it In resCol
        textData = textData & it.cacheUrl & _
            delimiter & it.content & _
            delimiter & it.title & _
            delimiter & it.cacheUrl & vbNewLine
    Next it
    fileNo = FreeFile
    Open fileName For Output As #fileNo
    Print #fileNo, textData
    Close #fileNo
End Sub

Google Search API parameters

Listed below are some key parameters that will definitely prove useful. You can find the rest here.

  • rsz – the number of results (e.g. rsz=4 will be 4 results)
  • hl – host language (e.g. hl=en for english)
  • start – the start index for the search results

A simple example below:

https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs&hl=pl&rsz=2

Will return 2 results, for host language Polish for the query “cats and dogs”.

Proxies

Do read my post on using proxies HTTP servers to distribute your queries.

Conclusions

Running the SaveGoogleResultsCSV procedure will provide us with a single result – as CSV file:
Scrape Google Results CSV
Obviously we can also refrain from actually saving the results and proceed with traversing our list of results to do an analysis or to save them to an Excel worksheet. For this we need only traverse the resCol VBA Collection and read only the attributes we need.

Before you start using the Google API extensively be sure to read Google’s Terms of service. Remember – scrape Google Search Results in a responsible way!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.