Tag Archives: Google

Scrape Google Search Results to CSV

Scrape Google Search Results to CSV using VBA

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading...

Google is today’s entry point to the world greatest resource – information. If something can’t be found in Google it well can mean it is not worth finding. Similarly SEO experts blog and write about how to optimize your web pages to rank best in Google Search results often ignoring other search engines which contribution to page impressions & clicks is almost irrelevant nowadays due to the G-Monopoly. Naturally there are tons of tools out there for scraping Google Search results, which I don’t intend to compete with. On the other hand even Google makes scraping it’s web results dead easy by facilitating its dedicated Web API. My goal in this post we be to show how easy it is to utilize Excel & VBA to scrape those search results into a simple CSV file which as little effort as possible.

Before you start reading on remember that violation of Googles TOS will get you temporarily banned!

As Google has retired its old AJAX Query API the post has been updated to use the new Custom Search JSON/Atom API. However I have not tested the new code yet :) so let me know if it works

Google Ajax Search API

The obvious way in which we obtain Google Search results is via Googles Search Page. However, such HTTP requests return lot’s of unnecessary information (a whole HTML web page).

In my case a simple “cats and dogs” Google Search almost 1 second and contains over 300kb of data, as according to pingdom:

https://www.google.com.ua/?q=cats+and+dogs#
HTTP Google Request stats
HTTP Google Request stats

Thankfully Google is not oblivious to the fact that many people want to scrape it’s search results and facilitates a simple AJAX Google Search API. Let’s see how that compares. In my case the AJAX API call of “cats and dogs” returned in approx. 400 miliseconds and contained less than 2kb of data:

https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs

REST Google API Request stats
REST Google API Request stats

That’s an improvement right? The results are returned in JSON and we can leverage a whole range of parameters.

JSON Google Search Results format

The AJAX Google Search Web API returns results in JSON. To be able to scrape these results we need to understand the format in which Google returns these results.

The only part of the JSON query that interests us is the “results”: [ … ] section. This is the array that will iterate through the results providing the following parameters separated by commas:

  • “GsearchResultClass” – “GwebSearch” for simple text search
  • “unescapedUrl” – the unescaped URL of the results (e.g. http://en.wikipedia.org/wiki/Paris_Hilton)
  • “url” – escaped url
  • “visibleUrl” – shortened version of the URL of the result, stripped of protocol and path
  • “cacheUrl” – cached Google url
  • “title” – title of the page
  • “titleNoFormatting” – tile of the page w/o formatting
  • “content” – brief snippet of information from the page

Great. Seems like we have our data source and query string. Let’s get started with leveraging that in Excel VBA.

Scrape Google Search Results using VBA

As we know the structure of each Google Result let’s follow programming best practices and create a suitable VBA Class (I will explain why not a Type structure afterwards):

Public unescapedUrl As String
Public url As String
Public visibleUrl As String
Public cacheUrl As String
Public title As String
Public titleNoFormatting As String
Public content As String

Each object of type VBA Class GoogleSearchResult will represent a single search result record. To download the records and save the data as CSV file I have divided the task into 3 distinct steps:

Below the main Sub procedure SaveGoogleResultsCSV:

Sub SaveGoogleResultsCSV()
    Dim res As String, resCol As Collection
    'Step 1
    res = GetResult("https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs")
    'Step 2
    Set resCol = ExtractResults(res)
    'Step 3
    SaveToCSV resCol, "C:\results.csv"
End Sub

Below each section will address these steps.

Get the JSON query result

To scrape the JSON results we can welcome the XMLHttpObject which will efficiently download the entire HTTP response of the query:

Function GetResult(url As String) As String
    Dim XMLHTTP As Object, ret As String
    Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "Cache-Control", "no-cache"
    XMLHTTP.setRequestHeader "Pragma", "no-cache"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send
    ret = XMLHTTP.ResponseText
    GetResult = ret
End Function

Extract the query results to a VBA Collection

The JSON result is mashed up and unfortunately is harder to read than an XML which we could more easily read in VBA. In this case we need to revert to using VBA Regex:

Function ExtractResults(res As String) As Collection
    Dim resCol As Collection, tmp As String, matches, match, subMatch, GoogleRes As GoogleSearchResult
    Set resCol = New Collection
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """results"":\[([^\]]+)\]": .Global = False
    End With
    Set matches = regex.Execute(res)
    tmp = matches(0).SubMatches(0)
    With regex
      .Pattern = "\{([^\}]+)\}": .Global = True
    End With
    Set matches = regex.Execute(tmp)
    For Each match In matches
      If match.SubMatches.Count > 0 Then
        For Each subMatch In match.SubMatches
          Set GoogleRes = New GoogleSearchResult
          GoogleRes.cacheUrl = ExtractAttribute(CStr(subMatch), "formattedUrl")
          GoogleRes.content = ExtractAttribute(CStr(subMatch), "snippet")
          GoogleRes.title = ExtractAttribute(CStr(subMatch), "title")
          resCol.Add GoogleRes
        Next subMatch
      End If
    Next match
    Set ExtractResults = resCol
End Function
Function ExtractResult(res As String, attrib As String) As String
    Dim regex As Object, str As String
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = "{.*}": .Global = False
    End With
    Set matches = regex.Execute(str)
    ExtractAttribute = matches.SubMatches(0)
End Function
Function ExtractAttribute(res As String, attrib As String) As String
    Dim regex As Object, str As String, matches
    Set regex = CreateObject("VBScript.RegExp")
    With regex
      .Pattern = """" & attrib & """:""([^\""]*)""": .Global = False
    End With
    Set matches = regex.Execute(res)
    ExtractAttribute = matches(0).SubMatches(0)
End Function

Write the data to CSV

Lastly having all results uploaded to our resCol VBA Collection object we can write all the records to an CSV file:

Sub SaveToCSV(resCol As Collection, fileName As String)
    Dim textData As String, delimiter As String, it As GoogleSearchResult, fileNo As Integer
    delimiter = ";"
    For Each it In resCol
        textData = textData & it.cacheUrl & _
            delimiter & it.content & _
            delimiter & it.title & _
            delimiter & it.cacheUrl & vbNewLine
    Next it
    fileNo = FreeFile
    Open fileName For Output As #fileNo
    Print #fileNo, textData
    Close #fileNo
End Sub

Google Search API parameters

Listed below are some key parameters that will definitely prove useful. You can find the rest here.

  • rsz – the number of results (e.g. rsz=4 will be 4 results)
  • hl – host language (e.g. hl=en for english)
  • start – the start index for the search results

A simple example below:

https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs&hl=pl&rsz=2

Will return 2 results, for host language Polish for the query “cats and dogs”.

Proxies

Do read my post on using proxies HTTP servers to distribute your queries.

Conclusions

Running the SaveGoogleResultsCSV procedure will provide us with a single result – as CSV file:

Scrape Google Results CSV
Scrape Google Results CSV

Obviously we can also refrain from actually saving the results and proceed with traversing our list of results to do an analysis or to save them to an Excel worksheet. For this we need only traverse the resCol VBA Collection and read only the attributes we need.

Before you start using the Google API extensively be sure to read Google’s Terms of service. Remember – scrape Google Search Results in a responsible way!

VBA PageSpeed

Using PageSpeed Insights API with Excel VBA (VBA PageSpeed)

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Google API are very useful and easy to use. Especially since queries are sent via GET and results are provided in JSON which is easy to scrape, from which data can be easily extracted. I always like to play with Google API as Google has such a vast range of different API addressing almost any area of the Internet.

From time to time I intend to share different examples of uses of the Google API using VBA as I think it may be of particular interest to some of my audiences. So let’s jump right in to the VBA PageSpeed guide.

What is the Google PageSpeed Insights API?

The Google PageSpeed Insights API runs an analysis on the page at the specified URL, and returns:

  • PageSpeed scores
  • Information on the URL resources (JS, CSS images)
  • Lists suggestions to make that page faster, and other information.

As you can see this information may be of particular interest to SEO experts, website admins or just a regular blogger like myself. There are literally multitudes of websites out living off this single Web API by sharing SEO and website stats with their users. Why use the middle man when you can easily use this Web API call yourself, and more easily even in Excel!

If you want to play around with the Google PageSpeed Insights API directly go here:
https://developers.google.com/apis-explorer/#p/pagespeedonline/v2/pagespeedonline.pagespeedapi.runpagespeed

The code (VBA PageSpeed)

Let’s start with defining some useful objects we will need for the procedure:

Enum Strategy
    Desktop
    Mobile
End Enum

Type PageSpeed
    responseCode As Long
    speedScore As Long
    'PageStats
    pageStats_numberResourses As Long
    pageStats_numberHosts As Long
    pageStats_totalRequestBytes As Long
    pageStats_numberStaticResources As Long
    pageStats_htmlResponseBytes As Long
    pageStats_cssResponseBytes As Long
    pageStats_imageResponseBytes  As Long
    pageStats_javascriptResponseBytes As Long
    pageStats_otherResponseBytes  As Long
    pageStats_numberJsResources As Long
    pageStats_numberCssResources As Long
End Type

Strategy – this enumeration lists the two possible options. Mobile – analyzing the URL for mobile devices, and Desktop – analyzing the URL for desktop devices. Pretty straightforward.

PageSpeed – this is the (partial) result of the API call. In this example I am only extracting the general pagespeed information like:

  • Response coderead here how to interpret these codes. Generally you would prefer this to be 200 (OK)
  • Speed Score – how well the URL scores in terms of speed. Read more here.
  • Other – number of page resources, JS/CSS/image files etc.

I intentionally ignored the other tons of info provided by the Google PageSpeed API response. You can always extract this data yourself by building up the VBA code and creating a regular expression that will match what you need.

The GetPageSpeed function

Public Function GetPageSpeed(url As String, analysisStrategy As Strategy, filterThirdPartyResources As Boolean) As PageSpeed
    Dim callURL As String, pageSpeedRes As PageSpeed, objHTTP As Object
    callURL = "https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url=http%3A%2F%2F" & url & "&filter_third_party_resources=" & IIf(filterThirdPartyResources, "true", "false") & "&strategy=" & IIf(analysisStrategy = Desktop, "desktop", "mobile")
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    objHTTP.Open "GET", callURL, False
    objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
    objHTTP.send ("")
    If InStr(objHTTP.responseText, """SPEED""") = 0 Then GoTo ErrorHandl
    '---Extract PageSpeed results---
    On Error Resume Next
    Dim regex As Object: Set regex = CreateObject("VBScript.RegExp"): regex.Global = False
    regex.Pattern = """responseCode"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.responseCode = CLng(matches(0).SubMatches(0))
    regex.Pattern = """score"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.speedScore = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberResourses = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberHosts"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberHosts = CLng(matches(0).SubMatches(0))
    regex.Pattern = """totalRequestBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_totalRequestBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberStaticResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberStaticResources = CLng(matches(0).SubMatches(0))
    regex.Pattern = """htmlResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_htmlResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """cssResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_cssResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """imageResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_imageResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """javascriptResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_javascriptResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """otherResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_otherResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberJsResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberJsResources = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberCssResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberCssResources = CLng(matches(0).SubMatches(0))
    Set regex = Nothing: Set objHTTP = Nothing
    '---Return result---
    GetPageSpeed = pageSpeedRes
    Exit Function
ErrorHandl:
End Function

Ok. Now let’s put it to the test:

Sub TestPageSpeed()
    Dim res As PageSpeed
    res = GetPageSpeed("www.google.com", Desktop, True)
    Debug.Print "Responsecode: " & res.responseCode
    Debug.Print "Speedscore: " & res.speedScore
    Debug.Print "Number of resources: " & res.pageStats_numberResourses
    Debug.Print "Number of hosts: " & res.pageStats_numberHosts
    Debug.Print "Total request bytes: " & res.pageStats_totalRequestBytes
    Debug.Print "Number of static resources: " & res.pageStats_numberStaticResources
    Debug.Print "HTML response bytes: " & res.pageStats_htmlResponseBytes
    Debug.Print "CSS response bytes: " & res.pageStats_cssResponseBytes
    Debug.Print "Image response bytes: " & res.pageStats_imageResponseBytes
    Debug.Print "JS response bytes: " & res.pageStats_javascriptResponseBytes
    Debug.Print "dOther response bytes: " & res.pageStats_otherResponseBytes
    Debug.Print "Number of JS resources: " & res.pageStats_numberJsResources
    Debug.Print "Number of CSS resources: " & res.pageStats_numberCssResources
End Sub

The results for Google (obviously speed score = 99!!!).

VBA PageSpeed: Google
VBA PageSpeed: Google

Let me know what you think and if you want more cool examples of using the Google APIs with VBA.

map

Get Google Maps address coordinates (latitude & longitude) in Excel VBA

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Continuing my last post, on how to get the distance between any addresses using VBA, I wanted to add a capability to my previous Excel file that would allow me to get any address coordinates. How to quickly get the lat. and long. of an address? Again why not use Google API…

Address coordinates in Excel

Sub GetLocation(address As String, ByRef lat As String, ByRef lng As String)
    Dim firstVal As String, secondVal As String, lastVal As String
    firstVal = "http://maps.googleapis.com/maps/api/geocode/json?address="
    lastVal = "&sensor=false"
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    URL = firstVal & Replace(address, " ", "+") & lastVal
    objHTTP.Open "GET", URL, False
    objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
    objHTTP.send ("")
    If InStr(objHTTP.responseText, """lat""") = 0 Then GoTo ErrorHandl
    tmpVal = Right(objHTTP.responseText, Len(objHTTP.responseText) - InStr(objHTTP.responseText, """lat"" : ") - 7)
    lat = Split(tmpVal, ",")(0)
    tmpVal = Right(objHTTP.responseText, Len(objHTTP.responseText) - InStr(objHTTP.responseText, """lng"" : ") - 7)
    lng = Replace(Split(tmpVal, "}")(0), " ", "")
    Exit Sub
ErrorHandl:
    lat = lng = 0
End Sub

This function can easily be used in any VBA algorithm like this:

Dim lat1 as String, lng1 as String
call GetLocation("Al. Jerozolimskie 2, Warsaw", lat:=lat1, lng:=lng1)

In the example above the address coordinates (latitude and longitude) will be returned to lat1 and lng1 variables.

Limits and common issues

Google limits the amount of free queries you can execute. Considering the limitations and common issues please read this section of my post.