Google is today’s entry point to the world greatest resource – information. If something can’t be found in Google it well can mean it is not worth finding. Similarly SEO experts blog and write about how to optimize your web pages to rank best in Google Search results often ignoring other search engines which contribution to page impressions & clicks is almost irrelevant nowadays due to the G-Monopoly. Naturally there are tons of tools out there for scraping Google Search results, which I don’t intend to compete with. On the other hand even Google makes scraping it’s web results dead easy by facilitating its dedicated Web API. My goal in this post we be to show how easy it is to utilize Excel & VBA to scrape those search results into a simple CSV file which as little effort as possible.
Before you start reading on remember that violation of Googles TOS will get you temporarily banned!
Google Ajax Search API
The obvious way in which we obtain Google Search results is via Googles Search Page. However, such HTTP requests return lot’s of unnecessary information (a whole HTML web page).
In my case a simple “cats and dogs” Google Search almost 1 second and contains over 300kb of data, as according to pingdom:
https://www.google.com.ua/?q=cats+and+dogs#
Thankfully Google is not oblivious to the fact that many people want to scrape it’s search results and facilitates a simple AJAX Google Search API. Let’s see how that compares. In my case the AJAX API call of “cats and dogs” returned in approx. 400 miliseconds and contained less than 2kb of data:
https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs
That’s an improvement right? The results are returned in JSON and we can leverage a whole range of parameters.
JSON Google Search Results format
The AJAX Google Search Web API returns results in JSON. To be able to scrape these results we need to understand the format in which Google returns these results.
The only part of the JSON query that interests us is the “results”: [ … ] section. This is the array that will iterate through the results providing the following parameters separated by commas:
- “GsearchResultClass” – “GwebSearch” for simple text search
- “unescapedUrl” – the unescaped URL of the results (e.g. http://en.wikipedia.org/wiki/Paris_Hilton)
- “url” – escaped url
- “visibleUrl” – shortened version of the URL of the result, stripped of protocol and path
- “cacheUrl” – cached Google url
- “title” – title of the page
- “titleNoFormatting” – tile of the page w/o formatting
- “content” – brief snippet of information from the page
Great. Seems like we have our data source and query string. Let’s get started with leveraging that in Excel VBA.
Scrape Google Search Results using VBA
As we know the structure of each Google Result let’s follow programming best practices and create a suitable VBA Class (I will explain why not a Type structure afterwards):
Public unescapedUrl As String Public url As String Public visibleUrl As String Public cacheUrl As String Public title As String Public titleNoFormatting As String Public content As String
Each object of type VBA Class GoogleSearchResult will represent a single search result record. To download the records and save the data as CSV file I have divided the task into 3 distinct steps:
- Step 1: Get the JSON query result
- Step 2: Extract the query results to a VBA Collection
- Step 3: Write the data to CSV
Below the main Sub procedure SaveGoogleResultsCSV:
Sub SaveGoogleResultsCSV() Dim res As String, resCol As Collection 'Step 1 res = GetResult("https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs") 'Step 2 Set resCol = ExtractResults(res) 'Step 3 SaveToCSV resCol, "C:\results.csv" End Sub
Below each section will address these steps.
Get the JSON query result
To scrape the JSON results we can welcome the XMLHttpObject which will efficiently download the entire HTTP response of the query:
Function GetResult(url As String) As String Dim XMLHTTP As Object, ret As String Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP") XMLHTTP.Open "GET", url, False XMLHTTP.setRequestHeader "Content-Type", "text/xml" XMLHTTP.setRequestHeader "Cache-Control", "no-cache" XMLHTTP.setRequestHeader "Pragma", "no-cache" XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0" XMLHTTP.send ret = XMLHTTP.ResponseText GetResult = ret End Function
Extract the query results to a VBA Collection
The JSON result is mashed up and unfortunately is harder to read than an XML which we could more easily read in VBA. In this case we need to revert to using VBA Regex:
Function ExtractResults(res As String) As Collection Dim resCol As Collection, tmp As String, matches, match, subMatch, GoogleRes As GoogleSearchResult Set resCol = New Collection Dim regex As Object, str As String Set regex = CreateObject("VBScript.RegExp") With regex .Pattern = """results"":\[([^\]]+)\]": .Global = False End With Set matches = regex.Execute(res) tmp = matches(0).SubMatches(0) With regex .Pattern = "\{([^\}]+)\}": .Global = True End With Set matches = regex.Execute(tmp) For Each match In matches If match.SubMatches.Count > 0 Then For Each subMatch In match.SubMatches Set GoogleRes = New GoogleSearchResult GoogleRes.cacheUrl = ExtractAttribute(CStr(subMatch), "formattedUrl") GoogleRes.content = ExtractAttribute(CStr(subMatch), "snippet") GoogleRes.title = ExtractAttribute(CStr(subMatch), "title") resCol.Add GoogleRes Next subMatch End If Next match Set ExtractResults = resCol End Function Function ExtractResult(res As String, attrib As String) As String Dim regex As Object, str As String Set regex = CreateObject("VBScript.RegExp") With regex .Pattern = "{.*}": .Global = False End With Set matches = regex.Execute(str) ExtractAttribute = matches.SubMatches(0) End Function Function ExtractAttribute(res As String, attrib As String) As String Dim regex As Object, str As String, matches Set regex = CreateObject("VBScript.RegExp") With regex .Pattern = """" & attrib & """:""([^\""]*)""": .Global = False End With Set matches = regex.Execute(res) ExtractAttribute = matches(0).SubMatches(0) End Function
Write the data to CSV
Lastly having all results uploaded to our resCol VBA Collection object we can write all the records to an CSV file:
Sub SaveToCSV(resCol As Collection, fileName As String) Dim textData As String, delimiter As String, it As GoogleSearchResult, fileNo As Integer delimiter = ";" For Each it In resCol textData = textData & it.cacheUrl & _ delimiter & it.content & _ delimiter & it.title & _ delimiter & it.cacheUrl & vbNewLine Next it fileNo = FreeFile Open fileName For Output As #fileNo Print #fileNo, textData Close #fileNo End Sub
Google Search API parameters
Listed below are some key parameters that will definitely prove useful. You can find the rest here.
- rsz – the number of results (e.g. rsz=4 will be 4 results)
- hl – host language (e.g. hl=en for english)
- start – the start index for the search results
A simple example below:
https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_LICENSE_KEY&cx=017576662512468239146:omuauf_lfve&q=cats%20and%20dogs&hl=pl&rsz=2
Will return 2 results, for host language Polish for the query “cats and dogs”.
Proxies
Do read my post on using proxies HTTP servers to distribute your queries.
Conclusions
Running the SaveGoogleResultsCSV procedure will provide us with a single result – as CSV file:
Obviously we can also refrain from actually saving the results and proceed with traversing our list of results to do an analysis or to save them to an Excel worksheet. For this we need only traverse the resCol VBA Collection and read only the attributes we need.
Before you start using the Google API extensively be sure to read Google’s Terms of service. Remember – scrape Google Search Results in a responsible way!