Category Archives: Web

Using PageSpeed Insights API with Excel VBA (VBA PageSpeed)

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 4.67 out of 5)
Loading...

Google API are very useful and easy to use. Especially since queries are sent via GET and results are provided in JSON which is easy to scrape, from which data can be easily extracted. I always like to play with Google API as Google has such a vast range of different API addressing almost any area of the Internet.

From time to time I intend to share different examples of uses of the Google API using VBA as I think it may be of particular interest to some of my audiences. So let’s jump right in to the VBA PageSpeed guide.

What is the Google PageSpeed Insights API?

The Google PageSpeed Insights API runs an analysis on the page at the specified URL, and returns:

  • PageSpeed scores
  • Information on the URL resources (JS, CSS images)
  • Lists suggestions to make that page faster, and other information.

As you can see this information may be of particular interest to SEO experts, website admins or just a regular blogger like myself. There are literally multitudes of websites out living off this single Web API by sharing SEO and website stats with their users. Why use the middle man when you can easily use this Web API call yourself, and more easily even in Excel!

If you want to play around with the Google PageSpeed Insights API directly go here:
https://developers.google.com/apis-explorer/#p/pagespeedonline/v2/pagespeedonline.pagespeedapi.runpagespeed

The code (VBA PageSpeed)

Let’s start with defining some useful objects we will need for the procedure:

Enum Strategy
    Desktop
    Mobile
End Enum

Type PageSpeed
    responseCode As Long
    speedScore As Long
    'PageStats
    pageStats_numberResourses As Long
    pageStats_numberHosts As Long
    pageStats_totalRequestBytes As Long
    pageStats_numberStaticResources As Long
    pageStats_htmlResponseBytes As Long
    pageStats_cssResponseBytes As Long
    pageStats_imageResponseBytes  As Long
    pageStats_javascriptResponseBytes As Long
    pageStats_otherResponseBytes  As Long
    pageStats_numberJsResources As Long
    pageStats_numberCssResources As Long
End Type

Strategy – this enumeration lists the two possible options. Mobile – analyzing the URL for mobile devices, and Desktop – analyzing the URL for desktop devices. Pretty straightforward.

PageSpeed – this is the (partial) result of the API call. In this example I am only extracting the general pagespeed information like:

  • Response coderead here how to interpret these codes. Generally you would prefer this to be 200 (OK)
  • Speed Score – how well the URL scores in terms of speed. Read more here.
  • Other – number of page resources, JS/CSS/image files etc.

I intentionally ignored the other tons of info provided by the Google PageSpeed API response. You can always extract this data yourself by building up the VBA code and creating a regular expression that will match what you need.

The GetPageSpeed function

Public Function GetPageSpeed(url As String, analysisStrategy As Strategy, filterThirdPartyResources As Boolean) As PageSpeed
    Dim callURL As String, pageSpeedRes As PageSpeed, objHTTP As Object
    callURL = "https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url=http%3A%2F%2F" & url & "&filter_third_party_resources=" & IIf(filterThirdPartyResources, "true", "false") & "&strategy=" & IIf(analysisStrategy = Desktop, "desktop", "mobile")
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    objHTTP.Open "GET", callURL, False
    objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
    objHTTP.send ("")
    If InStr(objHTTP.responseText, """SPEED""") = 0 Then GoTo ErrorHandl
    '---Extract PageSpeed results---
    On Error Resume Next
    Dim regex As Object: Set regex = CreateObject("VBScript.RegExp"): regex.Global = False
    regex.Pattern = """responseCode"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.responseCode = CLng(matches(0).SubMatches(0))
    regex.Pattern = """score"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.speedScore = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberResourses = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberHosts"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberHosts = CLng(matches(0).SubMatches(0))
    regex.Pattern = """totalRequestBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_totalRequestBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberStaticResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberStaticResources = CLng(matches(0).SubMatches(0))
    regex.Pattern = """htmlResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_htmlResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """cssResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_cssResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """imageResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_imageResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """javascriptResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_javascriptResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """otherResponseBytes"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_otherResponseBytes = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberJsResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberJsResources = CLng(matches(0).SubMatches(0))
    regex.Pattern = """numberCssResources"".*?([0-9]+)": Set matches = regex.Execute(objHTTP.responseText)
    pageSpeedRes.pageStats_numberCssResources = CLng(matches(0).SubMatches(0))
    Set regex = Nothing: Set objHTTP = Nothing
    '---Return result---
    GetPageSpeed = pageSpeedRes
    Exit Function
ErrorHandl:
End Function

Ok. Now let’s put it to the test:

Sub TestPageSpeed()
    Dim res As PageSpeed
    res = GetPageSpeed("www.google.com", Desktop, True)
    Debug.Print "Responsecode: " & res.responseCode
    Debug.Print "Speedscore: " & res.speedScore
    Debug.Print "Number of resources: " & res.pageStats_numberResourses
    Debug.Print "Number of hosts: " & res.pageStats_numberHosts
    Debug.Print "Total request bytes: " & res.pageStats_totalRequestBytes
    Debug.Print "Number of static resources: " & res.pageStats_numberStaticResources
    Debug.Print "HTML response bytes: " & res.pageStats_htmlResponseBytes
    Debug.Print "CSS response bytes: " & res.pageStats_cssResponseBytes
    Debug.Print "Image response bytes: " & res.pageStats_imageResponseBytes
    Debug.Print "JS response bytes: " & res.pageStats_javascriptResponseBytes
    Debug.Print "dOther response bytes: " & res.pageStats_otherResponseBytes
    Debug.Print "Number of JS resources: " & res.pageStats_numberJsResources
    Debug.Print "Number of CSS resources: " & res.pageStats_numberCssResources
End Sub

The results for Google (obviously speed score = 99!!!).

VBA PageSpeed: Google
VBA PageSpeed: Google

Let me know what you think and if you want more cool examples of using the Google APIs with VBA.

Web Scraping Tools for Beginners and the Advanced

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Web Scraping is a very wide topic and almost a separate profession. It is especially a valuable tool for SEO specialists, data scientists, analysts and many others. Due to this there are tons of tools out there. Trying to find the right one can be a real nightmare. For those that don’t have the time I dedicate this overview:

Ranking of Web Scraping tools/libraries (ease of use)

Level: Beginner

You need not be an expert coder to start extracting the data you need from websites! Web Scraping is a well known subject and there have been many tools adopted to make it easier to scrape html content. The tools below do not require any coding experience.

No. Name Comment
1 Excel Power Query (From Web)
  • Easy and quick to use
  • Limited to HTML tables
  • Available only in Excel 2010 and above (previous version have the less useful “Data->From Web” feature”)
2 Web Scraper plugin for Chrome
  • Quite easy to learn
  • Easy to configure straight from Chrome
  • Highly configurable (many options for selecting/scraping elements)
  • Available only from Chrome
3 Import.io
  • Quite easy to learn
  • Includes a wizard which will walk you through the process
  • Can be used to scrape numerous subpages of a page
4 Scrape Box
  • Commercial (not-free)
  • A simple app for scraping specific data off websites e.g. emails, links etc.
  • Easy configuration e.g. select the patterns you want to scrape and a list of websites/IPs
5 Scrape HTML Tool
  • Quite easy to learn. Requires only knowing how to build regular expressions (link to learn more on regex)
  • Provides simple UDF Excel functions
  • Provides integrated features to enhance Web Scraping performance (caching, automatic updating etc.)
  • Available only in Excel 2010 and above

Level: Advanced

The tools/libraries below require some coding experience

No. Name Comment
1 Selenium (Python, C#, Java, R etc.)
  • The best kit for problematically simulating web browser interaction
  • Available in most popular programming languages like Python, C#, Java (even for R – RSelenium)
  • Very easy to learn
  • Provides drivers for simulating user interaction in most browsers e.g. Chrome, FireFox, IE etc.
2 Scraper Wiki
  • A popular platform with web scraping tools
  • Easily transform web content into publicly (or privately) available data sets
  • Allows you to learn web scraping in the popular available technologies
  • Order web scraping aid
3 Kimono Labs
  • A popular web scraping website
  • Enables you to create easily available web interfaces to extract the web content you need
  • Interfaces available in mulitple formats
4 Scrapy (Python)
  • Allows crawling (webspiders) /scraping websites
  • Moderate level of coding proficiency required
  • One of the most popular web scraping library

Web Scraping libraries by programming language

With so many programming languages there must be multiple available web scraping libraries out there. He you can find a short list of the most popular web scraping libraries associated with each programming language.

Language Web Scraping Libraries
.NET (e.g. C#)
  • Html Agility Pack
  • WatiN
Java
  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
JavaScript
  • node.io
  • phantomjs
PHP
  • htmlSQL
Python
  • Scrapy
  • Selenium-Python
  • Beautiful Soup
  • lxml
  • HTQL
  • Mechanize
R
  • Rvest
  • RSelenium
Ruby
  • Nokogiri
  • Hpricot
  • Mechanize
  • scrAPI
  • scRUBYt!
  • wombat
  • Watir

Web Scraping Tools for Data Scientists

Are you a data scientist looking for the best tools out there for Web Scraping? Currently in data scientist communities (e.g. Kaggle) Python and R are the most regarded programming languages out there. Therefore find below a short list of libraries to consider for both:

Python:

R:

Next steps

Want to learn more on Web Scraping? Checkout these links:
Web Scraping Tutorial
Excel Scrape HTML Add-In

Multithreaded browser automation (VBA Web Scraping)

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.00 out of 5)
Loading...

Web Scraping is very useful for getting the information you need directly off websites. Sometimes however simple browser automation is not enough in terms of performance. Having created both the IE and Parallel classes I decided to jump an opportunity of creating a simple example of how mulithreaded browser automation can be achieved. Daniel Ferry achieved the same here. However, he used VBscript and a lot of coding was required to manage the swarm, as he called it. What I wanted to show you is rather an example how you can combine the Parallel class and the IE class to achieve the same functionality but in a much more concise and easy to manage way. So let’s jump straight in.

What does it do?

The browser automation procedure queries the Google page and copies the first resulting link text to the Excel workbook. In the mulithreaded example a “swarm” of 4 threads (or more if needed) is maintained to carry out simultaneous Google queries. This way the overall execution time is significantly reduced as more IE browser objects are created when some of them are waiting for a callback.

A multithreaded browser automation example
A multithreaded browser automation example

Multithreaded browser automation: Video

Instead of going into the details I encourage you to watch this short video example of a single and mulithreaded IE automation example:

Download

Feel free to download the workbook here:

Next steps

Check out the deterministic IE automation class here:
EXCEL: Simple class for using IE automation in VBA

Check out the Parallel class mulithreading tool here:
EXCEL: VBA Multithreading Tool

Simple class for browser automation in VBA

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Web browser automation (using Microsoft’s Web Browser) is not an easy task in VBA when doing some Web Scraping. Excel is certainly a great tool for building your web automation scripts but at every corner you find some obstacles. For me the most challenging was always the nondeterministic state of the IE browser control whenever waiting for it to finish loading/refreshing a website.

Be sure to checkout my VBA Web Scraping Kit which includes a whole worth of VBA scripts for Web Scraping

The problem

Most browser automation code out there includes the following lines of code whenever waiting for the Internet Explorer browser control to finish loading or refreshing:

While (objIE.Busy Or objIE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE)
        DoEvents
Wend
'The row below may sometimes raise an error!
Set element = objIE.Document.GetElementById(id) 

The problem is, however, that although IE might state that it has fully loaded the web page this in fact does not always have to be the case and some code might generate errors at random lines of code.

The solution

One way of going around this problem is using arbitrary delays – leveraging Application.OnTime / Application.Wait / Sleep. This workaround, although somewhat useful, in most cases will be unsatisfactory as still does not guarantee that you will actually be able to access a web element or click a button that simply might not be loaded yet due to some JavaScript running in the background. So how to solve this issue?

My answer? Wait until the element appears accessible. See an example function below that will wait until the element is accessible on the web page and only then will it return it.

So will the original code above change? Not too much really when we use this new function:

The code above waits for the element with id to appear accessible on the web page. The 5 millisecond Sleep interval is simply there so we are able to specify an upper threshold for an error to appear on a web page (10 seconds). This threshold is there to prevent us from waiting for an infinite period of time in case the web page crashed etc. This I believe is an honest way out of this conundrum.

A simple class for deterministic browser automation

Browser automation requires usually a lot of coding and in order to provide the deterministic automation we need we would need to encapsulate more procedures for getting elements by id, name, tagname, classname and even regular expressions.


The Google web page is also not an easy task for browser automation, assuming you would not embed your query in the GET params but treat it like a regular web page where all data is to be input and handled by forms.

The IE class I wrote handles queries to the Google Web Page like this:

That’s 10 lines of code and not hassle with handling the accessibility of any web elements. All the blocking/waiting is handled in the Get* methods which will simply wait until the control is available in the Web Browser or raise an error otherwise if the maxTimeout threshold is breached. You might have noticed the there is still a method called WaitForIE which actually still waits for the browser to confirm that it is not Busy anymore. Why is it there? Because we need to be sure that the browser has at least mostly loaded the new content and that the element we are looking for is not found, by mistake, in the previous page content. This code has proven deterministic for me in over 1’000 tested queries!

One other interesting thing is the GetRegex method which I basically scrapped of my Excel Scrape HTML Add-In. I prefer regular expressions over any other methods for extracting contents from strings/web pages and I encourage you to do so too. It is much easier and once you get the hand of it, it really makes life easier.

The code of the IE class (click on the arrow to expand):

Deterministic browser automation class

I hope this VBA class becomes the cornerstone of every browser automation script you write in VBA!

Download

Feel free to download the whole class file here:

Next steps

Check out my Scrape HTML Add-In for easy HTML scraping without any VBA:
Excel Scrape HTML Add-In

Export Excel to HTML – convert tables to HTML

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading...

I received an interesting question today – on how to easily publish an Excel file to a web page. Although there are a ton of ways to approach this problem (ranging from Excel Services in SharePoint to Excel Interop or ClosedXML) let us say we want to restrict to using only Excel and VBA. Printing Excel to HTML is a very useful feature if you want to publish your data/Workbook online.

The concept itself is very simple as HTML files are text files and therefore the problem is only to structure the data correctly in VBA before saving it to a HTML file.

I wanted to explore today 2 options:

  • Generating a HTML file via VBA
  • Generating a HTML file via the Publish feature in Excel

Both options are fairly useful ones – with the first one offering more flexibility and the second one being much easier to use.

Generating HTML via VBA

So lets start with the simple example of generating an Excel file from scratch just with VBA.

We have the following Excel table (ranging from A1 to C3):

Excel table
Excel table

The Code

Sub Test()
    RangeToHtml Range("A1:C3"), "test.html"
End Sub

Sub RangeToHtml(rng As Range, fileName As String)
    Dim resBeg As String
    resBeg = "<html><head></head><body><table>"
    resEnd = "</table></body></html>"
    For i = 1 To rng.Rows.Count
        '---Rows---
        resBeg = resBeg & "<tr>"
        For j = 1 To rng.Columns.Count
            '---Columns---
            resBeg = resBeg & "<td>"
            resBeg = resBeg & rng.Cells(i, j).Value
            resBeg = resBeg & "</td>"
        Next j
        resBeg = resBeg & "</tr>"
    Next i
    Call SaveStringToFile(resBeg & resEnd, fileName)
End Sub

Sub SaveStringToFile(str As String, fileName As String)
    Open fileName For Output As #1
    Print #1, str
    Close #1
End Sub

Excel to HTML: The result

Lets see the result (actual HTML posted to this page):

Col1 Col2 Col3
1 2 3
4 5 6

Nothing extraordinary – just a very simple table without any formatting.
What the code does is traverse through the Excel Range replacing rows with the

tag and columns (or rather cells) with the

tag inserting the cell’s contents within. A very simple example.

Excel Publish to HTML feaure

The Publish to HTML feature is a neat capability that allows you to export your entire Workbook as a HTML web page through which you can easily navigate. You can easily Publish your Excel Workbook from VBA or directly from Excel.

To publish the Excel file go to Save As and select Publish to configure the publish options:

Publish to HTML Excel feature
Publish to HTML Excel feature

Alternatively you can use the VBA code below to achieve the same:

 With ActiveWorkbook.PublishObjects.Add(xlSourceRange, _
        "PublishToHtml.htm", ";Sheet1", "$A$1:$C$4", _
        xlHtmlStatic, "PublishToHtml", "")
        .Publish (True)
 End With

Easy as that! The additional advantage is that the Publish to Excel feature will keep some of your formatting settings e.g. bold, italic fonts etc. Some, however, usually will be omitted e.g. borders.

Conclusions

Without resorting to external solutions there are least 2 easy options of generating html files from Excel. Personally I would prefer to have control over my HTML file and use VBA possibly adding my own css styles and formatting.