Web Scraping Tutorial

Web Scraping Tools for Beginners and the Advanced

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading...

Web Scraping is a very wide topic and almost a separate profession. It is especially a valuable tool for SEO specialists, data scientists, analysts and many others. Due to this there are tons of tools out there. Trying to find the right one can be a real nightmare. For those that don’t have the time I dedicate this overview:

Ranking of Web Scraping tools/libraries (ease of use)

Level: Beginner

You need not be an expert coder to start extracting the data you need from websites! Web Scraping is a well known subject and there have been many tools adopted to make it easier to scrape html content. The tools below do not require any coding experience.

No. Name Comment
1 Excel Power Query (From Web)
  • Easy and quick to use
  • Limited to HTML tables
  • Available only in Excel 2010 and above (previous version have the less useful “Data->From Web” feature”)
2 Web Scraper plugin for Chrome
  • Quite easy to learn
  • Easy to configure straight from Chrome
  • Highly configurable (many options for selecting/scraping elements)
  • Available only from Chrome
3 Import.io
  • Quite easy to learn
  • Includes a wizard which will walk you through the process
  • Can be used to scrape numerous subpages of a page
4 Scrape Box
  • Commercial (not-free)
  • A simple app for scraping specific data off websites e.g. emails, links etc.
  • Easy configuration e.g. select the patterns you want to scrape and a list of websites/IPs
5 Scrape HTML Tool
  • Quite easy to learn. Requires only knowing how to build regular expressions (link to learn more on regex)
  • Provides simple UDF Excel functions
  • Provides integrated features to enhance Web Scraping performance (caching, automatic updating etc.)
  • Available only in Excel 2010 and above

Level: Advanced

The tools/libraries below require some coding experience

No. Name Comment
1 Selenium (Python, C#, Java, R etc.)
  • The best kit for problematically simulating web browser interaction
  • Available in most popular programming languages like Python, C#, Java (even for R – RSelenium)
  • Very easy to learn
  • Provides drivers for simulating user interaction in most browsers e.g. Chrome, FireFox, IE etc.
2 Scraper Wiki
  • A popular platform with web scraping tools
  • Easily transform web content into publicly (or privately) available data sets
  • Allows you to learn web scraping in the popular available technologies
  • Order web scraping aid
3 Kimono Labs
  • A popular web scraping website
  • Enables you to create easily available web interfaces to extract the web content you need
  • Interfaces available in mulitple formats
4 Scrapy (Python)
  • Allows crawling (webspiders) /scraping websites
  • Moderate level of coding proficiency required
  • One of the most popular web scraping library

Web Scraping libraries by programming language

With so many programming languages there must be multiple available web scraping libraries out there. He you can find a short list of the most popular web scraping libraries associated with each programming language.

Language Web Scraping Libraries
.NET (e.g. C#)
  • Html Agility Pack
  • WatiN
Java
  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
JavaScript
  • node.io
  • phantomjs
PHP
  • htmlSQL
Python
  • Scrapy
  • Selenium-Python
  • Beautiful Soup
  • lxml
  • HTQL
  • Mechanize
R
  • Rvest
  • RSelenium
Ruby
  • Nokogiri
  • Hpricot
  • Mechanize
  • scrAPI
  • scRUBYt!
  • wombat
  • Watir

Web Scraping Tools for Data Scientists

Are you a data scientist looking for the best tools out there for Web Scraping? Currently in data scientist communities (e.g. Kaggle) Python and R are the most regarded programming languages out there. Therefore find below a short list of libraries to consider for both:

Python:

R:

Next steps

Want to learn more on Web Scraping? Checkout these links:
Web Scraping Tutorial
Excel Scrape HTML Add-In

Related Posts

4 thoughts on “Web Scraping Tools for Beginners and the Advanced”

  1. More of a question than a comment: why no mention of the ability to scrape from websites protected by a user id and/or password? Right now I’m googling, trying to find a preexisting solution to scrape data from a website protected by password into Excel rather than attempting (I’m not an expert) to code something from scratch in VBA. It is certainly worth mentioning that I’m assuming the user has the username and password to the site.

    1. Hi Brian,
      well unfortunately there is no repeatable solution, and the approach will probably depend on the website you are scraping. If you are familiar with VBA then use the IE object, if you know python try Selenium – both should allow you to simply input the user id and password like any other web control and submit these to authenticate to the website.
      Share your example (url), I can give you a few tips to start you off.

      Might however consider some time to do such a section in the Web Scraping Tutorial. Thanks for the suggestion!

  2. Thanks for the reply. A Web Scraping Tutorial would be awesome! I have some VBA experience but no Python experience. Some of the VBA coding solutions I’ve seen use the IE object. I think the best starting point I’ve found for a VBA solutions is:

    http://dailydoseofexcel.com/archives/2011/03/08/get-data-from-website-that-requires-a-login/

    The above link also has a non-IE MSXML solution in the comments. Another starting point I’ve found is:

    http://www.mrexcel.com/forum/excel-questions/527216-web-query-password-protected-site.html

    The website I want to scrape from is:

    http://www.swapfinancial.com/fpmu/full

    after entering the password, I want to grab one of the tables and the Closest Financial Product Market Update date.

    Any tips would be appreciated.

    Thanks.

    1. Hi Brian,

      seems like you are all set then. I have in fact a Web Scraping Tutorial here. Feel free to check it out. If you are familiar with the IE Object check out my IE Object class.

      Few tips:
      – Use the F12 button in your browser to locate elements you need to interact or scrape e.g.
      getElementByName("password").value = "your_password" would allow you to enter the password on your website
      getElementByTagName("form").submit should submit the password

      Can’t help you more as I don’t have further access. Feel free to utilize the StackOverflow community if you need more help.

Leave a Reply