Web Scraping Tutorial

Web Scraping Tools for Beginners and the Advanced

Web Scraping is a very wide topic and almost a separate profession. It is especially a valuable tool for SEO specialists, data scientists, analysts and many others. Due to this there are tons of tools out there. Trying to find the right one can be a real nightmare. For those that don’t have the time I dedicate this overview:

Ranking of Web Scraping tools/libraries (ease of use)

Level: Beginner

You need not be an expert coder to start extracting the data you need from websites! Web Scraping is a well known subject and there have been many tools adopted to make it easier to scrape html content. The tools below do not require any coding experience.

No. Name Comment
1 Excel Power Query (From Web)
  • Easy and quick to use
  • Limited to HTML tables
  • Available only in Excel 2010 and above (previous version have the less useful “Data->From Web” feature”)
2 Web Scraper plugin for Chrome
  • Quite easy to learn
  • Easy to configure straight from Chrome
  • Highly configurable (many options for selecting/scraping elements)
  • Available only from Chrome
3 Import.io
  • Quite easy to learn
  • Includes a wizard which will walk you through the process
  • Can be used to scrape numerous subpages of a page
4 Scrape Box
  • Commercial (not-free)
  • A simple app for scraping specific data off websites e.g. emails, links etc.
  • Easy configuration e.g. select the patterns you want to scrape and a list of websites/IPs
5 Scrape HTML Tool
  • Quite easy to learn. Requires only knowing how to build regular expressions (link to learn more on regex)
  • Provides simple UDF Excel functions
  • Provides integrated features to enhance Web Scraping performance (caching, automatic updating etc.)
  • Available only in Excel 2010 and above

Level: Advanced

The tools/libraries below require some coding experience

No. Name Comment
1 Selenium (Python, C#, Java, R etc.)
  • The best kit for problematically simulating web browser interaction
  • Available in most popular programming languages like Python, C#, Java (even for R – RSelenium)
  • Very easy to learn
  • Provides drivers for simulating user interaction in most browsers e.g. Chrome, FireFox, IE etc.
2 Scraper Wiki
  • A popular platform with web scraping tools
  • Easily transform web content into publicly (or privately) available data sets
  • Allows you to learn web scraping in the popular available technologies
  • Order web scraping aid
3 Kimono Labs
  • A popular web scraping website
  • Enables you to create easily available web interfaces to extract the web content you need
  • Interfaces available in mulitple formats
4 Scrapy (Python)
  • Allows crawling (webspiders) /scraping websites
  • Moderate level of coding proficiency required
  • One of the most popular web scraping library

Web Scraping libraries by programming language

With so many programming languages there must be multiple available web scraping libraries out there. He you can find a short list of the most popular web scraping libraries associated with each programming language.

Language Web Scraping Libraries
.NET (e.g. C#)
  • Html Agility Pack
  • WatiN
Java
  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
JavaScript
  • node.io
  • phantomjs
PHP
  • htmlSQL
Python
  • Scrapy
  • Selenium-Python
  • Beautiful Soup
  • lxml
  • HTQL
  • Mechanize
R
  • Rvest
  • RSelenium
Ruby
  • Nokogiri
  • Hpricot
  • Mechanize
  • scrAPI
  • scRUBYt!
  • wombat
  • Watir

Web Scraping Tools for Data Scientists

Are you a data scientist looking for the best tools out there for Web Scraping? Currently in data scientist communities (e.g. Kaggle) Python and R are the most regarded programming languages out there. Therefore find below a short list of libraries to consider for both:

Python:

R:

Next steps

Want to learn more on Web Scraping? Checkout these links:
Web Scraping Tutorial
Excel Scrape HTML Add-In

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.