« Back to home

Ruby on Rails: Screen Scraping with Nikogiri and OpenURI

Screen scraping in Ruby on Rails can be done very easily with Nokogiri and OpenURI. All you need to do is:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(target_url))

That returns a Nokogiri HTML document, which you can use to search for the wanted data using doc.at, similar to DOM searching. For example, let's retrieve the price and description of an Amazon Kindle:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.amazon.com/gp/product/B00I15SB16'))
item_description = doc.at('meta[@name="description"]')[:content]
price = doc.at('#buyingPriceValue b').children[0].text

That's pretty simple, but when doing large-scale scraping, you probably don't want to send requests directly from your server but rather through a proxy:

doc = Nokogiri::HTML(open(some_url, :proxy => '#{proxy_ip}:#{proxy_port}'))

However, public proxies are unreliable. It's better to use private/authenticated proxies instead. Luckily OpenURI has included proxy authentication in v1.9:

doc = Nokogiri::HTML(open(
  some_url, 
  :proxy_http_basic_authentication => [proxy_ip, proxy_username, proxy_password]
))

That almost covers everything needed to do basic screen scraping with Nokogiri and OpenURI. One thing to note that big websites like Ebay, Amazon... will block your proxy if their servers receive too many requests from the same IP. It's a good idea to do some stress testing with the website that you're trying to scrape to see how much their server can handle. I have written a Rake task for this purpose:

task :scrape_test_consecutive => :environment do |t|
  require 'open-uri'
  require 'nokogiri'
  require 'logger'

  proxy_username = 'your_proxy_username'
  proxy_password = 'your_proxy_password'
  proxy_ip = 'your_proxy_ip'
  proxy_port = 'your_proxy_port' 
  url = 'scraping_url' 
  log = Logger.new("#{Rails.root}/log/scrape_test.log")
  log.info "* [#{DateTime.now}] Test scraping url #{url} with proxy #{proxy_ip}:#{proxy_port}"
  scraping_start_time = DateTime.now
  num_of_query = 0

  while true do
    num_of_query += 1
    begin
      query_start_time = DateTime.now
      doc = Nokogiri::HTML(open( 
        url,
        :proxy_http_basic_authentication => ["http://#{proxy_ip}:#{proxy_port}", proxy_username, proxy_password]
      ))

    rescue OpenURI::HTTPError => e
      log.error "Query #{num_of_query} failed with error: #{e.message}"
      break // if HTML errors occur, stop testing
    end

    log.info "Query #{num_of_query} executed successfully in #{((DateTime.now - startQueryTime) * 24 * 60 * 60).to_f}"
  end

  log.info "* [#{DateTime.now}] Finished testing with proxy #{proxy_ip} in #{DateTime.now - scraping_start_time. Total queries: #{num_of_query}"
end

What the task does is that it consecutively sends queries to the remote server through a private proxy until a HTML error occurs. By sacrificing that proxy, we can roughly know the remote server's limit, which will allow us to estimate the number of private proxies needed. Then the proxies can be arranged in random or round-robin manner to coordinate the scraping.

Comments

comments powered by Disqus