Jake Austwick

A Scraper's Toolkit: Redis

Jake Austwick — Thu, 04 Sep 2014 20:06:40 GMT

In my opinion, Redis is now the swiss army knife for any developer writing a scraper. I can't remember a sizeable scraping project I started in the past year that didn't involve Redis somehow.

Queueing

Queueing isn't particularly tied to web scraping, it's just a necessary part. The chances are you want to parallelize your scraping - and redis provides a great way to do this. You can have multiple worker processes using BLPOP on a list, retrieving and performing the work as needed. Redis serves the worker that has been blocking the longest, which means all your worker processes will get their fair share of jobs. RQ is a great job queue / worker system for this.

There are downsides to consider though, keeping everything in-process is faster for basic operations (even if redis is really quick). On the flipside, depending on the implementation language your are using you may be limited to only one CPU whereas a multiple process approach wouldn't. Overall I find it unlikely that redis would be your bottleneck.

Priority Crawling

Sometimes your crawl isn't about speed, it's about quality / relevance. If you only want to scrape 500 pages of a website, then which 500 do you choose to get the highest quality crawl?

You could opt to just go for a breadth-first crawl; this isn't a bad method [PDF]. However logic suggests that the pages that are of most importance have the most links pointing to them. So why don't we crawl in this order?

Redis makes this easy for us, as we can use a sorted set to keep track of how many times we've found a link pointing to a particular URL (use the URL as the key, increment the value everytime you see the link), then choose which page to crawl next by selecting the record with the highest score. When a page has been crawled simply set its score to -inf to place it at the bottom of the list so you don't crawl it again.

Get the item with the highest score (next to crawl):

ZRANGEBYSCORE crawling_priorities -inf +inf WITHSCORES LIMIT 0 1

Proxy Management / Limiting

Sooner or later, if you're writing scrapers, you're going to need to use proxies. There's a lot of hype on using Tor so I'll save you the trouble - it will suck. The connections are slow, unreliable and will just generally make your life hell. Just invest in some private proxies, I highly recommend MyPrivateProxy. I have been using them for years and have well over 20 clients I've referred to them now with not even a mention of a problem.

Now, back to Redis. It's no good using proxies if you end up using the same IP over again in quick succession, and this will happen eventually if you just go with using the random approach. We can utilize sorted sets again to store the last timestamp we used each proxy, and then we can grab the proxy that hasn't been used in the longest period. Simple, but effective.

Probabilistic Data Structures

Image Source: HighlyScalable

Probabilistic data structures can be very useful in large scale scraping, because keeping the resources down is often a big concern, whereas the occassional counting error / membership query being incorrect doesn't an big issue.

Bloom Filters

By using a bloom filter, you can quickly perform membership queries to check whether you have already scraped a page. This can save a lot of memory by not having to store the URL itself. The downside is that it has a small chance of giving you an erroneous result; this is often a worthwhile tradeoff. There are libraries in every major language the help you with implementing this on top of redis. I recommend pyreBloom for Python users.

HyperLogLog

If you want to count something but don't need to check for membership of items, then you should seriously check out the new data structure in redis; the HyperLogLog. It lets you count the amount of unique things you place into it in a very small amount of memory. The catch is that you can't get back the things you're counting. It's perfect for things such as counting the amount of unique URL's you come across on a particular website though.

Extras

I use Redis for more than the above in my web scraping tasks, these are just the most popular use-cases that apply to most projects. I'd love for you guys to comment below on what other things you have come up with!

If you haven't already checked it out, I highly recommend giving my Python Web Scraping Resource a read. It covers a lot of scraping principles and ideas that I will build upon in this series.

Shameless Plug

I'm available to hire for scraping projects. Use the contact link in the sidebar if you have some data you want collecting and drop me a few details :)

Why you should give PonyORM a chance

Jake Austwick — Mon, 26 May 2014 01:49:31 GMT

Pony is a neat new ORM on the block. You likely haven't heard of it, as it doesn't seem like anybody has.

It is similar to other ORMs such as SQLAlchemy in the sense that you can define your models, and query them using convenient syntax. However, the syntax for Pony is a lot more pythonic in my opinion. It lets you write complex SQL queries using the python generator syntax, and then converts and executes this as SQL for you.

# Example from Pony homepage
select(c for c in Customer if sum(c.orders.price) > 1000)

Look at that syntax and how visually appealing and easy to write it is. It almost seems like magic, and that's because it pretty much is. The developer explains here in detail how it "performs its tricks". It essentially decompiles the generator into its bytecode using the dis module, and then converts the bytecode into an AST.

Clear Queries

The clear benefit of pony is simplicity of writing fairly complex queries. Here are a few examples:

# Select product name and amount of orders
select((p.name, count(p.orders)) for p in Product)

# Select the oldest age out of every person
max(p.age for p in Person)

# Using SQL AVG aggregation query to work out avg GPA
avg(s.gpa for s in Student if s.group.dept.number == 44)

Transaction / Session Handling

My next favourite feature about Pony is the transaction/db session handling. It eliminates the need to use a function like commit() at the end of all your functions where you update a record. Pony provides a simple @db_session decorator method that automatically starts a database session upon your first SQL query, and then keeps track of any changes you make to your data within the function. If you have edited any of your models at the end of your function, then PonyORM will make sure to sync these changes to the DB for you. You can of course commit changes at anytime using the commit() function if you really need to.

Support

It supports everything you have come to love about your current ORM; default column values, specifying custom column types, relationships, etc. It has support for One-To-One, One-To-Many and Many-To-Many relationships. Here's a quick example of how a basic relationship looks:

class Post(db.Entity):  
    title = Required(unicode, 500)
    content = Required(LongUnicode)

    category = Required("Category")
    last_updated = Required(datetime, default=lambda: datetime.now())

class Category(db.Entity):  
    name = Required(unicode)
    posts = Set("Post")

Problems

Unfortunately, Pony doesn't yet have migrations. So if you're used to migrations with something like South then you're out of luck. The good news is that the Pony developers have confirmed they are working on this.

It also has a Dual-License, which means in some circumstances you may have to purchase a license. But it's a measly $100, so it isn't going to make a difference to your project in the long run.

Final Message

Overall, I'd say Pony is definitely worth giving a chance. I use it in all of my new projects and have never found a problem with it. It speeds up my development for sure. Check out the PonyORM website for more details.

Heartbleed affects clients too

Jake Austwick — Tue, 08 Apr 2014 23:12:10 GMT

If you're in any kind of tech circles, then all you will have heard today and yesterday is discussion regarding Heartbleed. It's a very serious bug within the OpenSSL library. The chances are if you have a server, you're using OpenSSL somewhere or other.

Everybody with public facing servers using SSL has been rushing updating their servers to the new patched OpenSSL version. Amazon have released updates to their services, as have other major providers such as Cloudflare.

What I'm not seeing mentioned much is is the fact that clients are susceptible to this vulnerability! Don't fool yourself into thinking you're safe. If you are running web crawlers out on the open internet connecting over SSL using OpenSSL, there's no reason why somebody can't use this bug to affect you. They can extract information from your client. Your server could contain secret keys, passwords and all other things of sensitive material.

If you're running a web crawler, then make sure you patch your OpenSSL too.

Extending the requests response class

Jake Austwick — Sun, 16 Mar 2014 01:32:00 GMT

Requests is a fantastic library for python, one of the most enjoyable libraries I have used to this day. I use it on a daily basis for most of my scraping activities.

The chances are you have some convenience functions that you use in all of your scraping projects, but you may have just been copying them around for now, and passing your response objects into them as an argument. We can do better.

I'm just going to show you how to add a few simple methods to the Response class, so that you can use this technique for your own projects with your own methods.

We'll start by defining a Response class with a few convenience methods. The important method defined is doc() . It "caches" the parsed tree of the HTML, so all our other convenient methods don't cause the whole HTML to be re-parsed with each function call.

import requests  
from lxml import html  
import inspect

class Response(object):  
    def doc(self):
        if not hasattr(self, '_doc'):
            self._doc = html.fromstring(self.text)
        return self._doc

    def links(self):
        return self.doc().xpath('//a/@href')

    def images(self, filter_extensions=['jpg', 'jpeg', 'gif', 'png']):
        return [link for link in self.doc().xpath('//img/@src') if link.endswith(tuple(filter_extensions))]

    def title(self):
        title = self.doc().xpath('//title/text()')
        if len(title):
            return title[0].strip()
        else:
            return None

Now we need to patch the requests.Response class with the methods inside our newly defined class. We'll use the getmembers() function from the inspect module passing the ismethod() method.

for method_name, method in inspect.getmembers(Response, inspect.ismethod):  
    setattr(requests.models.Response, method_name, method.im_func)

We're all done. You can now access these convinience functions on any response object, see the following example:

r = requests.get('http://imgur.com/')  
print r.title()  
print r.images(filter_extensions=['png'])

Now go ahead, and make your response objects as powerful as you desire. If you're interested in other scraping related hints / tips, check out my python web scraping resource.

Python web scraping resource

Jake Austwick — Sun, 09 Mar 2014 20:31:30 GMT

If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.

In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.

Please Note: This is a work in progress. I am adding more things as I come across them. Got a suggestion? Drop me an email - jake.austwick@gmail.com

Prerequisites
- Requests
- lxml
Crawling / Spidering
- Storing seen URLs
CSS Selectors
Extracting The Main Content Block
Concurrency
Robots.txt
Avoiding Detection
Hire Me
Extending the Response class
Common Problems
Examples

Important

Always check for if you can extract information via an API first if they provide one, RSS / Atom feeds are also a great option if provided.

Prerequisites

We'll be using two external python libraries primarily.

Requests

We'll be using the requests library instead of urllib2. It's better in every way. I could go into details and explain why, but I think the requests page sums it all up in this short paragraph:

Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

lxml

lxml is a XML/HTML parser, so we'll be using it to extract the data from the responses we are returned. Some people prefer BeautifulSoup but I'm confident at writing my own xPaths and so would prefer to stick with lxml for raw speed.

Crawling / Spidering

Crawling is exactly the same as a "general" scraper, except that you find the next URLs to scrape on each page that you parse. Crawling is a technique used to create large databases of information.

Examples:

Crawling an eBay category to gather product information. (Don't - they have an API).
Crawling a category in the yellow pages for the contact information of a certain profession.
etc...

Storing seen URLs

You'll want to store the URLs you've already visited, so that you don't scrape them multiple times. For a small scrape(< 50k URLs), I would suggest simply using a set for this. You can then simply check for the presence of a URL in the set before fetching it / adding it to the "To Fetch Queue".

If you're going to be crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" and probably (very likely) returning the correct answer. It can do this using very little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.

Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.

CSS Selectors

If you're coming from a Javascript background, you might be more comfortable querying the DOM with CSS selectors rather than xPath. lxml provides a way to do this, it has a css_to_xpath method. Check out this page for a short guide on how to use it:

http://lxml.de/cssselect.html

PyQuery also comes highly recommended to me and I know it's widely used, so it's worth checking out too.

Extracting The Main Content Block

Sometimes you want to perform analysis on just the main content of the page, doing a word count for example. You wouldn't want to include the menu / sidebar within the count.

You can use readability-lxml for this. It's job is to simply extract the main content block (note: It can still contain HTML). Following on our word count example, you would remove the HTML using something like clean_html from within lxml.html.clean, then proceed to split on whitepace and count the results.

You can check out another article on my blog that shows how to use Readability to extract the main content.

Concurrency

We all know you want your data now, but hitting the server with 200 concurrent requests is going to make for one angry webmaster. Don't be that guy. Do everybody a favour and stick to a reasonable amount. 5 is usually enough.

To send requests concurrently easily with requests, check out the grequests library. It allows you to make your code concurrent in pretty much a line or two.

rs = (grequests.get(u) for u in urls)  
responses = grequests.map(rs)

Robots.txt

Almost every guide out there will tell you to always follow robots.txt rules. I'm not going to do that. The simple fact is they are usually really restrictive and don't even reflect what the webmasters restrictions would be. Most sites just use the default robots.txt for their framework / that their SEO plugin generates. Just make sure you're reasonable.

An exception to this rule is if you are writing a general spider. Something like Google/Bing. In this case, I would follow the rules the disallowed directories and crawl delay. If they don't want something indexing, then don't index it in your search engine / application.

Avoiding Detection

Sometimes you don't want to get caught. I get that. It isn't actually that difficult to achieve. The first thing to do it randomize your user agent. Your IP address will still be the same for all the requests, but that isn't unusual. Places like universities / businesses often have a ton of computers that all route traffic via 1 IP.

The next step is to use proxies. You can pick up shared proxies at a very small monthly cost, and when you're using randomized user agents and sending requests from 100 different IP's spread across different IP ranges and even different datacenters, they're gonna have a hard time detecting you.

Recommended Proxy Provider: My Private Proxy

When I've found myself in the unfortunate place of getting my proxies banned before on certain sites, they have been more the happy to switch them out for new IP's for me.

Disclosure: The above is an affiliate link, I have an account there as I use this proxy service myself.

Extending the Response class

I highly recommend that you extend the requests.model.Response object with some convinience methods of your own. I have wrote a dedicated article on this topic:

http://jakeaustwick.me/extending-the-requests-response-class/

Hire Me

I highly reccommend you use this article as a resource to build / fix your own scraper. If you don't have time however or would like to hire somebody to scrape/crawl something for you then drop me a line. You can find my details on my contact page.

http://jakeaustwick.me/contact-me/

Common Problems

Installing lxml

You may run into some trouble when trying to install lxml, especially on a linux distro. lxml requires the following packages to be installed before:

apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

You can then run the pip install lxml command as per usual.

The site uses AJAX, I can't scrape it!

This is simply incorrect. Often if a site uses AJAX, it can make your job even easier. Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json. This saves you the effort of having to parse the page, you can just hit the AJAX endpoint directly. If this isn't the case then I suggest using an actual browser for scraping. Check out the "The site is showing different content to my scraper" section for suggestions on this.

The site is showing different content to my scraper

Check the page source in your browser and make sure the information you want to scrape is actually there. It could be that the content was injected into the page via javascript and our scraper doesn't run javascript. If this is the case, your only real option is to run something like PhantomJS. You can control Phantom from Python though, just use something like selenium or splinter.
Sites often serve different content to different user agents. You should check the user agent that you are sending in your browser and set your scraper to have the same.
It's possible that the site is serving different content to different locations using geotargeting. If you're running your script from a server, then this could be affecting you.

The items I need to scrape are paginated

Sometimes you need to scrape items that aren't all available on one page. Luckily it's usually an easy fix. I suggest checking the URL when changing between the pages, you'll ussually find there is a page or offset parameter changing. This is easy to scrape with something like the following:

base_url = "http://some-url.com/something/?page=%s"  
for url in [base_url % i for i in xrange(10)]:  
    r = requests.get(url)

I'm getting 403 errors / rate limited

The simple solution to this is don't hit the site so damn hard. Be conservative with the amount of concurrent requests you use. Do you really need the data within the hour, can you really not wait 24 hours for it?

If you really need to keep sending requests to the site, then you're going to have to invest in some proxies. You can then simply create a simple proxy manager class that makes sure you only use the same proxy once every X interval so that it doesn't get blocked.

Check out the Concurrency and Avoiding Detection sections above.

Code Examples

The following examples aren't really of any use on their own. They're simply here as a reference for you.

Simple requests example

I'm not really going to show much in this example, you can find out all you need over at the requests documentation.

import requests

response = requests.get('http://jakeaustwick.me')

# Response
print response.status_code # Response Code  
print response.headers # Response Headers  
print response.content # Response Body Content

# Request
print response.request.headers # Headers you sent with the request

Request with proxy

proxy = {'http' : 'http://102.32.3.1:8080',  
           'https': 'http://102.32.3.1:4444'}
response = requests.get('http://jakeaustwick.me', proxies=proxy)

Parsing the responses body

It's easy to parse the html returned with lxml. Once we have it parsed into a tree, we can call xPaths on it to grab the data we are interested in.

import requests  
from lxml import html

response = requests.get('http://jakeaustwick.me')

# Parse the body into a tree
parsed_body = html.fromstring(response.text)

# Perform xpaths on the tree
print parsed_body.xpath('//title/text()') # Get page title  
print parsed_body.xpath('//a/@href') # Get href attribute of all links

Download all images on a page

The following script will download all the images, and save them in downloaded_images/. Make sure that directory exists before running it.

import requests  
from lxml import html  
import sys  
import urlparse

response = requests.get('http://imgur.com/')  
parsed_body = html.fromstring(response.text)

# Grab links to all images
images = parsed_body.xpath('//img/@src')  
if not images:  
    sys.exit("Found No Images")

# Convert any relative urls to absolute urls
images = [urlparse.urljoin(response.url, url) for url in images]  
print 'Found %s images' % len(images)

# Only download first 10
for url in images[0:10]:  
    r = requests.get(url)
    f = open('downloaded_images/%s' % url.split('/')[-1], 'w')
    f.write(r.content)
    f.close()

Single Threaded Crawler (basic)

This crawler just shows the basic principle behind crawling. It uses a dequeue so you can pop from the left of the queue and therefore fetch URLs in the order you find them. It doesn't do anything useful with the pages, just printing out the page title.

import requests  
from lxml import html  
import urlparse  
import collections

STARTING_URL = 'http://jakeaustwick.me'

urls_queue = collections.deque()  
urls_queue.append(STARTING_URL)  
found_urls = set()  
found_urls.add(STARTING_URL)

while len(urls_queue):  
    url = urls_queue.popleft()

    response = requests.get(url)
    parsed_body = html.fromstring(response.content)

    # Prints the page title
    print parsed_body.xpath('//title/text()')

    # Find all links
    links = {urlparse.urljoin(response.url, url) for url in parsed_body.xpath('//a/@href') if urlparse.urljoin(response.url, url).startswith('http')}

    # Set difference to find new URLs
    for link in (links - found_urls):
        found_urls.add(link)
        urls_queue.append(link)

Flask GeoIP API in python

Jake Austwick — Fri, 07 Mar 2014 21:27:30 GMT

We're going to create a simple flask webapp. I say simple because it only has three routes, however this doesn't mean it can't be extremely useful.

Jump to final code

The webapp will provide the following three functions:

Root route - Details for requesters IP
IP route - Details for the IP url parameter
Domain route - Details for the domain parameter

The JSON output of our API will look like this:

{
area_code: 408,  
city: "Campbell",  
continent: "NA",  
country_code: "US",  
country_code3: "USA",  
country_name: "United States",  
dma_code: 807,  
latitude: 37.28030000000001,  
longitude: -121.9567,  
metro_code: "San Francisco, CA",  
postal_code: "95008",  
region_code: "CA",  
time_zone: "America/Los_Angeles"  
}

Flask makes this very simple for us with its simple straight forward routing mechanism, we can get the request IP with request.remote_addr.

For our GeoIP functionality we'll be using the pygeoip library. You'll also need to download the latest available database from MaxMind.

We will take advantage of the jsonify function that Flask has to offer. This will take care of converting the dictionary returned from pygeoip into a JSON encoded string, and also setting the content-type response header to application/json.

from flask import Flask, request, jsonify  
import pygeoip

app = Flask(__name__)

# Make sure this points to your downloaded file
gi = pygeoip.GeoIP('GeoIPCity.dat', pygeoip.MEMORY_CACHE)

@app.route('/')
def root():  
    geo_data = gi.record_by_addr(request.remote_addr)
    return jsonify(geo_data)

if __name__ == '__main__':  
    app.run(port=8000, debug=False)

The above will make the root url display the details for your IP address. However please note that it will not work on localhost, as your IP address for that request would simply be 127.0.0.1.

Now you can add the routes for the custom IP or custom domain which will be passed in the URL.

@app.route('/ip/')
def ip(ip_address):  
    geo_data = gi.record_by_addr(ip_address)
    return jsonify(geo_data)

@app.route('/domain/')
def domain(domain_name):  
    geo_data = gi.record_by_name(domain_name)
    return jsonify(geo_data)

The only problem so far is that if somebody sends a request for an invalid IP/domain, then geo_data will equal None. Upon calling jsonify(None) you will recieve an exception.

Because our app only serves one simple function, we can register a global exception catcher for 500 errors, and display an error the the requester.

# In general catching all 500 errors like this could be considered bad practice
# But with an app which only serves single function like ours
# It's a neat way to follow the DRY principal because all errors are the same
@app.errorhandler(500)
def error_500(e):  
    return jsonify({'error': 'Error finding GeoIP data for that address'})

Final Code

That's it. You have yourself a JSON GeoIP API that can be used with ease. Here's all of it put together:

from flask import Flask, request, jsonify  
import pygeoip

app = Flask(__name__)

# Make sure this points to your downloaded file
gi = pygeoip.GeoIP('GeoIPCity.dat', pygeoip.MEMORY_CACHE)

@app.route('/')
def root():  
    geo_data = gi.record_by_addr(request.remote_addr)
    return jsonify(geo_data)

@app.route('/ip/')
def ip(ip_address):  
    geo_data = gi.record_by_addr(ip_address)
    return jsonify(geo_data)

@app.route('/domain/')
def domain(domain_name):  
    geo_data = gi.record_by_name(domain_name)
    return jsonify(geo_data)

# In general catching all 500 errors like this could be considered bad practice
# But with an app which only serves single function like ours
# It's a neat way to follow the DRY principal because all errors are the same
@app.errorhandler(500)
def error_500(e):  
    return jsonify({'error': 'Error finding GeoIP data for that address'})

if __name__ == '__main__':  
    app.run(port=8000, debug=False)

Scraping content with readability and python

Jake Austwick — Fri, 07 Mar 2014 13:02:09 GMT

This is only going to be a short one. If you want to scrape the main content of a site only (we aren't interested in the menu/sidebar etc), then you can use the python port of the readability library.

Jump to final code

We'll simply fetch the page using Kenneth Reitz's requests library, and then extract the content with readability. I'll be using Redis to store the results, simply because I didn't want the extra overhead of having to install a RDBMS for this project.

Installing libaries

pip install redis requests readability-lxml

First we'll create a generator to yield urls we want to fetch from a text file. We'll only yield the url if we don't already have the content for it stored though.

    import redis
    r = redis.StrictRedis()

    def urls():
      for url in open('urls.txt', 'r'):
            if not r.hexists('url-content', url):
                yield url

Now we can iterate over this, fetching the content, running it through readability and then storing the results in redis. It's as simple as it sounds.

Final Code

    import redis
    import requests
    from readability.readability import Document

    r = redis.StrictRedis()

    def urls():
        for url in open('urls.txt', 'r'):
            if not r.hexists('url-content', url):
                yield url

    if __name__ == '__main__':
        for url in urls():
            try:
                response = requests.get(url, timeout=10)

                # Only store the content if the page load was successful
                if response.ok:
                    page_content = Document(response.content).summary()
                    r.hset('url-content', url, page_content)
            except:
                print 'Error processing URL: %s' % url

        print 'Processed all URLs'

And there you have it, scraping and storing content of the urls loaded from a text file, all in < 25 lines of Python.