Spiders and robots [26.03.2009]

A couple of new, related features for Blueprint sites.

Firstly, you can define rules that prevent search engine spiders and other robots from accessing various parts of your site. Why would you do this? Really just to improve the efficiency of your site, and the relevance of the site content that search engines are storing. For instance, you don't want search engine spiders adding stuff to their personal shopping basket.

Most big sites provide some degree of direction to search engines as to what should be indexed, using the robots.txt standard — here, for instance, is Google's robots.txt file.

Blueprint gives you an easy way to define the robots.txt rules for a site. In lib/rules.rb, use the robots_disallow method to deny specific robots access to specific resources. Here's some examples:

# No robot should access the individual images for the gallery.
robot_disallow '*' => '/gallery/image'

# ...except Google.
robot_disallow 'Google' => ''

# The Yahoo robot shouldn't be allowed to see the Foo or Bar top-level pages.
robot_disallow 'Yahoo' => ['/foo', '/bar', 'add_to_basket']

Custom blueprints can define their own robots rules, which apply to any site that employs them. The SHP blueprint already uses this to explicitly deny access to its various basket-related actions.

Concomitant with this change is Blueprint's very own spider. You can use this to test that all internal links on your site are working. It flags any HTTP status code above 399 (eg, 404, 411, 500, etc), and any ingredient errors. Here's an example:

$ rake site:spider domain=inventivelabs.com.au
SPIDERING [inventivelabs.com.au]
[200] http://inventivelabs.com.au/ -- 1.782209s
[200] http://inventivelabs.com.au/weblog/post/peeking-at-previews-in-blueprint -- 1.422189s
[200] http://inventivelabs.com.au/weblog/post/not-bad-for-a-one-year-old -- 0.967232s
[200] http://inventivelabs.com.au/weblog/post/5-minute-redesign-ebay -- 0.825736s
[200] http://inventivelabs.com.au/weblog/post/using-blueprint-as-a-drafting-tool -- 0.973908s
[200] http://inventivelabs.com.au/weblog/post/a-bloggie-for-ms-fits -- 0.825474s
[200] http://inventivelabs.com.au/weblog/post/harry-we-are-here-to-help -- 0.801349s
[200] http://inventivelabs.com.au/weblog/post/eight-days-at-the-labs -- 0.988544s
[200] http://inventivelabs.com.au/weblog/post/the-other-things -- 1.06554s
[200] http://inventivelabs.com.au/weblog/post/migrating-from-lighthouse-to-redmine -- 1.221826s
[200] http://inventivelabs.com.au/weblog -- 0.927285s
[200] http://inventivelabs.com.au/weblog/post/harry-we-are-here-to-help#comments -- 1.06135s
[200] http://inventivelabs.com.au/the-way-we-work -- 3.964896s
[200] http://inventivelabs.com.au/about-the-labs -- 2.099506s
[200] http://inventivelabs.com.au/blueprint -- 0.838506s
[200] http://inventivelabs.com.au/blueprint/screencast -- 0.655021s
[200] http://inventivelabs.com.au/portfolio -- 2.039005s
[200] http://inventivelabs.com.au/portfolio/project/resicon -- 0.860806s
[200] http://inventivelabs.com.au/portfolio/project/rmit-sro-database -- 0.747917s
[200] http://inventivelabs.com.au/portfolio/project/big-and-little-films -- 0.856033s
[200] http://inventivelabs.com.au/portfolio/project/australians-all -- 0.920293s
[200] http://inventivelabs.com.au/portfolio/project/cyclopharm -- 0.860129s
[200] http://inventivelabs.com.au/portfolio/project/shane-maloney -- 0.921109s
[200] http://inventivelabs.com.au/portfolio/project/register-of-urban-design-practices -- 0.923032s
[200] http://inventivelabs.com.au/portfolio/project/scribe-website-and-business-system -- 0.98052s
[200] http://inventivelabs.com.au/portfolio/project/iaia-reference-finder -- 0.921664s
[200] http://inventivelabs.com.au/portfolio/project/unico -- 0.922174s
[200] http://inventivelabs.com.au/portfolio/project/mcculloch-mcculloch-website -- 0.737803s
[200] http://inventivelabs.com.au/portfolio/project/readings -- 1.003813s
[200] http://inventivelabs.com.au/portfolio/project/plans-at-work-software-and-website -- 0.760026s
[200] http://inventivelabs.com.au/portfolio/project/back-to-back-theatre-website -- 1.073601s
[200] http://inventivelabs.com.au/portfolio/project/text-publishing-website -- 0.73745s
[200] http://inventivelabs.com.au/static/dark.css -- 1.054769s
[200] http://inventivelabs.com.au/static/style.css -- 0.993249s
[200] http://inventivelabs.com.au/static/images/apple-touch-icon.png -- 2.546059s
[200] http://inventivelabs.com.au/static/images/favicon.ico -- 0.519485s
[200] http://inventivelabs.com.au/static/files/assets/32bfd614/lighthouse_import.rake -- 1.212854s
[200] http://inventivelabs.com.au/weblog/post/the-writing-s-on-the-magnetic-wall -- 5.4549s
[200] http://inventivelabs.com.au/blueprint/ -- 1.267579s

You've got a few different ways to invoke the spider; to see them all, run:

$ rake -Tspider

It's pretty basic right now, but we'll be adding features to it over time.

Only the comment field is required. Omitting the ID fields increases your risk of being mistaken for spam.

Preview or