A few months ago, I found myself with some spare time, I decided to use it to take a dive into Django, a python based web framework. I wasn’t really sure what I wanted to do at first, so I followed a few tutorials from around the web and created some simple example sites, I noticed right away, having some prior python experience, that Django was pretty easy to use. I started to brainstorm things that I could do with Django in line with creating an actual functional website. After some consideration, ammochamp.com was born, an ammunition price indexer. The idea was to gather ammunition price data from around the web, and display the results to the end-user in a fashion that made it easy to find the cheapest pricing available.

AmmoChamp First Thoughts

After some research, I settled on using scrapy to gather the data from the web and store the data in a postgresql database, and django to display the information that has been scraped. This was my first time using Scrapy to gather data from the web, and it ended up being quite a joy to work with. I started by following the basic Scrapy tutorial from their website. While I did have to do additional research to complete this project, answers to almost every question I had were found at scrapy.org. In this next piece of code, I’m going to show you an example of the scrapy script I created:

We start by setting up our imports, then we setup our spider, giving it a name, a domain, and a start URL. Xpath’s are important for harvesting the content you want from a webpage, and we will use xpath’s within our spider to select seperate elements of the webpage we are scraping and bring them back into scrapy for storage to our database. We set the ‘ammodeals_list_xpath’ variable to ‘//tr[@class=”odd”] | //tr[@class=”even”]’ to accommodate for the alternating CSS classes that are being used by sgammo.com to display their data, we will use this variable later on in the script. Scrapy will now begin to grab data from the webpage and parse it. Let’s take a look at some example data from sgammo using firebug:

Data from the SGAmmo Website

Here we have two tr’s, one with a class of odd and one with a class of even. The location of the data we’re harvesting will be the same regardless of the class of the tr class, since the data we’re harvesting appears in the same locations in both div classes. Let’s look at the first piece of information we’re harvesting, the title, it’s extracted using this line of code:

item[‘title’] = sel.xpath(‘.//td[2]/a/text()’).extract()

Using the Xpath selector ‘.//td[2]/a/text()’, we are able to extract the ‘title’ of the item. Let’s look at how that works. The start of the selector, ‘.//’ tells scrapy that we are using an xpath relative to our selector xpath, which is set to any tr’s with a class of odd or even. The next portion, ‘td[2]/’ indicates we are looking for the the second td with the tr with a class of odd or even, which is the td that contains our href. Next, ‘a/’ tells scrapy to look for the first ‘a’ tag under the previously referenced td. Finally, ‘text()’ tells scrapy to only grab the text portion of the a tag, which would be considered the anchor text, this is what we want for our item title. Here is another way to look at it:

Here again is the xpath selector in question :

‘.//td[2]/a/text()’

This selector reads like so, “Get the text from the first ‘a’ tag under the second table div relative to the selector xpath”. If you look at the screenshot above, you should be able to follow the xpath selector and easily find the data under either the odd or evenly classed table rows.

The next selector is a little more difficult, as we will incorporate a regex at the end of the data extraction. Let’s look at the code:

item[‘price’] = sel.xpath(‘.//td[4]/span/text()’).re(r'[0-9]+\.[0-9][0-9](?:[^0-9]|$)’)

The xpath selector is easy enough, and would read as such, ‘Get the text from the first span under the 4th table div relative to the selector xpath’. The problem is, we don’t want to store the data as it is, because it contains a foreign character, the $. We need to store this number as an integer in order to handle sorting of this data later when we display it, so we need to get rid of any foreign characters before we store this data to the database. That’s where the regex comes in. The regular expression ‘[0-9]+\.[0-9][0-9](?:[^0-9]|$)’ will insure that only numbers and decimals are returned from the scraped data, and will ignore the dollar sign. It is important that we do this here, before we bring the data to the pipeline, that’s where we will be writing the data to the database.

I won’t review the rest of the selectors that are similar to this one, as they should be self-explanatory. I do want to look at the ammoavailable selector though, it looks like this:

item[‘ammoavailable’] = sel.xpath(‘.//td[4]/form/div/input[@name=”op”]/@value’).extract()

This is just a simple extraction of data, but the selector is a little different than the others, and we should look at how it works. It’s at this point I would recommend you install the firefox addon Firepath. Firebug seems to do well at locating XPath’s, you can simply right click on any html element and select ‘Copy XPath’, but I find that having both helps. We have to keep in mind that Scrapy does not load javascript out of the box. To see exactly what scrapy is seeing code-wise, I use a Firefox add-on called JSOff. It simply allows you to enable and disable the execution of javascript on the fly. Visiting the page with javascript disabled we can see the data we are after. The selector would read as follows, ‘Get the value from the input with a name of ‘op’ under the first div of the first form of the fourth table div, relative to the selector xpath’. In this case, the value for any product that is available will be ‘Add to cart’, and this is what will be stored in the database through our pipeline. For any product that is not available, this product will simply not exist, we will use this data later to help us insure that we are only displaying products that are in-stock. It is always good practice when scraping data with Scrapy for display on a website to process it down to the format that you will use it in before you store it to the database, in order to minimizing parsing while displaying the data on your website. Also, we will be grabbing a product image and storing it locally, to be used later when we’re displaying our data.

We will now store our data to a postgresql database, let’s look at a diagram of how the data flows in Scrapy:

Simple Scrapy dataflow diagram

We can see that the spider is grabbing product images and storing them, and sending data to the pipeline for processing. Let’s look at pipelines.py:

First we setup our imports. Then when check for duplicates, initialize the db connection and session, and process the items from our spider. Everything here is pretty standard, except for this piece of code:

price_per_rd=round(float(item[‘price’][0].decode(‘unicode_escape’)) / float(item[‘ammopieces’][0].decode(‘unicode_escape’)),4),

What I’m doing here is a little math within the pipeline. Taking the ‘price’ and ‘ammopieces’ items and converting them to floats allows me to do the simple division I need to get the price of ammunition per round before storing it to the database, this keeps me from needing to do the math on this later, as this would cause issues, especially with a large number of products. You could look at it like this:

price divided by ammopieces rounded to the fourth decimal

Please note that I had to use import division from __future__ to achieve math within the pipeline, this may be a bit of a hacky solution, but I find it to work well.

Let’s also look at items.py and models.py:

 

If you’re familiar with Django you will know what these files do, they are pretty basic, but they are still quite important. The items.py needs to contain all the fields from the spider in order for scrapy to recognize them. The models.py file controls table structure and the actual database connection. All this is pretty basic, the work has already been handled in spider.py and pipelines.py. You should now be able to execute your spider and crawl the sgammo site, populating your database with relevant ammunition pricing data. Stay tuned for part 2, where I show you how to display the data from the database.

– Using Django with Scrapy to index web content

Skip to toolbar