By Scraping, I mean getting data from web pages in a programmable way. For example, Check out Anand’s post about how he scraped for a laptop on Flipkart. That’s how I came to know about ScraperWiki and recently I wrote some quick and dirty scrapers to help a friend. Thought of sharing that knowledge which might help you write simple scrapers.
Yes, scraperwiki makes it easy. Along with nice methods, it also gives you an infrastructure for your data processing. Even if you have a laptop of last generation, you can type in the code and let the scraperwiki does your job of processing large amount of data while you sit back and relax. Scraperwiki provides you three languages to choose from.
- PHP 5.3.5
- Python 2.7.2
- Ruby 1.9.2
All software you need will be your browser. Rest will be taken care by Scraperwiki unless you got any special need. Better place to start will be the Documentation and Live tutorials. It will give you what you need for getting started.
For starters, let’s scrape jobs from hasgeek job board. Know your page, before you scrape it! The HTML structure of the page we are going to scrape is like this:
Every job is posted in a stickie (the yellow boxes like post-it notes you see on this page) which could be a single or group posting. Now, check the html source more closely to know what are these really made of. i.e. What makes a stickie? It’s an li tag. Likewise, get to know all the elements that are involved in your scraping goal. A single stickie is a bunch of span elements inside an a which is wrapped by an li tag.
<li class="stickie"> <a href="/view/f5ai9" rel="bookmark"> <span class="location">Mumbai</span> <span class="date">Sep 27</span> <span class="headline">Android+Java hacker at Mobile Payments startup by IIT-IIM founders</span> <span class="company">PayMe</span><span class="new">New!</span> </a> </li>
Group stickies contain more than one job posting. These stickies also of span elements but a slight difference is the first stickie of the group is inside an anchor while the others are inside a div. See below:
<li class="stickie grouped"> <a href="/by/d52f64f84d243b73dc01b15738520375"> <span class="location">Bangalore</span> <span class="date">Sep 28</span> <span class="headline">Search, Relevance Architect,Bangalore</span> <span class="company">eCommerce</span><span class="new">New!</span> </a> <div class="stickie grouped under"> <span class="location">Bangalore</span> <span class="date">Sep 28</span> <span class="headline">Demand Generation, Relevance Engineer(Bangalore)</span> <span class="company">eCommerce</span><span class="new">New!</span> </div> <div class="stickie grouped under"> <span class="location">Bangalore</span> <span class="date">Sep 28</span> <span class="headline">Big Data, Systems Engineer in Bangalore</span> <span class="company">eCommerce</span><span class="new">New!</span> </div> </li>
We are going to extract information from both single and grouped stickies. Below 20 lines of code is my attempt to do that.
# HasJobs Scraper # Scrapes from HasGeek Job Board # http://jobs.hasgeek.in # By Santhosh Kumar Srinivasan # Fork at https://github.com/sanspace/scrapers.git # To run at scraperwiki.com # or locally using http://blog.scraperwiki.com/2012/06/07/local-scraperwiki-library/ # or https://github.com/scraperwiki/scraperwiki_local # Change src at line 46 to any desired job board URL such as # http://jobs.hasgeek.com/category/programming # or http://jobs.hasgeek.com/type/freelance # to fetch categorized jobs instead of everything import scraperwiki import lxml.html def save_data(elem, jobs): # Get all the span elements which has the data we look for for span in elem.cssselect('span'): jobs[span.attrib['class']] = span.text_content() # Saving to the DB. Needs a dict and a unique key print scraperwiki.sqlite.save(unique_keys=['link'], data=jobs) def scrape_content(url): html = scraperwiki.scrape(url) root = lxml.html.fromstring(html) # select all the stickies except the first one # Technically siblings of the first stickie that says POST A JOB # Refer http://api.jquery.com/next-siblings-selector/ for job in root.cssselect('ul#stickie-area li#newpost ~ li'): jobs = dict() # Have to build the URL as the anchor is relative jobs['link'] = url + job.cssselect('a').attrib['href'] if (job.attrib['class'] == "stickie grouped"): # group postings # Get all direct children of the grouped stickie # Refer http://api.jquery.com/child-selector/ for elem in job.cssselect('li > *'): save_data(elem, jobs) else: save_data(job, jobs) # Let's get started src = 'http://jobs.hasgeek.in' scrape_content(src)
The function scrape_content gets the html content from the URL. Then it extracts all the postings from the page. This involves selecting desired elements from the html content using css selectors. This is just like any css selector you use in your stylesheets or in your jquery code.
For each posting, it then calls save_data to get the desired data from the posting, If it is a group posting, then save_data is called for each posting in the group. This function simply extracts all the data from the stickie contents which are elements. Then, it calls a scraperwiki function to save this to the sqlite DB.
A bit of python knowledge is enough to carry out this one. Also, many third-party libraries are available for advanced users. It’d be fun to scrape some of your favorite sites. Let me know which site you are going to scrape.