Web Scraping: Introduction, Best Practices & Caveats

Introduction:

Web Scraping is the process of data extraction from various websites. Web Scraping has a wide variety of use cases:

  • Marketing & Sales Intelligence companies use web scraping to fetch lead-related information
  • Real Estate Tech companies use web scraping to fetch real estate listings.
  • Price Comparison Portals use web scraping to fetch product and price information from various e-commerce sites.

The process of web scraping usually involves spiders which fetch the HTML documents from relevant websites, extracts the needed content based on business logic and finally store it in a specific format.  This blog is meant to be a primer on building highly scale-able scrappers. The blog will cover the following items:

  1. Ways to scrape: We’ll cover basic scraping using techniques and frameworks in Python with some code snippets.

  2. Scraping at scale: Scraping a single page is straightforward but there are challenges in scraping millions of websites including managing the spider code, collecting data and maintaining a data warehouse. We’ll explore such challenges and solutions to make scraping easy and accurate.

  3. Scraping Guidelines: Scrapping data from websites without the owner's permissions can be deemed as malicious, there are certain guidelines that need to be followed to ensure our scrappers are not blacklisted. We’ll look at some guidelines or best practices one should follow for crawling.

So let’s start scraping. 

DIFFERENT TECHNIQUES FOR SCRAPING:

We will discuss here how to scrape a page and the different libraries available in Python (Python is the most popular language for scraping) we can use for scraping. 

1. Requests - HTTP Library in Python: To scrape the page or a website, we first need the content of the HTML page in a HTTP response object. The requests library from python is pretty handy and very easy to use. It uses urllib inside. I like requests as it’s easy and the code becomes readable too.

#Example showing how to use the requests library
import requests
r = requests.get("https://velotio.com") #Fetch HTML Page

2. BeautifulSoup: You now got the webpage, but now you need to extract the data. BeautifulSoup is a very powerful Python library which helps you to extract the data from the page. It's easy to use and has a wide range of APIs that’ll help you to extract the data. We use the requests library to fetch an HTML page and then use the Beautiful Soup to parse that page. In this example, we can easily fetch the page title and all links in the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

from bs4 import BeautifulSoup
import requests
r = requests.get("https://velotio.com") #Fetch HTML Page
soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
print "Webpage Title:" + soup.title.string
print "Fetch All Links:" soup.find_all('a')

3. Python Scrapy Framework:

Scrapy is a web crawling framework for developers to write code to create spiders, which define how a certain site (or a group of sites) will be scrapped. The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great. Scrapy is faster than BeautifulSoup. Moreover it is a framework to write scrapers as opposed BeautifulSoup which is just a library to parse HTML pages.

Here is a simple example of how to use scrapy is. Install scrapy via pip . Scrapy gives a shell after parsing a website

$ pip install scrapy #Install Scrapy"
$ scrapy shell https://velotio.com
In [1]: response.xpath("//a").extract() #Fetch all a hrefs

Now lets write a custom spider to parse a website.

$cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}
EOF
 scrapy runspider myspider.py

That’s it. Your first custom spider is created. Now let’s understand the code.

  • name: Name of the spider, in this case, it is “blogspider”. 
  • start_urls: A list of URLs where the spider will begin to crawl from.
  • parse(self, response): This function is called whenever the crawler successfully crawls a URL. Remember the response object from earlier in the scrapy shell? This is the same response object that is passed to the parse(..).

When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternately, you can write your extraction logic in parse method or create a separate class for extraction and call its object from parse method.

You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and additionally Here is the documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally does use lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over 'div' tag and perform some operation on each tag present under “div” then you can use this library which will give you list of 'div' tags. Now you can simply iterate over them using iter() function and traverse each child tag inside parent div tag. Such traversing operations are difficult in scraping.  Here is the documentation for this library.


CHALLENGES WHILE SCRAPING AT SCALE

Lets look at the challenges and solutions while scraping at large scale i.e scraping 100-200 websites regularly:

  1. Data warehousing: Data extraction at a large scale generates vast volumes of information. If the Data Warehousing Infrastructure is not properly built, searching, filtering and exporting of this data will become a cumbersome and time-consuming task. The Data Warehousing infrastructure needs to be scalable, perfectly fault-tolerant and secure. To achieve this, instead of maintaining own database or infrastructure you can use Amazon Web Services(AWS). You can use RDS(Relational Database Service) for structured database and DynamoDB for the non-relational database. AWS takes care of backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well.  This is a blog explaining about setting up infrastructure in the cloud for scraping.   

  2. Pattern Changes: Each website periodically changes its UI now and then. Scrapers usually need adjustments every few weeks to keep up with the changes as it might either give you incomplete data or crash the scraper. This is the most commonly encountered problem. For this, you can write test cases for the parsing and extraction logic and run them tests regularly via Jenkins or any other CI tool to catch failures.

  3. Anti- Scraping Technologies: Some websites will use anti-scraping technologies. LinkedIn is a good example for this. If you’re hitting a particular website from the same IP address then there are high chances that target website can block your IP address. Proxy services with rotating IP Addresses help in this regard.  Proxy servers help mask IP addresses and can improve crawling speed.  There are several rotating proxy services available on the internet. Scraping frameworks like Scrapy provide easy integration for several rotating proxy services.

  4. Javascript-based dynamic content:  Websites that heavily rely on Javascript & AJAX to render dynamic content makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or Javascript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser like Headless Chrome which essentially allows running Chrome in a server environment. Other Alternatives include PhantomJS which provides a headless Webkit based environment.

  5. Honeypot traps: Some website designers put honeypot traps inside websites to detect web spiders, There may be links that normal user can’t see and a crawler can.  Some honeypot links to detect crawlers will have the CSS style “display: none” or will be color disguised to blend in with the page’s background color. This detection is obviously not easy and requires a significant amount of programming work to accomplish properly. As a result, this technique is not widely used on either side – the server side or the bot or scraper side.

  6. Quality of data: The records which do not meet the quality guidelines will affect the overall integrity of the data. Making sure the data meets quality guideline while crawling is difficult because it needs to be performed real time. Faulty data can cause serious problems if you are using any ML or AI technologies on top of data. One thing you can do for this is to write test cases. You can make sure whatever your spiders are extracting is correct and they are not scraping any bad data.

  7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains. And the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours, sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs. You may use multithreading scraping packages available in Python such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis, lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation seen in large websites.

  8. Captchas: Captchas have been around for a long time and they serve a great purpose – keeping spam away. However, they also pose a great deal of accessibility challenge to the web crawling bots out there. When captchas are present on a page from where you need to scrape data from, basic web scraping setups will fail and cannot get past this barrier. For this, you would need a middleware that can take captcha, solve it and return the response. 

  9. Maintaining Deployment: If you’re scraping millions of websites then you can imagine the size of the code. It’s even very hard to execute spiders. In such cases, you can Dockerize your spiders and run them in Kubernetes or Amazon ECS like orchestration environment. The containers can be scheduled to run at regular intervals.


Scraping Guidelines/ Best Practices:

  1. Respect the robots.txt file:  Robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl & index pages on their website So this file generally contains instruction for crawlers. Robots.txt should be the first thing to check when you are planning to scrape a website. Every website would have set some rules on how bots/spiders should interact with the site in their robots.txt file. Some websites block bots altogether in their robots file. If that is the case, it is best to leave the site and not attempt to crawl them. Scraping sites that block bots is illegal. Apart from just blocking, the robots file also specifies a set of rules that they consider as good behavior on that site, such as areas that are allowed to be crawled, restricted pages, and frequency limits for crawling. You should respect and follow all the rules set by a website while attempting to scrape it.  Usually at the website admin area, one can find this file.

  2. Do not hit the servers too frequently:  Web servers are not fail-proof. Any web server will slow down or crash if the load on it exceeds a certain limit up to which it can handle. Sending multiple requests too frequently can result in the website’s server going down or the site becoming too slow to load. This creates a bad user experience for the human visitors on the website . While scraping, you should always hit the website with a reasonable time gap and keep the number of parallel requests in control. This will give the website some breathing space, which it should indeed have. A good website should have crawling limit set in the robots.txt file. You can use the standard delay of 10 seconds between requests. This will also help you to not get blocked by target website.

  3. User Agent Rotation and Spoofing: A User-Agent String in the request header helps identify which browser is being used, what version, and on which operating system. Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot. User Agent rotation and spoofing is the best solution for this. Spoof the User Agent by creating a list of user agents and picking a random one for each request. Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the default user-agent (such as wget/version or urllib/version)! If you’re using Scrapy then you can set USER_AGENT property in settings.py. Generally, you can keep the format as: ‘myspidername: myemailaddress’ so that target website would know its a spider and contact address.

  4. Disguise your requests by rotating IPs and Proxy Services: This we’ve discussed in challenges topic. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked in the near future.

  5. Do not follow the same crawling pattern: Only robots follow the same crawling pattern because unless specified otherwise, programmed bots follow a logic which is usually very specific. Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding the pattern in their actions. Humans generally will not perform repetitive tasks. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human.

  6. Scrape during off-peak hours: To make sure that a website isn’t slowed down due to a   high traffic accounting to humans as well as bots, it is better to schedule your web-crawling tasks to run in the off-peak hours. The off-peak hours of the site can be determined by the geo-location of where the site’s traffic is from. By scraping during the off-peak hours, you can avoid any possible load you might put on the server during the peak hours. This will also help in significantly improving the speed of the scraping process.

  7. Use the scraped data responsibly: Scraping the web to acquire data is unavoidable in the present scenario. However, you should respect copyright laws while using the scraped data. Using the data for republishing it elsewhere is totally unacceptable and can be considered copyright infringement. While scraping, it is important that you check the source website’s TOS page to be on the safer side.

  8. Use Canonical URLs: When we scrape we tend to scrape duplicate URLs and hence the duplicate data which is the last thing we want to do. It may happen in a single website we get multiple URLs having same data. In this situation, duplicate URL will have canonical URL mentioned which points to the parent or original URL. By this we make sure, we don’t scrape duplicate contents. In a framework like Scrapy, duplicate URLs are handled by default.

  9. Be transparent: Don’t misrepresent your purpose, or use deceptive methods to gain access. If you have a login and password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.


Conclusion:

We’ve seen the basics of scraping, frameworks, how to crawl and do’s and don'ts of scraping. To conclude:

  • Follow target URLs rules while scraping. Don’t make them block your spider.

  • Maintenance of data and spiders at scale is difficult. Use Docker/Kubernetes and public cloud provider like AWS to easily scale your web-scraping backend.

  • Always respect the rules of the web sites you plan to crawl and if APIs are available, always use them first.


About the Author

IMG-20170219-WA0007.jpg

Abhishek is a passionate Software Engineer who loves to learn new technologies. He has developed applications using Python, Django, Node.js and currently working on scraping and loving it. He is learning the Japanese language and loves to watch different kinds of Anime. In his free time, he likes to play Basketball and Table Tennis.