For better or for worse but web scraping became an integral part of the Internet. In short, web scraping allows extracting data from the websites.
All the process is carried out by a piece of code. It sends a “GET” query to a specific website. Hereafter, it parses an HTML document based on the received result. Having finished with it, it searches the necessary information and convert it to the format you’re interested in.
There are two groups of people, those who are for it and those who are against it.
In this article, we’ll look from the perspective of how to secure your content from web scraping.
However, the truth is that there’s no reliable way of getting rid of web scraping. What you can do is just making it harder. Nevertheless, there’s another side of the coin. The harder you make it for scrapers, the harder you make it for legitimate users.
At the same time, there are some effective strategies of how you can throw major obstacles in the way of a scraper.
Here’s an example of such condition:
“You may only use or reproduce the content on the website for your own personal and non-commercial use.”
It’s strongly prohibited to farm, scrape, extract, collect or mine the website’s content in any form and by any means whatsoever strictly prohibited.
Additionally, you may not mirror any material contained on the Website.
This indicates that no one can extract, use or mirror the content of the website for any other purpose than personal.
At the same time, bear in mind that this strategy won’t 100% protect your content. So let’s take a look at other strategies.
Rate limit individual IP addresses
If you’re receiving thousands of requests from a single computer, there’s a super high probability that the person behind it is making an automated request for your site.
The first measure you can do is to block all the requests from computers that are making them too fast.
At the same time, take into account that requests from corporate networks, VPNs and some proxy services, which go from the same IP, may be used by people who are connected through the same machine.
There may be another scenario. In case a scraper has enough resources, it can circumvent this method of protection by setting up multiple machines to run a scraper on. As a result, there will be only a few requests coming from a single computer.
Probably, this is the most common way of protecting the data from scraping. According to this scenario, a user is asked to enter captcha for accessing the website data. This method may be cool for systems, which data is accessed through non-frequent separate requests.
CAPTCHA can be useful but it should be used sparingly. For example, you may add an option, according to which it would be activated only if a particular client has made dozen of requests in the past few seconds.
Honeypot is a link to fake content that is invisible to a normal user but it is present in HTML that would come up when a program is parsing the website. With a redirecting a scraper to such honeypots, you can detect scrapers and make them waste resources by visiting pages that contain no data.
However, it’s necessary to disallow such links in your robot.txt file in order to make sure a search engine crawler does not end up in such honeypots.
Change DOM structure frequently
The great number of scrapers parse the HTML that is retrieved from the server. To make it difficult for scrapers to access the required data, we can frequently change the structure of the HTML. This would require an attacker to evaluate the structure of a website again in order to extract the required data.
Require login for access
Requiring account creation in order to view your content, is quite an effective method for deterring the scrapers. At the same time, it creates some blocks for real users.
The good thing is that you’d be able to accurately track user and scraper actions. As a result, you can easily detect when a specific account is being used for scraping and, correspondingly, ban it. Thanks to the fact that you can easily identify scrapers instead of just IP addresses.
If you want to avoid scripts creating many accounts, you should:
- Require an email address for the registration, verify the email address by sending a link that needs to be opened in order to activate the account.
- Require a captcha to be solved during registration/account creation (in order to prevent scripts from creating accounts).
Requiring account creation in order to view content will drive users and search engines away. Imagine you make the users create an account before reading the article. Probably, they will go elsewhere.
Change Website HTML Regularly
The great part of scrapers relies on finding patterns in a site’s HTML markup. Hereafter, they use these patterns as clues to help their scripts find the right data in a site’s HTML soup.
If your site’s markup changes frequently or is thoroughly inconsistent, this would quite frustrate the scraper so that it would finally give up.
However, this doesn’t mean you need a thorough redesign as yours. Such small changes as changing the class or id in an HTML may be enough for breaking most of the scrapers.
To make harder for the scraper to use the data, you may use various methods for obscuring the data. Here are some of them:
- You may use a part of the data in an image;
- Frequently change the HTML. In this case, an attacker would have to change the HTML parser in its turn.
Using geo-facing means that sites are only exposed within the geographic locations, where they conduct business. This will create some additional difficulties for scrapers as they would have been run within a specific geographic location. This may simply involve using a VPN link to a local point of presence.
The Bottom Line
Despite the fact that we provided a bunch of techniques, there’s no single method that could 100 percent prevent web scraping.
At the same time, if you combine few, you can make a pretty successful deterrent for scrapers. It’s a good idea to remain careful and keep monitoring traffic to make sure your services are being used in a way you intended them to be.
Try to find a balance between usability for real users and scraper-proofness. Bear in mind that everything you do may negatively impact the user experience in one way or another. So find compromises.
By the way, take a look at this post about Proxies for web scraping.