This post is about switching your User-Agent to bypass web application Paywalls, while not a “hack” it’s a useful bit of knowledge. I saw an interesting article on Wired today that I wanted to read. The article was about Aliens, Quantum Computers and Dark Energy. I won’t bore you with the details but after reading one sentence, their paywall popped up and wouldn’t let me continue reading. Obviously, this frustrated me as I wanted to know why Aliens use dark energy in their quantum computers.
This is where User-Agents come in. User-Agents are bits of information sent with requests to the Web Application. Essentially, when you request a Web Page a GET request for that page is sent to the server. This GET requests contains several parameters such as the Host, Cookies, Security Headers, and many others. One of these parameters is the User-Agent and it tells the server what type of client you are requesting the page from. This could allow the server to make decisions on what to send based on the value in the User-Agent. For example, if your User-Agent is specific to mobile phones then the server could see this and send you the mobile version of the page you requested.
Ok so how does this help with paywalls? Well for content to get indexed on search engines. The search engines need to crawl (visit and read) the contents. If the content is protected by a paywall, then the search engines crawlers would not be able to read it and wouldn’t be able to index them properly. This would be terrible for search engine optimisation and for a big site like Wired, SEO is important. The solution would be to whitelist the User-Agent of the search engine crawler.
User-Agent Switcher Plugin
I hope you can see where I’m going with this. Googles search engine crawler is called Googlebot, and you can safely assume that the paywall is configured to allow it to read the content. By changing our User-Agent to match Googles, we can bypass the paywall. This can be done in several different ways in different browsers. The simplest way to do it is by installing a plugin for your browser. There are many out there but the one I use is called User-Agent-Switcher. With the plugin installed you should be able to select Googlebot from the plugin settings. If you refresh the page, you should now be able to read the article.
User-Agent Switching in Chrome
In Google Chrome this can be done manually by right clicking the page and choosing the inspect option. Once the inspection window appears, click the 3 dots in the top right corner and select More Tools, then select Network Conditions. In the Network Conditions window, untick the Use browser default check box and then select Googlebot from the dropdown menu. Refresh the page with the paywall and you should now be able to access it.
User-Agent Switching in Firefox
In Firefox you need to head to about:config in your URL. You will get a warning message but click the Accept the Risk and Continue button. If you then search for general.useragent.override, then select string and click the + button and add your desired User-Agent. Once that value is set, refresh the page you’re trying to access, and the paywall should be gone.
As with a lot of web technologies it’s difficult to get the perfect balance of usability and security. Take for instance the robots.txt file, this file is used to tell web crawlers not to index certain pages on the search engine. This file could contain the location of sensitive directories such as the login page. The search engine wont index the page but anybody could request the robots.txt file in a browser and view what secrets are hiding.
The point I’m trying to make is, websites need to be indexed by search engines for users to find them. The contents of those sites need to be visible to the search engine crawlers and the name of those crawlers is public knowledge. Unless organisations opt not to have their content indexed and rely on sharing their content through other marketing platforms, then there will likely always be ways of bypassing paywalls, especially if they are client side. A better method of restricting access to the content would be to allow access to the content based on the IP address of the request origin. Personally, I don’t like the trend of hiding content behind paywalls. I understand that ad-blocking is having a negative impact on financial revenue generated by advertising on websites. However, There are better methods of monetising content, Brave browser is a perfect example of this. It rewards the content provider for user retention.
Disclaimer: My colleague Archie has pointed out that impersonating a Googlebot could result in the owner of the Web Application banning your IP address. It is possible to verify a genuine googlebot so if the site owner were to check on you then they could block your IP address. However, you wouldn’t access the web without a VPN would you?
I hope you found this article useful and that it wasn’t too long winded. Thanks for reading and please go ahead and check out some of my other posts on haXez.