The complete guide to proxies for web scraping

Web scraping is a huge industry with many applications, such as lead generation, machine learning, data aggregation, and much more. Every business would benefit from web data in some way.

Its biggest challenge, however, is gathering data reliably and at scale. Site owners don’t like their websites to be scraped so they often leverage anti-scraping measures like CAPTCHAs or honeypots, and even ban the offending IPs.

For this reason, quality proxies are in high demand.

In this article, we’ll look at what proxies are, why do we need them for web scraping, a few ways they can be categorized, and some of the best proxy providers on the market.

What is a proxy?

A proxy is a 3rd-party server that routes your request through them and uses their IP address in the process, hiding yours. It basically acts as the middleman between you and the site you’re connecting to.

Why use proxies for web scraping?

Most web scraping projects (except perhaps the most trivial ones) will require proxies. Let’s see why.

Avoid IP bans

Using the same IP address for all requests, especially with the exact same interval between them (or any other non-human behavior for that matter) might result in the site detecting you as a bot.

Once that happens, it might blacklist your IP or even ban it permanently, instantly denying any further requests from it.

Proxies play a crucial role here because if one of them is banned, you can rely on the next one.

Large-scale scraping

Moderately complex projects will require large-scale scraping with multiple requests in parallel. Of course, you should always take into account the strain this will cause on the site and how much traffic you estimate they can handle. The last thing you’ll want is to DDoS the server.

Flooding the target site with many concurrent requests from the same IP will quickly result in picking you up as a bot. They might then block your requests with 429 (Too Many Requests) errors, or simply ban your IP.

Therefore, using a sizable proxy pool is crucial in this case.

Access region-specific content

Sites intended for audiences in a specific geographical region (local subsidiaries of major e-commerce sites, for instance) might block your IP if you’re outside of that region.

In this case, the only way to get access is to use proxies from that country.

Improved security

Quality proxies have firewall software with powerful packet filtering capabilities that will block malicious traffic and prevent your machine from being infected.

Of course, this won’t be true for most public proxies that you will find on free listing sites.

Types of proxies

For the purpose of web scraping, we can categorize proxies in four ways: by the level of anonymity, by how the IP is assigned, whether it’s dedicated or shared, and by the protocol. Let’s see each of these.

By anonymity level

The level of anonymity of a proxy determines whether the site you’re scraping can find out whether you’re behind a proxy or even your real IP.

Proxies can be Transparent, Anonymous, or Elite.

Transparent (Level 3) proxies will always report your IP address to the target site. Technically, this means that they will set the X-Forwarded-For header to your real IP, and the Via header to the proxy’s IP.

The target site can therefore instantly see that you’re using a proxy, making this the least suitable for web scraping.

Anonymous (Level 2) proxies are better because they will not report your real IP to the site. Instead, they set the X-Forwarded-For header for either the proxy’s IP or leave it blank.

However, they will still set the Via header, thereby advertising themselves as proxies.

Elite (Level 1) proxies are the best for avoiding getting blocked. They don’t set either of the headers above, and they also remove other headers (such as From, Proxy-Authorization, and Proxy-Connection) that could expose them as proxies.

Note, however, that even elite proxies can be detected in some cases. Major sites maintain large lists of blacklisted IPs and they will check your proxy against those lists. They could also check whether the port your proxy uses is a typical proxy port, like 8080 or 3128.

By how the IP is assigned

The second way to categorize proxies is how they get their IP addresses. We have Datacenter, Residential, and Mobile proxies. Let’s see the pros and cons of each.

Datacenter proxies get their IPs from datacenters, usually maintained by large cloud providers such as Amazon AWS, Microsoft Azure, or Google Cloud.

Pros

  • Cheap
  • Usually fast (datacenters have plenty of bandwidth)
  • Come in large bulks

Cons

  • Blacklisted by many sites
  • If they are in the same subnet, they could easily be detected
  • Aren’t always secure

Residential proxies are located in people’s homes. They get their IPs from the Internet Service Provider (ISP) of the customer. Customers then allows proxy sellers to use their IPs (through plug-ins) in exchange to some compensation (money, access to an app or a service, etc.)

Pros

  • Rarely blocked by sites (high success rate)
  • Their locations vary (ideal for geo-targeted content)
  • Usually more secure

Cons

  • Expensive
  • Slower than datacenter proxies

Finally, mobile proxies are portable devices (such as smartphones or tablets) connected to the Internet via mobile data. Their IPs are allocated by mobile carriers.

These devices are owned by real people, so just like in the case of residential proxies, they usually install an app on their devices to offer their bandwidths to the proxy network, and are compensated.

Pros

  • Almost never blocked (extremely high success rate)
  • Their locations vary (ideal for geo-targeted content)

Cons

  • Usually slow
  • Very expensive
  • Mostly only suitable for scraping pages shown on mobile devices

Shared vs Dedicated

Most proxy providers offer both shared and dedicated proxies. Deciding on which one to use depends on your budget and the complexity of your project.

Shared proxies are used by multiple users. This can be a problem because there is a higher chance that some of them are blacklisted by popular sites since other people might also scrape the same sites. Of course, they are also cheaper.

Dedicated (or private) proxies, on the other hand, are allocated exclusively to a single user. They mitigate the problem above but they also cost more.

By the protocol used

Lastly, proxies can be categorized by the protocol they use. This can be HTTP, HTTPS, or SOCKS.

An HTTP proxy takes a plain-text request from a client, it will create a separate request to the destination server, get its response, and then return it to the client. The biggest problem with them is security. A malicious proxy could alter the response and inject ads or a script to steal cookies from the client, for instance.

An HTTPS proxy works differently. It takes a special CONNECT request from the client and will create an HTTP tunnel between it and the destination server. This means that after the connection is established, it will just route all raw TCP traffic back and forth between the client and server, making this protocol much more secure and ideal for web scraping.

SOCKS (SOCKet Secure) is a lower-level protocol than HTTP. It just relays any TCP traffic between the client and the server, which makes it faster and more flexible. As such, it is ideal for high-traffic applications such as video streaming or P2P networks.

Security is a concern here as well. Since the traffic is not encrypted, a malicious proxy could sniff the traffic between client and server and steal sensitive data.

SOCKS5 also supports UDP as the underlying transport protocol (in addition to TCP), and also provides authentication options.

Evaluating your proxy needs

To evaluate whether you need to use proxies in your scraping project (and which type), follow these steps:

  1. Check if the site uses geo-targeting. If they limit traffic to a certain region and you’re outside of it, you’ll get blocked so you need to use proxies.
  2. Decide if you can go slow. If your scraping project is low-scale and not time-critical, and you can crawl one page say every 5 seconds, you might not even need proxies. This will depend on the target site, of course. If you run into 429 (Too Many Requests) errors, you might need to slow down even more or rely on proxies.
  3. Try datacenter proxies. Ok, so you can’t go slow. In this case, try datacenter proxies first as they cost less and see whether you get blocked.
  4. Use residential proxies as last resort. If nothing works, go with residential proxies. These will cost more, but sometimes there’s just no other option.

Top proxy providers

Now that we covered the basics of proxies, let me list a few providers that I think are the best on the market.

BrightData

With a huge pool of over 72 million residential proxies and 700,000 data proxies, BrightData (formerly known as Luminati Networks) is one of the leading proxy providers of the world. Their pricing is very affordable as they offer a pay-as-you-go model for occasional users and also monthly subscriptions for people with recurring data needs.

For the pay-as-you-go option, you are charged $0.90/IP + $0.12/GB for datacenter proxies and $25/GB for residential proxies each month.

Subscription plans start from $500/month (for $0.60/IP + $0.095/GB) for datacenter proxies and $300/month (for $15.00/GB) for residential proxies, both of which are very affordable if your project is data-heavy.

We at DataGrab rely on BrightData as our proxy provider and are very satisfied with their service and customer support.

Oxylabs

Oxylabs is another major player in the industry with a pool of 100 million residential IPs and 2 million datacenter IPs. They charge you per IP and only offer subscription plans (no pay-as-you-go option), but since your bandwidth usage is unlimited, this might be a perfect fit if your project is large-scale. Pricing starts from $180/month for datacenter plans and $300/month for residential plans.

SmartProxy

SmartProxy is another big provider with over 40 million residential IPs across the world and 40,000 datacenter IPs in the US. They also only offer monthly subscriptions and charge you per GB of traffic. Pricing starts from $50/month for datacenter plans and $75/month for residential plans.

For a more detailed comparison between these three, check this article on Best Proxy Reviews.

Free proxy sites

There are many sites that offer a limited number of (about 100) free proxies. To do that, they usually scan a huge number of IPs for popular proxy ports like 8080 and 3128 and publish a small portion of them for free.

While you might think that free is great, relying on these proxies can be a huge security risk. You never know which one has malware installed and which could infect your computer or steal information.

Never use them to scrape sensitive data!

Another problem is that since they are public, they are heavily used and therefore are mostly blacklisted. They also tend to be slow and unreliable.

That being said, here are a few of them I had used in hobby projects and had decent success with:

Conclusion

We covered what proxies are, why we use them, and some of the ways they can be categorized.

Depending on your project, you might not even have to rely on them. For hobby projects with a small number of pages scraped, you can usually avoid them, whereas, for a large-scale enterprise project, they are crucial.

There are many proxy providers; I listed a few of them that I think are very good.

Thanks for reading this article, I hope you found it useful. :)

Which proxy provider do you prefer? What types of proxies do you use the most?