What is web scraping and how does it work
Websites are loaded with valuable data. Product information, reviews, stock prices, company contacts, sports stats, weather information, and so on.
Getting all that data, however, can be tedious. You could copy and paste it one by one, but that takes a lot of time and it’s error-prone. What if you could automate all this?
This is exactly where web scraping comes in.
What is web scraping?
Web scraping is the automated process of extracting data from web pages and storing it in a structured format.
Some sites offer APIs to get the data in a structured format (usually JSON). For sites that don’t, however, web scraping is the only option.
Why scrape data?
There are many use cases for web scraping. Almost every business would benefit from it. Let’s see a few examples.
Job aggregation
Most companies have career pages that lists all currently open positions. Constantly checking them, however, is tedious.
This is where job aggregators can save a lot of time. They scrape job offers from multiple sources and usually provide advanced filtering options so people seeking jobs can find the ideal offers. In fact, I built such an aggregator to showcase the power of web scraping. It focuses on remote programming jobs.
Lead generation
After defining the ideal customer persona, businesses can scrape company directories for industry, address, contact information, and more. They can then launch a cold outreach campaign with their offering to get potential customers.
Price comparison
Both consumers and businesses benefit from price comparison tools. Businesses scrape competitors’ prices so they can adjust theirs accordingly. Consumers use these tools to find the best deals for products they want to buy.
Real estate
Web scraping has many applications in the real estate industry. Agencies (or even regular fellas) can use scraped data to predict the market. This way, they could decide when to buy, sell, or rent properties to maximize profit.
Brand monitoring
Businesses use web scraping to monitor social media for posts mentioning the brand. Once the data is collected, sentiment analysis is leveraged to pick up negative reviews/comments. Customer support can then reach out to those people to clarify the issues.
Machine learning
Web scraping is perfect for supplying data to a machine learning model. A social media tool, for instance, could scrape posts/tweets, along with additional information about them (post date, likes, comments, etc.) and train a model to predict how well a new post will perform.
SEO
SEO (Search Engine Optimization) is all about monitoring and improving your site’s Google ranking.
Modern tools such as Ahrefs or Semrush crawl billions of pages regularly and provide insights such as how popular a keyword is, how well your page performs for a certain keyword, how much traffic you and your competitors get, how many backlinks a site has, and much more. They are indispensable from a content marketer’s toolset.
These were just a few areas where web scraping can be applied with great success. There are many more applications, of course.
How does web scraping work?
A typical scraper performs the following steps:
- Fetches the contents of a web page
- Extracts data of elements identified by CSS selectors or XPath
- Identifies and enqueues other URLs to visit
- Dequeues and visits the first URL from the queue
- Repeats the process until there are no more URLs to visit
Crawlers and Scrapers
Sometimes the terms crawler and scraper are used interchangeably. But they actually mean different things.
A crawler focuses on discovering URLs and getting their contents. It takes one or more starting URLs, visits them, and follows any links on those pages according to some rules (URL patterns to follow or ignore, how deep to go, etc.). Crawlers such as Googlebot or Ahrefs index pages and extract generic information (keywords, links to other pages, etc.) from them.
A scraper, on the other hand, focuses on extracting data in a structured format. It takes a configuration that defines what to extract (elements are identified by CSS selectors or XPath), and how to paginate (next page link, infinite scroll, etc.). Data is then stored in a structured format such as CSV, JSON, or even in a relational database.
Challenges
Web scraping is very powerful, but it’s not without challenges.
Fragile element selectors
Sometimes it can be hard to come up with CSS selectors that are robust enough to extract the correct data across multiple pages. Websites using technologies such as TailwindCSS or styled-components will reuse the same utility classes for many elements on a page, making it harder to uniquely identify them.
Getting blocked
Nobody likes their sites getting scraped. If abused, it can severely drain server resources and can also devalue the business. So to avoid crashes and protect sensitive data, many sites leverage anti-scraping services such as Cloudflare to block scrapers.
Several techniques exist to do this, such as displaying CAPTCHAs, using honeypots, or just flat-out blocking suspicious IPs.
These tools do a very good job of protecting sites, but on the flip side, that they can worsen the browsing experience of legitimate users. So there is always this trade-off.
Large-scale scraping is complex
Finally, some scraping projects rely on a huge amount of data. For them, a single page scraped every few seconds doesn’t cut it. They need multiple pages processed in parallel.
Since bombing the site with lots of requests from a single IP address in a short interval is quickly picked up by bot detectors, these applications rely on a large number of proxies and rotate them regularly to bypass these limits.
How to get scraped data?
All right, so now that we explored the applications of web scraping and how it’s done, let’s see the options for applying it.
Build it yourself
The first option is to implement the scraper yourself, which is a good approach if you’re a developer and have the necessary time. In this case, there are many libraries and frameworks that make this easier. Depending on the programming language, these are the most widely used ones:
- Python: Scrapy, BeautifulSoup, Selenium
- Node.js: Cheerio, Puppeteer, Playwright
- Java: JSoup, Selenium
Use visual tools
If you’re not technically savvy, you can also use tools such as DataGrab to set up scrapers without coding. DataGrab provides a Chrome extension that allows you to click and select elements that you want to extract, and map them into fields. It also supports several pagination methods, such as next link(s), infinite scrolling, and clicking a “Load more” button.
Once you set it up, you can test it out in your local browser, then upload it to the cloud service and run it there at scale.
These tools can save you a tremendous amount of time because they solve many of the problems you’d face if you’d implement the scraper yourself: proxy management, concurrent requests, data storage, etc.
Some tools have full-blown browser automation capabilities that allow you to simulate any interaction a user would perform in the browser. These are more complex but they can be used for much more than web scraping.
Hire a professional
Some sites are hard to scrape using visual tools. The site’s structure might be very particular, part of the data might be loaded asynchronously, and so on. In these cases, custom scripts might need to be developed that are tailored to that particular site.
Hiring a professional developer is usually the best option here. There are sites like Fiverr or Upwork that have a large pool of talented freelancers.
If you need help in this area, I’ll gladly offer my services as well. Just fill out the contact form, or drop me an email at robert@datagrab.io.
Conclusion
Web scraping is a fascinating field with tremendous benefits. It has so many uses that the only limit is pretty much your imagination. There are several ways to go about it, depending on whether you’re a technical person and have time to build it yourself, or you’d rather rely on tools instead.
Web scraping comes with many challenges and there is no silver bullet that addresses all of them. It will always be a cat and mouse game between site owners who want to prevent their sites from getting scraped and other people (such as developers, data scientists, marketers, or founders) who need data for a variety of reasons.
So what scraping projects are you working on? :)