What is DataGrab
DataGrab is a point-and-click web scraping service designed to be easy to use but still flexible enough to support a variety of use cases for web data extraction. These include lead generation, data analytics, content aggregation, competitor monitoring, artificial intelligence, real estate and more.
DataGrab consists of a Chrome extension for setting up scrapers and a web application for managing them.
Installing the Chrome Extension
To get started, install the Chrome extension by following these steps:
- Go to the Google Web Store.
- Click Add to Chrome.
- Review the requested permissions, then click Add extension. This will install it.
- You can now use the extension by clicking the icon to the right of the address bar.
- You may want to pin it to the menu by clicking the icon, then clicking the next to DataGrab.
Setting up Your Account
If your account is still pending but you did not get the confirmation mail, you can request it to be resent by clicking the RESEND MAIL button on the sign in page.
If you forget your password, follow the forgot password link on the sign in page. Then, give your email address you've used when you signed up and click SEND LINK. You'll get an email with a link to reset your password. Follow that link, introduce your new password two times, then click the RESET PASSWORD button to have it reset.
Setting up Scrapers
Scrapers are automated processes configured to extract data from web pages. They are the workhorses of the system. To set one up, open Chrome and navigate to the page you want to scrape, then open the extension by clicking the icon to the right of the address bar.
The extension's interface looks like the following:
It is divided in three tabs:
- Setup - for setting up scrapers
- Preview - for a live preview of data scraped from the current page for the active template
- DataSets - for managing the data scraped in the browser
A template is a named configuration to be applied to a page. It has a setup URL, the fields to be extracted, and the pagination settings.
The root template is always present and it will be applied to the starting URL of the scraping session. You can't delete it nor rename it, but you can change its setup URL if you wish.
In order to scrape the detail pages of a listing, you'd want to create a new template to be applied to detail pages (see the Fields section below). You can do that by clicking the NEW button and filling out the form. After submitting it, the extension will navigate to the setup URL you've given unless you uncheck the Navigate to page option.
A field stores a piece of data extracted from a web page, under a certain name. To add a new one, click the ADD button in the Fields section. This will open the following form:
Give it a name, then set its type. Currently, DataGrab supports the following three field types:
- Detail Page
As the core data type of the system, text fields are used for extracting textual data. You can select any element on the web page for them. If you select a container element (e.g. a div) having multiple textual elements (like paragraphs), DataGrab will concatenate their texts. You may also choose what to extract: the text content or the value of a certain attribute.
Link fields extract the absolute URLs of links, given by the href attribute of the anchor element. In case the URL is relative, it will be expanded with the base URL of the current page. So if you are scraping the site "https://mydomain.com" and the href attribute of the anchor element is "/employees/123", the extracted value would be "https://mydomain.com/employees/123".
Detail page fields allow you to link the data scraped from detail pages to their listing pages. For example, if you scrape a product listing, some information about the product you'd want to scrape (description, reviews, etc.) is usually found on its detail page, which is linked from the listing page. In most cases, those detail pages have the same structure so the same template could be applied to them.
After you set the field's type, click on page elements you want to extract data from. Elements that you hover on are highlighted with a green border, selected elements are filled with green, and similar elements are highlighted with a thick yellow border.
DataGrab uses CSS selectors for identifying elements on a web page. Once you select an element, the form will display its selector. If you select multiple elements, DataGrab will try to detect similar ones that would form a column of a table. It does this by generating a selector that is general enough to include all selected elements and possibly others.
Most of the time, this selector is quite robust and it will extract the correct data. However, if you're not satisfied with it, you can tweak it by editing the selector manually. Matching elements will be instantly highlighted on the page. For a comprehensive guide about their syntax, see CSS Selectors on CSS Selectors on W3Schools.
Along with the CSS selector, the form also displays the number of selected elements, the number of similar elements, or (if you edited the selector manually) the total number of matching elements.
Finally, submit the form by clicking the ADD button. Afterwards, you can edit the field by clicking thebutton, or delete it by clicking the button next to the field.
Most websites that store a moderate amount of data use some pagination technique for displaying it in batches. DataGrab supports the following pagination methods:
- Next link(s)
- Infinite scrolling
- "Load more" button
This is the traditional method used by most sites. Data is divided on multiple pages with navigation controls to go to the previous / next page displayed on the bottom of each page.
To set this up, click on thebutton next to the selector. This will open the selector picker form. Now click and select page elements. Just like in the case of fields, the form will report the number of selected and similar elements. You can also tweak the selector if you wish.
The pagination controls on most sites will include some kind of left and right arrow links for linking to the previous, respectively next pages of the listing. In these cases, you can just select the link to the next page. For those sites that don't, you can select multiple elements to generate a more robust selector that will include all links. Finally, click SAVE to set the link(s) and close the form.
You can also set a limit to the number of pages you want to scrape, or go until reaching the end of the series.
To test if pagination works correctly with your current settings, click the TEST PAGINATION button. This will open a new window with the current page loaded. Click the NEXT PAGE button to go to the next page in the series.
Widely used on sites designed for mobile devices, this technique allows the user to scroll through lots of content with no apparent end. When you scroll down to the bottom of the page, it will keep refreshing.
You can set the maximum number of times to scroll, or go until reaching the bottom of the page.
You can also set the time interval (in milliseconds) in which to scroll. To pick an appropriate value, study the page and estimate how long it takes to fully load after scrolling.
"Load more" Button
Some sites batch their data by displaying a "Load more" button somewhere on the bottom of the page and keep loading more data when the button is clicked.
To set it up, click on thebutton next to the selector. This will open the selector picker form. Now click and select the button (or anchor) element on the page. You can only select a single element.
You can also tweak the selector, but note that in case multiple elements match the selector, only the first one will be considered.
You can set the maximum number of times to click, or go until the button does not show up on the page anymore.
You can also set the time interval (in milliseconds) in which to click. To pick an appropriate value, study the page and estimate how long it takes to fully load after clicking the "Load more" button.
Managing the Configuration
The Configuration section has three buttons:
- IMPORT allows you to import your configuration from a JSON file
- EXPORT allows you to export your configuration to a JSON file
- RESET will reset your configuration (purging all fields and resetting the pagination settings)
DataGrab gives you an instant preview of the scraped data (from the current page), which will update automatically on field changes.
You can download it by selecting the format (CSV or JSON) and clicking the DOWNLOAD button. See the Export Formats section below for more details on them.
Running the Scraper
Once you've set up your scraper, you can choose to run it
- locally in your browser by clicking the RUN IN BROWSER button
- on the cloud platform (and manage it from there) by clicking the RUN IN CLOUD button
Both approaches have their advantages.
For simpler projects, you can run the scraper in your own browser. This allows you to scrape any page you can possibly display in your browser. Just like when you're normally browsing, you can log in if the page is behind login, or you can solve the CAPTCHA if one is presented to you. Also, since this is an automation and not a bot, it's much less likely that sites will block you.
In this case, a new window will be opened with the current page and the scraper will automatically go through the listing in case you set up pagination. The interface will display data about the session (its status, the current URL, start time, end time and the number of pages scraped). The table is constantly updated with new data.
After the scraper has finished, you can download your data by selecting the format (CSV or JSON) and clicking the DOWNLOAD button. See the Export Formats section below for more details on them.
If you wish to stop the scraper before it has finished, click the STOP button. All scraped data up to that point will be preserved.
For more complex projects with moderate to high data needs, you'd want to run your scrapers in the cloud at scale. Choosing this option also allows you to:
- Schedule your scrapers to run automatically every hour, day, week or month
- Get the data delivered automatically via email
- Retain your data for 7 days
The cloud service follows a usage-based pricing structure starting from $20/month, which gives you 2000 cloud requests each month.
Above that, a per-request pricing is applied based on the following tiers:
- $0.006 per request for the next 2,001 to 5,000 requests
- $0.003 per request for the next 5,001 to 10,000 requests
- $0.001 per request for the next 10,001 to 50,000 requests
- $0.0006 per request for the next 50,001 to 200,000 requests
- $0.0003 per request for the next 200,001 requests and above
So the cost for 65,000 pages, for example, is calculated as: $20 + 3,000 × $0.006 + 5,000 × $0.003 + 40,000 × $0.001 + 15,000 × $0.0006 = $102 in total.
In order to evaluate the service before committing to it, we offer a Free Trial of 2000 cloud requests and 14 days to use them up.
For all scraping sessions that you run in the browser, the data will be preserved in the browser's local storage until you delete it. You can view the available data sets by switching to the Datasets tab. Each entry in the table contains information about the scraping session. The last column allows you to export your data in the current format (selected above the table), and to delete it.
Managing Cloud Scrapers
If you uploaded your scraper to the cloud, you can manage it from here, which includes running it, downloading the extracted data, or updating its settings.
The Scrapers List
The Scrapers page is the default page you arrive on after you log in. The table displays the list of scrapers created, along with their statuses (Idle, Running, Completed, etc.) and some additional information. In case a scraper is running, the number of scraped and failed pages are displayed in green, respectively red badges inside the status chip.
You can sort the data by any column if you click on the column's name. The up / down arrow displayed after it will denote ascending vs descending order in which data is currently sorted.
Thebutton in the last column opens up a menu that allows you to:
- start/stop the scraper ( / )
- delete the scraper ( )
Editing a Scraper's Settings
Clicking the scraper's name in the list will navigate to its detail page. The top section (shown below) displays basic information about the scraper.
You can run the scraper by clicking the START SCRAPER button, or export its configuration to a JSON file by clicking the EXPORT CONFIG button.
Below that is the Settings section.
Here you can change its name, page limit, scheduling and data delivery configuration, then click the UPDATE SETTINGS button to save the changes.
Finally, the page displays the scraper's run history.
You can export and download the data of the desired run by clicking thebutton and choosing the format (CSV or JSON).
Data Retention Policy
You can request data to be automatically delivered to you via email. However, emails may get marked as spam or not delivered at all, depending on a variety of factors including the mail server's configuration. For this reason, data is retained for a period of time, allowing you to download it manually.
This retention period is subject to change but only within reasonable limits.
DataGrab supports two formats for exporting scraped data:
Let's briefly look at them, discussing their pros and cons.
CSV stands for Comma-Separated Values. It is a text format used for exchanging tabular data. Each line in the file is a data record with its fields (columns) separated by commas. If a value itself contains commas, it is enclosed in quotes (usually double quotes).
Although RFC-4180 proposes a specification for the format, it is not fully standardized, so there are many variations used in practice.
- Easily readable by humans, easily parsable by computers
- Very compact, so it takes up minimal space
- Widely supported by databases and DB management tools (like MySQL Workbench, phpMyAdmin) for importing and exporting data
- Efficient, as data can be loaded incrementally (no need to load the whole file in memory at once)
- Impractical for hierarchical data
- Has no data types (like numbers, booleans, etc.), works only with text
- Poor support for special characters
- Supports hierarchical data
- Widely supported and considered the de facto format used by modern APIs
- Inefficient for huge data sets, as the whole data needs to loaded in memory at once (though there are ways around this, like JSON Lines)
- More verbose than CSV, though much more compact than XML
- No way to add comments or attribute tags, limiting the ability to annotate data structures or provide metadata (contrary to XML)
DataGrab is constantly being developed and improved. However, making it better is not possible without your feedback. So if you have something to propose, or perhaps a problem to report, please do not hesitate to do so by clicking the Feedback button (with its label oriented vertically) on the right and completing the feedback form.
Every feedback is one of the following types:
- Bug Report - if you discovered a problem, like something broke in the system, or something doesn't work the way it should
- Comment - a general comment / observation about the tool
- Feature Request - if you have an idea you think it could improve the product
We'll follow up on every feedback via email.