What is DataGrab
DataGrab is a web scraping service designed to be easy to use but still flexible enough to support a variety of use cases for web data extraction. These include data analytics, content aggregation, competitor monitoring, etc.
The tool is currently in beta. This means it's free to use for a page limit of 5000 pages. All I ask in exchange is some feedback that allows me to make it better.
Setting up Your Account
After you sign up to DataGrab, you'll get a confirmation mail to activate your account. Follow the link sent in that mail. Your account is now validated so you can sign in and start setting up scrapers.
If your account is still pending but you did not get the confirmation mail, you can request it to be resent by clicking the RESEND MAIL button on the sign in page.
If you forget your password, follow the forgot password link on the sign in page. Then, give your email address you've used when you signed up and click SEND LINK. You'll get an email with a link to reset your password. Follow that link, introduce your new password two times, then click the RESET PASSWORD button to have it reset.
Setting up Scrapers
Scrapers are automated processes configured to extract data from web pages. They are the workhorses of the system. To set up one, go to the Scrapers page and follow these steps:
- Click CREATE (or CREATE A SCRAPER if you haven't set up one yet).
- Follow the setup wizard process, covered next.
Setting up a scraper has three simple steps:
- Choosing you starting URL
- Defining fields and pagination
- Configuring the scraper itself
Choosing the Starting URL
First, tell the scraper what URL it should start with. Usually this would be the main listing page of products, job offers, forum topics, and so on. Then click NEXT.
Defining Fields and Pagination
The second step is the heart of the setup process. This is where you define what to extract and what link(s) to follow. After the page you chose has loaded, you're ready to set them up. The setup interface looks like the following:
Let's go over each of these components.
The Toolbox contains buttons (tools) representing the types of data you can extract. DataGrab currently supports the following tools:
- detail page
Each tool is a toggleable, so the first click will activate it, a second click will deactivate it. For more details about them, see the Fields section below.
The Browser is the largest area on the center. This is where the web page is loaded and where you click on elements to select them. It has a minimum width of 1366 pixels to avoid shrinking the page, so that sites employing responsive techniques wouldn't display a tablet-optimized version of it, potentially excluding some elements from it. The Browser is scrollable both horizontally and vertically.
To select elements on it, toggle the desired tool on the Toolbox, then click on the elements. Elements that you hover on are highlighted with a green border, selected elements are filled with green, and similar elements are highlighted with a thick yellow border.
When you toggle a tool, a movable Selector Widget will appear on the Browser as you see below:
DataGrab uses CSS selectors for identifying elements on a web page. Once you select an element, the widget will display its selector. If you select multiple elements, DataGrab will try to detect similar ones that would form a column of a table. It does this by generating a selector that is general enough to include all selected elements and possibly others.
Most of the time, this selector is quite robust and it will extract the correct data. However, if you're not satisfied with it, you can tweak it by editing the selector manually. Matching elements will be instantly reflected on both the Browser and the widget. For a comprehensive guide about their syntax, see CSS Selectors on W3Schools.
Along with the CSS selector, the widget also displays the number of selected elements, the number of similar elements, or (if you edited the selector manually) the total number of matching elements.
In case the widget blocks the elements you want to select, you can move it anywhere on the screen.
The Properties Panel displays context-dependent settings. If no tool is toggled, it shows the list of fields currently defined and the pagination configuration. Otherwise, it displays the tool's property sheet.
Let's look at the fields list and pagination settings.
The Fields List displays the list of fields currently defined and allows you to update it. When you select a field, the Browser will highlight the elements you selected for it and similar ones.
The little button toolbar on the top of the list allows you to edit the field (), delete it (), or move it up () or down () in the list, respectively.
If the website uses pagination for batching its data, you can set up the scraper to walk through all of those pages and extract the fields you defined. For that, follow these steps:
- Enable pagination by checking Paginated
- Click the icon under the Next Link(s) property to toggle selecting the link to the next page
- Select the element(s) on the Browser, then click the SET button. You can scroll to and highlight them by clicking the icon.
The pagination controls on most sites will include some kind of left and right arrow links for linking to the previous, respectively next pages of the listing. In these cases, you can just select the link to the next page.
For those sites that don't, you can select multiple elements to generate a more robust selector that will include all links.
The four buttons on the bottom of the page allow you to proceed to the next step, preview a sample of the resulting data, go back to the previous step, or cancel the setup altogether. To preview the data or proceed to the next step, you need to have at least one field defined, and the pagination link set (in case pagination is enabled).
When you're done setting up your fields and pagination, click NEXT to proceed to the final step.
Configuring the Scraper
The last step of the wizard is configuring the scraper itself. This includes setting its name, page limit, scheduling and automatic delivery configuration.
The page limit sets the maximum number of pages that will be processed in a scraping run. If this is unlimited, processing will continue until there are no more pages to process for the current setup.
If you want to run your scrapers regularly (for instance, to monitor data changes), you can schedule them. DataGrab currently supports hourly, daily, weekly or monthly scheduling.
In order to get the scraped data automatically delivered via email after each run, you can enable it, and set the email address and desired format. For more information about the supported formats, see the Export Formats section below.
You can also choose to start the scraper right after submitting, by checking Start after submitting.
When you're ready, click CREATE (or CREATE & START) to submit.
A field stores a piece of data extracted from a web page, giving it a certain name. DataGrab supports the following field types:
- Date & Time
- Detail Page
Let's take a look at them in more detail.
As the core data type of the system, text fields are used for extracting textual data. You can select any element on the web page for them. If you select a container element (e.g. a div) having multiple textual elements (like paragraphs), DataGrab will concatenate their texts.
Date & Time Fields
Date & Time fields are used for extracting date and time information. You can select any element on the web page for them. You can also choose between storing just the date or both date and time (storing only the time is not currently supported).
On web pages, date/times are given as either absolute values (for example, "2020-07-17, 14:38"), relative values (like "2 hours ago"), or a combination of those ("yesterday at 15:10").
DataGrab tries to detect the format the page uses and extract the date/time from it as an absolute value. After all, storing relative dates textually as they are ("2 hours ago") wouldn't make much sense as the reference date is lost. If you know the date/time pattern used by the site, you can set its pattern string in the format property. DataGrab supports the following letters for it:
- yy, yyyy - Year (e.g. 96, or 1996, depending on the number of y's in the pattern)
- M, MM, MMM, MMMM - Month in year (e.g. 1, 01, Jan, or January, depending on the number of M's in the pattern)
- d, dd - Day in month (e.g. 2, or 02, depending on the number of d's in the pattern)
- a - AM/PM
- H, HH - Hour in day, between [0,23] (e.g. 1, or 01, depending on the number of H's in the pattern)
- k, kk - Hour in day, between [1,24] (e.g. 7, or 07, depending on the number of k's in the pattern)
- K, KK - Hour in am/pm, between [0,11] (e.g. 7, or 07, depending on the number of K's in the pattern)
- h, hh - Hour in am/pm, between [1,12] (e.g. 7, or 07, depending on the number of h's in the pattern)
- mm - Minute in hour (e.g. 18)
- ss - Second in minute (e.g. 35)
- SSS - Milliseconds (e.g. 559)
NOTE: Specifying timezone offsets is not currently supported.
For relative date/times, DataGrab recognizes the following string literals:
- second ago, seconds ago
- minute ago, minutes ago
- hour ago, hours ago
- day ago, days ago
- week ago, weeks ago
- month ago, months ago
- year ago, years ago
In this case, absolute dates are calculated relative to the exact time data is extracted; "1 day ago" will result in subtracting one day from the current time, and so on. Comparing the text extracted from the page to the literals above is case-insensitive.
NOTE: Literals for future dates, like "tomorrow", are not currently supported.
Link fields extract the absolute URLs of links, given by the href attribute of the anchor element. In case the URL is relative, it will be expanded with the base URL of the current page. So if you are scraping the site "https://mydomain.com" and the href attribute of the anchor element is "/employees/123", the extracted value would be "https://mydomain.com/employees/123".
Detail Page Fields
Say you're scraping a product listing. Each product name is actually a link to its detail page, and you want to follow the link and extract some data from that page. This is what detail page fields are intended for.
Every detail page field uses a template that configures the fields to be extracted and the pagination setting. You can select the template from the dropdown or create a new one.
To add a detail page field with a new template, follow these steps:
- Toggle the detail page field tool ()
- Select at least one link (anchor) element on the page.
- On the field's property sheet, click the icon below the dropdown.
- The New Template Dialog will be displayed. Give the new template a name, then click SETUP.
- The setup will load the page of the link you selected in Step #2 (or of the first one, if you selected more than one). Proceed with setting up fields and pagination for the new template.
Notice the Template Selector dropdown shown on the left of the URL, as shown below:
As we are now working with multiple templates, we need a way to navigate between them. The two buttons on the right side of the dropdown allow you to rename the current template (), or delete it (). Note that the main template cannot be deleted.
When you're done setting up the fields and pagination, go back to the main template by selecting it from the dropdown.
This will load back the page of the main template, along with its configuration. Here you can finalize adding the detail page field by setting its name and clicking ADD.Detail page templates can themselves be paginated. This practically allows you to scrape an entire forum, starting from its top listing page of topics, then visiting each topic and looping through its pages containing replies to the question or subject of the topic.
DataGrab stores all scraped data internally for a certain period of time. Before it can be used, however, it needs to be exported. DataGrab supports three formats for that:
- Excel (.xslx)
Let's briefly cover each of them, discussing their pros and cons.
CSV stands for Comma-Separated Values. It is a text format used for exchanging tabular data. Each line in the file is a data record with its fields (columns) separated by commas. If a value itself contains commas, it is enclosed in quotes (usually double quotes).
Although RFC-4180 proposes a specification for the format, it is not fully standardized, so there are many variations used in practice.
- Easily readable by humans, easily parsable by computers
- Very compact, so it takes up minimal space
- Widely supported by databases and DB management tools (like MySQL Workbench, phpMyAdmin) for importing and exporting data
- Efficient, as data can be load incrementally (no need to load the whole file in memory at once)
- Impractical for hierarchical data
- Has no data types (like numbers, booleans, etc.), works only with text
- Poor support for special characters
Excel is a format developed by Microsoft and used by their spreadsheet application, Microsoft Excel. Although it has several variations, including XML-based ones, DataGrab only supports .xlsx, which is used by modern versions of MS Office and is basically a Zip-archived collections of XML documents.
- Even though intended for mostly tabular data, it can store more complex data
- Supports worksheets for logically dividing the document in multiple parts
- Supports formatting of cells
- Applications usually provide a variety of functions for manipulating it (filtering, sorting, or aggregating data)
- Can only be opened by specialized applications like Microsoft Excel or Google Sheets
- Harder to programmatically manipulate as the format is proprietary
- Supports hierarchical data
- Widely supported and considered the de facto format used by modern APIs
- Inefficient for huge data sets, as the whole data needs to loaded in memory at once (though there are ways around this, like JSON Lines)
- More verbose than CSV, though much more compact than XML
- No way to add comments or attribute tags, limiting the ability to annotate data structures or provide metadata (contrary to XML)
This section covers managing your scraping, including running them, downloading the extracted data, or updating their configurations.
The Scrapers List
The Scrapers page (shown below) is the default page you arrive on after you log in. The table displays the list of scrapers created, along with their statuses (Idle, Running, etc.) and some additional information.
You can sort the data by any column if you click on the column's name. The up / down arrow display after it will denote ascending vs descending order in which data is currently sorted.
To refresh the statuses of scrapers, please refresh the page.
The button toolbar before each scraper allows you to start/stop the scraper (), download the extracted data of the latest run (), or delete the scraper (). The download button will only appear if the data of the latest run is still available to download (i.e. its retention period hasn't passed).
Editing a Scraper's Settings
Clicking the scraper's name in the list will navigate to its detail page.
Here you can edit its name, page limit, scheduling and automatic delivery configuration by clicking the button next to the appropriate setting.
Web pages tend to get updated a lot. This includes their layout, styling and contents. This may result in empty data when scraping. To solve this, click the SETUP FIELDS button and then revise the scraper's field and / or pagination configuration. For more details, see the Defining Fields and Pagination section above.
Finally, the page also displays the scraper's run history.
Here you can export and download the data of the desired run by clicking the button and choosing the format (CSV, Excel or JSON).
Data Retention Policy
You can request data to be automatically delivered to you via email. However, emails may get marked as spam or not delivered at all, depending on a variety of factors including the mail server's configuration. For this reason, data is retained for a period of time, allowing you to download it manually.
This retention period is subject to change but only within reasonable limits.
DataGrab is constantly being developed and improved. However, making it better is not possible without your feedback. So if you have something to propose, or perhaps a problem to report, please do not hesitate to do so by clicking the SEND FEEDBACK button on the right side of the header (if you are not in setup mode), or the little icon (if you are in setup mode), and completing the feedback form.
Every feedback is one of the following types:
- Bug Report - if you discovered a problem, like something broke in the system, or something doesn't work the way it should
- Comment - a general comment / observation about the tool
- Feature Request - if you have an idea you think it could improve the product
- Question - if you need insight / troubleshooting about something, etc.
I'll follow up on every feedback via email.