We are introducing a new to scrape and interact with websites programatically: WebParsy.

WebParsy is a NodeJS library and cli (a terminal application) that scrapes websites using Puppeteer and YAML definitions https://www.npmjs.com/package/webparsy

It also perform logins, take screeenshots, create PDF files, etc.

Installing WebParsy

WebParsy is available in the NPM official repository. Make sure you have installed a recent version of NodeJS.

To install it simply do:

[highlight]npm i webparsy -g[/highlight]


There are examples that covers all WebParsy possibilities on its Github repository: https://github.com/joseconstela/webparsy/tree/master/example

The YAML file

The first thing you need to do is to create a YAML file.

A YAML file is a human-readable set of data that both humans and computers can easily read through.

This is where you tell WebParsy the steps if must follow, which websites visit, what contents you need and where it can find them.

It can also contain some basic configuration parameters like the browser’s width and height, etc.

Below is the most basic content that your definition must contain:

      - goto: 

Add the steps

This is where you become a magician. At the moment of writing, WebParsy supports the following steps:

  • goto Navigate to an URL
  • goBack Navigate to the previous page in history
  • screenshot Takes an screenshot of the page
  • pdf Takes a pdf of the page
  • text Gets the text for a given CSS selector
  • title Gets the title for the current page.
  • form Fill and submit forms
  • html Return HTML code for the page or a DOM element

Each one of this can be defined in the steps section of the file. Take a look to the documentation for how to use each one of them.

To cover the example of this post, we will grab Madrid’s temperature from weather.com.

The first thing the scraper must do is visit their website. The first step of the process would be:

- goto: https://weather.com/es-ES/tiempo/hoy/l/SPXX0050:1:SP

The next thing to worry about is to tell the scraper where’s the city’s temperature in the web page.

You must help yourself to get the CSS selectors for the details you want to grab by using the browser’s dev tools.

For the case of weather.com, the CSS selector to grab the temperature is .today_nowcard-temp span. Since all scraped information is treated as text strings, we want the temperature to be returned as a number. WebParsy can cast this to a number, getting rid of the ° symbol for you. To do this you can make use of type.

WebParsy can both transform and cast (type) before returning the scraper details.

The step will look like:

      - text:
          selector: .today_nowcard-temp span
          type: number
          as: temp

as represents the name of the property to be returned.

Your YAML file should now look like

      - goto: https://weather.com/es-ES/tiempo/hoy/l/SPXX0050:1:SP
      - text:
          selector: .today_nowcard-temp span
          type: number
          as: temp


Simply do:


The command’s result should like as:

$ webparsy mi_file.yaml
  "temp": 16

Learn more about WebParsy

Although WebParsy might not suite everyone’s needs, it’s under active development and accepting all kind of suggestions. Take a look to the repository at https://github.com/joseconstela/webparsy and submit any issue you might experience.

You can also contribute with your pull requests and forking the repository.