Przejdź do głównej zawartości

Extraction rules

Extraction rules allow you to specify rules which will be applied to resulting HTML to extract values in JSON format.

Rules are passed in extract_rules parameter.

tip

Remember to encode this parameter like in the examples below.

Extraction rules parameter is an object in which each key specifies a CSS selector in either simple or extended form. In the response, you'll get a JSON, instead of HTML, in which keys are the same as in the extract_rules query param and values are extracted according to the provided rules.

info

By default, only the first matching element is returned. If you wish to extract all matching elements, use {type: "all", ...}.

Simple form

Simple form is a CSS selector in the form of a string.

Extract the first paragraph example

The following example uses the simple extraction rule form to extract text content from the first paragraph (<p>) out of the scraped page:

import requests
import json

payload = {
"api_key": "[your API key]",
"url": "https://example.com",
"extract_rules": json.dumps({"paragraph": "p"}),
}

response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)

In the response, you receive a JSON with contents of the first <p> element in the "paragraph" key:

{"paragraph":"This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission."}

Extended form

Extended form is an object in which you can specify not only the selector (selector key) but also how the output should be processed (output key) and whether all or only the first matching element should be extracted (type key). Extraction rules using extended form have the following structure:

{
"extract_rules": {
"something": {
"type": "all",
"selector": "p",
"output": "text"
}
}
}

type

Specifies whether to extract all or just the first element. Must be "all" or "first". Defaults to "first".

selector

A valid CSS selector.

output

Specifies the type of output. Must be one of:

  • "text" - extracts elements' inner text only (without HTML tags) - this is the default.
  • "html" - extracts elements' inner HTML.
  • "table_json" - extracts headers and rows of a HTML table(s) to JSON, e.g. [{"header1": "row1_value1", "header2": "row1_value2"}, {"header1": "row2_value1", "header2": "row2_value2"}]. If there are no headers, keys are incrementing integers, "0", "1" and so on.
  • "table_array" - extracts rows of an HTML table(s) to array(s), e.g. [["row1_value1", "row1_value2"], ["row2_value1", "row2_value2"]].
  • "@<any_string>" - extracts attribute value of elements, e.g. use @href to extract link values.
  • Nested object - extracts data from inner element. See: nested rules.

Here's how you can extract all links from the scraped website:

import requests
import json

payload = {
"api_key": "[your API key]",
"url": "https://example.com",
"extract_rules": json.dumps({
"links": {"type": "all","selector": "a", "output": "@href"}
}),
}

response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)

In the response, your receive a JSON with a list of all links from the scraped website in the "links" key:

{"links":["https://www.iana.org/domains/example"]}

Nested rules

If you want to create a nested output structure, set the "output" key to a nested extraction rules object in a recursive manner.

Example:

{
"extract_rules": {
"items": {
"type": "all",
"selector": ".item",
"output": {
"price": ".price",
"date": ".date",
"details": {
"selector": ".details",
"output": {
"title": ".title",
"description": ".description"
}
}
}
}
}
}