Crawler config

Parameters

schedule

How often a complete crawl should be performed.

  • once

  • monthly growth

  • weekly advanced

  • workDays professional

  • daily professional

executeAt

At which time the crawler should be executed. Formatted hh:mm Example: executeAt: 18:30

startUrls

Array of URLs the crawler users as entry points. Example:startUrls: ["https://example.com"]

sitemaps

Array of URLs pointing to a sitemap. URLs found in these sitemaps will be used as entry points. Example: sitemaps: ["https://example.com/sitemap.xml"]

ignoreQueryParams

Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs. Example: ignoreQueryParams: ["key", "q"]

ignoreAllQueryParams

Filters out all query parameters from crawled URLs, defaults to true. Example:ignoreAllQueryParams: true

extractUrls

Crawlers will extract urls from crawled data and continue to crawl these urls. Only urls that match with any of the start urls will be extracted. default to true. Example: extractUrls: true

concurrency

Number of concurrent tasks the crawler can run. Running multiple concurrent tasks may let your server throw 429 error. Excepts a value between 1 and 5, default to 1. Example:concurrency: 1

ignoreNoFollowTo

Whether the Crawler should follow links with the rel=”nofollow”defaults to true. Example:ignoreNoFollowTo: true

maxDepth

Limits how deep we can crawl into the website. maxDepth: 1 For example, you have a max depth of three:

  • http://example.com depth = 1 ✅

  • http://example.com/products depth = 2 ✅

  • http://example.com/products/category depth = 3 ✅

  • http://example.com/products/category/detail depth = 1 ❌

maxRetries

How many times the crawler should retry a failed url. Excepts a value between 0 and 8, default to 5.

Example: maxRetries: 5

maxUrls

Limits the number of urls the crawler will process, also applies to the startUrls. Excepts a value greater or equal to 1

Example: maxUrls: 1

characterLimit

Limits the number of characters the crawler can use. Can be helpful to keep your costs under control. Excepts a value greater then 500

Example: characterLimit: 10000

delay

Time, in milliseconds, the crawler waits before crawling the next URL. This can help to avoid 429 errors.

  • Value between 0 or 30 000 delay: 4000

  • Range delay: {min: 500, max: 30000}

Range will randomly generate a delay between your set range on each new crawl.

timeout

Time, in milliseconds, the crawler waits to get a response from each URL. When the timeout period expires, document will be marked as failed. Accepts a value between 0 and 30 000, default to 30 000.

Example: timeout: 30000

renderJavaScript

When enabled all web pages are rendered with a chrome headless browser. This is slower but crawls pages in the most realistic form. Defaults to true

Example: renderJavaScript: true

pathsToMatch

Only urls that match with this regex pattern will be crawled.

Example: pathsToMatch: /products/ Urls must also match with any of your startUrls

pathsToExclude

Urls that match with this regex pattern will not be crawled.

Example: pathsToExclude: /products/

documentsToInclude

Documents will only be crawled if it's extension, content-type, mime-type or document type is included in the list.

If provide a empty array all documents will be ignored

Example: documentsToInclude: ["word", "text/plain", "csv"]

documentsToExclude

Documents will only be crawled if it's extension, content-type, mime-type or document type is not included in the list. Example: documentsToExclude: ["word", "text/plain", "csv"]

htmlParser

Define how the html of your webpage should be parsed. The parsing process might impact the performance of your agent.

Default to smart. Example: htmlParser: smart

Parser types:

  • plain: dumb html parser

  • smart: Perserve document structure which can be important for context

headers

HTTP headers that will be added to every request your crawler makes. Example: headers: { "authentication": "JWT token" } Only available for renderJavaScript: true

userAgent

Customize the userAgent for the crawler.

Example: userAgent: ChathiveCrawler/1.0

Only available for renderJavaScript: true

languages

Crawl only webpages that match your configured languages.

  • Languages: Array of languages ISO6391 codes

  • Strict: if false documents that don't have a language will be crawled if true only documents that have a language that matches your list will be crawled

Example: languages: {languages: ["nl", "en"],strict: false}

restoreDeletedPages

Defines if the crawler must restore a document that has been crawled previously but has been deleted in Chathive.

default to false.

Example: restoreDeletedPages: false

acceptCookieConsent

Accept cookie consent policy to prevent it from being crawled. Uses html querySelector to accept the cookie policy.

Example: acceptCookieConsent: { selector: "#buttonId" }

Only available for renderJavaScript: true

localStorage

Crawler will pre-fill the webpage local storage on crawl.

Example: localStorage: { "key": "value" }

Only available for renderJavaScript: true

sessionStorage

Crawler will pre-fill the webpage session storage on crawl.

Example: sessionStorage: { "key": "value" }

Only available for renderJavaScript: true

cookies

Crawler will set cookies for the webpage on crawl, can be helpful to crawl behind a login wall.

Example: cookies: [{ name: "key", value: "value", domain: "example.com" }]

Only available for renderJavaScript: true

credentials

Provide HTTP authentication credentials to the crawler, can be helpful to crawl behind a login wall.

Example: credentials: { username: "name", password: "pass" }

Only available for renderJavaScript: true

projectIds

Define to which project the crawler must save the crawled documents. Id can be found in the url of your project.

Example: projectIds: ["66769ce07734ec30a64090e2"]

advancedLogs

Log more steps and actions your crawler did. Is helpful for debugging or beter understanding your crawler but might make the log files harder to navigate. Example: advancedLogs: false

dryRun

In dry run mode crawled documents will not be saved, this setting is purely for debugging/testing your config without wasting resources.

Example: dryRun: false

debug

In debug mode the crawler will run as dryRun mode. But it will only crawl 5 documents and log the contents of the documents. Can be helpful to debug your parser or other settings that depend on the document content.

Example: debug: false

Last updated