Crawler config
Parameters
schedule
How often a complete crawl should be performed.
once
monthly
growth
weekly
advanced
workDays
professional
daily
professional
executeAt
At which time the crawler should be executed. Formatted hh:mm
Example: executeAt: 18:30
startUrls
Array of URLs the crawler users as entry points.
Example:startUrls: ["https://example.com"]
sitemaps
Array of URLs pointing to a sitemap. URLs found in these sitemaps will be used as entry points.
Example: sitemaps: ["https://example.com/sitemap.xml"]
ignoreQueryParams
Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs.
Example: ignoreQueryParams: ["key", "q"]
ignoreAllQueryParams
Filters out all query parameters from crawled URLs, defaults to true.
Example:ignoreAllQueryParams: true
extractUrls
Crawlers will extract urls from crawled data and continue to crawl these urls. Only urls that match with any of the start urls will be extracted. default to true.
Example: extractUrls: true
concurrency
Number of concurrent tasks the crawler can run. Running multiple concurrent tasks may let your server throw 429 error. Excepts a value between 1 and 5, default to 1.
Example:concurrency: 1
ignoreNoFollowTo
Whether the Crawler should follow links with the rel=”nofollow”
defaults to true.
Example:ignoreNoFollowTo: true
maxDepth
Limits how deep we can crawl into the website. maxDepth: 1
For example, you have a max depth of three:
http://example.com depth = 1 ✅
http://example.com/products depth = 2 ✅
http://example.com/products/category depth = 3 ✅
http://example.com/products/category/detail depth = 1 ❌
maxRetries
How many times the crawler should retry a failed url. Excepts a value between 0 and 8, default to 5.
Example: maxRetries: 5
maxUrls
Limits the number of urls the crawler will process, also applies to the startUrls. Excepts a value greater or equal to 1
Example: maxUrls: 1
characterLimit
Limits the number of characters the crawler can use. Can be helpful to keep your costs under control. Excepts a value greater then 500
Example: characterLimit: 10000
delay
Time, in milliseconds, the crawler waits before crawling the next URL. This can help to avoid 429 errors.
Value between 0 or 30 000
delay: 4000
Range
delay: {min: 500, max: 30000}
Range will randomly generate a delay between your set range on each new crawl.
timeout
Time, in milliseconds, the crawler waits to get a response from each URL. When the timeout period expires, document will be marked as failed. Accepts a value between 0 and 30 000, default to 30 000.
Example: timeout: 30000
renderJavaScript
When enabled all web pages are rendered with a chrome headless browser. This is slower but crawls pages in the most realistic form. Defaults to true
Example: renderJavaScript: true
pathsToMatch
Only urls that match with this regex pattern will be crawled.
Example: pathsToMatch: /products/
Urls must also match with any of your startUrls
pathsToExclude
Urls that match with this regex pattern will not be crawled.
Example: pathsToExclude: /products/
documentsToInclude
Documents will only be crawled if it's extension, content-type, mime-type or document type is included in the list.
If provide a empty array all documents will be ignored
Example: documentsToInclude: ["word", "text/plain", "csv"]
documentsToExclude
Documents will only be crawled if it's extension, content-type, mime-type or document type is not included in the list.
Example: documentsToExclude: ["word", "text/plain", "csv"]
htmlParser
Define how the html of your webpage should be parsed. The parsing process might impact the performance of your agent.
Default to smart.
Example: htmlParser: smart
Parser types:
plain
: dumb html parsersmart
: Perserve document structure which can be important for context
headers
HTTP headers that will be added to every request your crawler makes.
Example: headers: { "authentication": "JWT token" }
Only available for renderJavaScript: true
userAgent
Customize the userAgent for the crawler.
Example: userAgent: ChathiveCrawler/1.0
Only available for renderJavaScript: true
languages
Crawl only webpages that match your configured languages.
Languages: Array of languages
ISO6391
codesStrict: if false documents that don't have a language will be crawled if true only documents that have a language that matches your list will be crawled
Example: languages: {languages: ["nl", "en"],strict: false}
restoreDeletedPages
Defines if the crawler must restore a document that has been crawled previously but has been deleted in Chathive.
default to false.
Example: restoreDeletedPages: false
acceptCookieConsent
Accept cookie consent policy to prevent it from being crawled. Uses html querySelector to accept the cookie policy.
Example: acceptCookieConsent: { selector: "#buttonId" }
Only available for renderJavaScript: true
localStorage
Crawler will pre-fill the webpage local storage on crawl.
Example: localStorage: { "key": "value" }
Only available for renderJavaScript: true
sessionStorage
Crawler will pre-fill the webpage session storage on crawl.
Example: sessionStorage: { "key": "value" }
Only available for renderJavaScript: true
cookies
Crawler will set cookies for the webpage on crawl, can be helpful to crawl behind a login wall.
Example: cookies: [{ name: "key", value: "value", domain: "example.com" }]
Only available for renderJavaScript: true
credentials
Provide HTTP authentication credentials to the crawler, can be helpful to crawl behind a login wall.
Example: credentials: { username: "name", password: "pass" }
Only available for renderJavaScript: true
projectIds
Define to which project the crawler must save the crawled documents. Id can be found in the url of your project.
Example: projectIds: ["66769ce07734ec30a64090e2"]
advancedLogs
Log more steps and actions your crawler did. Is helpful for debugging or beter understanding your crawler but might make the log files harder to navigate.
Example: advancedLogs: false
dryRun
In dry run mode crawled documents will not be saved, this setting is purely for debugging/testing your config without wasting resources.
Example: dryRun: false
debug
In debug mode the crawler will run as dryRun mode. But it will only crawl 5 documents and log the contents of the documents. Can be helpful to debug your parser or other settings that depend on the document content.
Example: debug: false
Last updated