# Crawler config

### Parameters

<table data-header-hidden><thead><tr><th width="224"></th><th></th></tr></thead><tbody><tr><td>schedule<br><br><br><br><br><br><br></td><td><p>How often a complete crawl should be performed.</p><ul><li>once</li><li>monthly <code>growth</code></li><li>weekly <code>advanced</code></li><li>workDays <code>professional</code></li><li>daily <code>professional</code></li></ul></td></tr><tr><td>executeAt<br></td><td>At which time the crawler should be executed. Formatted <code>hh:mm</code><br>Example: <code>executeAt: 18:30</code></td></tr><tr><td>startUrls<br></td><td>Array of URLs the crawler uses as entry points.<br>Example:<code>startUrls: ["https://example.com"]</code></td></tr><tr><td>sitemaps<br><br></td><td>Array of URLs pointing to a sitemap. URLs found in these sitemaps will be used as entry points.<br>Example: <code>sitemaps: ["https://example.com/sitemap.xml"]</code></td></tr><tr><td>ignoreQueryParams<br><br></td><td>Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs. <br>Example: <code>ignoreQueryParams: ["key", "q"]</code></td></tr><tr><td>ignoreAllQueryParams<br></td><td>Filters out all query parameters from crawled URLs, defaults to true.<br>Example:<code>ignoreAllQueryParams: true</code></td></tr><tr><td>extractUrls<br><br></td><td>Crawlers will extract urls from crawled data and continue to crawl these urls. Only urls that match with any of the start urls will be extracted. default to true. <br>Example: <code>extractUrls: true</code></td></tr><tr><td>concurrency<br><br></td><td>Number of concurrent tasks the crawler can run. Running multiple concurrent tasks may let your server throw 429 error. Excepts a value between 1 and 5, default to 1. <br>Example:<code>concurrency: 1</code></td></tr><tr><td><p>ignoreNoFollowTo</p><p><br></p></td><td>Whether the Crawler should follow links with the <code>rel=”nofollow”</code>defaults to true.<br>Example:<code>ignoreNoFollowTo: true</code></td></tr><tr><td>followRedirect</td><td>Wether the crawler should follow a redirect url or not.<br>If redirect is disabled the current url will be seen as deleted<br><code>followRedirect</code> default to true<br>Example:<code>followRedirect: true</code></td></tr><tr><td>maxDepth<br><br><br><br><br><br><br></td><td><p>Limits how deep we can crawl into the website. <code>maxDepth: 1</code><br>For example, you have a max depth of three:</p><ul><li>http://example.com <strong>depth = 1 ✅</strong></li><li>http://example.com/products <strong>depth = 2 ✅</strong></li><li>http://example.com/products/category <strong>depth = 3 ✅</strong></li><li>http://example.com/products/category/detail <strong>depth = 1 ❌</strong></li></ul></td></tr><tr><td>maxRetries<br></td><td><p>How many times the crawler should retry a failed url. Excepts a value between 0 and 8, default to 5.</p><p>Example: <code>maxRetries: 5</code></p></td></tr><tr><td>maxUrls<br><br></td><td><p>Limits the number  of urls the crawler will process, also applies to the startUrls. Excepts a value greater or equal to 1</p><p>Example: <code>maxUrls: 1</code> </p></td></tr><tr><td><p>characterLimit</p><p><br><br></p></td><td><p>Limits the number of characters the crawler can use. Can be helpful to keep your costs under control. Excepts a value greater then 500</p><p>Example: <code>characterLimit: 10000</code></p></td></tr><tr><td>delay<br><br><br><br><br><br></td><td><p>Time, in milliseconds, the crawler waits before crawling the next URL. This can help to avoid 429 errors.</p><ul><li>Value between 0 or 30 000 <code>delay: 4000</code></li><li>Range <code>delay: {min: 500, max: 30000}</code></li></ul><p>Range will randomly generate a delay between your set range on each new crawl.</p></td></tr><tr><td>timeout<br><br><br><br></td><td><p>Time, in milliseconds, the crawler waits to get a response from each URL. When the timeout period expires, document will be marked as failed.<br>Accepts a value between 0 and 30 000, default to 30 000.</p><p>Example: <code>timeout: 30000</code></p></td></tr><tr><td><p>renderJavaScript</p><p><br><br></p></td><td><p>When enabled all web pages are rendered with a chrome headless browser. This is slower but crawls pages in the most realistic form.<br>Defaults to true</p><p>Example: <code>renderJavaScript: true</code></p></td></tr><tr><td>pathsToMatch<br><br></td><td><p>Only urls that match with <strong>one</strong> of  the regex pattern defined in the array will be crawled.  <em>Urls must also match with any of your startUrls</em></p><p>Example: <code>pathsToMatch: ["/products\/*/i"]</code> </p><p><em>the regex must be enclosed in a string</em> and can contain flags</p></td></tr><tr><td>pathsToExclude<br></td><td><p>Urls that match with <strong>one</strong> of  the regex pattern defined in the array  will not be crawled. </p><p>Example: <code>pathsToExclude: ["/products\/*/i"]</code><br><em>the regex must be enclosed in a string</em> and can contain flags</p></td></tr><tr><td><p>documentsToInclude</p><p></p><p></p><p><br><br></p></td><td><p>Documents will only be crawled if it's extension, content-type, mime-type or <a href="../attributes#document-types">document type</a> is included in the list.</p><p>If provide a empty array all documents will be ignored<br></p><p>Example: <code>documentsToInclude: ["word", "text/plain", "csv"]</code></p></td></tr><tr><td><p>documentsToExclude</p><p><br><br></p></td><td>Documents will only be crawled if it's extension, content-type, mime-type or <a href="../attributes#document-types">document type</a> is <strong>not</strong> included in the list. <br>Example: <code>documentsToExclude: ["word", "text/plain", "csv"]</code></td></tr><tr><td><p>htmlParser</p><p><br><br><br><br><br><br><br></p></td><td><p>Define how the html of your webpage should be parsed. The parsing process might impact the performance of your agent.</p><p>Default to smart. <br>Example: <code>htmlParser: smart</code></p><p>Parser types:</p><ul><li><code>plain</code>: dumb html parser</li><li><code>smart</code>: Perserve document structure which can be important for context</li></ul></td></tr><tr><td><p>headers</p><p><br><br><br></p></td><td>HTTP headers that will be added to every request your crawler makes.<br>Example:  <code>headers: { "authentication": "JWT token" }</code><br><br><em>Only available for <code>renderJavaScript: true</code></em></td></tr><tr><td>userAgent<br><br><br></td><td><p>Customize the userAgent for the crawler. </p><p>Example: <code>userAgent: ChathiveCrawler/1.0</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><p>languages<br></p><p><br><br><br><br><br><br></p></td><td><p>Crawl only webpages that match your configured languages.</p><ul><li><strong>Languages</strong>: Array of languages <code>ISO6391</code> codes</li><li><strong>Strict</strong>: <br>if false documents that don't have a language will be crawled<br>if true only documents that have a language that matches your list will be crawled</li></ul><p>Example: <code>languages: {languages: ["nl", "en"],strict: false}</code></p></td></tr><tr><td>restoreDeletedPages<br><br><br></td><td><p>Defines if the crawler must restore a document that has been crawled previously but has been deleted in Chathive.</p><p>default to false.</p><p>Example: <code>restoreDeletedPages: false</code></p></td></tr><tr><td>acceptCookieConsent<br><br><br><br></td><td><p>Accept cookie consent policy to prevent it from being crawled.<br>Uses html querySelector to accept the cookie policy.</p><p>Example: <code>acceptCookieConsent: { selector: "#buttonId" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>localStorage<br><br><br></td><td><p>Crawler will pre-fill the webpage local storage on crawl. </p><p>Example: <code>localStorage: { "key": "value" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>sessionStorage<br><br><br></td><td><p>Crawler will pre-fill the webpage session storage on crawl. </p><p>Example: <code>sessionStorage: { "key": "value" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>cookies<br><br><br><br><br></td><td><p>Crawler will set cookies for the webpage on crawl, can be helpful to crawl behind a login wall.</p><p>Example: <code>cookies: [{ name: "key", value: "value",  domain: "example.com" }]</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><p>credentials</p><p><br><br><br><br></p></td><td><p>Provide HTTP authentication credentials to the crawler, can be helpful to crawl behind a login wall.</p><p>Example: <code>credentials: { username: "name", password: "pass" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><a href="crawler-config/login">login</a><br><br></td><td><p>Configure complex login methods to crawl behind a login wall.</p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>projectIds<br><br></td><td><p>Define to which project the crawler must save the crawled documents. Id can be found in the url of your project.</p><p>Example: <code>projectIds: ["66769ce07734ec30a64090e2"]</code></p></td></tr><tr><td>advancedLogs<br><br><br></td><td>Log more steps and actions your crawler did. Is helpful for debugging or beter understanding your crawler but might make the log files harder to navigate.<br>Example: <code>advancedLogs: false</code></td></tr><tr><td><p>dryRun<br></p><p><br></p></td><td><p>In dry run mode crawled documents will not be saved, this setting is purely for debugging/testing your config without wasting resources.</p><p>Example: <code>dryRun: false</code></p></td></tr><tr><td>debug<br><br><br><br></td><td><p>In debug mode the crawler will run as dryRun mode. But it will only crawl 5 documents and log the contents of the documents. Can be helpful to debug your parser or other settings that depend on the document content.</p><p>Example: <code>debug: false</code></p></td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://developers.chathive.app/crawler/crawler-config.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
