# Crawler config

### Parameters

<table data-header-hidden><thead><tr><th width="224"></th><th></th></tr></thead><tbody><tr><td>schedule<br><br><br><br><br><br><br></td><td><p>How often a complete crawl should be performed.</p><ul><li>once</li><li>monthly <code>growth</code></li><li>weekly <code>advanced</code></li><li>workDays <code>professional</code></li><li>daily <code>professional</code></li></ul></td></tr><tr><td>executeAt<br></td><td>At which time the crawler should be executed. Formatted <code>hh:mm</code><br>Example: <code>executeAt: 18:30</code></td></tr><tr><td>startUrls<br></td><td>Array of URLs the crawler uses as entry points.<br>Example:<code>startUrls: ["https://example.com"]</code></td></tr><tr><td>sitemaps<br><br></td><td>Array of URLs pointing to a sitemap. URLs found in these sitemaps will be used as entry points.<br>Example: <code>sitemaps: ["https://example.com/sitemap.xml"]</code></td></tr><tr><td>ignoreQueryParams<br><br></td><td>Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs. <br>Example: <code>ignoreQueryParams: ["key", "q"]</code></td></tr><tr><td>ignoreAllQueryParams<br></td><td>Filters out all query parameters from crawled URLs, defaults to true.<br>Example:<code>ignoreAllQueryParams: true</code></td></tr><tr><td>extractUrls<br><br></td><td>Crawlers will extract urls from crawled data and continue to crawl these urls. Only urls that match with any of the start urls will be extracted. default to true. <br>Example: <code>extractUrls: true</code></td></tr><tr><td>concurrency<br><br></td><td>Number of concurrent tasks the crawler can run. Running multiple concurrent tasks may let your server throw 429 error. Excepts a value between 1 and 5, default to 1. <br>Example:<code>concurrency: 1</code></td></tr><tr><td><p>ignoreNoFollowTo</p><p><br></p></td><td>Whether the Crawler should follow links with the <code>rel=”nofollow”</code>defaults to true.<br>Example:<code>ignoreNoFollowTo: true</code></td></tr><tr><td>followRedirect</td><td>Wether the crawler should follow a redirect url or not.<br>If redirect is disabled the current url will be seen as deleted<br><code>followRedirect</code> default to true<br>Example:<code>followRedirect: true</code></td></tr><tr><td>maxDepth<br><br><br><br><br><br><br></td><td><p>Limits how deep we can crawl into the website. <code>maxDepth: 1</code><br>For example, you have a max depth of three:</p><ul><li>http://example.com <strong>depth = 1 ✅</strong></li><li>http://example.com/products <strong>depth = 2 ✅</strong></li><li>http://example.com/products/category <strong>depth = 3 ✅</strong></li><li>http://example.com/products/category/detail <strong>depth = 1 ❌</strong></li></ul></td></tr><tr><td>maxRetries<br></td><td><p>How many times the crawler should retry a failed url. Excepts a value between 0 and 8, default to 5.</p><p>Example: <code>maxRetries: 5</code></p></td></tr><tr><td>maxUrls<br><br></td><td><p>Limits the number  of urls the crawler will process, also applies to the startUrls. Excepts a value greater or equal to 1</p><p>Example: <code>maxUrls: 1</code> </p></td></tr><tr><td><p>characterLimit</p><p><br><br></p></td><td><p>Limits the number of characters the crawler can use. Can be helpful to keep your costs under control. Excepts a value greater then 500</p><p>Example: <code>characterLimit: 10000</code></p></td></tr><tr><td>delay<br><br><br><br><br><br></td><td><p>Time, in milliseconds, the crawler waits before crawling the next URL. This can help to avoid 429 errors.</p><ul><li>Value between 0 or 30 000 <code>delay: 4000</code></li><li>Range <code>delay: {min: 500, max: 30000}</code></li></ul><p>Range will randomly generate a delay between your set range on each new crawl.</p></td></tr><tr><td>timeout<br><br><br><br></td><td><p>Time, in milliseconds, the crawler waits to get a response from each URL. When the timeout period expires, document will be marked as failed.<br>Accepts a value between 0 and 30 000, default to 30 000.</p><p>Example: <code>timeout: 30000</code></p></td></tr><tr><td><p>renderJavaScript</p><p><br><br></p></td><td><p>When enabled all web pages are rendered with a chrome headless browser. This is slower but crawls pages in the most realistic form.<br>Defaults to true</p><p>Example: <code>renderJavaScript: true</code></p></td></tr><tr><td>pathsToMatch<br><br></td><td><p>Only urls that match with <strong>one</strong> of  the regex pattern defined in the array will be crawled.  <em>Urls must also match with any of your startUrls</em></p><p>Example: <code>pathsToMatch: ["/products\/*/i"]</code> </p><p><em>the regex must be enclosed in a string</em> and can contain flags</p></td></tr><tr><td>pathsToExclude<br></td><td><p>Urls that match with <strong>one</strong> of  the regex pattern defined in the array  will not be crawled. </p><p>Example: <code>pathsToExclude: ["/products\/*/i"]</code><br><em>the regex must be enclosed in a string</em> and can contain flags</p></td></tr><tr><td><p>documentsToInclude</p><p></p><p></p><p><br><br></p></td><td><p>Documents will only be crawled if it's extension, content-type, mime-type or <a href="../attributes#document-types">document type</a> is included in the list.</p><p>If provide a empty array all documents will be ignored<br></p><p>Example: <code>documentsToInclude: ["word", "text/plain", "csv"]</code></p></td></tr><tr><td><p>documentsToExclude</p><p><br><br></p></td><td>Documents will only be crawled if it's extension, content-type, mime-type or <a href="../attributes#document-types">document type</a> is <strong>not</strong> included in the list. <br>Example: <code>documentsToExclude: ["word", "text/plain", "csv"]</code></td></tr><tr><td><p>htmlParser</p><p><br><br><br><br><br><br><br></p></td><td><p>Define how the html of your webpage should be parsed. The parsing process might impact the performance of your agent.</p><p>Default to smart. <br>Example: <code>htmlParser: smart</code></p><p>Parser types:</p><ul><li><code>plain</code>: dumb html parser</li><li><code>smart</code>: Perserve document structure which can be important for context</li></ul></td></tr><tr><td><p>headers</p><p><br><br><br></p></td><td>HTTP headers that will be added to every request your crawler makes.<br>Example:  <code>headers: { "authentication": "JWT token" }</code><br><br><em>Only available for <code>renderJavaScript: true</code></em></td></tr><tr><td>userAgent<br><br><br></td><td><p>Customize the userAgent for the crawler. </p><p>Example: <code>userAgent: ChathiveCrawler/1.0</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><p>languages<br></p><p><br><br><br><br><br><br></p></td><td><p>Crawl only webpages that match your configured languages.</p><ul><li><strong>Languages</strong>: Array of languages <code>ISO6391</code> codes</li><li><strong>Strict</strong>: <br>if false documents that don't have a language will be crawled<br>if true only documents that have a language that matches your list will be crawled</li></ul><p>Example: <code>languages: {languages: ["nl", "en"],strict: false}</code></p></td></tr><tr><td>restoreDeletedPages<br><br><br></td><td><p>Defines if the crawler must restore a document that has been crawled previously but has been deleted in Chathive.</p><p>default to false.</p><p>Example: <code>restoreDeletedPages: false</code></p></td></tr><tr><td>acceptCookieConsent<br><br><br><br></td><td><p>Accept cookie consent policy to prevent it from being crawled.<br>Uses html querySelector to accept the cookie policy.</p><p>Example: <code>acceptCookieConsent: { selector: "#buttonId" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>localStorage<br><br><br></td><td><p>Crawler will pre-fill the webpage local storage on crawl. </p><p>Example: <code>localStorage: { "key": "value" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>sessionStorage<br><br><br></td><td><p>Crawler will pre-fill the webpage session storage on crawl. </p><p>Example: <code>sessionStorage: { "key": "value" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>cookies<br><br><br><br><br></td><td><p>Crawler will set cookies for the webpage on crawl, can be helpful to crawl behind a login wall.</p><p>Example: <code>cookies: [{ name: "key", value: "value",  domain: "example.com" }]</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><p>credentials</p><p><br><br><br><br></p></td><td><p>Provide HTTP authentication credentials to the crawler, can be helpful to crawl behind a login wall.</p><p>Example: <code>credentials: { username: "name", password: "pass" }</code></p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td><a href="crawler-config/login">login</a><br><br></td><td><p>Configure complex login methods to crawl behind a login wall.</p><p><br><em>Only available for <code>renderJavaScript: true</code></em></p></td></tr><tr><td>projectIds<br><br></td><td><p>Define to which project the crawler must save the crawled documents. Id can be found in the url of your project.</p><p>Example: <code>projectIds: ["66769ce07734ec30a64090e2"]</code></p></td></tr><tr><td>advancedLogs<br><br><br></td><td>Log more steps and actions your crawler did. Is helpful for debugging or beter understanding your crawler but might make the log files harder to navigate.<br>Example: <code>advancedLogs: false</code></td></tr><tr><td><p>dryRun<br></p><p><br></p></td><td><p>In dry run mode crawled documents will not be saved, this setting is purely for debugging/testing your config without wasting resources.</p><p>Example: <code>dryRun: false</code></p></td></tr><tr><td>debug<br><br><br><br></td><td><p>In debug mode the crawler will run as dryRun mode. But it will only crawl 5 documents and log the contents of the documents. Can be helpful to debug your parser or other settings that depend on the document content.</p><p>Example: <code>debug: false</code></p></td></tr></tbody></table>
