Automated and self-cleaning WebCrawler | Voters

Automated and self-cleaning WebCrawler

Fireflies

To enhance our AI agent's capabilities, we need an automated and self-cleaning WebCrawler. This WebCrawler should run at a certain frequency, automatically adding new URLs and removing those no longer present on the sitemap. This feature would ensure our AI agent has the most up-to-date information, improving its efficiency and accuracy.

Created by Fleur Nouwens

May 26, 2025

Milou Wolsing

Merged in a post:

Automatically crawl URL's

Freek Vermolen

I would like to automatically crawl my URL's in the Web Crawler

June 3, 2025

Wouter Rosekrans

It would be nice if it was possible to set the frequency with which this update runs. For our situation, it would be ideal if the web crawler could compare the sitemap of the most recent crawl with the current sitemap and then only the changes (remove pages that have been removed from the knowledge base and add new pages).

Fleur Nouwens

Hey Fireflies, thanks for your feedback! Following up on this:

What specific frequency do you envision for the WebCrawler to run (e.g., daily, weekly)?
Are there any specific types of URLs or content that should be prioritized or excluded by the WebCrawler?
How should the WebCrawler handle URLs that are temporarily unavailable or return errors?

Freek Vermolen

I think ones a day and mayve max 5 urls

Fleur Nouwens

Hey Freek Vermolen, thanks for your feedback! Following up on this:

What specific types of URLs do you want the crawler to target?
How frequently would you like the URLs to be crawled?
Are there any specific data points or information you want to extract from the crawled URLs?