Indexing a Website with the Crawler

Learn how to transform an entire website into a knowledge base for your chatbot, without having to manually copy-paste text.

Introduction

The Web Crawler (or indexing bot) from DoxyChat is a powerful tool that navigates a page or an entire site to extract text, clean unnecessary elements (ads, menus, footers) and teach it to your artificial intelligence.

It’s the ideal solution if your documentation is online, or to enable your chatbot to know your products and services directly from your storefront.

The Two Import Modes

When you add a Web source, two options are available:

1. “Single Page” Mode

DoxyChat will read only the specific URL you provide. It won’t click on any links.

Recommended use: To add a specific blog article, a “Terms and Conditions” page, or a pricing page, without cluttering the chatbot with the rest of the site.

2. “Entire Site” Mode (Recursive)

DoxyChat starts with the provided URL (usually the homepage), then discovers all the site’s subpages to build comprehensive knowledge.

Recommended use: To index all your technical documentation, help center, or product catalog.

How Does Automatic Discovery Work?

Our Crawler is designed to be “intelligent” and find the maximum number of relevant pages in minimum time. It uses a hybrid strategy:

Sitemap Search (Priority): The bot first checks if your site has a “map” file (sitemap.xml). This is the most reliable method as it provides the exact list of pages you want to index.
Link Exploration (Fallback): If no map is found, the bot analyzes the homepage and follows all internal links (those pointing to the same domain). It navigates from page to page until it has mapped everything or reached your limit.

Technical note: Our crawler can read modern and complex sites (JavaScript, React, etc.) thanks to self-healing technology that simulates a real browser when necessary.

Quota and Limit Management

Importing an entire site can represent a large volume of data. DoxyChat includes safeguards to respect your subscription:

Predictive “Slots” Calculation

Before launching the exploration, the system calculates how many documents you can still add according to your plan (Discovery, Starter, Pro…).

Example: If your plan allows 50 documents and you already have 10, the Crawler will index a maximum of 40 pages from the website.

Automatic Stop

As soon as the limit is reached, the Crawler stops cleanly. Already indexed pages are kept and remain active. You’ll receive a notification indicating that the import is partial due to insufficient space.

Intelligent Filtering

To save your quotas and improve response quality, our bot:

Ignores unnecessary technical pages (shopping carts, user accounts, admin pages).
Prioritizes recent content (current year) on news sites to avoid archiving outdated articles.

Addition Procedure

Go to the Sources tab of your chatbot.
Select the Website option.
Enter the starting URL (e.g., https://mysite.com).
Choose the mode: Single page or Entire site.
Click on Launch import.

Pages will progressively appear in your source list as they are discovered and processed.