Data Sources
Note: Google verification for our data source connectors is still pending. If you need access to Google integrations (such as Google Drive) in the meantime, contact support@magus.digital. If you’re interested in other integrations, reach out to us as well.
Magus supports several types of data sources you can connect to your custom agents. Agents can search ingested data, monitor for errors, and proactively alert you when something needs attention.
Supported Sources
Web
Connect web-based content (documentation sites, CMS APIs, support pages, wikis) to your agents as a knowledge source. Your agent can search and reference this content semantically, just like uploaded files.
You provide a seed URL and Magus auto-detects the best way to discover pages:
- OpenAPI / Swagger: for CMS APIs and documented REST endpoints. Point it at an OpenAPI spec URL and Magus ingests all GET endpoints as searchable documentation. Supports tag and path filtering.
- Sitemap: for sites with a
sitemap.xml. Magus parses the sitemap and ingests all listed pages. - Link Following: for sites without a sitemap or API. Magus crawls from the seed URL following links, with configurable allowed domains, path prefixes, max depth, and max pages.
- Pagination: for paginated APIs using
Link: <url>; rel="next"headers or JSON cursor fields.
Magus auto-selects the right strategy, or you can choose one explicitly.
Setting up a web source:
- Go to Knowledge and add a new source. Select Web as the provider.
- Enter the seed URL (an OpenAPI spec URL, a docs site root, or any web page).
- Optionally set the strategy, authentication (bearer token or basic auth), and boundary rules (allowed domains, path prefixes, max depth).
- Create a collection from the source and trigger a full sync.
Magus syncs web sources on a configurable schedule (default: hourly). During incremental sync, new pages are ingested, removed pages are soft-deleted, and changed pages are re-chunked and re-embedded. Content hashes (SHA-256) ensure only genuinely changed content triggers re-processing.
For HTML content, Magus uses Spider.cloud for clean content extraction (requires SPIDER_API_KEY). Non-HTML sources are fetched directly. Magus respects robots.txt and applies a default 500ms delay between requests.
Log Source
Ingest application logs via webhook. Works with any log shipper that can POST JSON, including Fly Log Shipper (Vector), Logflare, or custom HTTP senders.
What your agent can do:
- Search recent logs by keyword, severity, or time range
- Get a health summary (error counts, last activity)
- Receive alerts when error thresholds are breached (e.g., “5 errors in 5 minutes”)
RSS Feed
Subscribe to RSS or Atom feeds. Magus polls the feed at a configurable interval and ingests new items automatically.
What your agent can do:
- Search feed content by keyword or time range
- Get a summary of recent items
- Receive notifications when new items appear
Setup
Step 1: Create or edit a custom agent
Go to Agents and create a new agent (or edit an existing one). The agent needs to have integrations enabled (not in disabled_tool_categories).
Step 2: Add a data source integration
In the agent’s settings, go to Integrations and add one:
For Log Source:
- Select Log Source as the provider
- Configure thresholds (optional):
- Error threshold: number of errors in the window to trigger an alert (default: 5)
- Window minutes: rolling window size (default: 5)
- Retention days: how long to keep entries (default: 7)
- Save and activate the integration
- Copy the webhook URL (you’ll need it for your log shipper)
For RSS Feed:
- Select RSS Feed as the provider
- Enter configuration:
- Feed URL: the RSS or Atom feed URL (e.g.,
https://example.com/feed.xml) - Poll interval minutes: how often to check for new items (default: 30)
- Retention days: how long to keep entries (default: 30)
- Feed URL: the RSS or Atom feed URL (e.g.,
- Save and activate the integration
Step 3: Configure your log shipper (Log Source only)
Point your log shipper’s HTTP output at the webhook URL from Step 2.
Fly.io with Log Shipper (Vector)
Deploy the Fly Log Shipper app and configure Vector’s HTTP sink:
[sinks.magus]
type = "http"
uri = "https://your-magus-instance.com/webhooks/log_source/YOUR_INTEGRATION_ID"
encoding.codec = "json"
Custom HTTP sender
POST JSON to your webhook URL in this format:
{
"message": "GenServer terminating: timeout",
"level": "error",
"timestamp": "2026-03-21T10:30:00Z",
"metadata": {
"fly_region": "iad",
"app": "myapp",
"instance": "abc123"
}
}
For batch sending:
{
"entries": [
{"message": "Request started", "level": "info", "timestamp": "..."},
{"message": "DB timeout", "level": "error", "timestamp": "..."}
]
}
Supported fields:
message(required): the log line contentlevel(optional):debug,info,warning,error,critical(defaults toinfoif missing)timestamp(optional): ISO 8601 datetime (defaults to current time)metadata(optional): any structured data you want to include
Step 4: Enable agent tools
The data source tools (Search Ingested Data and Get Source Status) are available under the integration’s enabled tools. Make sure they’re enabled for your agent.
How alerts work
Data sources don’t wake your agent on every log line or feed item. Instead:
For logs: a threshold checker runs after each batch of ingested entries. If the number of errors (severity error or critical) in the configured rolling window meets or exceeds the threshold, a single summarized inbox event is created for your agent’s triage. The event includes:
- How many errors occurred
- The top distinct error messages
- Sample entry IDs for investigation
The inbox event uses idempotency keys to avoid duplicate alerts within the same window.
For RSS: when new items are ingested during a poll, a summarized inbox event is created listing the titles of new items. One event per day per feed.
In both cases, the inbox event has deferred urgency: your agent’s triage will pick it up on the next heartbeat sweep, not immediately. This keeps costs low while ensuring nothing is missed.
Automatic cleanup
Ingested entries are automatically purged based on your configured retention_days (default: 7 for logs, 30 for RSS). A daily maintenance job runs at 3 AM UTC and deletes entries older than the retention period.
Agent tools reference
Search Ingested Data (search_ingested_data)
Search across all your data sources. Available parameters:
| Parameter | Type | Description |
|---|---|---|
source_type |
log, rss, email |
Filter by source type (optional) |
query |
string | Text search on content and title (optional) |
severity |
critical, error, warning, info, debug |
Filter by severity (optional) |
since |
ISO 8601 datetime | Start of time range (optional) |
until |
ISO 8601 datetime | End of time range (optional) |
limit |
integer | Max results, default 20 (optional) |
Get Source Status (get_source_status)
Get a health summary of your data sources. Available parameters:
| Parameter | Type | Description |
|---|---|---|
source_type |
log, rss, email |
Filter by source type (optional) |
Returns per-source: total entries in the last hour, error count, last sync time, and current configuration.
Crash detection
The log source automatically detects crash signatures and escalates their severity to critical:
GenServer terminating** (EXIT)SIGTERM/SIGKILL** (RuntimeError)/** (FunctionClauseError)Process.*crashedRanch listener.*connection process.*exit
These patterns are checked during ingestion; no LLM cost involved.
Deduplication
Entries are deduplicated per integration using a SHA-256 hash of the content. If the same log line or RSS item is sent twice, the second ingestion is silently skipped. This means:
- Re-polling an RSS feed won’t create duplicate entries
- Log shipper retries won’t create duplicates
- The dedup is per-integration, so the same content in different integrations creates separate entries