Client failover configuration
This page is the configuration reference for client failover. For the model behind these keys — host-health states, zone tiers, role filtering, and the two retry loops — read Concepts first.
Common keys
addr and auth_timeout_ms apply to every WS / WSS / HTTP / HTTPS client.
zone is accepted everywhere but only takes effect on egress; target is an
egress-only key and is rejected as an unknown key on an ingress connect string.
They are documented in full on the
connect-string reference;
the table below summarises the failover-relevant subset.
| Key | Type | Default | Notes |
|---|---|---|---|
addr | host:port[,host:port…] | required | Comma-separated peer list. The two syntactic forms (addr=h1,h2 and repeated addr=h1;addr=h2) accumulate. Empty entries are rejected. |
zone | string | unset | Client's zone identifier (opaque, case-insensitive — eu-west-1a, dc-amsterdam, etc.). Egress prefers same-zone peers when target is any or replica. Silently accepted but ignored on ingress. |
target | any | primary | replica | any | Egress only. Which server role the query client accepts. Rejected as an unknown key on an ingress connect string. See Role filter for the role table. |
auth_timeout_ms | int (ms) | 15000 | Upper bound on the HTTP-upgrade response read per host. Does not cover the TCP connect or TLS handshake — those use the OS default. Set lower if you have well-known network paths and want faster failover; set higher only if upgrade is genuinely slow. |
addr syntax — both of these are equivalent and produce the same three-peer
list:
addr=node-a:9000,node-b:9000,node-c:9000
addr=node-a:9000;addr=node-b:9000;addr=node-c:9000
Ingress (write)
The ingress reconnect loop is driven by store-and-forward connect-string keys. See Store-and-forward configuration and the connect-string reference for the full list. The failover-relevant keys are:
| Key | Type | Default | Notes |
|---|---|---|---|
reconnect_max_duration_millis | int (ms) | 300000 (5 min) | Per-outage wall-clock budget. Resets on every successful reconnect. Size this to span your largest expected failover window, but short enough to surface permanent topology issues. |
reconnect_initial_backoff_millis | int (ms) | 100 | Starting backoff sleep at round exhaustion. Doubles up to reconnect_max_backoff_millis. |
reconnect_max_backoff_millis | int (ms) | 5000 | Cap on the exponential backoff. With equal-jitter, the actual sleep lands in [max, 2·max) once the base saturates. |
initial_connect_retry | off | on | async | off | Whether to apply the same retry loop to the very first connect attempt. See below. |
initial_connect_retry
By default, the first connect failure is terminal — typically the first attempt failing means a misconfiguration (wrong host, wrong port, no network), and retrying for five minutes only hides it.
| Value | Behaviour |
|---|---|
off (default; alias false) | First-connect failure is terminal. The producer's call to build the sender throws immediately. |
on (aliases sync, true) | First-connect failures enter the same reconnect loop as mid-stream failures. The constructor blocks until success or the per-outage budget expires. |
async | The constructor returns immediately; the background I/O thread drives the reconnect loop. The producer experiences backpressure if it tries to publish before the connection comes up. Intended for unattended producers where the SF directory may already carry segments from a prior process and the server may come up later. |
Egress (query)
The egress failover loop wraps each execute() call on the read-side query
client. The full key list lives on the
connect-string reference;
the user-visible knobs are:
| Key | Type | Default | Notes |
|---|---|---|---|
failover | on | off | on | Global on/off. With failover=off, a single failed execute() call surfaces the underlying error without walking the address list. |
failover_max_attempts | int | 8 | Hard cap on attempts within a single execute() call. |
failover_max_duration_ms | int (ms) | 30000 | Wall-clock budget for failover eligibility. Bounds when failover stops, not the wall-clock of execute() itself — a final WalkTracker round can still cost up to hostCount × auth_timeout_ms after the budget expires. |
failover_backoff_initial_ms | int (ms) | 50 | Starting backoff sleep. Doubles up to the cap. |
failover_backoff_max_ms | int (ms) | 1000 | Cap on the exponential backoff. With full-jitter, the actual sleep lands in [0, max). |
Worked examples
Three-node Enterprise cluster, default failover
Most users need only the addr list — defaults cover the rest.
try (Sender sender = Sender.fromConfig(
"ws::addr=node-a:9000,node-b:9000,node-c:9000;sf_dir=/var/lib/qdb-sender;")) {
sender.table("events")
.symbol("source", "edge-42")
.longColumn("count", 1)
.atNow();
}
The ws:: scheme picks the QWP WebSocket transport. sf_dir enables the
disk-backed store-and-forward substrate, which keeps unacked data across
sender restarts; see
Store-and-forward concepts.
Zone-aware read replicas
For read-only queries spread across same-zone replicas, with a primary as final fallback:
try (QwpQueryClient client = QwpQueryClient.fromConfig(
"ws::addr=replica-eu-1a:9000,replica-eu-1b:9000,primary:9000;"
+ "zone=eu-west-1a;target=any;")) {
client.connect();
// handler is a QwpColumnBatchHandler that receives the result batches
client.execute("SELECT * FROM trades WHERE ts > now() - 1h", handler);
}
Setting target=replica would skip the primary entirely; target=any is
usually preferable so the query still completes after a replica outage.
Long-tolerated ingest with async first connect
Useful for unattended ingest processes (edge sensors, ETL jobs) that may restart before the server comes up:
try (Sender sender = Sender.fromConfig(
"ws::addr=primary:9000;sf_dir=/var/lib/qdb-sender;"
+ "initial_connect_retry=async;"
+ "reconnect_max_duration_millis=1800000;")) {
// appendBlocking() will absorb up to sf_max_total_bytes of writes
// while the I/O thread retries the initial connect.
}
The 30-minute reconnect budget gives a wide failover window; the async
initial-connect policy lets the producer thread proceed immediately.
Tight egress failover for an interactive dashboard
try (QwpQueryClient client = QwpQueryClient.fromConfig(
"ws::addr=node-a:9000,node-b:9000;"
+ "failover_max_duration_ms=5000;failover_max_attempts=3;")) {
client.connect();
// Surfaces an error within a few seconds if the cluster is unreachable.
}
Where each key is documented
| Key | Concept | Reference |
|---|---|---|
addr, zone, target, auth_timeout_ms | Host selection, role filter | connect-string #failover-keys |
reconnect_*, initial_connect_retry | Ingress retry budget | connect-string #reconnect-keys |
failover, failover_* | Egress retry budget | connect-string #egress-flow |
username / password / token | Authentication | connect-string #auth |
tls_* | TLS configuration | connect-string #tls |