Client failover configuration

This page is the configuration reference for client failover. For the model behind these keys — host-health states, zone tiers, role filtering, and the two retry loops — read Concepts first.

Common keys

addr and auth_timeout_ms apply to every WS / WSS / HTTP / HTTPS client. zone is accepted everywhere but only takes effect on egress; target is an egress-only key and is rejected as an unknown key on an ingress connect string. They are documented in full on the connect-string reference; the table below summarises the failover-relevant subset.

KeyTypeDefaultNotes
addrhost:port[,host:port…]requiredComma-separated peer list. The two syntactic forms (addr=h1,h2 and repeated addr=h1;addr=h2) accumulate. Empty entries are rejected.
zonestringunsetClient's zone identifier (opaque, case-insensitive — eu-west-1a, dc-amsterdam, etc.). Egress prefers same-zone peers when target is any or replica. Silently accepted but ignored on ingress.
targetany | primary | replicaanyEgress only. Which server role the query client accepts. Rejected as an unknown key on an ingress connect string. See Role filter for the role table.
auth_timeout_msint (ms)15000Upper bound on the HTTP-upgrade response read per host. Does not cover the TCP connect or TLS handshake — those use the OS default. Set lower if you have well-known network paths and want faster failover; set higher only if upgrade is genuinely slow.

addr syntax — both of these are equivalent and produce the same three-peer list:

addr=node-a:9000,node-b:9000,node-c:9000
addr=node-a:9000;addr=node-b:9000;addr=node-c:9000

Ingress (write)

The ingress reconnect loop is driven by store-and-forward connect-string keys. See Store-and-forward configuration and the connect-string reference for the full list. The failover-relevant keys are:

KeyTypeDefaultNotes
reconnect_max_duration_millisint (ms)300000 (5 min)Per-outage wall-clock budget. Resets on every successful reconnect. Size this to span your largest expected failover window, but short enough to surface permanent topology issues.
reconnect_initial_backoff_millisint (ms)100Starting backoff sleep at round exhaustion. Doubles up to reconnect_max_backoff_millis.
reconnect_max_backoff_millisint (ms)5000Cap on the exponential backoff. With equal-jitter, the actual sleep lands in [max, 2·max) once the base saturates.
initial_connect_retryoff | on | asyncoffWhether to apply the same retry loop to the very first connect attempt. See below.

initial_connect_retry

By default, the first connect failure is terminal — typically the first attempt failing means a misconfiguration (wrong host, wrong port, no network), and retrying for five minutes only hides it.

ValueBehaviour
off (default; alias false)First-connect failure is terminal. The producer's call to build the sender throws immediately.
on (aliases sync, true)First-connect failures enter the same reconnect loop as mid-stream failures. The constructor blocks until success or the per-outage budget expires.
asyncThe constructor returns immediately; the background I/O thread drives the reconnect loop. The producer experiences backpressure if it tries to publish before the connection comes up. Intended for unattended producers where the SF directory may already carry segments from a prior process and the server may come up later.

Egress (query)

The egress failover loop wraps each execute() call on the read-side query client. The full key list lives on the connect-string reference; the user-visible knobs are:

KeyTypeDefaultNotes
failoveron | offonGlobal on/off. With failover=off, a single failed execute() call surfaces the underlying error without walking the address list.
failover_max_attemptsint8Hard cap on attempts within a single execute() call.
failover_max_duration_msint (ms)30000Wall-clock budget for failover eligibility. Bounds when failover stops, not the wall-clock of execute() itself — a final WalkTracker round can still cost up to hostCount × auth_timeout_ms after the budget expires.
failover_backoff_initial_msint (ms)50Starting backoff sleep. Doubles up to the cap.
failover_backoff_max_msint (ms)1000Cap on the exponential backoff. With full-jitter, the actual sleep lands in [0, max).

Worked examples

Three-node Enterprise cluster, default failover

Most users need only the addr list — defaults cover the rest.

try (Sender sender = Sender.fromConfig(
"ws::addr=node-a:9000,node-b:9000,node-c:9000;sf_dir=/var/lib/qdb-sender;")) {
sender.table("events")
.symbol("source", "edge-42")
.longColumn("count", 1)
.atNow();
}

The ws:: scheme picks the QWP WebSocket transport. sf_dir enables the disk-backed store-and-forward substrate, which keeps unacked data across sender restarts; see Store-and-forward concepts.

Zone-aware read replicas

For read-only queries spread across same-zone replicas, with a primary as final fallback:

try (QwpQueryClient client = QwpQueryClient.fromConfig(
"ws::addr=replica-eu-1a:9000,replica-eu-1b:9000,primary:9000;"
+ "zone=eu-west-1a;target=any;")) {
client.connect();
// handler is a QwpColumnBatchHandler that receives the result batches
client.execute("SELECT * FROM trades WHERE ts > now() - 1h", handler);
}

Setting target=replica would skip the primary entirely; target=any is usually preferable so the query still completes after a replica outage.

Long-tolerated ingest with async first connect

Useful for unattended ingest processes (edge sensors, ETL jobs) that may restart before the server comes up:

try (Sender sender = Sender.fromConfig(
"ws::addr=primary:9000;sf_dir=/var/lib/qdb-sender;"
+ "initial_connect_retry=async;"
+ "reconnect_max_duration_millis=1800000;")) {
// appendBlocking() will absorb up to sf_max_total_bytes of writes
// while the I/O thread retries the initial connect.
}

The 30-minute reconnect budget gives a wide failover window; the async initial-connect policy lets the producer thread proceed immediately.

Tight egress failover for an interactive dashboard

try (QwpQueryClient client = QwpQueryClient.fromConfig(
"ws::addr=node-a:9000,node-b:9000;"
+ "failover_max_duration_ms=5000;failover_max_attempts=3;")) {
client.connect();
// Surfaces an error within a few seconds if the cluster is unreachable.
}

Where each key is documented

KeyConceptReference
addr, zone, target, auth_timeout_msHost selection, role filterconnect-string #failover-keys
reconnect_*, initial_connect_retryIngress retry budgetconnect-string #reconnect-keys
failover, failover_*Egress retry budgetconnect-string #egress-flow
username / password / tokenAuthenticationconnect-string #auth
tls_*TLS configurationconnect-string #tls