Server-Sent Events look simple on paper. One HTTP connection, a stream of data: lines, reconnect and replay via Last-Event-ID. The mechanism has shipped in every browser since 2011. Over two weeks and 15 commits, most of it still went wrong. Not in the way we expected, and only one problem was SSE-specific. The rest were about commit ordering, delivery guarantees, and what happens when a consumer disconnects for a weekend.

This is a walkthrough of the SSE signal feed we built. It streams signal events from a PostgreSQL-backed screener to remote trader_bot consumers over plain HTTP. We cover five things: why we chose SSE over WebSockets, the append-only event log that made replay exact, the migration from PostgreSQL sequences to a gapless counter, the five fixes that hardened the feed in production, and the consumer resilience patterns that emerged. SSE is not the right transport for every problem. This shows what a production implementation looks like and where the non-obvious edges are.

Why SSE, not WebSockets

The feed has one job: push signal events in one direction, from server to consumer. The consumer never sends data upstream. It only receives. A full-duplex transport like WebSocket would buy nothing but complexity. You write your own framing protocol, your own reconnection logic, and you monitor the liveness of a persistent socket yourself.

SSE solves several of these out of the box. The Last-Event-ID header gives you replay on reconnect with no custom handshake. The consumer sends its last known feed_seq, and the server resumes from the next event. The :ping heartbeat, an SSE comment line invisible to event handlers, keeps the connection alive through load balancers. Because SSE rides on HTTP/1.1, it inherits proxies, TLS termination, and authentication headers without modification.

The trade-off: SSE pins a long-lived TCP connection to one server instance, which makes horizontal scaling harder. For our use case, a single feed server behind a load balancer with sticky routing, this was acceptable. A restart drops all clients, but each one reconnects and replays its missed events from the cursor. High availability is out of scope for now. The consumer tolerates gaps by design.

The append-only event log

The screener writes INSERT and UPDATE rows to stock_market.signal. The naive approach is to have the feed watch that mutable table and stream new or changed rows. This breaks immediately. A SELECT ... ORDER BY created_at can miss a just-committed row whose created_at falls slightly earlier than the current high-water mark. An UPDATE to an existing row has no stable ordering against other events.

Instead we built an append-only event log: stock_market.signal_feed_event. A trigger fires on every INSERT or meaningful UPDATE to stock_market.signal. It appends one immutable row to the log: the full signal snapshot plus event_time. The feed never reads from the mutable signal table. It fetches from the log by a monotonic feed_seq. Rows are never updated or deleted, so replay is exact. A consumer that reconnects with Last-Event-ID: 12345 gets events 12346 onward, with no coalescing, no gaps, and no fetch-by-seq race.

The trigger also issues a lightweight pg_notify('signal_feed', '{feed_seq}') to wake the feed’s dedicated LISTEN connection. The previous signal_inserted channel carries the full row payload and stays unchanged for legacy in-house bots that still consume via LISTEN. The feed channel carries only the sequence number, which keeps it well under PostgreSQL’s 8000-byte notify payload limit.

When sequences are not enough

The first version of the event log used a PostgreSQL SEQUENCE to allocate feed_seq values. Sequences are designed for this, so it seemed obvious. The problem: nextval() commits immediately, whether or not the surrounding transaction commits or rolls back. Two concurrent transactions each call nextval(). One commits, the other rolls back. Now the log has a gap. Event 100 and event 102 exist, but event 101 never will.

That gap is not cosmetic. The feed’s single worker advances a high-water mark and delivers each consecutive feed_seq in order. It sees event 100, delivers it, and waits for event 101. If 101 was allocated by a rolled-back transaction, the worker waits forever, or until someone notices. Event 101 does not exist and the worker should move on. But the worker cannot tell a genuine in-flight transaction from a phantom gap without external knowledge it does not have.

The fix replaced the SEQUENCE with a single-row counter table: stock_market.signal_feed_counter. The trigger runs UPDATE signal_feed_counter SET value = value + 1 RETURNING value inside the same transaction as the event log insert. UPDATE ... RETURNING takes a row lock and commits only when the enclosing transaction commits. The serialized counter produces a gapless, commit-ordered sequence. If the transaction rolls back, no value was returned, so no value was consumed. Committed events form a contiguous, ascending prefix with no holes.

The migration itself had to be race-free. We locked the signal table, swapped from sequence to counter in a single transaction, and released the lock only then. The counter starts at the sequence’s current value, so no existing event is renumbered and no new event gets a duplicate feed_seq.

The hardening pass: five fixes

The feed shipped on a Friday. By Monday, five categories of failure had surfaced. None of them were SSE-related.

1. Truncation-gap detection. The replay window holds 72 hours of feed events. A consumer that disconnects for longer reconnects with a Last-Event-ID that predates the oldest available event. The feed could silently deliver from the current high-water mark, and the consumer would never know it missed a weekend of events, including exits on open positions. The fix: when Last-Event-ID falls outside the window, the feed sends a system event with reason: replay_truncated and the earliest available feed_seq. The consumer enters close-only mode, which blocks new entries and permits exits, and fires a CRITICAL alert. Recovery needs a manual restart with --recovery, which reconciles from the database directly.

2. Fan-out isolation. The feed delivers events to multiple consumers through per-client queues. In the initial code, one slow consumer whose queue filled up could block the worker that fans out to every client. The fix: each client gets its own bounded asyncio.Queue with put_nowait semantics. The fan-out is isolated, so a slow consumer’s backpressure affects only that consumer. On overflow, the feed drops the oldest event and sends a system event with the dropped count, so the consumer knows to reconnect from its cursor.

3. Poison quarantine. A malformed signal row could crash the worker and stop delivery for every client. Think a JSON payload that fails to serialize, or a row with a missing column. The fix: a per-event try/except wraps the delivery. A poisoned event is logged, quarantined, and skipped, and the worker continues to the next event. No single bad row can take down the feed.

4. No event loss on IBKR disconnect. The trader_bot’s feed consumer runs in a daemon thread. When the IBKR gateway drops, which happens routinely, the bot reconnects to IBKR and re-enters its position loop. In the initial code, the feed consumer thread did not pause during this window. It kept dequeuing and processing events, but place_order calls during an IBKR outage fail silently. The fix: on an IBKR disconnect timeout, the feed consumer breaks its loop and reconnects from the unchanged cursor. No event is consumed during the outage. When the feed reconnects, it replays everything that arrived while IBKR was down. A missed exit is not silently dropped.

5. Start-from-now default. A fresh consumer with no cursor file, on first run or after a lost cursor, would request a full backfill with ?cursor=0. That could replay tens of thousands of events and overflow the per-client queue. Most of those events were already reconciled by the bot’s startup --recovery pass. The fix: a cursorless connect defaults to the current high-water mark, not a full backfill. An explicit ?cursor=0 still works for debugging, but the default is the safe one.

These five fixes share a pattern. None of them changed the happy path. The feed streams events the same way it always did. What changed is the behavior at the edges: a consumer gone for 72 hours, a queue overflow, a malformed row, a broker drop, a new consumer with no history. Production hardening is mostly about the edges.

Consumer-side resilience: the failure spectrum

On the consumer side, a trader_bot instance started with --feed-url and --feed-api-key, the failure modes form a spectrum from transient to fatal.

Transient: 503 (degraded). The feed returns 503 when its LISTEN connection to PostgreSQL is down. The consumer retries with capped exponential backoff. This is the most common failure mode, and it recovers automatically.

Recoverable: read timeout. The feed sends a 15-second heartbeat. If the consumer sees more than 35 seconds of silence, two missed heartbeats plus jitter, it treats the connection as dead and reconnects from the cursor. No events are lost.

Recoverable: dropped events. A system event with reason: dropped means the per-client queue overflowed. The consumer reconnects from its cursor, replaying the dropped window.

Semi-fatal: replay truncated. A replay_truncated system event means the consumer was down longer than 72 hours. The bot enters close-only mode, with exits allowed and entries blocked, fires a CRITICAL alert, and waits for manual recovery via --recovery. This is the right behavior: a bot that might have missed exits should not open new positions.

Fatal: 401 / 403. A bad or expired API key returns 401. The consumer shuts down immediately. There is no retry on a bad key. The only remediation is human: mint a new key and restart the bot.

The consumer’s fallback is its startup --recovery pass, which re-reads recent signal rows from the database and re-dispatches them. It runs independently of the feed. If the cursor file was lost in a restart and the window was too short, the recovery pass reconciles the gap from the database. The bot needs database access for this to work, which is one reason the standalone-bot split is phased rather than all at once.

The documentation arc

The feed’s documentation started as a design plan, SIGNAL_FEED_PLAN.md, written before a single line of server code. It described the architecture in future tense: “the feed will use an append-only event log”, “the counter will produce a gapless sequence”. Once the feed shipped, that plan became a liability. It described a system that no longer matched reality. The real system was already running and diverging in small but important ways: the dual-channel notify, the start-from-now default, the delayed-tier replay scheduling.

The fix folded the plan into a single runbook: doc/SIGNAL_FEED.md. The as-built architecture replaced the aspirational one. We kept the design rationale, the “why” questions: why SSE, why a separate process, why no migration events. We moved it to a “Design notes & history” section at the bottom. The body of the document is operational: deploy commands, environment variables, auth and key management, wire format, schema, consumer setup, and failure modes. An operator reads it when the feed is down, not during the design review.

This pattern is worth surfacing because it is easy to skip: plan, build, then retire the plan. The temptation is to leave the design doc untouched and add an “as-built” appendix, or to keep both documents alive and let them drift. Neither works. A plan that outlives the build is misinformation. Fold it in, delete the old file, and repoint every reference.

What we learned

SSE itself was the easy part. The hard parts were all about delivery guarantees on top of PostgreSQL: commit ordering, gap detection, queue backpressure, poison isolation, and the consumer’s failure spectrum. If you build an event stream on a relational database, the database is where the interesting problems live. The transport carries the data and little else.

The append-only event log pattern generalizes well beyond this feed: immutable rows, gapless counter, replay by cursor. It is the same pattern behind event sourcing and change data capture, scaled down to a single table in pure PostgreSQL. The counter migration from SEQUENCE to UPDATE ... RETURNING is worth knowing on its own. nextval() looks simple and is silently wrong for any use case that needs contiguous ordering.

Finally: ship, then harden. The feed went from first commit to production in under a week. We expected the edges to surface in production and get fixed there. The five fixes, shipped in three commits, were not a failure of planning. They were the plan working as intended. You cannot anticipate every failure mode in advance. You can design the system so that when failures surface, they hit one consumer rather than all of them, and they surface visibly as a system event, a log line, or a CRITICAL alert, rather than silently dropping data.


This article describes infrastructure methodology for informational and educational purposes only. It does not constitute financial advice, trading recommendations, or investment guidance of any kind. Past system behavior does not guarantee future reliability.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *