Skip to content

ADR 0001: Build the Queue Service as the standalone waiting-room repo

Context

The original E5 plan called for extracting the queue functionality out of the Symfony monolith into a dedicated service: "Node.js or Go" runtime, Redis sorted-set backed, WebSocket/SSE realtime, 100k+ concurrent connections, 20-min purchase window, 30-min position persistence. Cross-service signalling to the monolith was sketched as Redis Pub/Sub (queue.turn_granted, queue.expired) on the shared Redis cluster, with queue state co-located in the same Redis instance that holds cart reservations, sessions, and the notification job queue.

We built that service — it lives in the sibling repo ../waiting-room/ (currently v0.15.5). During implementation we landed on several design choices that differ from the original sketch and are worth documenting:

  • Built as a multi-tenant service from day one (single instance can later gate other surfaces such as the Drupal webshop), not the single-purpose service the original plan implied.
  • Capacity mode (auto-admit up to a concurrency cap, hold a session for session_ttl_seconds) is the integration shape, alongside an operator mode (manual call-next-N) that we do not use.
  • Cross-service signalling to the backend is pull-based via GET /access rather than push-based via Redis Pub/Sub.
  • Its own Postgres + Valkey stack — not co-located on the platform's shared Redis cluster.
  • Valkey (BSD-3, Linux Foundation fork), not Redis-the-product (re-licensed to SSPL/RSALv2 in 2024). Redis-protocol clients (ioredis, BullMQ) work unchanged.
  • Stack: Node 20 + TypeScript 5.7 + Fastify 5 + Prisma 6 + BullMQ; FCM (firebase-admin) + APNs (@parse/node-apn) dispatchers; cross-instance WebSocket fan-out via Valkey Pub/Sub.
  • Released under an OSI-compatible licence (per a hard product constraint in the repo's own AGENTS.md).

Decision

The Queue Service for HNS Ticketing is the waiting-room repo. HNS Ticketing runs it as one of its sibling stacks and integrates as the first tenant.

Consequences for the platform architecture:

  1. Standalone deployment. waiting-room is an additional sibling repo with its own Postgres + Valkey + BullMQ. It does not share the ticketing backend's Redis or Postgres.
  2. Valkey, not Redis-the-product. A deliberate choice driven by the 2024 SSPL/RSALv2 re-license. The Redis-protocol clients we already use (Predis, ioredis, BullMQ) continue to work; the on-disk product is Valkey.
  3. Capacity mode is the integration model. waiting-room supports both operator (call-next-N, helpdesk style) and capacity (auto-admit up to a concurrency cap, then hold a session for session_ttl_seconds). We use capacity mode to gate protected endpoints (checkout / seat selection) on the ticketing backend.
  4. GET /access replaces the planned queue.turn_granted pub/sub. The ticketing backend (the protected origin) validates each protected request by calling GET /access on waiting-room with the mobile client's session token. This is pull-based, one Valkey lookup per validation, designed to be cached for 1–5s on the origin side. We do not subscribe to queue.* Redis pub/sub channels and we do not put queue.* subjects on our NATS Event Bus.
  5. Two new token types in the auth model. Mobile clients receive a ticketToken (talks to waiting-room: ticket state, WebSocket upgrade, voluntary leave) and a sessionToken (talks to our origin endpoints; validated via /access). Both are orthogonal to the Keycloak-issued user JWT.
  6. Queue notifications fan out from waiting-room, not from our Notification Workers. waiting-room owns delivery of admitted, position_changed, session_expired, expired, cancelled over its own WebSocket gateway and FCM/APNs dispatchers. Our Notification Workers continue to own order/ticket/quota/loyalty notifications.
  7. queue_entries and queue:* Redis structures leave the ticketing backend schema. Durable queue history lives in waiting-room's Postgres; hot queue state lives in waiting-room's Valkey. The ticketing backend no longer models a Queue entity.

Consequences

Positive

  • E5-F1 through E5-F5 are owned end-to-end by waiting-room, including parts (E5-F2 real-time updates, E5-F3 FCM/APNs push) that the original plan left as separate workstreams.
  • The Symfony monolith does not take on long-lived WebSocket connections — PHP-FPM is unchanged.
  • Multi-tenant by design: the same waiting-room deployment can later gate the Drupal webshop, admin portal, or other surfaces without spinning up additional services.
  • Capacity sizing is documented in waiting-room/docs/bottlenecks_estimate.md: a single 8 vCPU / 32 GB host sustains ~100–150k waiting users.

Negative / risks

  • Same-user-same-position dedup is not built-in. waiting-room keys tickets by ticket id, not user id, because it is tenant-agnostic about how callers identify users. The original spec called for "multi-device sync — same user = same position." We close the gap by enforcing 1 ticket per (user_id, queue_id) on the call site (mobile or backend) before issuing POST /queues/{id}/tickets.
  • No backend-side event hook today. When a fan is admitted, the ticketing backend learns about it lazily — either via the mobile app forwarding the admitted event, or via the subsequent /access validation call. A future WebhookDispatcher (the NotificationDispatcher interface in waiting-room is explicitly designed to support this) would close the gap; not required for v1.
  • Two services to operate instead of one (waiting-room + ticketing backend) — counter-balanced by waiting-room being a single Docker image with its own self-contained compose stack.
  • /access is on the hot path for every protected request. Origin-side caching (1–5s, capped at sessionExpiresAt) is mandatory; fail-closed is the documented default.

Neutral

  • Mobile app integration follows waiting-room/docs/workflow.md. The shape (POST join → WS for live updates → call origin with session token) is consistent with what the original architecture spec implied.
  • Admin operations (create/edit/list queues) use waiting-room's own admin UI or REST API. The ticketing backend provisions one queue per match at publish time.

Implementation outline

Not an implementation plan; just enough to make the consequences concrete.

  1. waiting-room is added to the sibling-repos layout with its own compose stack (own Postgres, own Valkey, own Traefik route).
  2. Bootstrap a tenant for HNS Ticketing via npm run admin -- tenant:create; store the resulting tenant API key in the ticketing backend's secret store.
  3. On Match.publish, the backend creates a capacity-mode queue in waiting-room (one queue per match) and stores waiting_room_queue_id on the Match.
  4. The mobile app joins via POST /queues/{waiting_room_queue_id}/tickets with the tenant API key, opens the WebSocket, and receives admitted with a sessionToken.
  5. The backend's checkout / seat-selection endpoints become "protected origin": each request goes through a RequireSession middleware that calls GET /access (cached 1–5s), and rejects on 401/410.
  6. Drop the QueueController, QueueService, QueueEntry entity, and queue:* Redis usage from the ticketing backend. Migration removes the queue_entries table.
  7. Remove queue.* subjects from the NATS Event Bus catalog; keep order.*, ticket.*, payment.*, etc.

Alternatives considered

  • Keep the queue logic in the Symfony monolith. Rejected: PHP-FPM is unsuitable for 100k+ long-lived WebSocket connections; this was the original reason the architecture called for extraction.
  • Build the Queue Service single-tenant and tightly coupled to HNS Ticketing, with Redis Pub/Sub events back to the monolith (the literal reading of the original microservices-strategy spec). Rejected: makes the service hard to reuse for other surfaces (Drupal webshop is the obvious next consumer), and ties cross-service signalling to a transport (shared Redis Pub/Sub) that doesn't survive separating Valkey instances. The chosen pull-based /access design is cheaper, cacheable, and works across multiple consuming origins.
  • Push-based events (Redis Pub/Sub or NATS) instead of pull-based /access. Rejected: pull-based is simpler for the consumer (no subscriber wiring, no replay logic), naturally cacheable on the origin side, and aligns with the model the protected origin needs anyway ("is this token still valid right now?"). Push-based would require origins to handle out-of-order delivery and at-least-once retries.

References

  • ../waiting-room/ARCHITECTURE.md — server-side data model, ticket lifecycle, dispatcher pattern.
  • ../waiting-room/docs/workflow.md — mobile + origin integration guide.
  • ../waiting-room/docs/bottlenecks_estimate.md — hardware sizing.
  • microservices-strategy.md — Queue Service section (rewritten to reflect this decision).
  • architecture-overview.md — repository layout and event catalog (updated).
  • E5 epic and feature docs (e5-f*.md) — bannered with a pointer to this ADR.