Skip to content

Microservices Strategy

Overview

The HNS Ticketing System follows a modular monolith with selective service extraction strategy. Most functionality remains in the PHP/Symfony monolith with clear bounded context modules, while two components are extracted as separate services to address specific scaling and technology requirements.

Architecture Decision

Why Not Full Microservices?

Consideration Decision
Team expertise PHP/Symfony - single technology reduces cognitive load
Transaction integrity Ticket purchase requires atomic inventory + payment operations
Operational complexity 15 microservices would require significant DevOps investment
Current scale Most operations under 10k concurrent users

Why Extract the Queue Service?

Service Scale Requirement Technology Mismatch
Queue Service (waiting-room) 100k+ concurrent WebSocket connections PHP unsuitable for long-lived connections

Notification delivery used to be on the "extract" list ("Notification Workers"), but the implementation took a different shape — email delivery moved out of the backend by being published to the NATS Event Bus and consumed by the external mailer microservice, while push delivery stays inside the request handler (Firebase SDK direct). See "Notification Delivery" below for the current shape.

Extracted Services

Queue Service (waiting-room)

The Queue Service is the standalone waiting-room service (sibling repo ../waiting-room/), developed by the HNS Ticketing team as a multi-tenant queue so that the same service can later gate other surfaces (e.g. the Drupal webshop) without spinning up additional deployments. HNS Ticketing is its first tenant. This section documents the integration shape; for service internals see ../waiting-room/ARCHITECTURE.md.

Responsibilities (owned by waiting-room)

  • Queue join and FIFO position assignment via Valkey sorted set.
  • Real-time position updates over WebSocket; cross-instance fan-out via Valkey Pub/Sub.
  • Capacity mode: auto-admit waiters up to a per-queue concurrency cap; hold each admitted session for session_ttl_seconds. Auto-admit the next waiter when an active session is cancelled, expires, or ends.
  • Operator mode (not used by HNS Ticketing): explicit "call next N tickets" by a human operator.
  • TTL enforcement via BullMQ delayed jobs (expire-ticket, session-expire) plus a periodic safety-net sweep.
  • Notification fan-out per state transition through a NotificationDispatcher interface (WebSocket + FCM via firebase-admin + APNs via node-apn).
  • Hot-path session validation for protected origins via GET /access.

Responsibilities the ticketing backend owns

  • Provisioning one capacity-mode queue per match at publish time (admin REST call to waiting-room).
  • Storing the resulting waiting_room_queue_id on the match record.
  • Enforcing 1 ticket per (user_id, queue_id) at the call site, because waiting-room keys tickets by ticket id, not user id (multi-device dedup is not built-in).
  • Gating protected endpoints (checkout / seat selection) with a RequireSession middleware that calls GET /access against waiting-room (cached 1–5s, fail-closed).

Technology Stack (locked by waiting-room)

Component Technology Rationale
Runtime Node.js 20 Event loop for 100k+ long-lived connections
Language TypeScript 5.7 (strict) Maintainability
HTTP framework Fastify 5 Schema-validated routes, low overhead
Database PostgreSQL 16 + Prisma 6 Durable ticket/queue history
Cache / sorted set Valkey 8 (not Redis) Redis-the-product re-licensed to SSPL/RSALv2 in 2024; we picked Valkey (BSD-3, Linux Foundation fork). Redis-protocol clients (ioredis, BullMQ) work unchanged.
Job queue BullMQ (Valkey-backed) TTL expiry jobs
WebSocket @fastify/websocket + Valkey Pub/Sub fan-out Cross-instance broadcast
Push firebase-admin (FCM), @parse/node-apn (APNs) Backgrounded-device delivery

Data Model (Valkey, owned by waiting-room)

Source: ../waiting-room/src/valkey/keys.ts.

# Waiting position (sorted set, score = joined_at ms)
Key:     queue:{queue_id}:waiting
Members: ticket_id
Used by: position math (ZRANK + 1), join, cancel.

# Operator-mode "called" tickets (sorted set, score = called_at ms)
Key:     queue:{queue_id}:called
Members: ticket_id
Used by: operator advance flow (not used by HNS Ticketing).

# Capacity-mode active sessions (sorted set, score = admitted_at ms)
Key:     queue:{queue_id}:active
Members: ticket_id
Used by: concurrency cap check before auto-admit.

# Per-ticket session record (string, JSON, EXPIREAT to sessionExpiresAt)
Key:     session:{ticket_id}
Value:   {secret, tenantId, queueId, sessionExpiresAt}
Used by: GET /access (one Valkey lookup, no Postgres on the hot path).

Durable ticket history (status, joinedAt, expiresAt, admittedAt, sessionExpiresAt, etc.) lives in waiting-room's Postgres Ticket table — separate from the ticketing backend's Postgres.

Queue mode used by HNS Ticketing

We use capacity mode exclusively. Per-queue config supplied at provisioning time:

  • mode: 'capacity'
  • ticket_ttl_seconds: hard expiry from join (e.g. 1800 = 30 min).
  • concurrency: maximum concurrent active sessions on the protected origin (i.e. concurrent checkout sessions per match).
  • session_ttl_seconds: how long an admitted session lives (e.g. 1200 = 20 min, matching the original purchase-window spec).
  • capacity (optional): absolute hard cap on waiting queue length; rejects joins with 409 queue_full.

API Endpoints (waiting-room)

Authoritative reference: ../waiting-room/openapi.yaml (mounted at /docs in dev).

Method Path Auth Used by
POST /queues Operator JWT Backend at match-publish (one queue per match)
PATCH /queues/{id} Operator JWT Backend to retune concurrency / TTLs
POST /queues/{id}/tickets Tenant API key Mobile (join)
GET /tickets/{id} Per-ticket JWT Mobile (poll fallback / state on reconnect)
DELETE /tickets/{id} Per-ticket JWT Mobile (voluntary leave)
WS /tickets/{id}/ws?token=… Per-ticket JWT (query) Mobile (live updates)
GET /access Bearer sessionToken Backend (protected-origin validation)

WebSocket events streamed to the client: state (snapshot on connect), position_changed, admitted (carries the sessionToken), expired, session_expired, cancelled. Backgrounded clients receive admitted, expired, session_expired via FCM/APNs.

Communication with the Ticketing Backend

┌─────────────────┐                                ┌──────────────────────────┐
│   Mobile App    │                                │  HNS Ticketing Backend   │
└────────┬────────┘                                │  (protected origin for   │
         │                                         │   checkout / seat select)│
         │ POST /queues/{id}/tickets               └──────────┬───────────────┘
         │ WSS  /tickets/{id}/ws                              │
         ▼                                                    │
┌──────────────────────────┐                                  │
│      waiting-room        │                                  │
│  ┌────────────────────┐  │                                  │
│  │ Valkey 8           │  │◄── GET /access  ─────────────────┤
│  │  queue:{id}:waiting│  │    Authorization: Bearer         │
│  │  queue:{id}:active │  │      <sessionToken>              │
│  │  session:{ticket}  │  │                                  │
│  └────────────────────┘  │  200 active | 410 expired | 401  │
│  ┌────────────────────┐  │ ─────────────────────────────────►
│  │ Postgres (durable) │  │                                  │
│  │   Ticket history   │  │                                  │
│  └────────────────────┘  │                                  │
│  ┌────────────────────┐  │                                  │
│  │ BullMQ workers     │  │                                  │
│  │   expire-ticket    │  │                                  │
│  │   session-expire   │  │                                  │
│  │   sweep-expired    │  │                                  │
│  └────────────────────┘  │                                  │
└──────────────────────────┘                                  │

There is no Redis Pub/Sub channel and no NATS subject carrying queue events to the backend. The backend learns about admission through one of two pull-based mechanisms:

  1. GET /access (primary, mandatory). Called from a RequireSession middleware on every protected-origin request. One Valkey GET on waiting-room. Cache 1–5s on the backend side, capped at sessionExpiresAt.
  2. Mobile-forwarded admitted event (optional). The mobile app can POST the admitted payload to a backend endpoint if pre-warming (e.g. eager seat lock) is wanted. Not required for correctness.

A future WebhookDispatcher in waiting-room (extending its NotificationDispatcher interface) would push events server-to-server; not built yet.

Session expiry doubles as cart expiry

The sessionExpiresAt returned by /access is also the canonical clock for cart and seat-lock lifetime. The backend writes match_seat_inventory.locked_until = sessionExpiresAt on every seat lock so the seat and the queue session expire simultaneously — no second TTL mechanism in the backend, no Redis seat-lock keys.

Origin gate caching (mandatory on the backend side)

// Pseudocode: RequireSession middleware on protected backend endpoints.
$token  = extractBearer($request);
$key    = "wr_access:{$token}";
$cached = $redis->get($key);                                 // 1. local cache
if ($cached !== null) {
    return decode($cached);
}
$resp = $httpClient->get(WAITING_ROOM_URL.'/access', [
    'headers' => ['Authorization' => "Bearer {$token}"],
]);
if ($resp->status === 200) {
    $body = json_decode($resp->body, true);
    $ttl  = min(5, max(1, sessionExpiresAtSecondsFromNow($body)));
    $redis->setex($key, $ttl, encode($body));                // 2. cache hit decision
    return ['ok' => true] + $body;
}
if ($resp->status === 410 || $resp->status === 401) {
    $redis->setex($key, 1, encode(['ok' => false, 'status' => $resp->status]));
    return ['ok' => false, 'status' => $resp->status];
}
// 5xx / network error: fail-closed by default; emit a metric.
return ['ok' => false, 'status' => 503];

Rate limit: /access is capped at 1000 req/min per source IP at waiting-room. With caching this is comfortable; without it, a busy backend trips the limit.

Auth credentials at the boundary

Credential Owner Used for
Tenant API key (wr_<tenantId>_<secret>) Mobile app build config POST /queues/{id}/tickets
Per-ticket JWT (ticketToken) Mobile app, runtime GET/DELETE /tickets/{id}, WS upgrade
Session token (<ticketId>.<secret>) Mobile app, runtime Calls to the protected backend; backend validates via /access
Operator JWT Ticketing backend's secret store Admin operations on waiting-room (POST /queues, PATCH /queues/{id})

These are orthogonal to the Keycloak user JWT used elsewhere in HNS Ticketing — the user-identity token never goes to waiting-room.

Sizing (waiting-room/docs/bottlenecks_estimate.md)

Tier Sustained join req/s Concurrent WS Practical queue capacity
4 vCPU / 16 GB 6–10k 50–80k ~30–50k waiting users
8 vCPU / 32 GB 15–25k 150–200k ~100–150k waiting users
16 vCPU / 64 GB 30–50k 400–600k ~300–500k waiting users

Bottleneck is CPU first (Node single-threaded; JSON / JWT / Prisma overhead) then RAM at WS scale (~20–25 KB per idle connection). /access is much cheaper than join because it's Valkey-only on the hot path.


Notification Delivery

The "Notification Workers" originally planned as Redis-list-consuming PHP workers were never built that way. Email and push took different paths and are documented honestly below. Queue-related notifications (admitted, position_changed, etc.) are owned end-to-end by waiting-room and do not enter either of these pipelines.

Email — decoupled via NATS

EmailService::send() and EmailService::sendBulk() publish each email as a mail.send message to the NATS JetStream Event Bus (see App\Service\EmailService). The external mailer microservice (sibling stack) consumes the mail.> subjects via the eventbus's mailer-auth-events subscription and is the only component that talks to Mailgun. The ticketing backend has no Mailgun client and no Redis email queue.

┌──────────────────────────────────────────────────────────────┐
│                  HNS Ticketing Backend                        │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ Ticket Purchase (E4)  Quota (E7)  Support (E10) …        │ │
│  │         │                                                │ │
│  │         ▼                                                │ │
│  │   EmailService::send / sendBulk                          │ │
│  │   - renders Twig template                                │ │
│  │   - publishes `mail.send` with PubAck (JetStream)        │ │
│  │   - audit-logs PUBLISHED / PUBLISH_FAILED                │ │
│  └────────────────────┬─────────────────────────────────────┘ │
└───────────────────────┼───────────────────────────────────────┘
                        │
                        ▼  NATS JetStream
              ┌─────────────────────────┐
              │ stream:                 │
              │   hns_ticketing_events  │
              │ subjects: mail.>        │
              │ retention: 72h          │
              └────────────┬────────────┘
                           │
                eventbus subscription: `mailer-auth-events`
                           │
                           ▼ HTTP POST (X-API-Key)
                  ┌─────────────────┐
                  │ mailer service  │
                  │   (Mailgun)     │
                  └─────────────────┘

Properties:

  • Fire-and-forget after PubAck. Once JetStream returns PubAck, the backend considers the email delivered to the bus. The mailer service is responsible for actual SMTP delivery downstream.
  • Idempotency via correlation_id in the envelope. Pass a deterministic value (e.g. order_confirmation:{orderId}) on retry-prone paths so the mailer's DEDUP_TTL_SECS window skips duplicates.
  • Bulk sends chunk recipients into Mailgun batches (chunkSize = 1000) and publish each chunk as one NATS message with recipient_variables.
  • Driver toggle: EMAIL_DRIVER=direct falls back to Symfony Mailer (used by MailerAssertionsTrait in tests). Production uses nats.
  • No retry/backoff loop in the backend. JetStream durability + the mailer-side consumer handle delivery semantics.

Push — synchronous from the request thread

PushService calls the Firebase SDK (kreait/firebase-php) directly from the request thread, mirroring each message to the ntfy HTTP endpoint when NTFY_URL is set for dev visibility. There is no queue, no worker, no NATS publish.

Ticket Purchase / Loyalty Award / Support → PushService::sendToUser
                                                │
                                                ├─► Firebase FCM (kreait SDK, sendMulticast)
                                                └─► ntfy.sh HTTP POST (dev only)
                                                │
                                                └─► writes PushLog row, deactivates invalid tokens

Known limitations (deliberately accepted in v1):

  • Blocks the request handler for the duration of the FCM round-trip. Fine at current scale; flagged as something to revisit if push volume grows or large fan-outs become common.
  • No retry/backoff inside the backend. Invalid-token errors are caught and the token row is marked inactive; everything else is logged at WARN.
  • No rate limiting. Firebase's documented quota (500 tokens/request, ~1k req/s) is not enforced on the backend side; we rely on FCM's own throttling.

The longer-term plan, tracked in ../hns-ticketing-eventbus/EVENT-CONTRACT.md, is to publish ticket.generated / loyalty.awarded / etc. NATS events and have a push-delivery consumer take over. Not built yet.

mail.send payload (canonical)

The shape EmailService publishes to mail.send, consumed by the mailer microservice. Bulk sends use the same shape with to as an array and recipient_variables set.

{
  "to": ["user@example.com"],
  "subject": "Vaše ulaznice — Hrvatska vs Italija",
  "html": "<rendered HTML>",
  "text": "<rendered text>",
  "from_email": "noreply@ulaznice.hns.family",
  "from_name": "HNS Ulaznice",
  "attachments": null,
  "recipient_variables": {"user@example.com": {}},
  "metadata": {
    "source": "ticketing-backend",
    "template_key": "order_confirmation",
    "correlation_id": "order_confirmation:01HZ…"
  }
}

For the full envelope (event, version, occurred_at, data) used across the bus, see ../hns-ticketing-eventbus/EVENT-CONTRACT.md.


Rate Limiting

Backend API is currently unprotected

Rate limiting today exists only at the waiting-room boundary — its public endpoints (POST /queues/{id}/tickets 30/min/IP, GET /access 1000/min/IP, POST /sessions 20/min/IP, etc.) use @fastify/rate-limit backed by waiting-room's own Valkey.

The ticketing backend's own API (/api/v1/* on hns-ticketing-backend) has no application-level rate limiting. Anything reachable without a sessionToken in front of it — cart, orders, users/me, portal endpoints, admin writes — can be hit at unbounded rates today. Closing this gap is tracked in HNSTIK-99.


Event Catalog

Cross-service events flow over NATS JetStream. The single source of truth for event shapes is ../hns-ticketing-eventbus/EVENT-CONTRACT.md; this section summarises what is wired up today vs. planned.

Published today

Event Publisher Subscriber(s) Trigger
mail.send (and mail.* family) EmailService::send / sendBulk mailer-auth-events → mailer microservice → Mailgun Every transactional email
user.updated EventBusPublisher::publishUserUpdate (from AdminUserService) user-sync-keycloak, user-sync-drupal, user-sync-backend Admin endpoints mutate user-profile fields
stripe.backend Stripe Webhook Router (not the backend) Ticketing backend (/api/v1/internal/webhooks/stripe) Stripe payment event with metadata.site=backend
stripe.drupal Stripe Webhook Router Drupal webshop Stripe payment event with metadata.site=drupal

Roadmap (planned in EVENT-CONTRACT.md, not yet published)

Backend service code still fires these side effects synchronously; the move to NATS is staged work.

Event Currently fires synchronously in Intended consumers when published
ticket.generated, ticket.cancelled, ticket.transferred TicketGenerationService, TicketCancellationService, TicketTransferService Email + push delivery
order.completed, payment.failed OrderController, PaymentService Email + push, analytics
quota.invitation QuotaService Email + push
loyalty.awarded LoyaltyAwardService (batch) Push per user
match.cancelled MatchController::cancel Mass ticket cancellation, mass push/email
blacklist.enforced BlacklistCancelService Audit log

Queue events

queue.turn_granted, queue.expired, and queue.position_update from the original spec do not exist on any platform-wide transport. They are dispatched inside waiting-room (WebSocket / FCM / APNs) and reach the backend only via the pull-based GET /access check on protected requests. See ADR 0001.


Deployment Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                              Edge / Ingress                              │
│         api.hns.hr           queue.hns.hr (waiting-room)                │
└──────────┬────────────────────────────────┬─────────────────────────────┘
           │                                │
           ▼                                ▼
┌──────────────────────┐   ┌────────────────────────────────────────────┐
│  HNS Ticketing       │   │  waiting-room (separate deployment)         │
│  Backend (Symfony)   │   │  ┌──────────────────────────────────────┐  │
│  Replicas: 5–20      │   │  │ Node 20 + Fastify 5                  │  │
│                      │   │  │ Replicas: 3–20 (Valkey Pub/Sub       │  │
│  Publishes:          │   │  │   fan-out across instances)          │  │
│   mail.send → NATS   │   │  └──────────────┬───────────────────────┘  │
│   user.updated →NATS │   │                 │                          │
│                      │   │   ┌─────────────▼─────────────┐            │
│                      │   │   │ Valkey 8                  │            │
└──────────┬───────────┘   │   │  queue:{id}:waiting       │            │
           │               │   │  queue:{id}:active        │            │
           │ GET /access   │   │  session:{ticket}         │            │
           │ (1–5s cached) │   │  + Pub/Sub channels       │            │
           └───────────────┼──▶└───────────────────────────┘            │
                           │   ┌───────────────────────────┐            │
                           │   │ PostgreSQL 16             │            │
                           │   │  Ticket / Queue / Tenant  │            │
                           │   └───────────────────────────┘            │
                           │   ┌───────────────────────────┐            │
                           │   │ BullMQ workers            │            │
                           │   │  expire-ticket / session  │            │
                           │   │  / sweep-expired          │            │
                           │   └───────────────────────────┘            │
                           └────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                       Backend's own Data Layer                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  Redis 7 — backend-internal only                                 │    │
│  │   - /access decision cache (1–5s, capped at sessionExpiresAt)    │    │
│  │   - Symfony session/cache                                        │    │
│  │  No cross-service events (those are NATS). No queue state        │    │
│  │  (waiting-room's Valkey). No cart or seat-lock state (Postgres). │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  PostgreSQL 16 — Tickets, orders, quotas, seats, blacklist,     │    │
│  │  cart + seat-lock state via match_seat_inventory.locked_until.  │    │
│  │  (No queue_entries — owned by waiting-room's Postgres.)         │    │
│  └─────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘

Failure Scenarios

Queue Service (waiting-room) Failure

Scenario Impact Mitigation
Single pod crash WS clients on that pod disconnect Kubernetes auto-restart < 5s; clients reconnect and receive a fresh state snapshot
Valkey unavailable All queue operations + /access fail Valkey with replicas; backend caches /access decisions for 1–5s — short outage is masked
Network partition between backend and waiting-room /access returns 5xx / times out Backend defaults to fail-closed (reject protected requests); emit a metric so the outage is loud

Recovery: Tickets and admitted-session state survive a pod crash because both are persisted in Valkey (sorted sets) and Postgres (durable history). The hard TTL (ticket_ttl_seconds, session_ttl_seconds) is the contract; voluntary WS reconnect is the optimisation.

Notification Delivery Failure

Scenario Impact Mitigation
NATS unavailable when backend publishes mail.send EmailService logs PUBLISH_FAILED in the audit log and continues — the email is lost. Fire-and-forget is the documented contract. Critical communications should be re-driveable (e.g. user re-requests the ticket PDF). NATS itself runs with JetStream durability + replicas; outages are short.
Mailer microservice down while NATS is up Messages buffer in the mail.> JetStream consumer until the mailer comes back; 72h retention window before the bus drops them. JetStream durability + the eventbus's per-subscription retry/backoff. No backend-side action required.
Mailgun outage Mailer service handles delivery retries against Mailgun's documented backoff. Owned by the mailer microservice; see its repo.
Firebase outage while a PushService call is in flight The request handler logs WARN, writes a FAILED PushLog row, and returns to the user successfully — the push is not retried. Accepted limitation in v1. Once push moves to NATS per the EVENT-CONTRACT roadmap, a consumer-side retry policy will close this gap.
Invalid FCM token The token row is marked inactive so subsequent sends skip it. Handled inline by PushService::deactivateToken.


Last Updated: May 2026 (Queue Service built as waiting-room — see ADR 0001)