Blog

SQS vs EventBridge — Criteria for Designing an Event-Driven Architecture During MSA Transition

During the transition from a monolith to MSA, we adopted an AWS event-driven architecture to decouple services. Sharing the criteria for separating SQS and EventBridge usage, and how we designed idempotency and DLQs.

AWSSQSEventBridgeMSACatenoidEvent-driven

Note: The code in this article has been conceptually rewritten based on actual work experience. It is not associated with the actual company code.

Event-Driven Architecture Structure

Command: SQS Completion Event: EventBridge Fail 3 times Video Upload Service TranscodingLambda Routing Metadata Service Notification Service Channel Service DLQ Operator Alert

Introduction

At Catenoid, I was responsible for transitioning the Loomex media distribution management solution from a .NET monolith to a Node.js Microservices Architecture (MSA).

The biggest challenge in transitioning to an MSA is not the separation of features itself. It's the question of how the separated services communicate with each other. If designed poorly, you end up with a "distributed monolith" where the services are separated, but the coupling remains the same.


The Problem: The Limits of Synchronous Inter-Service Calls

If you separate services from a monolith and have them call each other via HTTP APIs, it looks like an MSA on the surface. However, in reality, problems like these arise:

[Video Upload Service]
    → HTTP Call → [Transcoding Service]
    → HTTP Call → [Metadata Service]
    → HTTP Call → [Notification Service]
  • While the Transcoding Service is responding, the Upload Service is blocked waiting.
  • If the Transcoding Service goes down, the Upload Service also fails.
  • Every time a new follow-up task is added, the Upload Service code must be modified.

This is simply a monolith wearing an HTTP shell.


Design Criteria: Command vs Event

When introducing an event-driven architecture, the very first criterion we established was distinguishing between "Command" and "Event".

Classification Definition AWS Service
Command A request to perform a specific action. Has 1 handler. SQS (Queue)
Event A fact that a state change has occurred. Can have multiple handlers. EventBridge (Bus)

SQS Use Case: VOD Transcoding Request

When a video upload is complete, transcoding is required. Transcoding is a command delegated to a specific Lambda function.

// Request transcoding after upload completion (Command → SQS)
await sqs.sendMessage({
  QueueUrl: TRANSCODING_QUEUE_URL,
  MessageBody: JSON.stringify({
    videoId: video.id,
    sourcePath: video.s3Path,
    targetFormats: ['mp4_720p', 'mp4_1080p']
  }),
  MessageGroupId: video.id  // Guarantee order with FIFO Queue
}).promise();

By using SQS, even if the Lambda goes down, the message is preserved in the queue. Once the Lambda recovers, processing resumes automatically.

EventBridge Use Case: Encoding Completion Event

When transcoding is complete, multiple services need to know about this fact:

  • Metadata Service: Updates video information.
  • Notification Service: Sends a completion notification to the advertiser.
  • Channel Service: Updates the status of related channels.
// Publish transcoding completion event (Event → EventBridge)
await eventBridge.putEvents({
  Entries: [{
    Source: 'catenoid.transcoding',
    DetailType: 'TranscodingCompleted',
    Detail: JSON.stringify({
      videoId: video.id,
      status: 'COMPLETED',
      outputPaths: { mp4_720p: '...', mp4_1080p: '...' }
    }),
    EventBusName: 'catenoid-media-bus'
  }]
}).promise();

Each service independently subscribes to this event using EventBridge rules. Even if a new service is added, the Transcoding Service code doesn't need to be touched.


Stability Design: DLQ and Idempotency

Dead Letter Queue (DLQ)

If message processing fails repeatedly, that message is moved to the DLQ. Through this:

  • Messages that fail to process are not lost.
  • They can be reprocessed after analyzing the cause of failure.
  • They don't block the processing of normal messages.
// Setup DLQ when creating an SQS Queue
const queueAttributes = {
  RedrivePolicy: JSON.stringify({
    deadLetterTargetArn: DLQ_ARN,
    maxReceiveCount: '3'  // Move to DLQ after 3 failures
  })
};

Idempotency

SQS Standard Queues guarantee "At-Least-Once Delivery." This means the same message can be processed twice.

// Prevent duplicate processing with an idempotency key
export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    const messageId = record.messageId;
    
    // Check if the message has already been processed
    const isProcessed = await idempotencyStore.exists(messageId);
    if (isProcessed) {
      console.log(`Skipping already processed: ${messageId}`);
      continue;
    }
    
    await processTranscoding(JSON.parse(record.body));
    
    // Record completion
    await idempotencyStore.set(messageId, { processedAt: new Date() });
  }
};

Bonus: Automatic S3 Cleanup

Temporary chunk files and thumbnails generated during live streaming should be automatically deleted after the stream ends.

// Periodic cleanup using EventBridge Scheduler
export const cleanupHandler = async () => {
  const expiredChannels = await db.channels
    .findMany({ where: { status: 'ENDED', endedAt: { lt: sevenDaysAgo } } });
  
  for (const channel of expiredChannels) {
    await s3.deleteObjects({
      Bucket: MEDIA_BUCKET,
      Delete: {
        Objects: await listChunkFiles(channel.id)
      }
    }).promise();
  }
};

Results

By transitioning to an event-driven architecture:

  • Eliminated inter-service coupling: The Upload Service functions normally even if the Transcoding Service goes down.
  • Easy to add new features: We can connect new services just by adding EventBridge rules, without modifying existing code.
  • Infrastructure cost efficiency: Heavy tasks like transcoding are decoupled into Lambdas, running only when needed.

Conclusion: When is Event-Driven Appropriate?

Event-driven architecture is not always the answer. It is particularly effective in the following cases:

  1. Long-running asynchronous tasks (video transcoding, sending emails, etc.)
  2. When multiple services must react to the same event (Pub/Sub pattern)
  3. Tasks that require retries upon failure (utilizing DLQs)

On the other hand, if an immediate response is required (e.g., payment confirmation) or if processing order is extremely strict, synchronous calls might be more appropriate.