Portfolio
Focused on technical problems I led, the thought processes behind my decisions, and the outcomes achieved.
AI-based Workflow Improvements
Separate initiatives focused on improving development, analysis, and QA workflows with AI.
-
Project Overview: For the AIDC (Ali International Digital Center) ad center integration project, directly designed and built an AI Agent-based development workflow and Playwright E2E QA automation pipeline to enable team members without frontend experience to participate in Next.js · React development
-
Key Achievements:
- Designed an AI Agent 4-source multimodal context framework in an environment without a formal design system
- Automated per-issue QA first-pass validation via Jira Evidence mapping + Playwright E2E + monitoring dashboard
- Built a development system enabling non-frontend engineers to actively contribute
-
Problem Situation
- Lack of a clear design reference point (Design System), a prerequisite for AI Agent development — absence causes AI hallucinations
- No formal design system; only UX designer's Figma work, resulting in an absence of explicit component tokens or style guides for AI reference
-
Technical Approach & Implementation
- Informal Design Token: Utilized HTML/CSS markup output agreed upon by publishers and designers as an informal Design Token to establish a Next.js porting workflow
- 4-Source Multimodal Context: Provided 4 data sources simultaneously to maximize the AI Agent's comprehension and accuracy
- Figma MCP metadata (layout & component intent), Publisher HTML/CSS output (markup reference), Publisher screenshots (visual grounding), Refined spec documents (feature & interaction definitions)
- QA Automation Pipeline:
- Jira ticket parsing → AI issue analysis & task decomposition → Evidence mapping (Jira·Confluence·GitHub grounding to constrain AI scope) → Playwright E2E test generation (Screenshot, Video Recording) → Dashboard 1st-pass automated validation → PR Review 2nd-pass human validation
-
Limitations & Improvement Directions
- No Visual Regression Testing: Need to evolve to Expected vs Actual pixel comparison for fully automated 1st-pass validation
- No Artifact Sync Strategy: Require version control or change-detection hooks to update AI context when publisher markup changes
- Unmeasured Test Coverage: Need Coverage Report integration to verify AI-generated E2E test coverage across critical user flows
-
Project Overview: Designed and built an internal DeepWiki-like AI Agent analysis plugin for company legacy environments (Private Repos, MSSQL, Jira/Wiki) where public GitHub repo-oriented tools could not be applied as-is
-
Key Achievements:
- Built a business flow documentation system from client → Kafka topic → consumer → SP internals
- Reduced unnecessary DB access risk via a stepwise Evidence Architecture (Filesystem → GitHub → MSSQL)
- Separated verified facts from inference through evidence-grounded cross-validation
- Built a custom Atlassian CLI (
atls) to cover gaps in MCP ecosystem support - Designed a chain of 17 specialized Agents (analysis → validation → Confluence auto-publish)
- Multi-environment support: Gemini CLI, Claude Code, Codex, Antigravity IDE
-
Problem Situation
- Legacy systems where change history and documentation had become fragmented, making onboarding heavily dependent on veteran engineers
- Business logic context left only in code and operational traces after original owners had moved on
- Needed a DeepWiki-like automatic documentation experience, but public GitHub repo-oriented tools could not be applied directly to private repos, legacy DBs, and Jira/Wiki context
-
Technical Approach & Implementation
- Stepwise Evidence Architecture: Avoids direct DB exploration; narrows scope before querying in 3 stages
- Filesystem MCP (Local Search): First pass over API, consumer, SQL call sites, and config keywords in the local workspace
- GitHub MCP (Codebase): PRs, commits, CODEOWNERS to add change context, ownership, and call-path evidence
- MSSQL MCP (Targeted Query): Queries only the table and SP names identified in steps 1–2, then verifies SP definitions and dependencies
- Multi-source Cross-Validation: Cross-validates Filesystem → GitHub → Jira/Confluence → MSSQL sources to clearly separate verified facts from inference
atlsPython CLI — Custom Built: Wrapper CLI covering Atlassian features unsupported by MCP (Confluence page create/update, recursive Wiki scraping)- 17 Specialized Agent Chain: Role-separated agent pipeline instead of a monolithic analyzer
- Core 6:
bridge-analyzer(full flow),db-schema-analyst(DB·SP),business-workflow-analyst(business logic),cross-repo-tracer(multi-repo),evidence-auditor(quality audit),wiki-publisher(Confluence publish) - 11 additional domain-specific Agents
- Core 6:
- Mermaid 2-Stage Pre-Refinement: Pre-refine node relationships before asking AI to draw — prevents spaghetti diagrams
- Evidence Ledger: Maps every document claim to its source (code/config/wiki/github), clearly separating inference from verified fact
- Stepwise Evidence Architecture: Avoids direct DB exploration; narrows scope before querying in 3 stages
-
Limitations & Improvement Roadmap
Business flow and data flow documentation is complete. Code-level detailed analysis remains the next frontier.
Level Goal Level 2 AST-based Call Graph, SP branch logic → business rule mapping, ETL pattern tracing Level 3 Full bidirectional traceability: JIRA → code lines → Kafka → consumers → DB → SP Level 4 PR merge webhook → doc drift detection → auto-flag stale documentation Level 5 Natural language queries → sequence diagrams with Evidence citations, auto-generated
Project Cases
- Project Overview: Completely redesigned a
.NET&MS-SQLbased CPC ad ranking system to aSpring Webflux+MongoDBreactive architecture — led by a 2-person team alongside the existing maintainer. - Key Achievements:
- Reduced ranking update time from max 4 hours → under 3 minutes (96% reduction).
- Achieved query response time of under 10ms at the
p99.9percentile. - Reduced CPU usage by 60%.
- Stably handled peak
TPS of 50,000.
-
Problem Situation
- Synchronous ranking updates based on
.NET&MS-SQLtook up to 4 hours, causing increased advertiser CS inquiries and refunds. - Lack of processing capacity and instability during peak times (around
TPS 30,000). - High CPU usage due to synchronous blocking processing.
- Synchronous ranking updates based on
-
Technical Approach & Implementation
- Transition to Reactive Architecture: Introduced
Spring Webfluxto switch to non-blocking, asynchronous processing, securing high concurrency with fewer threads. MongoDBOptimization: Designed a denormalized model tailored for read-heavy data characteristics + configured compound indexes, continuously verifying query performance based on thep99.9percentile.- Asynchronous Stream Processing: Improved CPU and memory efficiency with massive ranking update logic based on
Reactor Flux/Mono. KafkaMessaging: Delegated heavy update tasks at the trigger point toKafkaconsumers to distribute API load.DatadogMonitoring: Real-time tracking of Latency, TPS, MongoDB query performance, and analysis of bottlenecks.
- Transition to Reactive Architecture: Introduced
- Project Overview: Migrated CPC ad settlement (
Bill/Pay) data and ranking update snapshot data from aHue/Hadoopbased environment toDatabricks— led the pipeline design, implementation, and verification. - Key Achievements: 0 batch failures post-migration (completely resolved recurring manual re-executions at dawn) / Secured ad settlement data accuracy through a step-by-step verification structure.
-
Problem Situation
- Frequent batch failures due to resource shortages in the
Hue/Hadoopinfrastructure → Required the person in charge to manually re-execute at dawn on a regular basis. - Ad settlement (
Bill/Pay) data andMongoDBranking snapshot data were processed separately in disparate environments. - Given the nature of ad settlement data, preventing omissions, duplications, and amount discrepancies was mandatory.
- Frequent batch failures due to resource shortages in the
-
Technical Approach & Implementation
- Legacy Flow Analysis: Understood the
Hue/HadoopbasedBill/Paycreation/aggregation/verification flow, and the usage purpose/processing order of theMS-SQLraw settlement data andMongoDBranking snapshot data. - Reconstructing
DatabricksPipeline: Transitioned to a workflow of loading raw data → creating intermediate aggregation tables → producing the final settlement result table. - Designing a Step-by-step Verification Structure: Constructed a structure traceable for failure points, preventing settlement errors in advance by automatically comparing and verifying record counts and total amounts against intermediate aggregation tables.
- Verifying Migration Results against
MS-SQL&MongoDBRaw Data: Verified migration accuracy based on major keys across ad, seller, and period dimensions.
- Legacy Flow Analysis: Understood the
- Project Overview: Designed and implemented a custom 3-stage synchronization pipeline to solve data consistency issues during the transition of the CPC ad ranking system from
MS-SQLtoMongoDB. - Key Achievements: Guaranteed real-time data consistency without a commercial
CDCsolution / Automatic recovery of failed transactions / Reduced advertiser CS inquiries through final consistency verification in the background.
-
Problem Situation
- Intermittent data inconsistencies occurred during the transition from
MS-SQLtoMongoDB. - Securing real-time consistency was mandatory as ad ranking data is a core business asset.
- Adopting a commercial
CDC (Change Data Capture)solution was impossible due to budget and technical constraints.
- Intermittent data inconsistencies occurred during the transition from
-
Technical Approach & Implementation
The initial approach of writing simultaneously to both DBs was rejected due to the complexity of distributed transaction management and concerns over worsening inconsistencies upon simultaneous failures. Instead, we custom-designed a 3-stage tiered architecture.
- Stage 1 — Real-time Sync (API Level): Attempted immediate reflection to
MongoDBupon data changes usingSpring Webfluxasynchronous processing. - Stage 2 — Failure Recovery (Utilizing Message Queue): If
MongoDBreflection failed, published to aKafkatopic where aSync Consumersubscribed and safely reprocessed it. Guaranteed order via message key (uuid) based partitioning. - Stage 3 — Background Verification (Final Consistency): Developed a
Batch Jobthat periodically compared both DBs based on theMS-SQLchange history and automatically corrected any inconsistencies.
- Stage 1 — Real-time Sync (API Level): Attempted immediate reflection to
-
Results
- Secured data reliability with one-way failure recovery + two-way final consistency verification.
- Improved operational efficiency and reduced advertiser CS inquiries caused by data inconsistencies.
- Project Overview: Analyzed and resolved issues of
Kafkapartition skew andRedisbulk processing performance degradation that occurred during the ranking system improvement process. - Key Achievements: 4x faster Redis bulk processing speed (reduced from 240 sec to 60 sec per 1 million records) / Resolved Kafka message distribution imbalance → Secured stable and predictable processing performance.
-
Issue 1 — Partition Skew Due to Kafka StickyPartitioner Bug
- Symptom: Abnormal distribution where messages were concentrated in specific partitions, leaving others nearly empty.
- Root Cause Analysis: Identified a skew bug in the
StickyPartitionerofKafka Client below 3.3that occurs when a lowlinger.mssetting overlaps with broker latency. Additionally, found a bug in theRoundRobinPartitionerfor versions2.4and above where thepartition()method is called twice, causing distribution only to even/odd partitions. - Resolution: Upgraded to
Kafka Client 3.3 or higherand used the bug-fixedDefaultPartitioner, restoring even distribution.
-
Issue 2 — Bulk Processing Performance Degradation Due to RedisTemplate Individual TCP Connections
- Symptom: Batch processing of 1 million Redis
SETcommands took 240 seconds (4 minutes) — exceeding expectations by 4 times. - Root Cause Analysis: While monitoring Redis connections via
Datadog, identified an explosive increase in connections during batch execution. Found that theRedisTemplatedefault configuration establishes and tears down a new TCP connection for every command. - Resolution: Switched to the
Lettucenative API, eliminating I/O overhead by utilizing connection pooling + pipelining (internal command queuing) → Processing time reduced from 240 sec to 60 sec (75% reduction).
- Symptom: Batch processing of 1 million Redis
- Project Overview: Launched a new video ad product by integrating an external VOD solution (
Shoplive) into a legacy.NETad system originally focused on image ads. - Key Achievements: Contributed to new ad revenue worth at least billions of KRW with the launch of the new video ad product / Successfully added features without impacting the stability of the legacy system.
-
Problem Situation
- Needed to add new video ad capabilities to a legacy
.NETad system designed exclusively for image ads. - Given the nature of the legacy system, code changes carried the risk of cascading failures, making minimizing the scope of impact a core constraint.
- Required real-time metadata synchronization with the external VOD solution (
Shoplive).
- Needed to add new video ad capabilities to a legacy
-
Technical Approach & Implementation
- Webhook-based Sync Design: To avoid touching the internals of the legacy system as much as possible, adopted a method of one-way metadata synchronization by receiving processing completion events from the external solution via webhooks.
- Stabilizing External Integration:
- Implemented retry logic for external system integration failures.
- Updated the internal status to the latest by checking the actual processing status of the external system when querying ads (preventing status inconsistencies).
- Secured operational visibility by adding detailed logging.
- Minimizing Legacy Impact: Performed minimal changes such as adding a video type field, implementing file processing logic, and extending the subsequent ad API server, followed by deployment after integration testing.
- Project Overview: Designed and introduced an event-driven architecture to resolve inter-service coupling during the transition of the
Loomexmedia distribution management solution from a.NETmonolith to aNode.jsMSA. - Key Achievements: Eliminated synchronous dependencies between services and secured a flexibly scalable structure / Optimized infrastructure costs by introducing asynchronous serverless processing.
-
Problem Situation
- Synchronous calls between services in the tightly coupled monolithic legacy system → Increased coupling, propagated failures.
- Structural limitations of performing long-running tasks like media transcoding via API synchronous processing.
-
Technical Approach & Implementation
Designed the infrastructure by distinguishing inter-service communication methods into "Commands" and "Events":
AWS SQS(Command): Used when specific tasks require asynchronous processing (e.g., VOD transcoding requests). The processing service orLambdapulls and processes messages from the queue.AWS EventBridge(Event): Used when multiple services need to independently subscribe to state change events (e.g., encoding completion, channel state changes). Eliminated inter-service coupling using thePublish/Subscribepattern.- Message Processing Stability: Supported preservation and reprocessing of failed messages by configuring a
DLQ (Dead Letter Queue). Prevented duplicate processing by applying an idempotency key (message ID) in theLambdaprocessing logic. - Automated Resource Cleanup: Automatically deleted
S3stream chunks and thumbnails using aLambda+EventBridgescheduler (applied DB-based retention policies). CloudWatchMonitoring: Set up alarms for core metrics such asSQSqueue depth andLambdaexecution error rates.
- Project Overview: Fully automated the 6-step manual setup process for customers' social media simulcasting on a live streaming platform into a single authentication flow.
- Key Achievements: Automated the 6-step setup process (like manual stream key input) into a 1-time authentication / Drastically reduced CS inquiries caused by setup errors / Achieved full automation for simultaneous YouTube and Facebook broadcasting.
-
Problem Situation
- To simulcast to YouTube and Facebook, customers had to manually visit each platform and complete a 6-step manual setup, including issuing a stream key.
- Visit social platform → Login → Go Live → Issue Stream Key & Server URL → Paste into our solution → Start Broadcast.
- Non-technical customers frequently generated CS inquiries due to typos and setup errors while manually entering stream keys.
- The complexity of integrating and managing different API specs, authentication methods, and setup procedures for each platform.
- To simulcast to YouTube and Facebook, customers had to manually visit each platform and complete a 6-step manual setup, including issuing a stream key.
-
Technical Approach & Implementation
- Implementing OAuth 2.0 Auth Flow: Registered apps in the developer consoles of each platform (YouTube, Facebook), handled
Redirect URIon the backend, exchanged authorization codes, acquiredTokens, and stored them in the DB. Eliminated the burden of re-authentication through automaticAccess Tokenrenewal based onRefresh Tokens. - Abstracting API Modules per Platform: Implemented YouTube and Facebook API clients respectively and abstracted features like live creation and stream info lookup into an internal standard interface → Making it easy to add new platforms.
- Implementing Automation Workflow: After the initial authentication, invoked APIs sequentially using stored tokens (Create Live → Acquire Stream Info → Auto-configure Catenoid Streaming Engine). Handled success/failure and rollbacks at each step.
- Security: Encrypted and stored user token information in the DB.
- Implementing OAuth 2.0 Auth Flow: Registered apps in the developer consoles of each platform (YouTube, Facebook), handled
- Project Overview: Resolved memory shortage and timeout issues that occurred when uploading media files of several GBs by implementing chunked uploads and a resume feature.
- Key Achievements: Completely resolved timeouts and memory shortages during large file uploads / Secured practically a 100% upload success rate barring file-specific issues / Supported resuming from the point of interruption upon network disconnection.
-
Problem Situation
- Single HTTP uploads of several GBs frequently caused browser memory shortages, server memory overflows, and network timeouts.
- User inconvenience of having to retry from the beginning upon an upload failure.
-
Technical Approach & Implementation
- Client Splitting & Transmission: Split the file into fixed-size chunks in JavaScript and transmitted them via
FormDataincluding metadata like file ID, total chunk count, and chunk index (supported parallel transmission). - Server Reception, Storage, & Reassembly: Implemented a chunk reception API on the
Node.jsserver and saved chunks in a temporary directory (/tmp/{fileId}/{chunkIndex}). Upon receiving all chunks, merged them sequentially, saved the final file, and deleted the temporary files. - Implementing Resume:
- Called an API to query the list of already uploaded chunks on the server before starting the upload.
- The server scanned the temporary directory and responded with a list of existing chunk indexes.
- The client re-transmitted starting from the last successful chunk (applied defensive logic to re-transmit the last successful chunk as well, considering the possibility of it terminating abnormally).
- Client Splitting & Transmission: Split the file into fixed-size chunks in JavaScript and transmitted them via
- Project Overview: Completely redeveloped Gmarket's external display ad network
EDN+ (Ebay Display Network AD)server from a Node.js legacy to a Spring base, and implemented ad placement integration with Karrot (Danggeun Market). - Key Achievements:
- Node.js Legacy → Completed construction of the new Spring-based server.
- Implemented Karrot Ad Placement Integration — Directly served APIs for banner ads displayed on Karrot.
- Handled the overall backend system of external network ad products similar to Google GDN and Criteo AD Choices.
-
EDN+ Product Introduction
EDN+is Gmarket's external display ad network product.- Exposes Gmarket ads on external digital media (news sites, portals, online communities, etc.) like Google GDN and Criteo AD Choices.
- Includes a DMP (Data Management Platform) and an ad selection engine based on retargeting, lookalike audiences, and user targeting.
-
Development Scope & Implementation Details
- New Server Development: Fully redeveloped the existing Node.js-based legacy EDN+ server to a Spring base.
- Implemented backend logic for the ad selection engine (Retargeting, Lookalike, User Targeting).
- Implemented APIs for external media ad placement quality management.
- Implemented an HTML-based ad creative template processing server.
- Developed a backend admin system for advertiser/publisher settlements.
- Karrot Ad Placement Integration:
- Discussed and designed API integration specs with Karrot.
- Implemented an API that selects and returns the appropriate ad when an ad is requested from a Karrot placement.
- All banner ads exposed on Karrot are served through this API.
- New Server Development: Fully redeveloped the existing Node.js-based legacy EDN+ server to a Spring base.