概要
Company Introduction
We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did I ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.
We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role Overview
We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting large‑scale AI training and inference workloads. This is a hands‑on, staff‑level individual contributor role with broad technical ownership, high operational impact, and significant cross‑functional influence across hardware, infrastructure, and datacenter operations.
CIC builds the infrastructure for abundant intelligence. We partner with leading AI labs, governments, and enterprises to deliver hyperscale GPU compute with high reliability, performance, and efficiency. Our infrastructure supports some of the most demanding AI training and inference workloads in production today.
We operate with urgency, deep ownership, and a strong bias toward execution. Reliability, operational excellence, and rigorous systems engineering are core to our business.
What You Will Do
As a Sr Staff System Engineer, GPU Fleet, you will be the senior technical owner for CIC’s large‑scale GPU compute infrastructure. This is a hands‑on senior individual contributor role with fleet‑level responsibility and broad cross‑functional influence.
You will define the technical direction for how GPU fleets are architected, operated, automated, and evolved across multiple generations of hardware. Your work will directly affect fleet reliability, operating efficiency, scalability, and customer success.
This role does not involve people management, but it carries principal‑level scope, autonomy, and decision‑making authority across infrastructure, hardware, and operations.
Key Responsibilities:
Fleet Architecture & Technical Ownership
- Own the end‑to‑end technical architecture of hyperscale GPU fleets, including hardware platform selection, firmware strategy, OS configuration, drivers, networking, and observability.
- Define and enforce technical standards and best practices for fleet reliability, availability, performance, and operability.
- Lead major fleet‑wide initiatives such as new GPU platform bring‑ups, multi‑generation hardware transitions, and architectural redesigns.
- Evaluate trade‑offs across cost, performance, reliability, and time‑to‑deploy, and make technically sound decisions under ambiguity.
Reliability, Availability & Performance
- Set and drive fleet‑level reliability, availability, and performance objectives.
- Lead root‑cause analysis and resolution of complex, systemic failures affecting large portions of the fleet or multiple datacenters.
- Identify recurring failure patterns and drive long‑term fixes spanning hardware, software, automation, and operational processes.
- Work directly with hardware vendors and partners to resolve platform‑level issues and influence future hardware designs.
Automation & Systems Engineering
- Design and build large‑scale automation systems for:
- GPU fleet provisioning and lifecycle management
- GPU health validation, diagnostics, and certification
- Automated remediation, recovery, and replacement workflows
- Eliminate manual operational toil through durable, well‑designed tooling that scales with fleet growth.
- Ensure all fleet systems are observable, testable, and resilient under failure conditions.
Operational Leadership
- Act as a senior escalation point for critical production incidents impacting GPU availability or customer workloads.
- Participate in on‑call rotations with a strong emphasis on preventing future incidents, not just responding to them.
- Lead high‑severity post‑incident reviews and ensure learnings are translated into concrete engineering and process improvements.
Technical Influence & Mentorship
- Provide technical mentorship and guidance to system and infrastructure engineers across the organization.
- Serve as a trusted technical partner to platform engineering, networking, datacenter operations, and leadership teams.
- Influence CIC’s long‑term infrastructure roadmap through strong technical judgment and data‑driven recommendations.
Basic Qualifications
- 12+ Years of overall experience with at least 8+ years of experience in Linux systems engineering, infrastructure engineering, or datacenter operations, operating production environments with strict uptime and performance requirements.
- Deep, hands‑on expertise in Linux system internals, including process scheduling, memory management, filesystem behavior, networking, kernel behavior, and system performance analysis.
- Demonstrated experience operating hardware‑intensive infrastructure in production, including bare‑metal servers at scale.
- Proven ability to debug complex issues across multiple system layers, including hardware components, firmware/BIOS, kernel drivers, OS configuration, and user‑space services.
- Extensive experience writing production‑grade automation using Python and Bash for provisioning, configuration management, diagnostics, remediation, and fleet operations.
- Strong understanding of how to design systems that are observable, resilient, and safe under failure, rather than reliant on manual intervention.
Preferred Qualifications
- Direct experience operating large‑scale GPU fleets supporting AI/ML training and/or inference workloads in production.
- Familiarity with modern GPU platforms and ecosystems, including GPU drivers, CUDA, NCCL, and high‑performance compute workloads.
- Experience with high‑speed interconnects and datacenter networking, such as NVLink, InfiniBand, RDMA, and high‑throughput Ethernet.
- Prior ownership of fleet‑wide or platform‑wide initiatives, such as new hardware bring‑ups, major architectural changes, or reliability transformations.
- Experience partnering directly with hardware vendors or manufacturers to troubleshoot systemic issues or influence future platform designs.
- Strong intuition for failure modes at scale, including cascading failures, correlated faults, and second‑order effects across systems.
- History of acting as a technical authority or escalation point for ambiguous, high‑impact production problems.
- Ability to mentor engineers through design reviews, technical problem solving, and modelling strong operational ownership.
- Experience participating in on‑call rotations and responding to high‑severity production incidents with clear ownership, urgency, and technical leadership.
- Strong written and verbal communication skills, including clear post‑incident reviews and technical documentation.
Type of work model
Hybrid
Details to consider
- Those eligible for employment protection (recipients of veteran’s benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws.
Privacy Notice
- Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below. https://privacy.coupang.com/en/land/jobs/