Staff, Site Reliability Engineer (Tech Infra), Seoul, South Korea

Description

회사 소개

쿠팡은 고객 감동 실현을 위해 존재합니다. 고객들이 "쿠팡 없이 그동안 어떻게 살았을까?" 라고 말할 때, 비로소 우리의 미션을 실현하고 있음을 알 수 있습니다. 고객들의 쇼핑과 식사, 생활 전반을 편하게 만들겠다는 유일한 집념으로 쿠팡은 수억 달러 규모의 이커머스 산업 전반의 혁신을 이끌고 있습니다. 쿠팡은 가장 빠르게 성장하는 이커머스 기업 중 하나로, 국내 커머스 업계에서의 독보적인 입지와, 고객 신뢰를 구축했습니다.

쿠팡은 스타트업 문화를 기반으로 한 글로벌 대형 상장사라고 자부합니다. 이것이 창립 당시의 기민함을 지하며, 신규 서비스를 끊임없이 출시하며 비즈니스를 확장해 나가는 우리의 성장 동력입니다. 쿠팡의 모든 임직원에게는 기업가 정신을 갖추고 새로운 혁신과 이니셔티브를 추진할 수 있는 기회가 주어집니다. 주저 없이 일에 뛰어들어 성과를 이루고자 하는 과감성이, 바로 쿠팡이 일하는 방식의 본질입니다. 쿠팡에서는 여러분 자신, 동료, 팀 그리고 회사 전체가 매일 성장하는 모습을 목격할 것입니다.

쿠팡의 모든 직원은 커머스의 미래를 만들겠다는 쿠팡의 미션에 진심입니다. 우리는 고객의 문제를 해결해 나가고, 전통적인 관념과 통념에 맞서며 실현 가능한 한계를 뛰어넘고 있습니다. 고가용성 (always-on) 과 최첨단의 앞선 기술 (high-tech), 초연결사회 (hyper-connected world) 에서의 놀라운 업무 경험을 원하신다면, 지금 바로 쿠팡에 합류하세요.

직무 소개

쿠팡의 Site Reliability Engineer(SRE)는 소프트웨어 엔지니어링과 시스템 엔지니어링을 결합하여, 대규모 이커머스 시스템을 구축·운영·확장하는 핵심적인 역할입니다.

SRE 팀의 일원으로서, 모든 고객 facing 서비스가 안정적으로 운영되고, 지속적으로 모니터링되며, 자동화되어 있고, 확장 가능하게 설계되도록 책임지게 됩니다. SRE 조직은 ‘운영을 엔지니어링 문제로 해결한다’는 철학 아래, 자동화를 최우선으로 접근합니다.

이 포지션에서는 Observability, Incident Management, Disaster Recovery, Load Testing, Capacity Engineering 등 다양한 영역에서 최고 수준의 인프라 자동화를 구축하게 됩니다. 또한 제품 개발 초기 단계부터 참여하여, 실제 운영 중 발생하는 이슈 해결까지 전 과정에 걸쳐 개발 조직과 긴밀히 협업합니다.

더불어 운영 서비스의 SLI/SLA 기준을 유지하고, SRE 원칙과 베스트 프랙티스를 조직 전반에 확산시키는 역할도 수행합니다.

대규모 분산 시스템 환경에서 복잡한 기술 문제 해결에 열정이 있고, 높은 책임감을 가지고 팀 간 협업과 커뮤니케이션을 원활하게 수행할 수 있다면 지금 쿠팡에 합류하세요!

업무 내용

쿠팡의 모든 고객 대상 서비스의 안정성, 상태, 성능을 책임지는 주요 담당자로 역할 수행
쿠팡 애플리케이션의 워크플로우와 의존성에 대한 깊은 이해 확보
시스템 가용성, 성능, 안정성과 관련된 KPI 및 SLO 정의 및 관리
신속한 장애 복구, 운영 리뷰 및 사후 분석을 포함한 Incident Management 프로세스 및 자동화 구축
효과적인 모니터링, 알림, 텔레메트리 시스템 구축 및 운영을 위한 베스트 프랙티스 수립
서비스 성장에 대비하기 위한 정기적인 Disaster Recovery 테스트 및 Load Testing 자동화 구축
제품 개발 팀과 긴밀히 협력하여 확장성과 운영 용이성을 고려한 설계 구현
서비스 안정성을 유지하기 위한 프로덕션 배포 가드레일 및 자동화 구축
24x7 온콜 로테이션 참여 및 빠른 속도의 환경에서 문제 대응
조직 내 다양한 레벨과 효과적으로 커뮤니케이션

자격 요건

대규모 분산 시스템 구축 및 운영 경력 5년 이상
UNIX/Linux 시스템에 대한 깊은 이해와 운영 경험
Python, Java, Golang, Ruby 중 하나 이상의 프로그래밍 역량
시스템, 네트워크(TCP/IP), 코드 전반에 걸친 문제 해결 및 분석 능력 (데이터 기반 의사결정 포함)
AWS, Azure, Google Cloud Platform 등 클라우드 인프라 경험
CI/CD, IaC 등 DevOps 및 SRE 관련 실무 이해 (Terraform 사용 경험 우대)
Docker, Kubernetes 등 컨테이너 및 오케스트레이션 기술 경험
다양한 조직과 기술 영역 간 협업이 가능한 커뮤니케이션 역량
Prometheus, Grafana, Elastic Stack, Datadog, New Relic 등 Observability 도구 경험

우대 사항

컴퓨터공학, 엔지니어링 또는 관련 분야 학사 학위
대규모 웹 기반 Java 아키텍처 및 JVM 설정 경험
클라우드, 모니터링 등 관련 기술 자격증 보유
대규모 이커머스 플랫폼 경험

근무지: 쿠팡 선릉 오피스

전형 절차 및 안내 사항

전형 절차
- 서류전형(영문이력서 제출) - 화상기술면접 1차 - 화상기술면접 2차 – 최종 합격
- 전형절차는 직무별로 다르게 운영될 수 있으며, 일정 및 상황에 따라 변동될 수 있습니다.
- 전형 일정 및 결과는 지원서에 등록하신 이메일로 개별 안내 드립니다.

참고 사항
- 본 공고는 모집 완료 시 조기 마감될 수 있습니다.
- 지원서 내용 중 허위사실이 있는 경우에는 합격이 취소될 수 있습니다.
- 취업 보호 대상자(보훈대상자, 장애인 등)는 관련 법률에 따라 채용우대를 받을 수 있습니다.
- 직급과 담당 업무 범위는 후보자의 전반적인 경력과 경험 등 제반사정을 고려하여 변경될 수 있습니다. 이러한 변경이 필요할 경우, 최종 합격 통지 전 적절한 시기에 후보자와 커뮤니케이션 될 예정입니다.
- 채용 및 업무 수행과 관련하여 요구되는 법령상 자격이 갖추어지지 않은 경우 채용이 제한될 수 있습니다.
- 고용형태는 정규직으로 수습기간 12주를 포함합니다. 단, 업무상 필요한 경우에는 상기 수습기간을 적용하지 않거나 단축 또는 연장할 수 있습니다.

개인정보 처리방침

쿠팡 그룹은 입사지원자 개인정보 처리방침(아래 링크)에 따라 귀하의 개인정보를 수집하여 처리합니다. https://www.coupang.jobs/kr/privacy-policy

서류 반환 정책

본 고지는 『채용절차의공정화에관한법률』 제11조제6항에 따른 것 입니다.
당사 채용에 응시한 구직자 중 최종 합격이 되지 못한 구직자는 『채용절차의 공정화에 관한 법률』에 따라 제출한 채용서류의 반환을 청구할 수 있음을 알려 드립니다. 다만, 홈페이지 또는 전자우편으로 제출된 경우나 구직자가 당사의 요구 없이 자발적으로 제출한 경우에는 그러하지 아니하며, 천재지변이나 그 밖에 당사에게 책임 없는 사유로 채용서류가 멸실된 경우에는 반환한 것으로 봅니다.
위2항 본문에 따라 채용 서류 반환 청구를 하는 구직자는 채용 서류 반환 청구서 [채용절차의 공정화에 관한 법률 시행규칙 별지 제 3 호 서식]를 작성하여 이메일 ([email protected]) 로 제출하면, 제출이 확인된 날로부터 14 일 이내에 지정한 주소지로 등기우편을 통하여 발송해 드립니다. 이 경우 등기우편요금은 수신자 부담으로 하게 되오니 유념하시기 바랍니다.
당사는 위2항 본문에 따른 구직자의 반환 청구에 대비하여 채용 여부가 확정된 날로부터 180 일간 구직자가 제출한 채용서류 원본을 보관하게 되며, 그때까지 채용서류의 반환을 청구하지 아니할 경우에는 『개인정보 보호법』에 따라 지체 없이 채용서류 일체를 파기할 예정입니다.
단, 위 1항 내지 4항의 내용은 대한민국의 노동 관계 법령이 적용되는 경우에만 적용됩니다. 그 이외의 경우에는 적용되지 않습니다.

About the Role:

Site Reliability Engineers (SREs) at Coupang is a mission-critical role which combines software and system engineering to build, run and scale our complex, large-scale ecommerce systems. As part of the Site Reliability Engineering team, you will be responsible for ensuring all our customer facing services are healthy, monitored, automated, and designed to scale. As SRE organization we take pride in handling “operations as an engineering” problem with automation first approach. You will use your background to build best in class infrastructure automation for areas such as Observability, Incident management, Disaster Recovery, Load testing, Capacity engineering and many more. In this role you will work very closely with our product development teams from an early stage of design to all the way helping resolve any production incidents, maintaining SLI/SLA bar for production services and influencing them with SRE principles and best practices. If you take pride in complete ownership, have a passion for solving complex technical challenges for large scale distributed systems and demeanor to work and communicate effectively across team boundaries, this is the role for you!

Key Responsibilities:

· Serve as a primary point responsible for the reliability, health, and performance of all Coupang customer-facing services.

· Gain deep knowledge of Coupang application workflow and dependencies.

· Define and track key performance indicators (KPIs) and service-level objectives (SLOs) related to system availability, performance, and reliability.

· Build world class incident management process and automation, including fast incident remediation, incident operational reviews and retrospectives.

· Develop and implement best practices for creating and maintaining effective monitoring, alerting, and telemetry systems.

· Build automation to execute regular Disaster Recovery testing and load testing to stay ahead of expected growth of Coupang services.

· Work closely with product development teams to ensure the products are designed with scale and operability in mind.

· Build right guardrails and automation for deploying production changes holding the reliability bar.

· Participate in a 24x7 rotation for production issue escalations, functions well in a fast-paced environment.

· Communicate effectively with people at all levels of the organization.

Essential Qualifications:

· 5+ years of industry experience building and operating large scale distributed systems.

· Deep UNIX/Linux systems knowledge and administration background.

· Demonstrated programming skills in one or more of: Python, Java, Golang, Ruby.

· Strong problem-solving and analytical skills spanning systems, network (TCP/IP) and code, with a focus on data-driven decision-making.

· Experience with cloud-based infrastructure, including AWS, Azure, or Google Cloud Platform.

· Strong understanding of DevOps and SRE practices, including continuous integration, continuous delivery, and infrastructure as code (IaC). Experience with Terraform is a plus.

· Experience with containerization and orchestration technologies, such as Docker and Kubernetes.

· Excellent communication and collaboration skills, with the ability to work with teams across distinct functions and technical domains.

· Knowledge of observability ecosystem including metrics, logging, tracing and tools, such as Prometheus, Grafana, Elastic Stack, Datadog, or New Relic.

Preferred Qualifications:

· Bachelor's degree in computer science, engineering, or a related technical field.

· Prior experience working with large scale web-based Java architectures and JVM configuration.

· Professional certifications in cloud platforms, monitoring tools, or related technologies.

· Previous experience working on a large-scale eCommerce platform.

Office: Seoul, Korea

Location	Seoul, South Korea
Updated	6/24/2026