As a Senior Staff Back-end SRE Engineer within the Site Reliability organization, you will be working with large scale cloud infrastructure handling billions of metrics and peta-bytes of logs.
You would leverage this data to help internal teams to monitor service reliability and predict/prevent incidents. You have the opportunity to build the next generation Observability Platform based on Kubernetes and other OSS solutions, as well as building software components from scratch. You would work directly with various engineering teams in Coupang, influence them with SRE principles and best practices and see your impact directly.
- Architect and drive the build-out out Fault injection, chaos engineering practices.
- Collect requirements, architect, design, and implement the next generation APM Platform on AWS for company-wide teams to improve observability, reliability & service availability.
- Work with internal teams directly and help them effectively leverage our monitoring infrastructure, as well as evangelize best SRE practices.
- Write reliable and reusable code with the ability to scale with very large data volumes.
- BS or advanced degree in Computer Science, Computer Engineering, or Electrical Engineering
- 10+ years of software engineering experience
- Experience in architecting, building and maintaining large-scale service infrastructures
- Experience in Java, distributed systems, micro services
- In-depth knowledge in metrics collection/visualization, log collection/aggregation, and tracing
- Strong AWS Cloud Background on both development and operation
- A strong team player, ability to quickly triage and troubleshoot complex problems
- A strong SRE/Devops background.
- SLO/SLA management and implementation experience
- Building and deploying Kubernetes, K8S or similar technology on AWS
- Experience in successfully deliver large scale platform services/tools
- Experiences with various metrics, logging, tracing, and APM tools
- Publications and/or open-source projects related to service observability and system monitoring