关于我们
To ensure stable Coupang's IT services, the IT Reliability Engineering team operates monitoring systems and processes for IT infra and applications. The team is responsible for ensuring and improving monitoring visibility. In the case of an event or incident, the team collaborates with the engineering team to resolve it and manage relevant metrics. To ensure the continuity of service, the team regularly conducts DR tests.
Key Responsibilities:
Strategic Vision & Leadership
- Define and drive the observability strategy and roadmap, aligning with business and technology goals.
- Establish a mature observability frameworkcovering infrastructure, network, applications, and end-user experience.
- Advocate for observability best practicesacross engineering, operations, and product teams.
Monitoring & Tool Implementation
- Lead the design, implementation, and optimizationof observability platforms (e.g., Prometheus, Grafana, Datadog, New Relic, Splunk).
- Evaluate and onboard new tools and technologies to enhance visibility and telemetry across systems.
- Ensure scalable and resilient monitoring architecturesare in place for hybrid and cloud-native environments.
Gap Analysis & Continuous Improvement
- Conduct gap assessmentsin existing monitoring setups and identify areas for improvement.
- Implement automated solutionsto address low-hanging fruits and reduce manual overhead.
- Continuously refine monitoring configurations to improve signal-to-noise ratioand reduce alert fatigue.
End-to-End Observability
- Build and maintain end-to-end visibilityacross infrastructure, network, applications, and user journeys.
- Integrate observability tools with incident management, ticketing, and reporting systems.
- Develop and enforce tagging strategies, metrics standards, and log enrichment
Collaboration & Enablement
- Partner with DevOps, SRE, and application teams to embed observability into CI/CD pipelinesand development workflows.
- Provide technical guidance and trainingto teams on observability tools and practices.
- Support incident response and post-mortem analysis with automated diagnostics and telemetry insights.
Data-Driven Insights
- Leverage observability data to generate actionable insightsfor performance tuning, capacity planning, and reliability engineering.
- Create dashboards and reportsthat provide meaningful visibility to stakeholders at all levels.
Qualifications:
Observability & Monitoring Tools
- Prometheus, Grafana, Zabbix, SolarWinds
- Datadog, New Relic, Dynatrace, Splunk
- ELK Stack (Elasticsearch, Logstash, Kibana)
- OpenTelemetry (for standardized telemetry collection)
Infrastructure & Automation
- Terraform, Ansible, Puppet, Chef (IaC tools)
- Scripting languages: Python, Bash, PowerShell
- REST APIs: Experience integrating and automating observability tools via APIs
Cloud & Container Platforms
- AWS, Azure, Google Cloud Platform
- Kubernetes and Docker (monitoring containerized environments)
- Cloud-native monitoring tools: CloudWatch, Azure Monitor, GCP Operations Suite
CI/CD & DevOps Tooling
- Jenkins, GitLab CI, GitHub Actions
- Git (version control)
- Integration of observability into CI/CD pipelines
Data Analysis & Visualization
- Experience with metrics, logs, and traces
- Building dashboards and custom visualizations
- Familiarity with SQL or time-series databases (e.g., InfluxDB, TimescaleDB)
Alerting & Incident Management
- Tools like PagerDuty, Opsgenie, ServiceNow, Jira
- Knowledge of alert tuning, event correlation, and automated diagnostics
Architecture & Design
- Understanding of distributed systems, microservices, and network protocols
- Ability to design scalable observability architectures
Preferred Qualifications:
- 15+ years of hands-on experience in monitoring, observability, and infrastructure operations.
- Proven track record of designing and implementing observability platforms in complex, environments.
- Experience in gap analysis and optimization of monitoring setups across infrastructure, network, applications, and end-user layers.
- Strong background in DevOps or SRE.
Technical Proficiency
- Deep expertise in observability tools (Prometheus, Grafana, Dynatrace, etc.)
- Strong skills in Infrastructure as Code, automation scripting, and API integrations.
- Familiarity with cloud-native architectures, microservices.
- Experience integrating observability into CI/CD pipelines and incident management workflows.
Soft Skills
- Strategic thinker with a vision for mature observability practices.
- Excellent communication and collaboration skills to work across teams.
- Ability to mentor and guide teams on observability principles and tooling.
致力平等
酷澎一直致力于员工之间的平等。我们取得的空前成功,皆离不开全球多元化团队所付出的努力。