概要
To ensure stable Coupang's IT services, the IT Reliability Engineering team operates monitoring systems and processes for IT infra and applications. The team is responsible for ensuring and improving monitoring visibility. In the case of an event or incident, the team collaborates with the engineering team to resolve it and manage relevant metrics. To ensure the continuity of service, the team regularly conducts DR tests.
Key Responsibilities:
- Grafana Dashboard Development: Design, build, and maintain advanced Grafana dashboards for real-time monitoring and analytics.
- Data Integration: Connect Grafana with data sources like Prometheus, InfluxDB, Elasticsearch, Loki, and CloudWatch.
- Collaboration: Work closely with SRE, DevOps, and application teams to define and track key metrics and KPIs.
- Automation: Automate dashboard provisioning using Grafana APIs, Terraform, or Grafana-as-Code.
- Performance Optimization: Ensure dashboards are optimized for performance and usability in large-scale environments.
- Alerting & Incident Support: Implement alerting strategies and integrate with tools like VictorOps and Slack; support incident response and post-mortem analysis.
- Observability Strategy: Contribute to the organization’s observability architecture and tooling strategy.
- ITSM Integration: Apply ITIL practices and integrate monitoring with ITSM workflows.
- Monitoring Management: Oversee monitoring for enterprise IT applications and infrastructure, including SaaS services (e.g., O365, Zoom, Zscaler).
- Incident Handling: Manage incidents/events, review tickets, and track preventive actions.
- Reporting & Process Improvement: Maintain metrics/KPIs, generate regular reports, and improve incident/event processes.
Qualifications:
- Experience:
- 12–15 years in engineering with a strong focus on monitoring and observability.
- 5+ years of hands-on experience with Grafana in production environments.
- Technical Skills:
- Deep knowledge of Grafana (plugins, templating, variables, custom panels).
- Proficiency in PromQL, InfluxQL, or Loki query languages.
- Experience with Terraform, Ansible, or other Infrastructure-as-Code tools.
- Strong scripting skills (Python, Shell, or Go).
- Familiarity with Kubernetes, Docker, and cloud platforms (AWS, GCP, Azure).
- Solid understanding of CI/CD pipelines, logging, and tracing tools.
- Monitoring Tools:
- Experience with Zabbix, Prometheus, NMS, and other monitoring systems.
- Ability to architect integrated monitoring for SaaS, servers, networks, AWS, applications, and endpoints.
- Process & Communication:
- Understanding of ITIL practices and ITSM workflows.
- Strong analytical, structured thinking, and communication skills.
- Ability to handle incidents systematically and collaborate across multi-domain teams.
Preferred Qualifications:
- Contributions to Grafana open-source projects or plugin development.
- Knowledge of OpenTelemetry, Jaeger, or Zipkin.
- Experience with SLO/SLI implementation and service health monitoring.
- Professional certifications (e.g., CCIE, LPIC, VCAP/DX, AWS).
- Familiarity with IT standards (ITIL, ISO20000, ISO22301).
- Experience with large-scale IT infrastructure (IDCs, offices, logistics centers).
- Proficiency in English (spoken and written)