Site Reliability Engineering (SRE) Group Leader
About The Position
Coralogix is a modern, full-stack observability platform transforming how businesses process and understand their data. Our unique architecture powers in-stream analytics without reliance on expensive indexing or hot storage. We specialize in comprehensive monitoring of logs, metrics, trace and security events with features such as APM, RUM, SIEM, Kubernetes monitoring and more, all enhancing operational efficiency and reducing observability spend by up to 70%.
We are seeking a Site Reliability Engineering (SRE) Group Leader to join our fast-paced and dynamic environment. As the Site Reliability Engineering (SRE) Group Leader, you will be at the forefront of ensuring the availability, stability, and performance of Coralogix's production platform. You will lead three specialized teams focusing on production availability and stability, observability, and production insights, while maintaining 99.9% uptime and ensuring immediate response to production issues.This role requires deep expertise in cloud technologies, Kubernetes, and the observability ecosystem. You'll work collaboratively across teams, setting objectives, defining metrics, and driving measurable improvements in platform reliability.
Key Responsibilities
- Production Reliability: Ensure the platform achieves and maintains 99.9% uptime by implementing robust SRE practices.
- Incident Response: Oversee immediate response to any production issues, ensuring timely resolution and minimizing downtime.
- Strategic Leadership: Lead and mentor three teams specializing in production availability, observability, and production insights, fostering a culture of accountability and collaboration.
- Cloud and Kubernetes Expertise: Drive optimization and reliability improvements using cloud technologies, Kubernetes, and Kubernetes operators.
- Observability Leadership: Develop and enhance observability solutions, ensuring comprehensive monitoring, alerting, and actionable insights across production systems.
- Data-Driven Decision-Making: Leverage production insights and metrics to drive system optimization and improvements.
- Cross-Team Collaboration: Partner with engineering, product, and support teams to align on priorities, objectives, and deliverables for production excellence.
Requirements
- Production Focus: Extensive experience managing large-scale production systems with a focus on maintaining high availability (≥99.9%).
- Incident Management Expertise: Proven ability to manage incident response processes and ensure rapid resolution of production issues.
- Observability Knowledge: Strong understanding of observability tools like Prometheus, Grafana, OpenTelemetry, and the broader observability ecosystem.
- Leadership Skills: Proven ability to manage and scale engineering teams, with experience leading multiple teams or groups.
- OKR Experience: Ability to define objectives, measure performance, and drive results through OKR frameworks.
- Problem-Solving Skills: Demonstrated expertise in troubleshooting and optimizing distributed systems and cloud environments.
- Collaboration Skills: Strong ability to work across teams and departments, aligning technical efforts with organizational goals.
Preferred Qualifications:
- Experience in companies within the observability domain (e.g., Datadog, New Relic, Sumologic).
- Familiarity with incident management tools (PagerDuty, OpsGenie, etc.) and chaos engineering practices.
- Background in designing and implementing SLOs for production systems.
- Experience optimizing systems for high-throughput and low-latency workloads.
Cultural Fit
We’re seeking candidates who are hungry, humble, and smart. Coralogix fosters a culture of innovation and continuous learning, where team members are encouraged to challenge the status quo and contribute to our shared mission. If you thrive in dynamic environments and are eager to shape the future of observability solutions, we’d love to hear from you.
Coralogix is an equal opportunity employer and encourages applicants from all backgrounds to apply.