Our next-gen architecture is built to help you make sense of your ever-growing data.

Watch a 4-min demo video!

Data is Never at Rest,
and Neither Are We

We’re constantly on the lookout for people who are hungry, humble, and smart. In that order. If that sounds like you, join us on our journey to revolutionize observability.

duns banner

In 2023, Dun & Bradstreet ranked Coralogix as one of the best tech startups to work for.

Join the Team

Our stateful streaming analytics approach enables teams to monitor, visualize, and alert on observability data in real-time with no reliance on storage or indexing.
We’re looking for new team members to join us in our mission to build our next-gen data-less data platform.

2K+
Global Customers
10K+
DevOps and Engineering Users
500K+
Applications Monitored
3M+
Events Processed Per Second

Our Benefits

Global Presence

With offices in Boston, Dublin, Gurgaon, London and Tel Aviv, we operate on a global scale.

Competitive Salary

We pride ourselves on rewarding great work with great compensation.

Generous Share Package

We want you to have skin in the game and share in our future success.

Commuter Benefits

We offer monthly credits for ride-sharing, parking, and public transportation to make getting to the office a breeze.

Team Events

Regular happy hours, annual company trips, and employee parties – these are just a few ways we like to keep things friendly.

Continuous Learning

We encourage everyone to continue learning new things – developing both personally and professionally.

Site Reliability Engineering (SRE) Group Leader

Ramat Gan, Israel · Full-time · Management

About The Position

Coralogix is a modern, full-stack observability platform transforming how businesses process and understand their data. Our unique architecture powers in-stream analytics without reliance on expensive indexing or hot storage. We specialize in comprehensive monitoring of logs, metrics, trace and security events with features such as APM, RUM, SIEM, Kubernetes monitoring and more, all enhancing operational efficiency and reducing observability spend by up to 70%.

We are seeking a Site Reliability Engineering (SRE) Group Leader to join our fast-paced and dynamic environment. As the Site Reliability Engineering (SRE) Group Leader, you will be at the forefront of ensuring the availability, stability, and performance of Coralogix's production platform. You will lead three specialized teams focusing on production availability and stability, observability, and production insights, while maintaining 99.9% uptime and ensuring immediate response to production issues.This role requires deep expertise in cloud technologies, Kubernetes, and the observability ecosystem. You'll work collaboratively across teams, setting objectives, defining metrics, and driving measurable improvements in platform reliability.

Key Responsibilities    

  • Production Reliability: Ensure the platform achieves and maintains 99.9% uptime by implementing robust SRE practices.
  • Incident Response: Oversee immediate response to any production issues, ensuring timely resolution and minimizing downtime.
  • Strategic Leadership: Lead and mentor three teams specializing in production availability, observability, and production insights, fostering a culture of accountability and collaboration.
  • Cloud and Kubernetes Expertise: Drive optimization and reliability improvements using cloud technologies, Kubernetes, and Kubernetes operators.
  • Observability Leadership: Develop and enhance observability solutions, ensuring comprehensive monitoring, alerting, and actionable insights across production systems.
  • Data-Driven Decision-Making: Leverage production insights and metrics to drive system optimization and improvements.
  • Cross-Team Collaboration: Partner with engineering, product, and support teams to align on priorities, objectives, and deliverables for production excellence.

Requirements

  • Production Focus: Extensive experience managing large-scale production systems with a focus on maintaining high availability (≥99.9%).
  • Incident Management Expertise: Proven ability to manage incident response processes and ensure rapid resolution of production issues.
  • Observability Knowledge: Strong understanding of observability tools like Prometheus, Grafana, OpenTelemetry, and the broader observability ecosystem.
  • Leadership Skills: Proven ability to manage and scale engineering teams, with experience leading multiple teams or groups.
  • OKR Experience: Ability to define objectives, measure performance, and drive results through OKR frameworks.
  • Problem-Solving Skills: Demonstrated expertise in troubleshooting and optimizing distributed systems and cloud environments.
  • Collaboration Skills: Strong ability to work across teams and departments, aligning technical efforts with organizational goals.

Preferred Qualifications:

  • Experience in companies within the observability domain (e.g., Datadog, New Relic, Sumologic).
  • Familiarity with incident management tools (PagerDuty, OpsGenie, etc.) and chaos engineering practices.
  • Background in designing and implementing SLOs for production systems.
  • Experience optimizing systems for high-throughput and low-latency workloads.

Cultural Fit

We’re seeking candidates who are hungry, humble, and smart. Coralogix fosters a culture of innovation and continuous learning, where team members are encouraged to challenge the status quo and contribute to our shared mission. If you thrive in dynamic environments and are eager to shape the future of observability solutions, we’d love to hear from you.

Coralogix is an equal opportunity employer and encourages applicants from all backgrounds to apply.

Apply for this position