Apply on Kit Job: kitjob.in/job/46rabd
Overview: We are looking for a Senior Site Reliability Engineer to join our Engineering Infrastructure team. In this role, you will own the reliability, performance, and operational excellence of Generac’s cloud-native software platforms. You will bridge the gap between development and operations—embedding SRE practices across engineering squads, driving automation, and ensuring our systems meet the highest availability and performance standards.
n
You will report directly to the Sr. Manager, Site Reliability Engineering.
n
Responsibilities: 1.
n
Incident
n
Response & On-Call Management - Own and lead incident response for production outages, coordinating cross-functional teams to drive rapid resolution and minimize customer impact.
n
- Maintain and evolve on-call runbooks, escalation paths, and post-mortem processes to build a culture of blameless learning.
- Conduct thorough root cause analysis (RCA) and implement preventive measures to reduce mean time to recovery (MTTR) and mean time between failures (MTBF).
- Define, track, and report on SLOs, SLIs, and error budgets, using Grafana dashboards to surface real-time reliability signals to engineering leadership.
- Champion proactive alerting strategies, eliminating alert fatigue and ensuring actionable notifications reach the right teams at the right time. 2.
n
Infrastructure
n
Automation & IaC - Design, build, and maintain infrastructure-as-code (IaC) using Terraform and Ansible to provision and manage cloud resources across AWS (primary), GCP, and Azure.
n
- Automate repeatable operational tasks—reducing toil and enabling engineering teams to move faster with confidence.
- Lead Kubernetes cluster management and lifecycle operations, including upgrades, scaling,
networking, and security hardening across environments.
- Manage and optimize GitHub Actions CI/CD pipelines, ensuring reliable, quick, and secure software delivery from code commit to production.
- Establish standards and best practices for environment consistency, secret management, and infrastructure drift detection.
- Performance & Capacity Planning - Lead capacity planning initiatives for multi-cloud infrastructure (AWS primary, GCP, Azure legacy), ensuring systems scale efficiently to meet business demand.
- Develop load testing frameworks and performance benchmarking strategies to identify bottlenecks before they impact customers.
- Analyze trends in system resource utilization and provide data-driven recommendations for cost optimization and right-sizing.
- Collaborate with engineering leadership on architecture reviews to ensure systems are designed with scalability and reliability as first-class concerns.
- Build and maintain Grafana dashboards and alerting rules that provide end-to-end visibility into system performance and capacity headroom. 4.
n
Developer
n
Tooling & Platform Engineering - Build and maintain internal developer platforms that improve engineering velocity, standardize observability, and reduce operational complexity.
n
- Partner with software engineering teams to embed reliability practices early in the SDLC—shift-left on reliability, security, and performance.
- Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation.
- Evangelize and implement best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms.
- Contribute to a shared services model that enables development teams to self-serve infrastructure needs safely and efficiently.
n
Required Qualification
n
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53); familiarity with GCP and Azure is a plus.
- Hands-on experience with Kubernetes administration, including cluster upgrades, RBAC, networking (CNI plugins), and storage.
- Proficiency in infrastructure-as-code tools, especially Terraform;
n
experience with Ansible or similar configuration management tools.
n
- Experience designing and managing GitHub Actions CI/CD pipelines at scale.
- Strong observability skills—experience with Grafana, Prometheus, or equivalent monitoring and alerting stacks.
- Solid programming or scripting skills in Python, Go, Bash, or similar languages for automation and tooling.
- Demonstrated ability to lead incident response and drive structured post-mortem processes.
- Experience defining and managing SLOs, SLIs, and error budgets in production environments.
- Excellent communication skills—able to translate complex technical concepts for both engineering and business stakeholders.
n
Preferred Skills
n
- Exp
Apply on Kit Job: kitjob.in/job/46rabd
📌 Senior Site Reliability Engineer (Pune)
🏢 Generac
📍 Pune