Site Reliability Engineer
Pune, Maharashtra, India
Full Time
Mid Level
Site Reliability Engineer
Onit, Inc. is looking for a Site Reliability Engineer L2 to join our Core Infrastructure team. This role will help to ensure the reliability of a diverse set of applications across our AWS infrastructure. To be successful in this role you will need to collaborate and pair with team members, have strong technical skills, and a passion for technology. The individual we seek is skilled in observability, excellent at troubleshooting, and has strong problem-solving skills. You must be able to multi-task in a fast-paced environment and be a self-starter with the ability to work independently.
Responsibilities:
Requirements:
Onit, Inc. is looking for a Site Reliability Engineer L2 to join our Core Infrastructure team. This role will help to ensure the reliability of a diverse set of applications across our AWS infrastructure. To be successful in this role you will need to collaborate and pair with team members, have strong technical skills, and a passion for technology. The individual we seek is skilled in observability, excellent at troubleshooting, and has strong problem-solving skills. You must be able to multi-task in a fast-paced environment and be a self-starter with the ability to work independently.
Responsibilities:
- Troubleshoot deployment failures and infrastructure issues across our full AWS infrastructure stack (EKS, RDS, ..); This incudes dev, test, and production environments
- Create and maintain monitors for uptime and performance using Datadog, CloudWatch and other monitoring tools.
- Provide T3 Technical Support to Onit products.
- Find ways to help reduce errors in systems and reduce noise in monitors and alerts
- Work with others on user stories to improve system health
- Help create and prioritize work / stories
- Participate in standups with US and India team
- Help define runbooks and automation to solve production problems
- Troubleshoot applications from a configuration and logging perspective
- Assist with responding to and analyzing security events from security tooling
- Help train others to take on SRE responsibilities
- Assist with performance optimization by identifying performing bottlenecks and making recommendations on improvements
- Verify systems are monitored, backed up, and following best practices ... via audits and automation
- Investigate how to take better advantage of the tools we use for monitoring, security, …
Requirements:
- Bachelor's degree in computer science or equivalent experience is required.
- 4+ years of experience for the following:
- AWS (EC2, EKS, ECS, S3, RDS, CloudWatch, CloudTrail, IAM, AWS CLI, etc.). Experience with containers and EKS is a must.
- Problem solving skill both with production infrastructure/Linux System and application side
- Linux (Centos, Amazon Linux, Ubuntu, ..)
- Git source code management (Gitlab, GitHub)
- Prior application coding and debugging experience (Ruby, Python, etc.)
- SaaS based Web application experience
- Experience with Jenkins or similar CI/CD tooling
- A solid understanding of the components that make up production systems (Memory, CPU, Disk space, Disk i/o, Network i/o, etc.) is required
- Experience with Monitoring tools and on-call support
- Ability to read and interpret application server logs, outputs, CloudTrail and other critical logging output
- Excellent troubleshooting skills required
- Experience with the creation of Jenkins/Automation scripts.
- Relational Database performance and monitoring.
- Terraform and/or CloudFormation
- Experience troubleshooting application integrations
- Other Technologies: Cloudflare, AWS Guard duty, Crowdstrike
Apply for this position
Required*