Job Description:
We are seeking a highly skilled AWS Site Reliability Engineer to join our team. As an AWS Site Reliability Engineer, you will play a crucial role in ensuring the reliability, scalability, and performance of our AWS infrastructure. You will spearhead engineering initiatives, drive automation efforts, monitor system health, respond to incidents, and collaborate closely with cross-functional teams to architect and deploy resilient services. This role requires a strong understanding of AWS services, excellent problem-solving skills, and a proactive approach to maintaining system reliability.
Responsibilities:
- Spearhead engineering initiatives to ensure AWS infrastructure reliability, scalability, and performance.
- Drive the development of automation tools to optimize operations and enhance system reliability.
- Proactively monitor system health, swiftly respond to incidents, and conduct in-depth post-incident analysis.
- Collaborate closely with teams to architect and deploy resilient services that meet reliability and performance requirements.
- Maintain on-call responsibilities to ensure round-the-clock support for critical systems.
- Continuously evaluate and implement best practices, tools, and technologies to improve system reliability and performance.
- Participate in capacity planning and performance tuning activities to ensure optimal resource utilization.
- Contribute to the design and implementation of disaster recovery and business continuity plans.
- Document processes, procedures, and configurations to ensure knowledge sharing and maintain system documentation.
- Stay updated on AWS best practices, new features, and industry trends to recommend and implement improvements.
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- Proven experience as an AWS Site Reliability Engineer or similar role.
- Strong understanding of AWS services, including EC2, S3, RDS, Lambda, CloudWatch, and others.
- Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation.
- Proficiency in scripting and automation using Python, Bash, or similar languages.
- Hands-on experience with monitoring and logging tools such as CloudWatch, Prometheus, Grafana, ELK Stack, etc.
- Excellent problem-solving skills and a proactive approach to troubleshooting and issue resolution.
- Ability to work effectively in a fast-paced, collaborative environment with cross-functional teams.
- Strong communication skills with the ability to articulate complex technical concepts to non-technical stakeholders.
- AWS certifications such as AWS Certified DevOps Engineer or AWS Certified SysOps Administrator are a plus.