Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to company engineering principles.
Responsibilities
- Engage in and improve the software development lifecycle – from inception and design, through development, deployment, operation and refinement for greater reliability.
- Influence and design infrastructure, architecture, standards and methods for large-scale systems
- Support services prior to production via infrastructure design, software platform development, load testing, capacity planning and launch reviews
- Maintain services during deployment and in production by measuring and monitoring key performance and service level indicators including availability, latency, and overall system health
- Automate system scalability and continually work to improve system resiliency, performance and efficiency
- Practice sustainable incident response as part of an on-call rotation and through blameless postmortems
- Remediate tasks within corrective action plan via sustainable, preventative, and automated measures whenever possible
Qualifications
- BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
- Experience developing and/or administering software in cloud infrastructure
- Experience in monitoring infrastructure and application uptime and availability to ensure functional and performance objectives.
- 5-7 years of experience in languages such as Python, Ruby, Bash, PHP, Perl, javascript and/or node.js
- Demonstrable cross-functional knowledge with systems, storage, networking, security and databases
- System administration skills, including automation and orchestration of Linux/Windows using Chef, Puppet, Ansible, Salt Stack and/or containers (Docker, Kubernetes, etc.)
- Proficiency with continuous integration and continuous delivery tooling and practices
- Strong analytical and troubleshooting skills
Preferred qualifications:
- Expertise designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Experience managing Infrastructure as code via tools such as Terraform or CloudFormation
- A passion for automation with a desire to eliminate toil whenever possible
- Experience building software or maintaining systems in a highly secure, regulated or compliant industry
- Experience and passion for working within a DevOps culture and as part of a team