Site Reliability Engineer in St. Louis, MO at HUNTER Technical Resources

Date Posted: 10/8/2020

Job Snapshot

Job Description

Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to company engineering principles.
 

  • Work with teams across an organization and ensures core services reliability and keep an eye on capacity and performance.
  • Responsible for blameless postmortems and proactive identification of potential outages factor into iterative improvement
  • Work closely with development and operations teams to build highly available, cost effective systems with extremely high uptime metrics.
  • Hands on experience Configuring and Administering SCM(GIT, SVN), Build (CMake, Make files, Maven), Nexus, CI(Jenkins), CD Automation Tools
  • Responsible for establishing end-to-end monitoring and alerting on all critical aspects to ensure SLAs and get proactive notifications of possible issues for all systems.
  • Work with cloud operations team to resolve trouble tickets, developing and running scripts, and troubleshooting.
  • Participate in 24x7X365 an on-call support for multiple core platforms globally. Using a “Follow the Sun” model, we expect working patterns will include on call duty, weekend and holiday season cover.
  • You will engage in and improve the software development lifecycle – from inception and design, through development, deployment, operation and refinement
  • You will influence and design infrastructure, architecture, standards and methods for large-scale systems
  • You will support services prior to production via infrastructure design, software platform development, load testing, capacity planning and launch reviews
  • You will maintain services during deployment and in production by measuring and monitoring key performance and service level indicators including availability, latency, and overall system health
  • You will automate system scalability and continually work to improve system resiliency, performance and efficiency
  • You will practice sustainable incident response as part of an on-call rotation and through blameless postmortems
  • You will remediate tasks within corrective action plan via sustainable, preventative, and automated measures whenever possible