Site Reliability Engineer in Alpharetta, GA at HUNTER Technical Resources

Date Posted: 10/29/2020

Job Snapshot

  • Employee Type:
  • Job Type:
  • Experience:
    Not Specified
  • Date Posted:
  • Job ID:

Job Description

  • You will be responsible for mission critical business functions and partner with other infrastructure, operations, and development teams to identify and implement automation opportunities to drive down toil, reduce technical debt, and improve system reliability.
  • You will support the production operations of our systems, as well as development/engineering of solutions to maximize system reliability & automation.
  • You will be responsible for root cause analysis of incidents and pro-active prevention of recurrence thru the creative design and development of technical solutions as well as process improvements.
  • You will engage in and improve the whole lifecycle of software development services— from inception and design, through deployment, operation, and refinement.
  • You will support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • You will maintain services once they are live by measuring and monitoring availability, latency, and overall system health in a 24x7 environment.
  • You will scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • You will practice sustainable incident response and blameless postmortems.
  • You will bind and orchestrate the system infrastructure with the application layer to enable High Availability/Clustering load balancing and integration;
  • You will be responsible for establishing end-to-end monitoring and alerting on all critical aspects to ensure SLOs, SLIs, and SLAs and get proactive notifications of possible issues for all systems;
  • You will develop automated solutions to address potential problems before they result in a service interruption and demonstrate a passion for automation, including CI/CD automation;
  • You will establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.

Must Haves
  • Bachelors of Science degree in Computer Science, Engineering, or equivalent relevant experience.
  • Good understanding of Site Reliability Engineering (SRE) and DevOps philosophies, technologies, platforms and tools, SLA management, incident resolution, and automation;
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive;
  • Ability to debug and optimize code and automate routine tasks;
  • 5+ years of experience in one or more of the following: Amazon Web Services, Google Cloud Platform, Kubernetes, etc.;
  • 5+ years of experience building JavaEE applications using, build tools like Maven/ANT, Subversion, JIRA Jenkins, Bitbucket and Chef;
  • 5+ years of experience in continuous integration tools (Jenkins, SonarQube, JIRA, Nexus, Confluence, GIT-BitBucket, Maven, Gradle, RunDeck, is a plus);
  • 5+ years of experience as SCM/release engineer, or in a position with similar skill sets and responsibilities (Software Engineer, Systems Engineer, Systems Administrator);
  • 5+ years of experience performing source code control management Subversion/GIT including branching, merging, tagging, etc.;
  • 5+ years of experience configuring and administering JavaEE application servers (Tomcat, WebSphere, WebLogic, etc.);
  • 5+ years of experience with scripting language such as Unix Shells, Python, Perl, Shell, bash, ksh);
  • 3-5 years of experience configuring, building, and supporting apps and operations in a public cloud environment (AWS, GCP);
  • 5+ years of experience with Monitoring and Logging tools (Elastic Search, ELK, AppDynamics, Splunk, etc.);