Site Reliability Engineer

  • Development
  • Remote job

Site Reliability Engineer

Job description

Job Brief

We are looking to hire a Site Reliability Engineer to help us make our service stable and resilient as well as improve our Engineering Department reliability practice


  • Work with development teams to implement observability: logging, monitoring, alerting
  • Lead efforts to put SLOs and SLAs for critical systems
  • Help with designing highly scalable and resilient architecture for our backend services
  • Implement and design automation to help investigate and resolve issues and scale key services
  • Lead the incident response process, including post-mortems, maintaining runbooks, etc.
  • Participate in on-call rotation together with engineers who develop the services

Technology Stack

  • Tooling: Elasticsearch / Logstash / Kibana, Prometheus, New Relic
  • Architecture: Micro services
  • Database / cache: Maria DB, Redis
  • Runtime: JVM, Java 11+
  • Cloud: Docker Swarm, AWS
  • CI/CD: Bitbucket Pipelines
  • Tech Culture: DevOps Philosophy / Kanban Software Development / Code Review / Pair Programming / Continuous Integration & Continuous Delivery


Job Requirements

  • 3+ years of experience in reliability engineering
  • Continuously develops him/herself as a professional
  • Experience with investigating issues in production environment
  • Willingness to work out of normal working hours
  • Improves own working efficiency
  • Good social skills
  • Has proven track of performance in previous companies
  • Attentive to details

Nice to have

  • Knows ELK stack and Prometheus
  • Knowledge of Docker and Docker orchestration (optimally Swarm)
  • Troubleshooting and optimizing JVM runtime
  • Experience with scaling distributed systems, optimally implemented in Java
  • Good Linux skills
  • Experience with DevOps approach, CI/CD
  • Basic development skills (scripting)
  • Reading Java code
  • very good understanding of networking, load balancing