Site Reliability Engineer
Full-Time
San Francisco, CA
Introduction
As a Site Reliability Engineer, you will be part of a small, tight-knit, dynamic, hands-on team that takes pride in its exceptional coding standards and development practices. You will be the primary interface between our developers and our production operations.
We work in both the dev and systems worlds, instrumenting key parts of core architecture while supporting developers as they try to do the same. We are looking for someone who lives, breathes and dreams automation and troubleshooting. You'll implement monitoring and alerting systems to support site stability and performance. You'll scale our infrastructure to meet ever-increasing demand. You'll make sure that when something goes bump in the night, someone hears it. And you'll play a key role in keeping us fast, stable and growing. You will be key in the incident management process.
responsibilities
Collaborate with the Engineering teams to increase productivity by enhancing our suite of internal web tools.
Leverage tooling to automate processes.
Write and develop documentation and capacity plans.
Analyze cross process and service problems in a complex environment.
Troubleshoot site issues.
Develop custom tools as necessary.
Document system design, daily operations, procedures, and run books.
Participate in a light on-call rotation.
requirements
B.S. or M.S. degree in Computer Science or equivalent.
Extensive knowledge of Unix/Linux – Ubuntu/Debian flavors.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Proven experience in administering one or more data stores like MySQL and Postgres. (high availability, scale-out replication).
A systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Knowledge of best practices related to security, performance, and disaster recovery.
Defined prescriptive ways to measure values such as application performance metrics and service level objectives, and communicated it to the right people.
Experienced with working on release engineering, and/or configuration management.
Understands the importance of security and compliance when working with personal data.
Experience at a large-scale consumer internet site.
Solid understanding of fundamental networking technologies.
Tests infrastructure changes thoroughly and likes to double and triple check everything.
Experience with our current tech stack: Ruby on Rails, MySQL, Postgres, Memcache, Docker, Solr, Kubernetes and GCP and the ability to grow as we grow our technologies