- Responsible for maintaining and scaling production services and servers across multiple data centers for complex and data-intensive cloud services Improve scalability, service reliability, capacity, and performance Write automation code for provisioning and operating infrastructure at massive scale You are not an operator, you're an experienced software engineer focused on operations.
- Work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up Work with QA on building pipelines and automation for delivering and deploying applications to production Participate in the occasional on-call rotation supporting the infrastructure.
- You will roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause.
- Hands on experience in building fault tolerant and scalable systems.
- Strong development/automation skills. Must be very comfortable with reading and writing Python code. Java is a plus.
- 5+ years of Unix/Linux experience, with some experience in managing 100+ nodes.
- Tools-first mindset. You build tools for yourself and others to increase efficiency and to make hard or repetitive tasks easy and quick.
- Experience with AWS/GCP and their APIs.
- Experience with Configuration Management and CI/CD. Salt and Jenkins preferred.
- Familiar with web servers (Nginx preferred) and HA Proxy.
- Preferred experience: Hadoop, Kafka, RabbitMQ, Spark, HBase, Elastic Search, Containers, OpenStack. Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, taking the lead.