We’re looking for a dedicated and talented SRE to join the team of our client — a cutting-edge company building next-generation AI infrastructure.
The company is focused on redefining how AI workloads are deployed and scaled by leveraging a distributed GPU network. Their platform enables seamless deployment across multiple environments, optimizing for cost, performance, and flexibility. The mission is to empower AI teams with a fast, scalable, and cost-efficient cloud experience, removing vendor lock-in and supporting the growing demands of modern AI systems.
Tasks and responsibilities: * Ensure reliability, availability, and performance of the production platform * Monitor and maintain infrastructure supporting AI workloads running in production * Set up and improve monitoring, alerting, and incident response processes * Participate in on-call rotations and handle production incidents * Work with observability tools (metrics, logs, tracing) to track system health (latency, error rates, SLA) * Support and optimize platform stability for customers running production workloads
What you need to be successful in this position * 5+ years of experience in SRE * Strong Linux administration skills * Solid understanding of networking fundamentals * Hands-on experience with monitoring, alerting, incident response, and on-call practices * Experience with observability (metrics, logs, tracing) and system reliability metrics (latency, error rates, SLA) * Upper-intermediate English level (B2+)
Additionally: * Experience with Kubernetes and cloud/GPU infrastructure * Familiarity with containers and CI/CD pipelines * Understanding of performance and cost optimization for AI/GPU workloads * Basic knowledge of production security and data handling * Experience with APIs and distributed systems reliability * Knowledge of autoscaling and capacity planning * Experience with AWS and tools like Grafana, Prometheus, Loki, EKS
This is a great opportunity to join a modern, fast-growing team working on cutting-edge AI infrastructure and solving complex, real-world challenges.
Please include a short summary of your relevant experience in your cover letter, and specify your English level as well as your experience working in a fully English-speaking environment. Thank you, and I look forward to the opportunity to discuss more in person!