We are looking for a dedicated SRE resource during Europe/Eastern European business hours, preferably coupled with an incident response service (MSP).
As a part of the SRE team (working REMOTELY) you will be challenged with maintaining our AI infrastructure platform outside of California (PST) business hours.
What you’ll be doing:
1) Ensure reliability and scalability of our AI infrastructure platform and hybrid Linux environments.
2) Managing Linux infrastructure to ensure maximum uptime.
3) Performance and reliability testing. This may include reviewing configuration, software choices/versions, hardware specs, etc.
4) Advancing our technology stack with innovative ideas and new creative solutions.
5) Participating in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.
6) Create strategies for long term permanent fixes to critical production incidents.
7) Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.
8) Proactively identify system anomalies.
Typical tasks might include moving clusters, troubleshooting, host unresponsive issues etc.