Infra Engineer – SRE (Kubernetes)
Budget: $500.0
FIXED /
⭐ 0.00 (0)
India
embedded-systems, cisco-routers, computer-networking, cisco-certified-network-associate-ccna
Infra Engineer – SRE (Kubernetes)
About the Role
We are seeking a skilled Site Reliability Engineer specializing in Kubernetes to join a Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of large-scale high-performance AI/ML clusters in data centers. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for infrastructure environments. Experience with large-scale infrastructure automation is considered a strong plus.
Responsibilities
* Design, implement, and maintain scalable AI/ML infrastructure solutions.
* Proactively monitor GPU cluster health, performance, and troubleshoot issues across compute, accelerator, networking, and storage systems.
* Automate deployment, configuration, and management of infrastructure resources.
* Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning, and upgrades.
* Implement CI/CD pipelines for infrastructure deployment and orchestration.
* Ensure security, compliance, and operational best practices across infrastructure environments.
* Manage incident response related to infrastructure resources, including GPU, CPU, storage, and network components.
* Handle customer provisioning requests for GPU resources, including onboarding, configuration, and troubleshooting.
* Resolve customer service requests related to infrastructure and platform operations while maintaining high customer satisfaction.
* Stay current with emerging GPU hardware and software technologies and integrate improvements where appropriate.
* Support regional and international travel requirements to data center locations when necessary.
Qualifications
* Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
* 3+ years of experience in data center operations, infrastructure engineering, systems engineering, or site reliability engineering.
* Proven experience with infrastructure automation tools such as Terraform and Ansible.
* Strong experience with Kubernetes and container orchestration technologies.
* Familiarity with NVIDIA GPU Operator, NVIDIA Network Operator, CNI, CSI, and similar Kubernetes ecosystem tools.
* Experience with job scheduling systems such as Slurm.
* Strong Linux system administration skills.
* Proficiency in scripting and automation using Python and Bash.
* Experience with observability and monitoring platforms such as Prometheus, Grafana, and Loki.
* Knowledge of GPU architectures, NVIDIA CUDA, NCCL, and AI/ML infrastructure is a strong advantage.
* Strong troubleshooting and root-cause analysis skills with the ability to analyze logs, metrics, and system performance data.
* Excellent communication, collaboration, and problem-solving abilities.
Preferred Skills
* Large-scale Kubernetes cluster operations.
* AI/ML infrastructure and GPU cluster management.
* Infrastructure-as-Code (IaC) and automation-first mindset.
* Production incident management and reliability engineering.
* Data center operations and hardware troubleshooting.
* CI/CD platform design and implementation.
Meeting every qualification is not required. Candidates with strong technical foundations, relevant experience, and a passion for building reliable large-scale infrastructure are encouraged to apply.
Open job