Infra Engineer – SRE (Kubernetes)

Budget: $500.0 FIXED / ⭐ 0.00 (0) India

embedded-systems, cisco-routers, computer-networking, cisco-certified-network-associate-ccna

Infra Engineer – SRE (Kubernetes) About the Role We are seeking a skilled Site Reliability Engineer specializing in Kubernetes to join a Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of large-scale high-performance AI/ML clusters in data centers. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for infrastructure environments. Experience with large-scale infrastructure automation is considered a strong plus. Responsibilities * Design, implement, and maintain scalable AI/ML infrastructure solutions. * Proactively monitor GPU cluster health, performance, and troubleshoot issues across compute, accelerator, networking, and storage systems. * Automate deployment, configuration, and management of infrastructure resources. * Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning, and upgrades. * Implement CI/CD pipelines for infrastructure deployment and orchestration. * Ensure security, compliance, and operational best practices across infrastructure environments. * Manage incident response related to infrastructure resources, including GPU, CPU, storage, and network components. * Handle customer provisioning requests for GPU resources, including onboarding, configuration, and troubleshooting. * Resolve customer service requests related to infrastructure and platform operations while maintaining high customer satisfaction. * Stay current with emerging GPU hardware and software technologies and integrate improvements where appropriate. * Support regional and international travel requirements to data center locations when necessary. Qualifications * Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. * 3+ years of experience in data center operations, infrastructure engineering, systems engineering, or site reliability engineering. * Proven experience with infrastructure automation tools such as Terraform and Ansible. * Strong experience with Kubernetes and container orchestration technologies. * Familiarity with NVIDIA GPU Operator, NVIDIA Network Operator, CNI, CSI, and similar Kubernetes ecosystem tools. * Experience with job scheduling systems such as Slurm. * Strong Linux system administration skills. * Proficiency in scripting and automation using Python and Bash. * Experience with observability and monitoring platforms such as Prometheus, Grafana, and Loki. * Knowledge of GPU architectures, NVIDIA CUDA, NCCL, and AI/ML infrastructure is a strong advantage. * Strong troubleshooting and root-cause analysis skills with the ability to analyze logs, metrics, and system performance data. * Excellent communication, collaboration, and problem-solving abilities. Preferred Skills * Large-scale Kubernetes cluster operations. * AI/ML infrastructure and GPU cluster management. * Infrastructure-as-Code (IaC) and automation-first mindset. * Production incident management and reliability engineering. * Data center operations and hardware troubleshooting. * CI/CD platform design and implementation. Meeting every qualification is not required. Candidates with strong technical foundations, relevant experience, and a passion for building reliable large-scale infrastructure are encouraged to apply.

Open job