← Lavori

Senior DevOps Engineer (AI/LLM Infrastructure)

Budget: $500.0 FIXED / ⭐ 5.00 (7) Australia

devops, docker

*Experience:* 6+ Years *Location:* Remote *Employment Type:* 40 hours/week - $500/month salary (part of a larger development team) ## About ReSkill AI ReSkill AI is an AI-powered enterprise platform focused on transforming businesses through intelligent automation, Large Language Models (LLMs), AI Agents, and scalable cloud-native solutions. We are looking for a highly skilled DevOps Engineer who can build, automate, and manage secure, scalable, and highly available infrastructure supporting AI-driven applications. ## Key Responsibilities * Design, implement, and maintain scalable cloud infrastructure on AWS (preferred), Azure, or Google Cloud Platform. * Build and manage CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, or Azure DevOps. * Deploy and maintain containerized applications using Docker and Kubernetes. * Manage Infrastructure as Code (IaC) using Terraform, CloudFormation, or Pulumi. * Implement monitoring, logging, and observability using Prometheus, Grafana, ELK Stack, Datadog, CloudWatch, or similar platforms. * Automate deployments, system provisioning, configuration management, and operational workflows. * Optimize infrastructure performance, reliability, security, and cost. * Collaborate with AI engineers to deploy and scale LLM-based applications and AI services. * Support model serving infrastructure for open-source and commercial LLMs. * Work closely with software engineering teams to improve deployment efficiency and release management. * Implement backup strategies, disaster recovery procedures, and production support processes. * Ensure compliance with security best practices, including IAM, secrets management, encryption, and vulnerability management. * Participate in on-call production support and incident response when required. ## Required Qualifications * Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience). * Minimum 6 years of professional experience in DevOps or Site Reliability Engineering. * Strong experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, IAM, CloudWatch, VPC, Route 53, and Auto Scaling. * Hands-on experience with Docker and Kubernetes in production environments. * Strong scripting skills in Bash, Python, or similar languages. * Experience building CI/CD pipelines and deployment automation. * Experience with Infrastructure as Code tools (Terraform preferred). * Solid understanding of Linux administration and networking fundamentals. * Experience with Git version control and branching strategies. * Familiarity with security best practices and DevSecOps principles. ## Preferred AI/LLM Experience * Understanding of Large Language Models (LLMs) and Generative AI architectures. * Experience deploying inference servers such as vLLM, Ollama, Hugging Face Text Generation Inference (TGI), or similar technologies. * Familiarity with vector databases such as Pinecone, Milvus, Weaviate, or pgvector. * Understanding of Retrieval-Augmented Generation (RAG) architectures. * Experience working with OpenAI APIs, Anthropic APIs, Google Gemini, or open-source models. * Knowledge of GPU infrastructure, CUDA environments, and AI workload optimization is a plus. * Familiarity with AI observability, model monitoring, and prompt management. ## Nice-to-Have Skills * Experience with Kafka, Redis, RabbitMQ, or event-driven architectures. * Experience with ArgoCD, Helm, or FluxCD. * Knowledge of HIPAA, SOC 2, ISO 27001, or other security compliance frameworks. * Experience supporting multi-region deployments and high-availability architectures. ## What We Offer * Opportunity to work on cutting-edge AI and enterprise automation products. * Collaborative and innovative engineering culture. * Flexible work environment. * Competitive salary and performance incentives. * Career growth opportunities in AI infrastructure and cloud technologies. 40 hour a week role, offer is a monthly salary $500USD/mth Salary will increase depending on performance on regular intervals Only apply if you understand the above
Apri su Upwork