Senior DevOps Engineer (AI/LLM Infrastructure)
Budget: $500.0
FIXED /
⭐ 5.00 (7)
Australia
devops, docker
*Experience:* 6+ Years
*Location:* Remote
*Employment Type:* 40 hours/week - $500/month salary (part of a larger development team)
## About ReSkill AI
ReSkill AI is an AI-powered enterprise platform focused on transforming businesses through intelligent automation, Large Language Models (LLMs), AI Agents, and scalable cloud-native solutions. We are looking for a highly skilled DevOps Engineer who can build, automate, and manage secure, scalable, and highly available infrastructure supporting AI-driven applications.
## Key Responsibilities
* Design, implement, and maintain scalable cloud infrastructure on AWS (preferred), Azure, or Google Cloud Platform.
* Build and manage CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, or Azure DevOps.
* Deploy and maintain containerized applications using Docker and Kubernetes.
* Manage Infrastructure as Code (IaC) using Terraform, CloudFormation, or Pulumi.
* Implement monitoring, logging, and observability using Prometheus, Grafana, ELK Stack, Datadog, CloudWatch, or similar platforms.
* Automate deployments, system provisioning, configuration management, and operational workflows.
* Optimize infrastructure performance, reliability, security, and cost.
* Collaborate with AI engineers to deploy and scale LLM-based applications and AI services.
* Support model serving infrastructure for open-source and commercial LLMs.
* Work closely with software engineering teams to improve deployment efficiency and release management.
* Implement backup strategies, disaster recovery procedures, and production support processes.
* Ensure compliance with security best practices, including IAM, secrets management, encryption, and vulnerability management.
* Participate in on-call production support and incident response when required.
## Required Qualifications
* Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
* Minimum 6 years of professional experience in DevOps or Site Reliability Engineering.
* Strong experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, IAM, CloudWatch, VPC, Route 53, and Auto Scaling.
* Hands-on experience with Docker and Kubernetes in production environments.
* Strong scripting skills in Bash, Python, or similar languages.
* Experience building CI/CD pipelines and deployment automation.
* Experience with Infrastructure as Code tools (Terraform preferred).
* Solid understanding of Linux administration and networking fundamentals.
* Experience with Git version control and branching strategies.
* Familiarity with security best practices and DevSecOps principles.
## Preferred AI/LLM Experience
* Understanding of Large Language Models (LLMs) and Generative AI architectures.
* Experience deploying inference servers such as vLLM, Ollama, Hugging Face Text Generation Inference (TGI), or similar technologies.
* Familiarity with vector databases such as Pinecone, Milvus, Weaviate, or pgvector.
* Understanding of Retrieval-Augmented Generation (RAG) architectures.
* Experience working with OpenAI APIs, Anthropic APIs, Google Gemini, or open-source models.
* Knowledge of GPU infrastructure, CUDA environments, and AI workload optimization is a plus.
* Familiarity with AI observability, model monitoring, and prompt management.
## Nice-to-Have Skills
* Experience with Kafka, Redis, RabbitMQ, or event-driven architectures.
* Experience with ArgoCD, Helm, or FluxCD.
* Knowledge of HIPAA, SOC 2, ISO 27001, or other security compliance frameworks.
* Experience supporting multi-region deployments and high-availability architectures.
## What We Offer
* Opportunity to work on cutting-edge AI and enterprise automation products.
* Collaborative and innovative engineering culture.
* Flexible work environment.
* Competitive salary and performance incentives.
* Career growth opportunities in AI infrastructure and cloud technologies.
40 hour a week role, offer is a monthly salary $500USD/mth
Salary will increase depending on performance on regular intervals
Only apply if you understand the above
Öppna på Upwork