Senior Site Reliability Engineer
Бюджет: $35.0 - $60.0
HOURLY / FULL_TIME
⭐ 4.94 (18978)
United States
embedded-systems, civil-engineering, mechanical-engineering, electrical-engineering
Role Overview:
As a Senior SRE at Upwork, you will scale and secure our global freelance marketplace while leading our transition to autonomous operations. By integrating Agentic AI frameworks to automate triage, tracing, and root-cause discovery, you will shift our team from reactive firefighting to high-leverage architectural engineering. Collaborating cross-functionally, you will drive security, observability, cost-efficiency, and a culture of technical innovation.
Work / Project Scope
1. Technical Leadership & Reliability Architecture
- Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability.
- Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration.
- Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards.
- Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems.
2. Incident Management & Advanced AI-Assisted Operations
- Participate in our production on-call rotation during daytime hours and on some weekends (roughly once every 2–3 weeks).
- Develop and deploy AI-assisted tools, agentic frameworks, and workflows to streamline incident triage, automate remediation, and accelerate resolution.
- Lead high-priority incident reviews and reliability audits to surface systemic technical or process gaps.
- Standardize Blameless Post-Mortems and drive a culture of continuous learning from root cause analysis (RCA) data.
3. Advanced Observability & Telemetry
- Define and maintain end-to-end observability strategies including distributed tracing, high-cardinality metrics pipelines, and log enrichment.
- Architect and optimize telemetry data pools using ELK/OpenSearch, Prometheus, Grafana, and long-term storage layers like VictoriaMetrics / VictoriaLogs to drive SLO-based alerting.
4. Zero-Trust Networking & Governance
- Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes.
- Enforce workload identity, network isolation, and mutual TLS (mTLS) policies utilizing Istio Service Mesh.
- Drive infrastructure automation using IaC best practices (Terraform) with an emphasis on policy-as-code, workload identity, and platform governance.
What It Takes To Catch Our Eye
- Experience: 5+ years in SRE, DevOps, or production engineering roles, including proven experience operating large-scale, revenue-producing distributed systems in production.
Kubernetes & Service Mesh: Deep expertise in Kubernetes operations, including multi-cluster orchestration, Istio (or equivalent service mesh), and workload policy management (e.g., OPA, Kyverno).
- GitOps Continuous Delivery: Proven experience building and maintaining robust GitOps pipelines using tools like ArgoCD or Flux.
- Observability & SLOs: Strong fluency in observability tooling (Prometheus, OpenTelemetry, Grafana, or ELK/OpenSearch,VictoriaLogs/VictoriaMetrics), with a programmatic focus on SLO-based alerting and incident detection. Familiarity scaling high-density log storage via VictoriaLogs is highly desired.
- Automation & AI-Enhanced Coding: Fluency in automation using scripting languages (Python, Go, or Bash) paired with AI-assisted software workflows (e.g., AI agents, Cursor, automated PR engines).
- Networking & Zero Trust: Extensive experience in network architecture, including firewall management, load balancing, and implementing Zero Trust security models.
- Incident Review Leadership: A clear track record of leading incident review programs, standardizing postmortems, and translating RCAs into systemic reliability improvements.
- Cross-Functional Collaboration: Ability to work effectively with platform, security, and developer enablement teams to embed resilience across the entire software development lifecycle (SDLC).
Открыть заказ