Senior AI Engineer for Critical Production Memory Leak Resolution
Budget: $25.0 - $50.0
HOURLY / PART_TIME
⭐ 0.00 (0)
Pakistan
kubernetes, aws-lambda, python, artificial-intelligence
We are looking for an experienced AI Engineer to join our team immediately to lead the investigation and resolution of a critical production issue involving a long running persistent memory leak affecting our AI platform.
This is a high impact role for someone who has deep experience debugging complex AI systems in production environments. You will work directly with our engineering team to identify the root cause, implement a robust solution, validate the fix under production workloads, and ensure long term platform stability.
## Responsibilities
* Investigate and resolve a persistent memory leak in a production AI system.
* Perform deep root cause analysis across application code, AI frameworks, runtime environments, and infrastructure.
* Profile CPU and memory usage using advanced debugging and performance analysis tools.
* Identify memory retention issues, object lifecycle problems, resource leaks, and concurrency related bottlenecks.
* Optimize long running AI services for reliability, performance, and efficient resource utilization.
* Validate fixes through stress testing and production level workload simulations.
* Collaborate closely with backend, infrastructure, and platform engineers.
* Document findings, recommendations, and preventive measures to improve long term system reliability.
## Required Experience
* Extensive experience building and operating AI systems in production.
* Strong expertise with Python and asynchronous programming.
* Deep understanding of memory management, garbage collection, object lifecycle, and profiling techniques.
* Experience debugging memory leaks in long running services.
* Strong knowledge of AI frameworks such as PyTorch, TensorFlow, Hugging Face Transformers, LangChain, or similar technologies.
* Experience with containerized environments including Docker and Kubernetes.
* Familiarity with Linux performance analysis and production debugging tools.
* Experience working with distributed systems, background workers, APIs, and high availability services.
* Ability to quickly isolate complex production issues and deliver reliable long term solutions.
## Preferred Qualifications
* Experience debugging GPU memory issues and CUDA memory management.
* Experience with vector databases, inference servers, and large language model deployments.
* Familiarity with observability platforms including Prometheus, Grafana, OpenTelemetry, or similar monitoring solutions.
* Experience improving production reliability for enterprise scale AI platforms.
## What Success Looks Like
The successful candidate will identify the root cause of the production memory leak, implement a verified long term fix, improve overall system stability and performance, and help establish engineering practices that prevent similar issues in the future.
This is a mission critical engagement requiring exceptional debugging skills, production engineering experience, and a disciplined approach to solving complex AI infrastructure problems.
Apri su Upwork