Senior AI Engineer for Critical Production Memory Leak Resolution

Бюджет: $25.0 - $50.0 HOURLY / PART_TIME ⭐ 0.00 (0) Pakistan

kubernetes, aws-lambda, python, artificial-intelligence

We are looking for an experienced AI Engineer to join our team immediately to lead the investigation and resolution of a critical production issue involving a long running persistent memory leak affecting our AI platform. This is a high impact role for someone who has deep experience debugging complex AI systems in production environments. You will work directly with our engineering team to identify the root cause, implement a robust solution, validate the fix under production workloads, and ensure long term platform stability. ## Responsibilities * Investigate and resolve a persistent memory leak in a production AI system. * Perform deep root cause analysis across application code, AI frameworks, runtime environments, and infrastructure. * Profile CPU and memory usage using advanced debugging and performance analysis tools. * Identify memory retention issues, object lifecycle problems, resource leaks, and concurrency related bottlenecks. * Optimize long running AI services for reliability, performance, and efficient resource utilization. * Validate fixes through stress testing and production level workload simulations. * Collaborate closely with backend, infrastructure, and platform engineers. * Document findings, recommendations, and preventive measures to improve long term system reliability. ## Required Experience * Extensive experience building and operating AI systems in production. * Strong expertise with Python and asynchronous programming. * Deep understanding of memory management, garbage collection, object lifecycle, and profiling techniques. * Experience debugging memory leaks in long running services. * Strong knowledge of AI frameworks such as PyTorch, TensorFlow, Hugging Face Transformers, LangChain, or similar technologies. * Experience with containerized environments including Docker and Kubernetes. * Familiarity with Linux performance analysis and production debugging tools. * Experience working with distributed systems, background workers, APIs, and high availability services. * Ability to quickly isolate complex production issues and deliver reliable long term solutions. ## Preferred Qualifications * Experience debugging GPU memory issues and CUDA memory management. * Experience with vector databases, inference servers, and large language model deployments. * Familiarity with observability platforms including Prometheus, Grafana, OpenTelemetry, or similar monitoring solutions. * Experience improving production reliability for enterprise scale AI platforms. ## What Success Looks Like The successful candidate will identify the root cause of the production memory leak, implement a verified long term fix, improve overall system stability and performance, and help establish engineering practices that prevent similar issues in the future. This is a mission critical engagement requiring exceptional debugging skills, production engineering experience, and a disciplined approach to solving complex AI infrastructure problems.

Відкрити на Upwork