At IBM work is more than a job - it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so let’s talk.
The internship will explore advanced routing strategies and KV-cache–aware optimizations in distributed inference systems with an emphasis on improving performance scalability and GPU cost efficiency.
What you will work on
- Designing and evaluating routing algorithms to optimize inference latency throughput and cost
- Investigating KV cache management strategies for large-scale distributed inference serving
- Prototyping benchmarking and analyzing inference optimization techniques
- Working with modern inference frameworks and real production-like workloads
Why join us?
This internship offers a unique opportunity to work at the intersection of AI systems and distributed infrastructure with real-world impact on scalable cost-efficient inference serving used in production environments.
- MSc or PhD student in Computer Science Machine Learning Systems or a related field
- Strong background or interest in distributed systems systems research or ML infrastructure
- Strong programming skills (Python Go or similar)
- Hands-on experience or familiarity with vLLM (architecture KV cache behavior scheduling or extensions)
- Interest in AI infrastructure performance optimization and cost efficiency
- Ability to work independently while collaborating effectively within a research and engineering team
Please include your grade sheet with your application.
- Experience with Kubernetes (K8s) and cloud-native systems
- Familiarity with inference serving stacks networking or GPU-based systems
- Experience with benchmarking profiling or performance analysis