At IBM work is more than a job - it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so let’s talk.
If you’re a student excited about the intersection of large language models with speech and audio analysis—and want to contribute to research with both academic and industrial impact—this internship is for you.
Our team at IBM Research develops models algorithms and technologies that drive IBM products and advance the broader AI community. We publish papers release open-source models and file patents based on our work.
As an intern you’ll tackle real-world problems using cutting-edge deep learning methods to advance the state of the art in speech understanding and generation. You’ll collaborate closely with researchers leverage large-scale GPU compute and focus on one of the following areas:
-
Speech and Audio — Advancing recognition analysis and generation of natural speech and audio for more expressive human-like interaction. Research spans generative and conversational AI speech synthesis and multimodal representation learning.
-
Multimodal and Foundation Models — Exploring large-scale unified models that jointly learn from text and audio. Topics include self-supervised learning realistic data synthesis expressive speech generation and tokenization strategies.
The goal of the internship is to produce a high-quality research outcome and publish in a leading AI venue (e.g. ICLR Interspeech NeurIPS ACL ICML).
This is a 3-month full-time summer internship at our Haifa or Tel Aviv research sites (flexible).
Sample of 2025 publications by the group:
Granite Speech ASRU 2025
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models  COLM 2025
Spoken question answering for visual queries  Interspeech 2025
Continuous Speech Synthesis using per-token Latent Diffusion ASRU 2025
A Non-autoregressive Model for Joint STT and TTS ICASSP 2025
• M.Sc. or Ph.D. student with knowledge in Machine Learning and Multimodal Large Language Models.
• Strong background using modern methods deep knowledge of the recent literature prior CV/ML/DL/LLMs publications are an advantage.
• Strong Python coding skills. Experience with Transformers and LLMs is an advantage.
• A team player with great social skills and willingness to collaborate.
Please add your grade sheet to your application
Publication/s at top-tier peer-reviewed conferences or journals.