Principal / Senior Systems Performance Engineer
Job Description
Our vision is to transform how the world uses information to enrich life for all.
Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.
Micron Data Center Workload Engineering in Austin, Texas, is seeking a senior engineer to join our team. We build, performance tune, and test data center solutions using innovative DRAM, SSD, and emerging memory hardware! Understanding key data center workloads is a Micron imperative, in order to improve current products in the deep-memory hierarchy (HBM, CXL, DDR) and bring about a total value proposition to customers based on efficiently applying several Micron products in concert!
Particularly, with the proliferation of generative AI, there is the urgent need to better understand how large language model training is impacted by HBM and GPU characteristics. To this end, the successful candidate will primarily supply to the HBM program by analyzing how AI/ML, HPC, data analytics and other compute-intensive workloads perform on the latest MU-HBM3e / NVIDIA H100 GPU / AMD GPU, conduct competitive analysis of SS / SK HBM3 products, showcase the benefits that workloads see with MU-HBM3e's capacity / bandwidth / thermals, give to marketing collateral, and extract AI/ML and HPC workload traces to help optimize future HBM designs.
Job Responsibilities:
These include but are not limited to the following:
- Analysis and characterization of data center workloads in several areas such as AI/ML, HPC, data analytics.
- Profiling AI training / inference models in generative AI, computer vision recommendation and NLP on CPU/GPU systems using HBM/DDR.
- Performance benchmarking of HBM using both microbenchmarks and the aforementioned data center applications and benchmarks.
- Particular focus on generative AI, large language models and how they are impacted by HBM and GPU features.
- Overlaying deep learning models on multi GPU-based system architectures to understand their interplay
- Understanding how HBM capacity, bandwidth and power/thermals impact DL models
- Competitive analysis of HBM power / thermals under workload behavior
- Detailed telemetry analysis of workloads on the HBM / GPU
- Understand how ML models in LLM, vision, recommendation and NLP are sensitive to additional HBM bandwidth and capacity
- Study the impact of model quantization and GPU / HBM resource usage / needs
- Profile DL model training / inference in a GPU + CPU-offload scenario, and in a cluster of GPU setting
- Build workload memory access traces from AI models and HPC applications
- Study system balance ratios for DRAM/HBM/NVMe in terms of capacity and bandwidth, memory expansion analysis vis-a-vis DRAM/HBM/NVMe, and study the interplay between these products and understand TCO
- Study memory/core, byte/FLOP and memory bandwidth/core/FLOP requirements for a variety of workloads to influence future products
- Study data movement between CPU, GPU and the associated memory subsystems (DDR, HBM) in heterogeneous system architectures via connectivity such as PCIe/NVLINK/Infinity Fabric to understand the bottlenecks in data movement for different workloads (particularly AI)
- Understand how additional DDR memory footprint or CXL memory footprint can help with AI training / inference occurring on the GPU-CPU complex
- Develop an automated testing framework through scripting
- Customer engagements to present findings
Preferred Qualifications:
- Strong AI/ML training/inference models, PyTorch/TensorFlow/DeepSpeed background
- Strong computer systems foundations
- Strong foundation in GPU and CPU processor architecture
- Familiarity with synthetic memory bandwidth testing applications
- Familiarity with and knowledge of server system memory (DRAM) and server system architecture
- Experience with performance analysis
- Understanding of memory and storage hierarchy including HBM
- Knowledge of AI, ML/DL frameworks and running them in CPU-GPU heterogeneous architectures
- Strong software development skills using leading scripting, programming languages and technologies (Python, CUDA, RoCm, C, C++)
- Strong systems software development skills by way of developing solutions for memory, storage, CPU and bigdata subsystems
- Familiarity with PCIe and NVLINK connectivity
- Knowledge of bigdata and data-intensive analysis tools (e.g., Spark)
- Be abreast with the state of the art in deep learning and optimizations therein
- Familiarity with system level automation tools and processes
- Excellent oral communication skills
- Ability to be hands-on as well as set and supervise goals for a team
- Excellent written and presentation skills to detail the findings
Education:
Bachelors (with 5+ years of experience) or Masters (with 3+ years of experience) or Ph.D. in Computer Science or related field
As a world leader in the semiconductor industry, Micron is dedicated to your personal wellbeing and professional growth. Micron benefits are designed to help you stay well, provide peace of mind and help you prepare for the future. We offer a choice of medical, dental and vision plans in all locations enabling team members to select the plans that best meet their family healthcare needs and budget. Micron also provides benefit programs that help protect your income if you are unable to work due to illness or injury, and paid family leave. Additionally, Micron benefits include a robust paid time-off program and paid holidays. For additional information regarding the Benefit programs available, please see the Benefits Guide posted on micron.com/careers/benefits.
Micron is proud to be an equal opportunity workplace and is an affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state, or local laws.
To learn about your right to work click here.
To learn more about Micron, please visit micron.com/careers
For US Sites Only: To request assistance with the application process and/or for reasonable accommodations, please contact Micron's People Organization at [email protected] or 1-800-336-8918 (select option #3)
Micron Prohibits the use of child labor and complies with all applicable laws, rules, regulations, and other international and industry labor standards.
Micron does not charge candidates any recruitment fees or unlawfully collect any other payment from candidates as consideration for their employment with Micron.
Explore More
Date Posted
12/17/2023
Views
11
Similar Jobs
Principal Rack Mechanical Engineer (Rack Integration Solutions) - Jabil
Views in the last 30 days - 0
View DetailsPMU Design Verification Engineer: Analog & Mixed Signal Engineer - Apple
Views in the last 30 days - 0
View DetailsPMU Design Verification Engineer: Analog & Mixed Signal Engineer - Apple
Views in the last 30 days - 0
View Details