Principal / Senior Systems Performance Engineer at Micron Technology

Our vision is to transform how the world uses information to enrich life for all.

Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.

Micron Data Center Workload Engineering in Austin, Texas, is seeking a senior engineer to join our team. We build, performance tune, and test data center solutions using innovative DRAM, SSD, and emerging memory hardware! Understanding key data center workloads is a Micron imperative, in order to improve current products in the deep-memory hierarchy (HBM, CXL, DDR) and bring about a total value proposition to customers based on efficiently applying several Micron products in concert!

Particularly, with the proliferation of generative AI, there is the urgent need to better understand how large language model training is impacted by HBM and GPU characteristics. To this end, the successful candidate will primarily supply to the HBM program by analyzing how AI/ML, HPC, data analytics and other compute-intensive workloads perform on the latest MU-HBM3e / NVIDIA H100 GPU / AMD GPU, conduct competitive analysis of SS / SK HBM3 products, showcase the benefits that workloads see with MU-HBM3e's capacity / bandwidth / thermals, give to marketing collateral, and extract AI/ML and HPC workload traces to help optimize future HBM designs.

Job Responsibilities:

These include but are not limited to the following:

Analysis and characterization of data center workloads in several areas such as AI/ML, HPC, data analytics.
Profiling AI training / inference models in generative AI, computer vision recommendation and NLP on CPU/GPU systems using HBM/DDR.
Performance benchmarking of HBM using both microbenchmarks and the aforementioned data center applications and benchmarks.
Particular focus on generative AI, large language models and how they are impacted by HBM and GPU features.
Overlaying deep learning models on multi GPU-based system architectures to understand their interplay
Understanding how HBM capacity, bandwidth and power/thermals impact DL models
Competitive analysis of HBM power / thermals under workload behavior
Detailed telemetry analysis of workloads on the HBM / GPU
Understand how ML models in LLM, vision, recommendation and NLP are sensitive to additional HBM bandwidth and capacity
Study the impact of model quantization and GPU / HBM resource usage / needs
Profile DL model training / inference in a GPU + CPU-offload scenario, and in a cluster of GPU setting
Build workload memory access traces from AI models and HPC applications
Study system balance ratios for DRAM/HBM/NVMe in terms of capacity and bandwidth, memory expansion analysis vis-a-vis DRAM/HBM/NVMe, and study the interplay between these products and understand TCO
Study memory/core, byte/FLOP and memory bandwidth/core/FLOP requirements for a variety of workloads to influence future products
Study data movement between CPU, GPU and the associated memory subsystems (DDR, HBM) in heterogeneous system architectures via connectivity such as PCIe/NVLINK/Infinity Fabric to understand the bottlenecks in data movement for different workloads (particularly AI)
Understand how additional DDR memory footprint or CXL memory footprint can help with AI training / inference occurring on the GPU-CPU complex
Develop an automated testing framework through scripting
Customer engagements to present findings

Preferred Qualifications:

Strong AI/ML training/inference models, PyTorch/TensorFlow/DeepSpeed background
Strong computer systems foundations
Strong foundation in GPU and CPU processor architecture
Familiarity with synthetic memory bandwidth testing applications
Familiarity with and knowledge of server system memory (DRAM) and server system architecture
Experience with performance analysis
Understanding of memory and storage hierarchy including HBM
Knowledge of AI, ML/DL frameworks and running them in CPU-GPU heterogeneous architectures
Strong software development skills using leading scripting, programming languages and technologies (Python, CUDA, RoCm, C, C++)
Strong systems software development skills by way of developing solutions for memory, storage, CPU and bigdata subsystems
Familiarity with PCIe and NVLINK connectivity
Knowledge of bigdata and data-intensive analysis tools (e.g., Spark)
Be abreast with the state of the art in deep learning and optimizations therein
Familiarity with system level automation tools and processes
Excellent oral communication skills
Ability to be hands-on as well as set and supervise goals for a team
Excellent written and presentation skills to detail the findings

Education:

Bachelors (with 5+ years of experience) or Masters (with 3+ years of experience) or Ph.D. in Computer Science or related field

As a world leader in the semiconductor industry, Micron is dedicated to your personal wellbeing and professional growth. Micron benefits are designed to help you stay well, provide peace of mind and help you prepare for the future. We offer a choice of medical, dental and vision plans in all locations enabling team members to select the plans that best meet their family healthcare needs and budget. Micron also provides benefit programs that help protect your income if you are unable to work due to illness or injury, and paid family leave. Additionally, Micron benefits include a robust paid time-off program and paid holidays. For additional information regarding the Benefit programs available, please see the Benefits Guide posted on micron.com/careers/benefits.

Micron is proud to be an equal opportunity workplace and is an affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state, or local laws.

To learn about your right to work click here.

To learn more about Micron, please visit micron.com/careers

For US Sites Only: To request assistance with the application process and/or for reasonable accommodations, please contact Micron's People Organization at [email protected] or 1-800-336-8918 (select option #3)

Micron Prohibits the use of child labor and complies with all applicable laws, rules, regulations, and other international and industry labor standards.

Micron does not charge candidates any recruitment fees or unlawfully collect any other payment from candidates as consideration for their employment with Micron.

Principal / Senior Systems Performance Engineer

Company

Location

Type

Job Description

Explore More

Date Posted

Views

Similar Jobs

Principal Rack Mechanical Engineer (Rack Integration Solutions) - Jabil

Senior Project Manager - Charles Schwab

Senior Project Manager - Charles Schwab

Senior Project Manager - Charles Schwab

PMU Design Verification Engineer: Analog & Mixed Signal Engineer - Apple

PMU Design Verification Engineer: Analog & Mixed Signal Engineer - Apple