HPC and AI System Administrator
Company
Intel Corporation
Location
Albuquerque, NM
Type
Full Time
Job Description
Job Details:
Job Description:
As AI reshapes not only computing, but also business and society, Intel is making major bets on the future of AI and particularly in data center computing. Intel's Datacenter and AI Solutions (DAIS) organization is a critical part of Intel broader AI efforts. DAIS responsibility spans data center workloads from generative AI and deep learning to analytics, HPC, and graphics - all of which are intertwined in the future of computing.
CRT Labs runs the Intel High Performance Computing benchmarking cluster called Endeavour. Endeavour is our renowned and largest cluster showcasing Intel Architecture supporting deals, development, performance optimization and so much more. We are System integrators of future platforms. We also host other clusters to support AI, HPC, Cloud, Enterprise, and other clusters for Technology, Pathfinding, and Innovation.
We partner closely with the Sales and Marketing team, as well as multiple Software Enabling and optimization organizations to deliver performant clusters at scale with unreleased and sometimes unstable hardware. This includes Intel Xeon and Discrete graphics products, both the latest generations and the yet-to-be released versions, high performance storage systems and fastest fabric interconnects available.
We are seeking a HPC and AI Systems Administrator who has a passion for working on Intel's latest technology. The HPC and AI Systems Administrator has deep technical knowledge of the design and deployment of data centers and the associated subsystems. These can include expertise in data center layout, mechanical design systems, cooling (both air and liquid), power delivery and other critical data center design expertise. The deliverables of the role may take the form of design of Intel's data centers, support for customers in designing their data centers or in the development of new products and technologies based on data center design expertise.
The HPC and AI Systems Administrator will be responsible for but limited to:
- Provide support and maintenance of large cluster hardware and software for optimized performance, security, consistency, and high availability.
- Manage various Linux OS distributions.
- Support Hardware such as rack-mounted servers, network switches, and firewalls.
- Support Intel HPC data center technologies, including servers, fabric, storage.
- Provide support in cluster debugging, Linux scripting, cluster validation tests, server expansion, file system tests, benchmarking, and job scheduling.
- Serve as a consultant for all projects and customers of the CRT Datacenter, creating and improving methodologies used in the datacenter to enhance the performance, reliability, and manageability of the CRT clusters.
- Research emerging capabilities in external HPC and AI clusters to help set direction on where the team needs to be internally.
Qualifications:
This position is not eligible for Intel immigration sponsorship.
Requirements listed would be obtained through a combination of industry relevant job experience, internship experiences and or schoolwork/classes/research. Minimum qualifications are required to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.
Minimum Education:
Bachelor's degree in computer science, Computer Engineering or any other related field and 6+ years of experience OR master's degree in computer science, Computer Engineering or any other related field and 4+ years of experience
Minimum Qualification
- 6+ years of Linux experience supporting complex HPC clusters.
- 6+ years of experiencing writing bash scripts, Python, and/or C programs.
- 1+ year of experience with the technical concepts, architecture, systems, development methods, and disciplines associated with the defined program, and utilizes knowledge to accelerate project completion.
Preferred Qualifications
- Experience managing cluster systems with 100+ nodes.
- Experience managing HPC clusters with discrete GPUs.
- Experience in data center layout, mechanical design systems, cooling (both air and liquid), power delivery and other critical data center design expertise.
- Experience using and supporting job schedulers such as SLURM, PBS or other schedulers.
- Experience with high performance interconnects, preferably Mellanox InfiniBand, Omni-Path, or Converged Ethernet.
- Experience administering high performance cluster file systems (Lustre, GPFS, others).
Job Type:
Experienced Hire
Shift:
Shift 1 (United States of America)
Primary Location:
US, New Mexico, Albuquerque
Additional Locations:
Business group:
The Data Center & Artificial Intelligence Group (DCAI) is at the heart of Intel's transformation from a PC company to a company that runs the cloud and billions of smart, connected computing devices. The data center is the underpinning for every data-driven service, from artificial intelligence to 5G to high-performance computing, and DCG delivers the products and technologies-spanning software, processors, storage, I/O, and networking solutions-that fuel cloud, communications, enterprise, and government data centers around the world.
Posting Statement:
All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.
Position of Trust
N/A
Work Model for this Role
This role will be eligible for our hybrid work model which allows employees to split their time between working on-site at their assigned Intel site and off-site. In certain circumstances the work model may change to accommodate business needs.
Date Posted
03/20/2024
Views
19
Similar Jobs
AI/ML Developer - Stellar Science
Views in the last 30 days - 0
Stellar Science an Albuquerquebased scientific software development company is seeking talented data science and AIML experts The ideal candidate will...
View DetailsQuality Control Technician - Farmington,NM - CRH
Views in the last 30 days - 0
Four Corners Materials a CRH company is seeking a Quality Control Technician The role involves performing ASTM tests on aggregates asphalt raw materia...
View DetailsCleared Senior/Principal Computer Science - Nuclear Detonation Detection Systems (C++), Onsite - Sandia National Laboratories
Views in the last 30 days - 0
Sandia National Laboratories is seeking experienced software developers for a challenging role in developing operational monitoring systems for nuclea...
View DetailsGROCERY/RECEIVING CLERK - Kroger
Views in the last 30 days - 0
The job description involves creating a positive customer experience maintaining a safe and clean environment assisting in reaching sales and profit g...
View DetailsOnline Grocery Pick-Up Clerk - Kroger
Views in the last 30 days - 0
The job description is for an Online Grocery PickUp Clerk also known as an InStore Grocery Shopper The role involves selecting and gathering products ...
View DetailsServiceNow IT Asset Management Technical Product Owner - CVS Health
Views in the last 30 days - 0
CVS Health is seeking a ServiceNow Technical Product Owner with a strong background in IT Asset Management ITAM and AgileSAFe product management The i...
View Details