Senior System Administrator / Site Reliability Engineer

Boson AI · Other US Location

Company

Boson AI

Location

Other US Location

Type

Full Time

Job Description

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.


About The Role


We are looking for a Senior Infrastructure Engineer / System Administrator to help us operate our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, OPNSense, networking and related tools is a big plus. You should be comfortable performing some amount of hardware configuration. 


You will have the opportunity to work with the latest NVIDIA H100 GPUs, many PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk TH3, Mellanox Infiniband)
  • Configure and maintain MAAS (metal as a service), Ceph, and Slurm
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network and security tools, including VPN, VLAN, DHCP, SSO, MFA
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in system operations, including Slurm, Ansible, MAAS, Ceph, OPNsense and Kubernetes
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Apply Now

Date Posted

11/25/2024

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Positive
Subjectivity Score: 0.8

Similar Jobs

Senior Software Engineer (Scala/Java) - HERE Technologies

Views in the last 30 days - 0

HERE Technologies is seeking an experienced backend engineer with strong Java or Scala skills to join the Map Processing Pipelines team The role invol...

View Details

Software Architecture Engineering and Cloud Computing Engineer - The Aerospace Corporation

Views in the last 30 days - 0

The Aerospace Corporation is seeking a Senior Project Engineer with expertise in software architecture engineering and cloud computing The role involv...

View Details

Senior Finance Business Partner (d/f/m) - Personio

Views in the last 30 days - 0

Personio an intelligent HR platform is seeking a Senior Manager for FPA to lead financial planning and analysis for key departments The ideal candidat...

View Details

Senior Lead, Talent Acquisition - Sales (Relocation to Munich) (d/f/m) - Personio

Views in the last 30 days - 0

Personio a leading HR platform is seeking a Senior Lead Talent Acquisition professional to drive growth in the Revenue and Success functions across Eu...

View Details

Senior Pricing Analyst - Cencora

Views in the last 30 days - 0

Cencora formerly known as AmerisourceBergen is a leading global pharmaceutical solutions organization They are currently experiencing rapid growth in ...

View Details

Senior Product Analyst - FinCrime Platform - WISE

Views in the last 30 days - 0

Wise is seeking a Senior Product Analyst for its FinCrime Platform The role involves driving analytics efforts in the Financial Crime Platform product...

View Details