Job Description
Summary
Reality Labs Research (RL-R) brings together a diverse and highly interdisciplinary team of researchers and engineers to create the future of augmented and virtual reality. On the Codec Avatars Infrastructure team, you'll work on building tools, libraries, and frameworks that will help researchers collaborate with each other and empower their research towards the generation of Codec Avatars.Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage a strong sense of ownership and embrace the ambiguity that comes with working on the frontiers of research. In this systems engineer role on the Codec Avatar Research Infrastructure team, you will foster our scientific explorations and generate viable paths to the consumer products that will connect people in meaningful ways for decades to come.
Required Skills
Systems Engineer Responsibilities:
- Build, scale, and secure the HPC clusters within Meta research labs, a heterogeneous environment containing diverse operating systems and applications
- Work side by side with research scientists to setup compute, storage, network, OS, and schedulers to enable HPC clusters for large scale multi-GPU training jobs
- Provide on-call support and lead incident root cause analysis through multiple infrastructure layers (compute, storage, network) for HPC clusters and act as a final escalation point
- Apply modern system engineering methodologies such as Infrastructure-as-Code, container orchestration, and software-defined storage for large scale compute clusters
- Collaborate in a diverse team environment across multiple scientific and engineering disciplines, making the architectural tradeoffs required to rapidly deliver software and infrastructure solutions
- Find ways to leverage the scale and complexity of the larger Meta production infrastructure to solve problems for Reality Lab researchers
- Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable
- Help others around you move faster by identifying issues and driving them to resolution
- Influence outcomes within your immediate team, peer engineering teams, and with cross-functional stakeholders
- Ability to work independently, handle large projects simultaneously, and prioritize team roadmap and deliverables by balancing required effort with resulting impact
Minimum Qualification
Minimum Qualifications:
- BS or MS in Computer Science, Engineering, or a related technical discipline (or equivalent experience)
- 3+ years experience in systems engineering
- 3+ years of experience working with monitoring and configuration management tools such as Chef, Ansible, Puppet, Saltstack, etc
- 3+ years experience automating the management of infrastructure and services
- 3+ years experience coding in at least one of the following languages: Python, Ruby, PHP, Rust, or Go
- Thorough understanding of Linux operating system internals
- Experience with managing HPC scheduler libraries like Slurm, Kubernetes, or LSF
- Experience with Python library management systems such as Conda or venv
Preferred Qualification
Preferred Qualifications:
- Prior experience in building out HPC clusters, handling power, cooling, compute, storage, network, operating systems, schedulers, and stakeholder discussions.
- Prior experience in cluster oncall operations, including troubleshooting server/scheduler/storage errors, maintaining compute/storage environments/libraries/tools, helping onboard users to the cluster, and answering general questions from users.
- Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of users, developing tools to improve user experience, providing guidance on best practices, coordinating distribution of compute/storage resources, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies.
- Prior experience building tooling for monitoring and telemetry
- Prior experience supporting configuration management in a multi-region environment
- Prior experience optimizing multi-tenant HPC clusters for performance and maintenance
- Prior experience with containerization technologies like Docker or Virtual Machines
- Prior experience building services
- Prior experience building PaaS or internal clouds
- Prior experience in developing/managing distributed network file systems
- Prior academic or development experience with machine learning and/or deep learning
- Prior experience in ML libraries such as PyTorch, TensorFlow or cuDNN
- Prior experience in Computer vision libraries such as OpenCV
- Prior experience in GPGPU development with CUDA, OpenCL or DirectCompute
EOE
Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.
Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at [email protected].
Explore More
Reality Labs Research (RL-R) brings together a diverse and highly interdisciplinary team Jobs
build tools Jobs
libraries Jobs
and frameworks that will help researchers collaborate Jobs
cultivates an honest and considerate environment where self-motivated individuals thrive Jobs
More Jobs at Meta
Jobs in Pittsburgh, PA
Date Posted
10/24/2023
Views
7
Positive
Subjectivity Score: 0.9
Similar Jobs
Specialist: Full-Time, Part-Time, and Part-Time Temporary - Apple
Views in the last 30 days - 0
View DetailsSenior Product Manager - Strategic Priority Management - PNC
Views in the last 30 days - 0
View DetailsDiamond V - Regional Sales Manager - East Coast (Animal Feed Additives) - Cargill
Views in the last 30 days - 0
View DetailsTJ Maxx Part time Mercahndise Associate, Robinson Twp. - The TJX Companies, Inc.
Views in the last 30 days - 0
View Details