Staff Research Engineer, Pre-training Data at Reddit

Reddit is continuing to grow our teams with the best talent. This role is completely remote friendly within the United States. If you happen to live close to one of our physical office locations (San Francisco Los Angeles New York City & Chicago) our doors are open for you to come into the office as often as you'd like.

The AI Engineering team at Reddit is embarking on a strategic initiative to build our own Reddit-native foundational Large Language Models (LLMs). This team sits at the intersection of applied research and massive-scale infrastructure tasked with training models that truly understand the unique culture language and structure of Reddit communities. You will be joining a team of distinguished engineers and safety experts to build the 'engine room' of Reddit's AI future—creating the foundational models that will power Safety & Moderation Search Ads and the next generation of user products.

As a Staff Research Engineer for Pre-training Data you will define the technical strategy and architecture for the data curriculum pipelines that power our next-generation foundation models. Sitting at the intersection of distributed infrastructure multimodal processing and mathematics you will design systems that transform Reddit’s unique corpus of human conversation—petabytes of text images and video—into high-quality training signals. You will move beyond flat text processing to engineer solutions that respect the complex tree-structured nature of Reddit threads ensuring our models learn the nuance of community interaction.

Responsibilities:

Architect and implement high-throughput deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
Design and execute dynamic curriculum learning strategies creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
Engineer the logic for serializing Reddit’s complex conversational trees (threads subreddits cross-posts) into optimal training contexts developing topological data processing strategies that preserve semantic relationships for model understanding.
Formulate and validate statistical hypotheses regarding data mixtures leveraging advanced sampling theory to minimize bias and maximize token quality.
Design the 'Safety-First' ingestion layer: Build automated pipelines for PII redaction toxicity signals and quality deduplication upstream of training working closely with our Safety and Moderation Engineering counterparts.
Bridge the gap between research and engineering by translating theoretical sampling insights into robust low-latency production infrastructure.
Mentor senior engineers and researchers on system design numerical correctness and performance optimization within distributed Python/Rust environments.

Required Qualifications:

8+ years of software engineering experience with a focus on machine learning infrastructure data science at scale or LLM pre-training.
Expert proficiency in Python and distributed data processing frameworks (e.g. Ray Data Spark or custom high-performance dataloaders).
Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text code images and audio/video.
Strong mathematical foundation in probability statistics and importance sampling theory.
Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
Experience working with Graph data structures or serializing conversation trees is highly valued.

Nice to Have:

Experience with JAX or PyTorch internals related to distributed data loading
Experience with Multimodal datasets (image/video + text) and vision-language preprocessing.
Proficiency in Rust or C++ for performance-critical data path optimization.
Published research or significant practical experience in active learning or automated data selection.

Benefits:

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave

#LI-SP1

Pay Transparency:

This job posting may span more than one career level.

In addition to base salary this job is eligible to receive equity in the form of restricted stock units and depending on the position offered it may also be eligible to receive a commission. Additionally Reddit offers a wide range of benefits to U.S.-based employees including medical dental and vision insurance 401(k) program with employer match generous time off for vacation and parental leave. To learn more please visit https://www.redditinc.com/careers/ .

To provide greater transparency to candidates we share base salary ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function level and country location benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including skills depth of work experience and relevant licenses/credentials and may vary from the amounts listed below.

The base salary range for this position is:

$230000—$322000 USD

Staff Research Engineer, Pre-training Data

Company

Location

Type

Job Description

Explore More

Date Posted

Views

Similar Jobs

Staff Manager -

Technical Support Engineer -

Mobile Engineer -

Principal Engineer, AI -

Director, Data Solutions & Engineering (Statara) -

Sr. Quality of Care Review Nurse - Oscar