Founding Data Engineer
Company
Elicit
Location
USA
Type
Full Time
Job Description
About Elicit
Elicit is an AI research assistant that uses language models to help professional researchers and high-stakes decision makers break down hard questions gather evidence from scientific/academic sources and reason through uncertainty.
What we're aiming for:
-
Elicit radically increases the amount of good reasoning in the world.
-
For experts Elicit pushes the frontier forward.
-
For non-experts Elicit makes good reasoning more accessible. People who don't have the tools expertise time or mental energy to make carefully-reasoned decisions on their own can do so with Elicit.
-
-
Elicit is a scalable ML system based on human-understandable task decompositions with supervision of process not outcomes . This expands our collective understanding of safe AGI architectures.
Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.
Why we're hiring for this role
Two main reasons:
-
Currently Elicit operates over academic papers and clinical trials. One of your key initial responsibilities will be to build a complete corpus of these documents available as soon as they're published combining different data sources and ingestion methods. Once that's done there is a growing list of other document types and sources we'd love to integrate!
-
One of our main initiatives is to broaden the sorts of tasks you can complete in Elicit. We need a data engineer to figure out the best way to ingest massive amounts of heterogeneous data in such a way as to make it usable by LLMs. We need your help to integrate into our customers' custom data providers to that they can create task-specific workflows over them.
In general we're looking for someone who can architect and implement robust scalable solutions to handle our growing data needs while maintaining high performance and data quality.
Our tech stack
-
Data pipeline: Python Flyte Spark
-
Probably less relevant to you but ICOI:
-
Backend: Node and Python event sourcing
-
Frontend: Next.js TypeScript and Tailwind
-
-
We like static type checking in Python and TypeScript!
-
All infrastructure runs in Kubernetes across a couple of clouds
-
We use GitHub for code reviews and CI
-
We deploy using the gitops pattern (i.e. deploys are defined and tracked by diffs in our k8s manifests)
Am I a good fit?
Consider the questions:
-
How would you optimize a Spark job that's processing a large amount of data but running slowly?
-
What are the differences between RDD DataFrame and Dataset in Spark? When would you use each?
-
How does data partitioning work in distributed systems and why is it important?
-
How would you implement a data pipeline to handle regular updates from multiple academic paper sources ensuring efficient deduplication?
If you have a solid answer for these—without reference to documentation—then we should chat!
Location and travel
We have a lovely office in Oakland CA; there are people there every day but we don't all work from there all the time. It's important to us to spend time with our teammates however so we ask that all Elicians spend about 1 week out of every quarter with teammates.
We wrote up more details on this page .
What you'll bring to the role
-
5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest manage and use data
-
Strong proficiency in Python (5+ years experience)
-
You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues planning an architecture deploying the infrastructure and implementing the tooling
-
Experience with architecting and optimizing large data pipelines ideally with particular experience with Spark; ideally these are pipelines which directly support user-facing features (rather than internal BI for example)
-
Strong SQL skills including understanding of aggregation functions window functions UDFs self-joins partitioning and clustering approaches
-
Experience with columnar data storage formats like Parquet
-
Strong opinions weakly-held about approaches to data quality management
-
Creative and user-centric problem-solving
-
You should be excited to play a key role in shipping new features to users—not just building out a data platform!
Nice to Have
-
Experience in developing deduplication processes for large datasets
-
Hands-on experience with full-text extraction and processing from various document formats (PDF HTML XML etc.)
-
Familiarity with machine learning concepts and their application in search technologies
-
Experience with distributed computing frameworks beyond Spark (e.g. Dask Ray)
-
Experience in science and academia: familiarity with academic publications and the ability to accurately model the needs of our users
-
Hands-on experience with industry standard tools like Airflow DBT or Hadoop
-
Hands-on experience with standard paradigms like data lake data warehouse or lakehouse
What you'll do
-
Building and optimizing our academic research paper pipeline
-
You'll architect and implement robust scalable systems to handle data ingestion while maintaining high performance and quality.
-
You'll work on efficiently deduplicating hundreds of millions of research papers and calculating embeddings.
-
Your goal will be to make Elicit the most complete and up-to-date database of scholarly sources.
-
-
Expanding the datasets Elicit works over
-
Our users want Elicit to work over court documents SEC filings … your job will be to figure out how to ingest and index a rapidly increasing ontology of documents.
-
We also want to support less structured documents spreadsheets presentations all the way up to rich media like audio and video.
-
Larger customers often want for us to integrate private data into Elicit for their organisation to use. We'll look to you to define and build a secure reliable fast and auditable approach to these data connectors.
-
-
Data for our ML systems
-
You'll figure out the best way to preprocess all these data mentioned above to make them useful to models.
-
We often need datasets for our model fine-tuning. You'll work with our ML engineers and evaluation experts to find gather version and apply these datasets in training runs.
-
Your first week:
-
Start building foundational context
-
Get to know your team our stack (including Python Flyte and Spark) and the product roadmap.
-
Familiarize yourself with our current data pipeline architecture and identify areas for potential improvement.
-
-
Make your first contribution to Elicit
-
Complete your first Linear issue related to our data pipeline or academic paper processing.
-
Have a PR merged into our monorepo demonstrating your understanding of our development workflow.
-
Gain understanding of our CI/CD pipeline monitoring and logging tools specific to our data infrastructure.
-
Your first month:
-
You'll complete your first multi-issue project
-
Tackle a significant data pipeline optimization or enhancement project.
-
Collaborate with the team to implement improvements in our academic paper processing workflow.
-
-
You're actively improving the team
-
Contribute to regular team meetings and hack days sharing insights from your data engineering expertise.
-
Add documentation or diagrams explaining our data pipeline architecture and best practices.
-
Suggest improvements to our data processing and storage methodologies.
-
Your first quarter:
-
You're flying solo
-
Independently implement significant enhancements to our data pipeline improving efficiency and scalability.
-
Make impactful decisions regarding our data architecture and processing strategies.
-
-
You've developed an area of expertise
-
Become the go-to resource for questions related to our academic paper processing pipeline and data infrastructure.
-
Lead discussions on optimizing our data storage and retrieval processes for academic literature.
-
-
You actively research and improve the product
-
Propose and scope improvements to make Elicit more comprehensive and up-to-date in terms of scholarly sources.
-
Identify and implement technical improvements to surpass competitors like Google Scholar in terms of coverage and data quality.
-
Compensation benefits and perks
In addition to working on important problems as part of a productive and positive team we also offer great benefits (with some variation based on location):
-
Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8) as long as you can travel for in-person offsites
-
Fully covered health dental vision and life insurance for you generous coverage for the rest of your family
-
Flexible vacation policy with a minimum recommendation of 20 days/year + company holidays
-
401K with a 6% employer match
-
A new Mac + $1000 budget to set up your workstation or home office in your first year then $500 every year thereafter
-
$1000 quarterly AI Experimentation & Learning budget so you can freely experiment with new AI tools to incorporate into your workflow take courses purchase educational resources or attend AI-focused conferences and events
-
A team administrative assistant who can help you with personal and work tasks
-
You can find more reasons to work with us in this thread !
For all roles at Elicit we use a data-backed compensation framework to keep salaries market-competitive equitable and simple to understand. For this role we target starting ranges of:
-
Senior (L4): $185-270k + equity
-
Expert (L5): $215-305k + equity
-
Principal (L6): >$260 + significant equity
We're optimizing for a hire who can contribute at a L4/senior-level or above.
We also offer above-market equity for all roles at Elicit as well as employee-friendly equity terms.
Date Posted
11/30/2025
Views
0
Similar Jobs
Data Support Specialist - Finalsite
Views in the last 30 days - 0
Finalsites job posting highlights their platforms impact on K12 schools remote work opportunities and data support roles requiring technical skills an...
View DetailsSoftware Engineer Mid to Senior Level - Mayvue
Views in the last 30 days - 0
Mayvue is seeking Mid to Senior Software Engineers to join their growing team The role offers hybrid work options competitive benefits and opportuniti...
View DetailsSr. Program Advisor (Closer) - Lean Marketing
Views in the last 30 days - 0
This job ad promotes a highperformance team at Lean Marketing seeking an exceptional Senior Program Advisor Closer The role offers warm prequalified a...
View DetailsAccount Manager - Torchlight Analytics LLC
Views in the last 30 days - 0
This job description outlines the role of a Senior Account Manager at Torchlight focusing on postsale customer relationships in the defense industry R...
View DetailsMulti-Specialty Professional Coder - AAPC
Views in the last 30 days - 0
This job posting seeks a remote Contract Coder with 5 years of experience in medical coding for physician practices and surgical specialties The role ...
View DetailsVice President of US Policy and Evangelism - Code.org
Views in the last 30 days - 0
Codeorg seeks a VP of US Policy and Evangelism to lead AICS education initiatives requiring 10 years in education reform bipartisan coalition building...
View Details