Data Engineer

Elicit · USA

Company

Elicit

Location

USA

Type

Full Time

Job Description

About Elicit

Elicit is an AI research assistant that uses language models to help researchers figure out what’s true and make better decisions starting with common research tasks like literature review.

What we're aiming for:

  • Elicit radically increases the amount of good reasoning in the world.

    • For experts Elicit pushes the frontier forward.

    • For non-experts Elicit makes good reasoning more affordable. People who don't have the tools expertise time or mental energy to make well-reasoned decisions on their own can do so with Elicit.

  • Elicit is a scalable ML system based on human-understandable task decompositions with supervision of process not outcomes . This expands our collective understanding of safe AGI architectures.

Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.

Why we're hiring for this role

Since launching the newest version of Elicit last fall response has been strong. We introduced Elicit Plus our monthly subscription plan and added thousands of paying users in a matter of months as well as hundreds of thousands of new sign-ups. This has been energizing for our team but we want to ship more useful functionality to our users even faster.

Our academic research paper search pipeline is at the heart of Elicit's capabilities and we need a skilled Data Engineer to take it to the next level. Our mission is to make Elicit the most complete and up-to-date database of scholarly sources. As we continue to add new data sources and expand our coverage of academic literature we face challenges in efficiently processing deduplicating and indexing hundreds of millions of research papers. We're looking for someone who can architect and implement robust scalable solutions to handle our growing data needs while maintaining high performance and data quality.

Our tech stack

  • Data pipeline: Python Flyte Spark

  • Frontend: Next.js TypeScript and Tailwind

  • Backend: Node and Python

  • We like static type checking in Python and TypeScript

  • All infrastructure runs in Kubernetes across a couple of clouds

  • We use GitHub for code reviews and CI

Am I a good fit?

Consider the questions:

  • How would you optimize a Spark job that's processing a large amount of data but running slowly?

  • What are the differences between RDD DataFrame and Dataset in Spark? When would you use each?

  • How does data partitioning work in distributed systems and why is it important?

  • How would you implement a data pipeline to handle regular updates from multiple academic paper sources ensuring efficient deduplication?

If you have a solid answer for these—without reference to documentation—then we should chat!

What you'll bring to the role

  • 5+ years of experience as a data engineer building core datasets and supporting business verticals with high data volumes

  • Strong proficiency in Python (5+ years experience)

  • Experience with architecting and optimizing large data pipelines with Spark

  • Strong SQL skills including understanding of aggregation functions window functions UDFs self-joins partitioning and clustering approaches

  • Experience with Parquet file formats and other columnar data storage formats

  • Strong data quality management skills

  • Ability to balance technical expertise with creative problem-solving

  • Excited to interact with product/web app and help ship new features (e.g. real-time updates advanced filtering options)

  • Experience shipping scalable data solutions in the cloud (e.g. AWS GCP Azure) across multiple data stores and methodologies

Nice to Have

  • Familiarity with web crawling technologies and best practices for data extraction from websites

  • Experience in developing deduplication processes for large datasets

  • Hands-on experience with full-text extraction and processing from various document formats (PDF HTML XML etc.)

  • Experience working with academic research databases or scientific literature

  • Familiarity with machine learning concepts and their application in search technologies

  • Experience with distributed computing frameworks beyond Spark (e.g. Dask Ray)

  • Knowledge of academic publishing processes and metadata standards

  • Hands-on experience with Airflow DBT or Hadoop

What you'll do

You'll own:

  • Building and optimizing our academic research paper pipeline

    • You'll architect and implement robust scalable solutions to handle our growing data needs while maintaining high performance and data quality.

    • You'll work on efficiently processing deduplicating and indexing hundreds of millions of research papers.

    • Your goal will be to make Elicit the most complete and up-to-date database of scholarly sources.

  • Enhancing Elicit's data infrastructure

    • You'll optimize our Spark jobs and data pipelines to handle large amounts of data efficiently.

    • You'll implement data partitioning strategies in our distributed systems to improve performance.

    • You'll develop processes to handle regular updates from multiple academic paper sources ensuring efficient deduplication.

  • Maintaining and improving data quality

    • You'll implement robust data quality management processes to ensure the accuracy and reliability of our academic database.

    • You'll work on developing defenses against unexpected changes from publishers to maintain data integrity.

Your first week:

  • Start building foundational context

    • Get to know your team our stack (including Python Flyte and Spark) and the product roadmap.

    • Familiarize yourself with our current data pipeline architecture and identify areas for potential improvement.

  • Make your first contribution to Elicit

    • Complete your first Linear issue related to our data pipeline or academic paper processing.

    • Have a PR merged into our monorepo demonstrating your understanding of our development workflow.

    • Gain understanding of our CI/CD pipeline monitoring and logging tools specific to our data infrastructure.

Your first month:

  • You'll complete your first multi-issue project

    • Tackle a significant data pipeline optimization or enhancement project.

    • Collaborate with the team to implement improvements in our academic paper processing workflow.

  • You're actively improving the team

    • Contribute to regular team meetings and hack days sharing insights from your data engineering expertise.

    • Add documentation or diagrams explaining our data pipeline architecture and best practices.

    • Suggest improvements to our data processing and storage methodologies.

Your first quarter:

  • You're flying solo

    • Independently implement significant enhancements to our data pipeline improving efficiency and scalability.

    • Make impactful decisions regarding our data architecture and processing strategies.

  • You've developed an area of expertise

    • Become the go-to resource for questions related to our academic paper processing pipeline and data infrastructure.

    • Lead discussions on optimizing our data storage and retrieval processes for academic literature.

  • You actively research and improve the product

    • Propose and scope improvements to make Elicit more comprehensive and up-to-date in terms of scholarly sources.

    • Identify and implement technical improvements to surpass competitors like Google Scholar in terms of coverage and data quality.

Who you'll work with

This role will report directly to James our Head of Engineering and work very closely with the rest of the engineering team:

You'll also spend a lot of time collaborating with Kevin (Head of Product) and co-founders Jungwon & Andreas .

Compensation benefits and perks

In addition to working on important problems as part of a productive and positive team we also offer great benefits (with some variation based on location):

  • Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8) as long as you can travel for in-person retreats and coworking events

  • Fully covered health dental vision and life insurance for you generous coverage for the rest of your family

  • Flexible vacation policy with a minimum recommendation of 20 days/year + company holidays

  • 401K with a 6% employer match

  • $2000 device budget to start with more accumulating for each month of work

  • $500 / year personal development budget

  • A team administrative assistant who can help you with personal and work tasks

  • You can find more reasons to work with us in this thread !

For all roles at Elicit we use a data-backed compensation framework to keep salaries market-competitive equitable and simple to understand.

  • This role starts between $195-230K + equity depending on your level. We're optimizing for a hire who can contribute at a L4/senior-level or above.

Apply Now

Date Posted

10/23/2024

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Positive
Subjectivity Score: 0.8

Similar Jobs

Staff Salesforce Engineer - CRM Systems - GitLab

Views in the last 30 days - 0

This job description outlines a Staff Salesforce Developer role focusing on designing building and scaling enterprisegrade solutions across Salesforce...

View Details

Software Engineer III | Platform - ExtraHop

Views in the last 30 days - 0

This job posting seeks a Software Engineer III to develop features lead junior team members and contribute to secure cloud and appliance solutions The...

View Details

DevOps Engineer - Guidehouse

Views in the last 30 days - 0

This job posting seeks a skilled DevOps Engineer to support development QA and operations across applications emphasizing automation cloudnative infra...

View Details

Data Scientist - Capstone Integrated Solutions

Views in the last 30 days - 0

Capstone Integrated Solutions promotes itself as a customerfocused provider offering comprehensive software services and seeks a Data Scientist with e...

View Details

Engineering Manager - Software Supply Chain Security: Auth Infrastructure - GitLab

Views in the last 30 days - 0

This job description highlights a leadership role in developing secure scalable authentication infrastructure for GitLab It emphasizes technical exper...

View Details

Growth Product Lead - Loyalty - Trafilea

Views in the last 30 days - 0

Trafilea promotes itself as a transformative consumer tech platform with AIdriven growth solutions highlighting achievements like 1B revenue and globa...

View Details