Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Jobgether · US

Company

Jobgether

Location

US

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States.

In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.

Accountabilities:

  • Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges.
  • Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows.
  • Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios.
  • Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment.
  • Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks.
  • Contribute to the design and evolution of evaluation methodologies that define standards for measuring coding capability in AI systems.
  • Collaborate with research and engineering stakeholders to refine benchmarks and integrate findings into iterative model improvement cycles.
  • Ensure evaluation systems are reliable, reproducible, and optimized for scale and accuracy.
  • Requirements:

    • 4+ years of professional software engineering experience in high-quality production environments.
    • Expert-level Python development skills with strong emphasis on clean, performant, and well-tested code.
    • Hands-on experience working within large, complex, and production-grade codebases.
    • Proven experience building or contributing to LLM evaluation systems, coding benchmarks, or AI model testing pipelines.
    • Strong understanding of Git workflows, software engineering best practices, and modern development processes.
    • Experience working in high-growth technology companies or top-tier engineering organizations.
    • Excellent analytical and problem-solving skills with strong attention to detail.
    • Strong written communication skills in English with the ability to articulate technical insights clearly.
    • Experience with CI/CD systems and unit testing frameworks is highly valued.
    • Familiarity with additional programming languages such as JavaScript, Go, or C++ is a plus.
    • Background in ML evaluation methodologies, open-source contributions, or security engineering is considered an advantage.
    • Benefits:

      • Competitive hourly compensation ranging from $80 to $100 per hour based on experience and location.
      • Fully remote contract opportunity with global flexibility across approved locations.
      • Weekly payments via PayPal or Stripe.
      • Short-term 3-month contract with potential for extension based on performance and project needs.
      • Opportunity to work on cutting-edge AI systems shaping frontier model evaluation standards.
      • High-impact technical role influencing how future AI coding capabilities are measured and improved.
      • Exposure to advanced AI research workflows, benchmarking methodologies, and large-scale evaluation systems.
      • Flexible, project-based engagement within a fast-evolving AI engineering environment.
Apply Now

Date Posted

05/15/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0

© 2026 Job Transparency. All rights reserved.