Senior Software Engineer — AI Evaluation & Benchmarks (Python)
Job Description
Team: IT
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States.
In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.
Accountabilities:
- Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges.
- Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows.
- Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios.
- Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment.
- Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks.
- Contribute to the design and evolution of evaluation methodologies that define standards for measuring coding capability in AI systems.
- Collaborate with research and engineering stakeholders to refine benchmarks and integrate findings into iterative model improvement cycles.
- Ensure evaluation systems are reliable, reproducible, and optimized for scale and accuracy.
- 4+ years of professional software engineering experience in high-quality production environments.
- Expert-level Python development skills with strong emphasis on clean, performant, and well-tested code.
- Hands-on experience working within large, complex, and production-grade codebases.
- Proven experience building or contributing to LLM evaluation systems, coding benchmarks, or AI model testing pipelines.
- Strong understanding of Git workflows, software engineering best practices, and modern development processes.
- Experience working in high-growth technology companies or top-tier engineering organizations.
- Excellent analytical and problem-solving skills with strong attention to detail.
- Strong written communication skills in English with the ability to articulate technical insights clearly.
- Experience with CI/CD systems and unit testing frameworks is highly valued.
- Familiarity with additional programming languages such as JavaScript, Go, or C++ is a plus.
- Background in ML evaluation methodologies, open-source contributions, or security engineering is considered an advantage.
- Competitive hourly compensation ranging from $80 to $100 per hour based on experience and location.
- Fully remote contract opportunity with global flexibility across approved locations.
- Weekly payments via PayPal or Stripe.
- Short-term 3-month contract with potential for extension based on performance and project needs.
- Opportunity to work on cutting-edge AI systems shaping frontier model evaluation standards.
- High-impact technical role influencing how future AI coding capabilities are measured and improved.
- Exposure to advanced AI research workflows, benchmarking methodologies, and large-scale evaluation systems.
- Flexible, project-based engagement within a fast-evolving AI engineering environment.
Requirements:
Benefits:
Explore More
Date Posted
05/15/2026
Views
0
Similar Jobs
Senior Manager, AI Transformation & Organizational Design - Jobgether
Views in the last 30 days - 0
View DetailsSenior Legal Technology Integration Specialist - Jobgether
Views in the last 30 days - 0
View Details