Director, SRE (Site Reliability Engineering)

Litera · Remote

Company

Litera

Location

Remote

Type

Full Time

Job Description

Job Description
Litera
Our Story: Litera, headquartered in Chicago, IL, is a fast-growing and diverse software company and one of the leading legal technology suppliers in the world. Serving more than 90% of the world's largest law firms, our software is used by hundreds of thousands of lawyers every day. As a company recognized as one of the best places to work, we believe professional development, rewards programs, open communication, and transparent leadership all contribute to a unique and open work environment. Our employees are driven, energetic, passionate, and have the ability to make a direct impact on the future of the company.
The Opportunity : We are seeking a highly skilled and experienced Director of Site Reliability Engineering (SRE) to join our dynamic and innovative technology team. As the Director of cloud operations, you will lead and manage a team of Site Reliability Engineers to ensure our systems and applications' reliability, scalability, and performance . This is a hands-on technical leadership role where you will guide the team in designing, implementing, and maintaining robust infrastructure and automation solutions . This position will report directly to the Vice President of Information Technology.
A Day in the Life/Role Responsibilities:
Strategy and Planning:
  • Develop and execute a comprehensive SRE strategy aligned with the organization's goals and objectives .
  • Collaborate with stakeholders to understand uptime requirements and formulate effective solutions.
  • Define and implement best practices, standards, and processes for site reliability engineering, focusing on scalability, availability, and performance.

Team Leadership and Management:
  • Provide hands-on leadership and guidance to a team of SRE engineers , fostering a collaborative and high-performing environment.
  • Define team goals, monitor progress, and ensure timely delivery of projects.
  • Mentor and coach team members, promoting their professional growth and skill development.

Development and Maintenance:
  • Develop and maintain monitoring, alerting, and incident response systems to proactively identify and resolve system failures or anomalies.
  • Fine-tune the operating system and database for performance and scalability.
  • Support application teams with systems tuning, indexing, and partitioning approaches to achieve read/write efficiency.
  • Drive the automation of infrastructure deployment, configuration management, and continuous integration/continuous delivery (CI/CD) processes.
  • Troubleshoot application issues and proactively escalate and own issues to resolve before they become problems.
  • Design and implement processes and procedures to reduce and eliminate application outages and downtime.
  • Automate routine maintenance, monitoring, and deployment procedures.

Performance Optimization and Observability:
  • Work directly with development teams to enhance the performance and observability of various services .
  • Proactively work with engineering teams to ensure data and access standards are followed.
  • Work closely with software engineering teams to influence and guide the design and architecture of applications for improved reliability and performance.
  • Maintain monitoring and alerting and participate in a rotating On-Call schedule.
  • Support the design of systems, mission architecture, and associated hardware.

Collaboration and Communication:
  • Collaborate with cross-functional teams, including development, operations, and product management, to identify and address system-level bottlenecks and performance issues.
  • Communicate application updates, progress, and challenges to stakeholders and senior management.
  • Provide detailed analysis and feedback to the executive team for escalated tickets.

Role Progression:
Within 1 month, you will:
  • Complete our team onboarding process
  • Meet the team and learn the nuances and direction of our business units
  • Understand our product, business strategy, and roadmap
  • Review existing processes and systems to identify gaps in understanding and opportunities for new ways of working; Provide input

Within 3 months, you will:
  • Become familiar with the organization's infrastructure, systems, applications, and team dynamics. Understand the existing SRE processes, tools, and technologies in use .
  • Assess the skills, strengths, and areas for improvement within the SRE team. Identify any skills gaps and create a plan for addressing them through training, hiring, or cross-functional collaboration.
  • Establish relationships and open lines of communication with key stakeholders, including development teams, operations teams, product managers, and other leaders within the organization. Understand their needs, expectations, and pain points.
  • Evaluate the existing SRE processes and workflows. Identify areas for improvement, such as automation opportunities, scalability enhancements, or performance optimizations. Propose and implement changes to enhance the reliability and efficiency of the systems.
  • Collaborate with the SRE team to align goals, priorities, and performance metrics. Set clear expectations and define key performance indicators (KPIs) to measure the team's success. Establish a shared vision and ensure that everyone understands their roles and responsibilities.
  • Gain a deep understanding of the organization's incident management process. Participate in incident response activities and analyze post-incident reports to identify recurring issues or patterns. Propose and implement improvements to minimize the occurrence and impact of incidents.

Within 6 months, you will:
  • Gain a deep understanding of the organization's incident management process. Own the incident response activities and analyze post-incident reports to identify recurring issues or patterns. Propose and implement improvements to minimize the occurrence and impact of incidents.
  • Conduct a technical assessment of the organization's infrastructure, systems, and applications. Identify any areas that require immediate attention, such as performance bottlenecks or critical vulnerabilities. Develop a roadmap for addressing technical debt, system improvements, and automation initiatives.
  • Provide regular updates to senior leadership and stakeholders on the SRE team's progress, achievements, and challenges . Communicate effectively about the team's impact on system reliability, performance improvements, and cost savings. Prepare and present reports on key metrics and initiatives.
  • Establish and monitor reliability metrics, service level agreements (SLAs), and error budgets to ensure that the SRE team meets performance targets and drives continuous improvement. Implement robust monitoring and alerting systems to provide real-time visibility into system health and performance.
  • Conduct regular capacity planning exercises to ensure the infrastructure and systems can handle anticipated growth and traffic demands. Collaborate with infrastructure teams to optimize resource allocation and scalability strategies.
  • Evaluate and manage relationships with third-party vendors, ensuring they meet SLAs and performance expectations. Identify opportunities for cost optimization, service enhancements, or alternative vendor solutions.

About You: Qualifications and Traits
  • Bachelor's degree in Computer Science , Information Technology, or a related field. A master's degree is a plus.
  • Proven experience with 8+ years in a hands-on technical leadership role, leading and managing a team of SRE engineers.
  • Strong expertise in designing and building highly scalable and reliable distributed systems.
  • Proficiency in scripting and programming languages (e.g., C#, Python, Javascript , Shell scripting).
  • Extensive experience with cloud platforms (e.g., AWS, Azure) and containerization technologies (e.g., Docker, Kubernetes).
  • Deep understanding of system-level troubleshooting, performance optimization, and root cause analysis.
  • Strong ability to create, update, and maintain stored procedures.
  • Familiarity with infrastructure-as-code (IaC ) tools such as Terraform or CloudFormation.
  • Knowledge of tools such as JIRA, Confluence, and Azure DevOps.
  • Proven track record of designing, building, optimizing ,
  • Experience working in a demanding environment with highly motivated and driven professionals
  • Ability to handle projects with multiple workstreams, with proven leadership succes s

Why Litera?
We strive to stay current with all employment trends and prioritize flexibility, employee well- being, and diversity, equity, and inclusion (DEI). Most of our employees are fully remote, and we do not have a return-to-office plan as we are more successful working remotely. Our generous PTO and ten flexible holidays promote work-life balance. And all our employees are encouraged to access personal development courses and tools in our internal learning management system.
Litera is an equal opportunity employer, and proud to be committed to diversity and inclusiveness. We will consider all qualified applicants without regard to race, color, nationality, gender, gender identity or expression, sexual orientation, religion, disability, or age.
Note: As we continually reevaluate pay transparency rules and requirements, in order to ensure we are compliant, we believe a direct approach to be most accurate. If you are a California, Colorado, Connecticut, Maryland, Nevada, New Jersey, New York, Ohio or Washington resident and this role is physically available in your state or classified as remote, you may be eligible to receive additional information about the compensation and benefits for this role, which we will provide upon request. Please send an email identifying the title of the role you are interested in and the state you reside in to [email protected] .
Litera is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Apply Now

Date Posted

10/19/2023

Views

5

Back to Job Listings Add To Job List Company Profile View Company Reviews
Positive
Subjectivity Score: 0.8