Principal Site Reliability Engineer
Job Description
Azure Cosmos DB is Microsoft's next generation of globally distributed, massively scalable, multi-model cloud database service. It is designed to enable developers to build planet-scale applications. Azure Cosmos DB is one of the fastest growing Azure services. Joining the Azure Cosmos DB team is a fantastic opportunity to work with incredibly talented engineers operating like a startup and be at the forefront of building and shaping the Livesite Automation and AI Ops stack in Cosmos DB and lead the path for broader adoption across Microsoft Azure.
Cosmos DB is a database of choice for the spectrum spanning from the hobbyist developer to the largest of Fortune 500 companies. The database provides the data backbone of many critical systems in Health Care, Retail, Telecommunications, IoT etc. where the Service Availability and Latency is paramount. Cosmos DB provides financially backed SLA (service level agreements) around 99.99 Availability and < 10 MS Latency and we take pride in upholding ourselves to even more stringent Service Level Objectives (SLO) that delight our customers. Other than a resilient and fault tolerant architecture, a key to attaining the SLO's is automating the root cause analysis and mitigation of Issues and a lot of times proactively addressing the issues even before any customer impact. This team prides itself on building systems where a vast majority of Livesite issues are automatically mitigated without the need for human intervention.
We are looking for a self-driven Principal Site Reliability Engineer (SRE) who likes taking a data driven and systems-based approach to solve Service Reliability problems. You will be responsible for building and optimizing solutions that can analyze massive amounts of telemetry and other Service Health indicators in near real time and perform automated root cause analysis and necessary mitigations to restore SLO's.
Our team focuses on diversity of all types of candidates for our roles and we strive to hire people with different experiences and perspectives into our team. To that end, we know that no candidate has every desired skill and experience, but all of us together make our team strong.
Qualifications:
Required Qualifications:
- 8+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.
- 8+ years of experience running large scale cloud services.
- 3+ years of operational experience in improving Service Reliability, Availability and Performance.
- 5+ years of hands-on experience in Python/Java/C#.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: T his position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
- Understanding of Observability and MELT implementation patterns for large-scale services.
- Experience in Logic Apps and authoring Jupyter Notebooks.
- Expertise in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity.
- Ability to deal with the ambiguity associated with working in a fast-paced environment.
- Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity
- Ability to deal with the ambiguity associated with working in a fast-paced environment
- Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
#AZDAT #ENGGJOBS
Responsibilities:
Responsibilities include but are not limited to:
- Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's and averting incidents altogether when possible.
- Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way.
- Communicate on a deeply technical level and be the single point of contact for interfacing with large enterprise customers for handling service escalations and driving the issues to resolution.
- Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
- Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
- Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.
Explore More
Date Posted
08/10/2023
Views
8
Similar Jobs
Software Engineer II, Graphics/Vulkan - DigitalFish
Views in the last 30 days - 0
DigitalFish is seeking a Software Engineer II Graphics to join their dynamic team The ideal candidate will have experience in realtime graphics and ma...
View DetailsSr. RF Silicon Software Engineer (Starlink) - SpaceX
Views in the last 30 days - 0
SpaceX is actively developing technologies to make human life on Mars possible and deploying Starlink the worlds largest satellite constellation provi...
View DetailsSr. Software Engineer, Starlink Ground Stations - SpaceX
Views in the last 30 days - 0
SpaceX is a company that aims to make human life on Mars possible by developing advanced technologies for a future of outdoor exploration They are cur...
View DetailsSoftware Engineer, Starlink Ground Stations - SpaceX
Views in the last 30 days - 0
SpaceX is a company that aims to make human life multiplanetary by developing technologies for a future where humanity explores the stars They are cur...
View DetailsSenior Software Engineer, Networking Software - NVIDIA
Views in the last 30 days - 0
NVIDIAs platforms have made significant impacts in AI and SoftwareDefined Networking with widespread use across leading academic institutions startups...
View DetailsIT Engineer, End User Support - NVIDIA
Views in the last 30 days - 0
NVIDIA is seeking an IT Engineer to support Field Office sites manage IT inventory ensure compliance resolve issues communicate updates and improve op...
View Details