A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions.
Seeking new possibilities and always staying curious we are a team dedicated to creating the world’s leading AI-powered cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research Software and Infrastructure. Entering this domain positions you at the heart of IBM where growth and innovation thrive.
As a Site Reliability Engineer you will work in an agile collaborative environment to build deploy configure and maintain systems for the IBM client business. In this role you will lead the problem resolution process for our clients from analysis and troubleshooting to deploying the latest software updates & fixes.
Your primary responsibilities include:
- 24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock ensuring continuous reliability and optimal customer experience.
- Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
- Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
- Maintenance and Support: Tasks related to applying security patches and upgrades and collaborating with Product support for issue resolution.
- Documentation: Reviewing updating creating documentation and runbooks as necessary with a strong attention to detail.
- System Monitoring and Troubleshooting: 1 year of experience in monitoring/observability issue response and troubleshooting for optimal system performance.
- Automation Proficiency: 1 year of experience in automation for production environment changes streamlining processes for efficiency and reducing toil.
- Linux: 1 year of experience working with Linux operating systems.
- Operation and Support Experience: 1 year of experience of experience in handling day-to-day operations alert management incident support migration tasks and break-fix support.
- Kubernetes/OpenShift: knowledge or experience of Kubernetes/OpenShift environments.
- Automation/Scripting: knowledge or experience of Ansible Python Terraform and CI/CD tools such as Jenkins IBM Continuous Delivery ArgoCD.
- Monitoring/Observability: knowledge or experience crafting alerts and dashboards using tools such as Instana New Relic Grafana/Prometheus OpenSearch.
- DBA: Interest or experience configuring and maintaining SQL NoSQL and data streaming technologies (e.g. DB2 OpenSearch PostgreSQL CouchDB Kafka Spark etc.).
- Communication: Ability to communicate with external audiences including remote technicians at data centers and colocation facilities.
- Networking: Understand the basic OSI networking model. CCNA and/or JNCIA certifications are a strongly preferred.