Site Reliability Engineer at AddShoppers

AddShoppers is searching for a Site Reliability Engineer (SRE) with deep Linux and Automation knowledge to join our team on a full-time basis. As we continue to scale our fully distributed operations, we are seeking a talented SRE to join our remote team and help ensure the reliability, scalability, and performance of our infrastructure. Your primary responsibility will be to ensure the availability and stability of our infrastructure by proactively monitoring, troubleshooting, and resolving incidents. You will collaborate closely with cross-functional teams, including software engineers, operations, and product managers, to optimize our systems and enhance their resilience.

Principle Responsibilities:

Monitor and maintain the health, performance, and availability of our production systems, ensuring proactive identification and resolution of potential issues
Troubleshoot and resolve incidents and outages promptly and effectively, minimizing downtime and impact on end-users
Develop and implement monitoring and alerting solutions to proactively detect and address performance bottlenecks, security vulnerabilities, and other issues
Collaborate with software engineering teams to design and implement scalable, reliable, and highly available systems
Own the Development Pipeline Stages from planning to environments and structures to logging and monitoring solutions for cloud applications
Lead infrastructure and platform deployments across cloud environments.
Plan and perform project work aimed at increasing the availability and scalability of all components of our infrastructure
Perform end to end POC on new tools or technologies and help in adopting and implementing new DevOps tools and processes
Troubleshoot and debug issues utilizing tools like tcpdump, nmap, and netstat. Interpret output to provide direction for development teams on what fixes are needed
Stay updated with industry trends and emerging technologies, applying them to enhance our infrastructure and SRE practices

Experience:

3-5+ years with DevOps and overall cloud engineering including 3+ years hands-on engineering with related tools
Bachelor's degree in computer science, engineering, mathematics
Excellent skills automating, deploying, configuring Infrastructure tools, developing cloud native distributed systems with Automation & IaC (infrastructure as code), tools like Ansible, Puppet, Terraform
GCP is the preferred platform but we realize skills are transferable
Develop container orchestration platforms, microservices, and serverless architecture
Solid understanding of Internet protocol (IP, TCP, HTTP, DNS, SSL/TLS)
Strong background in build and deployment processes, CI/CD, and application configuration management
Efficiently migrated legacy workloads into scalable platforms
Exposure to security concepts, best practices and policies for cloud-based deployments
Familiarity with our tech stack (or related technologies) which include GCP, Python, MongoDB, Bash, and Elasticsearch

Knowledge of logging and performance management tools such as Splunk, Dynatrace, AppDynamics
Experience with Web application firewall (WAF) concepts and technologies
Expertise in communication protocols, particularly the details of HTTP and HTTP server implementations
Knowledge in network configurations like Subnet, VPN, DNS configurations, DHCP etc.

Site Reliability Engineer

Company

Location

Type

Job Description

Explore More

Date Posted

Views

Similar Jobs

Operations Associate - Lyft

Associate Manager, CX AML Compliance - Coinbase

Business Sales and Retention Specialist - Spectrum

Senior Partner Sales Director - Appian

Director of Sales - Ad Sales - Spectrum

Sales Representative, Inbound Remote - Liberty Mutual Insurance