Job Description
Job Title: Site Reliability Engineer
Location: Fully Remote
Job Brief: We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will have extensive experience in service reliability and operations, automation scripting, and application performance management. You will be responsible for ensuring the reliability, performance, and availability of our large-scale, high-performance applications in a hybrid environment.
Responsibilities:
- Manage and maintain large-scale, high-performance applications in both on-prem and cloud environments.
- Write automation scripts and build dashboards for application performance management to manage transaction journeys.
- Develop and maintain containerized applications in GKE/RKE/AKE environments.
- Implement cloud observability using OTEL for real-time monitoring, distributed tracing, and incident resolution.
- Transition platforms to the cloud and containerization using GCP, AWS, Rancher, Cloud Formation, Azure, and OpenShift.
- Work with programming languages such as Go, Python, Java, Rust, etc.
- Utilize databases like Oracle, PL/SQL, SQL Server, Redis, Clickhouse, Postgres, Mongo, or any time-series databases.
- Implement and manage GraphQL frameworks (Apollo, Prisma, Hasura, etc.).
- Troubleshoot issues using knowledge of networking protocols such as TCP/IP, DNS, load balancing, and service mesh.
- Monitor and troubleshoot HashiCorp Vault environments to ensure minimal downtime and rapid recovery from incidents.
- Manage application availability and build creative solutions to manage repetitive activities, improve gating, and detect issues for a 24x7 high availability platform.
- Use monitoring tools like Splunk, AppDynamics, Grafana/Prometheus, and Dynatrace.
- Implement in-memory caching solutions, with experience on Redis DB being a plus.
- Debug across a variety of integrated technical platforms on API gateway.
- Work with GCS, Cloud SQL, PL/SQL, and Spanner.
- Utilize Vertex AI, Gen AI, and BigQuery for advanced data analysis and machine learning tasks.
Requirements:
- Minimum 3-5 years of service reliability/operation experience running large-scale, high-performance applications in a hybrid environment.
- Minimum 3-5 years of experience writing automation scripts and building dashboards for application performance management.
- 2-4 years of experience working with programming languages such as Go, Python, Java, Rust, etc.
- Working knowledge of one or more databases: Oracle, PL/SQL, SQL Server, Redis, Clickhouse, Postgres, Mongo, or any time-series databases.
- At least 2+ years of experience transitioning platforms to the cloud and containerization (GCP, AWS, Rancher, Cloud Formation, Azure, OpenShift).
- Experience maintaining containerized applications in GKE/RKE/AKE environments.
- Experience implementing cloud observability using OTEL.
- Experience working with specific GraphQL frameworks (Apollo, Prisma, Hasura, etc.).
- Knowledge of networking protocols such as TCP/IP, DNS, load balancing, and service mesh.
- Proven experience managing application availability and building solutions for a 24x7 high availability platform.
- Working knowledge of monitoring tools (Splunk, AppDynamics, Grafana/Prometheus, Dynatrace).
- Experience with tools like Rally, Confluence, and other CI/CD extenders.
- Hands-on experience with implementing in-memory caching solutions (Redis DB is a plus).
- Excellent debugging skills across various integrated technical platforms on API gateway.
- Hands-on experience with GCS, Cloud SQL, PL/SQL, and Spanner.
- Monitor and troubleshoot HashiCorp Vault environments.
- Working knowledge of Vertex AI, Gen AI, and BigQuery.
Job Tags
Remote job,