Site Reliability Engineer

Salary not available. View on company website.

Capital on Tap, City of Westminster

Full time
Permanent
Onsite working

Posted today, 17 Oct | Get your application in now to be one of the first to apply.

Closing date: Closing date not specified

job Ref: 8a5a2667695c4a838b999eb1b00d54b8

Capital on Tap

Full Job Description

Capital on Tap was founded with the mission to help small business owners and make their lives easier. Today, we provide an all-in-one business credit card & spend management platform that helps business owners save time and money. Capital on Tap proudly serves over 200,000 businesses across the world and our goal is to help 1 million small businesses by 2030. Why Join Us? We empower you to be innovative and solve complex problems. Take ownership, make an impact, and thrive in our scaling and agile environment. This is a Hybrid role, the Site Reliability Team works from our London (Shoreditch) Offices 1 day per week. What You'll Be Doing Our Site Reliability Engineers work closely with our Platform and Engineering teams to ensure our application infrastructure is robust and scalable. As a Site Reliability Engineer at Capital on Tap you will be responsible for designing, building, and monitoring systems to maximise platforms uptime and efficiency for the best possible end-user experience. You are also tasked with identifying and resolving potential outages and performance issues before they become a problem.

Manage Azure services and resources, Cloudflare edge security, traffic management in code
Create, manage, and monitor development resources within Kubernetes clusters and Serverless (i.e. Function Apps, Automation Accounts) for Product Engineering Teams
Own Terraform / Ansible / Pulumi Infrastructure as Code for each Product Engineering team
Continuously identify opportunities for improvement in systems, processes, and technologies, and implement changes to improve the overall reliability and performance of the platforms
Improve monitoring to provide insights into uptime and availability, and work towards the agreed SLO
Own and lead the troubleshooting of incidents that impact the customer experience
Cloud: Azure and GCP
Containerisation: Kubernetes, Docker
IaC: Terraform
CI/CD: Azure DevOps, Octopus Deploy
Monitoring: Datadog, Prometheus, Grafana
Scripting: Python, Powershell, Bash

Experience in managing public cloud processes
Experience in Azure DevOps, Octopus, and other CI/CD tools
Experience in Python, Powershell, Bash, or other scripting languages
Experience with Terraform
Experience working with a cloud monitoring solution (Datadog would be advantageous)
Experience with Kubernetes and Docker (advantageous)

We welcome, consider and encourage applications from anyone who shares our commitment to inclusivity. Join us in creating a space where authenticity thrives, and everyone can do their best work.