My job alerts

Site Reliability Architect

Macrometa

This job is no longer accepting applications

See open jobs at Macrometa.See open jobs similar to "Site Reliability Architect" DNX Ventures.

United States · India

Posted 6+ months ago

Our imagination is fueled by a vision of enabling developers to build apps and APIs without any limitations of time, space and cloud architectures. A world where ideas can be expressed instantly on a smart and reliable edge cloud platform that does all the heavy lifting of delivering their apps and data across the cloud and edge anywhere in the world.

Our mission is to make every developer a hero by making globally distributed application development and deployment simple and instant. This for us means taking responsibility for the entire experience of building and running cloud and edge apps. To do this we must provide the most powerful globally distributed stateful edge runtime, deep capillary networks, and a developer experience second to none.

Macrometa's culture is built on mutual respect and honest interactions. We value humble people who are curious to learn from and help each other. We prioritize our people first, customers second, and everything else third.

The Role:

Are you excited to work with a talented & experienced team on groundbreaking new ideas in building a planetary scale, distributed, decentralized, real-time data platform?

Are you interested in delivering, cutting-edge geo-distributed cloud infrastructure software, maintaining it, securing it and scaling it to meet users' needs while keeping an ever-watchful eye on capacity and performance? If yes - we may have your dream job at Macrometa.

We are seeking an experienced SRE Architect to manage and run our geo-distributed data platform running in AWS, GCP, Linode, and On-Premise. The SRE Architect will be responsible for identifying what operations to automate, what processes to follow for deploying, managing, and troubleshooting the production systems, and what needs to be done to maintain SLAs on availability and performance.

What You Will Do:

Manage and run a highly available and scalable geo-distributed data platform running in multiple cloud providers, including AWS, GCP, Linode, and On-Premise, and supporting a high volume of traffic.
Identify operations to automate, and design and implement automation frameworks for provisioning, configuration, and deployment of infrastructure and applications.
Develop and implement effective incident management processes and procedures to minimize service disruptions and mitigate the impact of incidents when they occur.
Coordinate and lead incident response teams to quickly and effectively address incidents.
Analyze incidents to identify root causes and implement measures to prevent similar incidents in the future.
Develop and implement disaster recovery and business continuity plans to minimize service disruptions and data loss.
Develop and implement security best practices and standards to ensure the security and privacy of the data and platform.
Design and implement effective monitoring, logging, and alerting systems to ensure the stability and performance of the platform.
Collaborate with developers, product managers, and operations teams to ensure the platform meets the needs of the business.
Develop and implement continuous integration and delivery pipelines to ensure faster and more reliable software delivery.
Maintain a deep understanding of emerging technologies, best practices, and industry trends, and make recommendations for improvements to the platform.
Establish processes for deploying, managing, and troubleshooting production systems across multiple cloud providers and on-premise infrastructure.
Maintain SLAs on availability and performance, and ensure that the platform is meeting the needs of the business.

Who You Are:

Bachelor’s or Master’s degree in Computer Science or a related field.
10+ years of experience in designing and managing large-scale, geo-distributed systems, with expertise in cloud computing, networking, and security.
Experience managing and running production systems in AWS, GCP, Linode, and On-Premise.
Strong understanding of incident management processes and procedures and ability to lead incident response teams effectively.
Experience in automating operations tasks and deploying applications using continuous integration and delivery pipelines.
Strong problem-solving skills and ability to analyze incidents and identify root causes.
Passion for learning and staying up-to-date with emerging technologies and best practices.
Excellent communication and collaboration skills.

Note to recruitment agencies: Macrometa will not accept unsolicited resumes/CV's and will not pay fees of any kind for unsolicited resumes/CV's sent to us by third parties.