available for opportunities · Dublin, Ireland

Salman
Hadi

Site Reliability Engineer  ·  Incident Response Manager

7+ years keeping high-availability fintech and enterprise systems reliable, observable, and resilient. Specialist in monitoring strategy, incident management, and operational automation.

7+
Years Experience
3
Cloud Platforms
6
Observability Tools
scroll

Experience

Feb 2022 — Present  ·  Dublin, Ireland
FreedomPay
Senior Platform Operations Engineer (Site Reliability)
  • Lead production operations and reliability improvements for a global cloud-based payment platform in high-availability environments.
  • Design and implement monitoring strategies using Dynatrace, Splunk, and New Relic — reducing alert noise and improving incident detection across production services.
  • Build end-to-end observability across infrastructure and services for proactive monitoring and performance optimisation.
  • Lead major incident response and root cause analysis (RCA), coordinating cross-team investigations and defining long-term corrective actions.
  • Developed Python automation utilities to streamline incident management, including automated timeline capture and P1 incident template generation.
  • Develop production support readiness frameworks, improving release processes and operational stability across engineering teams.
  • Perform SQL-based investigations to validate data flows and troubleshoot service anomalies.
  • Facilitate weekly Change Advisory Board (CAB) reviews for maintenance window activities.
Jan 2019 — Feb 2022  ·  Dublin, Ireland
Version 1
Technology Consultant
  • Acted as SME and lead analyst for Revenue.ie customs applications, supporting large-scale transactional systems in national customs operations.
  • Worked closely with business stakeholders and delivery teams to analyse requirements, support implementations, and troubleshoot production issues.
  • Investigated data integrity and transactional issues using SQL across Java and COBOL systems in high-volume environments.
  • Designed Bash automation scripts to streamline operational tasks, reducing manual workload by 30+ hours per month.
  • Supported AWS cloud transformation projects and virtualisation environments (VMware, Hyper-V).
  • Delivered internal training and knowledge-sharing sessions to improve team capability and system understanding.

Projects

⚙️ Incident Timeline Automation Tool

Built a Python utility that automatically captures and formats incident timelines during major outages, eliminating manual logging during high-pressure P1 situations and significantly reducing documentation effort post-incident.

Python Automation Incident Mgmt FreedomPay
📡 Observability & Monitoring Strategy

Designed and implemented a full-stack monitoring strategy across production payment services using Dynatrace, Splunk, and New Relic. Reduced alert noise by consolidating thresholds and introducing synthetic monitoring for critical user journeys.

Dynatrace Splunk New Relic SRE
🛡️ Production Support Readiness Framework

Developed a structured readiness framework to assess and improve operational stability before major releases — covering monitoring coverage, runbook completeness, on-call preparedness, and rollback plans across engineering teams.

Operations Documentation Release Mgmt Fintech
🖥️ Bash Automation Suite — Revenue.ie

Designed a suite of Bash scripts to automate repetitive operational tasks for national customs infrastructure at Version 1, saving 30+ engineering hours per month and reducing risk from manual processes in high-volume environments.

Bash Linux Automation Version 1
☁️ AWS Cloud Transformation Support

Contributed to AWS cloud migration projects at Version 1, supporting workload transitions from on-premise VMware and Hyper-V environments to AWS cloud infrastructure while maintaining operational continuity for public sector clients.

AWS VMware Hyper-V Cloud Migration
📋 P1 Incident Template Generator

Built a Python tool that generates structured P1 incident report templates pre-populated with service context, stakeholder lists, and escalation paths — cutting initial triage time and ensuring consistent communication during critical outages.

Python Incident Response Tooling FreedomPay

Skills & Tooling

☁️
Infrastructure
Cloud Platforms
AWS Azure GCP VMware Hyper-V
📡
Observability
Monitoring & Alerting
Dynatrace Splunk New Relic CloudWatch Datadog Pingdom
Automation
Scripting & Querying
Python Bash SQL SPL DQL
🔴
Operations
Incident Management
Incident Response Root Cause Analysis CAB Reviews SRE
🐧
Systems
Operating Systems
Linux Windows Server
🛠️
Collaboration
Dev & Ops Tooling
Jira Azure DevOps Confluence SharePoint

Certifications

AWS Cloud Practitioner
G
Google Cloud Platform Fundamentals
G
Google SRE Fundamentals
G
Google Project Management

Education

2017 – 2018
MSc Cloud Computing
National College of Ireland
2011 – 2016
BEng Information Technology
University of Pune

open to new roles

Let's Build Something Reliable

Looking for a Senior SRE or Incident Response in Dublin or remote. Let's connect.