Resilience & Availability

This section describes our approach to delivering resilience and high availability across all our services in order to meet our clients need to serve their clients 365x24x7.

Business Continuity and Disaster Recovery fall under our ISO27001:2013 accreditation and we have defined processes that are tested every three months as part of our ongoing accreditation. Eagle Eye maintains a framework approach to Business Continuity Management with structured plans covering Organization, IT, Personnel and Reputational Recovery and Fraud Response.

Backup & data availability

Our database is deployed as a cluster to protect against server and zone failure. The database is replicated to all three zones in the GCP region. One server is active, and two are hot-standby, ready to failover to if there is an issue with the master node. The data disks are snapshot every 2 hours and the snapshot images are stored in multiple regions. As part of our compliance with ISO27001:2013 we have documented processes to rebuild the platform and use backup datasets to re-establish a working copy of the platform to ensure the adequacy of these processes and our familiarity with them. We routinely test our failover and backup restore processes.

All other servers within the platform are built using automation tools for rapid deployment.

Monitoring services

The Eagle Eye AIR service is proactively monitored using tools such as New Relic, Nagios and Google Cloud Monitoring and we send emails to all named contacts for each of our clients advising of any active issues or incidents raised or detected and their potential impact, as well as publishing on our public status page - https://status.eagleeye.com. All our APIs and interfaces are monitored at multiple levels to ensure we’re providing access according to our SLAs. These include:

Uptime monitoring – we monitor all our endpoints to ensure they are available and responding correctly. Alerts are sent to our operations team if an endpoint becomes unresponsive. We use multiple tools that are provided by third parties to achieve this monitoring.

Performance Monitoring – we monitor all our endpoints to ensure response times are within our acceptable range. Alerts are sent to our operations team if any degradation is detected. Alerts are triggered early so investigations can start before any issues begin to impact customers.

Application Trace Logging – we log and monitor the internal communications within our platform to identify any bottlenecks and slowdown of internal applications and services to make sure our response times are within our acceptable range. Alerts are sent if internal communications and response times begin to rise, and are investigated by our operation and development teams.

Error rate logging – We track and monitor errors that are reported from the platform in real-time. Any service impacting errors are picked up and resolved by our operation and development teams.

All these logs are routinely reviewed, and pro-active engineering work is carried out during our normal sprint cycles to address any potential future issues.

Availability SLA

Eagle Eye AIR is a multi-tenanted SaaS based platform hosted in Google Cloud Platform (GCP) with a published service availability SLA of 99.9% each month and leveraging the inherent failover of GCP’s three separate physical Zones within each Region.

RTO / RPO

In the ultimate Disaster Recovery scenario, we have a published RTO of 4 hours and RPO of 2 hours and this is tested every 6 months as part of our ongoing ISO27001:2013 and SOC2 Type 2 accreditations