Summary
Between April 4 and April 19, users in the European region experienced intermittent performance degradation and connection issues when using the ROGER365.io Teams app. These disruptions included slow page loading, call routing delays, and temporary disconnections during gateway failovers.
Timeline & Impact
April 4, 2025 (11:05–11:31 CEST): Latency on required Azure services exceeded acceptable thresholds, triggering a gateway-side issue that caused slowness and partial call disruption. Traffic was automatically redirected, resulting in brief call interruptions for some users.
April 7, 2025 (10:00–10:12 CEST): A similar spike in latency again triggered the issue, affecting a portion of users and causing temporary service degradation.
April 14, 2025 (12:10–12:48 CEST): Elevated network latency led to a hold of the primary gateways infrastructure. The platform automatically failed over to backup infrastructure, causing temporary disruption for all users.
April 15, 2025 (12:58 CEST): Our team confirmed that the issue originated from latency and timeouts in essential Azure services. We engaged Microsoft, escalated the case, and worked closely with their teams throughout the investigation.
April 16, 2025 (08:08–08:10 CEST and 14:05–15:00 CEST): Additional instability triggered several gateway holdings. We proactively failed over our gateways and databases to other Azure regions to reduce the impacted Azure environment. One of these failovers was performed manually as a precautionary measure. Microsoft continued their internal investigation with these new events.
April 16, 2025 (20:30 CEST): A software fix was deployed to the gateway infrastructure to improve resilience against slow or unresponsive services.
April 17–18, 2025: Following the fix, the platform handled further Azure network instability and a Microsoft Support-initiated database failover without issue—validating the effectiveness of the changes under simulated latency and load conditions. Microsoft continued on their internal investigation.
April 19, 2025 (05:30 CEST): All systems successfully transitioned back to the original Azure region.
April 17-24, 2025: No new disruptions were observed since April 17 after the fix was deloyed on 16 April. Monitoring confirmed stability and effectiveness of the fix under varying load and network conditions.
Root Cause
The incident was triggered by elevated latency and timeouts in required Azure services outside ROGER365 domain. This affected the gateway infrastructure’s ability to communicate with key platform components within Azure and revealed a threading-related issue in the gateway logic. The issue was not introduced through a recent code change, but was exposed due to uncommon levels of Azure latency not previously encountered. Troubleshooting was complicated due to the intermittent nature of the issue and limited visibility into Azure’s underlying infrastructure. The issue was escalated to Microsoft, and several Azure teams were engaged. While no single root cause was identified, the investigation revealed multiple planned and unplanned maintenance events in the Azure network during the incident window. These are suspected to have contributed to the latency patterns that triggered the issue in our platform.
Resolution
We implemented a code-level fix to the gateway infrastructure to improve handling of Azure network latency and service unresponsiveness. Gateway code has been enhanced for greater resilience against external service degradation. Fixes were validated under simulated latency conditions and increased load to ensure reliability. Automatic failover mechanisms are in place, and manual actions were used during the incident to minimize disruption. Since implementation, the platform has remained stable.
Going Forward
We continue to collaborate with Microsoft and follow their escalation protocols for critical service issues. Additional monitoring has been implemented to detect abnormal latency patterns in Azure services and further safeguard against similar disruptions.
We appreciate your patience and understanding during this incident.
Posted Apr 24, 2025 - 10:19 CEST