Agent app slow and not loading

Incident Report for ROGER365.io Platform Status Page

Resolved

Summary
Between April 4 and April 19, users in the European region experienced intermittent performance degradation and connection issues when using the ROGER365.io Teams app. These disruptions included slow page loading, call routing delays, and temporary disconnections during gateway failovers.

Timeline & Impact
April 4, 2025 (11:05–11:31 CEST): Latency on required Azure services exceeded acceptable thresholds, triggering a gateway-side issue that caused slowness and partial call disruption. Traffic was automatically redirected, resulting in brief call interruptions for some users.

April 7, 2025 (10:00–10:12 CEST): A similar spike in latency again triggered the issue, affecting a portion of users and causing temporary service degradation.

April 14, 2025 (12:10–12:48 CEST): Elevated network latency led to a hold of the primary gateways infrastructure. The platform automatically failed over to backup infrastructure, causing temporary disruption for all users.

April 15, 2025 (12:58 CEST): Our team confirmed that the issue originated from latency and timeouts in essential Azure services. We engaged Microsoft, escalated the case, and worked closely with their teams throughout the investigation.

April 16, 2025 (08:08–08:10 CEST and 14:05–15:00 CEST): Additional instability triggered several gateway holdings. We proactively failed over our gateways and databases to other Azure regions to reduce the impacted Azure environment. One of these failovers was performed manually as a precautionary measure. Microsoft continued their internal investigation with these new events.

April 16, 2025 (20:30 CEST): A software fix was deployed to the gateway infrastructure to improve resilience against slow or unresponsive services.

April 17–18, 2025: Following the fix, the platform handled further Azure network instability and a Microsoft Support-initiated database failover without issue—validating the effectiveness of the changes under simulated latency and load conditions. Microsoft continued on their internal investigation.

April 19, 2025 (05:30 CEST): All systems successfully transitioned back to the original Azure region.

April 17-24, 2025: No new disruptions were observed since April 17 after the fix was deloyed on 16 April. Monitoring confirmed stability and effectiveness of the fix under varying load and network conditions.

Root Cause
The incident was triggered by elevated latency and timeouts in required Azure services outside ROGER365 domain. This affected the gateway infrastructure’s ability to communicate with key platform components within Azure and revealed a threading-related issue in the gateway logic. The issue was not introduced through a recent code change, but was exposed due to uncommon levels of Azure latency not previously encountered. Troubleshooting was complicated due to the intermittent nature of the issue and limited visibility into Azure’s underlying infrastructure. The issue was escalated to Microsoft, and several Azure teams were engaged. While no single root cause was identified, the investigation revealed multiple planned and unplanned maintenance events in the Azure network during the incident window. These are suspected to have contributed to the latency patterns that triggered the issue in our platform.

Resolution
We implemented a code-level fix to the gateway infrastructure to improve handling of Azure network latency and service unresponsiveness. Gateway code has been enhanced for greater resilience against external service degradation. Fixes were validated under simulated latency conditions and increased load to ensure reliability. Automatic failover mechanisms are in place, and manual actions were used during the incident to minimize disruption. Since implementation, the platform has remained stable.

Going Forward
We continue to collaborate with Microsoft and follow their escalation protocols for critical service issues. Additional monitoring has been implemented to detect abnormal latency patterns in Azure services and further safeguard against similar disruptions.

We appreciate your patience and understanding during this incident.

Posted Apr 24, 2025 - 10:19 CEST

Update

Over the past few days, we've worked closely with Microsoft Support to investigate intermittent latency affecting Azure Services. Our joint investigation suggests the issue is likely related to Azure network performance.

To reduce the impact, we've implemented code improvements to mitigate the latency while continuing to press Microsoft for a root cause and resolution.

Additionally, we observed unplanned maintenance by Microsoft over the weekend, which may have a positive effect on the issue.

Posted Apr 22, 2025 - 09:49 CEST

Update

From 14:45 till 15:00 (UTC+2) we again had issued with the Azure component. As the network in the West Europe region is suspected of having an issue, we failed over to North Europe and will keep it there for the time being to have the smallest effect on the workings of the system.

Posted Apr 16, 2025 - 15:26 CEST

Update

Our investigation indicates that the issue is related to an underlying Azure component.
Our team is actively working in close collaboration with Microsoft to resolve this as quickly as possible.

The mitigation we implemented yesterday at 12:45 CEST has ensured that customers can continue to work as expected. We are still closely monitoring the platform to ensure continued stability and will provide updates as soon as more information is available.

Posted Apr 15, 2025 - 12:58 CEST

Update

We are actively working on a permanent fix for this issue in close collaboration with Microsoft.

The mitigation we implemented yesterday at 12:45 CEST has ensured that customers can continue to work as expected. We are closely monitoring the platform to ensure continued stability and will provide updates as soon as more information is available.

Posted Apr 15, 2025 - 10:01 CEST

Update

We are continuing to work on a fix for this issue.

Posted Apr 14, 2025 - 14:32 CEST

Identified

We have identified that the issue is with the incoming gateway, handeling all incoming and outgoing traffic. We are still analyzing the logs and dumps to find the root cause of the issue.

Posted Apr 14, 2025 - 13:16 CEST

Update

We have mitigated the issue by doing a manual failover. We will investigate further into this, to retrieve the root cause of the issue

Posted Apr 14, 2025 - 12:50 CEST

Investigating

There seem to be an issue with loading the agent apps and slow responses on call routing. We are investigating this issue with the highest priorty. The issue has started at 12:15 (UTC+2)

Posted Apr 14, 2025 - 12:15 CEST

This incident affected: ROGER365.io Contact Center (Voice Routing Engine, Agent App).