航空事件簿

航空事件簿 2024/07/19 微軟裝置端跟雲端全球大當機 今天發生的是兩個分開的不同事件

✈️ AZPAC行李箱 粉絲感謝祭 第一代無煞車經典款 限時回饋價 至7/13看飛行記錄
✈️ 空中飛人用這張上網卡,用GB計算,社群敲碗訂製產品
✈️ 傑西大叔 x 易遊網折扣碼:全球機票1%【ezjessef1】全球訂房3%優惠【ezjesseh3】,『專屬連結
👌 有事問傑西:LINE社群 『我不在機場就在去機場的路上』 入群密碼『RCTP』後面加上任何一個台灣小吃

第8次更新:2024/07/20 18:00

發生什麼事情?

今天微軟同時間發生了兩件事情,但你在新聞台以及媒體看到,應該會把兩個事件混在一起,因為都是微軟相關的事件。但以技術背景的傑西來說,會把他看成兩個獨立事件,在同一天發生,因為影響的範圍跟設備還有修復處理方式都不一樣。

但必須要說這兩個事情在今天,可能分開發生,也有可能再一起遇到,也就是說機場的延誤可能是受到這兩個問題一起出現的狀況。

  1. 當機藍畫面:Crowdstrike 防毒/EDR軟體的更新檔案造成微軟作業系統無法正常開機。
    • 影響單位:使用該公司軟體的公司,大部分都是大型企業。
    • 這個修復的時間會需要一點時間,因為得一台一台改。
  2. 雲端服務中斷:微軟雲端服務,位於美國中部地區的資料中心儲存體發生問題,造成大規模服務中斷。
    • 目前微軟的網路服務的狀態是公告恢復正常,目前機場方也回報系統已經恢復正常運作
    • 影響單位:使用Navitaire(阿瑪迪斯子公司)系統的航空公司,因為使用微軟的雲端服務,需要等候微軟服務恢復,並且檢查相關資料的狀態之後才有辦法提供服務。
    • 因為雲端的部分大概處理完畢收尾是台灣晚間19日晚間八點,雲端修完代表軟體商才可以開始修復他們的軟體服務,因此才才拖那麼久。

第一件事情最大條 Crowdstrike

先說Crowdstrike是誰,他是目前全球的端點偵測與回應(EDR)的目前領導品牌,可以參考下方的GARTNER的市場分析。他提供了大型企業防毒以及EDR的軟體服務。

  • 此次的更新造成受影響到的作業系統無法正常開機,也就是所謂的BLUE SCREEN的當機螢幕。
  • 已知影響單位:(此部分都是看社群媒體上的照片判斷)
    • 大阪環球影城
    • 萬豪飯店集團(不確定是否為雲端受到影響)
    • 海外的麥當勞
    • 英國SKYNEWS電視台內勤系統
    • 部分阿姆斯特丹機場電子看板
    • 部分UNIQLO 電子看板
    • 部分印度新德里機場電子看板

目前知道的是Crowdstrike已經修復了相關的問題,但是因為對於企業的IT人員,必需要手動的調整並且更新受影響到的設備以及機器,所以影響到的設備恢復會花上蠻大的時間,因為無法使用現行的自動化的方式進行更新。

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
GARTNER研究報告,目前端點防護軟體的領先品牌
  • 因為畫面進入到BDOD,無法正常開機,整體看來會非常不好處理。
  • 目前評估沒辦法用大規模全自動化方式,進行修復更新,因為現行自動化更新的架構都是需要正常開機到作業系統裡面,已經有半自動化的方式調整。
  • 因為以WINDOWS來說他必需要用非常低階(LOW LEVEL)的方式透過WINDOWS PE/WINDOWS RE修復模式/安全模式,才能夠進入作業系統,逐一修改受到影響的電腦,但因為現在電腦都有透過BITLOCKER加密,需要有金鑰才有辦法進入系統,這會修改程序上的麻煩的地方,快的話改一台也要超過十五分鐘,短時間內應該會影響蠻多公司的。看有沒有出一個快速修改的程式,但是一台一台修改已經是沒辦法省掉部分。

實際修復方式

技術文件:目前網路上面有人提供快速解決方案

  1. 將 Windows 引導至安全模式或 WRE。(需要輸入BITLCOKER金鑰)
  2. 前往 C:\Windows\System32\drivers\CrowdStrike
  3. 找到並刪除符合「C-00000291*.sys」的文件
  4. 正常開機。

技術文件:半自動化的處理方式已經有人提供

Automated CrowdStrike BSOD Workaround in Safe Mode using Group Policy (github.com)

所有受到影響的電腦必須要加入網域,透過自動化執行的方式執行下面這段POWERSHELL。
或者有自己使用的自動化部屬機制,前提是你要在安全模式開機開的進去跟執行。

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿

冷知識:Crowdstrike的執行長 George Kurtz 

2010/04/21:全球WINDOWS XP電腦因為防毒軟體McAffee的更新造成全球大當機
George Kurtz 那時候時任McAffee的技術長(CTO)
新聞資料來源:Defective McAfee update causes worldwide meltdown of XP PCs | ZDNET

2024/07/19:全球Windows 因為Crowdstrike造成大當機
George Kurtz 是Crowdstrike創辦人、CEO跟總裁(President)

假新聞:冒用CrowdStrike 員工身分說這個更新是他發的

微軟全球大當機是這天兵害的? 他扮工程師「認罪」騙過一票網友
https://udn.com/news/story/6811/8108167

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿

第二個事件 微軟雲端服務中斷 影響到航空公司報到服務

  • 目前已經恢復,系統服務正常運作中
  • 微軟的相關雲端服務,7/19服務造成中斷,發生時間台灣時間的早上六點前後,發生雲端服務中斷的狀況。
  • 而Navitaire(阿瑪迪斯子公司)這間公司就是使用微軟的雲端服務,但不確定他使用的部屬的方式如何,以及使用到多少的微軟服務,但很明顯的資料當下無法正常存取,需要花點時間恢復服務。
  • 非航空的公司造成的影響
    • NVIDIA GeForce Now服務中斷
    • 任天堂網路商店
    • UBER
  • 因為有很多航空公司都是使用阿瑪迪斯公司的服務,所以造成的影響航空公司不少
    • 歐洲
      • 柏林機場有暫停服務(這個不確定是否為第一點提到的防毒軟體有關)
      • 漢莎航空
    • 美國
      • FAA要求暫停飛行:聯合航空、達美航空以及美國航空。(目前看FLIGHTRADAR 24目前已經恢復營運)
      • 美國邊疆航空(Frontier Airlines)
    • 日本韓國
      • 日本航空
      • 易斯達航空(EastarJet)
      • 濟州航空(JEJU Air)
      • 普萊米婭航空(Air Premia)
    • 其他亞洲
    • 台灣
      • 大部分低成本航空都是使用這套系統:
        • 酷航、樂桃、虎航、香港快線、捷星集團、亞洲航空集團、濟州、宿霧航空
      • 華航:機場作業正常,但網路報到作業有中斷。
      • 長榮航空:服務短暫中斷1430-1630,機場作業正常。

航班異動(往來台灣)

目前往來台灣的累積取消19個航班
5J宿霧太平洋航空、GK捷星航空、DL達美航空、UA 聯合航空
CI16航班在桃園機場時刻表上的取消,不是本次事件造成,而是時刻表調整。

  • 0719 5J310 MNL-TPE 2240-0055 取消
  • 0720 5J311 TPE-MNL 0215-0450 取消
  • 0719 GK11 NRT-TPE 2250-0140 取消
  • 0720 GK12 TPE-NRT 0240-0700 取消
  • 0719 DL68 TPE-SEA 1725-1405
    本來顯示取消,中間一度延誤到十點起飛,約十點左右官網顯示取消。
  • 0718 UA853 SFO-TPE 2345-0405 取消
  • 0720 UA852 TPE-SFO 1210-0905 取消
  • 香港快運取消 7/20 航班(其他香港飛往其他地區航班,參考以下公告)
    • 台北 UO110/UO111/UO112/UO113
    • 高雄 UO120/UO131/UO132/UO133
    • 台中 UO172/UO173/UO192/UO193
2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
香港快運7/20航班異動公告
https://static.hkexpress.com/media/1113/hk-express-flight-cancellation-due-to-global-it-outage_20240719_v2.pdf

使用Navitaire

因為Navitaire是全雲的解決方案,所以雲掛了就全掛。大部分傳統航空公司用的是他的母公司Amadeus混和雲的方案,以下是使用這一套服務的的業者。

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿

針對雲端服務微軟發出的調查報告 Tracking Id: 1K80-N_8

請注意,以下中文採用自動化翻譯。 Azure 狀態歷程記錄 | Microsoft Azure (status.microsoft)

摘要:依照微軟公布的事件發生到排除為,為台灣時間早上0556到中午1123。

Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region, including failures with service management operations and connectivity or service availability. A storage incident impacted the availability of Virtual Machines, which may have also restarted unexpectedly. Services dependent on the affected virtual machines and storage resources would have experienced an impact.


在2024 年7 月18 日21:56 UTC 至2024 年7 月19 日12:15 UTC 之間,
台灣時間:7/19 05:56 至 7/19 20:15

客戶可能遇到了美國中部地區的多個Azure 服務的問題,包括服務管理操作和服務連接或可用性故障。儲存事件影響了虛擬機器的可用性,虛擬機器也可能意外重新啟動。依賴受影響的虛擬機器和儲存資源的服務將會受到影響。

What do we know so far? 到目前為止我們知道什麼?

We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.
我們確定後端叢集管理工作流程部署了設定更改,導緻美國中部地區的 Azure 儲存叢集子集和運算資源之間的後端存取被封鎖。 這導致當受影響的儲存資源上託管的虛擬磁碟失去連線時,運算資源會自動重新啟動。

How did we respond? 我們如何回應?

時間事件
21:56 UTC on 18 July 2024
台灣時間 7/19 05:56
Customer impact began
顧客影響開始
22:13 UTC on 18 July 2024
台灣時間 7/19 06:13
Storage team started investigating
儲存團隊開始調查
22:41 UTC on 18 July 2024
台灣時間 7/19 06:41
Additional Teams engaged to assist investigations
其他團隊參與協助調查
2024 年 7 月 18 日 23:27 UTC
台灣時間 7/19 07:27
All deployments in Central US stopped
美國中部的所有部署停止
2024 年 7 月 18 日 23:35 UTC
台灣時間 7/19 07:35
All deployments paused for all regions
所有區域的所有部署暫停
2024 年 7 月 18 日 00:45 UTC
台灣時間 7/19 08:45
A configuration change as the underlying cause was confirmed
確認設定變更是主要問題原因
2024 年 7 月 19 日 01:10 UTC
台灣時間 7/19 09:10
Mitigation started
緩解措施開始
2024 年 7 月 19 日 01:30 UTC
台灣時間 7/19 09:30
Customers started seeing signs of recovery
顧客開始看到復甦的跡象
2024 年 7 月 19 日 02:51 UTC –
台灣時間 7/19 10:51
99% of all impacted compute resources recovered
99% 的受影響運算資源已恢復
2024 年 7 月 19 日 03:23 UTC –
台灣時間 7/19 11:23
All Azure Storage clusters confirmed recovery
所有 Azure 儲存叢集均已確認恢復
2024 年 7 月 19 日 03:41 UTC –
台灣時間 7/19 11:41
Mitigation confirmed for computing resources
已確認計算資源的緩解措施
Between 03:41 and 12:15 UTC on 19 July 2024
台灣時間 7/19 1141-2015
Services that were impacted by this outage recovered progressively, and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.

受此次中斷影響的服務逐漸恢復,各個團隊的工程師在需要進一步手動恢復的地方進行了人工干預。經過延長的監控期後,我們確定受影響的服務已恢復到預期的可用性水準。

08/02 更新

上述資料的表格已經重新更新過了,我重新在八月二日看一下重看資料,發現影響層面更大,因為影響層面包含了儲存、全部的資料庫,這個影響層面非常的大,而且資料庫花了非常多時間才重新復原,比上面的時間還有影響的範圍都大的多。

What happened?

Between 21:40 UTC on 18 July and 22:00 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region, due to an Azure Storage availability event that was resolved by 02:55 UTC on 19 July 2024. This issue affected Virtual Machine (VM) availability, which caused downstream impact to multiple Azure services, including service availability and connectivity issues, and service management failures. Storage scale units hosting Premium v2 and Ultra Disk offerings were not affected.

於 2024 年 7 月 18 日世界協調時間 (UTC) 21:40 至 7 月 19 日 UTC 22:00 之間,Azure 中部地區的多項服務受到影響,原因是 Azure 儲存體發生可用性問題,該問題於 7 月 19 日 UTC 02:55 解決。此事件影響了虛擬機器 (VM) 的可用性,進而導致多項 Azure 服務出現服務可用性、連接性和服務管理故障等問題。Premium v2 和 Ultra 磁碟的儲存體規模單位未受影響。

Services affected by this event included but were not limited to –

  • Active Directory B2C,
  • App Configuration,
  • App Service,
  • Application Insights,
  • Azure Databricks,
  • Azure DevOps,
  • Azure Resource Manager (ARM),
  • Cache for Redis,
  • Chaos Studio,
  • Cognitive Services,
  • Communication Services,
  • Container Registry,
  • Cosmos DB,
  • Data Factory,
  • Database for MariaDB,
  • Database for MySQL-Flexible Server,
  • Database for PostgreSQL-Flexible Server,
  • Entra ID (Azure AD),
  • Event Grid,
  • Event Hubs,
  • IoT Hub,
  • Load Testing,
  • Log Analytics,
  • Microsoft Defender,
  • Microsoft Sentinel,
  • NetApp Files,
  • Service Bus,
  • SignalR Service,
  • SQL Database,
  • SQL Managed Instance,
  • Stream Analytics,
  • Red Hat OpenShift
  • Virtual Machines.

Microsoft cloud services across Microsoft 365, Dynamics 365 and Microsoft Entra were affected as they had dependencies on Azure services impacted during this event.

What went wrong and why?

Virtual Machines with persistent disks utilize disks backed by Azure Storage. As part of security defense-in-depth measures, Storage scale units only accept disk reads and write requests from ranges of network addresses that are known to belong to the physical hosts on which Azure VMs run. As VM hosts are added and removed, this set of addresses changes, and the updated information is published to all Storage scale units in the region as an ‘allow list’. In large regions, these updates typically happen at least once per day. 

虛擬機器使用由 Azure 儲存體備份的磁碟。作為安全縱深防禦措施的一部分,儲存體規模單位僅接受來自已知屬於 Azure 虛擬機器運行的實體主機的網路位址範圍的磁碟讀寫請求。隨著虛擬機器主機的添加和移除,這組位址會發生變化,更新後的資訊會以「允許清單」的形式發布到區域中的所有儲存體規模單位。在大型區域中,這些更新通常每天至少發生一次。

On 18 July 2024, due to routine changes to the VM host fleet, an update to the allow list was being generated for publication to Storage scale units. The source information for the list is read from a set of infrastructure file servers and is structured as one file per datacenter. Due to recent changes in the network configuration of those file servers, some became inaccessible from the server which was generating the allow list. The workflow which generates the list did not detect the ‘missing’ source files, and published an allow list with incomplete VM Host address range information to all Storage scale units in the region. This caused Storage servers to reject all VM disk requests from VM hosts for which the information was missing.

2024 年 7 月 18 日,由於虛擬機器主機群集的例行更改,正在生成更新的允許清單以發布到儲存體規模單位。清單的來源資訊從一組基礎設施檔案伺服器讀取,並以每個資料中心一個檔案的形式結構化。由於這些檔案伺服器的網路配置最近發生變化,某些檔案伺服器無法從生成允許清單的伺服器訪問。生成清單的工作流程未檢測到「缺少」的來源檔案,並將包含不完整虛擬機器主機位址範圍資訊的允許清單發布到區域中的所有儲存體規模單位。這導致儲存體伺服器拒絕所有缺少資訊的虛擬機器主機的虛擬機器磁碟請求。

The allow list updates are applied to Storage scale units in batches but deploy through a region over a relatively short time window, generally within an hour. This deployment workflow did not check for drops in VM availability, so continued deploying through the region without following Safe Deployment Practices (SDP) such as Availability Zone sequencing, leading to widespread regional impact.

允許清單更新分批應用於儲存體規模單位,但在一個相對較短的時間窗口內(通常在一小時內)部署到整個區域。此部署工作流程未檢查虛擬機器可用性的下降,因此在沒有遵循安全部署實踐(SDP)如可用性區域順序的情況下繼續部署整個區域,導致廣泛的區域影響。

Azure SQL Database and Managed Instance:

Due to the storage availability failures, the VMs across various control and data plane clusters failed. As a result, the clusters became unhealthy, resulting in failed service management operations as well as connectivity failures for Azure SQL DB and Azure SQL Managed Instance customers in the region. Within an hour of the incident, we initiated failovers for databases with automatic failover policies. As a part of the failover, the geo-secondary is elevated as the new primary. The failover group (FOG) endpoint is updated to point to the new primary. This means that applications that connect through the FOG endpoint (as recommended) would be automatically directed to the new region. While this generally happened automatically, less than 0.5% of the failed-over databases/instances had issues in completing the failover. These databases had to be converted to geo-secondary through manual intervention. During this period, if the application did not route their connection or use FOG endpoints it would have experienced prolonged writes to the old primary. The cause of the issue was failover workflows getting terminated or throttled due to high demand on the service manager component.

After storage recovery in Central US, 98% of databases recovered and resumed normal operations. However, about 2% of databases had prolonged unavailability as they required additional mitigation to ensure gateways redirected traffic to the primary node. This was caused by the metadata information on the gateway nodes being out of sync with the actual placement of the database replicas. 

由於儲存體可用性失敗,各種控制和資料平面叢集中的虛擬機器失敗。因此,叢集變得不健康,導致服務管理操作失敗以及 Azure SQL DB 和 Azure SQL Managed Instance 客戶在該區域的連接失敗。在事件發生一小時內,我們啟動了具有自動故障移轉策略的資料庫的故障移轉。作為故障移轉的一部分,地理輔助將被提升為新的主要副本。故障移轉群組 (FOG) 端點更新為指向新的主要副本。這意味著通過 FOG 端點(建議)連接的應用程式將自動重定向到新區域。雖然這通常會自動發生,但不到 0.5% 的故障移轉資料庫/實例在完成故障移轉時遇到問題。這些資料庫必須通過手動干預轉換為地理輔助。在此期間,如果應用程式未路由其連接或使用 FOG 端點,它將會經歷長時間寫入舊的主要副本。問題的原因是故障移轉工作流程由於服務管理器元件的高需求而被終止或節流。

在美國中部儲存體恢復後,98% 的資料庫恢復並恢復正常運作。然而,大約 2% 的資料庫出現了長時間的不可用性,因為它們需要額外的緩解措施來確保閘道器將流量重定向到主要節點。這是由於閘道器節點上的元數據資訊與資料庫複本的實際放置不同步造成的。

Azure Cosmos DB:

Users experienced failed service management operations and connectivity failures because both the control plane and data plane rely on Azure Virtual Machine Scale Sets (VMSS) that use Azure Storage for operating system disks, which were inaccessible. The region-wide success rate of requests to Cosmos DB in Central US dropped to 82% at its lowest point, with about 50% of the VMs running the Cosmos DB service in the region being down. The impact spanned multiple availability zones, and the infrastructure went down progressively over the course of 68 minutes. Impact on the individual Cosmos DB accounts varied depending on the customer database account regional configurations and consistency settings as noted below: 

使用者遇到服務管理操作失敗和連接失敗,因為控制平面和資料平面都依賴於使用 Azure 儲存體作為作業系統磁碟的 Azure 虛擬機器規模集 (VMSS),這些磁碟無法訪問。美國中部對 Cosmos DB 的請求的區域範圍成功率在最低點降至 82%,該區域中運行 Cosmos DB 服務的約 50% 虛擬機器出現故障。影響範圍涵蓋多個可用性區域,基礎設施在 68 分鐘內逐步下降。對個別 Cosmos DB 帳戶的影響根據客戶資料庫帳戶區域配置和一致性設定而有所不同,如下所示:

  • Customer database accounts configured with multi-region writes (i.e. active-active) were not impacted by the incident, and maintained availability for reads and writes by automatically directing traffic to other regions. 
  • Customer database accounts configured with multiple read regions, with single write region outside of the Central US configured with session or lower consistency were not impacted by the incident and maintained availability for reads and writes by directing traffic to other regions. When strong or bounded staleness consistency levels are configured, write requests can be throttled to maintain configured consistency guarantees, impacting availability, until Central US region is put offline for the database account or recovered, unblocking writes. This behavior is expected. 
  • Customer database accounts configured with multiple read regions, with a single write region in the Central US region (i.e. active-passive) maintained read availability but write availability was impacted until accounts were failed over to the other region.
  • Customer database accounts configured with single region (multi-zonal or single zone) in the Central US region were impacted if at least one partition resided on impacted nodes. 

依據資料庫帳戶配置的影響:

  • 多區域寫入(主動-主動): 未受事件影響,流量自動導向至其他區域。
  • 多區域讀取,單一寫入區域位於美國中部以外(會話或更低一致性): 未受事件影響,流量導向至其他區域。當配置為強一致性或有限陳舊一致性等級時,寫入請求可能會被限制以維持配置的一致性保證,影響可用性,直到美國中部區域對資料庫帳戶離線或恢復,解除寫入限制。此行為預期發生。
  • 多區域讀取,單一寫入區域位於美國中部(主動-被動): 讀取可用性維持,寫入可用性受到影響,直到帳戶故障移轉至其他區域。
  • 單一區域(多區域或單一區域)位於美國中部: 如果至少有一個分割區位於受影響節點上,則會受到影響。

Additionally, some customers observed errors impacting application availability even if the database accounts were available to serve traffic in other regions. Initial investigations of these reports point to client-side timeouts due to connectivity issues observed during the incident. SDKs’ ability to automatically retry read requests in another region, upon request timeout, depends on the timeout configuration. For more details, please refer to https://learn.microsoft.com/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications

Azure DevOps:

The Azure DevOps service experienced impact during this event, where multiple micro-services were impacted. A subset of Azure DevOps customers experienced impact in regions outside of Central US due to some of their data or metadata residing in Central US scale units. Azure DevOps does not offer regional affinity, which means that customers are not tied to a single region within a geography. A deep-dive report specific to the Azure DevOps impact will be published to their dedicated status page, see: https://status.dev.azure.com/_history

Privileged Identity Management (PIM):

Privileged Identity Management (PIM) experienced degradations due to the unavailability of upstream services such as Azure SQL and Cosmos DB, as well as capacity issues. PIM is deployed in multiple regions including Central US, so a subset of customers whose PIM requests are served in the Central US region were impacted. Failover to another healthy region succeeded for SQL and compute, but it took longer for Cosmos DB failover (see Cosmos DB response section for details). The issue was resolved once the failover was completed.

Azure Resource Manager (ARM):

In the United States there was an impact to ARM due to unavailability of dependent services such as Cosmos DB. ARM has a hub model for storage of global state, like subscription metadata. The Central US, West US 3, West Central US, and Mexico Central regions had a backend state in Cosmos DB in Central US. Calls into ARM going to those regions were impacted until the Central US Cosmos DB replicas were marked as offline. ARM’s use of Azure Front Door (AFD) for traffic shaping meant that callers in the United States would have seen intermittent failures if calls were routed to a degraded region. As the regions were partially degraded, health checks did not take them offline. Calls eventually succeeded on retries as they were routed to different regions. Any Central US dependency would have failed throughout the primary incident’s lifetime. During the incident, this caused a wider perceived impact for ARM across multiple regions due to customers in other regions homing resources in Central US.

Azure NetApp:

The Azure NetApp Files (ANF) service was impacted by this event, causing new volume creation attempts to fail in all NetApp regions. The ANF Resource Provider (RP) relies on virtual network data (utilization data) used to decide on the placement of volumes, which is provided by the Azure Dedicated RP (DRP) platform to create new volumes. The Storage issue impacted the data and control plane of several Platform-as-a-Service (PaaS) services used by DRP. This event globally affected ANF because the DRP utilization data’s primary location is in Central US, which could not efficiently failover writes, or redirect reads to replicas in other healthy regions. To recover, the DRP engineering group worked with utilization data engineers to perform administrative failovers to healthy regions to recover the ANF control plane. However, by the time the failover attempt could be made, the Storage service recovered in the region and the ANF service recovered on its own by 04:16 UTC on 19 July 2024.

How did we respond?

As the allow list update was being published in the Central US region, our service monitoring began to detect VM availability dropping, and our engineering teams were engaged. Due to the widespread impact, and the primary symptom initially appearing to be a drop in VM disk traffic to Storage scale units, it took time to rule out other possible causes and identify the incomplete storage allow list as the trigger of these issues. 

Once correlated, we halted the allow list update workflow worldwide, and our engineering team updated configurations on all Storage scale units in the Central US region to restore availability, which was completed at 02:55 UTC on 19 July. Due to the scale of failures, downstream services took additional time to recover following this mitigation of the underlying Storage issue.

Azure SQL Database and Managed Instance:

Within one minute of SQL unavailability, SQL monitoring detected unhealthy nodes and login failures in the region. Investigation and mitigation workstreams were established, and customers were advised to consider putting into action their disaster recovery (DR) strategies. While we recommend customers manage their failovers, for 0.01% of databases Microsoft initiated the failovers as authorized by customers. 

80% of the SQL databases became available within two hours of storage recovery, and 98% were available over the next three hours. Less than 2% required additional mitigations to achieve availability. We restarted gateway nodes to refresh the caches and ensure connections were being routed to the right nodes, and we forced completion of failovers that had not completed.

Azure Cosmos DB:

To mitigate impacted multi-region active-passive accounts with write region in the Central US region, we initiated failover of the control plane right after impact started and completed failover of customer accounts at 22:48 UTC on 18 July 2024, 34 minutes after impact detected. On average, failover of individual accounts took approximately 15 minutes. 95% of failovers were completed without additional mitigations, completing at 02:29 UTC on 19 July 2024, 4 hours 15 minutes after impact detected. The remaining 5% of database accounts required additional mitigations to complete failovers. We cancelled “graceful switch region” operations triggered by customers via the Azure portal, where this was preventing service-managed failover triggered by Microsoft to complete. We also force completed failovers that did not complete, for database accounts that had a long-running control operation with a lock on the service metadata, by removing the lock.  

As storage recovery initiated and backend nodes started to come online at various times, Cosmos DB declared impacted partitions as unhealthy. A second workstream in parallel with failovers focused on repairs of impacted partitions. For customer database accounts that stayed in the Central US region (single region accounts) availability was restored to >99.9% by 09:41 UTC on, with all databases impact mitigated by approximately 19:30 UTC.

As availability of impacted customer accounts was being restored, a third workstream focused on repairs of the backend nodes required prior to initiating failback for multi-region accounts. Failback for database accounts that were previously failed-over by Microsoft, started at 08:51 UTC, as repairs progressed, and continued. During failback, we brought the Central US region online as a read region for the database accounts, then customers could switch write region to Central US if and when desired. 

A subset of customer database accounts encountered issues during failback that delayed their return to Central US. These issues were addressed but required a redo of the failback by Microsoft. Firstly, a subset of MongoDB API database accounts accessed by certain versions of MongoDB drivers experienced intermittent connectivity issues during failback, which required us to redo the failback in coordination with customers. Secondly, a subset of database accounts with private endpoints after failback to Central US experienced issues connecting to Central US, requiring us to redo the failback.

Detailed timeline of events:

  • 21:40 UTC on 18 July 2024 – Customer impact began.
  • 22:06 UTC on 18 July 2024 – Service monitoring detected drop in VM availability.
  • 22:09 UTC on 18 July 2024 – Initial targeted messaging sent to a subset of customers via Service Health (Azure Portal) as services began to become unhealthy.
  • 22:09 UTC on 18 July 2024 – Customer impact for Cosmos DB began.
  • 22:13 UTC on 18 July 2024 – Customer impact for Azure SQL DB began.
  • 22:14 UTC on 18 July 2024 – Monitoring detected availability drop for Cosmos DB, SQL DB and SQL DB Managed Instance.
  • 22:14 UTC on 18 July 2024 – Cosmos DB control plane failover was initiated.
  • 22:30 UTC on 18 July 2024 – SQL DB impact was correlated to the Storage incident under investigation.
  • 22:45 UTC on 18 July 2024 – Deployment of the incomplete allow list completed, and VM availability in the region reaches the lowest level experienced during the incident
  • 22:48 UTC on 18 July 2024 – Cosmos DB control plane failover out of the Central US region completed, initiated service managed failover for impacted active-passive multi-region customer databases.
  • 22:56 UTC on 18 July 2024 – Initial public Status Page banner posted, investigating alerts in the Central US region.
  • 23:27 UTC on 18 July 2024 – All deployments in the Central US region were paused.
  • 23:27 UTC on 18 July 2024 – Initial broad notifications sent via Service Health for known services impacted at the time. 
  • 23:35 UTC on 18 July 2024 – All compute buildout deployments paused for all regions.
  • 00:15 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance Geo-failover completed for databases with failover group policy set to Microsoft Managed.
  • 00:45 UTC on 18 July 2024 – Partial storage ‘allow list’ confirmed as the underlying cause. 
  • 00:50 UTC on 19 July 2024 – Control plane availability improving on Azure Resource Manager (ARM).
  • 01:10 UTC on 19 July 2024 – Azure Storage began updating storage scale unit configurations to restore availability. 
  • 01:30 UTC on 19 July 2024 – Customers and downstream services began seeing signs of recovery.
  • 02:29 UTC on 19 July 2024 – 95% Cosmos DB account failovers completed.
  • 02:30 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance databases started recovering as the mitigation process for the underlying Storage incident was progressing.
  • 02:51 UTC on 19 July 2024 – 99% of all impacted compute resources had recovered.
  • 02:55 UTC on 19 July 2024 – Updated configuration completed on all Storage scale units in the Central US region, restoring availability of all Azure Storage scale units. Downstream service recovery and restoration of isolated customer reported issues continue.
  • 05:57 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the region sustained recovery to >99%.
  • 08:00 UTC on 19 July 2024 – 98% of Azure SQL databases had been recovered.  
  • 08:15 UTC on 19 July 2024 – SQL DB team identified additional issues with gateway nodes as well as failovers that had not completed. 
  • 08:51 UTC on 19 July 2024 – Cosmos DB started to failback database accounts that were failed over by Microsoft, as Central US infrastructure repair progress allowed.
  • 09:15 UTC on 19 July 2024 – SQL DB team started applying mitigations for the impacted databases.
  • 09:41 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the Central US region sustained recovery to >99.9%.
  • 15:00 UTC on 19 July 2024 – SQL DB team forced completion of the incomplete failovers. 
  • 18:00 UTC on 19 July 2024 – SQL DB team completed all gateway node restarts.
  • 19:30 UTC on 19 July 2024 – Cosmos DB mitigated all databases.
  • 20:00 UTC on 19 July 2024 – SQL DB team completed additional verifications to ensure all impacted databases in the region were in the expected states.
  • 22:00 UTC on 19 July 2024 – SQL DB and SQL Managed Instance issue was mitigated, and all databases were verified as recovered.

How are we making incidents like this less likely or less impactful?

  • Storage: Fix the allow list generation workflow, to detect incomplete source information and halt. (Completed)
  • Storage: Add alerting for requests to storage being rejected by ‘allow list’ checks. (Completed)
  • Storage: Change ‘allow list’ deployment flow to serialize by Availability Zones and storage types, and increase deployment period to 24 hours. (Completed)
  • Storage: Add additional VM health checks and auto-stop in the allow list deployment workflow. (Estimated completion: July 2024)
  • SQL: Reevaluate the policy to initiate Microsoft managed failover of SQL failover groups. Reiterate recommendation for customers to manage their failovers. (Estimated completion: August 2024)
  • Cosmos DB: Improve fail-back workflow affecting a subset of MongoDB API customers causing certain versions of MongoDB drivers to fail in connection to all regions. (Estimated completion: August 2024)
  • Cosmos DB: Improve the fail-back workflow for database accounts with private endpoints experiencing connectivity issues to Central US after failback, enabling successful failback without requiring Microsoft to redo the process. (Estimated completion: August 2024)
  • Storage: Storage data-plane firewall evaluation will detect invalid allow list deployments, and continue to use last-known-good state. (Estimated completion: September 2024)
  • Azure NetApp Files: Improve the logic of several monitors to ensure timely detection and appropriate classification of impacting events. (Estimated completion: September 2024)
  • Azure NetApp Files: Additional monitoring of several service metrics to help detect similar issues and correlate events more quickly. (Estimated completion: September 2024)
  • SQL: Improve Service Fabric cluster location change notification mechanism’s reliability under load. (Estimated completion: in phases starting October 2024)
  • SQL: Improve robustness of geo-failover workflows, to address completion issues. (Estimated completion: in phases starting October 2024)
  • Cosmos DB: Eliminate issues that caused delay for the 5% of failovers. (Estimated completion: November 2024)
  • Azure DevOps: Working to ensure that all customer metadata is migrated to the appropriate geography, to help limit multi-geography impact. (Estimated completion: January 2025)
  • Azure NetApp Files: Decouple regional read/writes, to help reduce the blast radius to single region for this class of issue. (Estimated completion: January 2025)
  • Azure NetApp Files: Evaluate the use of caching to reduce reliance on utilization data persisted in stores, to help harden service resilience for similar scenarios. (Estimated completion: January 2025)
  • Cosmos DB: Adding automatic per-partition failover for multi-region active-passive accounts, to expedite incident mitigation by automatically handling affected partitions. (Estimated completion: March 2025)
  • SQL and Cosmos DB: Azure Virtual Machines is working on the Resilient Ephemeral OS disk improvement, which improves VM resilience to Storage incidents. (Estimated completion: May 2025)

各公司聲明

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
微軟雲端公告

微軟的說明:我們已經意識到這個問題並已與多個團隊合作。我們已經確定了根本原因。後端叢集管理工作流程部署了設定更改,導緻美國中部地區的部分 Azure 儲存叢集與運算資源之間的後端存取被鎖定。這導致當與虛擬磁碟的連線遺失時,計算資源會自動重新啟動。所有 Azure 儲存叢集均已緩解,大部分服務現已恢復。一小部分服務仍然受到殘餘影響。受影響的客戶將繼續透過 Azure 服務運作狀況入口網站進行溝通。

桃機新聞稿:機場公司今(19)日表示,受到全球微軟系統異常影響,桃園國際機場部分航空公司因使用微軟雲端系統Navitaire(阿瑪迪斯子公司),連帶影響離境報到系統無法電腦作業。目前影響航空公司包括全亞洲集團、台虎、捷星集團、香港快運、濟州、酷航已改採人工劃位方式進行報到作業,因為達美航空及聯合航空亦受影響從外站暫停起飛,相關訊息可洽航空公司。

華航聲明:華航接獲系統商Amadeus通知,部分訂位服務功能受到影響進行維護,目前機場報到及搭機作業系統維持正常運作,造成不便敬請見諒

虎航聲明:

2024/07/20 所有Navitaire系統已恢復正常服務
本公司訂位系統服務供應商(Navitaire)已恢復正常服務,包括:購票、管理訂單、航班搜尋、客服中心,以及機場櫃台報到等服務,目前皆已正常運作;所有搭乘台灣虎航航班的旅客,可依照正常時間抵達機場,台灣虎航對於之前系統故障造成的不便深感抱歉,再次感謝您這段時間的耐心等候及諒解。

2024/07/19 本公司稍早收到Navitaire通知,目前使用Navitaire系統的航空公司部份服務,包括:購票、管理訂單、航班搜尋、客服中心等服務,皆受到影響且暫時停止服務,恢復時間尚待進一步的通知;各機場現場之航班作業及搭機報到仍正常執行,惟因考慮到目前機場採人工進行報到程序,可能會造成航班延誤,建議可於航班起飛前3-4小時至機場辦理報到手續,造成您的不便敬請見諒。

國泰航空公告:國泰航空設於香港國際機場的自助辦理登機手續設施現已恢復運作,相關服務較早前因技術供應商出現突發問題而暫時停止。與此同時,香港國際機場行李寄艙設施的人面辨識技術功能仍然無法使用。若你攜帶寄艙行李外遊,請預留更多時間辦理手續。

微軟相關服務中斷公告

2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
2024,20240719,Crowdstrike,傑西說新聞,全球大當機,微軟,航空事件簿
微軟的辦公室服務MICROSOFT 365也受到大幅度的影響
資料來源 DOWNDECTOR

航空事件簿清單


關於這個單元

傑西說新聞:主要是因為傑西蠻常接受到各新聞台的訪問(或者被社群朋友詢問),但是因為新聞台都有大概兩分鐘的秒數限制,很難把整個事件的背景資料交代清楚,這裡就是把十秒鐘展開成一篇文章的地方。傑西有把我的敘事風格放進去AI請他分析一下,以下是AI的解析。

  • 簡潔明瞭:直接切入主題,用短句傳達重要信息,條列式組織信息,更容易理解。
  • 客觀報導風格:您傾向於以中立的語氣陳述事實,不加入過多個人觀點。
  • 背景資料提供:簡要介紹相關的歷史背景,並且提供相關數字做為參考。
  • 比較分析:通過比較不同地方的類似情況來增加內容的深度,如”跟京都一樣”的部分。
  • 時間線清晰:清楚地列出了政策實施的時間表。
  • 全球視角:提到了多個國家的類似問題,顯示了對全球趨勢的關注。
  • 無過多修飾:語言直接、簡潔,不使用過多形容詞或修飾語。

評論

5
1位網友投票評論
關於本文評論的商家、商品、景點、議題、方法,你給幾顆星呢?
歡迎一起點擊星號參與評論唷!

傑西大叔
一個對客機有著無比熱情的旅人
一個對電子產品瘋狂的科技宅
一個被說對食物才有愛的吃貨
老本行是資訊產業又是個器材控,不小心跨足PODCAST圈
骨子裡流著旅行人的血液,長著飛機人的翅膀
飛行超過132萬公里,繞了32圈的地球,70個機場,157條航線
能感受到各種隱藏在五感細節小巧思
Back to top button
Blogimove部落格搬家技術服務