微軟 (Microsoft)開源自家Spark資料串流工作管線建置工具Data Accelerator

微軟 (Microsoft)開源自家Spark資料串流工作管線建置工具Data Accelerator

News from: iThome & Microsoft Azure Web Site.

Data Accelerator能推斷輸入事件的結構,並以使用者設定的規則修改事件後,將資料寫出到輸出資料池。

Web site:https://github.com/Microsoft/Data-Accelerator
Web site:https://azure.microsoft.com/zh-tw/blog/microsoft-open-sources-data-accelerator-an-easy-to-configure-pipeline-for-streaming-at-scale/

微軟開源了一個原為內部使用的大資料專案Data Accelerator,能進行大規模資料處理,簡化在Apache Spark上串流傳輸的工作,支援SQL以及即時查詢,不需要撰寫程式碼就能設定處理規則與設定警報。從2017年開發以來,已經大規模應用在各種微軟產品工作管線上,現在於GitHub上開源。



微軟在2017年開始發展Data Accelerator專案,為的是要處理多來源串流資料,將這些資料重新組合後,路由到不同的輸出資料池(Output Sink),以方便進行後續的分析。微軟提到,在這過程中,正規化是一個負擔沈重的工作,要在異構事件環境,捕捉和調整事件解析器,需要花費不少時間與資源。

而Data Accelerator可以幫助使用者簡化這項工作,從事件資料樣本中推斷資料的結構,並將串流中的事件寫出到各種資料儲存。微軟提到,Data Accelerator不只可以被當作事件擷取服務Event Hubs以及資料庫間的管線,還能在進行串流傳輸的時候,重塑傳入的事件,將同一事件的不同部分路由到不同的資料庫。


Data Accelerator能大幅加速在Spark上的串流工作管線建置,其隨插即用的簡單設計,使用者只要設定輸入來源以及輸出資料池,在數分鐘內就能完成管線建置。Data Accelerator支援從Eventhub和IoThub讀取資料,並將資料寫入到Azure blob、CosmosDB、Eventhub等服務。

綜合應用事件與結構,Data Accelerator可以在事件流經工作管線的時候,辨識並進行修改,分割、合併甚至是丟棄事件不需要的部分。Data Accelerator提供了配置使用者介面,以及好用的查詢和規則設計工具,讓使用者無需撰寫任何程式碼,就能設置警示或是處理資料的規則。另外,Data Accelerator還支援串流資料的複雜處理任務,不論是依變動的時間視窗處理資料,還是隨時間累加資料,用戶都能以簡單的方法操作這些進階功能。

微軟提到,Data Accelerator支援dev-test循環的快速驗證周期,讓事件查詢的實作,在部署之前就能迭代修正到可用,這可以節省大量測試工作管線處理的時間,Data Accelerator還支援SQL查詢,使用者不需要使用Scala,光用SQL就能進行複雜的查詢工作。

---------------------------------------------------------------



Microsoft open sources Data Accelerator, an easy-to-configure pipeline for streaming at scale

 Principal Program Manager, Microsoft

This blog post was co-authored by Dinesh Chandnani, Principal Group Engineering Manager​, Microsoft.  
Standing up a data pipeline for the first time can be a challenge and decisions you make at the start of a project can limit your choices long after the initial deployment has been rolled out. Often what is needed is a playground in which to learn about and evaluate the available options and capabilities in the solution space. To that end, we are excited to be announcing that an internal Microsoft project known as Data Accelerator is now being open sourced.
Data Accelerator started in 2017 as a large-scale data processing project in Microsoft’s Developer Division that eventually landed on streaming on Apache Spark for reasons of scale and speed. The pipeline today operates at Microsoft scale.
Some of the reasons we think it will have value to the wider community:
  • Fast Dev-Test loop: Events can be sampled to support local execution of queries, short circuiting the wait and delay of submitting your job to the cluster for it to fail seven minutes later due to a misplaced semicolon.
  • One-box deployment for local testing and discovery: Learn before you commit to a prototype.
  • Designer-based rules and query building: Stand up an end-to-end ETL pipeline without writing any code or dig right into the details.
  • Time-windowing, reference data, and output capabilities added to SQL-Spark syntax: Keyword extensions to SQL-Spark syntax avoid the complexity and error-prone management of these common tasks.
The Developer Division of Microsoft is using Data Accelerator in production every day and will continue to make improvements in the toolchain over time, but we recognize the toolset could do many more things given the need. We hope that by opening this project some of you will find Data Accelerator even more helpful.
To learn more about the open sourcing of Data Accelerator visit the announcement on the Open Source blog.


留言

熱門文章