Intelligence Base: Auto Partitioning

เคยสงสัยกันมั้ยว่า Auto Partitioning เนี่ย มันเลือกวิธีการ partition ให้เรายังงัย? รู้ไว้จะได้ไม่ถูก Auto Partitioning ปั่นหัวเรา ทำให้ data ออกมาผิดๆได้

ถ้าเราเลือก Auto Partitioning DataStage จะเลือกวิธีการ partition ให้โดยอัตโนมัติ โดยคำนึงถึงความถูกต้องของผลลัพธ์ที่ควรจะเป็น เป็นพื้นฐาน กล่าวคือ

ถ้าเป็น stage แรกของ job เลย Auto จะเท่ากับ Round Robin กรณี Sequential-Parallel หรือ Same กรณี Parallel-Parallel
เลือกใช้ Hash ใน stage ที่ต้องการการ match key value เช่น Join, Merge, Remove Duplicate
เลือกใช้ Entire บน Lookup reference link ซึ่งไม่เหมาะกับ MPP/cluster เพราะเปลือง memory

เนื่องจาก DataStage ไม่รู้หรอกว่า data และ business rule ของเราเป็นยังงัย เพราะฉะนั้น ควรกำหนดการแบ่ง partition แบบ Hash เอง

ก่อน Sort และ Aggregrator stage ควรทำ Hash Partitioning
When processing requires groups of related records

บางครั้ง DataStage ก็ชอบเพิ่ม step การ re-partition มาให้โดยไม่จำเป็น อันนี้ต้องไปอ่านใน Job Score เอานะถึงจะรู้

Intelligence Base

10 April 2010

Auto Partitioning

No comments: