แนวทางการสร้าง Index และ การเลือก Sharding strategy ใน MongoDB

เลือกถูกชีวิตรุ่งเรือง แต่ถ้าเลือกผิด กลับตัวยากนะจ๊ะ

Published in

iamgoangle

6 min readJul 18, 2019

สารภาพบาปเลยว่าที่อยากจะเขียนบล็อกนี้ เพื่อใช้ทบทวน และ กลับมาเตือนตัวเอง เรื่องการสร้าง index และ การออกแบบ shard key ใน MongoDB เพราะ ลืม! ลืมไปว่าต้องเลือกแบบไหนถึงเหมาะสม เงื่อนไขอะไรที่ควรใช้ เพราะ การจะกลับมาเปลี่ยน มันยากแล้ว ยิ่งด้วยทำระบบที่ Scaling at Day One การเปลี่ยนใจมาเปลี่ยน มันอาจจะสายไปแล้ว

เชื่อว่าหลายคนรู้จักอยู่แล้วว่า ทำไม Database system ต้องมี Index และ บางคนก็อาจจะคุ้นเคยกับการทำ Distribute Data Accorss Server หรือ การทำ Sharding Data บ้างแล้ว ถ้าผมตกหล่น หรือ อธิบายผิดไป สามารถคอมเม้นต์เติมเต็มเนื้อหาได้นะครับ

เอาหล่ะ…มารู้จัก Index แต่ละประเภทกัน ก่อนไปทำความคุ้นเคยกับ Sharding Data

Index

ถ้าโลกนี้ไม่มี Index

ปราศจาก index ก็เหมือนไม่มีแผนที่นำทาง MongoDB จะ scan ทุกๆ document จนกว่าจะเจอสิ่งที่ค้นหา ถ้าโชคดีเจอ document ชุดแรก ก็ใช้เวลาอันสั้น แต่บังเอิญถ้าอยู่ document สุดท้ายก็ใช้เวลานานมาก

แต่ถ้ามี Index

ช่วยให้เจอ document ที่ใช้ได้เลยในพริบตา ไม่ต้องแวะข้างทาง ไปหาเป้าหมายโดยตรง ตัวอย่าง หา document ที่มี score < 30 เป็นใช้กลยุกทธ์ divide and conquer เพื่อแบ่งปัญหาเป็นส่วนๆ สโคปช่วงข้อมูลในการหา ช่วยให้การแก้ไขปัญหานั้นไวขึ้น

ปล. MongoDB ใช้ B-Tree สำหรับเป็น Data Structure

_id ถูกสร้างให้อัตโนมัติ ตั้งแต่สร้าง collection

แต่ละ document จะมี field _id ให้เพื่อป้องกัน การสร้าง document ซ้ำๆ โดยถูกระบบเป็นชนิด Unique Index

{
 "_id" : ObjectId("5d20b925ab28f162bf54acc6"),
 "requestId" : "1123:123456789:chunk:000001",
 "callServiceCount" : 3,
 "channelId" : "1150",
 ....
}

ประเภทของ Index

Single Field Index

กำหนด field เดียวใน document ที่จะใช้โปรโมทเป็น index เพื่อช่วยให้โอเปอร์เรชัน อ่าน เขียน เข้าถึงข้อมูลได้ไวขึ้น ดูเพิ่มเติม

Compound Indexs

ถ้าเราบอกว่า single field มันไม่สนับสนุนงานที่ซับซ้อนขึ้น ถ้าเราอยากสร้าง index มากกว่า 1 fields ให้ใช้ index ประเภทนี้

ลำดับของการกำหนด index ประเภทนี้มีนัยยะสำคัญ โดยจะ sort จาก field แรกทางซ้ายมือ ตามลำดับจนเสร็จ

{
 "_id": ObjectId(...),
 "item": "Banana",
 "category": ["food", "produce", "grocery"],
 "location": "4th Street Store",
 "stock": 4,
 "type": "cases"
}db.products.createIndex( { "item": 1, "stock": 1 } )

เช่น MongoDB จะ sortitem: 1 ก่อน แล้ว จึงนำผลลัพธ์ มา sort ต่อด้วย stock: ข้อจำกัด สูงสุดอยู่ที่ 32 fields

Multi Key Indexs

ถ้าเราอยาก index ใน document โดยระบุ field ที่ลึกเข้าไป ควรเลือกพิจารณา index ประเภทนี้

ข้อจำกัด ไม่สามารถใช้ multikey index เป็น shard key ได้

Geospatial Indexs

เหมาะกับประเภทงานเกี่ยวกับการเก็บข้อมูลพวก Geolocation, GeoJSON, ภูมิศาสตร์ต่างๆ, ข้อมูลแผนที่, ระบบคณิตศาตร์ ข้อมูลแกน x-y, เก็บงาน polygon มาพลอตกราฟ เป็นต้น

Text Indexes

เหมาะกับงาน text search query สนับสนุนการค้นหาข้อมูลด้วย wildcard, character encode seraching, tokenizer, search text stop word ว่าแล้วก็คล้าย Elesticsearch เหมือนกันนะ

Hashed Indexes

ถ้าเป็นงานประเภท “Evenly distribute write” MongoDB จะใช้ Hash function ในเพื่อนำข้อมูลมาสร้างค่า hash แล้วทำการ partition ข้อมูลกระจายไปแต่ละ shard

โดยจะใช้ I/O compute ค่า Hash ให้เอง เราไม่ต้องทำ MongoDB จะจัดการให้

MongoDB อนุญาติให้เราเลือก field ใดก็ได้มาทำ index ประเภทนี้ แต่จะไม่ซัพพอร์ต multikey-index

You may not create compound indexes that have hashed index fields or specify a unique constraint on a hashedindex;

ค่า Hash จะมาจาก 64 bits of the 128 bit md5 hash

การเลือก index ประเภทนี้ เพื่อไว้กำหนด Hash Sharding ซึ่งจะกล่าวในภายหลัง

คุณสมบัติของ Indexs

Unique Indexs

index ที่ใส่ property นี้จะ reject document ที่มีค่าซ้ำทันที เพื่อป้องกันไม่ให้ document ที่มีอยู่แล้วบันทึกข้อมูลเข้ามาอีก

Partial Indexs

ถ้าเราต้องการใส่ filter ให้ชัดเจนขึ้นอีกในการเข้าถึง index เราสามารถใส่เงื่อนไขเพิ่มเติมได้ ดังนี้

db.restaurants.createIndex(
   { cuisine: 1, name: 1 },
   { partialFilterExpression: { rating: { $gt: 5 } } }
)

กำหนดให้ restaurants collection

สร้าง index แบบ compound และ sorting cuisine และ name แบบ ASC
กำหนด filter โดยที่ rating > 5

Sparse Indexs

https://www.slideshare.net/sharkag/index-management-in-shallow-depth

ใช้มุมมองของ Data Block และ Pointer หาช่วงของข้อมูล เช่น Page 10 จะชี้ช่วงข้อมูล ≥10 จนไปถึง ≤ 20 ตามรูปตัวอย่าง ข้างบน

https://www.quora.com/What-is-difference-between-clustered-index-and-primary-index-or-both-are-same

TTL Indexs

จะลบ document ที่อยู่ภายใต้ index ประเภทนี้ โดยอัตโนมัติ เมื่อถึง timeout ที่กำหนด

ต่อไปมาดู Sharding Strategy แต่ละประเภทกันบ้างครับ

Sharding

ถ้าระบบของเรา ทำงานกับ Data Set ขนาดใหญ่มากๆๆๆ และ ต้องการ High throughput โครตๆๆเยอะ จำเป็นต้องมีการกระจายข้อมูลที่ดี Distribute Data ด้วยการกำหนด Sharding จึงเป็นสิ่งที่สำคัญ ไม่ใช่แค่ MongoDB เท่านั้นนะ Redis, HBase, หรือ แม้แต่ File System, Log ก็ยังจำเป็น

Horizal Scaling เหมาะมากๆ กับงานประเภทนี้ เพราะ เรากระจาย collection ของเราออกไปหลายๆ Shard Cluster และ HA ดีๆ ระบบของเราจะเขียนข้อมูลได้ไว รวมไปถึงการเข้าถึงข้อมูลด้วย

Replica Set

การทำ Fail over เป็นสิ่งสำคัญ ยิ่งถ้าเราเล่นกับระบบที่ High throughput มากๆ เราไม่สามารถการันตีได้เลย ว่ามันจะล่มตอนนั้น และ งานเขียน อ่าน ควรจะแบ่งหน้าที่กันยังไง

Primary / Secondary จึงถูกนิยามขึ้นมา ให้งานเขียนหนักๆ primary รับหน้าไป ส่วนงานเขียน Secondary ทำงานไป และหากเกิดเหตุร้าย Primary Down ก็ต้องทำ Fail over แล้ว ด้วยการ Election และ Promote ตัวสำรองขึ้นมา

Mongo Router

ถ้าใครคุ้นกับระบบ Distribute System เช่น ใช้ Zookeeper / Service Discovery Tools ต่างๆ จะรู้ดีว่า มันสุดยอดมาก แทนที่เราจะมานั่งวนหาเครื่องปลายทางเอง ให้พวกนี้จัดเก็บ Meta Data ให้ และชี้ทางให้ traffic request ของเราไปอัตโนมัติ ช่วยให้ชีวิตเราง่ายขึ้นเยอะ

แล้วชีวิต Software Engineer หรือ Software Architecture ควรบอกทีมยังไง ว่าเราควรเลือก Sharding แบบไหนดี และ ออกแบบ Key ยังไง?

ข้อดี ของการ Shard ผมขอข้าม แต่สามารถ อ่านเพิ่มเติมได้ ที่นี่

Hashed Sharding

even data distribution

https://www.digitalocean.com/community/tutorials/understanding-database-sharding

การันตีว่ากระจายข้อมูลได้ดีแน่นอน! นี่ คือ เป้าหมายของ Sharding Strategy แบบนี้

ข้อดี

กระจายการเขียนได้ดี เร็วในการเขียนลง Disk

ข้อควรพิจารณา

มีระยะเวลาใช้ I/O Compute Hash function (แต่ลองใช้แล้วไม่ได้นานจนน่าเกลียด)
ข้อมูลอยู่กระจาย ทำให้งาน Heavy Read ต้องคิดนิดนึง เช่น เราต้องการ Data ที่มี Shard Key ใกล้ๆกัน หรือ เช่น document ทั้งหมดของ นักเรียน ห้อง 00001 จะมีระยะเวลาในการรวม Data ที่กระจายอยู่ ณ ต่าง shard / chunk มาแสดงผล

https://www.slideshare.net/mongodb/mongodb-for-time-series-data-part-3-sharding

db.collection.createIndex( { _id: "hashed" } )
sh.shardCollection( "database.collection", { _id : "hashed" } )

Range Sharding

https://docs.mongodb.com/manual/sharding/#ranged-sharding

เป้าหมายของการ sharding แบบนี้ คือ การกองของให้อยู่ใกล้กันเท่าที่จะเป็นไปได้ การจะออกแบบวิธีนี้ ต้องแน่ใจว่าเราออกแบบ key ได้อย่างมีประสิทธิภาพ เพราะ ถ้าออกแบบไม่ดี มีสิทธิ์เกิด Hot Spoting ได้ หรือ การเกิด Hot Chunk ได้

เป็นการออกแบบ การแบ่งช่วงข้อมูลข้อมูลแบบ Horizaltal Scale

การออกแบบ Shard Key

ทุก shard collection ต้องมี index ที่ซัพพอร์ตการทำ shard และ ถ้า collection ของเรามี document อยู่แล้ว ให้สร้าง index ก่อน แล้วตามด้วยการสร้าง shard collection

Unique Indexs

เราไม่สามารถระบุ unique constraint ได้ ที่ Hashed Indexs

db.items.createIndex( { item: "hashed" }, { unique: true } )

สาเหตุเพราะว่า http://www.dba86.com/docs/mongo/2.4/tutorial/enforce-unique-keys-for-sharded-collections.html

MongoDB does not support creating new unique indexes in sharded clusters and will not allow you to shard collections with unique indexes on fields other than the _id field.

สำหรับ ranged sharded collection หากต้องการ Key ที่ Uniqueness เราต้องสร้าง index ประเภทนี้

single index
ใช้ compound key indexs
_id index ก็ได้ แต่สร้าง uniqueness per shard เท่านั้นนะ ไม่ across shard

UNIQUENESS AND THE _ID INDEXIf the _id field is not the shard key or the prefix of the shard key, _id index only enforces the uniqueness constraint per shard and not across shards.For example, consider a sharded collection (with shard key {x: 1}) that spans two shards A and B. Because the _id key is not part of the shard key, the collection could have a document with _id value 1 in shard A and another document with _id value 1 in shard B.If the _id field is not the shard key nor the prefix of the shard key, MongoDB expects applications to enforce the uniqueness of the _id values across the shards.

3 ข้อที่ MongoDB แนะนำให้วิเคราะห์ ได้แก่ cardinality, frequency, และเรตของ change

Cardinality

จำนวนมากที่สุดของ chunk (maximum number of chunks the balancer can create) ซึ่งเป็นตัวบ่งชี้ประสิทธิภาพของการสเกลระบบได้ในคลัสเตอร์

บ่งบอกถึงสมาชิกในเซท เช่น X = { 1, 2, 1 } การมีของ Low cardinality จะทำให้เกิดผลลัพธ์ ดังภาพ

เพราะการเขียนลงแค่ subset ของ shard cluster เขาจึงแนะนำให้ใช้ compound index ในการออกแบบ ถ้าหากเรามี low cardinality เพื่อช่วยกันในการกระจายข้อมูล ซึ่งจะทำให้ช่วงของ Chunk ขยายมากขึ้น

db.collection.createIndex({ X: 1, ID: 1 }, { unique: true });sh.shardCollection( "db.collection", {X, ID}, true)https://docs.mongodb.com/manual/reference/method/sh.shardCollection/#sh.shardCollection

Shard Key Frequency

การเกิดของข้อมูลที่ High frequecy มากๆ ถ้าออกแบบไม่ดี ก็อาจเกิด Hot Chunk ได้ ดังภาพ วิธีที่ MongoDB เสนอ ก็ คือ (ผมขอไม่แปลนะ)

If your data model requires sharding on a key that has high frequency values, consider using a compound index, using a unique or low frequency value.

แล้วแบบไหนที่เหมาะ?

ตอบยากมากครับ ขึ้นอยู่กับงาน แต่เดียวลองมาดูสักตัวอย่างกัน

แบบนี้เกิด Hot Spot ที่ Chunk C ใน Shard A

ทางแก้ของ Usecase นี้ ย้ำๆๆๆๆ ว่าของ Usecase นี้เท่านั้น เขาไม่แคร์เรื่อง การ Heavy Read Data ดังนั้น เขาจึงทำแบบนี้

ใช้ Hashed Sharding

เรื่องน่ารู้ของการทำ Sharding คุณจำเป็นต้องมีอย่างน้อย 2 Shard Servers เพื่อการันตีการกระจายข้อมูล ถ้ามีเครื่องเดียว แล้วมันจะกระจายให้ใคร??!

หากบทความนี้ผิดพลาดประการใดต้องขออภัยมา ณ ที่นี้ด้วยนะครับ ผู้อ่านสามารถให้คำแนะนำผมได้ผ่านทางช่องคอมเม้นต์เลยนะครับ หวังว่าบทความนี้จะช่วยให้ผู้อ่านเห็นการทำ index และ การเลือกใช้ sharding แต่ละประเภท ซึ่งถ่ายทอดจากประสบการณ์ที่มีโอกาสได้ลองผิด ลองถูก ใช้ Sharding Strategy ทั้ง 2 แบบมาแล้วบ้างเล็กน้อยครับผม

References

http://macroart.net/2013/10/mongodb-lessons-learned-on-pantip/
https://www.slideshare.net/mongodb/sharding-v-final
https://docs.mongodb.com/manual/sharding
https://docs.mongodb.com/manual/indexes