[MongoDB] 學習筆記(二) - 進階find使用、Aggregation

28 min readDec 19, 2020

前一章說明了基本增刪改查的功能，這裡想更詳細的紀錄要如何使用find的方法查找，以及各method使用的operator，並說明如何透過aggregation進行進階分組查詢。

本章使用的資料可以在這裡找到，並使用 mongoimport 將其導入。

Mongo版本：v4.4.1

mongoimport C:/Users/user/Downloads/movie.json -d demo -c movies --jsonArray --drop

這個資料是一個電影的數據，document的數量是100，內容包括電影名稱、年分、導演、演員、IMDB的評分等等。

find、findOne語法結構, Query Selectors and Projection Operators
Query Selectors - Comparison Operator
Query Selectors - Logical Operator
Query Selectors - Element Operator
Query Selectors - Array Operator & 矩陣的操作
Projection Operators - 輸出結果調整：矩陣調整
Cursor Methods - 修改query查詢結果
Aggregation Pipeline Stages、Operators 語法結構
Aggregation Pipeline Stages：$match、$project
Aggregation Pipeline Stages：$group、$bucket、$bucketAuto
Aggregation Pipeline Operators：$cond, $switch in $project, $group
Aggregation Pipeline Stages：$out 輸出查詢結果
結論

find、findOne語法結構, Query Selectors and Projection Operators

find的語法結構有兩個parameter，分別為 Query和 Projection，兩個都是選擇性的參數，如果不輸入， db.<collection_name>.find()、findOne()會輸出所有、第一個document。

Query可以讓我們對資料進行條件篩選，例如想找分數等於8.5分的電影，db.<collection_name>.find({imdb_score:8.5})。而我們可以利用 query operators來進行細節的調控，如用 $gt 找分數大於8.5分，db.<collection_name>.find({imdb_score:{$gt: 8.5}}) 。

Projection可以讓我們控制回傳的結果，例如當以Query篩選出大於8.5分電影的document後，我只想看到document中的電影名稱，這時我就可以透過projection來篩選 field， db.<collection_name>.find({$gt: 8.5}, {title:1})。細節則是透過projection operators來操作。使用結構如下圖。

使用方法method包含find與findOne，而query與projection則是object的格式，可以使用大括號，列出多個field，並以逗點分隔，就能進行多條件篩選。如db.<collection_name>.find({field:value, field:value, ...}, {field:value, field:value, ...})

當然，結構圖只是概況，實際上變化度是大於圖形內容的，後面再詳述。

Query Selectors - Comparison Operator

顧名思義，是篩選比較operator，內容包括等於 $eq、不等於 $ne、大於$gt、大於等於 $gte、小於 $lt、小於等於 $lte、

等於 $eq：找尋imdb_scoore = 8.5

等於有兩種寫法，分別為db.movies.findOne({imdb_score: 8.5}) 與db.movies.findOne({imdb_score: {$eq: 8.5}}) ，後續為方便檢視結果，均加上projection，只顯示imdb_score。

不等於 $ne：找尋imdb_scoore != 8.5

db.movies.find({imdb_score: {$ne: 8.5}}, {_id:0, imdb_score:1 })

大於 $gt：找尋imdb_scoore >8.5

db.movies.find({imdb_score: {$gt: 8.5}}, {_id:0, imdb_score:1 })

大於等於 $gte：找尋imdb_scoore >= 8.5

db.movies.find({imdb_score: {$gte: 8.5}}, {_id:0, imdb_score:1 })

小於 $lt：找尋imdb_scoore < 5.5

db.movies.find({imdb_score: {$lt: 5.5}}, {_id:0, imdb_score:1 })

小於等於 $lte：找尋imdb_scoore <= 5.5

db.movies.find({imdb_score: {$lte: 5.5}}, {_id:0, imdb_score:1 })

在清單中 $in：找尋imdb_scoore 等於8.5與9.0

db.movies.find({imdb_score: {$in: [8.5, 9.0]}}, {_id:0, imdb_score:1 })

不在清單中 $nin：找尋imdb_scoore 不等於8.5與9.0

db.movies.find({imdb_score: {$nin: [8.5, 9.0]}}, {_id:0, imdb_score:1 })

Query Selectors - Logical Operator

有4個operator，其中同時滿足$and、滿足任一項 $or、不能滿足任一項 $nor，三個operator可以用來串聯多個條件。$not則是用於條件的反義。

同時滿足$and：找尋imdb_scoore 大於等於8.5與小於等於9.0

db.movies.find({$and: [{imdb_score: {$gte: 8}}, {imdb_score: {$lte: 8.5}}]}, {‘_id’:0, imdb_score: 1})

滿足其中一項 $or：找尋imdb_scoore 大於等於8.8與小於等於5.8

db.movies.find({$or: [{imdb_score: {$gte: 8.8}}, {imdb_score: {$lte: 5.8}}]}, {‘_id’:0, imdb_score: 1})

不能滿足任一項 $nor：找尋imdb_scoore 不滿足大於等於8.5或不滿足小於等於8

db.movies.find({$nor: [{imdb_score: {$gte: 8.5}}, {imdb_score: {$lte: 8}}]}, {‘_id’:0, imdb_score: 1})

不 $not: 找尋imdb_scoore不超過5.5

db.movies.find({imdb_score: {$not: {$gt: 5.5}}}, {‘_id’:0,imdb_score:1})

Query Selectors - Element Operator

這個系列的operator用來判斷document中field的屬性，有兩個成員，field在哪些document存在$exists、field在哪些document中符合指定屬性 $type。

這裡我先插入一個名為test的新Collection，內容是1個array中有3個documents，但不同document中有不同的filed數量與屬性，下面將使用element operator進行篩選。

db.test.drop()
db.test.insert(
[{var1: 1, var2: 'Ben'}, 
 {var1: 2, var2: 44573}, 
 {var1: 2, var2: null},
 {var1: 3}]
)

field是否存在 $exists

如果要確定一個document中，是否具有某個field，可以使用<$exists: 1> ；反之，如果要篩選不具有某個field的document，則可以用<$exists: 0>，如篩選有、沒有var2 field的document分別為：db.test.find({var2: {$exists: 1}})、db.test.find({var2: {$exists: 0}}) 。

field 屬性的篩選 $type：

如果想要篩選某個field，為指定屬性，可以使用 <$type: type> ，如我想要篩選var2 是數字，則可以寫成db.test.find({var2: {$type: 'number'}}) 。如果想要篩選多種屬性，例如想篩選var2屬性為數字與字串任一種的document，則可以用array表示，db.test.find({var2: {$type: ['number', 'string']}}) ，詳細mongo定義的類別，可以參考下列網站。

$type - MongoDB Manual

For documents where is an array, returns documents in which at least one array element matches a type passed to . With…

docs.mongodb.com

遺失值、null的篩選

null在mongo中的type的定義，其實就是'null'，所以可以寫成db.test.find({var2: {$type: 'null'}})

Query Selectors - Array Operator & 矩陣的操作

當field中的值是一個array時，array中可能包含多個數值，此時使用一般的operator可能無法得到想要的結果，此時須使用Array Operator，包含三個，分別為$all、$eleMatch、$size。定義：

$all ：document中的array包含所有指給訂條件

$elemMatch：document中的array符合其中一項給定條件

$size：document中的array的size

以下示範取得各種元素條件的方法。先插入一個collection，內容的document中，有單純的string、有array、有document、巢狀array等形式。

db.demo_1.drop()
db.demo_1.insert(
  [
   {color: 'R'},
   {color: ['R']},
   {color: {var1: 'R'}},
   {color: [['R']]},
   {color: [[['R']]]},
   {color: [{var1: 'R'}]},
   {color: ['R', 'G']}, 
   {color: ['G', 'R']},  
   {color: ['R', 'B']}, 
   {color: [['R'], 'B']}, 
   {color: [{var1: 'R'}, 'B']},
   {color: [['R'], ['B']]}, 
   {color: [{var1: 'R'}, {var2: 'B'}]}, 
   {color: [['R'], {var2: 'B'}]}, 
   {color: ['B', 'B', 'B']},
   {color: ['R', 'G', 'B']}
  ])

在Array取得指定元素

如果我想在demo_1的collection中取得'R'，按照非array的篩選方式，會是 db.demo_1.find({color: “R” }, {_id:0})，這個結果取得了非array有R，以及array中第一階有R的document，但無法取得巢狀結構下的R。

db.demo_1.find({color: [“R”] }, {_id:0}) 這個方法篩選['R']，可以發現篩選到了value為['R']、與矩陣中巢型的第一階為['R']的對象。

但如果要篩選巢型document中的'R'時，就須以選擇元素的方式'color.var' ，db.demo_1.find({‘color.var1’: “R” } 。

如果需要同時取得上面三種情況，則以$or 串接db.demo_1.find({$or: [{color: “R”}, {color: [“R”]}, {‘color.var1’: “R” }]}, {_id:0})

取得Array中指定位置的指定元素

如果是document的話，可以指定field來給定條件，而如果是array，則可以給定位置，像是如果要取得矩陣中第1個元素，field的位置可以寫成 <field>.0 ，第2個元素則是<field>.1 ，以此類推。

db.demo_1.find({“color.0”: “R” }, {_id:0})，所以這個query篩選了array中第一個元素為R的document，而如果document根本不是array，則不會出現。所以下面結果沒有出現value為string或document的document。

當然，我也可以篩選array第一個元素的array的第一個元素，db.demo_1.find({“color.0.0”: “R” }, {_id:0})

取得Array中多個指定元素

如果要取得array中，同時包含R與G的document，直覺的寫法是db.demo_1.find({color: [‘R’, ‘G’] }, {_id:0})，不過這個寫法意思是value完全等於['R', 'G']的document，所以不會找到['G', ‘R’]。

下一個選擇是使用$or，db.demo_1.find({$or: [{color: “G”}, {color: “R”}]}, {_id:0})，這個寫法的意思是，出現G、或是出現R，所以篩選出來的結果可能會出現非array的document，也會出現單一R的array。

正確的寫法，可以使用 $and，db.demo_1.find({$and: [{color: “G”}, {color: “R”}]}, {_id:0})，會篩選出同時包含R、G的array，所以也會篩選出size大於2個array[‘R’, ‘G’, ‘B’]。如果使用Array Operator $all，會更簡潔，db.demo_1.find({color: {$all: [“G”, “R”]}}, {_id:0}) 。

如果想要加上array size為2的條件，db.demo_1.find({$and: [{color: {$all: [“G”, “R”]}}, {color: {$size:2}}]}, {_id:0}) 。這樣就能篩選出僅有[‘R’ , ‘G’]與[‘G’ , ‘R’]的document。

Projection Operators - 輸出結果調整:矩陣調整

上面已經有使用到基本的用法，就是以在find method的第二個參數Projection，以 <field>: 0/ 1的方式，調整一個field是否要呈現。例如 db.movies.findOne({}, {_id: 0, imdb_score: 1})，就是讓查詢結果僅顯示imdb_score且不顯示_id。

但如果需要的查詢結果是矩陣，就需要使用 $slice 進行調整。舉例而言，我們想僅僅想顯示field為actors的第一個元素，可以使用 {actors: {$slice: [0,1]}}

db.movies.find({}, {actors: {$slice: [0,1]}}).limit(1).pretty()

Cursor Methods - 修改query查詢結果

The Cursor is a MongoDB Collection of the document which is returned upon the find method execution. Cursor methods modify the way that the underlying query is executed.

find function查詢的結果稱為Cursor，如果想要對執行的結果進行其他處理，如統計數量、顯示、套用function等，就可以使用Cursor Methods。而其中有些方法是需要套用在Index中的，這裡先就一些簡單的使用方法說明，使用方式為cursor.<method>，以下說明幾個常用的method：

顯示：skip、limit

使用於想要限制查詢結果的數量，舉例而言，如果僅想顯示前10個結果，可以使用limit，可以寫成db.movies.find({},{ _id:0, imdb_score: 1}).limit(10)；如果想要顯示最後5個結果，可以使用skip，忽略前N個結果，因為document數量為95，可以寫成db.movies.find({},{ _id:0, imdb_score: 1}).skip(95) 。

而如果想要看6到10的結果，就會是db.movies.find({},{ _id:0, imdb_score: 1}).skip(5).limit(5)。

計數：count、size

這兩個method都是用來統計find function結果document的數量，差異在於，count是統計的結果不會因為使用了skip或是limit有所改變，但是size僅統計使用了skip、limit的結果。

下面這個查詢式會回傳10個document，但是當使用count時，仍會回傳100；而使用size時，則會回傳10。

db.movies.find({},{ _id:0, imdb_score: 1}).limit(10)

轉換成 Array：toArray

非這個method是將查詢結果轉換為array，當轉換為array後，就可以再以insert轉存collection進行使用。

db.movies.find({},{ _id:0, imdb_score: 1}).limit(5).toArray()

函數映射：map

map的第一個參數可以放置javescript function，他可以將查詢結果，依據function功能進行映射。舉例而言，如果我想要針對查詢的結果運算，並同時顯示運算前後的結果。

我先建立一個function，他的功能是將imdb_score的數值開平方，並回傳一個array，內容是開平方前、後的數值結果。

temp_fun = function(u) {a = u.imdb_score**2; return [u.imdb_score, a]}

此時我可以將這個function套用到map之中

db.movies.find({},{ _id:0, imdb_score: 1}).limit(3).map(temp_function)

下面可以看到執行結果

Aggregation Pipeline Stages、Operators 語法結構

Aggregation pipeline顧名思義，就是可以拼接不同的查詢語句，以串聯的方式一個接著一個的method。而其中方法可以分為兩個層級，第一個層級為Stage，他是pipeline串接的基本單位。如以三個stage進行串接：

db.<collection_name>.aggregate(
   [
      { <stage1> },
      { <stage2> },
      { <stage3> },
      ...
   ]
)

而第二個層級是Operators，將被使用在stage中，像是計算用的統計量$max、$min，或是條件式的方法$cond 等等。他不會出現在第一個層級。以下就幾個常見的功能進行說明。

Aggregation Pipeline Stages：$match、$project

這兩個stage功能基本上就對應到find method中的Query 和 Projection，但這裡的project除了可以顯示之外，也可以創建一些新的field。

舉例而言，如果想篩選imdb_score大於8.5的結果，並且僅顯示imdb_score，在find method可以寫成：

db.movies.find(
   {imdb_score: {$gte: 8.5}}, 
   {_id:0, imdb_score:1}
)

而對應到aggregate的寫法，可以看到元素基本上是一樣的：

db.movies.aggregate(
   [
     {$match: {imdb_score: {$gte: 8.5}}},
     {$project: {_id:0, imdb_score:1}}
   ]
)

如果要篩選imdb_score大於8.5的結果、且小於9的結果

find的寫法，用$and串接：

db.movies.find(
  {$and: [{imdb_score: {$gte: 8.5}},
          {imdb_score: {$lt: 9}}]},
  {_id:0, imdb_score:1}
)

對應到aggregate，可以寫成：

db.movies.aggregate(
   [
      {$match:   {$and: [{imdb_score: {$gte: 8.5}},
                         {imdb_score: {$lt: 9}}]}},
      {$project: {_id:0, imdb_score:1}}
   ]
)

不過因為aggregate是pipeline，所以也可以寫成下面形式，讓的操作變化度更高：

db.movies.aggregate(
   [
      {$match: {imdb_score: {$gte: 8.5}}},
      {$match: {imdb_score: {$lt: 9}}},
      {$project: {_id:0, imdb_score:1}}
   ]
)

Aggregation Pipeline Stages：$group、$bucket、$bucketAuto

分組統計是aggregate與find method最重要的差異，$group使我們可以根據指定field分組，而$bucket 則可以將數值型的field進行分組統計。

$group

使用$group根據指定field值分組：寫法如下，其中_id 是分組依據，而 <field1> 是自訂field，<accumulator1> 則是要進行的動作。

{$group:
   {
     _id: <expression>, // Group By Expression
     <field1>: { <accumulator1> : <expression1> },
     …
    }
 }

舉例而言，如果今天想統計不同城市中電影評分的總合，可以寫成：

db.movies.aggregate(
   [
      {$group: 
         {_id:   "$country", 
          total: {$sum :"$imdb_score"}
         }
      }
   ]
)

如果今天想要根據兩個field進行分組，如根據城市與年份，可以寫成：

db.movies.aggregate(
   [
      {$group: 
         {_id: ["$country", "$year"], 
          total: {$sum :"$imdb_score"}
         }
      }
   ]
)

如果要計算多個統計量，則可以寫成依序自訂多個field：

db.movies.aggregate(
   [
      {$group: 
         {_id: "$country", 
          total: {$sum :"$imdb_score"},
          avg: {$avg :"$imdb_score"},
          max: {$max :"$imdb_score"},
          std: {$stdDevSamp :"$imdb_score"},
          first: {$first :"$imdb_score"},
          count: {$sum :1}
         }
      }
   ]
).pretty()

$bucket

使用$bucket根據指定field值，自行切割分組：其語法如下，其中groupBy 可以設定要分組的field；boundaries 則是可以將設定的field依據自訂的數值分組，例如field為1-100的實數，我可以每10設定一個組別；output 中設定的是自訂的field，一樣可以計算分組統計

{$bucket:
   {
     groupBy: <expression>, // Group By Expression
     boundaries: [ <lowerbound1>, <lowerbound2>, ... ],
     output: {
        <field1>: { <accumulator1> : <expression1> },
        …
     }
   }
}

舉例而言，如果我想統計特定年份區間的電影評分平均值、及電影數量，且分組方式為0-1999、2000-2004、2005-2009、2010-2999，則可以寫成：

db.movies.aggregate(
{
  $bucket: {
      groupBy: "$year",
      boundaries: [ 0, 2000, 2005, 2010, 3000],
      output: {
          avg: {$avg :"$imdb_score"},
          count: {$sum :1}
      }
   }
}).pretty()

$bucketAuto

與$bucket類似，但可以設定一個數字做為間隔來分組，寫法如下：

{$bucketAuto:
   {
     groupBy: <expression>, // Group By Expression
     buckets: <number>,
     output: {
        <field1>: { <accumulator1> : <expression1> },
        …
     }
   }
}

舉例而言，如果我想統計特定年份區間的電影評分平均值、及電影數量，且分組方式為每5個數字為一組，則可以寫成：

db.movies.aggregate(
{
  $bucketAuto: {
      groupBy: "$year",
      buckets: 5,
      output: {
          avg: {$avg :"$imdb_score"},
          count: {$sum :1}
      }
   }
}).pretty()

Aggregation Pipeline Operators：$cond, $switch in $project, $group

當需要以條件自訂field時，可以使用$cond與$switch兩個operator，前者是以if else建立，後者則可以給定多個條件。

$cond

$cond 是一個給定條件建立field的operator ，寫法如下：

{$project:<field1>: { $cond: { if: <boolean-expression>, 
                        then: <true-case>, 
                        else: <false-case> } 
   },
   ...
}

如果今天你想要根據一個條件進行分組，你可以先用 $project + $cond 建立field，再使用 $group。舉例而言，如果我想依據國家創建一個新的field，分為是USA，或不是USA

db.movies.aggregate([
{
  $project:{ 
      area:{$cond: { //已$cond建立新的field，area
          if: {$eq: ["$country", "USA"]},
          then: "USA", 
          else: "Other"
      }}, imdb_score: 1
  }
},
{$group:    
         {_id: "$area", // 用新建立的field分組
          total: {$sum :"$imdb_score"},
          count: {$sum :1}
         }
      }
])

當然，因為$cond 可以直接用於$group所以也可以直接在$group 以 $cond 創立field。

db.movies.aggregate([
{
  $group:{
      _id:{$cond: {
          if: {$eq: ["$country", "USA"]},
          then: "USA", 
          else: "Other"
      }}, 
      total: {$sum :"$imdb_score"},
      count: {$sum :1}
  }
}])

$switch

$switch 可以依據多條件產生新的field，函數寫法如下，case中擺的是條件式，而then 中則是要指定的新value，default 則是非條件中的值。

{$project:<field1>: {$switch: {
       branches: [
          { case: <expression>, then: <expression> },
          { case: <expression>, then: <expression> },
          ...
       ],
       default: <expression>
    }
}

舉例而言，如果我今天想對電影的年份建立一個新的類別，等於2007的為"Old"，介於2008到2015為"Middle"，超過2015則為"New"，非這些條件中的則為"Other"，可以寫成下式：

db.movies.aggregate([{
    $project:{
        _id:0,
        year:1,
        new_year_type: {$switch:{
            branches:[
            {
                case: {$eq: ["$year", 2007]},
                then: "Oooold"
            },
            {
                case: {$and: [{$gt: ["$year", 2007]}, 
                              {$lte: ["$year", 2015]}]},
                then: "Middle"
            },
            {
                case: {$gt: ["$year", 2015]},
                then: "New"
            }
        ], default: "Other"}
        }
    }
}])

Aggregation Pipeline Stages：$out 輸出查詢結果

$out 的功能是將查詢結果另存到另一個collection，這個collection可以是既存的，也可以是新的。如下：

{ $out: { db: "<output-db>", coll: "<output-collection>" } }

舉例來說，如果我要將高於8.5分的電影結果，存到指定的collection “high_score_movie”之中：

db.high_score_movie.drop()
db.movies.aggregate(
   [
     {$match: {imdb_score: {$gte: 8.5}}},
     {$out: "high_score_movie"}
   ]
)
db.high_score_movie.find().pretty()

結論

原本打算紀錄增刪改查的筆記，不過光是查的篇幅就是前一篇的數倍，並且許多重要的查詢功能內容並未提及，像是array在aggregate的操作、document的合併或是型態轉換等等，日後有機會再更新筆記。

因為篇幅長，鐵定有疏漏的 hahaha，再請不吝指教。