[Python] Numpy 學習筆記: random[np-002]

ChunJen Wang

Published in

jimmy-wang

10 min readApr 11, 2021

本篇文章將介紹我在學習Numpy應用的筆記學。

內容架構結合了w3school以及我在工研院上AI課程的筆記，以下皆有透過colab連結，以.ipynb檔讀取，可以登入Google帳戶查看。架構如：

入門: numpy獨有的ndarray操作(陣列重塑、合併、切割、搜尋)。
Random: 透過隨機模組生成模擬資料。
ufunc: 通用函數，numpy的加減乘除，與廣播功能(Broadcasting)。
練習題: 就是更多的練習。

Source: https://towardsdatascience.com/how-to-create-numpy-arrays-from-scratch-3e0341f9ffea

二、Random

搭配random這個模組，我們可以有更豐富的應用。但在此我們以演算法生成的資料(偽隨機, Pseudo Random)加上雜訊模擬為隨機資料。

這篇將搭配圖型繪製，來呈現在統計學經常看到的Probability density function (pdf)概念，透過pdf進行隨機資料生成。

引入套件：

from numpy import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

數值隨機生成

隨機生成的目標可以有整數、浮點數或是從array生成。

random.randint(50000) #()內數值代表範圍，從0~50000產出一整數
random.rand(10)　　　  #()內數值代表個數，隨機生成10個float

若需要生成多個int，則可以使用size這個參數進行調整。

random.randint(100, size=(2,2)) #[[45 95]  [14 96]]
random.rand(2,2)    #[[0.909492 0.041678]  [0.338248 0.542733]]

若要從array任意取出則可使用choice函數。

random.choice([1,3,5,7,9], size=(2,2)) #[[3 1]  [9 3]]

結合常見機率密度函數生成隨機數值

回想我們進行決策時，事情經常不能以非黑即白單純觀察，而是伴隨著複雜因素，彼此互相牽動著。在此，我們以機率來看待隨機產生。
舉例來說：
小資族的我，根據昨日新聞與友人訊息，綜合開盤資訊，判定今日股市狀況良好，但也擔心突發事件產生，為了平均投資風險，決定從7檔中選擇，評估上漲、持平、下跌機率，讓python隨機為我挑選上漲的某幾檔。

stock = ['2330', '2603', '2303', '2317', '2891', '2885', '2884']
random.choice(['上漲', '持平', '下跌'], p=[0.6, 0.3, 0.1],size=10)final = []
for i in range(len(stock)):
　　if x[i] == '上漲':
　　　　final.append(stcok[i])print(final) #就決定是['2330', '2303', '2317', '2885']了 \\\純屬虛構

常態分布(高斯分布)

幾乎是所有分布中最重要的，經常被用來刻畫各種資料散佈，也是諸多統計假說的assumption。譬如: 人類身高、IQ分數。

重要的參數有：
1. loc — (Mean) where the peak of the bell exists.
2. scale — (Standard Deviation) how flat the graph distribution should be.
3. size — The shape of the returned array.

nor = random.normal(loc=0, scale=1, size=(100))
sns.distplot(nor)
plt.show()

二項分布 Binomial Distribution

常用來刻化擲硬幣，或是轉到多種結果的投骰子情境，可以逐一設定事件發生機率。

重要的參數有：
1. n — number of trials.
2. p — probability of occurence of each trial (e.g. for toss of a coin 0.5 each).
3. size — The shape of the returned array.

擲硬幣10次，0=數字/ 1=人頭，觀察其生成值。

x = random.binomial(n=1, p=0.5, size=10)
print(x) #[0 0 1 0 1 0 0 0 0 0]

或者，我們可以觀察每一次都擲100個硬幣，累加0/1，進行1000次。

sns.distplot(random.binomial(n=100, p=0.5, size=1000), hist=True, kde=True)
plt.show()

結果就變得相當有趣了，在統計我們都學過，當次數越大，結果會趨近於常態分布，而其期望值(平均值，或分布中心)，就是n * p(在此就是50)。

我們也可以任意替換當中的 p，想像其為作弊硬幣，0.1的機率是0, 0.9機率才是1，也會是趨近常態分布，但中心就會拉到10週圍。

但若是投骰子的情境呢?

Multinomial Distribution

我們可以透過用pvals這個參數，來模擬骰子從數字1~數字6事件發生的機率。例如建立一個list= [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] 。

x = random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(x) #[0 0 1 1 3 1] 在此0->1, 1->2依此類推

Poisson Distribution

評估特定時間內，會發生多少次該事件。

重要的參數有：
1. lam — rate or known number of occurences e.g. 2 for above problem.
2. size — The shape of the returned array.

x = random.poisson(lam=2, size=5)
print(x) #[1 5 1 1 3]
sns.distplot(random.poisson(lam=2, size=1000), kde=False)
plt.show()

Uniform Distribution

評估當每個事件發生機率都相同下，從均勻分布中任意抽取發生機率。

重要的參數有：
1. a — lower bound — default 0
2. b — upper bound — default 1
3. size — The shape of the returned array.

sns.distplot(random.uniform(size=1000), hist=True)
plt.show()

Logistic Distribution

為了要在機器學習或深度學習中產出資料而建立的分布。

重要的參數有：
1. loc — mean, where the peak is. Default 0.
2. scale — standard deviation, the flatness of distribution. Default 1.
3. size — The shape of the returned array.

x = random.logistic(loc=1, scale=2, size=(2, 3))
print(x) # [[1.1680 -0.2599  3.9169][6.3415 4.5419 -0.3042]]

Exponential Distribution

用來模擬下一次成功／失敗事件發生的時間間隔。

scale — inverse of rate ( see lam in poisson distribution ) defaults to 1.0.
size — The shape of the returned array.

x = random.exponential(scale=2, size=(2, 3))
print(x)  #[[0.4202 2.0437 3.2408] [0.4533 1.4538 1.6089]]sns.distplot(random.exponential(size=1000), hist=False)
plt.show()

Chi Square Distribution

用來做假說檢定的一個散佈。

df — (degree of freedom).
size — The shape of the returned array.

sns.distplot(random.chisquare(df=1, size=1000), hist=False)
plt.show()

其他特別的PDF

Rayleigh Distribution：用來進行訊號處理(signal processing)

1. scale — decides how flat the distribution will be default 1.0).
(standard deviation)
2. size — The shape of the returned array.

sns.distplot(random.rayleigh(size=1000), hist=False)
plt.show()

Pareto Distribution：用來模擬80/20法則的資料分布。

1. a — shape parameter.
2. size — The shape of the returned array.

sns.distplot(random.pareto(a=10, size=1000), kde=True)
plt.show()

Zipf Distribution：用在處理自然語言與料庫的分布。又稱作齊夫定律。

1. a — distribution parameter.
2. size — The shape of the returned array.

x = random.zipf(a=2, size=1000)#只看前10名
sns.distplot(x[x<10], kde=False)
plt.show()

感謝閱讀至此，這一篇也是基礎知識，
透過這些知識的累積，才能譜成完整的應用。