About python, pandas, numpy, etc…

Just recap and take notes about the forgettable things I learn. Most of materials I use are from the internet, and I don’t own the copyright to them.

  1. random and seed

import numpy as np

np.random.seed(0)

Here 0 is the integer used to create random numbers. If the same random numbers will be created with the same seed integer. If we don’t set the seed, then the random numbers will be generated according to the system time.

2. scale VS normalization

Scale doesn’t change the data distribution shape, only modifies the range of the data.

While normalization changes the distribution shape into the normal distribution.

3. list vs numpy.array

1> numpy.array can only contains one data type

2> [1, 2, 3] + [1, 2, 3] => [1, 2, 3, 1, 2, 3]

array([1, 2, 3]) + array([1, 2, 3]) = array([2, 4, 6])

4. draw the graph with matplotlib

import matplotlib.pyplot

line graph:

plt.plot(x, y)

scatter graph:

plt.scatter(x, y)

histogram:

plt.hist(x, [bins=10, …])

5. row & column access in pandas

6. local variable vs global variable

In the function, it will search for the local variable first, then the global one (but only get access to it rather than modify it)

If we want to alter the value of a global variable, we should use “global x” (x is the name of the global variable) in the function

7. import data

1> just read the file

e.g: file = open( path/to/file, mode=’r’); file.read()

2> use numpy to import the data as array

e.g.: data = np.loadtxt( path/to/txt/file, delimiter=’,’[, skiprows=1, usecols=[1, 3], dtype=str])

We can also use genfromtxt() in numpy to import txt file, and it’s quite similar to loadtxt(), but if we set dtype=None in genfromtxt(), then it will figure out what types each column should be, also the result data from genfromtxt() is a structured array, which is 1D array, and each element is a tuple containing all data in one row.

Also, we can use recfromcsv() in numpy to import csv file, which is quite similar to genfromtxt(). It has the defaults delimiter=',' and names=True in addition to dtype=None.

3> use pandas to import data as DataFrame

e.g.: data = pd.read_csv(path/to/txt/file)

for other file types (not flat files):

1<< pickle files (for json)

with open(pikcled_fruit.pkl, ‘rb’) as file:

data = pickle.load(file)

2 << excel files

data = pd.ExcelFile(path/to/excel/file)

3 << SAS files

from sas7bdat import SAS7BDAT

with SAS7BDAT(‘urbanpop.sas7bdat’) as file:

df_sas = file.to_data_frame()

4 << Stata files

data = pd.read_stata(‘urbanpop.dta’)

5 << HDF5 files

6 << matlab files

4> relational database

pandas way to do the same thing:

5 > import data from web

1 << automate file download

or we can also use pandas to download flat files

e.g. df = pd.read_csv(url_of_csv_file, [sep=’;’])

We can also use “BeautifulSoup” library to get information from the html text.

2 << load json file

3 << connect to api in python

4 << use twitter api

Authentication handler

Define stream listener class

Stream twitters

8. count the value in the dataframe column (including the missing and NaN ones)

9. draw the histogram and box plot from the column data in dataframe

10. to tidy data, we can use pd.melt() to reorganize the structure of dataframe

If value_vars is not given, then it will melt all columns except the column corresponding to id_vars.

11. to better report data, we can also use dataframe.pivot() to unmelt the data

However, if we have duplicate row with the same index value and pivot value, then it will raise an exception. In this situation, we can use df.pivot_table().

12. To concatenate data, we can use pd.concat(), which will not change their original index

Or we can also use the parameter ignore_index=True to reorganize the index.

13. we can use globbing to find all the files with matching file names

14. we can also merge data (which is similar to join in SQL)

if left df and right df use the same key as the name, we can specify it by on=’corresponding_key_name’.

15.we can use astype() method to convert data type

we can also use pd.to_numeric with the parameter errors=’coerce’ to convert data to numeric type, and all the invalid values will be NaN.

16. use regular expression to match data

17. we can use df.apply to apply function with the dataframe

18. drop the duplicated from dataframe

19. deal with missing data

fill missing value

20. test with asserts

21. recap a little

21. index and columns

22. how to build DataFrame

1 > from csv files

2 > from dictionary

23. to plot the graph

1 > plt.plot(data)

2> df.plot(kind=’line’)

24. fix scale of the graph, customize and save it

25. to plot PDF(probability density function) and CDF( cumulative distribution function)

26. to make multiple columns as the index with set_index() and sort it with sort_index(), and also slice it with slice()

27. use different columns to reshape the dataframe with pivot()

If “values” is not identified, then all of the rest values will be presented.

If the value in the index columns has duplicate ones, then it will raise an error.

But we can use pivot_table instead, which will summarize the (default to be the mean) values with the same index (we can also use “aggfunc” to identify the summarizing function).

28. unstack the multi-level index with unstack() and swap the stack level with swaplevel()

We can also use index with “level” parameter.

we can also swap the stack level with swaplevel()

29. we can transform the dataframe back into its original state with melt()

30. To store the discrete values as ‘category’ dtype rather than ‘object’ can help:

1> improve the computation speed

2> save memory

31. multiple aggregation with groupby() and agg()

32. after groupby(), when to use tranform()/filter()/agg()/apply()

If you want to get a single value for each group, use aggregate() (or one of its shortcuts).

If you want to get a subset of the original rows, use filter().

And if you want to get a new value for each original row, use transform(). [However, the function in transform() involves multiple columns, it will raise an error.]

apply() is quite similar to transform(), but it can only operates on the series (only one column) rather than a dataframe.

33. import data

pd.read_csv()

pd.read_excel()

pd.read_json()

pd.read_html()

we can use for loop or list generation to import data from multiple files

34. indexes VS indices

35. the percentage changes of dataframe with pct_change()

36. to combine data

1> stack 2 dataframes vertically with append()

2> stack dataframes either row_wise or column-wise

concat() VS append()

3> To stack and concat numpy array

1<< horizontally

2<< vertically

37. join (inner, outer) dataframes

38. merge dataframes with merge()

39. connect to database with sqlalchemy

40. reflection reads database and builds sqlalchemy table object

41. to execute the sql queries

In a more pythonic way:

case()

cast()

select_from(), join()

self join:

build a table

to build a table with constraints

insert data into a table

to update a table

to delete data in a table

to drop a table [ metadata will not delete the data about this table until python is restarted ]

to drop all the tables

42. to draw the graph with annotate()

43. create 2D numpy array with meshgrid() and fill the color with pcolor()

44. different kinds of plot

1> strip plot

Grouped strip plot

spreading out strip plot

2> swarm plot

more groupings:

changing orientation

3> violin plot

4> box plot

5> combining plots

6> joint plot

joint plot using KDE

7> pair plot

pair plot with “hue”

8> heat map

45. plot time series slice

46. selecting & formatting dates

cleaning up ticks on axis

47. time series with moving windows

48. image histogram

rescale image

image histogram & CDF

equalizing intensity values