Indexing in Python for Data Scientists (NumPy, Pandas and Python native)

Atul kumar
Sep 2, 2018 · 7 min read

Following would be covered. No explanation is given. It is meant to be ready reckoner

  1. Python native indexing
  2. Python List Indexing
  3. NumPy Array indexing
  4. Pandas series indexing
  5. Pandas Dataframe indexing

Python List Indexing

With due recognition to https://www.i-programmer.info/programming/python/3942-arrays-in-python.html

myList=[1,2,3,4,5,6] — Python list
myList[2] — 3rd element
myList[2:5] — 3rd Element to 5th element
myList[5:] — 6th element to last elemnt
myList[:5] — 1st element to 5th element
myList[:] — All elements
myList[-2:] — myList[3,4]
myList[-3:-1] — myList[2,3]
myList[0:3:2] — myList[1,3]
myList[::2] — myList[1,3,5]
for e in myList:
for i in range(len(myList)):
for i in range(10):
myList.append(1)
myList_squered=[i*i for i in range(10)]
newList=[myList[i] for i in range(1,3)]
newList=myList[1:3]
myArray=[[1,2],[3,4]] — two dimensional arrays
myArray[i][j] — ith row and jth column
for i in range(len(myArray)):
for j in range(len(myArray[i])):
for row in myArray:
for e in row:
myArray=[[0 for j in range(3)] for i in range(3)]

NumPy Array indexing

myArray=np.ndarray((3,3)) 
np.arange(start,end,increment)
myArray=np.array([[1,2,3],[4,5,6],[7,8,9]])
myArray[0:2] — array([[1, 2, 3],[4, 5, 6]])
myArray[0:2][0:2] — array([[1, 2, 3],[4, 5, 6]]) No 2x2 sub matrix in the top left hand corner
myArray[1,2] — 6
myArray[0:2,0:2] — array([[1, 2],[4, 5]]) 2x2 sub matrix in the top left hand corner
myArray[0:2] — myArray[0:2,:]
bigArray[…,0] — bigArray[:,:,:,:,0]
myArray[0,:] — array([1, 2, 3])
myArray[0:1,:] — array([[1, 2, 3]])
Using an integer i returns an array with one less dimension than using the slicer [i:i+1] which returns the same elements.
myArray[[0,2],[1,2]] — myArray[0,1] and myArray[2,2] which is array([2, 9])
x = np.arange(10) [0,1,2,3,4,5,6,7,8,9]
x[2] — 2
x[-2] — 8
x[2:5] — array([2, 3, 4])
x[:-7] — array([0, 1, 2])
x[1:7:2] — array([1, 3, 5])
x.shape = (2,5) # now x is 2-dimensional array([0, 1, 2, 3, 4], [5,6,7,8,9])
x[1,3] — 8
x[1,-1] — 9
x[0] — array([0, 1, 2, 3, 4])
x[0][2] — 2
y = np.arange(35).reshape(5,7)
y[1:5:2, ::3] —
array([[ 7, 10, 13], [21, 24, 27]])

Slices of arrays do not copy the internal array data but produce new views.

x = np.arange(10,1,-1) — array([10, 9, 8, 7, 6, 5, 4, 3, 2])
x[np.array([3, 3, 1, 8])] — array([7, 7, 9, 2])
x[np.array([3,3,-3,8])] — array([7, 7, 4, 2])x[np.array([[1,1],[2,3]])]
array([[9, 9], [8, 7]]) — when index arrays are used is an array with the same shape as the index array, but with the type and values of the array being indexed.
y[np.array([0,2,4]), np.array([0,1,2])] — array([ 0, 15, 30])y[np.array([0,2,4]), 1] — array([ 1, 15, 29])
y[np.array([0,2,4])] — array([[ 0, 1, 2, 3, 4, 5, 6],
[14, 15, 16, 17, 18, 19, 20],
[28, 29, 30, 31, 32, 33, 34]])

Shape of the resultant array will be the concatenation of the shape of the index array (or the shape that all the index arrays were broadcast to) with the shape of any unused dimensions (those not indexed) in the array being indexed.

b = y>20 — array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])
b[:,5] # use a 1-D boolean whose first dim agrees with the first dim of y
array([False, False, False, True, True])
y[b[:,5]] — array([[21, 22, 23, 24, 25, 26, 27], [28, 29, 30, 31, 32, 33, 34]])
y[np.array([0,2,4]),1:3] — array([[ 1, 2], [15, 16], [29, 30]])y[b[:,5],1:3] — array([[22, 23], [29, 30]])z[[1,1,1,1]] # produces a large array array([[[[27, 28, 29], [30, 31, 32], …
z[(1,1,1,1)] # returns a single value 40
y[np.array([0,2,4]),1:3] — array([[1,2],[15, 16], [29, 30]])
slice is converted to an index array np.array([[1,2]]) (shape (1,2)) that is broadcast with the index array to produce a resultant array of shape (3,2).
y[b[:,5],1:3] — array([[22, 23],[29, 30]])y.shape — (5, 7)
y[:,np.newaxis,:].shape — (5, 1, 7)
x = np.arange(5)
x[:,np.newaxis] + x[np.newaxis,:]
array([[0, 1, 2, 3, 4],[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7], [4, 5, 6, 7, 8]])

z = np.arange(81).reshape(3,3,3,3)
z[1,…,2] — z[1,:,:,2]
array([[29, 32, 35], [38, 41, 44], [47, 50, 53]])
x = np.arange(10)
x[2:7] = 1
x[2:7] = np.arange(5)

Pandas Series/Dataframe indexing

Pandas supports three types of multi-axis indexing.
.loc is primarily label based, but may also be used with a boolean array. Allowed inputs are:

  • A single label, e.g. 5 or ‘a’
  • A list or array of labels [‘a’, ‘b’, ‘c’].
  • A slice object with labels ‘a’:’f’ (both the start and the stop are included)
  • A boolean array
  • A callable function with one argument (the calling Series, DataFrame or Panel). Returns valid output for indexing (one of the above).

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. Allowed inputs are:

  • An integer e.g. 5.
  • A list or array of integers [4, 3, 0].
  • A slice object with ints 1:7.
  • A boolean array.
  • A callable function with one argument (the calling Series, DataFrame or Panel) Returns valid output for indexing (one of the above).

.loc, .iloc, and also [] indexing can accept a callable as indexer.

Return type values when indexing pandas objects with [] is as follows

Object Type | Selection | Return Value Type

  • Series | series[label] | scalar value
  • DataFrame | frame[colname] | Series corresponding to colname
  • Panel| panel[itemname] | DataFrame corresponding to the itemname

Getting values from an object with multi-axes selection uses the following notation (.loc as well as .iloc). Any of the axes accessors may be the null slice :. Axes left out of the specification are assumed to be :, e.g. p.loc[‘a’] is equivalent to p.loc[‘a’, :, :].

Object Type | Indexers

  • Series | s.loc[indexer]
  • DataFrame | df.loc[row_indexer,column_indexer]
  • Panel | p.loc[item_indexer,major_indexer,minor_indexer]
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
panel = pd.Panel({'one' : df, 'two' : df - df.mean()})
s = df['A']
s[dates[5]] - -0.67368970808837059
df[['B', 'A']] = df[['A', 'B']] - Transform DF
df.loc[:,['B', 'A']] = df[['A', 'B']].values - Transform DF
pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc. The correct way to swap column values is by using raw values.
Access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:sa = pd.Series([1,2,3],index=list('abc'))
sa.b - 2
df.A - Gets Column A
panel.One - Gets First panel
dfa = df.copy()
dfa.A = list(range(len(dfa.index))) # ok if A already exists
dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column
standard indexing - s['1'], s['min'], and s['index'] will access the corresponding element or column.Assign a dict to a row of a DataFrame:
pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x.iloc[1] = dict(x=9, y=99) - Row 1 becomes (9, 99) now
Attribute access can modify an existing element of a Series or column of a DataFrame, but using it to create a new column creates a new attribute rather than a new column.s[:5] - Top 5 rows
s[::2] - Alternate rows
s[::-1] - Reverse
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation..loc is strict with slicers not compatible (or convertible) with the index type. Using integers in a DatetimeIndex will raise a TypeError.
dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=pd.date_range('20130101',periods=5))
String likes in slicing can be convertible to the type of the index and lead to natural slicing.
dfl.loc['20130102':'20130104']

Purely label based indexing.

A strict inclusion based protocol. Every label asked for must be in the index. When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.

The .loc attribute is the primary access method.

  • A single label, e.g. 5 or ‘a’
  • A list or array of labels [‘a’, ‘b’, ‘c’]
  • A slice object with labels ‘a’:’f’ (both the start and the stop are included).
  • A boolean array.
  • A callable, see Selection By Callable.
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))
s1.loc['c':]
s1.loc['b']
s1.loc['c':] = 0
df1 = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'), columns=list('ABCD'))df1.loc[['a', 'b', 'd'], :] - Selected rows
df1.loc['d':, 'A':'C'] - Selected columns and rows
df1.loc['a'] - Row
df1.loc['a'] > 0
df1.loc[:, df1.loc['a'] > 0] - Boolean
df1.loc['a', 'A'] - cell

When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:

s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.loc[3:5]

If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:

s.sort_index().loc[1:6]

purely integer based indexing. Follow closely Python and NumPy. 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. Cannot use a non-integer.

.iloc attribute is primary access method. The following are valid inputs:

  • An integer e.g. 5.
  • A list or array of integers [4, 3, 0].
  • A slice object with ints 1:7.
  • A boolean array.
  • A callable, see Selection By Callable.
s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))
s1.iloc[:3] - First 3 items
s1.iloc[:3] = 0
df1 = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=list(range(0,8,2)))
df1.iloc[:3] - Top 3 rows
df1.iloc[1:5, 2:4] - Selected columns and rows
df1.iloc[[1, 3, 5], [1, 3]] - Integer list
df1.iloc[1:3, :] - All columns
df1.iloc[:, 1:3] - All rows
df1.iloc[1, 1] - Specific cell
df1.iloc[1] - 2nd row

Out of range slice indexes are handled gracefully just as in Python/Numpy. It can result in an empty axis (e.g. an empty DataFrame being returned).

A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out of bounds will raise an IndexError.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade