Python: NumPy

There is a Jupyter Notebook accompanying this post HERE.

NumPy is a Python package built around the concept of ndarrays (n-dimensional arrays) along with a suite of efficient functions for applying operations over those arrays. Many of the other important packages for data scientists are built on top of NumPy (e.g. Pandas, scikit-learn). In the previous tutorial we discussed arrays and why they may be more useful than lists in many cases (although, remember lists are useful because they are dynamically typed and can store multiple data-types in a single container). We also discussed how NumPy arrays are represented in memory. Here we will take a deeper dive into NumPy and start to explore some of the more advanced functionality.

1 Numpy Arrays

Numpy arrays are n-dimensional structures with homogenous d-type and minimal and array-level rather than element-level header information, making them memory efficient. Implicit vectorization makes them very fast to work with. We will return to this idea shortly. For now, lets look at creating a NumPy array:

# Method 1: Created NumPy array from a list
import numpy as np

list = ([3,0,1,3,4,7,5,2,7,8,9])
nparray = np.array(list)

print("List: ",list)
print("NumPy array from list: ",nparray)

# Method 2: create NumPy array directly
nparray2 = np.array([3,0,1,3,4,7,5,2,7,8,9])
print("NumPy array (direct creation): ",nparray2)

# show that the resulting arrays are identical (array_equal returns True if 2 parsed arrays are identical,
# otherwise False):
print("Arrays are identical: ",np.array_equal(nparray,nparray2))
List:  [3, 0, 1, 3, 4, 7, 5, 2, 7, 8, 9]
NumPy array from list:  [3 0 1 3 4 7 5 2 7 8 9]
NumPy array (direct creation):  [3 0 1 3 4 7 5 2 7 8 9]
Arrays are identical:  True

Two and three dimensional arrays can be created in the same way, wither directly or from lists of lists

listoflists2D = [[0,3,4,2,7,6],[9,2,5,2,4,2],[8,5,3,6,7,8],[7,5,6,3,4,2]] # create list of lists
nparray2D_1 = np.array(listoflists2D) # convert to NumPy array
nparray2D_2 = np.array([[0,3,4,2,7,6],[9,2,5,2,4,2],[8,5,3,6,7,8],[7,5,6,3,4,2]]) # create NumPy array directly

print("2D Array:\n\n",nparray2D_2)
print("2D arrays are identical: ", np.array_equal(nparray2D_1,nparray2D_2))
2D Array:

 [[0 3 4 2 7 6]
 [9 2 5 2 4 2]
 [8 5 3 6 7 8]
 [7 5 6 3 4 2]]
2D arrays are identical:  True

We also have other options for creating arrays, for example, we can append to arrays in loops, or created arrays of zeros, ones, random integers or ordered values using native NumPy functions…

randomArray = np.random.randint(100,size=(10,10))
print("RandomArray:\n\n",randomArray)
print()

zeroArray = np.zeros([10,10])
print("ZeroArray: \n\n",zeroArray)
print()

onesArray = np.ones([10,10])
print("onesArray: \n\n", onesArray)
print()

RangeArray = np.arange(0,100,1).reshape(10,10)
print("RangeArray: \n\n",RangeArray)
RandomArray:

 [[30  4 33 95 48 60 28 24 81 28]
 [37  3  8 30 17 93 26 65 77 41]
 [77 52 51 97 42 42  9 25 75 51]
 [71 44 22 88 28 32 11 57 74 97]
 [17 72  8 89  9 77 35 97 49 60]
 [ 7 74 27 45 63 82 52 56 79 53]
 [13 74  7 90 50 18 26 50 54 69]
 [24 67 97 22 93 37 94 68 92 20]
 [66 49 83 70 82 80 72 45 51 41]
 [85 56 40 71 37 20 79 56  1 24]]

ZeroArray: 

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

onesArray: 

 [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

RangeArray: 

 [[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]

2 Array Shapes

2.1 dimensions and indices

As we discussed in the last session, a NumPy array is a continguous block of homogenous dtype organized into a grid. This grid can be 1D (called a vector), 2D (called a matrix) or higher dimensional. When we print a >2D array to the console it is represented as a collection of 2D arrays that we have to mentally stack of top of one another.

We can access the shape of an array using the NumPy native function np.shape(). The shape is represented as an n-tuple that reports the size of each axis. For a vector, the size tuple simply shows the number of elements (there is only one dimension and all elements are contained within it). For a matrix, the shape tuple represents number of rows and number of columns, and the array size is equal to their product. As the dimensionality of the arrays increases by one, one additional dimension is added to the front of the shape tuple (e.g an array with 20 rows, 20 columns and a size of 2 in the z dimension would be represented as (2,20,20) and it size equal to 22020.

The following figure illustrates the structure of a 3D (2,3,4) array that contains 24 elements:

https://ipython-books.github.io/13-introducing-the-multidimensional-array-in-numpy-for-fast-array-computations/

Let’s create that array and assign it to the variable name nparray3D. You will see 2 arrays, each with 3 rows and 4 columns. We can confirm this by querying the array shape and size.

nparray3D = np.random.randint(100,size=(2,3,4))
print("3D Array: \n\n", nparray3D)

print("\nArray shape = ", np.shape(nparray3D))
print("2 * 3 * 4 = ", 2*3*4)

print("Array Size = ", np.size(nparray3D))
3D Array: 

 [[[49 12 23 53]
  [12 93  7 94]
  [72 98 15 65]]

 [[30 83 93 62]
  [70  0 69 20]
  [66 81 63 37]]]

Array shape =  (2, 3, 4)
2 * 3 * 4 =  24
Array Size =  24

2.2 Slicing nd-arrays

Understanding the shape and structure of nd-arrays enables us to select individual elements or collections of elements from across the various dimensions. For example, we might want to select a particular value, or a 2D slice of a 3D array, or a data cube of certain dimensions from a 3D array. We can do this by indexing each dimension in turn, using square brackets to indicate indexes. Remember as always that we index from zero in Python!

# select first element in 3D array (i.e. top-left element in bottom-most 2D array in 3D stack)
nparray3D[0,0,0]

# select 2nd row, 3rd column value in bottom array in 3D stack
nparray3D[0,1,2]


7

Notice that providing the index only for the 3rd dimension returns a matrix, providing 3rd dimension plus row or column returns a vector, and indexing into all available dimensions returns a scalar.

A = nparray3D[0]
B = nparray3D[0,0]
C = nparray3D[0,0,0]

print("Providing 1 index value returns \n\n",A, "\n\ni.e. a matrix\n\n")
print("Providing 2 index values returns \n\n",B, "\n\ni.e. a vector\n\n")
print("Providing 3 index values returns \n\n",C, "\n\ni.e. a scalar\n\n")
Providing 1 index value returns 

 [[49 12 23 53]
 [12 93  7 94]
 [72 98 15 65]] 

i.e. a matrix


Providing 2 index values returns 

 [49 12 23 53] 

i.e. a vector


Providing 3 index values returns 

 49 

i.e. a scalar


We can also take slices of individual dimensions, or combinations of scalar and vector indices…

A = nparray3D[:,0,0] # for all arrays in 3rd dimension, print 1st row and 1st column
B = nparray3D[0,0,-1] # print last column, first row, base of 3D stack
C = nparray3D[0,0:3,0] # for 1st array in 3rd dimension, print 2st 3 rows in 1st column
D = nparray3D[0,:,:] # print matrix at base of 3D stack

print("A = \n\n", A,"\n\n")
print("B = \n\n", B,"\n\n")
print("C = \n\n", C,"\n\n")
print("D = \n\n", D,"\n\n")
A = 

 [49 30] 


B = 

 53 


C = 

 [49 12 72] 


D = 

 [[49 12 23 53]
 [12 93  7 94]
 [72 98 15 65]] 


We can also use fancy indexing to parse arrays of indices.

fancyResult = nparray3D[:,[0],[0,1,3]] #select both arrays in 3D stack, first row, first, second and fourth columns

print(fancyResult)
[[49 12 53]
 [30 83 62]]

NumPy also includes some useful functions to find the index of the max and min values of an array rather than the values themselves…

X = np.random.randint(100,size=(100))

# find min, max and mean values
print("minimum value = ", np.min(X))
print("maximum value = ",np.max(X))

# find indexes of min and max values
print("minimum value = ", np.argmin(X))
print("maximum value = ",np.argmax(X))

# to prove this works we can check that the minimum value is equal to the value at the argmin index

if X[np.argmin(X)]==np.min(X):
    print("values are equal")
else:
    print("values are not equal")
minimum value =  1
maximum value =  98
minimum value =  67
maximum value =  24
values are equal

2.3 Updating arrays by indexing

As well as selecting and viewing parts of arrays by indexing, fancy indexing and slicing, we can also use the same techniques to update the array. For example, we can change the first five values in array X to a constant scalar:

X= np.zeros(100) # create 1D array of 100 zeros
print(X[0:15]) # print first 15 elements

X[0:5] = 5 # set first 5 elements to 5

print(X[0:15]) # print updated array
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[5. 5. 5. 5. 5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Here’s an example with fancy indexing…

X[[1,3,5,7,9]] = 1

print(X[0:15])
[5. 1. 5. 1. 5. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]

2.4 More about indexing

2.4.1 Strides

This is a good moment to stop and consider indexing in a little more detail. In both lists and arrays each element has a value and also some header information associated with it. Being dynamically-typed the elements in a list have more of this “metadata” including the d-type; however, much of this is removed from elements in arrays because they are statically-typed. In either case, along with the data value, each element has an index. This is the position inside the data container where the value can be found, for example the first element in a vector has index 0.

However, this indexing scheme is abstracted from how the information is stored in memory (strings of binary bits). To report the correct value the computer needs to know how many bits each value occupies in memory, where they start and finish and the spacing between them. This information is stored in the array’s “stride”.

For example, to traverse a vector to the 8-bit integer in index [1], the computer will stride over the 8 bits representing index[0] and retrieve the 8 bits that follow – the data associated with index[1]. Similarly, to obtain the 8-bit integer associated with index [5] it will traverse the array 8 * 5 bits and retrieve the following 8 bits because it knows stride = 8.

Ultimately, a n-dimensional array is stored as a continuous string of bits and is abstracted up to a shaped array by the NumPy architecture. To do so, the number of bits in each element of each dimension is required. For example, in a small 2 x 2 array of 8-bit integers, there are 16 bits per row. To access the second row, NumPy must traverse the array 16 bits then begin reading the bits that follow. Similarly, accessing rows requires NumPy to read 8 bits separated by an entire row length. This principle holds true for arrays of any dimension.

 

We can demonstrate this with some simple queries to our 3D array. The native function “strides” returns an n-tuple containing the stride required to traverse each dimension.

print("3D array shape = ", nparray3D.shape)
print("Stride to traverse each dimension = ", nparray3D.strides)
3D array shape =  (2, 3, 4)
Stride to traverse each dimension =  (96, 32, 8)

What this shows is that to move from, say, the top-left value in the bottom 2D array in the 3D stack to the top-left value in the second-from-bottom 2D array in the 3D stack, NumPy needs to traverse 96 bits. This makes sense because each 2D array is made up of 3 rows and 4 columns, which gives the array a size of 12 integers. Each integer is 8 bits long, so 12 * 8 = 96. Each time NumPy moves upwards or downwards in the 3rd dimenion it has to move back or forward the length of one entire 2D array – 96 bits.

The second value in the tuple shows the stride to traverse a row. This also makes sense because to move from one row to the next, the entire row is traversed from left to right. The row-length is 4, so the stride required to reach the next row is 4* 8 = 32.

To move along the columns, NumPy simply has to traverse one element at a time, so the stride is 8 bits. To move 2 columns, the stride is 8 * 2 = 16 bits.

3 Operations on Arrays

3.1 Operations

At this point we understand what an array is, both in abstracted form as it appears in Python and the underlying structure. One of the major advantages of NumPy is the suite of very efficient native functions that benefit from an ability to compile the Python code down to very efficient C code, which cannot be achieved for list data with additional metadata attached. Let’s start with some basic array descriptors…

nparray = np.random.randint(100,size=[100,100,3])

#there are two ways to apply a ufunc to an array, first call the function and parse the array as an arg
print("METHOD 1\n")
print("mean = ",np.mean(nparray))
print("standard deviation = ",np.std(nparray))
print("max = ",np.max(nparray))
print("min = ",np.min(nparray))

# second, call the function as a property of the array
print("\n\nMETHOD 2 \n")
print("mean = ",nparray.mean())
print("standard deviation = ",nparray.std())
print("max = ",nparray.max())
print("min = ",nparray.min())
METHOD 1

mean =  49.5108
standard deviation =  28.964175861916047
max =  99
min =  0


METHOD 2 

mean =  49.5108
standard deviation =  28.964175861916047
max =  99
min =  0

We can also apply these functions over slices of the original array

sliceMean = np.mean(nparray[0,0:5,:])
print(sliceMean)
39.4

3.2 uFuncs

uFuncs (universal functions) are native to NumPy and they greatly accelerate the application of functions over arrays. They do by vectorizing the operation (applying it to the entire array at once rather than element-wise in a loop) and by delegating the iteration down to Python’s base language (C) rather than looping in Python where the overheads are greater.

Python can be quite a slow language, especially when it has to repeat operations over many elements. With lists and when element-wise loops are used to access arrays, the code is slow because it has to retrieve the relevant value from the array, check the dtype and look up the relevant function for that specific dtype, then apply it, then repeat for each element in the list or array. This is the bottleneck that is overcome by using NumPy’s uFuncs.

These uFuncs provide an interface to static-typed, compiled routines in C that do not have the overheads associated with looping. Operations can then be applied at the array-level rather than element by element. This is known as vectorisation.

Let’s create a large array and try some uFuncs…

# create 2D array with ten-million elements
bigArray = np.random.randint(100,size=(10000,1000))
print("Shape = ",bigArray.shape,"\nSize = ",bigArray.size)
Shape =  (10000, 1000) 
Size =  10000000

In many cases we barely notice we are using uFuncs – they use the same operators as vanilla Python, e.g. + (add) – (subtract) / (divide) % (remainder) ** (square). When these are used on a single array these are known as “unary” functions. Let’s try some and time them against a looped eqivalent.

%timeit bigArray/2 #second uFunc is to divide all element in the array by 2
%timeit bigArray + 5
%timeit bigArray * 2
%timeit bigArray ** 2

# compare these times to looping element-wise
def testLoop(bigArray):
    for i in np.arange(0,len(np.ravel(bigArray))):
        i**2
    return

%timeit testLoop(bigArray)
41.1 ms ± 968 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.3 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.3 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.7 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.73 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As well as operating on a single array, these operators can be used across multiple arrays…

# define new array of equal size to our original big array
bigArray2 = np.random.randint(100,size=(10000,1000))

# time some addition and subtraction across the two arrays
%timeit bigArray + bigArray2
%timeit bigArray - bigArray2

# compare to achieving the same in a loop
def testLoop(bigArray,bigArray2):
    for i in np.arange(0,len(bigArray[:,0]),1):
        for j in np.arange(0,len(bigArray[0,:]),1):
            bigArray[i,j] + bigArray2[i,j]
    return

%timeit testLoop(bigArray,bigArray2)
33.8 ms ± 460 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
33.8 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.17 s ± 88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In both of these examples, the speed of computation has increased by over 1000 times by vectorizing rather than looping!

The full list of uFuncs can be found at https://docs.scipy.org/doc/numpy-1.15.1/reference/ufuncs.html and there is an excellent blog detailing more of the uFuncs here: https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html.

3.3 Broadcasting

Whe we perform a mathematical operation between two arrays, we perform it element-wise, for example in the case of adding array X to array Y, X[0,0] is added to Y[0,0], X[0,1] is added to Y[0,1] etc through to X[-1,-1] + Y[-1,-1].

Element-wise operations are only possible when there are the same number of elements organised into the same shapes in both arrays. For example, a 10 x 10 element array can only be added or subtracted from another 10 x 10 array, a 2 x 2 array can only be multiplied by another 2 x 2 array etc. This is a major limitation to the operations that can be performed as arrays very often have different sizes and shapes. However, NumPy can broadcast arrays to overcome this problem and enable operations on arrays of different sizes. This allows vectorization (i.e. delegation of looping from Python down to the base language C).

3.3.1 Arrays and scalars

The simplest example is adding a scalar (i.e. a 0 dimensional array) to a 1D array.

In [19]:
X = np.arange(0,10,1)
Y = 2

print("X: ",type(X))
print("Y: ", type(Y))
print("Array before addition operation", X)
print()

print("Array after addition operation: ", X+Y)
X:  <class 'numpy.ndarray'>
Y:  <class 'int'>
Array before addition operation [0 1 2 3 4 5 6 7 8 9]

Array after addition operation:  [ 2  3  4  5  6  7  8  9 10 11]

This is made possible because NumPy effectively stretches the scalar to fit the dimensions of the larger array. In our example we can think of NumPy creating the array Y = [5,5,5,5,5,5,5,5,5,5] and then adding this element-wise to X. In reality this is not quite what NumPy is doing – it’s more memory efficient than that, but it is a good mental model for the broadcasting process.

This also works for multidimensional arrays. In the cell below X is a 2D array and Y is a scalar. NumPy effectively creates a 2D array of 2’s to add element-wise to X.

X = np.arange(0,100,1).reshape(10,10)
Y = 2

print("X: ",type(X))
print("Y: ", type(Y))
print("\nArray before addition operation: \n", X)
print()

print("Array after addition operation: \n", X+2)
X:  <class 'numpy.ndarray'>
Y:  <class 'int'>

Array before addition operation: 
 [[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]

Array after addition operation: 
 [[  2   3   4   5   6   7   8   9  10  11]
 [ 12  13  14  15  16  17  18  19  20  21]
 [ 22  23  24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39  40  41]
 [ 42  43  44  45  46  47  48  49  50  51]
 [ 52  53  54  55  56  57  58  59  60  61]
 [ 62  63  64  65  66  67  68  69  70  71]
 [ 72  73  74  75  76  77  78  79  80  81]
 [ 82  83  84  85  86  87  88  89  90  91]
 [ 92  93  94  95  96  97  98  99 100 101]]

3.3.2 Arrays of different shapes/sizes

The same idea can be applied to arrays of different sizes and shapes. In the following example two arrays are multiplied together. A is a 2D array with shape (2,5) and B is a 1D array with shape (1,5). The arrays agree in the second dimension but differ in the first. Broadcasting can occur by stretching B in the first dimension until the array sizes match, then multiply elementwise.

In [21]:
A = np.arange(0,10,1).reshape(2,5) # 2D array
B = np.arange(0,5,1).reshape(1,5)

A*B
Out[21]:
array([[ 0,  1,  4,  9, 16],
       [ 0,  6, 14, 24, 36]])

This can only occur when the shapes of the arrays in each dimension are identical or the dimension size is equal to 1. If these conditions are not met, the arrays are incompatible. In the examples in the cell below, the first three calculations work because their dimensions allow broadcasting – either the dimension values are equal or the smaller array has value 1 for the mismatching dimension. The final example fails because the arrays do not meet those criteria – there is no way to pad these arrays to make them equivalent in shape, despite their size being identical.

A = np.arange(0,12,1).reshape(4,3) # 2D array
B = np.arange(0,12,1).reshape(4,3)
A*B
print("A*B works because shapes are equal")

C = np.arange(0,10,1).reshape(2,5)
D = np.arange(0,5,1).reshape(1,5)
C*D
print("C*D works because the arrays match in 2nd dimension and D = 1 in first dimension")

E= np.arange(0,100,1).reshape(5,5,4)
F = np.arange(0,20,1).reshape(1,5,4)
E*F
print("E*F works because the arrays match in 2nd and 3rd dimension and F = 1 in first dimension")

G = np.arange(0,100,1).reshape(1,10,10)
H = np.arange(0,100,1).reshape(5,5,4)

try:
    G*H
except:print("G* H Failed: Incompatible shape")
A*B works because shapes are equal
C*D works because the arrays match in 2nd dimension and D = 1 in first dimension
E*F works because the arrays match in 2nd and 3rd dimension and F = 1 in first dimension
G* H Failed: Incompatible shape

To reiterate:
Two dimensions are compatible for broadcasting if they are identical or if one of them has value 1. 

4 Take Aways

1) NumPy is a package that enables fast computation across arrays

2) Arrays are n-dimensional

3) Individual elements can be accessed by indexing and sets of elements accessed by slicing

4) NumPy has native functions that vectorise code to accelerate computation

5) NumPy broadcasts arrays so that arrays of different size can be combined

 

Reading:

NumPy Docs
Broadcasting
Python Data Science Handbook

One thought on “Python: NumPy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s