Python: Arrays

Run and experiment with the code in this tutorial using the Jupyter Notebook

Intro

Abstraction is one of the fundamental concepts in object-oriented programming. It enables a user to implement an idea without having to grapple with its implementation, or put another way it allows access to a high level concept while hiding away lower-level details. Arrays are a good example of an abstracted concept, because while it is tempting to think of arrays as a distinct data type, they are actually classes that organize fundamental data types into a common structure with associated class methods. When we use an array we do not often need to interact with the low level code that determines precisely how integers, floating point decimals or character strings are organised in memory, we often use array methods that exist as a core feature of the array and save us from having to write fresh code to, for example, reshape the array or slice it.

In this tutorial I’ll discuss raw Python arrays and NumPy arrays. A Jupyter Notebook for this tutorial is available HERE.

Python data structures: A double edged sword

One of the major strengths of Python is that it is dynamically-typed. This means a name can be linked to an object whose type changes at different times during the execution of some Python code. This gives great flexibility and makes Python an intuitive language for building complex software. The alternative to dynamically-typed languages are static typed languages such as C or Java. In statically-typed languages the type of each variable must be declared and held constant throughout the code execution.

With this distinction in mind, consider that Python is a high-level language built predominantly using C. That means that dynamically-typing requires manipulation of C variables “under-the-hood”. In C, switching a variable from one type to another would lead to a compilation failure, yet Python, which is itself built using C, allows it.

In the code shown below I assign the value ten to variable “x” as an integer, then overwrite it as a string and then as a floating point decimal value – three different fundamental data types, all assigned to the same variable name. Python runs this without complaint because of dynamic-typing but this would fail in a static-typed language because the type of x has not been declared initially, and then that predefined type was violated twice.


# Demonstrating the dynamic-type nature of Python: integer, string, float
x = 10
x = "ten"
x = 10.0
print(x)
10.0

This dynamic typing enables a huge amount of flexibility for Python programmers – we don’t need to explicitly assign data types to variables and we can be efficient in juggling data types for a single variable. However, this flexibility comes at a cost. Each Python object requires additional information to be stored alongside the variable name and value to enable type-switching, which means there is a memory implication.

 

What Python objects look like in memory

Since Python is written in C, every Python object is ultimately a C structure. That structure contains the variable name and value, but also a collection of other crucial data that enables dynamic typing. An integer in C is simply a name and a value (“x”, 10). An integer in Python is a pointer to a C structure containing four items:

1) ob refcnt: a reference count that helps Python silently handle memory allocation and deallocation

2) ob type: encodes variable type

3) ob size: specifies the size of the data members

4) ob digit: contains actual integer value that we expect the Python variable to represent

In Python, a value is stored in a data structure that includes the value and the four items listed above in a header. This is illustrated in Figure 1 below, where the C integer is simply a value, whereas the Python integer is a pointer to a structure with header information (the four items above) and the integer value.

Figure 1: Integers in C and Python (image from Jake van der Plas’s Data Science Handbook)

 

Python Lists and Arrays

Lists

In Python collections of data can be stored together under a single variable name. A simple one-dimensional sequence of data can be stored as a list. A list is one of the simplest types of data container. Lists make use of the dynamic-type nature of Python, since they can contain different data types in a single container. Lists are mutable, meaning they can be altered even after their creation.

First, let’s make a list containing integers…

L = list(range(10))
L
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Now, let’s add some strings to the same container, L…

L = L + ["one","two","three"]
L
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'one', 'two', 'three']

Finally, let’s also add some floating point numbers…

L = L + [1.0, 2.0, 3.0]
L
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'one', 'two', 'three', 1.0, 2.0, 3.0]

At this stage, we have created a list and populated it with integers. Then we made use of the mutable nature of Python lists and Python’s dynamic type to add strings and floats to the list. The result is a single object containing three different data types. Let’s just confirm with a quick check…

print("container type = ",type(L), "\ntype of first element = ", type(L[0]), 
      "\ntype of tenth element", type(L[11]), "\ntype of final element = ", type(L[-1]))
container type =  <class 'list'> 
type of first element =  <class 'int'> 
type of tenth element <class 'str'> 
type of final element =  <class 'float'>

Arrays

While the capability to do this is very useful, it is also inefficient and comparatively memory-hungry. Therefore, Python offers several options for storing data in C-style static-type containers. For example, a dynamic-type “list” can become a static-type “array”. In this case, we must specify the data type for the array.

import array
L = list(range(10)) # make list containing ten integers
integer_array = array.array('i',L) # convert list to static-type array of integers
float_array = array.array('f',L) # convert list L to static-type array of floats

print("integer array = ", integer_array)
print("float array = ", float_array)
integer array =  array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
float array =  array('f', [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0])

NumPy arrays

The size of the array in memory is much less than the size of the list because the type is static and the header information required for each object is reduced. The Python package “NumPy” (numerical python) has its own array structure that is also static-type. However, there is also a rich collection of operations designed specifically to operate efficiently on numpy arrays.

Figure 1: Comparison of structure of numpy arrays and lists (Image fromJake van der Plas’s Data Science Handbook)

First we will import numpy under the alias np, then convert our list, integer array and float array to numpy arrays.

import numpy as np

A = np.array(L)
B = np.array(integer_array)
C = np.array(float_array)

print("A type = ", type(A), "A content type = ", type(A[0]), "A contents = ", A)
print("B type = ", type(B), "B content type = ", type(B[0]), "B contents = ", B)
print("C type = ", type(C), "C content type = ", type(C[0]), "C contents = ", C)
A type =  <class 'numpy.ndarray'> A content type =  <class 'numpy.int64'> A contents =  [0 1 2 3 4 5 6 7 8 9]
B type =  <class 'numpy.ndarray'> B content type =  <class 'numpy.int32'> B contents =  [0 1 2 3 4 5 6 7 8 9]
C type =  <class 'numpy.ndarray'> C content type =  <class 'numpy.float32'> C contents =  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

Notice that the type of all three containers are numpy arrays, but the types of the contents differ in each array. The original list is converted into 64 bit integers. Where we have defined the data type as “integer” it has been as 32 bit integers. In array C the floating point numbers from “float_array” are still floating point numbers. We can change the data type by explicitly defining it in the array creation.

A = np.array(L,dtype="int64")
print("Type of container = ", type(A), "type of content = ", type(A[0]))
print(A)
Type of container =  <class 'numpy.ndarray'> type of content =  <class 'numpy.int64'>
[0 1 2 3 4 5 6 7 8 9]

The reason arrays and numpy arrays are more memory efficient than lists (or, worse – lists of lists!) is that they are static type – each element in the structure is a single value of a fixed type, so the additional overheads necessary for dynamic typing are stripped away and data type information can be stored once for the array as a whole rather than for each element. Below we create three data structures – numpy array, python array and python list – and apply a function by looping over each element. Then we print the computation time and amount of memory (bytes) each data structure occupies on disk.

#set up three identical datasets as different structures
numpy_array = np.array(range(1000000),dtype='int') # create numpy array
python_array = array.array('q',range(1000000)) # create python array - q = 8bit integer
python_list = list(range(1000000)) # create python list


# time a simple element-wise squaring
%timeit [i **2 for i in numpy_array]
%timeit [i **2 for i in python_array]
%timeit [i **2 for i in python_list]

# display size of each dataset on disk
import sys
print("size of numpy_array = ", sys.getsizeof(numpy_array))
print("size of python_array = ", sys.getsizeof(python_array))
print("size of python_list = ", sys.getsizeof(python_list))
191 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
218 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
size of numpy_array =  8000096
size of python_array =  8183800
size of python_list =  9000120

Notice that the numpy array has the smallest memory allocation and fastest computation time – this is because it is a static type data structure that consists of a continuous 1D sequence of values, with the structure size, shape and data type all saved once at the array level rather than element-wise. The list has by far the highest memory allocation because it is dynamically-typed and each element in the array has its own header information. Notice also that performing the operation over the numpy array was not dramatically faster than operating over the python array or list. This is because we applied the function in a loop that forced numpy to access each element in turn.

However, the real power of numpy becomes apparent when we consider the implicit vectorisation and efficient storage of complex multidimensional data. Where we can vectorise computation over a numpy array the computation speed is very high, but applying loops to numpy arrays is comparatively slow. We can demonstrate this by applying the same function, but allowing numpy to vectorise…

%timeit numpy_array **2 # vectorise
1.01 ms ± 78.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The computation speed for the numpy array has increased by a factor of ~191! This is an important point – looping over elements in numpy arrays is inefficient and should be avoided where possible, whereas vectorised operations over numpy arrays are very efficient and should be used as often as possible!

Visualising 1D arrays

We can visualise the array using the Python package Matplotlib. As the array has one dimension it can easily be plotted on a simple x,y plot where the x value is simply the index of each element in the array.

import matplotlib.pyplot as plt
numpy_array = np.array(np.sin(np.arange(0,100,1)))
plt.figure(figsize=(10,8)),plt.plot(numpy_array)
(<Figure size 720x576 with 1 Axes>,
 [<matplotlib.lines.Line2D at 0x7f1f26891d50>])

Multidimensional Numpy Arrays

2D Arrays

So far, we have used arrays with only one dimension – i.e. a single row of elements. However, numpy really comes into it’s own when dealing with multidimensional arrays. A two dimensional array would look something like a table, or an Excel spreadsheet in that there are both rows and columns. Let’s take a look at a random set of integers organised into a 2D array:

twoDimArray = np.random.randint(100, size=(10, 10))
twoDimArray
array([[24, 45, 20, 80, 29, 90, 83, 44, 49, 33],
       [39,  5, 39, 75, 15, 49, 48, 52, 54, 91],
       [93, 73, 94, 86, 50, 88,  9, 19, 58, 70],
       [14, 26, 53, 52, 67, 55, 11,  5, 91, 61],
       [88, 27,  1, 39, 50, 44, 61, 43, 86, 33],
       [47, 62, 12, 17, 38, 76, 65,  2, 70,  9],
       [33,  9, 54, 27, 29, 98, 22, 20, 70, 60],
       [26, 59, 43, 18, 32, 94, 59, 42, 69, 37],
       [74, 39, 37, 41, 22, 81, 60, 56, 69, 13],
       [40, 76, 15,  6, 75, 32, 66, 81, 24, 18]])

We can use numpy’s native functions to query this 2D array and gather information about its structure and contents

size = np.size(twoDimArray)
shape = np.shape(twoDimArray)
mean = np.mean(twoDimArray)
stdev = np.std(twoDimArray)

print(size)
print(shape)
print(mean)
print(np.round(stdev,2))
100
(10, 10)
47.35
26.01

Let’s create a list of 10 lists, each containing 10000 elements. This is equivalent to creating a 10 x 100000 element array, but the value are stored in 10 list objects. We can check the size and shape to confirm this…

L1 = list(range(10000))
L2 = list(range(10000))
L3 = list(range(10000))
L = list([L1,L1,L1,L1,L1,L1,L1,L1,L1,L1])

print(len(L))
print(np.size(L))
10
100000

Now we organise the same data into a 2D array of identical shape and size, but this time the container is a NumPy array…

nparray = np.arange(0,100000,1).reshape(10,10000)
print(np.shape(nparray))
(10, 10000)
print("size of list of lists = ",sys.getsizeof(L)) 
print("size of numpy array = ", sys.getsizeof(nparray))

%timeit [[i**2 for i in p] for p in L]  #nested list comprehension to loop through individual list, then list of lists
%timeit [i**2 for i in nparray]
%timeit nparray**2
size of list of lists =  208
size of numpy array =  112
20.6 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.3 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
38.6 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As for the 1D examples, the size on disk and the computation time are both reduced significantly by using numpy arrays instead of lists (notice the reported time is in milliseconds for the list and microseconds for the arrays), and especially when the operation is vectorised.

Slicing Arrays

We can identify individual elements in the array by defining a row and column index…

print("element in top left of array (i.e. first element) = ", twoDimArray[0,0])
print("element in bottom right of array (i.e. last element)", twoDimArray[-1,-1])
print("element at bottom of second column = ", twoDimArray[-1,1])
element in top left of array (i.e. first element) =  24
element in bottom right of array (i.e. last element) 18
element at bottom of second column =  76

Or we can identify entire rows and columns, or sections of rows and columns…

print("entire first row = ", twoDimArray[0,:])
print("entire second column = ", twoDimArray[:,1])
print("first 5 elements of third row = ", twoDimArray[2,0:5])
entire first row =  [24 45 20 80 29 90 83 44 49 33]
entire second column =  [45  5 73 26 27 62  9 59 39 76]
first 5 elements of third row =  [93 73 94 86 50]

This process of isolating individual elements or parts of an array is known as “slicing”. This relies on “indexing” which is the concept of having a value representing the position of a dtaa element in addition to the actual value stored in that location.

Visualising 2D arrays

We can also visualise a 2D array using matplotlib. Since we have data distributed in two dimensions, this will be represented as a plane of dtaa divided up into individual pixels, with one pixel representing one element in the 2D array. For this we need to use the imshow function from matplotlib.pyplot instead of the plot function.

plt.figure(figsize=(10,8)),plt.imshow(twoDimArray),plt.colorbar(label = 'DATA VALUES')
plt.ylabel('ROW INDEX'), plt.xlabel('COLUMN INDEX')
(Text(0, 0.5, 'ROW INDEX'), Text(0.5, 0, 'COLUMN INDEX'))

3D Arrays

We can also do the same thing for higher dimensional arrays. As we have seen a two dimensional array can be thought of as a dataset organised into rows and columns. Like a map, a 2D array has x and y coordinates, which are equivalent to the data indices.

A three dimensional array adds the equivalent of a “z” dimension. This can be thought of as a vertical stack of two dimensional arrays on top of one another, where a vertically oriented column extends vertically through each index on the 2D array. Let’s create a 3D array of random integers…

threeDimArray = np.random.randint(100, size=(1000,10,10))
threeDimArray
array([[[48, 62, 61, ..., 72, 10, 33],
        [37, 14,  6, ..., 95, 67, 39],
        [66, 11, 83, ..., 71, 69, 32],
        ...,
        [23, 89, 49, ...,  4, 35, 27],
        [74, 75,  5, ..., 16, 33, 12],
        [31, 87, 13, ..., 62, 17, 45]],

       [[71, 45, 47, ..., 23, 52, 42],
        [43, 92, 59, ..., 16,  9, 75],
        [93, 23, 29, ..., 21, 86, 31],
        ...,
        [47, 18, 58, ..., 56, 79, 46],
        [ 1, 64, 18, ..., 59, 91, 54],
        [45, 23, 32, ..., 89, 96, 46]],

       [[20, 75, 21, ..., 50, 57,  6],
        [17,  1, 75, ..., 64, 68, 27],
        [16, 23, 70, ..., 83,  3, 50],
        ...,
        [ 0, 84, 77, ..., 34, 64, 83],
        [22, 74, 25, ..., 85, 48, 26],
        [95, 33, 49, ..., 22, 16, 54]],

       ...,

       [[61, 41, 59, ..., 88, 18, 53],
        [39,  9, 50, ..., 52,  8, 34],
        [70, 39, 98, ..., 56, 36, 43],
        ...,
        [11, 60, 55, ..., 40, 87, 30],
        [18, 59, 11, ..., 72, 85, 38],
        [65, 43, 61, ..., 36, 61, 88]],

       [[ 0, 69, 57, ..., 22, 40, 82],
        [ 0, 17, 22, ..., 28, 90, 12],
        [63, 63, 83, ..., 88, 65, 23],
        ...,
        [ 1, 16, 92, ..., 89, 45, 50],
        [71, 11, 98, ..., 66,  5, 80],
        [95, 56,  1, ..., 73, 24, 53]],

       [[62, 52,  5, ..., 22,  3, 16],
        [59, 12, 50, ..., 39, 14, 90],
        [68,  7, 21, ..., 26, 31, 28],
        ...,
        [70, 93, 49, ..., 88, 23, 22],
        [17, 43, 15, ..., 44, 16, 99],
        [98,  5, 19, ...,  2, 14, 91]]])

Note that printing the 3D array to the console causes ten 2D arrays to be displayed. This is because these 2D arrays should be stacked into one 3 dimensional array – the third dimension, equivalent to a “z” axis, is deconstructed to enable us to see the entire contents of the array without stacked layers obscuring those beneath.

print(np.shape(threeDimArray))
print(np.size(threeDimArray))
(1000, 10, 10)
100000
%timeit threeDimArray**2
38 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Notice that the computation time for applying the operation over the 3D array is the same as the computation time over the 2D array of the same size. The number of dimensions has not had a noticeable effect on computation time.

Visualising 3D arrays

from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(10,8))
ax = plt.axes(projection='3d')
ax.plot_surface(threeDimArray[0,:,:], threeDimArray[:,0,:], threeDimArray[:,:,0], rstride=1, cstride=1,
                cmap='viridis', edgecolor='none')
ax.set_title('surface')
Text(0.5, 0.92, 'surface')

Take-aways

1. Python objects are C structures inside some wrapper code

2. The less additional information stored with each value, the smaller the amount of memory required

3. Arrays are structures that store multiple elements and can be multidimensional

4. NumPy arrays are computationally efficient, especially when operations can be vectorised

5. Looping over NumPy arrays should be avoided in favour of vectorised operations

Further Reading

One thought on “Python: Arrays

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s