Introduction to NumPy Arrays - Crash course in Data Science with Python

NumPy documentation¶

Note: Numpy has a very good and extensive documentation, which you can find at https://numpy.org/doc/stable/. If you need any further details about numpy arrays, you can always refer to it.

Motivation for arrays¶

We have seen that for numerical data, we often use Numpy arrays. Why do we need this additional container and why can’t we just use Python lists ?

Let’s illustrate this with an example. Imagine we have a list containing weights in gramms:

gramms = [5400, 3491, 2591, 14100]

Now we want to transform this list into kilogramms. We don’t have any other choice than using a for loop (or a comprehension list) to divide each element by 1000:

kilogramms = []
for i in range(len(gramms)):
    new_value = gramms[i]/1000
    kilogramms.append(new_value)
kilogramms

You can imagine much more complex cases, e.g. where we mix multiple lists, that makes this writing cumbersome and slow. What arrays provide us is vectorized computations.

Creating an array from a list¶

To see how this works with NumPy, let’s create a Numpy array. First of all, let’s import Numpy.

import numpy as np

We can easily turn our list from above into an array using the np.array function:

gramms_array = np.array(gramms)
gramms_array

Vectorized operations¶

Vectorization means now that we can operate on the list as one object, i.e. we can do mathematics with it as with a single number. In our example:

kilogramms_array = gramms_array / 1000
kilogramms_array

As mentioned above, this also works if we need to performe a computation which uses multiple arrays. Let’s imagine we have a list of price/ $m^2$ and surface for a series of appartments:

price_per_m2 = [6, 10.3, 12.4, 10.6, 5.7, 4.3, 14, 0.5, 0.5, 17.8, 12.7, 16, 2.7, 17.5, 5.2, 7.1, 1.2, 7.2, 14.5, 11.9]
surface = [238, 239, 265, 212, 143, 132, 142, 133, 109, 291, 225, 165, 141, 197, 298, 289, 123,  90, 132, 203]

Now if we want to calculate the price of the apartment, we can multiply each price/ $m^2$ by the surface. We could do that by creating a for loop and filling a new list with the values:

price = []
for i in range(len(price_per_m2)):
    current_price = price_per_m2[i] * surface[i]
    price.append(current_price)

price

As you might have guessed, it makes more sense, to once more transform the two lists into arrays:

price_per_m2_array = np.array(price_per_m2)
surface_array = np.array(surface)

Instead of having to write a foor loop, Numpy allows us now to just use a standard mathemetical operation where we multiply the two arrays:

price_array = price_per_m2_array * surface_array
price_array

You see that when multiplying two arrays, Numpy simply multiplies each element of one array by the equivalent element of the other array.

Advantages of vectorization¶

There are two main advantages to this approach. First it makes the code much simpler: we achieved the calculation in a single line, what took a cumbersome for loop with lists (note that it would be slightly more efficient even in plain Python via comprehension lists, but still far from NumPy).

Second, it makes our code run much faster. When we do a for loop, each operation is done separately, and since Python is dynamically typed (you don’t have to say whether a variable is text or numbers) it has to repeatedly carry out verifications. In the Numpy vectorized version, all multiplications can be done in parallel because: 1) the array contains only one type of variables so that no controls have to be done and 2) arrays are efficiently stored as blocks in memory so that individual values don’t have to be “searched” for.

With this very simple example, we can compare the execution time using the magic command %%timeit:

%%timeit -n 10000 -r 5 
price = []
for i in range(len(price_per_m2)):
    current_price = price_per_m2[i] * surface[i]
    price.append(current_price)

%%timeit -n 10000 -r 5
price_array = price_per_m2_array * surface_array

Applying functions to arrays¶

As we’ve seen, we can do operations on arrays of the same size or with a single number. We will see later that there are exceptions to this rule (called broadcasting).

We can however also do mathematics with just a single array. Numpy implements many standard mathematical functions that you can directly apply to arrays as if you were dealing with a single number. These always have the same syntax as in regular mathematics $y = f(x)$ , albeit here $y$ and $x$ are arrrays. Here a few examples, e.g. in trigonometry:

np.cos(price_array)

... or for exponentials and logarithms:

np.exp(price_array)

np.log10(price_array)

Accessing elements by index¶

1D arrays¶

As with lists, accessing elements is straight-forward. The standard way to extract information from an array is to use the square parenthesis (bracket) notation. If we want for example to extract the second element of the array we write:

price_array[1]

Remember that we start counting from 0 in Python, which is why the second element has index 1.

We can use negative indices to count from the end of the array. For example, if we want to access the last element, we can write:

price_array[-1]

Higher dimensions¶

When working in higher dimensions, NumPy arrays show their potential. We can simply put the indices of each dimension in the square brackets, separated by commas. For example, if we have a 2D array (a matrix) and want to access the element in the first row and second column, we write:

array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array_2d[0, 1]

Array dimensions¶

Arrays are pre-destined for work in higher dimensions. Let’s check the number of dimensions, the shape, and the size of the following array:

(If you want, you can try to understand how this array was created, but it is not key in this the moment.)

my_array = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]]] * 2)

print("Number of dimensions:", my_array.ndim)
print("Shape:", my_array.shape)
print("Size:", my_array.size)

Array type¶

We have mentioned above that computation is fast because the type of the arrays is known. This means that all the elements of an array must have the same type. Numpy implements its own types called dtype. We can access the type of an array using the dtype attribute:

price_per_m2_array.dtype

We see that by default Numpy decided that the price had float64 dtype because the numbers we used had a comma. Notice it also turned the numbers that didn’t have a comma into floats (like the first element 6). Since all elements of an array need to have the same type, Numpy just selects the most complex one for the entire array.

Let’s see what dtype the surface array has:

surface_array.dtype

We only used integer numbers in that list, and therefore Numpy can use a “simpler” dtype for that array.

Finally let’s see the result of our multiplication:

price_array.dtype

When combining multiple arrays, Numpy always selects the most complex dtype for the output.

If needed, we can also change the dtype of an array explicitly using the as_type method. For example if we want our surface_array to be a float instead of an integer we can write:

surface_array_float = surface_array.astype(np.float64)
surface_array_float.dtype

Notice how we had to create a new array: by default, most operations on Numpy arrays are not done in place i.e. the array itself is not changed.

Exercises¶

Create an array with 3 elements and one with 5 elements containing integers
Try to multiply the two arrays.
You should get an error message. Do you understand the problem ? Change the size of one of the arrays so that you can multiply them.
Change the dtype of the output to float32.


### YOUR CODE HERE

Getting shape and size of an array: follow the instructions in the comments of the cell below.

# a) Guess the shape, number dimensions and size of the arrays below

array_1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

array_2 = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]])

array_3 = np.array([1, 2, 3, 4, 5, 6])

array_4 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])


# b) Check your guesses using the appropriate attributes of the array.

### YOUR CODE HERE

Accessing elements in arrays: follow the instructions in the comments of the cell below.

# a) Access the element in the second row and third column of array_2d (should be 6)

array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

### YOUR CODE HERE


# b) Access the number 7 in array_2d using indexing

### YOUR CODE HERE


# c) Access the number 5 in array_3d using indexing

array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

### YOUR CODE HERE

(Optional) Creating and accessing a 2D array

Create the array shown below
Output the dimensions of the array
Access the elements marked in bold (individually or per list) and output them

11	3
5	16
23	10
13	14
7	28


### YOUR CODE HERE