Introduction to numpy and matplotlib#

This lecture will introduce NumPy and Matplotlib. Numpy and Matplotlib are two of the most fundamental parts of the scientific python ecosystem. Most of everything else is built on top of them.

Numpy: The fundamental package for scientific computing with Python. NumPy is the standard Python library used for working with arrays (i.e., vectors & matrices), linear algebra, and other numerical computations.

Note

Documentation for this package is available at https://numpy.org/doc/stable/index.html.

Matplotlib: Matplotlib is a comprehensive library for creating static and animated visualizations in Python.

Note

Documentation for this package is available at https://matplotlib.org/stable/index.html.

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab

Then install numpy and matplotlib by executing the following command in a Jupyter cell at the top of the notebook.

!pip install matplotlib numpy

Importing a Package#

This will be our first experience with importing a package.

Usually we import numpy with the alias np.

import numpy as np

NDArrays#

NDarrays (short for n-dimensional arrays) are a key data structure in numpy. NDarrays are similar to Python lists, but they allow for fast, efficient computations on large arrays and matrices of numerical data. NDarrays can have any number of dimensions, and are used for a wide range of numerical and scientific computing tasks, including linear algebra, statistical analysis, and image processing.

Thus, the main differences between a numpy array and a list are the following:

  • numpy arrays can have N dimensions (while lists only have 1)

  • numpy arrays hold values of the same datatype (e.g. int, float), while lists can contain anything.

  • numpy optimizes numerical operations on arrays. Numpy is fast!

# create an array from a list
a = np.array([9, 0, 2, 1, 0])

Note

If you’re in Jupyter, you can use <shift> + <tab> to inspect a function.

# find out the datatype
a.dtype
dtype('int64')
# find out the shape
a.shape
(5,)
# another array with a different datatype and shape
b = np.array([[5, 3, 1, 9], [9, 2, 3, 0]], dtype=np.float64)
# check dtype
b.dtype
dtype('float64')
# check shape
b.shape
(2, 4)

Array Creation#

There are lots of ways to create arrays.

np.zeros((4, 4))
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
np.ones((2, 2, 3))
array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])
np.full((3, 2), np.pi)
array([[3.14159265, 3.14159265],
       [3.14159265, 3.14159265],
       [3.14159265, 3.14159265]])
np.random.rand(5, 2)
array([[0.41896038, 0.67123746],
       [0.65026313, 0.52958136],
       [0.38550543, 0.43134556],
       [0.46974043, 0.32469587],
       [0.9457798 , 0.25889625]])
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(2, 4, 0.25)
array([2.  , 2.25, 2.5 , 2.75, 3.  , 3.25, 3.5 , 3.75])

A frequent need is to generate an array of N numbers, evenly spaced between two values. That is what linspace is for.

np.linspace(2, 4, 20)
array([2.        , 2.10526316, 2.21052632, 2.31578947, 2.42105263,
       2.52631579, 2.63157895, 2.73684211, 2.84210526, 2.94736842,
       3.05263158, 3.15789474, 3.26315789, 3.36842105, 3.47368421,
       3.57894737, 3.68421053, 3.78947368, 3.89473684, 4.        ])

Numpy also has some utilities for helping us generate multi-dimensional arrays. For instance, meshgrid creates 2D arrays out of a combination of 1D arrays.

x = np.linspace(-2 * np.pi, 2 * np.pi, 5)
y = np.linspace(-np.pi, np.pi, 4)
xx, yy = np.meshgrid(x, y)
xx.shape, yy.shape
((4, 5), (4, 5))
yy
array([[-3.14159265, -3.14159265, -3.14159265, -3.14159265, -3.14159265],
       [-1.04719755, -1.04719755, -1.04719755, -1.04719755, -1.04719755],
       [ 1.04719755,  1.04719755,  1.04719755,  1.04719755,  1.04719755],
       [ 3.14159265,  3.14159265,  3.14159265,  3.14159265,  3.14159265]])
xx
array([[-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531]])

Indexing#

Basic indexing in numpy is similar to lists.

# get some individual elements of xx
xx[3, 4]
6.283185307179586
# get some whole rows
xx[0]
array([-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531])
# get some whole columns
xx[:, -1]
array([6.28318531, 6.28318531, 6.28318531, 6.28318531])
# get some ranges (also called slicing)
xx[0:2, 3:5]
array([[3.14159265, 6.28318531],
       [3.14159265, 6.28318531]])

Visualizing Arrays with Matplotlib#

Let’s create a slightly bigger array:

x = np.linspace(-2 * np.pi, 2 * np.pi, 100)
y = np.linspace(-np.pi, np.pi, 50)
xx, yy = np.meshgrid(x, y)
xx.shape, yy.shape
((50, 100), (50, 100))

For plotting a 1D array as a line, we use the plot command.

To use this function, we first need to import it from the matplotlib library.

from matplotlib import pyplot as plt

The line imports the visualization module pyplot from the matplotlib library and nicknames it as plt for brevity in the code.

plt.plot(x);
_images/f3b91cf911970c049049aa9c4b3641a5b282c5d1f39f25a4a3531a745950c2e3.png

There are many ways to visualize 2D data. He we use pcolormesh.

plt.pcolormesh(xx);
_images/3c5da4eb2641bd35c2fafdf882bf83883b5a07f7838d4373de191113d91997d5.png

Array Operations#

There is a huge number of operations available on arrays.

All the familiar arithemtic operators are applied on an element-by-element basis.

Basic Math#

f = np.sin(xx) * np.cos(0.5 * yy)
plt.pcolormesh(f)
<matplotlib.collections.QuadMesh at 0x7f9f87faef50>
_images/556807bf81e1f66ef8c62ad4d4fc20b7d047f710e4a03d0aee1a64fd9b99d033.png
xx == yy
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])
np.any(xx == yy)
False

Broadcasting#

Not all the arrays we want to work with will have the same size!

Broadcasting is a powerful feature in numpy that allows you to perform operations on arrays of different shapes and sizes. It automatically expands the smaller array to match the dimensions of the larger array, without actually making copies of the data, so that element-wise operations can be performed. This is done by following a set of rules that determine how the shapes of the arrays align.

Broadcasting allows you to vectorize operations and avoid explicit loops, leading to more concise and efficient code. It’s particularly useful when working with large data sets, as it helps optimize memory usage and computational speed.

Dimensions are automatically aligned starting with the last dimension. If the last two dimensions have the same length, then the two arrays can be broadcast.

print(f.shape, x.shape)
(50, 100) (100,)
g = f * x
print(g.shape)
(50, 100)
plt.pcolormesh(g)
<matplotlib.collections.QuadMesh at 0x7f9f87f24050>
_images/4c7e43ccedba821ae48dd4f0eed465cb1f5fb5771aeff20ee43e035219f7d655.png

Reduction Operations#

In data science, we usually start with a lot of data and want to reduce it down in order to make plots of summary tables.

There are many different reduction operations. The table below lists the most common functions:

Reduction Operation

Description

numpy.sum()

Computes the sum of array elements over a given axis.

numpy.mean()

Computes the arithmetic mean along a specified axis.

numpy.min()

Computes the minimum value along a specified axis.

numpy.max()

Computes the maximum value along a specified axis.

numpy.prod()

Computes the product of array elements over a given axis.

numpy.std()

Computes the standard deviation along a specified axis.

numpy.var()

Computes the variance along a specified axis.

# sum
g.sum()
-3083.038387807155
# mean
g.mean()
-0.616607677561431
# standard deviation
g.std()
1.6402280119141424

A key property of numpy reductions is the ability to operate on just one axis.

# apply on just one axis
g_ymean = g.mean(axis=0)
g_xmean = g.mean(axis=1)
plt.plot(x, g_ymean)
[<matplotlib.lines.Line2D at 0x7f9f8c096b10>]
_images/ed5029d8c4fc9f3498116cc932b855dfcb4f1b0fe0044dfd6de0b057c12d83da.png
plt.plot(g_xmean, y)
[<matplotlib.lines.Line2D at 0x7f9f8c0f9cd0>]
_images/eebc3b066cd07052ed2236a6c72c8da403d1d515c8a4785dd9140c5b5544c065.png

Figures and Axes#

The figure is the highest level of organization of matplotlib objects.

fig = plt.figure()
<Figure size 640x480 with 0 Axes>
fig = plt.figure(figsize=(13, 5))
<Figure size 1300x500 with 0 Axes>
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
_images/d3b6a3ba6024bf41d06c58085cb7074269d9d358612f29999f231f88a7a827f0.png

Subplots#

Subplot syntax is a more convenient way to specify the creation of multiple axes.

fig, ax = plt.subplots()
_images/e31401c8857f5c6af71ce3480f5ffd1c513ec207433282f4b54c2825e474a247.png
ax
<Axes: >
fig, axes = plt.subplots(ncols=2, figsize=(8, 4), subplot_kw={"facecolor": "blue"})
_images/950af5a0dd74972b7b11213e395b4f30753d466a005aceba4399e4e53ad5af31.png
axes
array([<Axes: >, <Axes: >], dtype=object)

Drawing into Axes#

All plots are drawn into axes.

# create some data to plot
import numpy as np

x = np.linspace(-np.pi, np.pi, 100)
y = np.cos(x)
z = np.sin(6 * x)
fig, ax = plt.subplots()
ax.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f9f84db2950>]
_images/7e85171690b3e29d389220cce2eb0471b7e5600c84659ada68a5212167d65e91.png

This does the same thing as

plt.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f9f84ce03d0>]
_images/7e85171690b3e29d389220cce2eb0471b7e5600c84659ada68a5212167d65e91.png

This starts to matter when we have multiple axes to manage.

fig, axes = plt.subplots(figsize=(8, 4), ncols=2)
ax0, ax1 = axes
ax0.plot(x, y)
ax1.plot(x, z)
[<matplotlib.lines.Line2D at 0x7f9f84b4e710>]
_images/22d1aff2deef9841baf1e45953937f5b6074d960706a543c5ae25c651419dba9.png

Labeling Plots#

Labeling plots is very important! We want to know what data is shown and what the units are. matplotlib offers some functions to label graphics.

fig, ax = plt.subplots(figsize=(4, 4))

ax.plot(x, y)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("x vs. y")

# squeeze everything in
plt.tight_layout()
_images/f5db3fd01e2758a774e611851ef0ade34fb0a80028f5b82143087b67de951888.png

Customizing Plots#

fig, ax = plt.subplots()
ax.plot(x, y, x, z)
[<matplotlib.lines.Line2D at 0x7f9f8c268b90>,
 <matplotlib.lines.Line2D at 0x7f9f8c2bedd0>]
_images/26202ca5b3a82fa0c4f62d140999f538007e032bc878b6146966918d4a1dda1d.png

It’s simple to switch axes

fig, ax = plt.subplots()
ax.plot(y, x, z, x)
[<matplotlib.lines.Line2D at 0x7f9f84891c10>,
 <matplotlib.lines.Line2D at 0x7f9f848ced10>]
_images/8a3bb7f84a439d9538cedeb190fa7a0b99adb10aaad8bd240fa6e13242d133fc.png

Line Styles#

fig, ax = plt.subplots()
ax.plot(x, y, linestyle="--")
ax.plot(x, z, linestyle=":")
[<matplotlib.lines.Line2D at 0x7f9f84a125d0>]
_images/12dff60e63db6e9dc61c6fa0cdb30bb780075d1fde4d4d8f8a973e5bf65ff8ba.png

Colors#

As described in the colors documentation, there are some special codes for commonly used colors.

fig, ax = plt.subplots()
ax.plot(x, y, color="black")
ax.plot(x, z, color="red")
[<matplotlib.lines.Line2D at 0x7f9f84ab1690>]
_images/b5bd117246d75a84da5eb418221ae0779ccdd3e294c28b26ecffa0eb1ff313f7.png

Markers#

There are lots of different markers availabile in matplotlib!

fig, ax = plt.subplots()
ax.plot(x[:20], y[:20], marker="o", markerfacecolor="red", markeredgecolor="black")
ax.plot(x[:20], z[:20], marker="^", markersize=10)
[<matplotlib.lines.Line2D at 0x7f9f8474a610>]
_images/fa934539014f325921ad077ee2c60cecd0a5c46a5d30c09e31fe3499a75b5999.png

Axis Limits#

fig, ax = plt.subplots()
ax.plot(x, y, x, z)
ax.set_xlim(-5, 5)
ax.set_ylim(-3, 3)
(-3.0, 3.0)
_images/74fe939321b5a2916c68607551490170788f6f4756fcfd7374454d19bd3cb8ba.png

Scatter Plots#

fig, ax = plt.subplots()

splot = ax.scatter(y, z, c=x, s=(100 * z**2 + 5), cmap="viridis")
fig.colorbar(splot)
<matplotlib.colorbar.Colorbar at 0x7f9f84a80610>
_images/72f213c2632e412d8c6bf04697c5f548825722fdd82e63127c7101b901648d27.png

There are many different colormaps available in matplotlib: https://matplotlib.org/stable/tutorials/colors/colormaps.html

Bar Plots#

labels = ["Reuter West", "Mitte", "Lichterfelde"]
values = [600, 400, 450]

fig, ax = plt.subplots(figsize=(5, 5))
ax.bar(labels, values)

ax.set_ylabel("MW")

plt.tight_layout()
_images/182857339f42a6a816c618762ef0238ecaae9e8777896356950ea918733a3688.png

Exercises#

Import numpy under the alias np.

Hide code cell content
import numpy as np

Create the following arrays:

  1. Create an array of 5 zeros.

  2. Create an array of 10 ones.

  3. Create an array of 5 \(\pi\) values.

  4. Create an array of the integers 1 to 20.

  5. Create a 5 x 5 matrix of ones with a dtype int.

Hide code cell content
np.zeros(5)
array([0., 0., 0., 0., 0.])
Hide code cell content
np.ones(10)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Hide code cell content
np.full(5, np.pi)
array([3.14159265, 3.14159265, 3.14159265, 3.14159265, 3.14159265])
Hide code cell content
np.arange(1, 21)
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])
Hide code cell content
np.ones((5, 5), dtype=np.int8)
array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]], dtype=int8)

Create a 3D matrix of 3 x 3 x 3 full of random numbers drawn from a standard normal distribution (hint: np.random.randn())

Hide code cell content
np.random.randn(3, 3, 3)
array([[[-0.96098253,  0.73365019,  0.55603546],
        [-0.97260311,  0.01229601, -1.86541728],
        [ 1.33196736,  1.19672972,  1.22791807]],

       [[-0.51689411,  0.03008594, -0.77110762],
        [-1.72408945, -0.79713237,  1.04254781],
        [ 0.75471135,  0.15011543,  0.88710601]],

       [[ 0.08709144, -2.01345111, -1.07395192],
        [ 1.44489885,  1.34628547, -0.69745533],
        [-0.96929149, -1.02811627,  0.22968666]]])

Create an array of 20 linearly spaced numbers between 1 and 10.

Hide code cell content
np.linspace(1, 10, 20)
array([ 1.        ,  1.47368421,  1.94736842,  2.42105263,  2.89473684,
        3.36842105,  3.84210526,  4.31578947,  4.78947368,  5.26315789,
        5.73684211,  6.21052632,  6.68421053,  7.15789474,  7.63157895,
        8.10526316,  8.57894737,  9.05263158,  9.52631579, 10.        ])

Below I’ve defined an array of shape 4 x 4. Use indexing to procude the given outputs.

Hide code cell content
a = np.arange(1, 26).reshape(5, -1)
a
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])
Hide code cell content
a[1:, 3:]
array([[ 9, 10],
       [14, 15],
       [19, 20],
       [24, 25]])
array([[ 9, 10],
       [14, 15],
       [19, 20],
       [24, 25]])
Hide code cell content
a[1]
array([ 6,  7,  8,  9, 10])
array([ 6,  7,  8,  9, 10])
Hide code cell content
a[2:4]
array([[11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])
array([[11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])
Hide code cell content
a[1:3, 2:4]
array([[ 8,  9],
       [13, 14]])
array([[ 8,  9],
       [13, 14]])

Calculate the sum of all the numbers in a.

Hide code cell content
a.sum()
325

Calculate the sum of each row in a.

Hide code cell content
a.sum(axis=1)
array([ 15,  40,  65,  90, 115])
Hide code cell content
a.sum(axis=0)
array([55, 60, 65, 70, 75])

Extract all values of a greater than the mean of a (hint: use a boolean mask).

Hide code cell content
a[a > a.mean()]
array([14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])