|
|
Self described:
"NumPy is the fundamental package for scientific computing with Python. It contains among other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
Numpy is licensed under the BSD license, enabling reuse with few restrictions."
If you ever want to use Python in the context of numerical computation, you'll be using Numpy. You can use Numpy separately with Sage -- it installs easily into any standard operating system (including MS Windows). It also integrates well with Cython (see the Section below in this worksheet for an example).
The following tutorial is based on the first part of http://www.scipy.org/Tentative_NumPy_Tutorial, but with shortening, and changes to make it more Sage relevant.
|
|
IMPORTANT: To use numpy in Sage, you must first do "import numpy"!
|
|
NumPy's main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.
In Numpy dimensions are called axes. The number of axes is rank.
For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has one axis. That axis has a length of 3.
array([1, 2, 1]) array([1, 2, 1]) |
(3,) (3,) |
The following array has rank 2 (it is 2-dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of 3.
array([[1, 0, 0],
[0, 1, 2]])
array([[1, 0, 0],
[0, 1, 2]])
|
(2, 3) (2, 3) |
And this array has rank 3 (it is 3-dimensional). The first dimension has a length of 2, the second dimension has a length of 2, and the third has a length of 3.
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
(2, 2, 3) (2, 2, 3) |
1 1 |
10 10 |
array([[1, 0, 0],
[0, 1, 2]])
array([[1, 0, 0],
[0, 1, 2]])
|
array([[ 5, 6, 7],
[ 8, 9, 10]])
array([[ 5, 6, 7],
[ 8, 9, 10]])
|
You can also call reshape to get back a view of the array with a different shape. It appears at first glance to be a new copy of your data with a different shape, but is really a reference to the original data. This is very powerful/good, but can be surprising, so watch out!
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
array([[[ 1, 0, 0, 0, 1, 2, 5, 6, 7, 8, 9, 10]]]) array([[[ 1, 0, 0, 0, 1, 2, 5, 6, 7, 8, 9, 10]]]) |
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
array([[[199, 0, 0, 0, 1, 2, 5, 6, 7, 8, 9, 10]]]) array([[[199, 0, 0, 0, 1, 2, 5, 6, 7, 8, 9, 10]]]) |
array([[[199, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[199, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
|
|
(Note: these are not function calls like they would be in Sage, so don't put parantheses after them.)
3 3 |
(2, 2, 3) (2, 2, 3) |
12 12 |
dtype('int64')
dtype('int64')
|
8 8 |
<read-write buffer for 0x7fbc439de9b0, size 96, offset 0 at 0x10cb3b2b0> <read-write buffer for 0x7fbc439de9b0, size 96, offset 0 at 0x10cb3b2b0> |
|
|
|
|
WARNING: Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality.
array('i', [1, 0, 5])
array('i', [1, 0, 5])
|
<type 'array.array'> <type 'array.array'> |
<type 'numpy.ndarray'> <type 'numpy.ndarray'> |
True True |
|
|
The dtype (data type) is analoguous to the ``base ring'' for Sage matrices.
An array of integers that aren't too big default to having dtype int64 (or int32 on some computers). Watch out, the int64 data type (though fast) will silently overflow (just like "cdef long")!
dtype('int64')
dtype('int64')
|
array([2, 6]) array([2, 6]) |
int64 [ 1 4611686018427387904] int64 [ 1 4611686018427387904] |
Scary answer below!
array([ 2, -9223372036854775808]) array([ 2, -9223372036854775808]) |
So make sure you understand what numpy arrays are all about: They do their best to map the input list to some fixed precision machine level data type, e.g., int, float, etc. There is one exception, if such a mapping can't be done, then the dtype is 'object', which allows for arbitrary Python objects (but will be much slower).
object [1 4611686018427387904] object [1 4611686018427387904] |
array([2, 9223372036854775808], dtype=object) array([2, 9223372036854775808], dtype=object) |
array([1, 1180591620717411303424], dtype=object) array([1, 1180591620717411303424], dtype=object) |
array([1, 1180591620717411303424], dtype=object) array([1, 1180591620717411303424], dtype=object) |
[ 1. 0.75] float64 [ 1. 0.75] float64 |
[ 2.+3.j 0.+1.41421356j] complex128 [ 2.+3.j 0.+1.41421356j] complex128 |
array([[[ 0., 0., 0.],
[ 0., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
array([[[ 0., 0., 0.],
[ 0., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
|
array([[[ 1.+0.j, 1.+0.j, 1.+0.j],
[ 1.+0.j, 1.+0.j, 1.+0.j]],
[[ 1.+0.j, 1.+0.j, 1.+0.j],
[ 1.+0.j, 1.+0.j, 1.+0.j]]])
array([[[ 1.+0.j, 1.+0.j, 1.+0.j],
[ 1.+0.j, 1.+0.j, 1.+0.j]],
[[ 1.+0.j, 1.+0.j, 1.+0.j],
[ 1.+0.j, 1.+0.j, 1.+0.j]]])
|
array([[[ -2.31584178e+077, -3.11108817e+231, 3.45845952e-323],
[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000]],
[[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000]]])
array([[[ -2.31584178e+077, -3.11108817e+231, 3.45845952e-323],
[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000]],
[[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
[ 0.00000000e+000, 0.00000000e+000, 0.00000000e+000]]])
|
625 loops, best of 3: 1.79 µs per loop 625 loops, best of 3: 115 µs per loop 625 loops, best of 3: 1.79 µs per loop 625 loops, best of 3: 115 µs per loop |
arange is incredibly useful in numerical applications:
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ]) array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ]) |
|
|
pylab = matlab plotting
In Sage: use pylab.savefig instead of pylab.show!
|
Click to the left again to hide and once more to show the dynamic interactive window |
|
|
|
|
array([[ 6, 8],
[10, 12]])
array([[ 6, 8],
[10, 12]])
|
array([[ 5, 12],
[21, 32]])
array([[ 5, 12],
[21, 32]])
|
array([[19, 22],
[43, 50]])
array([[19, 22],
[43, 50]])
|
|
|
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
array([[1, 0, 0],
[0, 1, 2]])
array([[1, 0, 0],
[0, 1, 2]])
|
array([[[ 0, 0],
[ 1, 2]],
[[ 6, 7],
[ 9, 10]]])
array([[[ 0, 0],
[ 1, 2]],
[[ 6, 7],
[ 9, 10]]])
|
WARNING/OPPORTUNITY: Unlike with Sage Matrices, these slices reference the original data!
array([[[ 1, 123, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 123, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
array([[[ 1, 0, 0],
[ 0, 1, 2]],
[[ 5, 6, 7],
[ 8, 9, 10]]])
|
|
|
|
|
Bonus worksheet below about combining Numpy and Cython -- look if you're interested (I will not go over it in class).
NEXT WEEK:
|
|
|
|
This lecture is about how to efficiently combine Numpy and Cython to write fast numerical code.
We will focus on the problem of computing the standard deviation of a list of floating point numbers. Let x_1, \ldots, x_n be a list of n real numbers, and let
Note: In statistics it is common to divide by n-1 instead of n when computing the standard deviation of a sample and using it to estimate the standard deviation of a population. We will not do this, since our goal today is illustrating programming techniques, not learning techniques of statistics.
Running Example: Compute the standard deviation of a list of 64-bit floating point numbers. Our data set is computed using the random.random method, which generates numbers uniformly between 0 and 1.
|
|
[0.43811732887872634, 0.78344784289564662, 0.7917672531341533, 0.43546157784257289, 0.99879630143646858, 0.21470214570255253, 0.52818002353940696, 0.51667692205628102, 0.67726646422202585, 0.92183464863760212, 0.54553592061123968, 0.2143866131543859, 0.90130600825854523, 0.71144055233831971, 0.080614713472959898, 0.81024524942438758, 0.8403186842969067, 0.26527690630696821, 0.9755892062984004, 0.94353224947123771] [0.43811732887872634, 0.78344784289564662, 0.7917672531341533, 0.43546157784257289, 0.99879630143646858, 0.21470214570255253, 0.52818002353940696, 0.51667692205628102, 0.67726646422202585, 0.92183464863760212, 0.54553592061123968, 0.2143866131543859, 0.90130600825854523, 0.71144055233831971, 0.080614713472959898, 0.81024524942438758, 0.8403186842969067, 0.26527690630696821, 0.9755892062984004, 0.94353224947123771] |
First we write a naive straightforward implementation of computation of the standard deviation.
|
|
0.28871143425255896 Time: CPU 0.06 s, Wall: 0.06 s 0.28871143425255896 Time: CPU 0.06 s, Wall: 0.06 s |
10 loops, best of 3: 64.8 ms per loop 10 loops, best of 3: 64.8 ms per loop |
Next we try the std function in Sage, which was implemented by UW undergrad Andrew Hou as part of paid work he did on Sage after he took Math 480.
0.28871143425255896 Time: CPU 0.03 s, Wall: 0.03 s 0.28871143425255896 Time: CPU 0.03 s, Wall: 0.03 s |
25 loops, best of 3: 26.4 ms per loop 25 loops, best of 3: 26.4 ms per loop |
Next we try Numpy, which is much faster than the above.
|
|
dtype('float64')
dtype('float64')
|
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 1.25 ms per loop 625 loops, best of 3: 1.25 ms per loop |
18.0000000000000 18.0000000000000 |
Sage also has code for working with TimeSeries, which happens to have a method for computing the standard deviation. It is a couple of times faster than Numpy.
|
|
|
|
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 240 µs per loop 625 loops, best of 3: 240 µs per loop |
The TimeSeries code is nearly optimal. A TimeSeries is represented by a contiguous array of double's, and the code for computing standard deviation is very straightforward Cython that maps directly to C. (I wrote it, by the way.)
5.29661016949153 5.29661016949153 |
Goal: Write a function that computes the standard deviation of a numpy array as quickly as stats.TimeSeries does, hence is faster than Numpy itself.
First approach: Use numpy "vectorized operations". This doesn't help at all (and is also wasteful of memory, by the way).
|
|
|
|
864.90625 864.90625 |
|
|
865.671875 865.671875 |
0.28871143425255896 0.28871143425255896 |
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 1.25 ms per loop 625 loops, best of 3: 1.25 ms per loop |
Let's see how the time gets spent between each step. It turns out to be about equally spent among each line.
625 loops, best of 3: 140 µs per loop 625 loops, best of 3: 140 µs per loop |
625 loops, best of 3: 241 µs per loop 625 loops, best of 3: 241 µs per loop |
625 loops, best of 3: 157 µs per loop 625 loops, best of 3: 157 µs per loop |
625 loops, best of 3: 143 µs per loop 625 loops, best of 3: 143 µs per loop |
sqrt(2) sqrt(2) |
1.4142135623730951 1.4142135623730951 |
100000 loops, best of 3: 589 ns per loop 100000 loops, best of 3: 589 ns per loop |
100000 loops, best of 3: 216 ns per loop 100000 loops, best of 3: 216 ns per loop |
|
|
Next try Cython with no special type declarations. Not surprisingly, this does not help in the least bit.
|
|
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 1.3 ms per loop 625 loops, best of 3: 1.3 ms per loop |
Next try Cython with special support for Numpy. This gets powerful... as we will see.
|
|
Traceback (click to the left of this block for traceback) ... TypeError: Argument 'v' has incorrect type (expected numpy.ndarray, got NoneType) Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "_sage_input_68.py", line 10, in <module>
exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("c3RkX251bXB5MyhOb25lKQ=="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
File "", line 1, in <module>
File "/tmp/tmpXdvvgn/___code___.py", line 2, in <module>
exec compile(u'std_numpy3(None)
File "", line 1, in <module>
File "_sagenb_flask_sage_notebook_sagenb_home_openidSfmMv1OuVE_44_code_sage70_spyx_0.pyx", line 8, in _sagenb_flask_sage_notebook_sagenb_home_openidSfmMv1OuVE_44_code_sage70_spyx_0.std_numpy3 (_sagenb_flask_sage_notebook_sagenb_home_openidSfmMv1OuVE_44_code_sage70_spyx_0.c:713)
def std_numpy3(ndarray v not None):
TypeError: Argument 'v' has incorrect type (expected numpy.ndarray, got NoneType)
|
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 1.7 ms per loop 625 loops, best of 3: 1.7 ms per loop |
Look at Cython + Numpy documentation (by Googling "cython numpy"), and we learn that if we declare v a little more precisely, then we get fast direct access to the underlying elements in v.
|
|
0.49896857465357608 0.49896857465357608 |
Timing looks good...
625 loops, best of 3: 376 µs per loop 625 loops, best of 3: 376 µs per loop |
|
|
Let's finish it the function and see how it compares.
|
|
0.28871143425255896 0.28871143425255896 |
625 loops, best of 3: 274 µs per loop 625 loops, best of 3: 274 µs per loop |
625 loops, best of 3: 238 µs per loop 625 loops, best of 3: 238 µs per loop |
625 loops, best of 3: 1.27 ms per loop 625 loops, best of 3: 1.27 ms per loop |
Very nice!!
|
|
Finally, we try again, after disabling bounds checking. This is even better; almost as good as stats.TimeSeries.
|
|
625 loops, best of 3: 227 µs per loop 625 loops, best of 3: 227 µs per loop |
625 loops, best of 3: 240 µs per loop 625 loops, best of 3: 240 µs per loop |
For smaller input, interestingly we get a massive win over Numpy. If you were, e.g., computing a sliding window of standard deviations (say) for a time series, this would be important.
array([ 1., 2., 3., 4.]) array([ 1., 2., 3., 4.]) |
625 loops, best of 3: 483 ns per loop 625 loops, best of 3: 483 ns per loop |
625 loops, best of 3: 24.4 µs per loop 625 loops, best of 3: 24.4 µs per loop |
625 loops, best of 3: 534 ns per loop 625 loops, best of 3: 534 ns per loop |
|
|