Writing Functions#

In a few of my example notebooks, I have written functions.

In this notebook I will say more about what they are and how they work, and how you can use them to save time.

A First Function#

Let’s write a simple function:

def add(a, b):
    return a + b

There are a few things going on here:

  • The def keyword (for define) begins a function definition.

  • The function’s name is add.

  • The function has two arguments (sometimes called parameters): a and b.

  • The function returns a value with the return statement. In this case, the returned value is \(a + b\).

We can call it like any other function in Python:

add(3, 5)

We can’t call it with missing arguments:

TypeError                                 Traceback (most recent call last)
<ipython-input-3-c846fc9fb86e> in <module>
----> 1 add(2)

TypeError: add() missing 1 required positional argument: 'b'

We also can’t call it with too many arguments:

add(2, 5, 3)
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0ca3abe18cf7> in <module>
----> 1 add(2, 5, 3)

TypeError: add() takes 2 positional arguments but 3 were given

Reusing Code#

Python functions are useful to be able to reuse code. There are many times when we need to be able to do basically the same operation multiple times with either different data or different values for some pieces of a computation. Functions allow us to abstract this.

In the Penguin Inference example, I did this with a few functions. For example, the bootstrap function for comparing the means of two independent samples:

def boot_ind(s1, s2, nboot=10000):
    ## we will ignore NAs here
    obs1 = s1.dropna()
    obs2 = s2.dropna()
    n1 = len(obs1)
    n2 = len(obs2)
    ## pool the observations together
    pool = pd.concat([obs1, obs2])
    ## grab the observed mean
    md = np.mean(s1) - np.mean(s2)
    ## compute our bootstrap samples of the mean under H0
    b1 = np.array([np.mean(rng.choice(pool, size=n1)) for i in range(nboot)])
    b2 = np.array([np.mean(rng.choice(pool, size=n2)) for i in range(nboot)])
    ## the P-value is the probability that we observe a difference as large
    ## as we did in the raw data, if the null hypothesis were true
    return md, np.mean(np.abs(b1 - b2) >= np.abs(md))

This function allows us to bootstrap the difference in means for any two independent samples. The process of bootstrapping does not change from data to data; all it needs is the data to operate on.

We can do this for a wide range of operations. We can manipulate data frames in functions, we can plot in them, all kinds of things. The key limitation, in terms of code that lives in notebooks, is that a function can only return one value (or a tuple of values); it is difficult to write a function that displays multiple tables using Jupyter’s table formatting, for example. (It is not impossible, but the necessary code is outside the scope of this class.) You can draw multiple plots, however, with multiple plt.show() calls.

My general rule is that I’ll copy code once, and when I need to use it a 3rd time, I will refactor it into a function.

Default Values#

The boot_ind function demonstrates another feature: default values for arguments. Let’s write a function to increment a value:

def incr(x, step=1):
    return x + step

If we call this with one value, it will add 1:


But we can also specify a second value:

incr(3, 2)

Position and Keyword#

There are two primary ways to provide arguments to a Python function: by position and by keyword.

What we’ve seen above is by position:

incr(3, 2)

We can also specify the name of one or both arguments:

incr(3, step=2)

A few notes on keyword arguments:

  • Keyword arguments come after positional arguments. Once you start using keywords, the rest of the arguments in the function call need to use keywords.

  • Some functions have keyword-only arguments: arguments that must be provided by keyword, and cannot be given by position only.

In my own code, I tend to use positional arguments for the core data the function is going to operate on, and keyword arguments for options that change how it is going to operate on that data.

Variables and Scopes#

One thing that quickly becomes relevant with functions is the notion of variable scope. In our notebooks, we have a set of variables that are the global variables. We can assign a variable and then use it:

x = 20

These global variables are available in functions:

def get_x():
    return x

However, if a function ever assigns to a variable, that variable is local to the function - it isn’t visible outside:

def greet(who):
    message = f'Hello, {who}!'
    return message
'Hello, doctor!'

We have called the function, so its code has run, and assigned a variable to message. However, message is not available globally:

NameError                                 Traceback (most recent call last)
<ipython-input-16-1133e3acf0a4> in <module>
----> 1 message

NameError: name 'message' is not defined

This is because message is a local variable within the greet function. Functions are the only way to have local variables in Python — unlike Java, there aren’t other scoping mechanisms to create local variables. (Classes have something that may look like local variables, but they are different.)

If a local variable has the same name as a global variable, the function uses the local one:

def add_and_square(a, b):
    x = a + b
    return x * x
add_and_square(3, 2)

That returns what we expect (\((3 + 2) ^ 2 = 25\)). It stored an intermediate value in x; however, that was the function’s local x. Our global x still has its old value:


This introduces our local variable principle: all variable assignments in functions are local. (There is a way to do a global assignment, but we aren’t going to use it.)

Python detects local variables by looking for an assignment when it parses the function. If the variable is assigned anywhere in the function, it is treated as local. So you can’t access a global and use a local of the same name — this won’t work:

def add_to_x_and_square(y):
    x = x + y
    return x * x
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-21-aeb316e58333> in <module>
----> 1 add_to_x_and_square(5)

<ipython-input-20-af14518c375a> in add_to_x_and_square(y)
      1 def add_to_x_and_square(y):
----> 2     x = x + y
      3     return x * x

UnboundLocalError: local variable 'x' referenced before assignment

The UnboundLocalError means exactly what it says: the variable x is referenced (to compute x + y) before it is first assignemd (to store the result). We can fix this with a different name:

def add_to_x_and_square(y):
    z = x + y
    return z * z

The function’s arguments are also local variables, and they can be reassigned in the function:

def add_and_square(a, b):
    a = a + b
    return a * a
add_and_square(3, 4)

The assignment only affects the local variable. If we use a variable as the value to the argument that gets reassigned, the caller’s variable is not changed:

add_and_square(x, 2)

x is still 20:


This is called pass by value: when we pass an argument to a function, its value is passed. Inside the function, it knows nothing about our x variable (unless it references it as a global), or the fact that the value of the argument a came from x.

I generally do not recommend reassigning function arguments — it’s a very good way to confuse yourself and your readers. I do it sometimes, but primarily when the operation I’m doing is some kind of a clean-up for the argument data: at the beginning of my function, I’ll do cleanup operations (e.g. dealing with missing data, filling in default values for unspecified parameters, etc.), so the rest of the function can use “cleaned up” argument values.

There is one place where this gets wierd. If we pass a mutable object, and we modify the object itself (rather than reassign the variable with a new object), then the original object is modified. This is because it passes a reference to the object (as in Java).

So if we have a list, and we append:

xs = [2, 4, 9, -1]
def add_x(x):
    return xs
[2, 4, 9, -1, 5]

The list pointed to xs has changed:

[2, 4, 9, -1, 5]

The variable xs did not change — it still points to the same list. But Python list objects can be changed in-place: that list object now contains another value.

If you’ve taken a programming languages course: this is not the same thing as pass by reference. It is passing a reference to the object by value.

The primary mutable objects you’ll see are:

  • lists

  • dictionaries

  • arrays

  • Pandas objects when you do in-place mutation operations


You probably won’t use this a lot, but Python supports a special kind of function called a generator. Whereas a normal function returns a value with return, a generator returns multiple values with yield, and you iterate over them with a for loop (or anything else that takes an iterable).

For example:

def recipients():
    yield 'alpha'
    yield 'beta'
for recip in recipients():
Hello, alpha!
Hello, beta!

Most generators put the yield statement(s) in a loop.


There are a lot of things we can do with functions. Their basic capabilities are pretty simple, but they give us building blocks for many more things.

Another use case for functions is in the apply Pandas method: it applies a function to each element of a data frame or series, or group of a groupby.

They don’t have to return a value: if you do not have a return (or a yield), the function returns None.

For more reading, see the Python tutorial.