# 5. Abstract data types¶

In Chapter 3, we introduced the concept of type as an abstraction comprising a set of possible values, and a set of operations. All types in Python are abstractions in this sense because we deal with them in terms of their defined properties rather than their actual implementation. However, it can also be useful to define purely mathematical types, distinct from their concrete realisation in code. Indeed, there are typically multiple possible concrete realisations of a mathematical idea.

Definition 5.1 (Abstract Data Type)

An abstract data type is a purely mathematical type, defined independently of its concrete realisation as code.

Abstract data types enable the programmer to reason about algorithms and their cost separately from the task of implementing them. In contrast, we can also define the concrete realisation of a data type:

Definition 5.2 (Data structure)

A data structure describes the concrete implementation of a data type: which data is stored, in what order, and how it is interpreted.

In understanding abstract data types, it is often useful to think about the data structures which could be used to implement them, but there is also value in understanding an abstract type purely in terms of its mathematical properties.

## 5.1. Stacks¶

Video: stacks as an abstract data type.

Imperial students can also watch this video on Panopto.

Possibly the simplest abstract data type which is not synonymous with a Python type is the stack. A stack is a sequence of objects in which only the most recently added object is accessible. The image to have in mind is a stack of plates on a spring-loaded holder of the type found in many university and workplace canteens. Each time a plate is added to the stack, the whole pile is pushed down to keep the top of the stack in place. If the top plate is removed, then the whole stack pops back up. An alternative name for a stack is a LIFO (last in, first out), because the last object added to the stack is the first object retrieved.

Recall that a type is defined by a set of possible values and a set of operations. The value of stack is an ordered sequence of objects of any type. The operations are push to add a new object to the sequence, and pop to return the most recently added object, and remove it from the sequence. Fig. 5.1 shows these operations. It is also common to add an additional operation of peek, which returns the most recently added object without removing it from the stack.

### 5.1.1. An example: reverse Polish notation¶

Reverse Polish notation, or postfix notation, is a way of writing mathematical operations without using operator precedence or brackets in order to determine the order of operations. This makes the algorithm for evaluating expressions written in reverse Polish notation arithmetic particularly simple. Reverse Polish calculators require fewer button pushes for complex calculations and were popular in the 1970s. They are still available, most famously from HP. In a more current example, the PostScript language used to describe documents for printers is written in reverse Polish notation.

In reverse Polish notation, the operator follows its operands. For example to add the numbers one and two, one would write $$1\ 2\ +$$. Formally, a reverse Polish calculator comprises a set of numbers, a set of operators (each of which takes a fixed number of arguments), and a stack. Each number encountered in the expression is pushed onto the stack, while each operator pops the right number of arguments off the stack and pushes the result onto the stack. At the end of the calculation, the result of the calculation is on the top of the stack. Listing 5.1 shows pseudocode for a reverse Polish calculator.

Listing 5.1 Pseudocode for a reverse Polish calculator implemented using a stack
1for item in inputs:
2    if item is number:
3        stack.push(number)
4    elif item is operator:
5        operand2 = stack.pop()
6        operand1 = stack.pop()
7        stack.push(operand1 operator operand2)
8return stack.pop()


Notice that we pop the second operand before the first. This is because $$4\ 2\ -$$ means $$4 - 2$$, not $$2 - 4$$. Table 5.1 shows how a reverse Polish calculator would evaluate an arithmetic expression.

Table 5.1 Evaluation of the reverse Polish expression 6 2 / 2 4 ** + using a stack (equivalent to $$6/2 + 2^4 = 3 + 16 = 19$$).

Expression

Stack

Action

6 2 / 2 4 ** +

()

2 / 2 4 ** +

(6)

push

/ 2 4 ** +

(6 2)

push

2 4 ** +

(3)

pop, pop, divide, push

4 ** +

(3 2)

push

** +

(3 2 4)

push

+

(3 16)

pop, pop, power, push

(19)

pop, pop, add, push

### 5.1.2. Implementing stacks in Python¶

While it is strictly true that Python does not have a stack type, the list class functions as a perfectly good stack. The relationship between the two is shown in Table 5.2.

Table 5.2 Correspondence between abstract stack operations, and Python list operations. We assume a list called my_list

Stack operation

List operation

Description

push(x)

my_list.append(x)

Add x to the top of the stack.

pop

my_list.pop()

Return and remove the top item on the stack.

peek

my_list[-1]

Return the last item on the stack, but leave the stack unchanged.

len(my_list)

Return the number of items on the stack. Not a strictly required stack operation, but often useful.

## 5.2. Separation of concerns¶

At first sight, discussions of abstract data types can seem like a complication of what, at the end of the day, are just operations on some objects. Instead of talking about stacks, why don’t we just say that a reverse Polish calculator can be implemented using a list?

The critical conceptual difference here is that a list is a Python construct, while a stack is a mathematical concept with universal applicability. If you understand the concept of a stack, then you will be able to use this to design algorithms and write programs in other languages where the concrete implementation might be a different type, or you might have to create your own stack from lower-level types and operations.

This is an example of a fundamental computer science concept called separation of concerns. Separation of concerns is a design principle that underpins much of what is considered to be good practice in programming. The idea is to divide larger tasks into smaller units, each responsible for doing one thing (addressing one concern). Different units communicate with each other using mathematically well-defined interfaces. This makes the internal design of each unit more-or-less independent of the other units. Why is this important? There are two key reasons. The first is that in programming, as in maths, complexity is the enemy of understanding. Directly addressing a large and complex problem is likely to result in a large and complex piece of code which nobody understands. Such a program will almost inevitably produce the wrong answer, and finding out what is wrong will be exceptionally difficult.

Abstract data types provide part of the mathematical interface that separates different concerns. The user of an abstract data type has an object with a simple set of operations which is easy to reason about. At the same time, the implementer of an abstract data type only has to provide an object with the required methods: they do not have to reason about all the ways in which that object might be used. By learning to think about programming in terms of abstract types and objects, you will become a better programmer who can address more complex programming tasks.

## 5.3. Algorithmic complexity¶

Video: dynamic arrays and algorithmic complexity.

Imperial students can also watch this video on Panopto.

The second reason that understanding abstract data types is important is that a good implementation of a well-designed abstract data type will have well-defined performance characteristics. In particular, the optimal algorithmic complexity, expressed in big $$O$$ notation, of operations on abstract data types will be known. Recall the definition of big $$O$$:

Definition 5.3 ($$O$$)

Let $$f$$, $$g$$, be real-valued functions. Then:

$f(n) = O(g(n)) \textrm{ as } n\rightarrow \infty$

if there exists $$M>0$$ and $$N>0$$ such that:

$n>N\, \Rightarrow\, |f(n)| < M g(n).$

We use $$n$$ rather than $$x$$ as the independent variable, because we are primarily interested in characterising the number of primitive operations or the amount of memory that an algorithm will use as a function of the number of objects stored in the relevant abstract data type.

For example, in the Python list implementation, all of the stack operations are, on average, $$O(1)$$. This means that each of pushing, popping, and peeking has an approximately fixed cost that does not depend on the current size of the stack. This does not obviously have to be the case, especially for the push and pop operations, which modify the stack. Listing 5.2 provides an implementation of a stack in which the data is stored as a Python tuple. Here, every time item is pushed onto or popped from the stack, a new copy of the tuple has to be made. This touches every one of the $$n$$ items currently in the stack, and therefore costs $$O(n)$$ operations. It is often useful to distinguish between time complexity, which is an indication of the number of operations required to execute an algorithm, and space complexity, which measures the peak memory usage of an algorithm or data structure.

Listing 5.2 A poorly designed stack implementation in which push and pop cost $$O(n)$$ operations, where $$n$$ is the current number of objects on the stack.
 1class BadStack:
2    def __init__(self):
3        self.data = ()
4
5    def push(self, value):
6        self.data += (value,)
7
8    def pop(self):
9        value = self.data[-1]
10        self.data = self.data[:-1]
11        return value
12
13    def peek(self):
14        return self.data[-1]


Note

You may already have seen big O notation in numerical analysis. The distinction is that in analysing algorithmic complexity, the limit is taken as $$n$$ approaches infinity, while in numerical analysis the independent variable approaches 0. This difference between two closely related fields is often confusing, particularly since both disciplines conventionally leave out the limit. It’s worth keeping in mind the difference, because a numerical algorithm with $$O(h^4)$$ error is really rather good since h is small, but an algorithm with $$O(n^4)$$ cost is very expensive indeed!

### 5.3.1. Amortised complexity and worst case complexity¶

The actual implementation of a list is of a contiguous sequence of locations in memory, each of which can hold a reference to a Python object. How, then, can appending an item to a list work? The next location in memory might already be in use for some other data. The obvious naïve implementation would be to allocate a new contiguous block of memory, one location longer than the previous one, and copy the existing values into that before placing the appended value in the final location. This amounts to the approach in Listing 5.2, with the result that appending an item to a list would have a time complexity of $$O(n)$$.

In fact, this is not how Python lists are implemented. Instead of only allocating the exact amount of memory needed, Python allocates a bit more and keeps track of how many memory locations are currently in use to implement the list. Only when all the current memory locations are full does a further append operation cause Python to allocate more memory. The amount of memory allocated is approximately proportional to the current length of the list. That is, if the current list length is $$n$$ then the new memory allocation will be of size approximately $$kn$$ for some $$k>1$$. This data structure is called a dynamic array. Fig. 5.2 illustrates its operation.

What does this memory allocation strategy mean for the computational complexity of appending items to the list? There are two cases. If there is a spare location for the appended value, then a reference to the value is simply inserted into that location. The cost of this does not depend on the current length of the list, so it’s $$O(1)$$. If all of the allocated memory locations are now in use, then a new chunk of memory is allocated, and the existing values are copied there. This is an $$O(n)$$ operation. However, this $$O(n)$$ operation only occurs when the list has to be extended. How often is that? Suppose the list has just been reallocated (at a cost of $$O(n)$$). The new memory allocation is $$kn$$ large, but we’ve already used $$n$$ locations so we get $$(k-1)n$$ more cheap $$O(1)$$ append operations before we have to reallocate again. $$(k-1)n = O(n)$$ so this means that adding $$O(n)$$ items to the list costs:

$\underbrace{O(n)}_{\textrm{reallocation}} + \underbrace{O(n)\times O(1)}_{O(n) \textrm{ cheap appends.}} = O(n)$

If appending $$O(n)$$ items to a list has a time complexity of $$O(n)$$, it follows that the cost of appending one item to a list, averaged over a suitably large number of operations, is $$O(1)$$. This measure of complexity, in which the cost of occasional expensive operations is considered averaged over a large number of operations, is called amortised complexity. In contrast, the occasional reallocate and copy operation is an example of the worst case complexity of the algorithm. Appending an item to a list has an amortised time complexity of $$O(1)$$ but a worst-case time complexity of $$O(n)$$.

We can use Python’s introspection capabilities to illustrate how the dynamic allocation of space for a list works as the list is appended. The sys.getsizeof() function returns the amount of computer memory that an object consumes. The function in Listing 5.3 uses this to diagnose the memory consumption of progressively longer lists, and Listing 5.4 demonstrates this.

Listing 5.3 Code to progressively lengthen a list and observe the impact on its memory consumption. This function is available as example_code.linked_list.byte_size().
 1import sys
2
3def byte_size(n):
4    """Print the size in bytes of lists up to length n."""
5    data = []
6    for i in range(n):
7        a = len(data)
8        b = sys.getsizeof(data)
9        print(f"Length:{a}; Size in bytes:{b}")
10        data.append(i)

Listing 5.4 The memory consumption of lists of length 0 to 19. We can infer that the list is reallocated at lengths 1, 5, 9, and 17.
In [1]: from example_code.linked_list import byte_size

In [2]: byte_size(20)
Length:0; Size in bytes:56
Length:1; Size in bytes:88
Length:2; Size in bytes:88
Length:3; Size in bytes:88
Length:4; Size in bytes:88
Length:5; Size in bytes:120
Length:6; Size in bytes:120
Length:7; Size in bytes:120
Length:8; Size in bytes:120
Length:9; Size in bytes:184
Length:10; Size in bytes:184
Length:11; Size in bytes:184
Length:12; Size in bytes:184
Length:13; Size in bytes:184
Length:14; Size in bytes:184
Length:15; Size in bytes:184
Length:16; Size in bytes:184
Length:17; Size in bytes:256
Length:18; Size in bytes:256
Length:19; Size in bytes:256


## 5.4. Queues and deques¶

Video: deques and ring buffers.

Imperial students can also watch this video on Panopto.

A queue is, like a stack, an ordered sequence of objects. The difference is that the only accessible item in the sequence is the earliest added. Items can be added to the back of the queue and taken from the front. As with a stack, the optimal implementations of item insertion and removal are $$O(1)$$.

A deque (Double Ended QUEue) is a generalisation of a queue to permit adding and removing items at either end. The observant reader will note that stacks and queues are both special cases of deques. Python’s standard library contains the collections.deque class, providing a simple and efficient implementation of a deque.

### 5.4.1. Ring buffers¶

How might one go about implementing a deque? A dynamic array allows values to be appended with $$O(1)$$ complexity, but doesn’t offer an efficient mechanism for prepending values. One might think that the natural solution for this would be to create a double-ended dynamic array: a buffer with spare space at each end. Unfortunately this is not optimally efficient in the case where the deque is used to implement a queue of approximately constant length. In that case, values are consistently added at one end of the data structure and removed from the other. Even in the case of a double-ended dynamic array, the buffer space at the append end of the queue will constantly run out, necessitating an expensive copy operation. The solution is to use a dynamic array, but to logically join up its ends, so that the first position in the buffer follows on from the last. Only in the case where all positions in the buffer are full will the buffer be reallocated. This data structure is called a ring buffer.

Fig. 5.3 shows a ring buffer being used as a queue. At each step, an object is appended to the end of the queue, or removed from its start. At step 7, the contents of the buffer wrap around: the queue at this stage contains D, E, F. At step 9 there is insufficient space in the buffer to append G, so new space is allocated and the buffer’s contents copied to the start of the new buffer.

Imperial students can also watch this video on Panopto.

One disadvantage of a deque (and hence of a stack or queue) is that inserting an object into the middle of the sequence is often an $$O(n)$$ operation, because on average half of the items in the sequence need to be shuffled to make space. A linked list provides a mechanism for avoiding this. A singly linked list is a collection of links. Each link contains a reference to a data item and a reference to the next link. Starting from the first link in a list, it is possible to move along the list by following the references to successive further links. A new item can be inserted at the current point in the list by creating a new link, pointing the link reference of the new link to the next link, and pointing the link reference of the current link to the new link. Fig. 5.4 shows this process, while Listing 5.5 shows a minimal implementation of a linked list in Python. Notice that there is no object for the list itself: a linked list is simply a linked set of links. Some linked list implementations do store an object for the list itself, in order to record convenient information such as the list length, but it’s not strictly necessary.

Listing 5.5 A simple singly linked list implementation.
 1class Link:
2    def __init__(self, value, next=None):
3       self.value = value
4       self.next = next
5
7       '''Insert a new link after the current one.'''
8


Linked lists tend to have advantages where data is sparse. For example, our implementation of a Polynomial in Chapter 3 would represent $$x^{100} + 1$$ very inefficiently, with 98 zeroes. Squaring this polynomial would cause tens of thousands of operations, almost all of them on zeroes. Conversely, if we implemented polynomials with linked lists of terms, this squaring operation would take the handful of operations we expect.

A doubly linked list differs from a singly linked list in that each link contains links both to the next link and to the previous one. This enables the list to be traversed both forwards and backwards.

A deque, and therefore a stack or a queue can be implemented using a linked list, however the constant creation of new link objects is typically less efficient than implementations based on ring buffers.

## 5.6. The iterator protocol¶

Video: the iterator protocol.

Imperial students can also watch this video on Panopto.

The abstract data types we have considered here are collections of objects, and one common abstract operation which is applicable to collections is to iterate over them. That is to say, to loop over the objects in the collection and perform some action for each one. This operation is sufficiently common that Python provides a special syntax for it: the for loop. You will already be very familiar with looping over sequences such as lists:

In [1]: for planet in ["World", "Mars", "Venus"]:
...:     print(f"Hello {planet}")
...:
Hello World
Hello Mars
Hello Venus


Python offers a useful abstraction of this concept. By implementing the correct special methods, a container class can provide the ability to be iterated over. This is a great example of abstraction in action: the user doesn’t need to know or care how a particular container is implemented and therefore how to find all of its contents, they can simply write a for loop to access every item in turn.

There are two special methods required for iteration. Neither take any arguments beyond the object itself. The first, __iter__(), needs to be implemented by the container type. Its role is to return an object which implements iteration. This could be the container itself, or it could be a special iteration object (for example because it is necessary to store a number recording where the iteration is up to).

The object returned by __iter__() is called an iterator. It also needs to implement __iter__() (for example it could simply return self). In addition, it needs to implement the __next__() method. This is called by Python repeatedly to obtain the next object in the iteration sequence. Once the sequence is exhausted, subsequent calls to __next__() should raise the built-in StopIteration exception. This tells Python that the iteration is over. This arrangement is called the iterator protocol, and it’s further documented in the official Python documentation.

Hint

Raising exceptions is the subject of Section 6.4, to which we will turn presently. For current purposes, it is sufficient to know that iteration is halted when __next__() executes this line of code:

raise StopIteration


Let’s suppose we want to make the linked list in Listing 5.5 iterable. We’ll need to make another object (an iterator) to keep track of where we are in the list at each point in the iteration. Listing 5.6 shows the code. The iterator class LinkIterator is never seen by the user, it’s just there to keep track of the iteration.

Listing 5.6 A simple linked list implementation that supports the iterator protocol.
 1class Link:
2    def __init__(self, value, next=None):
3        self.value = value
4        self.next = next
5
7        '''Insert a new link after the current one.'''
8
11
12    def __iter__(self):
14
15
19
20    def __iter__(self):
21        return self
22
23    def __next__(self):
24        if self.here:
25            next = self.here
26            self.here = self.here.next
27            return next.value
28        else:
29            raise StopIteration


As a trivial example, we can set up a short linked list and iterate over it, printing its values:

In [3]: linked_list = Link(1, Link(2, Link(3)))

In [4]: for l in linked_list:
...:     print(l)
...:
1
2
3


Indeed, since Python now knows how to iterate over our linked list, converting it to a sequence type such as a tuple will now work automatically:

In [5]: tuple(linked_list)
Out[5]: (1, 2, 3)


## 5.7. Other abstract data types¶

Here we have introduced in some detail a few relatively simple abstract data types that illustrate the distinction between the mathematical properties of a type and the concrete details of its implementation. There are many other abstract data types, some of which you will have already met, and we will encounter a few more in this course. For context, here are a few other examples.

set

A set is an unordered collection of objects with the property that objects that compare equal can only occur once in the set. Inserting or accessing a set member has $$O(1)$$ amortised complexity. Python provides the set built in class as an implementation.

dictionary or map

A generalised function in which unique (up to equality) keys are mapped to arbitrary values. Again, insertion and deletion cost $$O(1)$$ operations on average. The Python dict type is an implementation.

graph

A general relation between a set and itself defined by a set of vertices and a set of edges, where each edge connects exactly two vertices. Graphs can be used to describe very general relationships among data.

tree

A particular sort of graph in which edges have a direction (a from and a to node), and each node is the origin of at most one edge. Trees can be used to describe many types of structured relationship. We will show how trees and related structures can be used in symbolic maths in Chapter 9.

## 5.8. Glossary¶

abstract data type

A mathematical type, defined independently of any concrete implementation in code. Contrast data structure

algorithmic complexity

A measure of the number of operations (time complexity) or amount of storage (space complexity) required by an algorithm or data structure. Algorithmic complexity is usually stated in terms of a bound given in big ‘O’ notation.

amortised complexity

The average complexity of an algorithm considered over a suitably large number of invocations of that algorithm. Amortised complexity takes into account circumstances where the worst case complexity of an algorithm is known to occur only rarely.

data structure

The concrete implementation of a data type in code. The data structure is the organisation of the actual information in the computer’s memory or on disk. Contrast abstract data type.

deque

A double ended queue. An abstract data type representing an ordered sequence in which objects can be added or removed at either end. A deque is a generalisation of both a stack and a queue.

dynamic array

A data structure for efficiently storing a variable length sequence of values. A fixed length piece of memory, called a buffer, is reserved for the sequence. If the number of items exceeds the size of the buffer then a larger buffer is reserved, the contents of the sequence are copied over, and the original buffer is returned to the system.

introspection

The ability to inspect the implementation of a program from inside that program while it is running.

queue
FIFO (first in, first out)

An abstract data type representing an ordered sequence of objects in which objects are accessed in the order in which they were added.

ring buffer

A generalisation of a dynamic array in which two ends of the memory buffer are considered connected in order to enable the sequence to be efficiently lengthened or shortened at either end.

separation of concerns

A design principle under which individual components each address a specific well defined need and communicate through well defined interfaces with other components. Separation of concerns enables reasoning about one part of a problem independently of other parts.

stack
LIFO (last in, first out)

An abstract data type representing an ordered sequence of objects, in which only the most recently added object can be directly accessed.

worst case complexity

An upper bound on the algorithmic complexity of an algorithm. Many algorithms have a relatively low algorithmic complexity most of the times they are run, but for some inputs are much more complex. amortised complexity is a mechanism for taking into account the frequency at which the worst case complexity can be expected to occur.

## 5.9. Exercises¶

Using the information on the book website obtain the skeleton code for these exercises. You will also need to install the pytest-timeout package.

Exercise 5.4

In the exercise repository, create a package called adt_examples with a module called adt_examples.fibonacci. Make the package installable and install it in editable mode. Create a class Fib implementing the iterator protocol which returns the Fibonacci numbers. In other words, the following code should print the Fibonacci numbers under 100:

from adt_examples.fibonacci import Fib

for n in Fib():
print(n)
if n >= 100:
break


Obviously the Fibonacci sequence is infinite, so your iterator will never raise StopIteration. Make sure that calculating the next number is always a $$O(1)$$ operation: don’t recalculate from 1 each time.

Exercise 5.5

In the exercise repository, create a module adt_examples.rpcalc containing a class RPCalc implementing a reverse Polish calculator. The calculator should have the following methods:

push(n)

This takes a single argument. If it is a number then it should be pushed onto the calculator’s internal stack. If it is the string for a recognised operator, then the appropriate number of operands should be popped from the internal stack, and the result pushed back on the stack. Your calculator should support the following operators: "+", "-", "*", "/", "sin", "cos". This method should not return anything.

pop()

This method, which takes no arguments, should pop the top item on the internal stack and return it.

peek()

This method, which takes no arguments, should return the top item on the internal stack without popping it.

__len__()

This is the __len__() special method, which takes no arguments but returns the length of the object. In this case, the length of the calculator is defined to be the number of items on its internal stack.

Exercise 5.6

In the exercise repository, create a module adt_examples.deque containing a class Deque implementing a deque. Your implementation should use a ring buffer implemented as a Python list. In order to make things somewhat simpler, we will use a fixed size ring buffer, which doesn’t grow and shrink with the queue. The constructor of your Deque should take a single integer argument which is the size of the list you will use as your ring buffer.

Implement the following methods:

append(x)

Append x to the end of the Deque

appendleft(x)

Append x to the start of the Deque

pop()

Remove the last item in the Deque and return it.

popleft()

Remove the first item in the Deque and return it.

peek()

Return the last item in the Deque without removing it.

peekleft()

Return the first item in the Deque without removing it.

__len__()

The __len__() special method. This should return the number of items currently in the Deque.

In addition to the above methods, you should ensure that Deque implements the iterator protocol. This should return the items in the queue, starting from the first to the last. Iterating over the Deque should not modify the Deque.

Hint

You can create a list of length n containing only None using the following syntax:

l = [None] * n


The modulo operator, % and integer division operator // are also likely to be very useful.

Hint

When removing an item from the Deque, it is important to actually overwrite the place in the ring buffer occupied by that item, perhaps with None. Failing to do so can cause a program to “leak” memory (i.e. fail to free memory that is no longer needed).

Note

You may not use collections.deque to implement this exercise.

Footnotes

1

https://object-oriented-python.github.io/edition2/exercises.html