5. Abstract data types¶
In Chapter 3, we introduced the concept of type as an abstraction comprising a set of possible values, and a set of operations. All types in Python are abstractions in this sense because we deal with them in terms of their defined properties rather than their actual implementation. However, it can also be useful to define purely mathematical types, distinct from their concrete realisation in code. Indeed, there are typically multiple possible concrete realisations of a mathematical idea.
An abstract data type is a purely mathematical type, defined independently of its concrete realisation as code.
Abstract data types enable the programmer to reason about algorithms and their cost separately from the task of implementing them. In contrast, we can also define the concrete realisation of a data type:
A data structure describes the concrete implementation of a data type: which data is stored, in what order, and how it is interpreted.
In understanding abstract data types, it is often useful to think about the data structures which could be used to implement them, but there is also value in understanding an abstract type purely in terms of its mathematical properties.
5.1. Stacks¶
Possibly the simplest abstract data type which is not synonymous with a Python type is the stack. A stack is a sequence of objects in which only the most recently added object is accessible. The image to have in mind is a stack of plates on a spring-loaded holder of the type found in many university and workplace canteens. Each time a plate is added to the stack, the whole pile is pushed down to keep the top of the stack in place. If the top plate is removed, then the whole stack pops back up. An alternative name for a stack is a LIFO (last in, first out), because the last object added to the stack is the first object retrieved.
Recall that a type is defined by a set of possible values and a set of
operations. The value of stack is an ordered sequence of objects of any type.
The operations are push
to add a new object to the sequence, and pop
to
return the most recently added object, and remove it from the sequence.
Fig. 5.1 shows these operations. It is also common to add an
additional operation of peek
, which returns the most recently added object
without removing it from the stack.
5.1.1. An example: reverse Polish notation¶
Reverse Polish notation, or postfix notation, is a way of writing mathematical operations without using operator precedence or brackets in order to determine the order of operations. This makes the algorithm for evaluating expressions written in reverse Polish notation arithmetic particularly simple. Reverse Polish calculators require fewer button pushes for complex calculations and were popular in the 1970s. They are still available, most famously from HP. In a more current example, the PostScript language used to describe documents for printers is written in reverse Polish notation.
In reverse Polish notation, the operator follows its operands. For example to add the numbers one and two, one would write \(1\ 2\ +\). Formally, a reverse Polish calculator comprises a set of numbers, a set of operators (each of which takes a fixed number of arguments), and a stack. Each number encountered in the expression is pushed onto the stack, while each operator pops the right number of arguments off the stack and pushes the result onto the stack. At the end of the calculation, the result of the calculation is on the top of the stack. Listing 5.1 shows pseudocode for a reverse Polish calculator.
1for item in inputs:
2 if item is number:
3 stack.push(number)
4 elif item is operator:
5 operand2 = stack.pop()
6 operand1 = stack.pop()
7 stack.push(operand1 operator operand2)
8return stack.pop()
Notice that we pop the second operand before the first. This is because \(4\ 2\ -\) means \(4 - 2\), not \(2 - 4\). Table 5.1 shows how a reverse Polish calculator would evaluate an arithmetic expression.
Expression |
Stack |
Action |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5.1.2. Implementing stacks in Python¶
While it is strictly true that Python does not have a stack type, the
list
class functions as a perfectly good stack. The
relationship between the two is shown in Table 5.2.
Stack operation |
List operation |
Description |
---|---|---|
|
|
Add |
|
|
Return and remove the top item on the stack. |
|
|
Return the last item on the stack, but leave the stack unchanged. |
|
Return the number of items on the stack. Not a strictly required stack operation, but often useful. |
5.2. Separation of concerns¶
At first sight, discussions of abstract data types can seem like a
complication of what, at the end of the day, are just operations on
some objects. Instead of talking about stacks, why don’t we just say
that a reverse Polish calculator can be implemented using a
list
?
The critical conceptual difference here is that a
list
is a Python construct, while a stack is a mathematical
concept with universal applicability. If you understand the concept of
a stack, then you will be able to use this to design algorithms and
write programs in other languages where the concrete implementation
might be a different type, or you might have to create your own stack
from lower-level types and operations.
This is an example of a fundamental computer science concept called separation of concerns. Separation of concerns is a design principle that underpins much of what is considered to be good practice in programming. The idea is to divide larger tasks into smaller units, each responsible for doing one thing (addressing one concern). Different units communicate with each other using mathematically well-defined interfaces. This makes the internal design of each unit more-or-less independent of the other units. Why is this important? There are two key reasons. The first is that in programming, as in maths, complexity is the enemy of understanding. Directly addressing a large and complex problem is likely to result in a large and complex piece of code which nobody understands. Such a program will almost inevitably produce the wrong answer, and finding out what is wrong will be exceptionally difficult.
Abstract data types provide part of the mathematical interface that separates different concerns. The user of an abstract data type has an object with a simple set of operations which is easy to reason about. At the same time, the implementer of an abstract data type only has to provide an object with the required methods: they do not have to reason about all the ways in which that object might be used. By learning to think about programming in terms of abstract types and objects, you will become a better programmer who can address more complex programming tasks.
5.3. Algorithmic complexity¶
Video: dynamic arrays and algorithmic complexity.
Imperial students can also watch this video on Panopto.
The second reason that understanding abstract data types is important is that a good implementation of a well-designed abstract data type will have well-defined performance characteristics. In particular, the optimal algorithmic complexity, expressed in big \(O\) notation, of operations on abstract data types will be known. Recall the definition of big \(O\):
Let \(f\), \(g\), be real-valued functions. Then:
if there exists \(M>0\) and \(N>0\) such that:
We use \(n\) rather than \(x\) as the independent variable, because we are primarily interested in characterising the number of primitive operations or the amount of memory that an algorithm will use as a function of the number of objects stored in the relevant abstract data type.
For example, in the Python list
implementation, all of
the stack operations are, on average, \(O(1)\). This means that
each of pushing, popping, and peeking has an approximately fixed cost
that does not depend on the current size of the stack. This does not
obviously have to be the case, especially for the push and pop
operations, which modify the stack. Listing 5.2 provides an
implementation of a stack in which the data is stored as a Python
tuple
. Here, every time item is pushed onto or popped from
the stack, a new copy of the tuple
has to be made. This
touches every one of the \(n\) items currently in the stack, and
therefore costs \(O(n)\) operations. It is often useful to
distinguish between time complexity, which is an indication of the
number of operations required to execute an algorithm, and space
complexity, which measures the peak memory usage of an algorithm or
data structure.
1class BadStack:
2 def __init__(self):
3 self.data = ()
4
5 def push(self, value):
6 self.data += (value,)
7
8 def pop(self):
9 value = self.data[-1]
10 self.data = self.data[:-1]
11 return value
12
13 def peek(self):
14 return self.data[-1]
Note
You may already have seen big O notation in numerical analysis. The
distinction is that in analysing algorithmic complexity, the limit is taken
as \(n\) approaches infinity, while in numerical analysis the
independent variable approaches 0. This difference between two closely
related fields is often confusing, particularly since both disciplines
conventionally leave out the limit. It’s worth keeping in mind the
difference, because a numerical algorithm with \(O(h^4)\) error is
really rather good since h
is small, but an algorithm with \(O(n^4)\)
cost is very expensive indeed!
5.3.1. Amortised complexity and worst case complexity¶
The actual implementation of a list
is of a contiguous
sequence of locations in memory, each of which can hold a reference to
a Python object. How, then, can appending an item to a list work? The
next location in memory might already be in use for some other
data. The obvious naïve implementation would be to allocate a new
contiguous block of memory, one location longer than the previous one,
and copy the existing values into that before placing the appended
value in the final location. This amounts to the approach in
Listing 5.2, with the result that appending an item to a list
would have a time complexity of \(O(n)\).
In fact, this is not how Python lists are implemented. Instead of only allocating the exact amount of memory needed, Python allocates a bit more and keeps track of how many memory locations are currently in use to implement the list. Only when all the current memory locations are full does a further append operation cause Python to allocate more memory. The amount of memory allocated is approximately proportional to the current length of the list. That is, if the current list length is \(n\) then the new memory allocation will be of size approximately \(kn\) for some \(k>1\). This data structure is called a dynamic array. Fig. 5.2 illustrates its operation.
What does this memory allocation strategy mean for the computational complexity of appending items to the list? There are two cases. If there is a spare location for the appended value, then a reference to the value is simply inserted into that location. The cost of this does not depend on the current length of the list, so it’s \(O(1)\). If all of the allocated memory locations are now in use, then a new chunk of memory is allocated, and the existing values are copied there. This is an \(O(n)\) operation. However, this \(O(n)\) operation only occurs when the list has to be extended. How often is that? Suppose the list has just been reallocated (at a cost of \(O(n)\)). The new memory allocation is \(kn\) large, but we’ve already used \(n\) locations so we get \((k-1)n\) more cheap \(O(1)\) append operations before we have to reallocate again. \((k-1)n = O(n)\) so this means that adding \(O(n)\) items to the list costs:
If appending \(O(n)\) items to a list has a time complexity of \(O(n)\), it follows that the cost of appending one item to a list, averaged over a suitably large number of operations, is \(O(1)\). This measure of complexity, in which the cost of occasional expensive operations is considered averaged over a large number of operations, is called amortised complexity. In contrast, the occasional reallocate and copy operation is an example of the worst case complexity of the algorithm. Appending an item to a list has an amortised time complexity of \(O(1)\) but a worst-case time complexity of \(O(n)\).
We can use Python’s introspection capabilities to illustrate how the
dynamic allocation of space for a list works as the list is appended. The
sys.getsizeof()
function returns the amount of computer memory that an
object consumes. The function in Listing 5.3 uses this to diagnose the
memory consumption of progressively longer lists, and Listing 5.4
demonstrates this.
1import sys
2
3def byte_size(n):
4 """Print the size in bytes of lists up to length n."""
5 data = []
6 for i in range(n):
7 a = len(data)
8 b = sys.getsizeof(data)
9 print(f"Length:{a}; Size in bytes:{b}")
10 data.append(i)
In [1]: from example_code.linked_list import byte_size
In [2]: byte_size(20)
Length:0; Size in bytes:56
Length:1; Size in bytes:88
Length:2; Size in bytes:88
Length:3; Size in bytes:88
Length:4; Size in bytes:88
Length:5; Size in bytes:120
Length:6; Size in bytes:120
Length:7; Size in bytes:120
Length:8; Size in bytes:120
Length:9; Size in bytes:184
Length:10; Size in bytes:184
Length:11; Size in bytes:184
Length:12; Size in bytes:184
Length:13; Size in bytes:184
Length:14; Size in bytes:184
Length:15; Size in bytes:184
Length:16; Size in bytes:184
Length:17; Size in bytes:256
Length:18; Size in bytes:256
Length:19; Size in bytes:256
5.4. Queues and deques¶
A queue is, like a stack, an ordered sequence of objects. The difference is that the only accessible item in the sequence is the earliest added. Items can be added to the back of the queue and taken from the front. As with a stack, the optimal implementations of item insertion and removal are \(O(1)\).
A deque (Double Ended QUEue) is a generalisation of a queue to permit
adding and removing items at either end. The observant reader will note that
stacks and queues are both special cases of deques. Python’s standard library
contains the collections.deque
class, providing a simple and efficient
implementation of a deque.
5.4.1. Ring buffers¶
How might one go about implementing a deque? A dynamic array allows values to be appended with \(O(1)\) complexity, but doesn’t offer an efficient mechanism for prepending values. One might think that the natural solution for this would be to create a double-ended dynamic array: a buffer with spare space at each end. Unfortunately this is not optimally efficient in the case where the deque is used to implement a queue of approximately constant length. In that case, values are consistently added at one end of the data structure and removed from the other. Even in the case of a double-ended dynamic array, the buffer space at the append end of the queue will constantly run out, necessitating an expensive copy operation. The solution is to use a dynamic array, but to logically join up its ends, so that the first position in the buffer follows on from the last. Only in the case where all positions in the buffer are full will the buffer be reallocated. This data structure is called a ring buffer.
Fig. 5.3 shows a ring buffer being used as a queue. At each step,
an object is appended to the end of the queue, or removed from its start. At
step 7, the contents of the buffer wrap around: the queue at this stage contains
D, E, F
. At step 9 there is insufficient space in the buffer to append G
, so
new space is allocated and the buffer’s contents copied to the start of the new
buffer.
5.5. Linked lists¶
One disadvantage of a deque (and hence of a stack or queue) is that inserting an object into the middle of the sequence is often an \(O(n)\) operation, because on average half of the items in the sequence need to be shuffled to make space. A linked list provides a mechanism for avoiding this. A singly linked list is a collection of links. Each link contains a reference to a data item and a reference to the next link. Starting from the first link in a list, it is possible to move along the list by following the references to successive further links. A new item can be inserted at the current point in the list by creating a new link, pointing the link reference of the new link to the next link, and pointing the link reference of the current link to the new link. Fig. 5.4 shows this process, while Listing 5.5 shows a minimal implementation of a linked list in Python. Notice that there is no object for the list itself: a linked list is simply a linked set of links. Some linked list implementations do store an object for the list itself, in order to record convenient information such as the list length, but it’s not strictly necessary.
1class Link:
2 def __init__(self, value, next=None):
3 self.value = value
4 self.next = next
5
6 def insert(self, link):
7 '''Insert a new link after the current one.'''
8 link.next = self.next
9 self.next = link
Linked lists tend to have advantages where data is sparse. For example, our
implementation of a Polynomial
in
Chapter 3 would represent \(x^{100} + 1\) very
inefficiently, with 98 zeroes. Squaring this polynomial would cause tens of
thousands of operations, almost all of them on zeroes. Conversely, if we
implemented polynomials with linked lists of terms, this squaring operation
would take the handful of operations we expect.
A doubly linked list differs from a singly linked list in that each link contains links both to the next link and to the previous one. This enables the list to be traversed both forwards and backwards.
A deque, and therefore a stack or a queue can be implemented using a linked list, however the constant creation of new link objects is typically less efficient than implementations based on ring buffers.
5.6. The iterator protocol¶
The abstract data types we have considered here are collections of objects, and one common abstract operation which is applicable to collections is to iterate over them. That is to say, to loop over the objects in the collection and perform some action for each one. This operation is sufficiently common that Python provides a special syntax for it: the for loop. You will already be very familiar with looping over sequences such as lists:
In [1]: for planet in ["World", "Mars", "Venus"]:
...: print(f"Hello {planet}")
...:
Hello World
Hello Mars
Hello Venus
Python offers a useful abstraction of this concept. By implementing the correct special methods, a container class can provide the ability to be iterated over. This is a great example of abstraction in action: the user doesn’t need to know or care how a particular container is implemented and therefore how to find all of its contents, they can simply write a for loop to access every item in turn.
There are two special methods required for iteration.
Neither take any arguments beyond the object itself. The first,
__iter__()
, needs to be implemented by the container type. Its
role is to return an object which implements iteration. This could be the
container itself, or it could be a special iteration object (for example
because it is necessary to store a number recording where the iteration is up
to).
The object returned by __iter__()
is called an iterator. It
also needs to implement __iter__()
(for example it could simply
return self
). In addition, it needs to implement the
__next__()
method. This is called by Python repeatedly to
obtain the next object in the iteration sequence. Once the sequence is
exhausted, subsequent calls to __next__()
should raise the
built-in StopIteration
exception. This tells Python that the iteration
is over. This arrangement is called the iterator protocol, and it’s further
documented in the official Python documentation.
Hint
Raising exceptions is the subject of Section 6.4,
to which we will turn presently. For current purposes, it is
sufficient to know that iteration is halted when __next__()
executes this line of code:
raise StopIteration
Let’s suppose we want to make the linked list in Listing 5.5
iterable. We’ll need to make another object (an iterator) to keep track of
where we are in the list at each point in the iteration.
Listing 5.6 shows the code. The iterator class
LinkIterator
is never seen by the user, it’s just there to keep track
of the iteration.
1class Link:
2 def __init__(self, value, next=None):
3 self.value = value
4 self.next = next
5
6 def insert(self, link):
7 '''Insert a new link after the current one.'''
8 link.next = self.next
9 self.next = link
10
11 def __iter__(self):
12 return LinkIterator(self)
13
14
15class LinkIterator:
16 def __init__(self, link):
17 self.here = link
18
19 def __iter__(self):
20 return self
21
22 def __next__(self):
23 if self.here:
24 next = self.here
25 self.here = self.here.next
26 return next.value
27 else:
28 raise StopIteration
As a trivial example, we can set up a short linked list and iterate over it, printing its values:
In [3]: linked_list = Link(1, Link(2, Link(3)))
In [4]: for l in linked_list:
...: print(l)
...:
1
2
3
Indeed, since Python now knows how to iterate over our linked list,
converting it to a sequence type such as a tuple
will now work
automatically:
In [5]: tuple(linked_list)
Out[5]: (1, 2, 3)
5.7. Other abstract data types¶
Here we have introduced in some detail a few relatively simple abstract data types that illustrate the distinction between the mathematical properties of a type and the concrete details of its implementation. There are many other abstract data types, some of which you will have already met, and we will encounter a few more in this course. For context, here are a few other examples.
- set
A set is an unordered collection of objects with the property that objects that compare equal can only occur once in the set. Inserting or accessing a set member has \(O(1)\) amortised complexity. Python provides the
set
built in class as an implementation.- dictionary or map
A generalised function in which unique (up to equality) keys are mapped to arbitrary values. Again, insertion and deletion cost \(O(1)\) operations on average. The Python
dict
type is an implementation.- graph
A general relation between a set and itself defined by a set of vertices and a set of edges, where each edge connects exactly two vertices. Graphs can be used to describe very general relationships among data.
- tree
A particular sort of graph in which edges have a direction (a from and a to node), and each node is the origin of at most one edge. Trees can be used to describe many types of structured relationship. We will show how trees and related structures can be used in symbolic maths in Chapter 9.
5.8. Glossary¶
- abstract data type¶
A mathematical type, defined independently of any concrete implementation in code. Contrast data structure
- algorithmic complexity¶
A measure of the number of operations (time complexity) or amount of storage (space complexity) required by an algorithm or data structure. Algorithmic complexity is usually stated in terms of a bound given in big ‘O’ notation.
- amortised complexity¶
The average complexity of an algorithm considered over a suitably large number of invocations of that algorithm. Amortised complexity takes into account circumstances where the worst case complexity of an algorithm is known to occur only rarely.
- data structure¶
The concrete implementation of a data type in code. The data structure is the organisation of the actual information in the computer’s memory or on disk. Contrast abstract data type.
- deque¶
A double ended queue. An abstract data type representing an ordered sequence in which objects can be added or removed at either end. A deque is a generalisation of both a stack and a queue.
- dynamic array¶
A data structure for efficiently storing a variable length sequence of values. A fixed length piece of memory, called a buffer, is reserved for the sequence. If the number of items exceeds the size of the buffer then a larger buffer is reserved, the contents of the sequence are copied over, and the original buffer is returned to the system.
- introspection¶
The ability to inspect the implementation of a program from inside that program while it is running.
- queue¶
- FIFO (first in, first out)¶
An abstract data type representing an ordered sequence of objects in which objects are accessed in the order in which they were added.
- ring buffer¶
A generalisation of a dynamic array in which two ends of the memory buffer are considered connected in order to enable the sequence to be efficiently lengthened or shortened at either end.
- separation of concerns¶
A design principle under which individual components each address a specific well defined need and communicate through well defined interfaces with other components. Separation of concerns enables reasoning about one part of a problem independently of other parts.
- stack¶
- LIFO (last in, first out)¶
An abstract data type representing an ordered sequence of objects, in which only the most recently added object can be directly accessed.
- worst case complexity¶
An upper bound on the algorithmic complexity of an algorithm. Many algorithms have a relatively low algorithmic complexity most of the times they are run, but for some inputs are much more complex. amortised complexity is a mechanism for taking into account the frequency at which the worst case complexity can be expected to occur.
5.9. Exercises¶
Using the information on the book website obtain the skeleton code for these exercises. You will also need to install the pytest-timeout package.
In the exercise repository, create a package called
adt_examples
with a module called
adt_examples.fibonacci
. Make the package installable and install it
in editable mode. Create a class Fib
implementing the iterator
protocol which returns the Fibonacci numbers. In other words, the following
code should print the Fibonacci
numbers under 100:
from adt_examples.fibonacci import Fib
for n in Fib():
print(n)
if n >= 100:
break
Obviously the Fibonacci sequence is infinite, so your iterator will never
raise StopIteration
. Make sure that calculating the next number is
always a \(O(1)\) operation: don’t recalculate from 1 each time.
In the exercise repository, create a module
adt_examples.rpcalc
containing a class RPCalc
implementing a
reverse Polish calculator. The calculator should have the following methods:
push(n)
This takes a single argument. If it is a number then it should be pushed onto the calculator’s internal stack. If it is the string for a recognised operator, then the appropriate number of operands should be popped from the internal stack, and the result pushed back on the stack. Your calculator should support the following operators:
"+"
,"-"
,"*"
,"/"
,"sin"
,"cos"
. This method should not return anything.pop()
This method, which takes no arguments, should pop the top item on the internal stack and return it.
peek()
This method, which takes no arguments, should return the top item on the internal stack without popping it.
__len__()
This is the
__len__()
special method, which takes no arguments but returns the length of the object. In this case, the length of the calculator is defined to be the number of items on its internal stack.
In the exercise repository, create a module
adt_examples.deque
containing a class Deque
implementing a
deque. Your implementation should use a ring buffer implemented
as a Python list. In order to make things somewhat simpler, we will use a
fixed size ring buffer, which doesn’t grow and shrink with the queue. The
constructor of your Deque
should take a single integer
argument which is the size of the list you will use as your ring buffer.
Implement the following methods:
append(x)
Append
x
to the end of theDeque
appendleft(x)
Append
x
to the start of theDeque
pop()
Remove the last item in the
Deque
and return it.popleft()
Remove the first item in the
Deque
and return it.peek()
Return the last item in the
Deque
without removing it.peekleft()
Return the first item in the
Deque
without removing it.__len__()
The
__len__()
special method. This should return the number of items currently in theDeque
.
In addition to the above methods, you should ensure that Deque
implements the iterator protocol. This should return the items in the queue,
starting from the first to the last. Iterating over the
Deque
should not modify the Deque
.
Hint
You can create a list of length n
containing only
None
using the following syntax:
l = [None] * n
The modulo operator, %
and integer division operator //
are also likely
to be very useful.
Hint
When removing an item from the Deque
, it is important to
actually overwrite the place in the ring buffer occupied by that item,
perhaps with None
. Failing to do so can cause a program to “leak”
memory (i.e. fail to free memory that is no longer needed).
Note
You may not use collections.deque
to implement this exercise.
Footnotes