Graph problems are a staple of the programming interview. Luckily, most graph problems you’ll encounter in an interview can be solved by one of these two algorithms: depth-first search (DFS) or breadth-first search (BFS). The hard part isn’t in writing the algorithm; it’s in figuring out how you can apply the algorithm to your problem.

Today, we’ll look at a common graph interview question that showcases an one of the two search algorithms above.

## Problem: Given a root file, find a valid ordering to compile the file and all of its dependencies.

Suppose we want to compile a file `main.c.`

This file might depend on some additional files, such as `stdio.h`

or `math.h`

. In order to compile our original file, we need to compile `stdio.h`

and `math.h`

first. Those two files, in turn, may require other files, such as `stdlib.h`

or `assert.h`

, to be compiled first, and so on.

If we represent each file as a node and each edge as a dependency, we will end up with a directed graph like the one below. Each arrow points to a prerequisite file that must be compiled before we can compile the file at the base of the arrow.

In this example graph, we want to compile `main.c`

, which requires us to compile `stdio.h`

, `stdlib.h`

, `time.h`

, and `assert.h`

in some order. Importantly, **n****ot all orderings work**.

For this hypothetical example, compiling `stdlib.h`

before `assert.h`

will fail, presumably because `stdlib.h`

uses a function from the `assert.h`

library. However, we **can** compile `assert.h`

before `time.h`

, because `assert.h`

does not depend on `time.h`

. Finding this ordering is the crux of the problem.

First, let’s try to find a valid ordering for the example graph above. We can start by compiling the files that have no dependencies: `assert.h`

and `time.h`

.

Next, we look for all files that depend only on the files we already compiled. In this example, it would be `stdlib.h`

.

Now, we’re left with two files left to compile `stdio.h`

and `main.c`

. The root file `main.c`

depends on `stdio.h`

, so we can’t compile it yet. However, we can compile `stdio.h`

, because it only depends on `stdlib.h`

.

Finally, there is one file left, `main.c`

. Since we’ve compiled all of its dependencies, we can now compile `main.c`

.

In the end, we found that we can compile `main.c`

by compiling the dependencies in this order:

`time.h`

`assert.h`

`stdlib.h`

`stdio.h`

`main.c`

At a high level, our approach was to compile a file’s dependencies before compiling the current file, and then repeat that process until every file has been compiled. To compile a file’s dependencies, we have to look at the dependencies’ dependencies and compile those first, and so on.

We can convert this approach into graph search language. When we visit a node, we want to explore all of its children nodes. For each child node, we want to explore all of the nodes reachable from that child before moving on to the next child. The reason why we have to explore the reachable nodes from a given child first is because those are all dependencies we need to compile before we can compile the child node. This is, at its core, depth-first search.

Of course, there are some subtleties here. This problem is asking for an ordering of the nodes, but DFS doesn’t return anything. Nevertheless, looking at the psuedocode for DFS is still a good starting point.

DFS: if current node has not been visited before: mark current node as visited for each child node call DFS on child node

Now, let’s think about how we can modify DFS to create this ordered list. Where do we add nodes to this ordered list?

## STOP. Try to figure this out yourself before moving on.

In order to compile the node we are currently at, we must first compile all of its dependencies (children nodes). Once we’ve compiled every dependency/explored each child node, we know that we’ve met all of the requirements to compile the current node.

This suggests that we should add nodes to the ordered list after we recurse on all of the children. Below is what the pseudocode would look like.

DFS: if current node has not been visited before: mark current node as visited for each child node call DFS on child node add current node's filename to ordered list

If you remain unconvinced, I would recommend running this algorithm on the example above. Assuming the directed graph does not have a cycle, this algorithm is guaranteed to work.

Now that we have a good handle on our algorithm, let’s turn this into code. We will represent the nodes in the graph as an object with three fields: `name`

, `has_visited`

, and `children`

. The `name`

and `has_visited`

fields are self-explanatory, and the `children`

field is a list of other node objects.

In the example below, `node`

has one outgoing edge to `node2`

, and `node2`

has no outgoing edges.

node2 = { "name": "time.h" "has_visited": True, "children": [] } node = { "name": "main.c", "has_visited": False, "children": [node2], }

The topological sort pseudocode we wrote above requires two pieces of information: 1) The current node 2) The order to compile the files. Since the only information we expect from the user is the initial node, we can define a helper function that will also keep track of the compilation order.

# This code assumes the directed graph has no cycles def topological_sort(node): compile_order = [] topological_sort_helper(node, compile_order) return compile_order def topological_sort_helper(node, compile_order): if not node.visited: node.visited = True for child in node.children: topological_sort_helper(child, compile_order) compile_order.append(node.name)

## Additional Questions

### What if the graph has a cycle?

A cycle means that you can start at one of the nodes in the graph, and then take a path that leads you back to the starting node. This may be kind of abstract, so take a look at the example below.

There is a cycle in this graph because, starting from `assert.h`

, we can go to `time.h`

, and then back to `assert.h`

. Concretely, what this means is that, if there is a cycle, then there is no way to compile the file and its dependencies. The example above highlights the impossibility of compilation. In order to compile `assert.h`

, we need to compile `time.h`

. But in order to compile `time.h`

, we need to compile `assert.h`

.

When the graph has a cycle, our code still returns a list of file names, which does not make sense. Instead, the code should raise an exception or alert the user that a topological ordering does not exist.

How can we detect cycles in the graph? (Hint: use DFS)

A simpler question you could ask is this: If I gave you a node in the graph, can you tell me if there’s a path that takes me back to the same node?

One thing we could do is mark the current node with some special color, let’s say red. We want to check if this node is reachable from any of its children, so we call DFS on each of its children. If any of its children reach a node and see that it’s marked red, we know that there is a cycle. If there were no cycles, then none of the children nodes would be able to reach a node marked red.

Once we detect a cycle, we can raise an exception that explains that the graph has a cycle and thus doesn’t have a valid compilation order.

I’ll leave the code as an exercise for you (Email me if you’re stuck).

### Does BFS work?

Nodes that are further away from the root node tend to be compiled earlier. This intuition may lead you to ask if we can modify BFS to give a valid topological ordering. For example, we could run BFS, store each node we explore in a list, and then reverse the list. BFS visits each node in the order of their distance from the root node, so reversing the list would bring the furthest nodes from the root to the front of the list.

BFS: add root node to queue while queue is not empty take node from front of queue add node to ordered list reverse ordered list

Consider the example below.

If we run BFS on `main.c`

, there are two possible orderings, either `main.c, stdlib.h, time.h, assert.h`

, or `main.c, time.h, stdlib.h, assert.h`

. At any rate, the last file BFS finds is `assert.h`

, which would imply that we should compile `assert.h`

first. However, in this picture, `assert.h`

depends on `time.h`

, so we actually need to compile `time.h`

first. Neither possible output of our modified BFS will give an ordering where we compile `time.h`

before anything else, so this approach does not work.

## Leave a Reply