CMSC 435  Algorithm Analysis & Design

2008 Final Exam Take Home Questions

 

1.      Given the following table used by an instance of the union-find algorithm (see back pages of the exam booklet for the union-find algorithm if this problem is on the exam).

 

index

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

parent

1

2

9

12

16

10

1

2

9

12

5

2

9

14

9

9

rank

1

3

0

0

1

0

0

0

3

1

0

2

0

0

0

2

 

The table is an array representation of a set of equivalence classes, each of whose members can be depicted as forming a tree.  

 

a.       Draw the tree structure for each equivalence class from the information contained in this table.

 

b.      Suppose the edge (6,11) is the next edge selected from the heap in Kruskal's algorithm.  What values are returned by find(6) and find (11) respectively?

 

c.       Does edge (6,11) create a cycle in the MST?

 

d.      Show how the table is changed, if at all, by the two find operations (and a possible union).  (You might find it useful to first show any changes to any of the tree structures that you drew in part (a) and then translate these changes to the table.

 

index

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

parent

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

rank

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.      Solve the following recurrence relations and give a Q bound for each of them

a.       T(n) = 2T(n/3) + n

b.      T(n) = 9T(n/3) + n2

c.       T(n) = 49T(n/25) + n3/2log n

 

3.      Write the code to determine the maximum value that can be placed in a knapsack of (integer) size W when selecting from a set of n items that may include duplicates.  Your algorithm must take as parameters the size W of the knapsack, the number of items in the set to be selected from and two vectors of size n:  double [] v -- the value of each of the items in the set, and int [] q -- the quantity or number of each of the items.  You are to maximize the value placed in the knapsack without exceeding its capacity W.  What two fields will you need in your table in order to determine the maximum value that the knapsack can hold and to be able to determine the contents that produce this value?

 

4.      When a new gene is discovered, a standard approach to understanding its function is to look through a database of known genes and find close matches.  The closeness of two genes is measured by the extent to which they are aligned.  to formalize this, think of a gene as being a long string over the alphabet S = {A, C, T, G}.  Consider two genes x = ATGCC and y = TACGCA.  An alignment is a way of matching up these two strings by writing them in columns, for instance:

 

                  _   A  T  _  G  C  C

                  T  A   _  C  G  C  A

 

Here the "_" indicates a gap.  The characters of each string must appear in order, and each column must contain a character from at least one of the strings.  The score of an alignment is specified by a scoring matrix of size (|S| + 1) x (|S| + 1); where the extra row and column are to accommodate gaps.  For instance, the preceding alignment has the following score:

 

                  d(_,T) + d(A,A) + d(T,_) + d(_,C) + d(G,G) + d(C,C) + d(C,A)

 

Write a dynamic programming algorithm that takes as input two strings            

x = [x1, …, xn] and y = [y1, …, ym] and a scoring matrix d and returns the highest scoring alignment.  The running time of this algorithm should be O(mn).  Determine the cost and alignment of the two strings given above produced by your algorithm.  (I got an equivalent, but different alignment from the one shown in the example when I simulated my algorithm.)

 

Hint.  Read the notes on edit distance posted on the website.  The scoring matrix should use the following weights: d(_, x) = 2; d(x,_) = 2; d(x, y) = 1 if x != y;

d(x,x) = 3  where x and y are any of the characters in the set {A, C, G, T}.  Is this problem also like one that we have already done in class? (yes)

 

5.      Given the following set of characters with their corresponding probabilities

 

a          0.32                 e          0.09

b          0.24                 f           0.04

c          0.15                 g          0.03

d          0.12                 h          0.01

 

a.       Compute the entropy of this source

b.      Construct a Huffman code for the same source, and compute its weighted average code length.

c.       Encode the string: beachhead using Huffman coding and the code giving equal number of bits to each character.

 

6.      You are interested in analyzing some hard-to-obtain data from two separate databases.  Each database contains n numerical values -- so there are 2n values total -- and you may assume that no two values are the same.  You'd like to determine the median value of this set of 2n values, which we will define here to be the nth smallest value.  However, the only way you can access these values is through queries to the databases.  In a single query, you can specify a value k to one of the two databases, and the chosen database will return the kth smallest value that it contains.  Since queries are expensive, you would like to compute the median using as few queries as possible.  Give an algorithm that finds the median value using at most O(log n) queries.

7.      Prove that IndependentSet(G, k) is in NP-Complete, where G is a graph, G = (V, E), and k is the size of an Independent Set V' contained in V. Show that 3-SAT ≤P IndependentSet.

 

8.      You are helping to run a high-performance computing system capable of processing several terabytes of data per day. For each of n days, you are presented with a quantity of data: on day t, you're presented with xt terabytes. For each terabyte you process, you receive a fixed revenue, but any unprocessed data becomes unavailable at the end of the day. Your system is running with software that is not totally reliable, and so the amount you can process goes down with each day that passes since the most recent reboot of the system. On the first day after a reboot, you can process s1 terabytes, on the second day after a reboot you can process s2 terabytes, and so on down to sn; we assume s1 > s2 > s 3> … > sn . (Of course you can only process up to xt terabytes on day t now matter how fast your system is.) To get the system back to peak performance, you can choose to reboot it, but on any day you choose to reboot the system, you can't process any data at all. Given the amounts of available data for the next n days x1 , x2 , x3, …, xn and given the profile of your system as expressed by s1 , s2 , s 3, …, sn (and from a freshly rebooted system on day 1) choose the days on which you're going to reboot so as to maximize the total amount of data you process. Give an efficient algorithm that takes values for x1 , x2 , x3, …, xn and s1 , s2 , s 3, …, sn and returns the total number of terabytes processed by an optimal solution.

Let P(1,n) be the maximum number of terabytes processed during the span from day 1 to day n, then we have

P(1, n) = max{P(1, n-2) + min(xn , s1 ), max1< k < n-1(P(1, k-1) + P(k+1, n), P(1, n-1) + min(xn,sn)}

Note. It does not make sense to reboot on day n and get no processing on the last day, therefore the above formula gives the optimal output as the larger of the optimal output for the first n-2 days with a reboot on day n-1 plus the maximal output on day n, or the maximal output for the span from days 1 to k -1 plus the maximal output for the days from k+1 to n where a reboot on day k is chosen to maximize this sum, or choosing not to reboot at all. A reboot on day k means that the number of bytes processed on day k + 1 equals min (s1, xk+1).  The same formula will hold for finding the maximum output for the interval between days i and j inclusive, where a reboot is assumed to have occurred before day i (and xj is the available work for day j). What is the value you will assign to each P(i, i)? Your dynamic programming algorithm will fill in your table by computing values determined from previous entries. (Is this problem similar to one(s) I already know how to solve?)

7.      Let G = (V, E) be an undirected graph with n nodes. Recall that a subset of the nodes is called an independent set if no two of them are joined by an edge. In general, finding an independent set in a graph is NP-Complete, but for this simple case an efficient algorithm can be found. Call a graph a path if its nodes can be written as v1 , v2 , …, vn with an edge between vi and vj if and only if i and j differ by exactly 1. With each node vi, we associate a positive weight wi . The problem is to find an independent set in the Graph G with maximal total weight.

a.       Give a counter example to show that the following algorithm does not always find an independent set of maximal total weight.

The “heaviest-first” greedy algorithm

Start with S equal to the empty set

While some node remains in G

            Pick a node vi of maximum weight

            Add vi to S

            Delete vi and its neighbors from G

End while

Return S

 

b.      Give a counter example to show that the following algorithm also does not always find an independent set of maximal total weight.

Let S1 be the set of all vi where i is an odd number

Let S2 be the set of all vj where j is an even number

//both S1 and S2 are independent sets

Determine which of S1, S2 has greater total weight and return that one

 

c.       Give an algorithm that takes an n-node path G with weights and returns an independent set of maximal total weight.  The runtime should be polynomial in n, independent of the given set of weights.  (Hint!  This is a variation of the knapsack problem where the items to be put into the sack are the vertices of the independent set and the capacity of the knapsack can reach a maximum of n/2.  The recurrence relation that you will write will determine the entry for cell i,j that will choose between adding the jth vertex to the independent set or leaving it out.  Which two previously filled cells will you need to examine to make this selection?  What fields will you need in each cell if you are top determine the contents of the maximal set as well as its cardinality and total weight?  Note also that treatment of the first vertex in the path is a special case that will need to be handled during initialization.)

 

d.      For the case of the graph depicted below, show the maximal set determined from your algorithm. (Show how the table constructed using the algorithm described in part c is filled when the input is the graph depicted below.)

Path graph.jpg

 

 

10.  Write an O(n lg(n)) algorithm to compute the union of two sets of integers.  The input consists of two arrays, each representing one of the sets.  In each of these (unsorted) arrays, each value appears only one time.  The output is an array representing the union of the two sets in which each value appears only one time.  If n is the sum of the sizes of the input arrays, the algorithm must run in time O(n lg(n)).

 

11.  Show that SUBSET-SUM is NP-Complete.  Describe a reduction for reducing any instance of 3-SAT into an instance of SUBSET-SUM in time O(p(n)), such that the instance of 3-SAT is satisfiable if and only if a solution to the SUBSET-SUM problem exists.  Using your reduction, reduce the following instance of 3-SAT into an instance of SUBSET-SUM that has a solution if and only if 3-SAT is satisfiable. (Note! You will have to generalize the example given in class.  The target and slack variables will have to take into account that a literal for each variable must appear in at least one clause and could possibly appear in all clauses.)

 

Φ = (x1, ~x2, x4) (x1, x2, ~x3) (~x1, ~x2, x5) (x2, ~x3, x5) (x2, x4, ~x5)

 

12.  Given a library class PrimHeap with the methods listed below, write a method for Dijkstra’s algorithm that determines the distance from a single source to every vertex in a graph G.  Call your method public void dijkstra(Graph g, Vertex v)  where the input parameters are a Graph g and the source Vertex v.  Assume that this method is contained in a class in which an int [n] dist holding place for the distance to each vertex from the source and an array of List<Vertex>[n] path, that can be used for holding the neighbors of each vertex visited on the tree of shortest paths to all vertices from the source, are instance variables global to method dijkstra.   The class Graph has methods:

            Iterator<Vertex>  iterator( )

            void reset( )     //resets all vertices to unvisited

           

            class Vertex has methods:

                        Iterator<Edge>  neighbors( )     //iterates over outgoing edges from Vertex

            boolean isVisited( )

            void visit( )

            int  getNum( ) //Each vertex is identified by a number

 

            class Edge has methods:

 

                        Vertex getSource( )

                        Vertex getDest( )

                        int getCost( )

 

            Prim Heap has these public methods:

 

                        PrimHeap(Tuple[ ] verts)          //the Heapify Constructor

                        PrimHeap( )                             //the Empty Heap Constructor

                        boolean isEmpty( )

                        void insert(Tuple v)

                        Tuple deleteMin( )

                        void decreaseKey(int vertNum, int value)

 

            where a Tuple is just a pair (Vert, int)  //a vertex and a cost (distance) to it

            assume you have getters and setters for accessing these two objects.

 

            Implement method dijkstra using these methods and show that its run-time cost is O(|V| lg(|V|))