\documentclass[12pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amsthm,amssymb,amsfonts}
\usepackage{graphicx,epstopdf,fancyhdr,xspace,algorithm,algorithmic}
\usepackage[space]{grffile}
\pagestyle{fancy}
\usepackage{epsfig}
\usepackage[space]{grffile}
\usepackage{titlesec}
\usepackage[table]{xcolor}
\usepackage{listings}
\usepackage{verbatim}
\usepackage[hyphens]{url}
\usepackage{enumitem} % package used to generate bulleted lists
\usepackage{tikz}
\usetikzlibrary{arrows,automata}
\theoremstyle{definition}
\renewcommand{\epsilon}{\varepsilon}
\newtheorem{problem}{Problem} %
\cfoot{\thepage}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\headwidth}{\textwidth}
%\renewcommand{\footwidth}{\textwidth}
%\addtolength{\headwidth}{\marginparwidth}
%\addtolength{\headwidth}{\marginparsep}
\renewcommand{\footrulewidth}{0.4pt}
\newcommand{\mymin}{\ensuremath{\mbox{\sc FindMin}}\xspace}
\newcommand{\ms}{\ensuremath{\mbox{\sc Mergesort}}\xspace}
\newcommand{\qs}{\ensuremath{\mbox{\sc Quicksort}}\xspace}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{observation}{Observation}
\newtheorem{question}{Problem}
\titleformat*{\section}{\large\bfseries}
\newcommand{\N}{\mathbb N} %you can define shorthands this way, here we are defined \N as a shorthand for math bold N for natural numbers
\newenvironment{solution}{\bigskip\noindent{\em Solution.} \ignorespaces}{\bigskip}
\usepackage{hyperref}
\hypersetup{
colorlinks=true,
linkcolor=blue,
filecolor=magenta,
urlcolor=cyan,
}
\urlstyle{same}
\PassOptionsToPackage{hyphens}{url}
\newcommand{\homework}[6]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf CS358: Applied Algorithms\hfill} }
\vspace{6mm}
\hbox to 6.28in { {\Large \hfill #1 \normalsize{(#2)} \hfill} }
\vspace{6mm}
\hbox to 6.28in { {\it Instructor: #3 \hfill } }
%\hbox to 6.28in { {\it TA: #4 \hfill #6}}
% \vspace{2mm}
}
}
\end{center}
\markboth{#1}{#1}
\vspace*{4mm}
}
\definecolor{grayX}{gray}{0.95}
\begin{document}
\homework{Assignment 2: Space-Efficient Edit Distance}{due 9/29/21}{Sam McCauley}{}{}{}
\section*{Instructions}
All submissions are to be done through github. This process is detailed in the handout ``Handing In Assignments'' on the course website. Answers to the questions below should be submitted by editing this document. All places where you are expected to fill in solution are marked in comments with ``FILL IN.''
Please contact me at \url{srm2@cs.williams.edu} if you have any questions or find any problems with the assignment materials.
\section*{Problem Description}
\textsc{Input:} The input consists of a sequence of tests. Each test begins with a line that has four numbers on it. These numbers are the length of \texttt{string1}, the length of \texttt{string2}, the length of the intended solution string, and finally the size of the alphabet. Following this line, \texttt{string1} is listed in 80-character lines, followed by \texttt{string2} and finally the intended solution. These strings each have brief comments (to make it more human-readable) that are ignored by the input reader provided to you.
\bigskip
\noindent
\textsc{Output:} The output is a string describing a sequence of edits. Each character in the string should be \texttt{`i'}, \texttt{`r'}, \texttt{`d'}, or \texttt{`m'}. The string should be null-terminated.
\bigskip
\noindent
\textsc{Goal:} The output is a string describing how \texttt{string2} can be modified to obtain \texttt{string1}. The output is a string of characters, where each character is \texttt{`i'}, \texttt{`r'}, \texttt{`d'}, or \texttt{`m'}, representing an insert, replacement, delete, or match, respectively.
Thus, if \texttt{string2} is \texttt{ac}, and \texttt{string1} is \texttt{a}, the optimal output string is \texttt{md}.
\textbf{Please pay attention to ties.} Your algorithm should favor going as far right in the dynamic programming table as possible, if \texttt{string1} is represented vertically and \texttt{string2} is represented horizontally.
\begin{itemize}
\item This means that when implementing the recursive algorithm, if there are multiple splits that have the same cost, you should take the rightmost option.
\item Equivalently, if you are implementing a non-recursive algorithm, if there is a tie while backtracking you should use an insert whenever possible, then a match or replace, and finally a delete.
\item If your algorithm breaks ties correctly, if \texttt{string1} is \texttt{aaa} and \texttt{string2} is \texttt{aa}, it should output \texttt{mmi} (not \texttt{mim} or \texttt{imm}). The input file \texttt{testBaseCases.txt} has more examples to help you ensure that your algorithm breaks ties correctly.
\end{itemize}
\section*{Testing Parameters}
The \texttt{main()} method of the testing program (in \texttt{test.c}) takes two arguments, each of which is a file containing edit distance instances. You can test your program by first running \texttt{make}, and then running \texttt{./test.out testData.txt timeData.txt}.
\begin{itemize}
\item All instances on this assignment will have an alphabet of size $4$ or $256$; in either case the input will be taken as a normal array of \texttt{char}s. You may optimize your solution under these assumption (so it's OK if your algorithm fails with alphabets that are of any size other than $4$ or $256$).
\item There are several testing and timing instances provided. Your performance will be judged based on the total time across three instances. The three testing instances will look very similar to those provided in \texttt{timeData.txt}.
\item All 4-character instances have \texttt{string1} as an actual subsequence of human DNA, and \texttt{string2} as a perturbed version of this DNA. All 256-character instances have \texttt{string1} consist of English text (including punctuation), and \texttt{string2} as a perturbed version of this text.
\item Tiebreaking is unfortunately necessary and unavoidable even in the large instances.
\item Two solutions with running times within $.1$ seconds of each other will be considered tied for the purposes of this assignment.
\end{itemize}
\section*{Questions}
\textbf{Code}~(50 points). Implement the standard dynamic program for edit distance. Then, implement Hirschberg's space-efficient algorithm. Please give a very brief (2-3 sentence) description of your implementation below.
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\begin{question}[10 points]
Use \texttt{cachegrind} to compare the performance of your space-inefficient approach, and your approach using Hirschberg's algorithm. Interpret the data: which has more (data) cache misses? What level of cache is incurring those misses?
Describe how you performed your tests, and include the output of your tests.
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
%Latex hint: it's not required, but may be useful to use a verbatim environment:
% \begin{verbatim}
% \end{verbatim}
% to include this data. Anything between those two commands will be printed verbatim to the pdf. (Make sure it fits on the page, however.)
\end{solution}
\subsubsection*{External Memory}
\begin{question}[\textbf{Extra credit}: 15pts]
Give an algorithm that can find the edit distance (not necessarily the sequence of edits) between strings of length $n$ and length $m$ in $O(nm/MB + (n + m)/B)$ I/Os. (Your algorithm should work even for very large $m$ and $n$.) Analyze your algorithm, explaining why it meets this bound.
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\subsubsection*{Shelving Books with Labels}
Let's say you work in a library. You have to put $m$ books (let's call them $b_1,\ldots, b_m$) on $n \leq m$ shelves (which we'll call $s_1,\ldots s_n$). The books are numbered using the Dewey Decimal system, and must be placed in order starting on shelf $s_1$. Furthermore, \textbf{there may not be an empty shelf between two shelves that contain books}.\footnote{Your boss at the library is a stickler for aesthetics.} For example, if book $10$ goes on shelf $2$, book $11$ must go on shelf $2$ or shelf $3$. Each shelf can hold any number of books; even all $m$ books may be placed on a single shelf.
This would normally be fairly easy---for example, you could just put all books on the first shelf. Unfortunately, this library also keeps track of $k$ \emph{topics} to help people browse for books they may be interested in.
Each shelf $s_j$ has a label $\ell_j$ representing the topics of books on that shelf. Similarly, each book $b_i$ has a list of topics $t_i$ representing what topics are covered in that book.
You were instructed to reprint all the labels on the shelves so that the shelves indicate the topics of their books: if book $b_i$ is on shelf $s_j$, then each topic in $t_i$ can be found in $\ell_j$. However, in an effort to stay green you want to keep the labels as-is, and place the books so that they match the current labels as closely as possible (while still retaining Dewey Decimal order).
Let's say that the \emph{cost} of placing book $b_i$ on shelf $s_j$ is the number of topics $t_i$ that do not appear in the list $\ell_j$.
This leads to an algorithmic problem: how can the books be assigned to shelves to minimize the total cost; i.e. the number of missing topics over all books on all shelves?
Let's start with the basic dynamic program. (Of course, an answer to a later question works for this one as well---this is just an opportunity to focus on correctness.) Remember that for a dynamic program, you should state the recurrence, state how you're storing the dynamic programming table (what entries does the table have? What does the value they store represent?), state the base case, and describe in what order one should fill out the table.
\begin{question}[15 points]
\label{question:dynamicprogram}
Give an algorithm to find the assignment of books to shelves that minimizes the number of mismatches. Your algorithm should run in $O(nmk)$ time and $O(nm)$ space. Analyze this algorithm in the external memory model---how many cache misses does it have in terms of $n$, $m$, and $B$? (You should assume for this analysis that $n$ and $m$ are large.)
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
As we saw with edit distance, if we only care about the \emph{cost} of a solution (rather than reconstructing the solution itself), there's an immediate optimization that saves space.
\begin{question}[15 points]
Give an algorithm to find the minimum number of mismatches in $O(nmk)$ time and $O(n+m)$ space (you do not need to find the optimal assignment of books to shelves).\footnote{You may notice that it's probably possible to tighten the space a bit, to something like $O(\min\{n,m\})$. This is not required.}
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
Finally, let's use the ideas from Hirschberg's algorithm to improve the space usage while giving the assignment itself, not just the cost.
\begin{question}[10 points]
Finally, give an algorithm to find the assignment of books to shelves that minimizes the number of mismatches in $O(nmk)$ time and $O(n+m)$ space.
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\section*{Tips and tricks}
\begin{itemize}
\item Remember to create a correct program before worrying about creating a fast one! Even more importantly: create a correct program before creating one that cleverly reuses unnecessary space. Even your first Hirschberg's implementation should probably be fairly wasteful!
\item I would fairly strongly suggest that you create a working version of the simple (not space-efficient) edit distance algorithm to help you with debugging.
\item Keep an eye on memory management! If you are not careful you may wind up with $\Theta(nm)$ memory usage even with a recursive algorithm.
\item As I mentioned in class, a correct implementation of Hirschberg's algorithm almost certainly has disjoint (nonoverlapping) subproblems.
\item In class we discussed that maintaining the solution using Hirschberg's algorithm can be a bit tricky. I believe that the easiest way is to do it ``bottom-up:'' construct a solution in the base case, and pass it to the calling function. The calling function can then allocate space for a solution that combines its two recursive calls, and (again) pass it up to its calling function; and so on.
\item As a reminder, valgrind is an excellent tool if you are having trouble keeping track of memory. It is very easy to use, and it is available on the lab computers.
\item While it is likely possible to implement an $O(\min\{m,n\})$ space algorithm, it is probably better to implement an $O(n+m)$ space algorithm and then work on other avenues of improving efficiency.
\end{itemize}
\end{document}