World's most popular travel blog for travel bloggers.

[Solved]: Automated optimization of 0-1 matrix vector multiplication

, , No Comments
Problem Detail: 

Question:

Is there established procedure or theory for generating code that efficiently applies a matrix-vector multiplication, when the matrix is dense and filled with only zeros and ones? Ideally, the optimized code would make systematic use of previously computed information to reduce duplicated work.

In other words, I have a matrix $M$ and I want to do some precomputation based upon $M$, that will make computing $Mv$ as efficient as possible when I later receive the vector $v$.

$M$ is a rectangular dense binary matrix that is known at "compile time", whereas $v$ is an unknown real vector that is only known at "run time".

Example 1: (sliding window)

Let me use an easy small example to illustrate my point. Consider the matrix, $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1\\ & 1 & 1 & 1 & 1 & 1 \\ & & 1 & 1 & 1 & 1 & 1\\ & & & 1 & 1 & 1 & 1 & 1\end{bmatrix}.$$ Supposing we apply this matrix to a vector $v$ to get $w = Mv$. Then the entries of the result are, \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= v_2 + v_3 + v_4 + v_5 + v_6\\ w_3 &= v_3 + v_4 + v_5 + v_6 + v_7\\ w_4 &= v_4 + v_5 + v_6 + v_7 + v_8 \end{align}

Doing a standard matrix-vector multiplication will compute exactly this way. However, a lot of this work is redundant. We could do the same matrix computation at less cost by keeping track of a "running total", and adding/subtracting to get the next number: \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= w_1 + v_6 - v_1\\ w_3 &= w_2 + v_7 - v_2\\ w_4 &= w_3 + v_8 - v_3 \end{align}

Example 2: (hierarchical structure)

In the previous example, we could just keep track of a running total. However, usually one would need to create and store a tree of intermediate results. For example, consider $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1 \\ & & & & 1 & 1 & 1 & 1\\ 1 & 1 \\ & & & & 1 & 1 \\ & & 1 & 1 \\ & & & & & & 1 & 1\end{bmatrix}$$ One could compute $w = Mv$ efficiently using a tree of intermediate results:

  1. Compute $w_5$ and $w_7$, and add them to get $w_3$.
  2. Compute $w_4$ and $w_6$, and add them to get $w_2$.
  3. Add $w_2$ and $w_3$ to get $w_1$

The structure in the examples above is easy to see, but for the actual matrices I'm interested in, the structure is not so simple.

Example 3: (low rank)

To clear up some confusion, the matrices are generally not sparse. Specifically, a method solving this problem needs to be able to find efficient methods to apply matrices where large blocks are filled with ones. For example, consider

$$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \end{bmatrix}.$$

This matrix can be decomposed as a difference of two rank-1 matrices, $$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1\end{bmatrix} - \begin{bmatrix}& & & & & \\ & & & & & \\ & & & & 1 & 1 \\ & & & & 1 & 1 \\ & & & & 1 & 1\end{bmatrix}$$

so its action on a vector $w := Mv$ can be computed efficiently by, \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}

Motivation:

I'm working on a numerical method for some image processing, and there are several large dense $0-1$ matrices with different structures that are fixed for all time. Later these matrices will need to be applied to many unknown vectors $v_i$ that will depend on the user's input. Right now I'm using pencil-and-paper to come up with efficient code on a per-matrix basis, but I'm wondering if the process can be automated.

Edit: (postscript)

All of the answers here so far (as of 9/5/15) are interesting, but none answer the question as satisfactorily as I had hoped. Probably it turns out that this is a hard research question, and no one knows a good answer.

Since time has run out I am awarding the bounty to EvilJS's answer since it addresses the right question. However, I wish the answer contained more clear and detailed explanations.

tranisstor's answer makes a connection between this question and the Online Boolean Matrix-Vector Multiplication (OMv) problem, but the connection is not exactly what this question is asking. In particular, the following assumption doesn't really fit (bold emphasis mine),

Now assume that for all $n \leq n_0$ and all $n \times n$ matrices $M$ we know an algorithm $A_{n,M}$, that for all vectors $v$ computes $Mv$ in truly subquadratic time, i.e. in time $O(n^{2 - \varepsilon})$ for some $\varepsilon > 0$.

Whether or not there exist subquadratic algorithms for all matrices is orthogonal to the question of finding an algorithm for a specific matrix that is as fast as possible. Most 0-1 matrices look like random noise and (if I were to guess) probably don't have subquadratic algorithms. However, the fact that there are really bad matrices out there doesn't prevent me from finding a fast algorithm on a good matrix, for example, a "sliding window" matrix.

vzn's answers, first answer, second answer are interesting (and in my opinion don't deserve so many downvotes), but they don't apply to the question for reasons discussed in the comments there.

Asked By : Nick Alger

Answered By : Evil

If it is possible try to exploit banded tridiagonal nature of matrix.
Otherwise if the matrix contains only a constant number of distinct values (which surely is being binary), you should try Mailman algorithm (by Edo Liberty, Steven W. Zucker In Yale university technical report #1402): optimized over finite dictionary
Common Subexpression Elimination is know for some time like Multiple Constant Multiplication, but going down to gate-level is an option - patterns used here could be used separately as solution or merged with other methods, the paper for this "Improving Common Sub-expression Elimination Algorithm with A New Gate-Level Delay Computing Method" by Ning Wu, Xiaoqiang Zhang, Yunfei Ye, and Lidong Lan published in "Proceedings of the World Congress on Engineering and Computer Science 2013 Vol II WCECS 2013, 23-25 October, 2013, San Francisco, USA" Gate level CSE

There is also crude but working method, to generate symbolic matrix with constants, vector with variables and plug it to Static Single Assingment (SSA) from compilers, which automates process of writing matrices by hand.

new algorithm prototype
What you did with running sum: \begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 \\ w_2 &= w_1 + v_6 - v_1 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}
Gives 10 operations, and with my initial idea to use Thomas it is equivalent.
For now I am still writing and testing new algorithm, also runtimes are nasty, but first test result gave me surprising answer:

\begin{align} tmp_1 &= v_2 + v_3 + v_4 + v_5 \\ w_1 &= v_1 + tmp_1 \\ w_2 &= tmp_1 + v_6 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}
Which gives 9 operations, defining them as + or - is 1, and = is 0.

\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}

This gives 7 operations, my algorithm result gave:
\begin{align} tmp_1 &= v_1 + v_2 + v_3 + v_4 \\ tmp_2 &= v_5 + v_6 \\ w_1 &= tmp_1 + tmp_2 \\ w_2 &= w_1 \\ w_3 &= w_2 - tmp_2 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}
Which gives 6 operations For now I can tell that I am using Hamming distance, & and | bitwise operations, counting usages and making something like Cocke–Younger–Kasami (CYK) - "a parsing algorithm for context-free grammars, named after its inventors, John Cocke, Daniel Younger and Tadao Kasami. It employs bottom-up parsing and dynamic programming." - from Wikipedia This is the same technique I use to build blocks of variables.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/45596

0 comments:

Post a Comment

Let us know your responses and feedback