World's most popular travel blog for travel bloggers.

## Operating Systems

Non contiguous memory allocation methodology does require that a file be termed at the start. The file grows as needed with time. A major advantage is the reduced waste of disk space and flexibility when it comes to memory allocation. The Operating System will allocation memory to the file when needed.
Non contiguous memory allocation, offers the following advantages over contiguous memory allocation:
• Allows the interdependence of code and data among processes.
• External fragmentation is none existent with non contiguous memory allocation.
• Virtual memory allocation is strongly supported in non contiguous memory allocation.
Non contiguous memory allocation methods include Paging and Segmentation.

## Paging

Paging is a non contiguous memory allocation method in which physical memory is divided into fixed sized blocks called frames of size in the power of 2, ranging from 512 to 8192 bytes. Logical memory is also divided into same size blocks called pages. For a program of size n pages to be executed, n free frames are needed to load the program.
Some of the advantages and disadvantages of paging as noted by Dhotre include the following:
• Paging Eliminates Fragmentation
• Multiprogramming is supported
• Overheads that come with compaction during relocation are eliminated
• Paging increases the price of computer hardware, as page addresses are mapped to hardware
• Memory is forced to store variables like page tables
• Some memory space stays unused when available blocks are not sufficient for address space for jobs to run

## Segmentation

Segmentation is a non contiguous memory allocation technique that supports a user view of memory. A program is seen as a collection of segments such as main program, procedures, functions, methods, stack, objects, etc.
Some of the advantages and disadvantages of segmentation as noted by Godse et al include the following:
• Fragmentation is eliminated in Segmentation memory allocation
• Segmentation fully supports virtual memory
• Dynamic memory segment growth is fully supported
• Segmentation allows the user to view memory in a logical sense.

• On the disadvantages of segmentation:
• Main memory will always limit the size of segmentation, that is, segmentation is bound by the size limit of memory
• It is difficult to manage segments on secondary storage
• Segmentation is slower than paging
• Segmentation falls victim to external fragmentation even though it eliminates internal fragmentation
Problem Detail:

A typical HDD would represent information as either 1 (e.g. spin up) or 0 (e.g. spin down). Let's assume you want to represent the information physically in a hex system with 16 states, and assume this is possible with using some physical form (maybe the same spin).

What is the minimum physical size of a memory element in this new system in units of binary bits? It seems to me that the minimum is 8 bits = 1 byte. Therefore, going from a binary representation to a higher representation will, everything else equal, make the minimum variable size equal 1 byte instead of 1 bit. Is this logic correct?

###### Answered By : Yuval Filmus

One hexadecimal digit contains 4 binary digits. You can compute this as follows: $\log_2 16 = 4$. Alternatively, $2^4 = 16$. So the minimal memory element will contain 4 bits' worth of information.

This also works when the number of states is not a power of 2, but you have to be more flexible in your interpretation.

Question Source : http://cs.stackexchange.com/questions/52233

3200 people like this
Problem Detail:

To avoid the noise as much as possible, I'm planning to take multiple scenes from RGB-D then try to merge them ....

so is there any research papers , thoughts , ideas , algorithms or anything to help

Yes, one technique is known as super-resolution imaging. There's a rich literature on the subject, at least for RGB images. You could check Google Scholar to see if there has been any research on super-resolution for RGB-D images (e.g., from 3D cameras such as Kinect, Intel RealSense, etc.).

Question Source : http://cs.stackexchange.com/questions/56093

3200 people like this
Problem Detail:

There is a some question that arise from the proof of Lemma 5.8.1 of Cover's book on information theory that confuse me.

First question is why he assumes that we can "Consider an optimal code $C_m$. Is he assuming that we are encoding a finite number of words so that $\sum p_i l_i$ must have a minimum value? Here I give you the relevant snapshot.

Second, there is an observation made on this notes that was also done in my class before proving the theory of optimality of huffman codes, that is,

observe that earlier results allow us to restrict our attention to instantaneously decodeable codes

I don't really understand why this observation is necessary.

###### Answered By : Yuval Filmus

To answer your first question, the index $i$ goes over the range $1,\ldots,m$. The assumption is that there are finitely many symbols. While some theoretical papers consider encodings of countably infinite domains (such as universal codes), usually the set of symbols is assumed to be finite.

To answer your second question, the claim is that a Huffman code has minimum redundancy among the class of uniquely decodable codes. The proof of Theorem 10 in your notes, however, only directly proves that a Huffman code has minimum redundancy among the class of instantaneously decodable codes. It does so when it takes an optimal encoding for $p_1,\ldots,p_{n-2},p_{n-1}+p_n$ and produces an optimal encoding for $p_1,\ldots,p_n$ by adding a disambiguating bit to the codeword corresponding to $p_{n-1}+p_n$; it's not clear how to carry out a similar construction for an arbitrary uniquely decodable code.

Question Source : http://cs.stackexchange.com/questions/64727

3200 people like this
Problem Detail:

Assume that we have $l \leq \frac{u}{v}$ and assume that $u=O(x^2)$ and $v=\Omega(x)$. Can we say that $l=O(x)$?

Thank you.

###### Answered By : Yuval Filmus

Since $u = O(x^2)$, there exist $N_1,C_1>0$ such that $u \leq C_1x^2$ for all $x \geq N_1$. Since $v = \Omega(x)$, there exist $N_2,C_2>0$ such that $v \geq C_2x$ for all $x \geq N_2$. Therefore for all $x \geq \max(N_1,N_2)$ we have $$l \leq \frac{u}{v} \leq \frac{C_1x^2}{C_2x} = \frac{C_1}{C_2} x.$$ So $l = O(x)$.

Question Source : http://cs.stackexchange.com/questions/45258

3200 people like this
Problem Detail:

While doing some digging around in the GNU implementation of the C++ standard library I came across a section in bits/hashtabe.h that refers to a hash function "in the terminology of Tavori and Dreizin" (see below). I have tried without success to find information on these people, in the hopes of learning about their hash function -- everything points to online versions of the file that the following extract is from. Can anyone give me some information on this?

*  @tparam _H1  The hash function. A unary function object with *  argument type _Key and result type size_t. Return values should *  be distributed over the entire range [0, numeric_limits<size_t>:::max()]. * *  @tparam _H2  The range-hashing function (in the terminology of *  Tavori and Dreizin).  A binary function object whose argument *  types and result type are all size_t.  Given arguments r and N, *  the return value is in the range [0, N). * *  @tparam _Hash  The ranged hash function (Tavori and Dreizin). A *  binary function whose argument types are _Key and size_t and *  whose result type is size_t.  Given arguments k and N, the *  return value is in the range [0, N).  Default: hash(k, N) = *  h2(h1(k), N).  If _Hash is anything other than the default, _H1 *  and _H2 are ignored. 

I read that passage as saying that Tavori and Dreizin introduced the terminology/concept of a "range-hashing function". Presumably, that's a name they use for a hash function with some special properties. In other words, I read that as implying not that Tavori and Dreizen introduced a specific hash function, but that they talk about a category of hash functions and gave it a name.

I don't know if that is what the authors actually meant; that's just how i would interpret it.

I tried searching on Google Scholar for these names and found nothing that seemed relevant. A quick search turns up a reference to Ami Tavori at IBM (a past student of Prof. Meir Feder, working on computer science), but I don't know if that's who this is referring to.

Question Source : http://cs.stackexchange.com/questions/66931

3200 people like this
Problem Detail:

In unification, there is a "occur-check". Such as $X = a \, X$ fails to find a substitution for $X$ since it appears on right hand side too. The first-order unification, higher-order unification all have occur-check.

the paper nominal unification described a kind of unification based on nominal concepts. But I did not mention "occur-check" at all.

So, I am thinking why? does it has occur-check?

Yes, it has the occur check. The ~variable transformation rule of nominal unification has a condition which states

   provided X does not occur in t 

what it is saying is exactly occur check.

Question Source : http://cs.stackexchange.com/questions/65833

3200 people like this
Problem Detail:

In Sipser's text, he writes:

When a probabilistic Turning machine recognizes a language, it must accept all strings in the language and reject all strings not in the language as usual, except that now we allow the machine a small probability of error.

Why is he using "recognizes" instead of "decides"? If the machine rejects all strings that are not in the language, then it always halts, so aren't we restricted to deciders in this case?

The definition goes on:

For $0 < \epsilon < 1/2$ we say that $M$ recognizes language $A$ with error probability $\epsilon$ if

1) $w \in A$ implies $P(M \text{ accepts } w) \ge 1 - \epsilon$, and

2) $w \notin A$ implies $P(M \text{ rejects } w) \ge 1 - \epsilon$.

So it seems like the case of $M$ looping is simply not allowed for probabilistic Turning machines?

###### Answered By : Yuval Filmus

Complexity theory makes no distinction between "deciding" and "recognizing". The two words are used interchangeably. Turing machines considered in complexity theory are usually assumed to always halt. Indeed, usually only time-bounded machines are considered (such as polytime Turing machines), and these halt by definition.

In your particular case, you can interpret accept as halting in an accepting state, and reject as halting in a rejecting state. The Turing machine is thus allowed not to halt. However, the class BPP also requires the machine to run in polynomial time, that is, to halt in polynomial time. In particular, the machine must always halt.

Question Source : http://cs.stackexchange.com/questions/65491

3200 people like this
Problem Detail:

In Johnson's 1975 Paper 'Finding All the Elementary Circuits of a Directed Graph', his psuedocode refers to two separate data structures, logical array blocked and list array B. What is the difference in them and what do they represent? Moreover, what does 'Vk' mean?

In the pseudocode, T array means an array where each element has type T. Logical is the type of a boolean (i.e., it can hold the value true or false). Integer list is the type of a list of integers.

Thus, in the pseudocode, logical array blocked(n) is the declaration of an array called blocked containing n elements, where each element is a boolean. integer list array B(n) is the declaration of an array called B containing n elements, where each element is a list of integers.

$V_K$ isn't clearly defined, but from context, I'd guess it is the set of vertices in $A_K$.

Question Source : http://cs.stackexchange.com/questions/58180

3200 people like this
Problem Detail:

I ran the following grammar (pulled from the dragon book) in the Java Cup Eclipse plugin:

S' ::= S S ::= L = R | R L ::= * R | id R ::= L 

The items associated with state 0 given in the Automaton View are as follows:

S ::= ⋅S, EOF S ::= ⋅L = R, EOF S ::= ⋅R, EOF L ::= ⋅* R, {=, EOF} L ::= ⋅id, {=, EOF} R ::= ⋅L, EOF 

Shouldn't the last item's lookahead set be {=, EOF}? This item could be derived from S ::= ⋅R (in which case the lookahead set is {EOF}) or from L ::= ⋅* R (in which case the lookahead set is {=, EOF}).

In state 0, R ::= ⋅L can only be generated by S ::= ⋅R. In L ::= ⋅* R, the dot precedes *, not R, so no further items are generated by it.

The dragon book uses this grammar as an example of the inadequacy of SLR, and the correct computation of the lookahead in this case is an instance; the SLR algorithm bases lookahead decisions on the FOLLOW set rather than actual lookahead possibilities in the state, which will eventually lead to a shift/reduce conflict on lookahead symbol =.

Question Source : http://cs.stackexchange.com/questions/59548

3200 people like this
Problem Detail:

Would it be possible to form such a group using the ADD instruction and the NOT instruction?

Sure. The integers modulo $2^{32}$ form a group. The group operation is addition modulo $2^{32}$, which can be implemented by the ADD instruction. You don't need the NOT instruction.

There are other groups you could form, such as the integers modulo 2, and many more. I recommend you read the definition of a group and play around with some examples.

Question Source : http://cs.stackexchange.com/questions/53882

3200 people like this
Problem Detail:

Consider the generator of the selections:

for (i1 = 0;  i1 < 10; i1++)     for (i2 = i1; i2 < 10; i2++)         for (i3 = i2; i3 < 10; i3++)             for (i4 = i3; i4 < 10; i4++)                 printf("%d%d%d%d", i1, i2, i3, i4); 

Result:

0000 0001 0002 ... 0009 0011 <-- 0010 is skipped 0012 ... 0019 0022 <-- 0020 and 0021 are skipped 0023 ... 8999 9999 

Generated selections have the following property: the order in the selection does not mater, i.e. 0011 0101 1001 1010 1100 are the same.

Basically it is 4-combination of the set { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ..., 9, 9, 9, 9}

What do you say to name this type of the selection?

I always get stuck when I say:

4-xxxxxxxx of the set {0, 1, 2, ..., 9} 

where xxxxxxxx is the name of this selection.

###### Answered By : David Richerby

The sequences your program generates are non-decreasing sequences.

Question Source : http://cs.stackexchange.com/questions/33196

3200 people like this
Problem Detail:

I have been hearing the phrases quasipolynomial, superpolynomial and subexponential.

I think know what quasipolynomial and subexponential is. I believe these are functions respectively of form $n^{\log^c n}$ and $n^{n^{1/c}}$ for some $c>1$.

What does superpolynomial mean?

###### Answered By : Yuval Filmus

Superpolynomial means $n^{\omega(1)}$, that is, growing faster than any polynomial. More clearly, $n^{f(n)}$ for some function $f$ satisfying $\lim_{n\to\infty} f(n) = \infty$.

Question Source : http://cs.stackexchange.com/questions/50641

3200 people like this
Problem Detail:

So for these 5 conditions, I am trying to find the solution/formula for them. What would $a_n$ equal basically? If it helps, the recurrence relation these 5 conditions were generated from was $a_n = a_{n - 1} + 2n$. Any help would be greatly appreciated.

\begin{align*} a_0 &= 4 \\ a_1 &= 6 \\ a_2 &= 10 \\ a_3 &= 16 \\ a_4 &= 24 \end{align*}

###### Answered By : Yuval Filmus

There are infinitely many sequences starting $4,6,10,16,24$. If I understand you correctly, this sequence was generating according to the rule $a_n = a_{n-1} + 2n$ with the initial value $a_0 = 4$. In that case, you have $$a_n = a_0 + \sum_{k=1}^n (2k) = a_0 + 2 \frac{n(n+1)}{2} = n^2 + n + 4.$$

Question Source : http://cs.stackexchange.com/questions/37632

3200 people like this
Problem Detail:

I am developing algorithm for solving following problem:

Given a set of items with unknown feature(s) (but know distribution of feature(s)). Algorithm must choose what items to measure(every measure has some cost)). I use Value of information theory to find this measurements.

After measurements are done algorithm choose K best items from the set, using some utility function that depends on feature values of items.

I crafted few synthetic data sets, but perhaps there are some benchmark data sets that are used for this kind of problems?

Regards.

In this sort of situation, two standard answers are:

1. Figure out what the practical applications of your algorithm are. Find a dataset associated with that particular application, and try your algorithm on it and see how well it works. Measure success using some metric that is appropriate for that particular application.

2. Look through the research literature to find previously published papers that try to solve the same problem. Look at what benchmarks they used. Use the same benchmarks, so that you can compare how well your algorithm does to previously published algorithms.

Question Source : http://cs.stackexchange.com/questions/48501

3200 people like this
Problem Detail:

I have a deterministic function $f(x_1, x_2, ..., x_n)$ that takes $n$ arguments.

Given a set of arguments $X = (x_i)$, I can compute $U_X = \{ i \in [1, n] : x_i \text{ was read during the evaluation of } f(X) \}$

Would it be valid to use the set $K_X = \{(i, x_i): i \in U_X\}$ as a memoization key for $f(X)$?

In particular, I am worried that there may exist $X=(x_i)$ and $Y=(y_i)$ such that:

$$\tag1 U_X \subset U_Y$$ $$\tag2 \forall i \in U_X, x_i = y_i$$ $$\tag3 f(X) \neq f(Y)$$

In my case, the consequence of the existence of such $X$ and $Y$ would be that $K_X$ would be used as a memoization key for $f(Y)$, and would thus return the wrong result.

My intuition says that, with $f$ being a deterministic function of its arguments, there should not even exist $X$ and $Y$ (with $U_X$ a strict subset of $U_Y$) such that both $(1)$ and $(2)$ hold (much less all three!), but I would like a demonstration of it (and, if it turns out to be trivial, at least pointers to the formalism that makes it trivial).

###### Asked By : Jean Hominal

As long as $f$ is deterministic and depends only on its arguments (not on other global variables), yes, you can safely use that as a memoization key. The bad case cannot happen.

You can see this by thinking about what happens as it reads the $t$th argument and using induction on $t$. Let $i_1,i_2,i_3,\dots,i_t$ be the sequence of argument indices read (i.e., $f$ first reads $i_1$ when running on input $x$, then reads $i_2$, etc.). For each $t$, if $f(x)$ and $f(y)$ have behaved the same up until just before they read the $t$th of these, then the next index examined by $f(x)$ and $f(y)$ will be the same, say $i_t$; if additionally $x_{i_t} = y_{i_t}$, then it follows $f(x)$ and $f(y)$ will behave the same up until just before they read the $t+1$st of their arguments. Since $i_t \in U_X$, $x_{i_t} = y_{i_t}$ is guaranteed by your assumptions. Now use induction.

This assumes everything is deterministic. For instance, $f$ had better not be allowed to read a random-number generator, the current time of day, or other global variables. Beware that on some platforms there can be surprising sources of non-determinism, e.g., due to aspects of behavior that are unspecified and might depend on external factors. For instance, I have a recollection that floating-point math in Java can introduce non-determinism in some cases, and so if you want to write code that is verifiably deterministic, you probably want to avoid all floating-point math.

Question Source : http://cs.stackexchange.com/questions/53309

3200 people like this
Problem Detail:

I'm studying program verification and came across the following triple:
$$\{\top\} \;P \; \{y=(x+1)\}$$ What's the meaning of the $\top$ symbol on the precondition? Does it mean $P$ can take any input?

###### Answered By : Yuval Filmus

The symbol $\top$, known as top, stands for "True". There is also a symbol $\bot$, known as bottom, which stands for "False". Top is always true, and bottom is always false. In your case, having a precondition that always holds is the same as having no precondition.

Question Source : http://cs.stackexchange.com/questions/66151

3200 people like this
Problem Detail:

I'm describing the semantics of a new optimization for Java using Operational Semantics but I'm not sure how to define the transition system.

I found this link :http://www.irisa.fr/celtique/teaching/PAS/opsem-2016.pdf where in slide 20 the transition system is defined but I don't get it.

###### Asked By : El Marce

Operational semantics utilizes the tools of logic, so as a prerequisite we must understand judgements and inference rules.

A judgement is like a proposition, but more general. It asserts a relation between two entities of our language. For example, in programming, we often employ the judgement $e: \tau$, asserting that expression $e$ has type $\tau$.

Inference rules are used to define judgements. They have the general form

$$\frac{J_1 \dots J_n}{J}$$

which reads: if we know judgements $J_1$ through $J_n$, we can infer judgement $J$. For example, we may have the following self-explanatory inference rule for the previously defined typing judgement:

$$\frac{n: \text{int} \quad m: \text{int}}{n+m: \text{int}}.$$

A structural operational semantics is defined using a transition system between states. In a programing language, the states are all closed expression in the language, and the final states are values. Formally, we make use of two judgements:

1. $e_1 \to e_2$, stating that expression $e_1$ transitions to state $e_2$ in one step
2. $e \space \text{final}$, stating that expression $e$ is a final state of the system.

Here are some example inference rules in a language with arithmetic and function abstraction:

$$\frac{}{n \text{ final}}$$ $$\frac{}{n+0 \to n}$$ $$\frac{n + m \to k}{s(n) + m \to s(k)}$$ $$\frac{}{\lambda x. e \ \text{final}}$$ $$\frac{(\lambda x. e_1)e_2}{[e_2/x]e_1}$$

Most languages do not have a formal definition, including Java (as far as I know). There are also other methods for defining the semantics of a language, but for describing an optimization, I believe structural dynamics is a wise choice, as it has a natural notion of time complexity (the number of transitions).

Question Source : http://cs.stackexchange.com/questions/65976

3200 people like this
Problem Detail:

A CFG is in strong GNF when all rewrite rules are in the following form:

$A \rightarrow aA_1...A_n$

where $n \leq 2$.

###### Answered By : Yuval Filmus

Question Source : http://cs.stackexchange.com/questions/66546

3200 people like this
Problem Detail:

The definition of the variable

< number>::=< digit > | < digit >< number >

where < digit > is defined as

< digit > ::= 1|2|3|4|5

Apparently reflects the following syntax diagram

Please explain why this is the case. I am particularly confused with the clause after the vertical OR line (i.e. < digit >< number >) and whether it has something to do with the fact that number can consist of many digits.

< number>::=< digit > | < digit >< number > means that a number is either just a digit, or a digit followed by a number. So 1 is a number (just a single digit), and 12 is also a number (a single digit, followed by a number that consists of just the single digit '2'). Similarly 123 is a number that consists of a single digit '1' and a number (that itself consists of a single digit '2' followed by a number (that consists of the single digit '3')).

Question Source : http://cs.stackexchange.com/questions/64215

3200 people like this
Problem Detail:

Given a digraph, determine if the graph has any vertex-disjoint cycle cover.

I understand that the permanent of the adjacency matrix will give me the number of cycle covers for the graph, which is 0 if there are no cycle covers. But is it possible to check the above condition directly?

###### Answered By : Yuval Filmus

Your problem is answered in the Wikipedia article on vertex-disjoint cycle covers. According to the article, you can reduce this problem to that of finding whether a related graph contains a perfect matching. Details can be found in a paper of Tutte or in recitation notes of a course given by Avrim Blum.

As a comment, in the graph-theoretic literature a vertex-disjoint cycle cover is known as a 2-factor.

Question Source : http://cs.stackexchange.com/questions/67044

3200 people like this
Problem Detail:

The CS188 course from Berkeley goes to great length in explaining why the optimality of $A^*$ algorithm is conditioned by the admissibility of heuristic.

Note: admissibility of heuristic means that:

$$\forall n, 0 \leq h(n) \leq h^*(n)$$

With $h^*(n)$ being the true optimal distance of $n$ to the goal.

The course's proof reasoning is sound and makes perfect sense.

However, this example uses a heuristic that is not admissible. For instance, the distance between the start state and the goal state is only $h^* = 2$ while the heuristic of the start state is $h = 17$ (see the example with the 8-puzzle, using Nilsson heuristic). Obviously, $17 \gt 2$ and therefore the heuristic is not admissible.

However, after having implemented the algorithm and tested it, it seems that it is able to find the optimal solution each time. Trying to alter the heuristic function only makes it worse than optimal.

So, if the admissibility of the heuristic a necessary condition for the guaranteed optimality of $A^*$, how comes that this example seems to refute this?

Did I miss something? Or perhaps, does the condition on the admissibility means that sometimes, the algorithm will be optimal even with an unadmissible heuristic, but only an admissible one guarantees the optimality?

I'd be curious to hear any thought about that.

It's exactly as you suspected:

Sometimes, the algorithm will find an optimal solution even with an unadmissible heuristic... only an admissible heuristic guarantees the optimality of the solution returned.

If you use an unadmissible heuristic, there's no guarantee what will happen. The solution you get back might be the best one; or it might not be. And, you probably won't have any way of telling whether the solution you got back was optimal or not, so you won't even know for sure when this is a problem and when it isn't.

If you use an unadmissible heuristic, you'll often (but not always) get sub-optimal solutions.

Question Source : http://cs.stackexchange.com/questions/67175

3200 people like this
Problem Detail:

Question: How many 32 bit integers can be stored in a 16 bit cache line. Answer: 4

can somebody please explain for me why the answer is 4 i did not understand the reason and i think they should give us more given like the number of blocks in the cache..

###### Answered By : Yuval Filmus

There's probably a typo there: it's not 16 bits but rather 16 bytes. Each 32 bit integer takes 4 bytes, so 16/4 = 4.

Question Source : http://cs.stackexchange.com/questions/42324

3200 people like this
Problem Detail:

I was reading about Multi Layered Perceptron(MLP) and how can we learn pattern using it. Algorithm was stated as

• Initiate all weight to small values. Compute activation of each neuron
• using sigmoid function. Compute the error at the output layer using
• $\delta_{ok} = (t_{k} - y_{k})y_{k}(1-y_{k})$
• compute error in hidden layer(s) using
• $\delta_{hj} = a_{j}(1 -a_{j})\sum_{k}w_{jk}\delta_{ok}$
• update output layer using using
• $w_{jk} := w_{jk} + \eta\delta_{ok}a_{j} ^{hidden}$
• and hidden layer weight using
• $v_{ij} := v_{ij} + \eta\delta_{hj}x_{i}$

Where $a_{j}$ is activation function, $t_{k}$ is target function,$y_{k}$ is output function and $w_{jk}$ is weight of neuron between $j$ and $k$

My question is that how do we get that $\delta_{ok}$? and from where do we get $\delta_{hj}$? How do we know this is error? where does chain rule from differential calculus plays a role here?

## How do we get that $\delta_{ok}$?

You calculate the gradient of the network . Have a look at "Tom Mitchel: Machine Learning" if you want to see it in detail. In short, your weight update rule is

$$w \gets w + \Delta w$$ with the $j$-th component of the weight vector update being $$\Delta w^{(j)} = - \eta \frac{\partial E}{\partial w^{(j)}}$$ where $\eta \in \mathbb{R}_+$ is the learning rate and $E$ is the error of your network.

$\delta_{ok}$ is just $\frac{\partial E}{\partial w^{(o,k)}}$. So I guess $o$ is an output neuron and $k$ a neuron of the last hidden layer.

## Where does the chain rule play a role?

The chain rule is applied to compute the gradient. This is where the "backpropagation" comes from. You first calculate the gradient of the last weights, then the weights before, ... This is done by applying the chain rule to the error of the network.

## Note

The weight initialization is not only small, but it has to be different for the different weights of one layer. Typically one chooses (pseudo)random weights.

Question Source : http://cs.stackexchange.com/questions/60098

3200 people like this
Problem Detail:

I have an exercise about cache memory, first the cache is empty :
I have a cache memory with 16 lines and each lines have 16 octet, the address is 16 bits

So I know that the INDEX will be composed of 4 bits and the offset will be composed of 4 bits too So i will have this :

bits number : (15 ... TAG... 8)(7.. INDEX.. 4) (3 .... OFFSET .... 0 )

Now if have to say if there is cache default or NOT and say if it will Loads or Not. Adress that I have :

3000 : tag = 30 / index = 0 , offset 0

2040 : tag = 20 / index = 4 , offset = 0

3001 : tag = 30 / index = 0 , offset = 1

2404 : tag =24 / index = 0 , offset = 4

3002 : tag = 30 / index = 0, offset = 2

20C4 tag = 20 / index = 12 , offset =4

3003 tag = 30 / index= 0 , offset =3

24C4 tag = 24 / index = 12 , offset = 4

If someone can explains me how I know if it loads and if I have a cache default, I would be very happy. Thanks

• Starting with an empty cache.
• Assuming direct map

Tag memory will contain for the 16 indexes the addresses [15:8] and a "valid" bit

3000 : Miss -> Load 30 into tag 0 and fill line (3000..300F) 2040 : Miss -> Load 20 into tag 4 and fill line (2040..204F) 3001 : Hit : Already in the cache 2404 : Miss -> Load 24 into tag 0 and fill line (replaces 3000..300F -> 2400..240F) 3002 : Miss -> Load 30 into tag 0 and fill line (replaces 2400..240F -> 3000..300F) 20C4 : Miss -> Load 20 into tag C and fill line (20C0..20CF) 3003 : Hit : Already in the cache 24C4 : Miss -> Load 24 into tag C and fill line (24C0..24CF) 

Very few hits here, would greatly benefit from a two ways set-associative cache.

Question Source : http://cs.stackexchange.com/questions/51772

3200 people like this
Problem Detail:

I encountered some system of ~5000 random nodes connected by ~8000 non-hookean springs, with ~1300 nodes at the boundary fixed as the "wall", the potential of the springs are of the form $dx*e^{(dx/a)}$ where $a$ is a constant and $dx$ the strain (displacement/original length) of the spring, I am using Monte Carlo method to find the energy-minimized configuration after I performed some "perturbation", say, a simple shear or a isotropic expansion of the whole system.

It seems that the conventional energy minimization schemes such as "steepest Descent", or "simulated annealing" is not working as efficiently here as the case of linear situations, it always fail to converge to a satisfactorily balanced state.

Could someone share your experiences in dealing with such non-linear situations?

Thank you so much!

###### Answered By : Long Liang

OK, I finally fixed this issue, the right thing to do in such non-linear situation is to use simulated annealing. I am implementing a gradient guided simulated annealing, which works pretty efficiently.

Thanks for everyone who gave me suggestions and guidance to the right path!

Have fun (mixed with a lot of frustrations) with modeling!

Question Source : http://cs.stackexchange.com/questions/51403

3200 people like this
Problem Detail:

I'm writing a computer game and one of my game's objects must follow a movement path that is very similar to the following graph.

The bold lines are the Y and X axis. In order to be able to code this movement thought I need the algebric equation that translates to this graph. For example f(x) = x^2.

Is there any mathematic technique I could use to obtain the algebric from of this function?

We can start from $\sin(x)$ which has a nice regular graph. To avoid negative values you can simply use the absolute value $|\sin(x)|$. This produces a graph similar to your but with constant height.

To decrease height as $x \to \infty$ we want to multiply that function by something that decrements. For example $\frac{1}{1+|x|}$ goes to $0$ as $x \to \infty$ and is symmetric. So $\frac{1}{1+|x|}|\sin(x)|$ is something similar to what you want with a maximum height of $1$.

If you multiply by a constant $A$ you "set" the maximum height to $A$. If you want to change the width of the bumps you can simply multiply the argument of the $\sin$ function, for example $\frac{A}{1+|x|}|\sin(2x)|$ will have maximum height $A$ and a width that is half the width of the normal $\sin$, while using $\sin(x/2)$ would produce a bump that is twice the width of the normal $\sin$.

See this wolfram alpha's graph for an example result.

You can also change the degradation of height by setting a different power for the $|x|$. For example using $\frac{1}{1+x^2}$ you'd have a faster drop to $0$, while using $\frac{1}{1+\sqrt{|x|}}$ would produce a slower change in height.

Question Source : http://cs.stackexchange.com/questions/54448

3200 people like this
Problem Detail:

Consider the problem of representing in memory numbers in the range $\{1,\ldots,n\}$.

Obviously, exact representation of such number requires $\lceil\log_2(n)\rceil$ bits.

In contrast, assume we are allowed to have compressed representations such that when reading the number we get a 2-approximation for the original number. Now we can encode each number using $O(\log\log n)$ bits. For example, given an integer $x\in\{1,\ldots,n\}$, we can store $z = \text{Round}(\log_2 x)$. When asked to reconstruct an approximation for $x$, we compute $\widetilde{x} = 2^z$. Obviously, $z$ is in the range $\{0,1,\ldots,\log_2 n\}$ and only requires $O(\log\log n)$ bits to represent.

In general, given a parameter $\epsilon>0$, what is the minimal number of bits required for saving an approximate representation of an integer in the above set, so that we can reconstruct its value up to a multiplicative error of $(1+\epsilon)$?

The above example shows that for $\epsilon=1$, approximately $\log\log n$ bits are enough. This probably holds asymptotically for every constant $\epsilon$. But how does epsilon affect the memory (e.g., does it require $\Theta(\frac{1}{\epsilon^2}\log\log n)$ bits?).

###### Asked By : R B

Storing $x$ to within a $1+\epsilon$ approximation can be done with $\lg \lg n - \lg(\epsilon) + O(1)$ bits.

Given an integer $x$, you can store $z = \text{Round}(\log_{1+\epsilon} x)$. $z$ is in the range $\{0,1,\dots,\log_{1+\epsilon} n\}$, so requires about $b = \lg \log_{1+\epsilon} n$ bits. Doing a bit of math, we find

$$b = \lg \frac{\lg n}{\lg(1+\epsilon)} = \lg \lg n - \lg \lg (1+\epsilon).$$

Now $\lg(1+\epsilon) = \log(1+\epsilon)/\log(2) \approx \epsilon/\log(2)$, by a Taylor series approximation. Plugging in, we get

$$b = \lg \lg n - \lg(\epsilon) + \lg(\log(2)),$$

Question Source : http://cs.stackexchange.com/questions/53633

3200 people like this
Problem Detail:

I'm studying with "Numerical Solution of Partial Differential Equations by K.W.Morton and D.F.Mayers". On page 25, it says "2(add) + 2(multiply) operations per mesh point for the explicit algorithm (2.19)", but it seems 3(add) + 2(multiply) to me, how did I get wrong?

(2.19) $U_j^{n+1}=U_j^{n}+\mu(U_{j+1}^{n}-2U_j^{n}+U_{j-1}^{n})$

My counting is

2(add) and 1(multiply) inside the bracket

1(multiply) for $\mu$ and the brackets

1(add) for $U_j^{n}$ and the rest

###### Answered By : Tom van der Zanden

Note that

$U^n_j + \mu (U^n_{j+1}-2U^n_j+U^n_{j-1}) = (1-2\mu)U^n_j+\mu(U^n_{j+1}+U^n_{j-1})$

$1-2\mu$ may be precomputed.

Question Source : http://cs.stackexchange.com/questions/37321

3200 people like this
Problem Detail:

For a certain maximization problem, a "constant-factor approximation algorithm" is an algorithm that returns a solution with value at least $F\cdot \textrm{Max}$, where $F<1$ is some constant and $\textrm{Max}$ is the exact maximal value.

What term describes an algorithm in which the approximation factor is a function $F(n)$ of the problem size, and $F(n)\to 1$ as $n\to \infty$?

###### Answered By : Yuval Filmus

You can say your algorithm is asymptotically optimal. One example is universal codes, which are a certain type of codes for the natural numbers. They satisfy the following property. Let $D$ be a monotone probability distribution over the natural numbers, that is $\Pr[D=n] > \Pr[D=n+1]$. The average codeword length under a universal code is $H(D)(1 + o(1))$. Since $H(D)$ is the optimal length, universal codes are asymptotically optimal, in exactly the same sense as the one you're after.

Question Source : http://cs.stackexchange.com/questions/49220

3200 people like this
Problem Detail:

Good evening! I am studying scheduling problems and I have some difficulties understanding constraints of potentials:

Let be $t_j$ the time when a task $j$ starts, and $t_i$ the time when a task $i$ starts. I'm assuming that $a_{ij}$ is the length of the task $i$ but I'm not sure. Why is a "constraint of potential" mathematically expressed by:

$$t_j-t_i \le a_{ij}$$

Shouldn't it be the reverse, $t_j-t_i \ge a_{ij}$? If we know that the length of task $i$ is $a_{ij}$, isn't it impossible to do something in less that the necessary allocated time?

I suspect there's some misunderstanding about the definition/meaning of $a_{ij}$. The time to complete task $i$ depends only on $i$ (not on $j$), so it wouldn't make sense to use notation like $a_{ij}$ for the length of task $i$: we'd instead expect to see only the index $i$, but not $j$, appear in that notation.

I suspect you'll probably need to go back to your textbook or other source on scheduling and look for a precise definition of the notation that it uses. If your textbook doesn't define its notation, look for a better textbook.

As far as the constraint $t_j - t_i \le a_{ij}$, that is expressing that $t_j$ should start at most $a_{ij}$ seconds after $t_i$ starts. So, it'd make more sense for $a_{ij}$ to be a permissible delay that expresses the maximum time you can wait to start task $j$, once task $i$ has been started.

Question Source : http://cs.stackexchange.com/questions/49132

3200 people like this
Problem Detail:

Hello I am a layman trying to analyze game data from League of Legends, specifically looking at predicting the win rate for a given champion given an item build.

### Outline

A player can own up to 6 items at the end of a game. They could have purchased these items in different orders or adjusted their inventory position during the course of the game.

In this fashion the dataset may contain the following rows with:

   champion id   |                 items ids               | win(1)/loss(0) ----------------------------------------------------------------------------        45        |   [3089, 3135, 3151, 3157, 3165, 3285]  |       1        45        |   [3151, 3285, 3135, 3089, 3157, 3165]  |       1        45        |   [3165, 3285, 3089, 3135, 3157, 3151]  |       0 

While the items are in a different order the build is the same, my initial thought would be to simply multiply the item ids as this would give me an integer value representing that combination of 6 items.

While there are hundreds of items, in reality a champion draws off a small subset (~20) of those to form the core (3 items) of their build. A game may also finish before players have had time to purchase 6 items:

                items ids                ------------------------------------------    [3089, XXXX, 3151, 3285, 3165, 0000]    [XXXX, 3285, XXXX, 3165, 3151, 0000]    [3165, 3285, 3089, XXXX, 0000, 0000]  XXXX item from outside core subset 0000 empty inventory slot 

As item 3089 compliments champion 45 core builds that have item 3089 have a higher win rate than core builds which are missing item 3089.

The size of the data set available for each champion varies between 10000 and 100000. The mean is probably around 35000.

### Questions

1. Is this a suitable problem for supervised classification?
2. How should I approach finding groups of core items and their win rates?

Yes. If you have a non-trivial data set of this form, this would be a reasonable fit for statistical analysis.

For independent variables, you have one binary feature for each core item (indicating whether that item was purchased or not); the outcome is a binary variable (win or loss). Accordingly, one reasonable approach would be to try logistic regression. You'll have one independent variable $X_i$ for each item; $X_i=1$ means that item $i$ is one of the 6 items that the champion purchased in this game, $X_i=0$ means item $i$ was not purchased. You'll have a dependent variable $Y$; $Y=1$ means that the champion won this game, $Y=0$ means the champion lost. Then logistic regression will tell you which items tend to be associated with winning games.

There are other methods you could try as well: pretty much any method for supervised classification that works well with binary/categorical variables.

The one thing I don't recommend you do is multiply the item codes. That's not going to help. Instead, just have 20 features, where each feature indicates whether a particular item was purchased or wasn't.

Question Source : http://cs.stackexchange.com/questions/45645

3200 people like this
Problem Detail:

I am reading the article a kind of old article entitled "Adaptive image region-growing" by Chang Y Li X, published in IEEE transactions on image processing DOI:10.1109/83.336259.

Well, the problem is that the method is not well explained.

For example, in equation (2) and (3) I don't get the meaning of Pr() ? any idea about that ?

###### Asked By : ALJI Mohamed

$\Pr(E)$ normally refers to the probability of the event $E$. Consult a textbook on probability to learn more about standard notation for probability and statistics.

Question Source : http://cs.stackexchange.com/questions/44788

3200 people like this
Problem Detail:

If I have the integer ""1234" stored in a file, the size of the file when I use the command "ls -l" or "wc -c" is 5. If I have "12345" the size of the file is 6. Basically the number of bytes in a file = (number of individual digits) + 1. Now, I am assuming the "+1" comes from the space it takes to create the name of the file. But why is the rest of the byte count equal to the number of individual characters? Does the file system read the file character by character even we "write an integer" to the file? Elaboration on this would be very helpful.

The +1 is for the newline character.

The byte count equals the number of characters, in this case, because each character takes exactly one byte. That need not be true in general (see, e.g., UTF-8), but it is for the example you listed.

You didn't write an integer to the file. You wrote a string -- a sequence of characters, namely the character 1, then the character 2, then 3, then 4, then a newline. Files don't store integers or other types. They store a sequence of bytes. Each character is converted to one or more bytes using an encoding, and then the sequence of bytes is written.

Question Source : http://cs.stackexchange.com/questions/55364

3200 people like this
Problem Detail:

(This question was originally posted at http://academia.stackexchange.com/questions/68675 but was considered too specific for Academia.)

In computer science/engineering research papers relating to performance improvement wherein execution time is normalized to control benchmarks (i.e. speedup plots), how is that normalization generally implemented when the control(s) is/are executed multiple times in order to avoid the possibility of other processes/interrupts/etc. taking processor time and skewing the results? My first thought would be to just use the minimum time achieved for each benchmark, but as that would probably not be the minimum possible execution time in most cases and, depending on the situation, overall execution time rather than best-case execution time may be more important, is it better to just accept the skewing and go off the means? Or is it better to use the median for each set of results?

One simple approach is to use the median. It is less sensitive to outliers than the mean, but still serves as a good summary of typical running time. Alternatively, you could use the mean but in that case you should inspect all the results yourself to check for the possibility of outliers. I personally would recommend the median.

Papers often show confidence intervals as well, so that you can assess the effect of statistical noise.

Normally, benchmarks are run on an isolated system to minimize interference from other tasks.

Note that there are many challenges with benchmarking. Even simple irrelevant changes can cause significant changes to performance, e.g., because they happened to change cache alignment to something that is randomly better or worse. Therefore, it can be difficult to separate out whether your optimization led to a 3% performance improvement because your optimization is an improvement, or if that's just randomness (e.g., just recompiling with slightly different settings can change the running time by +/- 3%, and you happened to get unlucky once and lucky the other time).

Question Source : http://cs.stackexchange.com/questions/57427

3200 people like this
Problem Detail:

So I'm reading "Search Through Systematic Set Enumeration" by Ron Rymon (currently available online for free. I'm having a problem with the notation in the following definition presented bellow:

The Set-Enumeration (SE)-tree is a vehicle for representing and/or enumerating sets in a best-first fashion. The complete SE-tree systematically enumerates elements of a power-set using a pre-imposed order on the underlying set of elements. In problems where the search space is a subset of that power-set that is (or can be) closed under set-inclusion, the SE-tree induces a complete irredundant search technique. Let E be the underlying set of elements. We first index E's elements using a one-to-one function ind: E -> $\mathbb{N}$. Then, given any subset S $\subseteq$ E, we define its SE-tree view:

Definition 2.1 A Node's View

$View(ind,S) \stackrel{def}{=} \{e \in E | ind(e) \gt max_{e' \in S} ind(e')\}$

In the paper there is an example of a tree made with what appears to be E={1,2,3,4}. I have some familiarity with set-builder notation, but much of the other parts of the "node's view" is confusing me. I skimmed ahead to see if there were clarifications, but I didn't manage to find them so either: a) the author is assuming a competency I do not have, b) the explanation is there and I couldn't find it, or c) the author is doing a horrible job as an author.

So with the hope that it is one of the first two:

• I'm assuming that the prime in e' is for the complement of the set e, so if e = {1}, then e' = {2,3,4}. Is this correct?
• What is this ind function? What would ind({3,4}) be for example?
• $max_{e' \in S}$? Is this the maximum height of the sub tree of the complement of e?

Any assistance on this would be most appreciated.

###### Answered By : Pål GD

No, the prime is not complement, $e'$ is just a different variable than $e$.

In words, $\text{View}(\text{ind},S)$ is the set of all edges whose index is higher than the indices of edges in $S$. The function $\text{ind}: E \to \mathbb{N}$ gives a number to each edge, and $\max_{e' \in S}\text{ind}(e')$ is simply the highest index of $S$.

Then $\text{View}(\text{ind},S)$ is the set of all edges whose index is higher than $\max_{e' \in S}\text{ind}(e')$.

Question Source : http://cs.stackexchange.com/questions/40146

3200 people like this
Problem Detail:

When we perform average case analysis of algorithms, we assume that the inputs to the algorithm are sampled uniformly from some underlying space. For example, the average case analysis of quicksort assumes that the unsorted array is uniformly sampled from the $n!$ permutations of $\{1,\ldots,n\}$.

Suppose instead that the inputs to an algorithm are chosen non-uniformly over the input space.

Is the resulting analysis still "average case" analysis?

If the distribution causes the algorithm to perform at its worst (resp. best), is there a standard name for it? E.g. "adversarial (resp. favourable) distribution of inputs".

Yes, the expected running time under some other distribution would still count as an example of average-case analysis. However, when you describe it to someone, make sure you explain what distribution you're using. I wouldn't recommend that you just call it "average-case analysis" without also explaining that you're using a non-standard probability distribution.

The worst-case distribution is always a point distribution that assigns all its probability to a single input. In other words, the worst-case probability distribution that makes the expected running time as large as possible is in fact just a single input: the input that makes the algorithm run as long as possible. Consequently, this kind of worst-case analysis coincides with the "standard" notion of worst-case running time. For this reason, there is no need for a special "name" for this; worst-case running time already covers it.

The same is true for best-case running time and the probability distribution that makes the expected running time as small as possible.

Question Source : http://cs.stackexchange.com/questions/64352

3200 people like this
Problem Detail:

I am getting properly stuck into reinforcement learning and I am currently reading the review paper by Kober et al. (2013).

And there is one constant feature that I cannot get my head around, but which is mentioned a lot, not just in this paper, but others too; namely the existence of gradients.

In section 2.2.2. they say:

The approach is very straightforward and even applicable to policies that are not differentiable.

What does it mean to say that the gradients exist and indeed, how do we know that they exist? When wouldn't they exist?

The gradient doesn't exist / isn't well-defined for non-differentiable functions. What they mean by that statement is that there is an analogous version of gradients that can be used, instead of the gradient.

## Discrete functions

In the discrete case, finite differences are the discrete version of derivatives. The derivative of a single-variable continuous function $f:\mathbb{R} \to \mathbb{R}$ is $df/dx$; the partial difference (a discrete derivative) of a single-variable discrete function $g:\mathbb{Z} \to \mathbb{Z}$ is $\Delta f : \mathbb{Z} \to \mathbb{Z}$ given by

$$\Delta f(x) = f(x+1)-f(x).$$

There's a similar thing that's analogous to a gradient. If we have a function $f(x,y)$ of two variables, the gradient consists of partial derivatives $\partial f / \partial x$ and $\partial f / \partial y$. In the discrete case $g(x,y)$, we have partial differences $\Delta_x f$ and $\Delta_y f$, where

$$\Delta_x f(x,y) = f(x+1,y) - f(x)$$

and similarly for $\Delta_y f$.

## Continuous functions

If you have a continuous function that is not differentiable at some points (so the gradient does not exist), sometimes you can use the subgradient in lieu of the gradient.