The post Probability as a measure of ignorance appeared first on Quick Math Intuitions.

]]>** What does a probability measure measure?** Sure, the open sets of the -algebra that supports the measure space. But really, what? Thinking about it, it is very difficult to define

Well, **probability measures our ignorance about something**.

When we make some claim with 90% probability, what we are really saying is that *the knowledge we have* allows us to make a prediction that is that much accurate. And the main point here is that **different people may assign different probabilities to the very same claim!** If you have ever seen weather forecasts for the same day disagree, you know what I am talking about. **Different data or different models can generate different knowledge, and thus different probability figures.**

But we do not have to go that far to find reasonable examples. Let’s consider a very simple one. Imagine you found yourself on a train, and in front of you is sitting a girl with clothes branded Patagonia. What would be the odds that the girl has been to Patagonia? Not more than average, you would guess, because Patagonia is just a brand that makes warm clothes, and can be purchased in several stores all around the world, probably even more than in Patagonia itself! So you would probably say that is surely no more than 50% likely.

**But now imagine a kid in the same scenario.** If they see a girl with Patagonia clothes, they would immediately think that she had been to Patagonia (with probability 100% this time), because they are lacking a good amount of important information that you instead hold. And so the figure associated with is pretty **different depending on the observer, or rather on the knowledge (or lack of) they possess**. In this sense probability is a measure of our ignorance.

The post Probability as a measure of ignorance appeared first on Quick Math Intuitions.

]]>The post But WHY is the Lattices Bounded Distance Decoding Problem difficult? appeared first on Quick Math Intuitions.

]]>A lattice is a discrete subgroup , where the word discrete means that each has a neighborhood in that, when intersected with results in itself only. One can **think of lattices as being grids**, although the coordinates of the points need not be integer. Indeed, all lattices are isomorphic to , but it may be a grid of points with non-integer coordinates.

Another very nice way to define a lattice is: given independent vectors , the lattice generated by that base is the set of all linear combinations of them **with integer coefficients:**

Then, we can go on to define the **Bounded Distance Decoding problem** (BDD), which is used in **lattice-based cryptography** (more specifically, for example in trapdoor homomorphic encryption) and believed to be hard in general.

Given an arbitrary basis of a lattice , and a point *not necessarily belonging* to , find the point of that is closest to . We are also guaranteed that is *very close* to one of the lattice points. Notice how we are relying on an *arbitrary* basis – if we claim to be able to solve the problem, we should be able to do so with *any* basis.

Now, as the literature goes, this is a problem that is *hard in general, but easy if the basis is nice enough*. So, for example for encryption, the idea is that we can encode our secret message as a lattice point, and then add to it some small noise (i.e. a small element ). This basically generates an instance of the BDD problem, and then the decoding can only be done by someone who holds the good basis for the lattice, while those having a bad basis are going to have a hard time decrypting the ciphertext.

However, albeit of course there is no proof of this (it is a problem* believed* to be hard), I wanted to get at least some clue on **why** it should be easy with a nice basis and hard with a bad one (GGH is an example schema that employs techniques based on this).

So now to our real question. But WHY is the Bounded Distance Decoding problem hard (or easy)?

Let’s first say what a good basis is. **A basis is good if it is made of nearly orthogonal short vectors**. This is a pretty vague definition, so let’s make it a bit more specific (although tighter): we want a base in which each of its is of the form for some . One can imagine being smaller than some random value, like 10. (This shortness is pretty vague and its role will be clearer later.) In other words, **a nice basis is the canonical one, in which each vector has been re-scaled by an independent real factor.**

To get a flavor of why the Bounded Distance Decoding problem is easy with a nice basis, let’s make an example. Consider , with as basis vectors. Suppose we are given as challenge point. It does not belong to the lattice generated by , but it is only away from the point , which does belong to the lattice.

Now, what does one have to do to solve this problem? Let’s get a graphical feeling for it and formalize it.

We are looking for the lattice point closest to . So, sitting on , we are looking for the linear combination with integer coefficients of the basis vectors that is closest to us. Breaking it component-wise, we are looking for and such that they are solution of:

This may seem a difficult optimization problem, but in truth it is very simple! **The reason is that each of the equations is independent, so we can solve them one by one – the individual minimum problems are easy and can be solved quickly**. (One could also put boundaries on with respect to the norm of the basis vectors, but it is not vital now.)

So the overall complexity of solving BDD with a good basis is , which is okay.

**A bad basis** is any basis that does not satisfy any of the two conditions of a nice basis: it **may be poorly orthogonal, or may be made of long vectors.** We will later try to understand what roles these differences play in solving the problem: for now, let’s just consider an example again.

Another basis for the lattice generated by the nice basis we picked before () is . This is a bad one.

Let’s write down the system of equations coordinate-wise as we did for the nice basis. We are looking for and such that they are solution of:

Now look! This may look similar as before, but **this time it really is a system, the equations are no longer independent:** we have 3 unknowns and 2 equations. The system is under-determined! This already means that, in principle, there are infinite solutions. Moreover, we are also trying to find a solution that is constrained to be minimum. Especially with big , solving this optimization problem can definitely be non-trivial!

So far so good: we have discovered why the Bounded Distance Decoding problem is easy with a good basis and difficult with a bad one. But still, **what does a good basis have to make it easy? How do its properties related to easy of solution?**

We enforced two conditions: orthogonality and shortness. Actually, we even required something stronger than orthogonality: that the good basis was basically a stretched version of the canonical one – i.e. had only one non-zero entry.

**Let’s think for a second in terms of canonical basis** . **This is what makes the minimum problems independent** and allows for easy resolution of the BDD problem. However, when dealing with cryptography matters, **we cannot always use the same basis,** we need some randomness. That is why we required to use a set of independent vectors each having only one non-zero coordinate: it is the main feature that makes the problem easy (at least for the party having the good basis).

We also asked for **shortness. This does not give immediate advantage to who holds the good basis, but makes it harder to solve the problem for those holding the bad one.** The idea is that, given a challenge point , if we have short basis vectors, *we can take small steps* from it and look around us for nearby points. It may take some time to find the best one, but we are still not looking totally astray. Instead, **if we have long vectors, every time we use one we have to make a big leap in one direction**. In other words, *who has the good basis knows the step size of the lattice, and thus can take steps of considerate size. slowly poking around*; who has the bad basis takes huge jumps and may have a hard time pinpointing the right point.

It is true, though, that the features of a good basis usually only include shortness and orthogonality, and not the “rescaling of the canonical basis” we assumed in the first place. So, let’s consider a basis of that kind, like . If we wrote down the minimum problem we would have to solve given a challenge point, it would be pretty similar to the one with the bad basis, with the equations not being independent. Looks like bad luck, uh?

However, not all hope is lost! In fact, **we can look for the rotation matrix that will turn that basis into a stretching of the canonical one,** finding ! Then we can rotate the challenge point as well, and solve the problem with respect to those new basis vectors. Of course that is not going to be the solution to the problem, but we can easily rotate it back to find the real solution!

However, given that using a basis of this kind does not make the opponent job any harder, but only increases the computational cost for the honest party, I do not see why this should ever be used. Instead, I guess the best choices for good basis are the stretched canonical ones.

(This may be obvious, but having a generic orthogonal basis is not enough for an opponent to break the problem. If it is orthogonal, but its vectors are long, bad luck!)

The post But WHY is the Lattices Bounded Distance Decoding Problem difficult? appeared first on Quick Math Intuitions.

]]>The post Conditional probability: why is it defined like that? appeared first on Quick Math Intuitions.

]]>

We need to wonder: **what does the happening of event tell about the odds of happening of event ?** How much *more likely* becomes if happens? Think in terms of **how affects **.

**If and are independent**, then knowing something about B will not tell us anything at all about , at least not that we did not know already. In this case is empty and thus . This makes sense! In fact, consider this example: how does me buying a copybook affects the likelihood that your grandma is going to buy a frying pan? It does not: the first event has no influence on the second, thus the conditional probability is just the same as the normal probability of the first event.

**If and are not independent**, several things can happen, and that is where things get interesting. We know that B happened, and we should now **think as if was our whole universe**. The idea is: we already know what are the odds of , right? It is just . **But how do they increase if we know that we do not really have to consider all possible events, but just a subset of them? **As an example, think of versus *knowing that* all balls are red. This makes a huge difference, right? (As an aside, that is what we mean when we say that probability is a measure of our ignorance.)

So anyway, now we ask: what is the probability of ? Well, it would just be , but we must account for the fact that we now *live inside* , and everything that is outside it is as if it did not existed. So actually becomes : we only care about the part of that is inside , because that is where we live now.

But, there is a caveat. We are *thinking* as if was the whole universe but, in terms of probabilities, it actually is not, because nobody has informed the probability distribution. In fact, we compute precisely because **we still live in the bigger universe**, but we need to account for the fact that is our real universe now. That is why we need a *re-scaling factor*: something that will scale to make it numerically correct, to account for the fact that is our current universe. This is what the at the denominator does.

In fact, the factor accounts for *how much relevant the information that happened is*. If , it means that, for probability purposes, – the switch of universe was just apparent! A further consequence is that , because is basically inside (apart from silly null-measure caveats). In turn, this has the consequence of making . This makes sense: if is sure to happen, then what does it tell us about the odds of something else? As an example, if we are considering strings of digits (), what is the likelihood that a certain string is made of ones or twos, knowing that it is made out of digits ()? It sounds tautological, and it certainly is.

What about a that is big, yet ? This is trickier, as it mostly depends on the interplay between and . But you are not on a university level website to read about inverse proportionality, are you?

Another case worth inspection is when . In that case,

Makes sense, right? If happened, and is inside it, then clearly must happen as well. If I bought a red umbrella, what are the odds that I bought a generic umbrella as a consequence? Full, yep.

Finally, let’s consider the case in which is very small. Suppose that and . An at the denominator will make the resulting fraction become significantly bigger.

The idea here is that if is very narrow, if it talks about **something very unlikely, and it happened, this greatly influences the overall conditional probability**. What are the odds that I get hospitalized in Japan as a 25 years old man? Very low. What are the odds that today there is an earthquake in Japan? Very low. What are the odds that I get hospitalized in Japan, knowing that today an earthquake happened? Quite high. That’s the idea: the more is unlikely, the higher tends to be. In a sense, the more narrow is, the higher the amount of information it brings knowing that it happened.

The post Conditional probability: why is it defined like that? appeared first on Quick Math Intuitions.

]]>The post Diagonalizing a matrix NOT having full rank, what does it mean? appeared first on Quick Math Intuitions.

]]>Every matrix can be seen as a linear map between vector spaces. Stating that a matrix is similar to a diagonal matrix equals to stating that there exists a basis of the source vector space in which the linear transformation can be seen as a simple *stretching of the space*, as re-scaling the space. In other words, diagonalizing a matrix is the same as *finding an orthogonal grid that is transformed in another orthogonal grid*. I recommend this article from AMS for good visual representations of the topic.

That’s all right – when we have a matrix from in , if it can be diagonalized, we can find a basis in which the transformation is a re-scaling of the space, fine.

But what does it mean to diagonalize a matrix that has null determinant? The associated transformations have the effect of killing at least one dimension: indeed, a x matrix of rank has the effect of lowering the output dimension by . For example, a x matrix of rank 2 will have an image of size 2, instead of 3. This happens because two basis vectors are merged in the same vector in the output, so one dimension is bound to collapse.

Let’s consider the sample matrix

which has non full rank because has two equal rows. Indeed, one can check that the two vectors go in the same basis vector. This means that instead of 3. In fact, it is common intuition that when the rank is not full, some dimensions are lost in the transformation. Even if it’s a x matrix, the output only has 2 dimensions. It’s like at the end of Inception when the 4D space in which cooper is floating gets shut.

However, is also a symmetric matrix, so from the spectral theorem we know that it can be diagonalized. And now to the vital questions: what do we expect? What meaning does it have? *Do we expect a basis of three vectors even if the map destroys one dimension?*

Pause and ponder.

Diagonalize the matrix and, indeed, you obtain three eigenvalues:

The eigenvalues are thus , and , each giving a different eigenvector. Taken all together, they form a orthogonal basis of . The fact that is among the eigenvalues is important: it means that *all the vectors belonging to the associated eigenspace all go to the same value*: zero. This is the mathematical representation of the fact that **one dimension collapses**.

At first, I naively thought that, since the transformation destroys one dimension, I should expect to find a 2D basis of eigenvectors. But this was because I confused the source of the map with its image! The point is that we can still find a basis of the source space from the perspective of which the transformation is just a re-scaling of the space. However, that doesn’t tell anything about the behavior of the transformation, whether it will preserve all dimensions: *it is possible that two vectors of the basis will go to the same vector in the image*!

In fact, the fact that the matrix has the first and third rows that are the same means that the basis vectors and both go into . A basis of is simply , and we should not be surprised by the fact that those vectors have three entries. In fact, *two* vectors (even with *three* coordinates) only allow to represent a *2D* space. In theory, one could express any vector that is combination of the basis above as combination of the usual 2D basis , to confirm that .

The post Diagonalizing a matrix NOT having full rank, what does it mean? appeared first on Quick Math Intuitions.

]]>The post Finding paths of length n in a graph appeared first on Quick Math Intuitions.

]]>For example, in the graph aside there is one path of length 2 that links nodes A and B (A-D-B). How can this be discovered from its adjacency matrix?

It turns out there is a beautiful mathematical way of obtaining this information! Although this is not the way it is used in practice, it is still very nice. In fact, Breadth First Search is used to find paths of any length given a starting node.

**PROP**. holds the number of paths of length from node to node .

Let’s see how this proposition works. Consider the adjacency matrix of the graph above:

With we should find paths of length 2. So we first need to square the adjacency matrix:

Back to our original question: how to discover that there is only one path of length 2 between nodes A and B? Just look at the value , which is 1 as expected! Another example: , because there are 3 paths that link B with itself: B-A-B, B-D-B and B-E-B.

This will work with any pair of nodes, of course, as well as with any power to get paths of any length.

Now to the intuition on why this method works. Let’s focus on for the sake of simplicity, and let’s look, again, at paths linking A to B. , which is what we look at, comes from the dot product of the first row with the second column of :

Now, the result is non-zero due to the fourth component, in which both vectors have a 1. Now, let us think what that 1 means in each of them:

- – first row -> first node (A) is linked to fourth node (D)
- – second column -> second node (B) is linked to fourth node (D)

So overall this means that **A and B are both linked to the same intermediate node**, they *share a node* in some sense. Thus we can go from A to B in two steps: going through their common node.

The same intuition will work for longer paths: when two dot products agree on some component, it means that those two nodes are both linked to another common node. For paths of length three, for example, instead of thinking in terms of two nodes, think in terms of paths of length 2 linked to other nodes: when there is a node in common between a 2-path and another node, it means there is a 3-path!

The post Finding paths of length n in a graph appeared first on Quick Math Intuitions.

]]>The post On the relationship between L^p spaces and C_c functions for p = infinity appeared first on Quick Math Intuitions.

]]>When we discover that (continuous functions with compact support) is dense in , we also discover that it does not hold if and .

What that intuitively means is that if you take away functions in from , you take away something fundamental for : you are somehow taking away a net that keeps the ceiling up.

The fact that it becomes false for limitless spaces () and means that the functions in *do not need* functions in to survive.

This is reasonable: functions in are not required to exist only in a specific (compact) region of space, whereas functions in do. Functions in are simply bounded – their image keeps below some value, but can go however far they want in *x* direction. **Very roughly speaking, they have a limit on their height, but not on their width**.

What we find out, however, is that the following chain of inclusions holds:

That’s reasonable! Think about it:

- Functions in live in a well defined area of space – a
*confined*area of space. - Functions in are allowed to live everywhere, with the constraint that they become more and more negligible the farther and farther we go. Not required to ever be zero though.
- Functions in are simply required to have an upper bound (a finite one, obviously).

I’m not saying this is simple (advanced analysis is at least as difficult as pitching a nail with a needle as hammer), but after careful thinking, it’s just the way it should be, given the definitions.

The post On the relationship between L^p spaces and C_c functions for p = infinity appeared first on Quick Math Intuitions.

]]>The post The meaning of F Value in the Analysis of Variance for Linear regression appeared first on Quick Math Intuitions.

]]>The F Value is computed by dividing the value in the Mean Square column for Model with the value in the Mean Square column for Error. In our example, it’s .

There are **two possible interpretations for the F Value** in the Analysis of Variance table for the linear regression.

**We are comparing the variances of the model and of the error**.

The two factors represent each the numerator of the variance of the model and of the error. What do we want? The only hypothesis of the linear regression model is that is a normal variable with zero mean. Thus we want a small variance for the error, so we can say the errors are close to zero.**We are comparing the model with all the variables with the model with only the intercept as variable**.

This ambiguity exists because can either be seen as the numerator of the variance of , or as a comparison between the complete model and the reduced model in which only the intercept is used.

The post The meaning of F Value in the Analysis of Variance for Linear regression appeared first on Quick Math Intuitions.

]]>The post On the meaning of hypothesis and p-value in statistical hypothesis testing appeared first on Quick Math Intuitions.

]]>In statistical hypothesis testing, we

- have some data, whatever it is, which we imagine as being values of some random variable;
- make an hypothesis about the data, such as that the expected value of the random variable is ;
- find a distribution for any affine transformation of the random variable we are making inference about – this is the test statistic;
- run the test, i.e. numerically say how much probable how observations were in relation to the hypothesis we made.

I had a couple of A-HA moments I’d like to share.

There is a reason why this is called *hypothesis testing* and not *hypothesis choice*. There are indeed two hypothesis, the null and the alternative hypothesis. However, their roles are widely different! 90% of what we do, both from a conceptual and a numerical point of view, has to do with the null hypothesis. They really are not symmetric. The question we are asking is “With the data I have, am I certain enough my null hypothesis no longer stands?” not at all “With the data I have, which of the two hypothesis is better?”

In fact, **the alternative hypothesis is only relevant in determining what kind of alternative we have**: whether it’s one-sided (and which side) or two-sided. This affects calculations. But other than that, the math doesn’t really care about the specific value of the alternative. In other words, the two following test are really equivalent:

This accounts for why, when evaluating a p-value, we refuse the null hypothesis only for very low figures. The way I first thought about it had been: “Well, the p-value is, intuitively, a measure of the proximity of the observed data to the null hypothesis. Then, if I get something around , I should refuse the null hypothesis and switch to the alternative, as it seems a better theory.” But this is a flawed argument indeed. **To see if the alternative was really better I should run a test using it as principal hypothesis!** We refuse for very low p-values because that means we null hypothesis really isn’t any more good, and should be thrown to the bin. Then we need to care about finding another good theory that can suit the data.

However, before throwing the current theory out of the window, we don’t accept all kinds of evidence against it: **we want a very strong evidence**. We don’t want to discard the current theory for another that could only be marginally better. It must be crushingly better!

The post On the meaning of hypothesis and p-value in statistical hypothesis testing appeared first on Quick Math Intuitions.

]]>The post Why hash tables should use a prime-number size appeared first on Quick Math Intuitions.

]]>I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:

- 8 % 10 = 8
- 18 % 10 = 8
- 87865378 % 10 = 8
- 2387762348 % 10 = 8

It doesn’t matter what the number is: as long as it ends with 8, its modulo 10 will be 8. You could pick a huge power of 10 as modulo operator, such as 10^k (with k > 10, let’s say), but

- you would need a huge table to store the values
- the hash function is still pretty stupid: it just trims the number retaining only the first
*k*digits starting from the right.

However, if you pick a different number as modulo operator, such as 12, then things are different:

- 8 % 12 = 8
- 18 % 12 = 6
- 87865378 % 12 = 10
- 2387762348 % 12 = 8

We still have a collision, but the pattern becomes more complicated, and the collision is just due to the fact that 12 is still a small number.

**Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.**

For example, with 367:

- 8 % 367 = 8
- 18 % 367 = 18
- 87865378 % 367 = 73
- 2387762348 % 367 = 240

What is worth nothing is that there may be a pattern even with modulo 367, but it would be way less trivial than with modulo 10 (or with modulo 2 in binary). **We don’t really need a prime number**, just having a big non-power of two is enough. Having a prime number, obviously, is just a guaranteed way of satisfying those conditions.

The post Why hash tables should use a prime-number size appeared first on Quick Math Intuitions.

]]>The post Metaphysics on geometric distribution in probability theory appeared first on Quick Math Intuitions.

]]>This may be a naive post, I warn you, but I was really stunned when I realized this.

Let’s jump to the point. We know (or at least, I was taught) that geometric distribution is used to calculate the probability that the first success in trials (all independent and of probability ) will happen precisely at the -th trial.

Remember that a geometric distribution is a random variable such that its distribution is

How can we **relate the above distribution with the fact that it matches the first success**? Well, we need to have one success, which explains the at the bottom. Moreover, we want to have just one success, so all other trials must be unsuccessful, which explains the .

But hey, **where would ***first*** ever be written**? Unless you do probability in a non-commutative ring (in which case, I don’t know what you are doing), multiplication is commutative. So **who can tell the order** between the events in a Bernoulli process?

In fact, could just as well refer to having unsuccessful outcomes for the first trials and then a successful one at the -th trial, as to having a success in the very first attempt and then all failures. As it is, **as long as we have one (and only one) success among the attempts, the geometric distribution holds**!

Apparently then, geometric distribution *is* about the time of first success, but it is not* just* about that. It encompasses way more cases, all equally likely. Geometric distribution allows to calculate *exactly one success* will happen in trials in a Bernoulli process.

**The universe does not care about the order of events** (in a Bernoulli process, at least). As long as we do trials, regardless of when the success happens, the universe does not care. This stuns me!

The post Metaphysics on geometric distribution in probability theory appeared first on Quick Math Intuitions.

]]>