Conditional probability: why is it defined like that?

So, you want to calculate the probability of an event knowing that another has happened. There is a formula for that, it is called conditional probability, but why is it the way it is? Let’s first write down the definition of conditional probability:

    \[\mathbb{P}(A | B) = \dfrac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}\]

We need to wonder: what does the happening of event B tell about the odds of happening of event A? How much more likely A becomes if B happens? Think in terms of how B affects A.

If A and B are independent, then knowing something about B will not tell us anything at all about A, at least not that we did not know already. In this case A \cap B is empty and thus \mathbb{P}(A | B) = \mathbb{P}(A). This makes sense! In fact, consider this example: how does me buying a copybook affects the likelihood that your grandma is going to buy a frying pan? It does not: the first event has no influence on the second, thus the conditional probability is just the same as the normal probability of the first event.

Sets no intersection

If A and B are not independent, several things can happen, and that is where things get interesting. We know that B happened, and we should now think as if B was our whole universe. The idea is: we already know what are the odds of A, right? It is just \mathbb{P}(A). But how do they increase if we know that we do not really have to consider all possible events, but just a subset of them? As an example, think of \mathbb{P}(\text{drawing a red ball}) versus \mathbb{P}(\text{drawing a red ball}) knowing that all balls are red. This makes a huge difference, right? (As an aside, that is what we mean when we say that probability is a measure of our ignorance.)

So anyway, now we ask: what is the probability of A? Well, it would just be \mathbb{P}(A), but we must account for the fact that we now live inside B, and everything that is outside it is as if it did not existed. So \mathbb{P}(A) actually becomes \mathbb{P}(A \cap B): we only care about the part of A that is inside B, because that is where we live now.

But, there is a caveat. Continue reading “Conditional probability: why is it defined like that?”

Diagonalizing a matrix NOT having full rank: what does it mean?

This is going to be a quick intuition about what it means to diagonalize a matrix that does not have full rank (i.e. has null determinant).

Every matrix can be seen as a linear map between vector spaces. Stating that a matrix is similar to a diagonal matrix equals to stating that there exists a basis of the source vector space in which the linear transformation can be seen as a simple stretching of the space, as re-scaling the space. In other words, diagonalizing a matrix is the same as finding an orthogonal grid that is transformed in another orthogonal grid. I recommend this article from AMS for good visual representations of the topic.

Image Taken from AMS - We Recommend a Singular Value Decomposition
Taken from AMS – We Recommend a Singular Value Decomposition

Diagonalization on non full rank matrices

That’s all right – when we have a matrix from \mathbb{R}^3 in \mathbb{R}^3, if it can be diagonalized, we can find a basis in which the transformation is a re-scaling of the space, fine.

But what does it mean to diagonalize a matrix that has null determinant? The associated transformations have the effect of killing at least one dimension: indeed, a nxn matrix of rank k has the effect of lowering the output dimension by n-k. For example, a 3×3 matrix of rank 2 will have an image of size 2, instead of 3. This happens because two basis vectors are merged in the same vector in the output, so one dimension is bound to collapse.

Let’s consider the sample matrix

    \[A = \begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}\]

which has non full rank because has two equal rows. Indeed, one can check that the two vectors (1,0,0); (0,0,1) go in the same basis vector. This means that dim(Im(f_A))=2 instead of 3. In fact, it is common intuition that when the rank is not full, some dimensions are lost in the transformation. Even if it’s a 3×3 matrix, the output only has 2 dimensions. It’s like at the end of Interstellar when the 4D space in which Cooper is floating gets shut.

However, A is also a symmetric matrix, so from the spectral theorem we know that it can be diagonalized. And now to the vital questions: what do we expect? What meaning does it have? Do we expect a basis of three vectors even if the map destroys one dimension?

Pause and ponder.

Continue reading “Diagonalizing a matrix NOT having full rank: what does it mean?”

Finding paths of length n in a graph

Suppose you have a non-directed graph, represented through its adjacency matrix. How would you discover how many paths of length n link any two nodes?

Graph Adjacency Matrix 1For example, in the graph aside there is one path of length 2 that links nodes A and B (A-D-B). How can this be discovered from its adjacency matrix?

It turns out there is a beautiful mathematical way of obtaining this information! Although this is not the way it is used in practice, it is still very nice. In fact, Breadth First Search is used to find paths of any length given a starting node.

PROP. (A^n)_{ij} holds the number of paths of length n from node i to node j.

Let’s see how this proposition works. Consider the adjacency matrix of the graph above:

    \[(A)_{ij} \ \ \begin{matrix} \textbf{-} & \textbf{A} & \textbf{B} & \textbf{C} & \textbf{D} & \textbf{E} \\ \textbf{A} & 0 & 1 & 1 & 1 & 0 \\ \textbf{B} & 1 & 0 & 0 & 1 & 1 \\ \textbf{C} & 1 & 0 & 0 & 1 & 0 \\ \textbf{D} & 1 & 1 & 1 & 0 & 1 \\ \textbf{E} & 0 & 1 & 0 & 1 & 0 \\ \end{matrix}\]

With n = 2 we should find paths of length 2. So we first need to square the adjacency matrix:

Continue reading “Finding paths of length n in a graph”

On the relationship between L^p spaces and C_c functions for p = infinity

Very quick post on the relationship between \mathcal{L}^p, \mathcal{C}_c(X) and \mathcal{L}^\infty. I will assume you already know what I am talking about, I’ll just be sharing some intuition on what those mean, but won’t bother with details. It’s more a reminder for me rather than something that intends to be useful, actually, but there’s almost nothing on the Internet about this!

When we discover that \mathcal{C}_c(X) (continuous functions with compact support) is dense in \mathcal{L}^p, we also discover that it does not hold if p = \infty and \mu(X) = \infty.

What that intuitively means is that if you take away functions in \mathcal{C}_c(X) from \mathcal{L}^p, you take away something fundamental for \mathcal{L}^p: you are somehow taking away a net that keeps the ceiling up.

The fact that it becomes false for limitless spaces (\mu(X) = \infty) and p = \infty means that the functions in \mathcal{L}^\infty do not need functions in \mathcal{C}_c(X) to survive.

This is reasonable: functions in \mathcal{L}^\infty are not required to exist only in a specific (compact) region of space, whereas functions in \mathcal{C}_c(X) do. Functions in\mathcal{L}^\infty are simply bounded – their image keeps below some value, but can go however far they want in x direction. Very roughly speaking, they have a limit on their height, but not on their width.

What we find out, however, is that the following chain of inclusions holds:

\mathcal{C}_c(X) \subset \mathcal{C}_\infty(X) \subset \mathcal{L}^\infty

Continue reading “On the relationship between L^p spaces and C_c functions for p = infinity”

The meaning of F Value in the Analysis of Variance for Linear regression

This is a sample output for linear regression:

Linear regression output

The F Value is computed by dividing the value in the Mean Square column for Model with the value in the Mean Square column for Error. In our example, it’s 178.11288/5.34346 = 33.33.

There are two possible interpretations for the F Value in the Analysis of Variance table for the linear regression.

Continue reading “The meaning of F Value in the Analysis of Variance for Linear regression”

On the meaning of hypothesis and p-value in statistical hypothesis testing

Statistical hypothesis testing is really an interesting topic. I’ll just briefly sum up what statistical hypothesis testing is about, and what you do to test an hypothesis, but will assume you are already familiar with it, so that I can quickly cover a couple of A-HAs moments I had.

In statistical hypothesis testing, we

  • have some data, whatever it is, which we imagine as being values of some random variable;
  • make an hypothesis about the data, such as that the expected value of the random variable is \mu;
  • find a distribution for any affine transformation of the random variable we are making inference about – this is the test statistic;
  • run the test, i.e. numerically say how much probable how observations were in relation to the hypothesis we made.

I had a couple of A-HA moments I’d like to share.

There is a reason why this is called hypothesis testing and not hypothesis choice. There are indeed two hypothesis, the null and the alternative hypothesis. However, their roles are widely different! 90% of what we do, both from a conceptual and a numerical point of view, has to do with the null hypothesis. They really are not symmetric. The question we are asking is “With the data I have, am I certain enough my null hypothesis no longer stands?” not at all “With the data I have, which of the two hypothesis is better?”

Continue reading “On the meaning of hypothesis and p-value in statistical hypothesis testing”

Why hash tables should use a prime-number size

I read in several books and online pages that hash tables should use a prime number for the size. Nobody really justified this statement properly. Here’s my attempt!

I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:

  • 8 % 10 = 8
  • 18 % 10 = 8
  • 87865378 % 10 = 8
  • 2387762348 % 10 = 8

It doesn’t matter what the number is: as long as it ends with 8, its modulo 10 will be 8. You could pick a huge power of 10 as modulo operator, such as 10^k (with k > 10, let’s say), but

  1. you would need a huge table to store the values
  2. the hash function is still pretty stupid: it just trims the number retaining only the first k digits starting from the right.

However, if you pick a different number as modulo operator, such as 12, then things are different:

  • 8 % 12 = 8
  • 18 % 12 = 6
  • 87865378 % 12 = 10
  • 2387762348 % 12 = 8

We still have a collision, but the pattern becomes more complicated, and the collision is just due to the fact that 12 is still a small number.

Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.

Continue reading “Why hash tables should use a prime-number size”

Metaphysics on geometric distribution in probability theory

I realized geometric distribution is not exactly about the time needed to get the first success in a given number of trials. This is a very odd feeling. It is probably a feeling applied mathematicians get sometimes, when they feel they are doing the best they can, and yet the theory is not perfect.

This may be a naive post, I warn you, but I was really stunned when I realized this.

Geometric distribution is not about the first success

Let’s jump to the point. We know (or at least, I was taught) that geometric distribution is used to calculate the probability that the first success in k trials (all independent and of probability p) will happen precisely at the k-th trial.

Remember that a geometric distribution is a random variable X such that its distribution is


How can we relate the above distribution with the fact that it matches the first success? Well, we need to have one success, which explains the p at the bottom. Moreover, we want to have just one success, so all other trials must be unsuccessful, which explains the (1-p)^{k-1}.

But hey, where would first ever be written? Continue reading “Metaphysics on geometric distribution in probability theory”

Random variables: what are they and why are they needed?

This article aims at providing some intuition for what random variables are and why random variables are useful and needed in probability theory.

Intuition for random variables

Informally speaking, random variables encode questions about the world in a numerical way.

How many heads can I get if I flip a coin 3 times?

How many people will vote the Democrats at the US presidential elections?

I want to make pizza. What is the possible overall cost of the ingredients, considering all combinations of different brands of them?

These are all examples of random variables. What a random variable does, in plain words, is to take a set of possible world configurations and group them to a number. What I mean when I say world configurations will be clearer soon, when talking about the sample space \Omega (which, appropriately, is also called universe).

Continue reading “Random variables: what are they and why are they needed?”

Relationship between reduced rings, radical ideals and nilpotent elements

This post aims at providing some intuition and meaning for the following algebra relationship:

Reduced ring – Radical ideal – Nilpotent

Reduced ring – Radical ideal – Nilpotent

A basic fact of ring theory is that if you take a ring A and quotient it for a (double-sided) radical ideal I you get a reduced ring. Let us suppose A is a commutative ring and understand why this fact is true.

Nilpotent element
Def. a \in A is nilpotent \Leftrightarrow \exists n \in \mathbb{N} : a^n = 0

Informally, a nilpotent element is like a road ending in the middle of nowhere, collapsing in the depth of an abyss. You are driving on it, following the powers of a, and then all of a sudden, with no explanation, your road ends in a big black hole. Indeed, the zero really acts as some kind of black hole, attracting nilpotent-made roads at some point or another: we can think of nilpotent roads as spiraling into the zero.

Continue reading “Relationship between reduced rings, radical ideals and nilpotent elements”