chapter of the in-progress e-book on linear algebra, “A birds eye view of linear algebra”. The desk of contents to date:
Keep tuned for future chapters.
Right here, we’ll describe operations we will do with two matrices, however holding in thoughts they’re simply representations of linear maps.
I) Why care about matrix multiplication?
Virtually any info might be embedded in a vector house. Photographs, video, language, speech, biometric info and no matter else you possibly can think about. And all of the purposes of machine studying and synthetic intelligence (just like the current chat-bots, textual content to picture, and so on.) work on high of those vector embeddings. Since linear algebra is the science of coping with excessive dimensional vector areas, it’s an indispensable constructing block.

A whole lot of the methods contain taking some enter vectors from one house and mapping them to different vectors from another house.
However why the give attention to “linear” when most fascinating features are non-linear? It’s as a result of the issue of constructing our fashions excessive dimensional and that of constructing them non-linear (normal sufficient to seize every kind of advanced relationships) change into orthogonal to one another. Many neural community architectures work by utilizing linear layers with easy one dimensional non-linearities in between them. And there’s a theorem that claims this type of structure can mannequin any operate.
For the reason that manner we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it’s the bedrock of the trendy AI revolution.

II) Algebra on maps
In chapter 2, we learnt methods to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll want two linear maps and a foundation.

II-A) Addition
If we will add matrices, we will add linear maps since matrices are the representations of linear maps. And matrix addition will not be very fascinating if scalar addition. Simply as with vectors, it’s solely outlined if the 2 matrices are the identical measurement (similar rows and columns) and includes lining them up and including aspect by aspect.

So, we’re simply doing a bunch of scalar additions. Which implies that the properties of scalar addition logically lengthen.
Commutative: if you happen to swap, the end result received’t twitch
A+B = B+A
However commuting to work won’t be commutative since going from A to B would possibly take longer than B to A.
Associative: in a series, don’t chorus, take any 2 and proceed
A+(B+C) = (A+B)+C
Id: And right here I’m the place I started! That’s no technique to deal with a person!
The presence of a particular aspect that when added to something ends in the identical factor. Within the case of scalars, it’s the quantity 0. Within the case of matrices, it’s a matrix stuffed with zeros.
A + 0 = A or 0 + A = A
Additionally, it’s attainable to start out at any aspect and find yourself at every other by way of addition. So it have to be attainable to start out at A and find yourself on the additive id, 0. The factor that have to be added to A to attain that is the additive inverse of A and it’s known as -A.
A + (-A) = 0
For matrices, you simply go to every scalar aspect within the matrix and substitute with the additive inverse of every one (switching the indicators if the scalars are numbers) to get the additive inverse of the matrix.
II-B) Subtraction
Subtraction is simply addition with the additive inverse of the second matrix as an alternative.
A-B = A+(-B)
II-C) Multiplication
We might have outlined matrix multiplication simply as we outlined matrix addition. Simply take two matrices which might be the identical measurement (rows and columns) after which multiply the scalars aspect by aspect. There’s a title for that sorts of operation, the Hadamard product.
However no, we outlined matrix multiplication as a much more convoluted operation, extra “unique” than addition. And it isn’t advanced only for the sake of it. It’s an important operation in linear algebra by far.
It enjoys this particular standing as a result of it’s the means by which linear maps are utilized to vectors, constructing on high of dot merchandise.
The best way it really works requires a devoted part, so we’ll cowl that in part III. Right here, let’s record a few of its properties.
Commutative
In contrast to addition, matrix multiplication will not be all the time commutative. Which implies that the order during which you apply linear maps to your enter vector issues.
A.B != B.A
Associative
It’s nonetheless associative
A.B.C = A.(B.C) = (A.B).C
And there’s a lot of depth to this property, as we’ll see in part IV.
Id
Similar to addition, matrix multiplication additionally has an id aspect, I, a component that when any matrix is multiplied to ends in the identical matrix. The massive caveat being that this aspect solely exists for sq. matrices and is itself sq..
Now, due to the significance of matrix multiplication, “the id matrix” basically is outlined because the id aspect of matrix multiplication (not that of addition or the Hadamard product for instance).
The id aspect for addition is a matrix composed of 0’s and that of the Hadamard product is a matrix composed of 1’s. The id aspect of matrix multiplication is:

So, 1’s on the principle diagonal and 0’s all over the place else. What sort of definition for matrix multiplication would result in an id aspect like this? We’ll want to explain the way it works to see, however first let’s go to the ultimate operation.
II-D) Division
Simply as with addition, the presence of an id matrix suggests any matrix, A might be multiplied with one other matrix, A^-1 and brought to the id. That is known as the inverse. Since matrix multiplication isn’t commutative, there are two methods to this. Fortunately, each result in the id matrix.
A.(A^-1) = (A^-1).A = I
So, “dividing” a matrix by one other is solely multiplication with the second ones inverse, A.B^-1. If matrix multiplication is essential, then this operation is as effectively because it’s the inverse. It is usually associated to how we traditionally developed (or possibly stumbled upon) linear algebra. However extra on that within the subsequent chapter (4).
One other property we’ll be utilizing that could be a mixed property of addition and multiplication is the distributive property. It applies to every kind of matrix multiplication from the standard one to the Hadamard product:
A.(B+C) = A.B + A.C
III) Why is matrix multiplication outlined this fashion?
Now we have arrived ultimately to the part the place we’ll reply the query within the title, the meat of this chapter.
Matrix multiplication is the way in which linear maps act on vectors. So, we get to inspire it that manner.
III-A) How are linear maps utilized in follow?
Take into account a linear map that takes m dimensional vectors (from R^m) as enter and maps them to n dimensional vectors (in R^n). Let’s name the m dimensional enter vector, v.
At this level, it is likely to be useful to consider your self really coding up this linear map in some programming language. It must be a operate that takes the m-dimensional vector, v as enter and returns the n dimensional vector, u.
The linear map has to take this vector and switch it into an n dimensional vector in some way. Within the operate above, you’ll discover we simply generated some vector at random. However this utterly ignored the enter vector, v. That’s unreasonable, v ought to have some say. Now, v is simply an ordered record of m scalars v = [v1, v2, v3, …, vm]. What do scalars do? They scale vectors. And the output vector we’d like must be n dimensional. How about we take some (mounted) m vectors (pulled out of skinny air, every n dimensional), w1, w2, …, wm. Then, scale w1 by v1, w2 by v2 and so forth and add all of them up. This results in an equation for our linear map (with the output on the left).

Make be aware of the equation (1) above since we’ll be utilizing it once more.
For the reason that w1, w2,… are all n dimensional, so is u. And all the weather of v=[v1, v2, …, vm] have an affect on the output, u. The concept in equation (1) is carried out beneath. We take some randomly generated vectors for the w’s however with mounted seeds (making certain that the vectors are the identical throughout each name of the operate).
Now we have a manner now to “map” m dimensional vectors (v) to n dimensional vectors (u). However does this “map” fulfill the properties of a linear map? Recall from chapter-1, part II the properties of a linear map, f (right here, a and b are vectors and c is a scalar):
f(a+b) = f(a) + f(b)
f(c.a) = c.f(a)
It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.


The m vectors, w1, w2, …, wm are arbitrary and it doesn’t matter what we select for them, the operate, f outlined in equation (1) is a linear map. So, completely different selections for these w vectors ends in completely different linear maps. Furthermore, for any linear map you possibly can think about, there might be some vectors w1, w2,… that may be utilized along with equation (1) to symbolize it.
Now, for a given linear map, we will acquire the vectors w1, w2,… into the columns of a matrix. Such a matrix may have n rows and m columns. This matrix represents the linear map, f and its multiplication with an enter vector, v represents the applying of the linear map, f to v. And this software is the place the definition of matrix multiplication comes from.

We will now see why the id aspect for matrix multiplication is the way in which it’s:

We begin with a column vector, v and finish with a column vector, u (so only one column for every of them). And for the reason that parts of v should align with the column vectors of the matrix representing the linear map, the variety of columns of the matrix should equal the variety of parts in v. Extra on this in part III-C.
III-B) Matrix multiplication as a composition of linear maps
Now that we described how a matrix is multiplied to a vector, we will transfer on to multiplying a matrix with one other matrix.
The definition of matrix multiplication is rather more pure after we think about the matrices as representations of linear maps.
Linear maps are features that take a vector as enter and produce a vector as output. Let’s say the linear maps corresponding to 2 matrices are f and g. How would you consider including these maps (f+g)?
(f+g)(v) = f(v)+g(v)
That is paying homage to the distributive property of addition the place the argument goes contained in the bracket to each the features and we add the outcomes. And if we repair a foundation, this corresponds to making use of each linear maps to the enter vector and including the end result. By the distributive property of matrix and vector multiplication, this is similar as including the matrices comparable to the linear maps and making use of the end result to the vector.
Now, let’s consider multiplication (f.g).
(f.g)(v) = f(g(v))
Since linear maps are features, probably the most pure interpretation of multiplication is to compose them (apply them one after the other, in sequence to the enter vector).
When two matrices are multiplied, the ensuing matrix represents the composition of the corresponding linear maps. Take into account matrices A and B; the product AB embodies the transformation achieved by making use of the linear map represented by B to the enter vector first after which making use of the linear map represented by A.
So now we have a linear map comparable to the matrix, A and a linear map comparable to the matrix, B. We’d wish to know the matrix, Ccomparable to the composition of the 2 linear maps. So, making use of B to any vector first after which making use of A to the end result must be equal to simply making use of C.
A.(B.v) = C.v = (A.B).v
Within the final part, we learnt methods to multiply a matrix and a vector. Let’s try this twice for A.(B.v). Say the columns of B are the column vectors, b1, b2, …, bm. From equation (1) within the earlier part,

And what if we utilized the linear map comparable to C=A.B on to the vector, v. The column vectors of the matrix C are c1, c2, …, ck.

Evaluating the 2 equations above we get,

So, the columns of the product matrix, C=AB are obtained by making use of the linear map comparable to matrix A to every of the columns of the matrix B. And accumulating these ensuing vectors right into a matrix offers us C.
Now we have simply prolonged our matrix-vector multiplication end result from the earlier part to the multiplication of two matrices. We simply break the second matrix into a group of vectors, multiply the primary matrix to all of them and acquire the ensuing vectors into the columns of the end result matrix.

So the primary row and first column of the end result matrix, C is the dot product of the primary column of B and the primary row of A. And basically the i-th row and j-th column of C is the dot product of the i-th row of A and the j-th column of B. That is the definition of matrix multiplication most of us first be taught.

Associative proof
We will additionally present that matrix multiplication is associative now. As an alternative of the only vector, v, let’s apply the product C=AB individually to a gaggle of vectors, w1, w2, …, wl. Let’s say the matrix that has these as column vectors is W. We will use the very same trick as above to point out:
(A.B).W = A.(B.W)
It’s as a result of (A.B).w1 = A.(B.w1) and the identical for all the opposite w vectors.
Sum of outer merchandise
Say we’re multiplying two matrices A and B:

Equation (3) might be generalized to point out that the i,j aspect of the ensuing matrix, C is:

Now we have a sum over ok phrases. What if we took every of these phrases and created ok particular person matrices out of them. For instance, the primary matrix may have as its i,j-th entry: b_{i,1}. a_{1,j}. The ok matrices and their relationship to C:

This means of summing over ok matrices might be visualized as follows (paying homage to the animation in part III-A that visualized a matrix multiplied to a vector):

We see right here the sum over ok matrices all the similar measurement (nxm) which is similar measurement because the end result matrix, C. Discover in equation (4) how for the primary matrix, A, the column index stays the identical whereas for the second matrix, B, the row index stays the identical. So the ok matrices we’re getting are the matrix merchandise of the i-th column of A and the i-th row of B.
Matrix multiplication as a sum of outer merchandise. Picture by creator.
Contained in the summation, two vectors are multiplied to provide matrices. It’s a particular case of matrix multiplication when utilized to vectors (particular instances of matrices) and known as “outer product”. Right here is yet one more animation to point out this sum of outer merchandise course of:

This tells us why the variety of row vectors in B must be the identical because the variety of column vectors in A. As a result of they must be mapped collectively to get the person matrices.
We’ve seen loads of visualizations and a few math, now let’s see the identical factor by way of code for the particular case the place A and B are sq. matrices. That is primarily based on part 4.2 of the e-book “Introduction to Algorithms”, [2].
III-C) Matrix multiplication: the structural selections

Matrix multiplication appears to be structured in a bizarre manner. It’s clear that we have to take a bunch of dot merchandise. So, one of many dimensions has to match. However why make the columns of the primary matrix be equal to the variety of rows of the second?
Received’t it make issues extra simple if we redefine it in a manner that the variety of rows of the 2 matrices must be the identical (or the variety of columns)? This could make it a lot simpler to establish when two matrices might be multiplied.
The standard definition the place we require the rows of the primary matrix to align with the columns of the second has multiple benefit. Let’s go first to matrix-vector multiplication. Animation (1) in part III-A confirmed us how the standard model works. Let’s visualize what it if we required the rows of the matrix to align with the variety of parts within the vector as an alternative. Now, the n rows of the matrix might want to align with the nparts of the vector.

We see that we’d have to start out with a column vector, v with n rows and one column and find yourself with a row vector, u with 1 row and m columns. That is awkward and makes defining an id aspect for matrix multiplication difficult for the reason that enter and output vectors can by no means have the identical form. With the standard definition, this isn’t a problem for the reason that enter is a column vector and the output can be a column vector (see animation (1)).
One other consideration is multiplying a series of matrices. Within the conventional technique, it’s so straightforward to see to begin with that the chain of matrices beneath might be multiplied collectively primarily based on their dimensionalities.

Additional, we will inform that the output matrix may have l rows and p columns.
Within the framework the place the rows of the 2 matrices ought to line up, this rapidly turns into a multitude. For the primary two matrices, we will inform that the rows ought to align and that the end result may have n rows and l columns. However visualizing what number of rows and columns the end result may have after which reasoning about climate it’ll be appropriate with C, and so on. turns into a nightmare.

And that’s the reason we require the rows of the primary matrix to align with the columns of the second matrix. However possibly I missed one thing. Possibly there’s an alternate definition that’s “cleaner” and supervisor to side-step these two challenges. Would love to listen to concepts within the feedback 🙂
III-D) Matrix multiplication as a change of foundation
To date, we’ve considered matrix multiplication with vectors as a linear map that takes a vector as enter and returns another vector as output. However there’s one other manner to consider matrix multiplication — as a technique to change perspective.
Let’s think about two-dimensional house, R². We symbolize any vector on this house with two numbers. What do these numbers symbolize? The coordinates alongside the x-axis and y-axis. A unit vector that factors simply alongside the x-axis is [1,0] and one which factors alongside the y-axis is [0,1]. These are our foundation for the house. Each vector now has an handle. For instance, the vector [2,3] means we scale the primary foundation vector by 2 and the second by 3.
However this isn’t the one foundation for the house. Another person (say, he who shall not be named) would possibly wish to use two different vectors as their foundation. For instance, the vectors e1=[3,2] and e2=[1,1]. Any vector within the house R² can be expressed of their foundation. The identical vector would have completely different representations in our foundation and their foundation. Like completely different addresses for a similar home (maybe primarily based on completely different postal methods).
Once we’re within the foundation of he who shall not be named, the vector e1 = [1,0]and the vector e2 = [0,1] (that are the premise vectors from his perspective by definition of foundation vectors). And the features that interprets vectors from our foundation system to that of he who shall not be named and vise-versa are linear maps. And so the translations might be represented as matrix multiplications. Let’s name the matrix that takes vectors from us to the vectors to he who shall not be named, M1 and the matrix that does the other, M2. How do we discover the matrices for these matrices?

We all know that the vectors we name e1=[3,2] and e2=[1,1], he who shall not be named calls e1=[1,0] and e2=[0,1]. Let’s acquire our model of the vectors into the columns of a matrix.

And in addition acquire the vectors, e1 and e2 of he who shall not be named into the columns of one other matrix. That is simply the id matrix.

Since matrix multiplication operates independently on the columns of the second matrix,

Pre-multiplying by an applicable matrix on each side offers us M1:

Doing the identical factor in reverse offers us M2:

This may all be generalized into the next assertion: A matrix with column vectors; w1, w2, …, wn interprets vectors expressed in a foundation the place w1, w2, …, wn are the premise vectors to our foundation.
And the inverse of that matrix interprets vectors from our foundation to the one the place w1, w2, …, wn are the premise.
All sq. matrices can therefore be considered “foundation changers”.
Word: Within the particular case of an orthonormal matrix (the place each column is a unit vector and orthogonal to each different column), the inverse turns into the identical because the transpose. So, altering to the premise of the columns of such a matrix turns into equal to taking the dot product of a vector with every of the rows.
For extra on this, see the 3B1B video, [1].
Conclusion
Matrix multiplication is arguably probably the most essential operations in fashionable computing and likewise with nearly any knowledge science discipline. Understanding deeply the way it works is essential for any knowledge scientist. Most linear algebra textbooks describe the “what” however not why its structured the way in which it’s. Hopefully this weblog crammed that hole.
[1] 3B1B video on change of foundation: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s
[2] Introduction to Algorithms by Cormen et.al. Third version
[3] Matrix multiplication as sum of outer merchandise: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products
[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number