The Channel-Smart Consideration | Squeeze and Excitation

Metric Deception: When Your Greatest KPIs Conceal Your Worst Failures

Forecasting the Future with Tree-Primarily based Fashions for Time Collection

After we discuss consideration in pc imaginative and prescient, one factor that most likely involves your thoughts first is the one used within the Imaginative and prescient Transformer (ViT) structure. In reality, that’s not the one consideration mechanism now we have for picture information. There may be truly one other one referred to as Squeeze and Excitation Community (SENet). If the eye in ViT operates spatially, i.e., assigning weights to completely different patches of a picture, the eye mechanism proposed in SENet operates in channel-wise method, i.e., assigning weights to completely different channels. — On this article, we’re going to focus on how the Squeeze and Excitation structure works, learn how to implement it from scratch, and learn how to combine the community into the ResNeXt mannequin.

The Squeeze and Excitation Module

SENet, which was first proposed in a paper titled “Squeeze-and-Excitation Networks” by Hu et al. [1], is just not a standalone community like VGG, Inception, or ResNet. As a substitute, it’s truly a constructing block to be positioned on an current community. In CNN-based fashions, we assume that pixels spatially shut to one another have excessive correlations, which is the explanation that we make use of small-sized kernels to seize these correlations. This sort of assumption is mainly the inductive bias of CNN. Alternatively, SENet introduces a brand new inductive bias, the place the authors assume that each picture channel contributes otherwise to predicting a particular class. By making use of SE modules to a CNN, the mannequin not solely depends on spatial patterns but additionally captures the significance of every channel. To higher illustrate this, we are able to consider a picture of fireplace, the place the crimson channel would theoretically give the next contribution to the ultimate prediction than the blue and inexperienced channels.

The construction of the SE module itself is proven in Determine 1. Because the identify of the community suggests, there are two essential steps achieved on this module: squeeze and excitation. The squeeze half corresponds to the operation denoted as F_sq, whereas the excitation half consists of each F_ex and F_scale. Alternatively, the F_tr operation, is definitely not the a part of the SE module. Reasonably, it represents a change perform that initially belongs to the mannequin the place the SE module is utilized. For instance, if we have been to position this SE module on ResNet, the F_tr operation refers back to the stack of convolution layers inside the bottleneck block.

Determine 1. The construction of the Squeeze and Excitation module [1].

Speaking extra particularly in regards to the F_sq operation, it primarily works by using world common pooling mechanism, the place it’s used to seize the data from the complete spatial dimension of every channel. By doing so, each channel of the enter tensor goes to be represented by a single quantity, which is mainly simply the common worth of the corresponding channel. The authors check with this operation as world data embedding. Mathematically talking, this may formally be written within the equation proven in Determine 2, the place we mainly sum all values throughout the peak H and width W earlier than ultimately dividing it with the variety of pixels inside that channel (H×W).

Determine 2. The mathematical expression of the worldwide common pooling mechanism in SE module [1].

In the meantime, each excitation and scaling operations are known as adaptive recalibration since what they primarily do is to dynamically alter the weightings of every channel within the enter tensor in response to its significance. In reality, the diagram in Determine 1 doesn’t fully depict the complete SENet structure. You may see within the determine that F_ex seems to be a single operation, but it truly consists of two linear layers every adopted by an activation perform. See the Determine 3 under for the main points.

Determine 3. The mathematical formulation of the ***F_ex*** operation [1].

The 2 linear layers are denoted as W_1 and W_2, whereas δ and σ characterize ReLU and sigmoid activation capabilities, respectively. So, primarily based on this mathematical definition, what we mainly must do later within the implementation is to go tensor z (the average-pooled tensor) via the primary linear layer, adopted by the ReLU activation perform, the second linear layer, and lastly the sigmoid activation perform. Do not forget that the sigmoid perform normalizes enter values to be inside the vary of 0 to 1. On this case, we are going to understand the ensuing output as the load of every channel, the place a price near 1 signifies that the corresponding channel comprises essential data, therefore we enable the mannequin to pay extra consideration to that channel. In any other case, if the ensuing quantity is near 0, it signifies that the corresponding channel doesn’t contribute that a lot to the output.

As a way to make the most of these channel weights, we are able to carry out the F_scale operation, which is mainly only a multiplication of the unique tensor u and the load tensor s, as proven in Determine 4 under. By doing this, we primarily retain the values inside the essential channels whereas on the identical time suppressing the values of the unimportant ones.

Determine 4. The scaling course of is only a multiplication of the unique and the load tensors [1].

By the best way sorry for getting a bit too mathy right here, lol. However I consider this may enable you perceive the code later within the implementation part.

The place to Put the SE Module

Making use of the SE module on a plain CNN mannequin like VGG is straightforward, as we are able to merely place it proper after every convolution layer. Nonetheless, it may not be simple within the case of Inception or ResNet because of the presence of parallel branches in these two networks. To deal with this confusion, authors present a information to implement the SE module particularly on the 2 fashions as proven in Determine 5 under.

Determine 5. The place SE module is positioned in Inception and ResNet [1].

For the Inception mannequin, as a substitute of putting SE module proper after every convolution layer, we go the enter tensor via the complete Inception block (together with all of the branches inside) after which connect the SE module afterwards. The identical method additionally works for ResNet, however remember the fact that the summation between the tensor in skip connection and the primary circulation occurs after the primary tensor has been processed by the SE module.

As I discussed earlier, the excitation stage primarily consists of two linear layers. If we take a more in-depth have a look at the above construction, we are able to see that the output form of the primary linear layer is 1×1×C/r. The variable r known as discount ratio which reduces the dimensionality of the load tensor earlier than ultimately projecting it again to 1×1×C via the second linear layer. The dimensionality discount achieved by the primary layer acts as a bottleneck operation, which is beneficial to restrict mannequin complexity and to enhance generalization. Authors performed experiments on completely different r values, they usually discovered that r = 16 produces one of the best stability between accuracy and complexity.

Determine 6. A number of methods attainable for use to connect SE module in ResNet [1].

Along with implementing the SE module in ResNet, it’s seen in Determine 6 that there are literally a number of methods we are able to comply with to take action. In line with the experimental ends in Determine 7, it seems to be like the usual SE, SE-PRE, and SE-Identification blocks obtained comparable outcomes, whereas on the identical time all of them outperformed SE-POST by a major margin. This implies that the position of the SE module impacts mannequin efficiency when it comes to accuracy. Based mostly on these findings, the authors argue that we’re going to receive good outcomes so long as we apply the SE module earlier than the element-wise summation operation. Later within the coding part, I’m going to display learn how to implement the usual SE block.

Determine 7. Experimental outcomes on completely different SE module integration methods [1].

Extra Experimental Outcomes

There are literally much more experimental outcomes mentioned within the paper. One in all them is a desk displaying accuracy rating enhancements when SE module is utilized to current CNN-based fashions. The desk I’m referring to is displayed in Determine 8 under.

Determine 8. Experimental outcomes on making use of SE module on completely different fashions [1][2].

The columns highlighted in blue characterize the error charges of every mannequin and those in pink point out the computational complexity measured in GFLOPs. The re-implementation column refers back to the plain mannequin that the authors carried out themselves, whereas the SENet column represents the identical mannequin geared up with SE module. The desk clearly exhibits that each top-1 and top-5 errors lower when the SE module is utilized. It is very important know that though including the SE module causes the GFLOPs to get greater, but this improve is significantly marginal in comparison with the discount in error fee.

Subsequent, we are able to truly reveal fascinating insights by printing out the values contained within the SE modules in the course of the inference section. Let’s check out the charts in Determine 9 under to higher illustrate this. The x axis of those charts denotes the channel numbers, the y axis represents how a lot weight does every channel have in response to its significance, and the colour of the strains signifies the category being predicted.

Determine 9. What the activation of SE modules seems to be like in several community depth [1].

In shallower layers, the options captured by SE module are class-agnostic, which mainly implies that it captures generic data required to foretell all lessons. The charts known as (a) and (b), that are the SE modules from ResNet stage 2 and three, present that there’s not a lot distinction in channel exercise from one class to a different, indicating that these two modules don’t seize data concerning a particular class. The case is definitely completely different from the SE modules in deeper layers, i.e., those in stage 4 (c) and stage 5 (d). We will see that these two modules alter channel weights otherwise relying on the category being predicted. That is primarily the explanation that the SE modules in deeper layers are stated to be class-specific. Nonetheless, the authors acknowledge that there may be uncommon conduct occurring in a few of the SE modules which occurs within the 2nd block of stage 5 (e). Right here the SE module doesn’t present significant channel recalibration conduct, indicating that it doesn’t contribute as a lot as those we mentioned earlier.

The Detailed Structure

On this article we’re going to implement the SE-ResNeXt-50 (32×4d) mannequin, which in Determine 10 it corresponds to the one within the rightmost column. The ResNeXt mannequin itself is just like ResNet, besides that the group parameter of the second convolution layer inside every block is about to 32. In case you’re accustomed to ResNeXt, that is primarily the only but efficient solution to implement the so-called cardinality. I like to recommend you learn my earlier article about ResNeXt if you’re not but accustomed to it, which the hyperlink is offered at reference quantity [3] on the finish of this text.

Taking a more in-depth have a look at the structure, what differentiates SE-ResNet-50 from ResNet-50 is solely the presence of SE modules. The identical additionally applies to SE-ResNeXt-50 (32×4d) in comparison with ResNeXt-50 (32×4d) (not displayed within the desk). Discover within the determine under that the fashions with SE modules have an fc layer hooked up after the final convolution layer inside every block, which the corresponding two numbers point out the primary and second fully-connected layers contained in the SE module.

Determine 10. The whole structure of ResNet-50, SE-ResNet-50 and SE-ResNeXt-50 (32×4d) [1].

From Scratch Implementation

Do not forget that right here we’re about to combine the SE module on ResNeXt, so we have to implement each of them from scratch. Technically talking, it’s truly attainable to take the ResNeXt structure instantly from PyTorch, then manually connect the SE module on it. Nonetheless, right here I made a decision to make use of the ResNeXt implementation from my earlier article as a substitute since I really feel like it’s a lot simpler to grasp than the one from PyTorch. Observe that right here I’ll deal with developing the SE module and learn how to connect it to the ResNeXt mannequin quite than explaining the ResNeXt itself since I’ve already coated it in that article [3].

Now let’s begin the code by importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Squeeze and Excitation Module

The next SE module implementation follows the diagram proven in Determine 5 (proper). It’s price noting that the SEModule class under doesn’t embody the skip-connection (curved arrow), as the complete SE module is utilized after the preliminary branching however earlier than the merging (summation).

The __init__() technique of this class accepts two parameters: num_channels and r, as proven at line #(1) in Codeblock 2a. We undoubtedly need this SE module to be usable all through the complete community. So, we have to set the num_channels parameter to be adjustable as a result of the variety of output channels varies throughout ResNeXt blocks at completely different phases, as proven again in Determine 10. In the meantime, regardless that we usually use the identical discount ratio r within the SE modules inside the complete community, however it’s technically attainable for us to make use of completely different r for various stage, which could most likely be an fascinating factor to experiment with. So, that is primarily the explanation that I additionally set the r parameter to be adjustable.

# Codeblock 2a
class SEModule(nn.Module):
    def __init__(self, num_channels, r):                     #(1)
        tremendous().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(2)
        self.fc0 = nn.Linear(in_features=num_channels,       #(3)
                             out_features=num_channels//r, 
                             bias=False)
        self.relu = nn.ReLU()                                #(4)
        self.fc1 = nn.Linear(in_features=num_channels//r,    #(5)
                             out_features=num_channels, 
                             bias=False)
        self.sigmoid = nn.Sigmoid()                          #(6)

There are 5 layers we have to initialize contained in the __init__() technique. I write them down in response to the sequence given in Determine 5, i.e., world common pooling layer (#(2)), linear layer (#(3)), ReLU activation perform (#(4)), one other linear layer (#(5)), and sigmoid activation perform (#(6)). Right here you may see that the primary linear layer is accountable to carry out dimensionality discount by shrinking the variety of channels from num_channels to num_channels//r, which is able to then be expanded again to num_channels by the second linear layer. Observe that we set the bias time period of each linear layers to False, which primarily means that we are going to solely make the most of the load tensors. The absence of bias phrases within the two layers forces the SE module to study the correlation between one channel to the others quite than simply including mounted changes.

Nonetheless with the SEModule class, let’s now transfer on to the ahead() technique to outline the circulation of the community. You may see at line #(1) in Codeblock 2b that we begin from a single enter x, which within the case of ResNeXt it’s primarily a tensor produced by the third convolution layer inside the identical ResNeXt block. As proven in Determine 5, what we have to do subsequent is to department out the community. Right here we instantly course of the department utilizing the global_pooling layer, which I identify the ensuing tensor squeezed (#(2)). The unique enter tensor x itself will likely be left as is since we aren’t going to carry out any operation on it till the scaling section. Subsequent, we have to drop the spatial dimension of the squeezed tensor utilizing torch.flatten() (#(3)). That is mainly achieved as a result of we need to course of it additional with the linear layers at line #(4) and #(5), which may solely work with a single-dimensional tensor. The spatial dimension is then launched once more at line #(6), permitting us to carry out multiplication between x (the unique tensor) and excited (the channel weights) at line #(7). This whole course of produces a recalibrated model of x which we check with as scaled. Right here I print out the tensor dimension after every step in an effort to higher perceive the circulation of this SE module.

# Codeblock 2b
    def ahead(self, x):                                  #(1)
        print(f'originaltt: {x.measurement()}')
        
        squeezed = self.global_pooling(x)                  #(2)
        print(f'after avgpooltt: {squeezed.measurement()}')
        
        squeezed = torch.flatten(squeezed, 1)              #(3)
        print(f'after flattentt: {squeezed.measurement()}')
        
        excited = self.relu(self.fc0(squeezed))            #(4)
        print(f'after fc0-relutt: {excited.measurement()}')
        
        excited = self.sigmoid(self.fc1(excited))          #(5)
        print(f'after fc1-sigmoidt: {excited.measurement()}')
        
        excited = excited[:, :, None, None]                #(6)
        print(f'after reshapett: {excited.measurement()}')
        
        scaled = x * excited                               #(7)
        print(f'after scalingtt: {scaled.measurement()}')
        
        return scaled

Now we’re going to see if now we have carried out the community appropriately by passing a dummy tensor via it. In Codeblock 3 under, I initialize an SE module and configure it to just accept a picture tensor of 512 channels and has a discount ratio of 16 (#(1)). In case you check out the SE-ResNeXt structure in Determine 10, this SE module mainly corresponds to the one within the third stage (which the output measurement is 28×28). Thus, at line #(2) we have to alter the form of the dummy tensor accordingly. We then feed this tensor into the community utilizing the code at line #(3).

# Codeblock 3
semodule = SEModule(num_channels=512, r=16)    #(1)
x = torch.randn(1, 512, 28, 28)                #(2)

out = semodule(x)      #(3)

And under is what the print capabilities give us.

# Codeblock 3 Output
authentic          : torch.Dimension([1, 512, 28, 28])    #(1)
after avgpool     : torch.Dimension([1, 512, 1, 1])      #(2)
after flatten     : torch.Dimension([1, 512])            #(3)
after fc0-relu    : torch.Dimension([1, 32])             #(4)
after fc1-sigmoid : torch.Dimension([1, 512])            #(5)
after reshape     : torch.Dimension([1, 512, 1, 1])      #(6)
after scaling     : torch.Dimension([1, 512, 28, 28])    #(7)

You may see that the unique tensor form matches precisely with our dummy tensor, i.e., 1×512×28×28 (#(1)). By the best way we are able to ignore the #1 within the 0th axis because it primarily denotes the batch measurement, which on this case I assume that we solely bought a single picture in a batch. After being pooled, the spatial dimension collapses to 1×1 since now every channel is represented by a single quantity (#(2)). The aim of the flatten operation I defined earlier is to drop the 2 empty axes (#(3)) because the subsequent linear layers can solely work with single-dimensional tensor. Right here you may see that the primary linear layer reduces the tensor dimension to 32 because of the discount ratio which we beforehand set to 16 (#(4)). The size of this tensor is then expanded again to 512 by the second linear layer (#(5)). Subsequent, we unsqueeze the tensor in order that we get our 1×1 spatial dimension again (#(6)), permitting us to multiply it with the enter tensor (#(7)). Based mostly on this detailed circulation, you may see that an SE module mainly preserves the unique tensor dimension, proving that this module will be hooked up to any CNN-based mannequin with out disrupting the unique circulation of the community.

ResNeXt

As now we have understood learn how to implement SE module from scratch, now that I’m going to point out you ways we are able to connect it on a ResNeXt mannequin. Earlier than doing so, we have to initialize the parameters required to implement the ResNeXt structure. Within the Codeblock 4 under the primary 4 variables are decided in response to the ResNeXt-50 (32×4d) variant, whereas the final one (R) represents the discount ratio for the SE modules.

# Codeblock 4
CARDINALITY  = 32
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048]
NUM_BLOCKS   = [3, 4, 6, 3]
NUM_CLASSES  = 1000
R = 16

The Block class outlined in Codeblock 5a and 5b is the ResNeXt block from my earlier article. There are literally numerous issues we do contained in the __init__() technique, however the normal thought is that we initialize three convolution layers known as conv0 (#(1)), conv1 (#(2)), and conv2 (#(3)) earlier than initializing the SE module at line #(4). We’ll later configure these layers in response to the SE-ResNeXt structure proven again in Determine 10.

# Codeblock 5a
class Block(nn.Module):
    def __init__(self, 
                 in_channels,
                 add_channel=False,
                 channel_multiplier=2,
                 downsample=False):
        tremendous().__init__()

        self.add_channel = add_channel
        self.channel_multiplier = channel_multiplier
        self.downsample = downsample
        
        
        if self.add_channel:
            out_channels = in_channels*self.channel_multiplier
        else:
            out_channels = in_channels
        
        mid_channels = out_channels//2
        
        
        if self.downsample:
            stride = 2
        else:
            stride = 1
            

        if self.add_channel or self.downsample:
            self.projection = nn.Conv2d(in_channels=in_channels,
                                        out_channels=out_channels, 
                                        kernel_size=1, 
                                        stride=stride, 
                                        padding=0, 
                                        bias=False)
            nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
            self.bn_proj = nn.BatchNorm2d(num_features=out_channels)

        self.conv0 = nn.Conv2d(in_channels=in_channels,       #(1)
                               out_channels=mid_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
        self.bn0 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv1 = nn.Conv2d(in_channels=mid_channels,      #(2)
                               out_channels=mid_channels, 
                               kernel_size=3, 
                               stride=stride,
                               padding=1, 
                               bias=False, 
                               teams=CARDINALITY)
        nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        self.bn1 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv2 = nn.Conv2d(in_channels=mid_channels,      #(3)
                               out_channels=out_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        self.bn2 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu = nn.ReLU()
        
        self.semodule = SEModule(num_channels=out_channels, r=R)    #(4)

The ahead() technique itself is mostly additionally the identical as the unique ResNeXt mannequin, besides that right here we have to put the SE module proper earlier than the element-wise summation as proven at line #(1) within the Codeblock 5b under. Do not forget that this implementation follows the usual SE block structure in Determine 6 (b).

# Codeblock 5b
    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        if self.add_channel or self.downsample:
            residual = self.bn_proj(self.projection(x))
            print(f'after projectiont: {residual.measurement()}')
        else:
            residual = x
            print(f'no projectiontt: {residual.measurement()}')
        
        x = self.conv0(x)
        x = self.bn0(x)
        x = self.relu(x)
        print(f'after conv0-bn0-relut: {x.measurement()}')

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        print(f'after conv1-bn1-relut: {x.measurement()}')
        
        x = self.conv2(x)
        x = self.bn2(x)
        print(f'after conv2-bn2tt: {x.measurement()}')
        
        x = self.semodule(x)      #(1)
        print(f'after semodulett: {x.measurement()}')
        
        x = x + residual
        x = self.relu(x)
        print(f'after summationtt: {x.measurement()}')
        
        return x

With the above implementation, each time we instantiate a Block object we could have a ResNeXt block which is already geared up with an SE module. Now we’re going to take a look at the above class to see if now we have carried out it appropriately. Right here I’m going to simulate a ResNeXt block inside the third stage. The add_channel and downsample parameters are set to False since we need to protect each the variety of channels and the spatial dimension of the enter tensor.

# Codeblock 6
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)

out = block(x)

Beneath is what the output seems to be like. Right here you may see that our first convolution layer efficiently decreased the variety of channels from 512 to 256 (#(1)), which is then expanded again to its authentic dimension by the third convolution layer (#(2)). Afterwards, the tensor goes via the SE block which the ensuing output measurement is similar as its enter, similar to what we noticed earlier in Codeblock 3 (#(3)). Because the processing with SE module is finished, we are able to lastly carry out the element-wise summation between the tensor from the primary department and the one from the skip-connection (#(4)).

authentic             : torch.Dimension([1, 512, 28, 28])
no projection        : torch.Dimension([1, 512, 28, 28])
after conv0-bn0-relu : torch.Dimension([1, 256, 28, 28])    #(1)
after conv1-bn1-relu : torch.Dimension([1, 256, 28, 28])
after conv2-bn2      : torch.Dimension([1, 512, 28, 28])    #(2)
after semodule       : torch.Dimension([1, 512, 28, 28])    #(3)
after summation      : torch.Dimension([1, 512, 28, 28])    #(4)

And under is how I implement the complete structure. What we primarily must do is simply to stack a number of SE-ResNeXt blocks in response to the structure in Determine 10. In reality, the SEResNeXt class in Codeblock 7 is precisely the identical because the ResNeXt class in my earlier article [3] (I actually copy-pasted it) since what makes SE-ResNeXt completely different from the unique ResNeXt is simply the presence of SE module inside the Block class we mentioned earlier.

# Codeblock 7
class SEResNeXt(nn.Module):
    def __init__(self):
        tremendous().__init__()

        # conv1 stage
        self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
                                       out_channels=NUM_CHANNELS[1],
                                       kernel_size=7,
                                       stride=2,
                                       padding=3, 
                                       bias=False)
        nn.init.kaiming_normal_(self.resnext_conv1.weight, 
                                nonlinearity='relu')
        self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
        self.relu = nn.ReLU()
        self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3,
                                             stride=2, 
                                             padding=1)

        # conv2 stage
        self.resnext_conv2 = nn.ModuleList([
            Block(in_channels=NUM_CHANNELS[1],
                  add_channel=True,
                  channel_multiplier=4,
                  downsample=False)
        ])
        for _ in vary(NUM_BLOCKS[0]-1):
            self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))

        # conv3 stage
        self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2],
                                                  add_channel=True, 
                                                  downsample=True)])
        for _ in vary(NUM_BLOCKS[1]-1):
            self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
            
            
        # conv4 stage
        self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[2]-1):
            self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
            
            
        # conv5 stage
        self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[3]-1):
            self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
 
       
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.fc = nn.Linear(in_features=NUM_CHANNELS[5],
                            out_features=NUM_CLASSES)
        

    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
        print(f'after resnext_conv1t: {x.measurement()}')
        
        x = self.resnext_maxpool1(x)
        print(f'after resnext_maxpool1t: {x.measurement()}')
        
        for i, block in enumerate(self.resnext_conv2):
            x = block(x)
            print(f'after resnext_conv2 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv3):
            x = block(x)
            print(f'after resnext_conv3 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv4):
            x = block(x)
            print(f'after resnext_conv4 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv5):
            x = block(x)
            print(f'after resnext_conv5 #{i}t: {x.measurement()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.measurement()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.measurement()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.measurement()}')
        
        return x

As the complete SE-ResNeXt-50 (32×4d) structure is accomplished, now that we’re going to take a look at it by passing via a tensor of measurement 1×3×224×224 via the community, simulating a single RGB picture of measurement 224×224. You may see within the output of the Codeblock 8 under that it looks like mannequin works correctly because the tensor efficiently handed via all layers inside the seresnext mannequin with out returning any error. Thus, I consider this mannequin is now able to be educated. By the best way don’t overlook to alter the variety of neurons within the output channel in response to the variety of lessons in your dataset if you wish to truly practice this mannequin.

# Codeblock 8
seresnext = SEResNeXt()
x = torch.randn(1, 3, 224, 224)

out = seresnext(x)

# Codeblock 8 Output
authentic               : torch.Dimension([1, 3, 224, 224])
after resnext_conv1    : torch.Dimension([1, 64, 112, 112])
after resnext_maxpool1 : torch.Dimension([1, 64, 56, 56])
after resnext_conv2 #0 : torch.Dimension([1, 256, 56, 56])
after resnext_conv2 #1 : torch.Dimension([1, 256, 56, 56])
after resnext_conv2 #2 : torch.Dimension([1, 256, 56, 56])
after resnext_conv3 #0 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Dimension([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Dimension([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Dimension([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Dimension([1, 2048, 7, 7])
after avgpool          : torch.Dimension([1, 2048, 1, 1])
after flatten          : torch.Dimension([1, 2048])
after fc               : torch.Dimension([1, 1000])

Moreover, we are able to additionally print out the variety of parameters this mannequin has utilizing the next code. Right here you may see that the codeblock returns 27,543,848. This variety of parameters is barely greater than the unique ResNeXt mannequin counterpart, which solely has 25,028,904 parameters as talked about in my earlier article in addition to the official PyTorch documentation [4]. Such a rise within the mannequin measurement undoubtedly is sensible because the ResNeXt blocks all through the complete community now have extra layers because of the presence of SE modules.

# Codeblock 9
def count_parameters(mannequin):
    return sum([params.numel() for params in model.parameters()])

count_parameters(seresnext)

# Codeblock 9 Output
27543848

Ending

And that’s just about the whole lot in regards to the Squeeze and Excitation module. I do encourage you to discover from right here by coaching this mannequin by yourself dataset in order that you will notice whether or not the findings offered within the paper additionally apply to your case. Not solely that, I feel it might even be fascinating if you happen to attempt to implement SE module on different neural community architectures like VGG or Inception by your self.

I hope you study one thing new right now. Thanks for studying!

By the best way you can even discover the code used on this article in my GitHub repo [5].

[1] Jie Hu et al. Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed March 17, 2025].

[2] Picture initially created by writer.

[3] Taking ResNet to the Subsequent Stage. In the direction of Knowledge Science. https://towardsdatascience.com/taking-resnet-to-the-next-level/ [Accessed July 22, 2025].

[4] Resnext50_32x4d. PyTorch. https://pytorch.org/imaginative and prescient/essential/fashions/generated/torchvision.fashions.resnext50_32x4d.html#torchvision.fashions.resnext50_32x4d [Accessed March 17, 2025].

[5] MuhammadArdiPutra. The Channel-Smart Consideration — Squeeze and Excitation. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/essential/Thepercent20Channel-Wisepercent20Attentionpercent20-%20Squeezepercent20andpercent20Excitation.ipynb [Accessed April 7, 2025].