, my analysis at Multitel has targeted on fine-grained visible classification (FGVC). Particularly, I labored on constructing a sturdy automotive classifier that may work in real-time on edge gadgets. This put up is a part of what could develop into a small sequence of reflections on this expertise. I’m writing to share among the classes I realized but additionally to prepare and compound what I’ve realized. On the similar time, I hope this provides a way of the sort of high-level engineering and utilized analysis we do at Multitel, work that blends tutorial rigor with real-world constraints. Whether or not you’re a fellow researcher, a curious engineer, or somebody contemplating becoming a member of our workforce, I hope this put up presents each perception and inspiration.
1. The issue:
We would have liked a system that would determine particular automotive fashions, not simply “it is a BMW,” however which BMW mannequin and yr. And it wanted to run in actual time on resource-constrained edge gadgets alongside different fashions. This sort of process falls underneath what’s often called fine-grained visible classification (FGVC).

FGVC goals to acknowledge photos belonging to a number of subordinate classes of a super-category (e.g. species of animals / vegetation, fashions of automobiles and so on). The issue lies with understanding fine-grained visible variations that sufficiently discriminate between objects which can be extremely related in total look however differ in fine-grained options [2].

What makes FGVC notably difficult?
- Small inter-class variation: The visible variations between courses may be extraordinarily delicate.
- Giant intra-class variation: On the similar time, situations throughout the similar class could differ drastically as a result of adjustments in lighting, pose, background, or different environmental components.
- The delicate visible variations may be simply overwhelmed by the opposite components comparable to poses and viewpoints.
- Lengthy-tailed distributions: Datasets sometimes have a couple of courses with many samples and plenty of courses with only a few examples. For instance, you may need solely a few photos of a uncommon spider species present in a distant area, whereas widespread species have 1000’s of photos. This imbalance makes it troublesome for fashions to study equally effectively throughout all classes.

2. The panorama:
After we first began tackling this downside, we naturally turned to literature. We dove into tutorial papers, examined benchmark datasets, and explored state-of-the-art FGVC strategies. And at first, the issue appeared much more sophisticated than it truly turned out to be, not less than in our particular context.
FGVC has been actively researched for years, and there’s no scarcity of approaches that introduce more and more complicated architectures and pipelines. Many early works, for instance, proposed two-stage fashions: a localization subnetwork would first determine discriminative object components, after which a second community would classify primarily based on these components. Others targeted on customized loss capabilities, high-order characteristic interactions, or label dependency modeling utilizing hierarchical constructions.
All of those strategies have been designed to deal with the delicate visible distinctions that make FGVC so difficult. In case you’re curious in regards to the evolution of those approaches, Wei et al [2]. present a stable survey that covers lots of them in depth.

After we appeared nearer at latest benchmark outcomes (archived from Papers with Code), lots of the top-performing options have been primarily based on transformer architectures. These fashions typically reached state-of-the-art accuracy, however with little to no dialogue of inference time or deployment constraints. Given our necessities, we have been pretty sure that these fashions wouldn’t maintain up in real-time on an edge system already operating a number of fashions in parallel.
On the time of this work, the very best reported outcome on Stanford Automobiles was 97.1% accuracy, achieved by CMAL-Internet.
3. Our strategy:
As a substitute of beginning with essentially the most complicated or specialised options, we took the other strategy: Might a mannequin that we already knew would meet our real-time and deployment constraints carry out effectively sufficient on the duty? Particularly, we requested whether or not a stable general-purpose structure may get us near the efficiency of newer, heavier fashions, if skilled correctly.
That line of pondering led us to a paper by Ross Wightman et al., “ResNet Strikes Again: An Improved Coaching Process in Timm.” In it, Wightman makes a compelling argument: most new architectures are skilled utilizing the newest developments and strategies however then in contrast towards older baselines skilled with outdated recipes. Wightman argues that ResNet-50, which is regularly used as a benchmark, is commonly not given the advantage of these fashionable enhancements. His paper proposes a refined coaching process and exhibits that, when skilled correctly, even a vanilla ResNet-50 can obtain surprisingly robust outcomes, together with on a number of FGVC benchmarks.
With these constraints and targets in thoughts, we got down to construct our personal robust, reusable coaching process, one that would ship excessive efficiency on FGVC duties with out counting on architecture-specific tips. The thought was easy: begin with a recognized, environment friendly spine like ResNet-50 and focus fully on bettering the coaching pipeline fairly than modifying the mannequin itself. That means, the identical recipe may later be utilized to different architectures with minimal changes.
We started amassing concepts, strategies, and coaching refinements from throughout a number of sources, compounding greatest practices right into a single, cohesive pipeline. Particularly, we drew from 4 key sources:
- Bag of Methods for Picture Classification with Convolutional Neural Networks (He et al.)
- Compounding the Efficiency Enhancements of Assembled Strategies in a Convolutional Neural Community (Lee et al.)
- ResNet Strikes Again: An Improved Coaching Process in Timm (Wightman et al.)
- How one can Prepare State-of-the-Artwork Fashions Utilizing TorchVision’s Newest Primitives (Vryniotis)
Our objective was to create a strong coaching pipeline that didn’t depend on model-specific tweaks. That meant specializing in strategies which can be broadly relevant throughout architectures.
To check and validate our coaching pipeline, we used the Stanford Automobiles dataset [9], a broadly used fine-grained classification benchmark that intently aligns with our real-world use case. The dataset accommodates 196 automotive classes and 16,185 photos, all taken from the rear to emphasise delicate inter-class variations. The information is sort of evenly break up between 8,144 coaching photos and eight,041 testing photos. To simulate our deployment state of affairs, the place the classification mannequin operates downstream of an object detection system, we crop every picture to its annotated bounding field earlier than coaching and analysis.
Whereas the unique internet hosting website for the dataset is not out there, it stays accessible by way of curated repositories comparable to Kaggle, and Huggingface. The dataset is distributed underneath the BSD-3-Clause license, which allows each industrial and non-commercial use. On this work, it was used solely in a analysis context to provide the outcomes introduced right here.

Constructing the Recipe
What follows is the distilled coaching recipe we arrived at, constructed via experimentation, iteration, and cautious aggregation of concepts from the works talked about above. The thought is to point out that by merely making use of fashionable coaching greatest practices, with none architecture-specific hacks, we may get a general-purpose mannequin like ResNet-50 to carry out competitively on a fine-grained benchmark.
We’ll begin with a vanilla ResNet-50 skilled utilizing a fundamental setup and progressively introduce enhancements, one step at a time.
With every approach, we’ll report:
- The person efficiency achieve
- The cumulative achieve when added to the pipeline
Whereas lots of the strategies used are doubtless acquainted, our intent is to spotlight how highly effective they are often when compounded deliberately. Benchmarks typically obscure this by evaluating new architectures skilled with the newest developments to previous baselines skilled with outdated recipes. Right here, we wish to flip that and present what’s attainable with a rigorously tuned recipe utilized to a broadly out there, environment friendly spine.
We additionally acknowledge that many of those strategies work together with one another. So, in observe, we tuned some combos via grasping or grid search to account for synergies and interdependencies.
The Base Recipe:
Earlier than diving into optimizations, we begin with a clear, easy baseline.
We prepare a ResNet-50 mannequin pretrained on ImageNet utilizing the Stanford Automobiles dataset. Every mannequin is skilled for 600 epochs on a single RTX 4090 GPU, with early stopping primarily based on validation accuracy utilizing a persistence of 200 epochs.
We use:
- Nesterov Accelerated Gradient (NAG) for optimization
- Studying price: 0.01
- Batch measurement: 32
- Momentum: 0.9
- Loss perform: Cross-entropy
All coaching and validation photos are cropped to their bounding bins and resized to 224×224 pixels. We begin with the identical customary augmentation coverage as in [5].
Right here’s a abstract of the bottom coaching configuration and its efficiency:
Mannequin | Pretrain | Optimizer | Studying price |
Momentum | Batch measurement |
ResNet50 | ImageNet | NAG | 0.01 | 0.9 | 32 |
Loss perform | Picture measurement | Epochs | Endurance | Augmentation | Accuracy |
Crossentropy Loss |
224×224 | 600 | 200 | Normal | 88.22 |
We repair the random seed throughout runs to make sure reproducibility and cut back variance between experiments. To evaluate the true impact of a change within the recipe, we comply with greatest practices and common outcomes over a number of runs (sometimes 3 to five).
We’ll now construct on high of this baseline step-by-step, introducing one approach at a time and monitoring its influence on accuracy. The objective is to isolate what every element contributes and the way they compound when utilized collectively.
Giant batch coaching:
In mini-batch SGD, gradient descending is a random course of as a result of the examples are randomly chosen in every batch. Rising the batch measurement doesn’t change the expectation of the stochastic gradient however reduces its variance. Utilizing massive batch measurement, nonetheless, could decelerate the coaching progress. For a similar variety of epochs, coaching with a big batch measurement leads to a mannequin with degraded validation accuracy in comparison with those skilled with smaller batch sizes.
He et al [5] argues that linearly growing the training price with the batch measurement works empirically for ResNet-50 coaching.
To enhance each the accuracy and the velocity of our coaching we modify the batch measurement to 128 and the training price to 0.1. We add a StepLR scheduler that decays the training price of every parameter group by 0.1 each 30 epochs.
Studying price warmup:
Since at first of the coaching all parameters are sometimes random values utilizing a too massive studying price could end in numerical instability.
Within the warmup heuristic, we use a small studying price at first after which change again to the preliminary studying price when the coaching course of is steady. We use a gradual warmup technique that will increase the training price from 0 to the preliminary studying price linearly.
We add a linear warmup technique for five epochs.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
Normal | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 89.21 | +0.99 | +0.99 |
Trivial Increase:
To discover the influence of stronger information augmentation, we changed the baseline augmentation with TrivialAugment. Trivial Increase works as follows. It takes a picture x and a set of augmentations A as enter. It then merely samples an augmentation from A uniformly at random and applies this augmentation to the given picture x with a energy m, sampled uniformly at random from the set of attainable strengths {0, . . . , 30}, and returns the augmented picture.
What makes TrivialAugment particularly enticing is that it’s utterly parameter-free, it doesn’t require search or tuning, making it a easy but efficient drop-in substitute that reduces experimental complexity.
Whereas it could appear counterintuitive that such a generic and randomized technique would outperform augmentations particularly tailor-made to the dataset or extra subtle automated augmentation strategies, we tried a wide range of alternate options, and TrivialAugment persistently delivered robust outcomes throughout runs. Its simplicity, stability, and surprisingly excessive effectiveness make it a compelling default alternative.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 92.66 | +3.45 | +4.44 |
Cosine Studying Charge Decay:
Subsequent, we explored modifying the training price schedule. We switched to a cosine annealing technique, which decreases the training price from the preliminary worth to 0 by following the cosine perform. An enormous benefit of cosine is that there are not any hyper-parameters to optimize, which cuts down once more our search area.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
TrivialAugment | Cosine | – | – | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 93.22 | +0.56 | +5 |
Label Smoothing:
approach to cut back overfitting is to cease the mannequin from turning into overconfident. This may be achieved by softening the bottom fact utilizing Label Smoothing. The thought is to vary the development of the true label to:
[q_i = begin{cases}
1 – varepsilon, & text{if } i = y,
frac{varepsilon}{K – 1}, & text{otherwise}.
end{cases} ]
There’s a single parameter which controls the diploma of smoothing (the upper the stronger) that we have to specify. We used a smoothing issue of ε = 0.1, which is the usual worth proposed within the unique paper and broadly adopted within the literature.
Apparently, we discovered empirically that including label smoothing lowered gradient variance throughout coaching. This allowed us to securely enhance the training price with out destabilizing coaching. Because of this, we elevated the preliminary studying price from 0.1 to 0.4
Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Accuracy | Incremental Enchancment |
5 | 0.01 | 0.1 | 94.5 | +1.28 |
Absolute Enchancment |
||||
+6.28 |
Random Erasing:
As an extra type of regularization, we launched Random Erasing into the coaching pipeline. This system randomly selects an oblong area inside a picture and replaces its pixels with random values, with a hard and fast chance.
Typically paired with Automated Augmentation strategies, it normally yields further enhancements in accuracy as a result of its regularization impact. We added Random Erasing with a chance of 0.1.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Random Erasing | Accuracy |
5 | 0.01 | 0.1 | 0.1 | 94.93 |
Incremental Enchancment |
Absolute Enchancment |
|||
+0.43 | +6.71 |
Exponential Transferring Common (EMA):
Coaching a neural community utilizing mini batches introduces noise and fewer correct gradients when gradient descent updates the mannequin parameters between batches. Exponential transferring common is utilized in coaching deep neural networks to enhance their stability and generalization.
As a substitute of simply utilizing the uncooked weights which can be instantly realized throughout coaching, EMA maintains a operating common of the mannequin weights that are then up to date at every coaching step utilizing a weighted common of the present weights and the earlier EMA values.
Particularly, at every coaching step, the EMA weights are up to date utilizing:
[theta_{mathrm{EMA}} leftarrow alpha , theta_{mathrm{EMA}} + (1 – alpha) , theta]
the place θ are the present mannequin weights and α is a decay issue controlling how a lot weight is given to the previous.
By evaluating the EMA weights fairly than the uncooked ones at check time, we discovered improved consistency in efficiency throughout runs, particularly within the later phases of coaching.
Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss perform | Picture measurement | Epochs | Endurance |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Technique |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Random Erasing | EMA Steps |
5 | 0.01 | 0.1 | 0.1 | 32 |
EMA Decay | Accuracy | Incremental Enchancment |
Absolute Enchancment |
|
0.994 | 94.93 | 0 | +6.71 |
We examined EMA in isolation, and located that it led to notable enhancements in each coaching stability and validation efficiency. However after we built-in EMA into the total recipe alongside different strategies, it didn’t present additional enchancment. The outcomes appeared to plateau, suggesting that many of the beneficial properties had already been captured by the opposite parts.
As a result of our objective is to develop a general-purpose coaching recipe fairly than one overly tailor-made to a single dataset, we selected to hold EMA within the closing setup. Its advantages could also be extra pronounced in different circumstances, and its low overhead makes it a secure inclusion.
Optimizations we examined however didn’t undertake:
We additionally explored a variety of further strategies which can be generally efficient in different picture classification duties, however discovered that they both didn’t result in important enhancements or, in some instances, barely regressed efficiency on the Stanford Automobiles dataset:
- Weight Decay: Provides L2 regularization to discourage massive weights throughout coaching. We experimented extensively with weight decay in our use case, nevertheless it persistently regressed efficiency.
- Cutmix/Mixup: Cutmix replaces random patches between photos and mixes the corresponding labels. Mixup creates new coaching samples by linearly combining pairs of photos and labels. We tried making use of both CutMix or MixUp randomly with equal chance throughout coaching, however this strategy regressed outcomes.
- AutoAugment: Delivered robust outcomes and aggressive accuracy, however we discovered TrivialAugment to be higher. Extra importantly, TrivialAugment is totally parameter-free, which cuts down our search area and simplifies tuning.
- Different Optimizers and Schedulers: We experimented with a variety of optimizers and studying price schedules. Nesterov Accelerated Gradient (NAG) persistently gave us the very best efficiency amongst optimizers, and Cosine Annealing stood out as the very best scheduler, delivering robust outcomes with no further hyperparameters to tune.
4. Conclusion:
The graph beneath summarizes the enhancements as we progressively constructed up our coaching recipe:

Utilizing simply a normal ResNet-50, we have been capable of obtain robust efficiency on the Stanford Automobiles dataset, demonstrating that cautious tuning of some easy strategies can go a good distance in fine-grained classification.
Nonetheless, it’s necessary to maintain this in perspective. These outcomes primarily present that we will prepare a mannequin to differentiate between fine-grained, well-represented courses in a clear, curated dataset. The Stanford Automobiles dataset is sort of class-balanced, with high-quality, largely frontal photos and no main occlusion or real-world noise. It does not deal with challenges like long-tailed distributions, area shift, or recognition of unseen courses.
In observe, you’ll by no means have a dataset that covers each automotive mannequin—particularly one which’s up to date every day as new fashions seem. Actual-world techniques must deal with distributional shifts, open-set recognition, and imperfect inputs.
So whereas this served as a robust baseline and proof of idea, there was nonetheless important work to be achieved to construct one thing sturdy and production-ready.
References:
[1] Krause, Deng, et al. Amassing a Giant-Scale Dataset of Superb-Grained Automobiles.
[2] Wei, et al. Superb-Grained Picture Evaluation with Deep Studying: A Survey.
[3] Reslan, Farou. Automated Superb-grained Classification of Hen Species Utilizing Deep Studying.
[5] He, et al. Bag of Methods for Picture Classification with Convolutional Neural Networks.
[7] Wightman, et al. ResNet Strikes Again: An Improved Coaching Process in Timm.
[9] Krause et al, 3D Object Representations for Superb-Grained Catecorization.
[10] Müller, Hutter. TrivialAugment: Tuning-free But State-of-the-Artwork Information Augmentation.