5 Methods to Effective-Tune Chronos-2, the Time Sequence Basis Mannequin

How To Construct Your Personal LLM Runtime From Scratch

Construct an LLM Agent That Can Write and Run Code

In Half 1 of this sequence, -2, a time-series basis mannequin. We acquired our fingers soiled by strolling by means of an actual case research and noticed what Chronos-2 can do straight out of the field, with no coaching.

However as we famous on the finish of Half 1, zero-shot isn’t all the time sufficient.

In instances when:

Your knowledge could look not like something within the pretraining combine.
The mannequin retains making systematic errors.
You do have wealthy historic knowledge that may be leveraged.
Your downstream goal could also be misaligned with the target that Chronos-2’s coaching optimizes for.

Effective-tuning is the pure subsequent step.

On this submit, we’ll proceed the identical constructing electricity-demand case research from Half 1, and stroll by means of 5 fine-tuning eventualities of Chronos-2:

Single-building adaptation: find out how to fine-tune on the one asset.
Portfolio fine-tuning: find out how to pool historical past throughout the fleet for a shared adapter.
Covariate-informed fine-tuning: find out how to fine-tune with known-future alerts.
Portfolio + covariates: find out how to leverage each covariate and fleet info.
Held-out switch: find out how to adapt as soon as, then deploy on belongings the mannequin by no means noticed throughout fine-tuning.

By the top, you’ll have a working template for fine-tuning a TSFM that is able to adapt to your individual knowledge.

Half 1 of this sequence introduces find out how to make Chronos-2 forecasting for univariate, multivariate, covariate-informed, and cross-learning eventualities. If you wish to use Chronos-2 out of the field, verify the submit right here.

1. The case research, recapped

Let’s rapidly revisit the setup from Half 1.

We now have an artificial dataset of eight business buildings that data hourly electrical energy demand. The duty we goal to resolve is to forecast the overall electrical energy load one week forward, i.e., 168 hours. We now have a bodily simulator to generate the dataset, the place the overall load is decomposed into base, plug, lighting, and HVAC hundreds. Bodily, plug and lighting hundreds are decided by weekday occupancy patterns, whereas HVAC load is decided by out of doors temperature.

Now, what’s new for Half 2 is that we simulate an extended time span in order that we are able to have knowledge for fine-tuning. And we maintain a clear separation between fine-tuning knowledge and inference knowledge. Particularly, we divide the timeline into 4 contiguous home windows:

Prepare (12 weeks): 2025-03-01 to 2025-05-22, the one window fine-tuning sees.
Validation (1 week): 2025-05-23 to 2025-05-29, used for checkpoint choice and early stopping.
Inference context (45 days): 2025-05-30 to 2025-07-13, the window used as context when making forecasts. The zero-shot pipeline in Half 1 additionally consumed 45 days of context.
Take a look at (1 week): 2025-07-14 to 2025-07-20, the forecast horizon for testing the fine-tuned mannequin.

Be aware that the fine-tuning course of would solely see knowledge within the prepare & validation set, so there isn’t any leakage within the evaluation.

Determine 1. Prepare/val/context/take a look at cut up. (Picture by writer)

2. Temporary on fine-tuning and LoRA

Earlier than our walk-through, let’s first briefly talk about the idea of fine-tuning and one in every of its particular applied sciences, i.e., LoRA.

2.1 What’s fine-tuning?

Effective-tuning means we proceed coaching a pretrained mannequin on our personal knowledge. Successfully, we’re adapting the weights of the pretrained mannequin such that it understands and follows the patterns particular to our drawback.

For Chronos-2 particularly, it’s a 120M-parameter Transformer that has already realized a whole lot of generic time-series construction. Effective-tuning would enable us to additional nudge its habits within the path of our knowledge.

However ought to we replace all 120M parameters?

Most likely not.

This may be costly in each compute and storage. Additionally, in follow, we’d not have sufficient knowledge to assist adjusting all 120M parameters.

We’d like a extra environment friendly strategy to do the fine-tuning. One such resolution is LoRA.

2.2 What’s LoRA?

LoRA stands for Low-Rank Adaptation [1]. Its core concept is straightforward: as an alternative of updating the complete weight matrices, we freeze the unique pre-trained mannequin and solely be taught a small set of further parameters that barely modify its habits.

To offer an instance, suppose one layer within the pretrained mannequin accommodates a weight matrix W, with a form of d_out x d_in, the place d_out=d_in=1024.

The replace of the burden matrix would suggest:

Then, the scale of ΔW would additionally have to be 1024 x 1024. If we wish to do a full replace, that might imply that we replace multiple million trainable parameters.

The trick that LoRA adopts is that ΔW just isn’t realized as a full matrix. As a substitute, LoRA represents it because the product of two a lot smaller matrices:

the place A has a form of r x d_in and B has a form of d_out x r. And r is the rank of the adapter. The explanation why it’s referred to as a low-rank technique is that r is often fairly small, reminiscent of 4, 8, 16, or 32.

What this means is that LoRA doesn’t enable the fine-tuning to make an arbitrary full-dimensional change to W. The updates are restricted to a lower-dimensional subspace. And that restriction is precisely the place the effectivity comes from.

This works in follow as a result of many downstream variations do not likely require altering the mannequin in each potential path. Usually, the helpful change lives in a a lot smaller subspace. LoRA instantly exploits this assumption.

In follow, this provides us a number of benefits. Since we have now many fewer trainable parameters, the GPU reminiscence utilization, which is consumed by gradients and optimizer states, could be made a lot decrease. We even have smaller checkpoints, as a result of we don’t want to save lots of a full copy of the 120M-parameter mannequin for each experiment; we solely save the adapter. And it reduces overfitting danger, particularly when the downstream dataset just isn’t giant.

3. The best way to do LoRA for Chronos-2?

To do LoRA for the Chronos-2 mannequin, the very first thing we have to determine is which layers of Chronos-2 we wish to adapt.

To reply this query, we must always first check out how the mannequin is constructed.

In Half 1, we defined that Chronos-2 is a Transformer encoder organized round three constructing blocks:

An enter patch embedding.
A stack of consideration layers, alternating between time consideration and group consideration.
An output patch embedding.

Our LoRA configuration adapts two of those three blocks:

The Q, Okay, V, and O projections in each consideration layer. That is the place we are able to fine-tune how the mannequin attends each temporally inside every sequence and throughout sequence inside a gaggle.

In Chronos-2, every consideration layer includes 4 linear projections to map from layer’s enter to the output. The question (Q), key (Okay), and worth (V) produce three totally different views of the enter, the eye mechanism then computes a similarity rating between each question and each key, and makes use of these scores to compute the weighted aggregation of the values. The outcome then passes by means of the output projection (O), which mixes info throughout consideration heads and reshapes it again to match the layer’s commonplace output dimensions.

The output patch embedding. This enables us to fine-tune the way in which the mannequin initiatives its inside states into closing forecasts.

In code, we have now:

LORA_CONFIG = {
    "r": 8,
    "lora_alpha": 16,
    "target_modules": [
        "self_attention.q",
        "self_attention.v",
        "self_attention.k",
        "self_attention.o",
        "output_patch_embedding.output_layer",
    ],
}

the place lora_alpha is a scaling issue. It controls how strongly the LoRA replace is utilized, the place a bigger α means a extra aggressive adaptation.

In our present research, we use Hugging Face peft library to fine-tune Chronos-2.

Now we’re able to get hands-on.

4. 5 fine-tuning eventualities

For the next experiments, we additionally begin from the identical base mannequin, i.e., amazon/chronos-2 checkpoint, with the identical LoRA configuration. What adjustments is the info we expose to fine-tuning.

The principle metric we’ll use is weighted absolute share error:

With that setup, let’s stroll by means of the 5 eventualities one after the other.

For those who haven’t but arrange the correct Chronos atmosphere, please check with Half 1: 4.1 Organising the Chronos-2 mannequin.

4.1 Single-building adaptation

Can we fine-tune on one asset?

Suppose we solely care about one constructing, say Constructing 03. We do have its historic load knowledge, and we wish to adapt Chronos-2 to this explicit constructing’s patterns.

This could be the only fine-tuning setup. No covariates, no portfolio info, only one goal sequence.

As talked about earlier, we begin from amazon/chronos-2 checkpoint, go away the bottom mannequin frozen, and solely be taught a small LoRA adapter on prime of it.

Chronos-2’s fine-tuning API expects coaching knowledge as a listing of job dictionaries. For our present target-only univariate job, every dictionary solely wants one key: goal.

For Constructing 03, we are able to put together the fine-tuning enter like this:

story_building = "Constructing 03"
train_df = full_df[full_df["timestamp"] < "2025-05-23"]

single_building_train = train_df[
    train_df["building"].eq(story_building)
].sort_values("timestamp")

train_inputs = [
    {
        "target": single_building_train[["total_load_kw"]]
        .to_numpy(dtype="float32")
        .T
    }
]

The explanation why we want a “transpose” above is that Chronos-2 expects the goal array to have form:

(num_target_series, time_steps)

Since we solely have a single univariate goal, we have now:

(1, T)

Along with coaching knowledge, we must always put together validation knowledge in the identical format:

validation_df = full_df[full_df["timestamp"] < "2025-05-30"]
single_building_validation = validation_df[
    validation_df["building"].eq(story_building)
].sort_values("timestamp")

validation_inputs = [
    {
        "target": single_building_validation[["total_load_kw"]]
        .to_numpy(dtype="float32")
        .T
    }
]

There are two issues price mentioning right here:

To start with, only a reminder: the validation knowledge right here just isn’t used to replace the LoRA adapter; it’s used to determine which adapter checkpoint to maintain. It’s the identical sample you’d usually use for coaching a neural community mannequin.

Then, you may discover that validation_df just isn’t solely Might 23-29, but in addition accommodates all the things earlier than that. We’d like that as a result of, for making forecasts, Chronos-2 wants context. Primarily based on the set prediction_length, Chronos internally treats the final prediction_length hours of validation_df because the true validation forecast goal. The previous values are the context.

Within the present case, we solely configured one validation job in validation_inputs. This implies we successfully solely have one validation forecast window, as a result of internally Chronos-2 all the time makes use of the dataframe’s final prediction_length steps because the goal window and the previous context_length steps because the context, NO MATTER what number of extra steps you feed in that dataframe. In different phrases, merely feeding an extended validation dataframe doesn’t mechanically create extra validation home windows.

In follow, in order for you extra validation forecast home windows, e.g., doing a rolling window primarily based validation, we would want to create a number of validation duties, every ending at a special cutoff date. This manner, Chronos-2 would validate on the final 168 hours of every job.

For coaching, although, we don’t actually need any particular therapy, as we are able to merely move Chronos-2 a protracted historic sequence and let it pattern many coaching home windows internally.

Now we are able to fine-tune:

fine_tuned_model = base_model.match(
    train_inputs,
    prediction_length=168,
    validation_inputs=validation_inputs,
    finetune_mode="lora",
    lora_config=LORA_CONFIG,
    context_length=1080,         # 45-day context window
    learning_rate=2e-5,
    num_steps=1000,
    batch_size=32,
    output_dir="finetuned_models/fine_tuning_modes/single_target",
    finetuned_ckpt_name="checkpoint",
    callbacks=[EarlyStoppingCallback(early_stopping_patience=6)],
    save_steps=25,
    eval_steps=25,
)

Right here, we set prediction_length=168, in order that the mannequin is educated for a similar job we care about at take a look at time, i.e., one-week forward hourly forecasting. Additionally, we set context_length=45 * 24, which represents a 45-day context window. This is identical context size we utilized in Half 1. Lastly, since we have now used validation_inputs, the checkpoint choice is activated. Each 25 coaching steps, Chronos-2 evaluates validation loss, and if validation loss stops bettering for six validation checks in a row (early_stopping_patience=6), early cease will kick in and cease the fine-tuning.

Determine 2. Coaching loss retains falling, however validation loss rises after the primary checkpoint. (Picture by writer)

I ran the fine-tuning job on an NVIDIA RTX 2000 Ada Laptop computer GPU with 8 GB VRAM. This run completed in about 42s.

As soon as the adapter is educated, inference appears virtually the identical as zero-shot forecasting:

single_context = test_context_df[
    test_context_df["building"].eq(story_building)
][["building", "timestamp", "total_load_kw"]]

pred_single_finetuned = fine_tuned_model.predict_df(
    single_context,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="constructing",
    timestamp_column="timestamp",
    goal="total_load_kw",
)

For Constructing 03, the target-only zero-shot baseline has a WAPE of 8.3%. After fine-tuning on Constructing 03 solely, WAPE reduces to 7.6%. We do see that fine-tuning has introduced some enhancements.

4.2 Portfolio fine-tuning

Can we pool historical past throughout the fleet for a shared adapter?

In follow, we regularly have a number of associated belongings in a portfolio.

In our case, meaning eight buildings. They aren’t an identical, however they comply with comparable each day and weekly demand patterns.

So the subsequent pure query is: can we fine-tune one adapter on the entire constructing portfolio, as an alternative of only one constructing at a time?

Right here, we nonetheless forecast solely total_load_kw, this implies the setup is nearly the identical as earlier than:

target_column = "total_load_kw"

train_inputs = [
    {
        "target": building_df[[target_column]].to_numpy(dtype="float32").T,
    }
    for _, building_df in train_df.groupby("constructing", kind=True)
]

validation_inputs = [
    {
        "target": building_df[[target_column]].to_numpy(dtype="float32").T,
    }
    for _, building_df in validation_df.groupby("constructing", kind=True)
]

Successfully, every constructing turns into one coaching job. Then we fine-tune Chronos-2 with the identical LoRA configuration as earlier than:

fine_tuned_model = base_model.match(
    inputs=train_inputs,
    validation_inputs=validation_inputs,
    prediction_length=168,
    context_length=1080,
    lora_config=LORA_CONFIG,
    learning_rate=2e-5,
    max_steps=1000,
)

It’s price emphasizing that right here we’re not coaching eight separate adapters. As a substitute, we’re asking Chronos-2 to be taught one shared adaptation that works throughout the fleet. In follow, if there are recurring patterns throughout buildings, the adapter might have extra possibilities to be taught them. Nonetheless, if every constructing is totally impartial, this technique could not assist a lot.

The fine-tuning causes are proven beneath, the place we examine the forecasting high quality between the zero-shot and fine-tuned Chronos-2:

Constructing      Zero-shot WAPE    Effective-tuned WAPE
Constructing 01   8.0%              7.4%
Constructing 02   12.2%             11.3%
Constructing 03   8.3%              7.5%
Constructing 04   8.0%              7.6%
Constructing 05   7.2%              6.8%
Constructing 06   10.9%             9.9%
Constructing 07   7.7%              7.2%
Constructing 08   6.6%              6.3%

We see enhancements throughout all of the buildings, which is an effective signal that each constructing is benefiting from the shared adapter.

4.3 Covariate-informed fine-tuning

Can we give Chronos-2 the identified covariates throughout fine-tuning?

Thus far, Chronos-2 solely sees the goal sequence itself, i.e., historic total_load_kw.

However in our building-demand case, we do know or can moderately properly forecast the underlying driving elements, together with out of doors temperature, occupancy sample, photo voltaic irradiance, and weekend indicator. They’re the covariates that drive the change of total_load_kw.

Due to this fact, on this fine-tuning situation, we wish to know if we are able to fine-tune Chronos-2 not solely on the goal historical past, but in addition on the connection between the goal and known-future covariates

That is the place the fine-tuning enter needs to be modified. As a substitute of solely passing the goal, every coaching job ought to now additionally comprise past_covariates and future_covariates:

known_future_columns = [
    "outdoor_temp_c",
    "occupancy",
    "solar_irradiance",
    "is_weekend",
]

single_building_train = train_df[
    train_df["building"].eq(story_building)
].sort_values("timestamp")

train_inputs = [
    {
        "target": single_building_train[["total_load_kw"]]
        .to_numpy(dtype="float32")
        .T,
        "past_covariates": {
            column: single_building_train[column].to_numpy(dtype="float32")
            for column in known_future_columns
        },
        "future_covariates": {
            column: None
            for column in known_future_columns
        },
    }
]

The past_covariates half accommodates the historic values of the covariate sequence. Throughout fine-tuning, Chronos-2 can see how covariates of temperature, occupancy, photo voltaic irradiance, and weekends change the load.

The future_covariates half tells Chronos-2 that these covariates are additionally out there within the forecast horizon. We set them to None right here as a result of Chronos-2 constructs the longer term home windows internally from the identical historic sequence. Later, at inference time, we’ll present the precise future covariate values by means of future_df, identical to we did in Half 1.

The fine-tuning name itself stays virtually the identical:

fine_tuned_model = base_model.match(
    train_inputs,
    prediction_length=168,
    validation_inputs=validation_inputs,
    finetune_mode="lora",
    lora_config=LORA_CONFIG,
    context_length=1080,
    learning_rate=2e-5,
    num_steps=1000,
    batch_size=32,
    output_dir="finetuned_models/fine_tuning_modes/single_covariate",
    finetuned_ckpt_name="checkpoint",
    callbacks=[EarlyStoppingCallback(early_stopping_patience=6)],
    save_steps=25,
    eval_steps=25,
)

After the fine-tuning is completed, at inference time, we move each the historic context and the identified future covariates:

context_with_covariates = test_context_df[
    ["building", "timestamp", "total_load_kw"] + known_future_columns
]

future_covariates_df = test_truth_df[
    ["building", "timestamp"] + known_future_columns
]

pred_single_covariate = fine_tuned_model.predict_df(
    context_with_covariates,
    future_df=future_covariates_df,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="constructing",
    timestamp_column="timestamp",
    goal="total_load_kw",
)

For Constructing 03, covariate-informed zero-shot WAPE is 4.0%. After fine-tuning the covariate-informed adapter on Constructing 03, WAPE drops to 2.8%, resulting in a 30.7% relative discount.

It is a a lot bigger achieve than target-only fine-tuning.

That is additionally an fascinating sensible lesson right here: typically the most important win just isn’t “fine-tuning” by itself. It’s fine-tuning the mannequin with the fitting info.

4.4 Portfolio + covariates

Can we leverage each covariate and fleet info for fine-tuning?

The earlier two eventualities added the “Portfolio” ingredient and “covariate” ingredient individually. Naturally, we wish to use each.

That is the setup I consider to be most related in lots of actual use instances, as a result of in follow, we hardly ever simply have one asset, and as a rule, we do have identified or forecastable exterior alerts that may assist goal sequence forecasting. Utilizing each for fine-tuning just isn’t solely logical, however most likely additionally preferable.

Concretely, for our present case, we fine-tune on all eight buildings, and for every constructing, we offer total_load_kw because the goal and outdoor_temp_c, occupancy, solar_irradiance, and is_weekend as known-future covariates:

train_inputs = []

for constructing, building_df in train_df.groupby("constructing", kind=True):
    building_df = building_df.sort_values("timestamp")

    train_inputs.append(
        {
            "goal": building_df[["total_load_kw"]]
            .to_numpy(dtype="float32")
            .T,
            "past_covariates": {
                column: building_df[column].to_numpy(dtype="float32")
                for column in known_future_columns
            },
            "future_covariates": {
                column: None
                for column in known_future_columns
            },
        }
    )

Within the code snippet above, we create one job per constructing. The identical concept applies to validation knowledge as properly. Every constructing is related to one validation job, and Chronos-2 makes use of the final 168 hours of every job because the validation forecast window.

The fine-tuning name itself nonetheless stays the identical:

fine_tuned_model = base_model.match(
    train_inputs,
    prediction_length=168,
    validation_inputs=validation_inputs,
    finetune_mode="lora",
    lora_config=LORA_CONFIG,
    context_length=1080,
    learning_rate=2e-5,
    num_steps=1000,
    batch_size=32,
    output_dir="finetuned_models/fine_tuning_modes/portfolio_covariate",
    finetuned_ckpt_name="checkpoint",
    callbacks=[EarlyStoppingCallback(early_stopping_patience=6)],
    save_steps=25,
    eval_steps=25,
)

For inference, we move 45-day historic context, in addition to the identified future covariates for the forecast week:

context_with_covariates = test_context_df[
    ["building", "timestamp", "total_load_kw"] + known_future_columns
]

future_covariates_df = test_truth_df[
    ["building", "timestamp"] + known_future_columns
]

pred_portfolio_covariate = fine_tuned_model.predict_df(
    context_with_covariates,
    future_df=future_covariates_df,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="constructing",
    timestamp_column="timestamp",
    goal="total_load_kw",
)

The determine beneath reveals the fine-tuning outcomes for Constructing 03, the place we are able to clearly see the development introduced by fine-tuning:

Determine 3. Portfolio + covariate fine-tuning in contrast with the plain zero-shot forecast for Constructing 03. (Picture by writer)

Throughout all eight buildings, the plain zero-shot baseline has a WAPE of 8.4%. After portfolio + covariate fine-tuning, WAPE drops to 2.8%, a 66.8% relative discount.

4.5 Held-out switch

Can we adapt as soon as, then deploy on belongings the mannequin by no means noticed throughout fine-tuning?

Thus far, each fine-tuning situation has used the identical buildings that later seem at inference time.

However there may be yet another necessary query: What if a brand new constructing comes on-line solely very not too long ago?

So on this closing situation, we maintain out Constructing 06 throughout fine-tuning, in order that Chronos-2 by no means sees its knowledge whereas studying the LoRA adapter. We fine-tune on the opposite seven buildings, utilizing each goal histories and known-future covariates. Then, at inference time, we apply the adapter to Constructing 06.

The code change is small:

held_out_building = "Constructing 06"

train_buildings = [
    building
    for building in sorted(train_df["building"].distinctive())
    if constructing != held_out_building
]

train_inputs = []

for constructing in train_buildings:
    building_df = train_df[
        train_df["building"].eq(constructing)
    ].sort_values("timestamp")

    train_inputs.append(
        {
            "goal": building_df[["total_load_kw"]]
            .to_numpy(dtype="float32")
            .T,
            "past_covariates": {
                column: building_df[column].to_numpy(dtype="float32")
                for column in known_future_columns
            },
            "future_covariates": {
                column: None
                for column in known_future_columns
            },
        }
    )

Then, at inference time, we goal Constructing 06 for forecasting:

building_06_context = test_context_df[
    test_context_df["building"].eq(held_out_building)
][["building", "timestamp", "total_load_kw"] + known_future_columns]

building_06_future_covariates = test_truth_df[
    test_truth_df["building"].eq(held_out_building)
][["building", "timestamp"] + known_future_columns]

pred_heldout = fine_tuned_model.predict_df(
    building_06_context,
    future_df=building_06_future_covariates,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="constructing",
    timestamp_column="timestamp",
    goal="total_load_kw",
)

For Constructing 06, the covariate-informed zero-shot baseline has a WAPE of 4.2%. After making use of the adapter fine-tuned on the opposite seven buildings, WAPE drops to three.1%. That’s a 26.8% relative discount.

For actual deployment, our present Q5 investigation represents a extra scalable sample, that’s, we fine-tune an adapter on a consultant portfolio, then deploy it to associated belongings as they arrive on-line. For every new asset, we nonetheless present its current context and known-future covariates, however we would not have to fine-tune once more instantly. We received’t have sufficient knowledge for that anyway.

5. What did we be taught?

After strolling by means of the 5 eventualities one after the other, let’s put their outcomes aspect by aspect.

For every row, I examine the fine-tuned mannequin in opposition to the matching zero-shot baseline. Concretely, meaning target-only fine-tuning is in contrast with target-only zero-shot, and covariate-informed fine-tuning is in contrast with covariate-informed zero-shot:

Determine 4. Effective-tuning improves all 5 eventualities. Covariate-informed setups introduced the biggest positive factors. (Picture by writer)

The sample is fairly clear. Goal-only fine-tuning helps to some extent, however solely modestly. The bigger positive factors seem after we give Chronos-2 the known-future covariates, after which fine-tune the adapter round that. The held-out switch outcome can also be encouraging: even for a constructing excluded from fine-tuning, the adapter can be taught from associated buildings and nonetheless enhance over the covariate-informed zero-shot baseline.

You will discover the complete pocket book right here: https://github.com/ShuaiGuo16/chronos-2-forecasting/blob/predominant/02_chronos2_fine_tuning_building_demand.ipynb