Constructing a Scalable and Correct Audio Interview Transcription Pipeline with Google Gemini

Past Mannequin Stacking: The Structure Ideas That Make Multimodal AI Methods Work

Past Code Era: Constantly Evolve Textual content with LLMs

This text is co-authored by Ugo Pradère and David Haüet

can it’s to transcribe an interview? You feed the audio to an AI mannequin, wait a couple of minutes, and increase: good transcript, proper? Nicely… not fairly.

In terms of precisely transcribe lengthy audio interviews, much more when the spoken language isn’t English, issues get much more difficult. You want prime quality transcription with dependable speaker identification, exact timestamps, and all that at an reasonably priced value. Not so easy in spite of everything.

On this article, we take you behind the scenes of our journey to construct a scalable and production-ready transcription pipeline utilizing Google’s Vertex AI and Gemini fashions. From surprising mannequin limitations to funds analysis and timestamp drift disasters, we’ll stroll you thru the actual challenges, and the way we solved them.

Whether or not you might be constructing your personal Audio Processing device or simply interested in what occurs “below the hood” of a sturdy transcription system utilizing a multimodal mannequin, you will see that sensible insights, intelligent workarounds, and classes discovered that must be value your time.

Context of the undertaking and constraints

Initially of 2025, we began an interview transcription undertaking with a transparent aim: to construct a system able to transcribing interviews in French, sometimes involving a journalist and a visitor, however not restricted to this case, and lasting from a couple of minutes to over an hour. The ultimate output was anticipated to be only a uncooked transcript however needed to mirror the pure spoken dialogue written in a “book-like” dialogue, guaranteeing each a devoted transcription of the unique audio content material and an excellent readability.

Earlier than diving into improvement, we carried out a brief market evaluation of present options, however the outcomes had been by no means passable: the standard was usually disappointing, the pricing undoubtedly too excessive for an intensive utilization, and generally, each without delay. At that time, we realized a customized pipeline could be mandatory.

As a result of our group is engaged within the Google ecosystem, we had been required to make use of Google Vertex AI companies. Google Vertex AI provides a wide range of Speech-to-Textual content (S2T) fashions for audio transcription, together with specialised ones equivalent to “Chirp,” “Latestlong,” or “Telephone name,” whose names already trace at their meant use circumstances. Nonetheless, producing a whole transcription of an interview that mixes excessive accuracy, speaker diarization, and exact timestamping, particularly for lengthy recordings, stays an actual technical and operational problem.

First makes an attempt and limitations

We initiated our undertaking by evaluating all these fashions on our use case. Nonetheless, after in depth testing, we got here shortly to the next conclusion: no Vertex AI service absolutely meets the whole set of necessities and can permit us to realize our aim in a easy and efficient method. There was at all times no less than one lacking specification, normally on timestamping or diarization.

The horrible Google documentation, this should be mentioned, value us a major period of time throughout this preliminary analysis. This prompted us to ask Google for a gathering with a Google Cloud Machine Studying Specialist to attempt to discover a resolution to our drawback. After a fast video name, our dialogue with the Google rep shortly confirmed our conclusions: what we aimed to realize was not so simple as it appeared at first. All the set of necessities couldn’t be fulfilled by a single Google service and a customized implementation of a VertexAI S2T service needed to be developed.

We introduced our preliminary work and determined to proceed exploring two methods:

Use Chirp2 to generate the transcription and timestamping of lengthy audio recordsdata, then use Gemini for diarization.
Use Gemini 2.0 Flash for transcription and diarization, though the timestamping is approximate and the token output size requires looping.

In parallel of those investigations, we additionally needed to take into account the monetary facet. The device could be used for lots of of hours of transcription per 30 days. Not like textual content, which is usually low cost sufficient to not have to consider it, audio could be fairly pricey. We due to this fact included this parameter from the start of our exploration to keep away from ending up with an answer that labored however was too costly to be exploited in manufacturing.

Deep dive into transcription with Chirp2

We started with a deeper investigation of the Chirp2 mannequin since it’s thought of because the “finest at school” Google S2T service. A simple software of the documentation offered the anticipated outcome. The mannequin turned out to be fairly efficient, providing good transcription with word-by-word timestamping in line with the next output in json format:

"transcript":"Oui, en effet",
"confidence":0.7891818284988403
"phrases":[
  {
    "word":"Oui",
    "start-offset":{
      "seconds":3.68
    },
    "end-offset":{
      "seconds":3.84
    },
    "confidence":0.5692862272262573
  }
  {
    "word":"en",
    "start-offset":{
      "seconds":3.84
    },
    "end-offset":{
      "seconds":4.0
    },
    "confidence":0.758037805557251
  },
  {
    "word":"effet",
    "start-offset":{
      "seconds":4.0
    },
    "end-offset":{
      "seconds":4.64
    },
    "confidence":0.8176857233047485
  },
]

Nonetheless, a brand new requirement got here alongside the undertaking added by the operational group: the transcription should be as devoted as attainable to the unique audio content material and embody small filler phrases, interjections, onomatopoeia and even mumbling that may add which means to a dialog, and sometimes come from the non-speaking participant both on the identical time or towards the top of a sentence of the talking one. We’re speaking about phrases like “oui oui,” “en effet” but additionally easy expressions like (hmm, ah, and so forth.), so typical of the French language! It’s really not unusual to validate or, extra not often, oppose somebody level with a easy “Hmm Hmm”. Upon analyzing Chirp with transcription, we seen that whereas a few of these small phrases had been current, quite a lot of these expressions had been lacking. First draw back for Chirp2.

The principle problem on this method lies within the reconstruction of the speaker sentences whereas performing diarization. We shortly deserted the concept of giving Gemini the context of the interview and the transcription textual content, and asking it to find out who mentioned what. This technique might simply lead to incorrect diarization. We as a substitute explored sending the interview context, the audio file, and the transcription content material in a compact format, instructing Gemini to solely carry out diarization and sentence reconstruction with out re-transcribing the audio file. We requested a TSV format, an excellent structured format for transcription: “human readable” for quick high quality checking, straightforward to course of algorithmically, and light-weight. Its construction is as follows:

First line with speaker presentation:

Diarization Speaker_1:speaker_nameSpeaker_2:speaker_nameSpeaker_3:speaker_nameSpeaker_4:speaker_name, and so forth.

Then the transcription within the following format:

speaker_idttime_startttime_stoptext with:

speaker: Numeric speaker ID (e.g., 1, 2, and so forth.)
time_start: Section begin time within the format 00:00:00
time_stop: Section finish time within the format 00:00:00
textual content: Transcribed textual content of the dialogue phase

An instance output:

Diarization Speaker_1:Lea FinchSpeaker_2:David Albec

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 Positive thanks.

1 00:04:00 00:07:00 So, let’s begin the interview

2 00:07:00 00:08:00 All proper.

A easy model of the context offered to the LLM:

Right here is the interview of David Albec, skilled soccer participant, by journalist Lea Finch

The outcome was pretty qualitative with what gave the impression to be correct diarization and sentence reconstruction. Nonetheless, as a substitute of getting the very same textual content, it appeared barely modified in a number of locations. Our conclusion was that, regardless of our clear directions, Gemini in all probability carries out extra than simply diarization and truly carried out partial transcription.

We additionally evaluated at this level the price of transcription with this system. Beneath is the approximate calculation based mostly solely on audio processing:

Chirp2 value /min: 0.016 usd

Gemini 2.0 flash /min: 0,001875 usd

Value /hour: 1,0725 usd

Chirp2 is certainly fairly “costly”, about ten occasions greater than Gemini 2.0 flash on the time of writing, and nonetheless requires the audio to be processed by Gemini for diarization. We due to this fact determined to place this technique apart for now and discover a manner utilizing the model new multimodal Gemini 2.0 Flash alone, which had simply left experimental mode.

Subsequent: exploring audio transcription with Gemini flash 2.0

We offered Gemini with each the interview context and the audio file requesting a structured output in a constant format. By rigorously crafting our immediate with customary LLM tips, we had been in a position to specify our transcription necessities with a excessive diploma of precision. As well as with the standard parts any immediate engineer would possibly embody, we emphasised a number of key directions important for guaranteeing a high quality transcription (remark in italic):

Transcribe interjections and onomatopoeia even when mid-sentence.
Protect the total expression of phrases, together with slang, insults, or inappropriate language. => the mannequin tends to vary phrases it considers inappropriate. For this particular level, we needed to require Google to deactivate the protection guidelines on our Google Cloud Mission.
Construct full sentences, paying explicit consideration to modifications in speaker mid-sentence, for instance when one speaker finishes one other’s sentence or interrupts. => Such errors have an effect on diarization and accumulate all through the transcript till context is robust sufficient for the LLM to appropriate.
Normalize extended phrases or interjections like “euuuuuh” to “euh.” and never “euh euh euh euh euh …” => this was a classical bug we had been encountering known as “repetition bug” and is mentioned in additional element under
Establish audio system by voice tone whereas utilizing context to find out who’s the journalist and who’s the interviewee. => as well as we will go the knowledge of the primary speaker within the immediate

Preliminary outcomes had been really fairly satisfying by way of transcription, diarization, and sentence building. Transcribing quick take a look at recordsdata made us really feel just like the undertaking was almost full… till we tried longer recordsdata.

Coping with Lengthy Audio and LLM Token Limitations

Our early assessments on quick audio clips had been encouraging however scaling the method to longer audios shortly revealed new challenges: what initially appeared like a easy extension of our pipeline turned out to be a technical hurdle in itself. Processing recordsdata longer than only a few minutes revealed certainly a collection of challenges associated to mannequin constraints, token limits, and output reliability:

One of many first issues we encountered with lengthy audio was the token restrict: the variety of output tokens exceeded the utmost allowed (MAX_INPUT_TOKEN = 8192) forcing us to implement a looping mechanism by repeatedly calling Gemini whereas resending the beforehand generated transcript, the preliminary immediate, a continuation immediate, and the identical audio file.

Right here is an instance of the continuation immediate we used:

Proceed transcribing audio interview from the earlier outcome. Begin processing the audio file from the earlier generated textual content. Don’t begin from the start of the audio. Watch out to proceed the beforehand generated content material which is obtainable between the next tags .

Utilizing this transcription loop with giant information inputs appears to considerably degrade the LLM output high quality, particularly for timestamping. On this configuration, timestamps can drift by over 10 minutes on an hour-long interview. If a couple of seconds drift was thought of appropriate with our meant use, a couple of minutes made timestamping ineffective.

Our preliminary take a look at on quick audios of some minutes resulted in a most 5 to 10 seconds drift, and vital drift was noticed typically after the primary loop when max enter token was reached. We conclude from these experimental observations, that whereas this looping approach ensures continuity in transcription pretty effectively, it not solely results in cumulative timestamp errors but additionally to a drastic lack of LLM timestamps accuracy.

We additionally encountered a recurring and notably irritating bug: the mannequin would generally fall right into a loop, repeating the identical phrase or phrase over dozens of strains. This conduct made complete parts of the transcript unusable and infrequently seemed one thing like this:

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 Positive thanks.

2 00:03:00 00:03:00 Positive thanks

2 00:03:00 00:03:00 Positive thanks.

2 00:03:00 00:03:00 Positive thanks

2 00:03:00 00:03:00 Positive thanks.

and so forth.

This bug appears erratic however seems extra incessantly with medium-quality audio with sturdy background noise, distant speaker for instance. And “on the sector”, that is usually the case.. Likewise, speaker hesitations or phrase repetitions appear to set off it. We nonetheless don’t know precisely what causes this “repetition bug”. Google Vertex group is conscious of it however hasn’t offered a transparent rationalization.

The results of this bug had been particularly limiting: as soon as it occurred, the one viable resolution was to restart the transcription from scratch. Unsurprisingly, the longer the audio file, the upper the likelihood of encountering the problem. In our assessments, it affected roughly one out of each three runs on recordings longer than an hour, making it extraordinarily tough to ship a dependable, production-quality service below such situations.

To make it worse, resuming transcription after a reached Max_token “cutoff” required resending the whole audio file every time. Though we solely wanted the following phase, the LLM would nonetheless course of the total file once more (with out outputting the transcription), which means we had been billed the total audio time lenght for each resend.

In apply, we discovered that the token restrict was sometimes reached between the fifteenth and twentieth minute of the audio. In consequence, transcribing a one hour lengthy interview usually required 4 to five separate LLM calls, resulting in a complete billing equal of 4 to five hours of audio for a single file.

With this course of, the price of audio transcription doesn’t scale linearly. Whereas a 15-minute audio could be billed as quarter-hour, in a single LLM name, a 1-hour file might successfully value 4 hours, and a 2-hour file might enhance to 16 hours, following a close to quadratic sample (≈ 4^x, the place x = variety of hours).
This made lengthy audio processing not simply unreliable, but additionally costly for lengthy audio recordsdata.

Pivoting to Chunked Audio Transcription

Given these main limitations, and being rather more assured within the skill of the LLM to deal with text-based duties over audio, we determined to shift our method and isolate the audio transcription course of to take care of excessive transcription high quality. A high quality transcription is certainly the important thing step of the necessity and it is smart to make sure that this a part of the method must be on the core of the technique.

At this level, splitting audio into chunks turned the perfect resolution. Not solely, it appeared more likely to drastically enhance timestamp accuracy by avoiding the LLM timestamping efficiency degradation after looping and cumulative drift, but additionally lowering value since every chunk could be runned ideally as soon as. Whereas it launched new uncertainties round merging partial transcriptions, the tradeoff appeared to our benefit.

We thus centered on breaking lengthy audio into shorter chunks that may insure a single LLM transcription request. Throughout our assessments, we noticed that points like repetition loops or timestamp drift sometimes started across the 18-minute mark in most interviews. It turned clear that we must always use 15-minute (or shorter) chunks for security. Why not use 5-minute chunks? The standard enchancment seemed minimal to us whereas tripling the variety of segments. As well as, shorter chunks scale back the general context, which might damage diarization.

Additionally this setup drastically minimized the repetition bug, we noticed that it nonetheless occurred often. In a need to supply the very best service attainable, we undoubtedly needed to search out an environment friendly counterback to this drawback and we recognized a chance with our beforehand annoying max_input_token: with 10-minute chunks, we might undoubtedly be assured that token limits wouldn’t be exceeded in almost all circumstances. Thus, if the token restrict was hit, we knew for certain the repetition bug occurred and will restart that chunk transcription. This pragmatic technique turned out to be certainly very efficient at figuring out and avoiding the bug. Nice information.

Correcting audio chunks transcription

With good transcripts of 10 minutes audio chunk in hand, we carried out at this stage an algorithmic post-processing of every transcript to deal with minor points:

Elimination of header tags like tsv or json added at the beginning and the top of the transcription content material:

Regardless of optimizing the immediate, we couldn’t absolutely get rid of this aspect impact with out hurting the transcription high quality. Since that is simply dealt with algorithmically, we selected to take action.

Changing speaker IDs with names:

Speaker identification by identify solely begins as soon as the LLM has sufficient context to find out who’s the journalist and who’s being interviewed. This ends in incomplete diarization in the beginning of the transcript with early segments utilizing numeric IDs (first speaker in chunk = 1, and so forth.). Furthermore, since every chunk could have a distinct ID order (first particular person to speak being speaker 1), this may create confusion throughout merging. We instructed the LLM to solely use IDs and supply a diarization mapping within the first line, in the course of the transcription course of. The speaker ids are due to this fact changed in the course of the algorithmic correction and the diarization headline eliminated.

Hardly ever, malformed or empty transcript strains are encountered. These strains are deleted, however we flag them with a notice to the person: “formatting problem on this line” so customers are no less than conscious of a possible content material loss and proper it will definitely handwise. In our ultimate optimized model, such strains had been extraordinarily uncommon.

Merging chunks and sustaining content material continuity

On the earlier stage of audio chunking, we initially tried to make chunks with clear cuts. Unsurprisingly, this led to phrases and even full sentences loss at reduce factors. So we naturally switched to overlapping chunk cuts to keep away from such content material loss, leaving the optimization of the dimensions of the overlap to the chunk merging course of.

With out a clear reduce between chunks, the chance to merge the chunks algorithmically disappeared. For a similar audio enter, the transcript strains output could be fairly totally different with breaks at totally different factors of the sentences and even filler phrases or hesitations being displayed otherwise. In such a scenario, it’s advanced, to not say not possible, to make an efficient algorithm for a clear merge.

This left us with using the LLM choice in fact. Shortly, few assessments confirmed the LLM might higher merge collectively segments when overlaps included full sentences. A 30-second overlap proved enough. With a ten min audio chunk construction, this may implies the next chunks cuts:

1st transcript: 0 to 10 minutes
2nd transcript: 9m30s to 19m30s
third transcript: 19m to 29m …and so forth.

These overlapped chunk transcripts had been corrected by the beforehand described algorithm and despatched to the LLM for merging to reconstruct the total audio transcript. The thought was to ship the total set of chunk transcripts with a immediate instructing the LLM to merge and provides the total merged audio transcript in tsv format because the earlier LLM transcription step. On this configuration, the merging course of has primarily three high quality criterias:

Guarantee transcription continuity with out content material loss or duplication.
Regulate timestamps to renew from the place the earlier chunk ended.
Protect diarization.

As anticipated, max_input_token was exceeded, forcing us into an LLM name loop. Nonetheless, since we had been now utilizing textual content enter, we had been extra assured within the reliability of the LLM… in all probability an excessive amount of. The results of the merge was passable generally however liable to a number of points: tag insertions, multi-line entries merged into one line, incomplete strains, and even hallucinated continuations of the interview. Regardless of many immediate optimizations, we couldn’t obtain sufficiently dependable outcomes for manufacturing use.

As with audio transcription, we recognized the quantity of enter info as the primary problem. We had been sending a number of hundred, even 1000’s of textual content strains containing the immediate, the set of partial transcripts to fuse, a roughly related quantity with the earlier transcript, and a few extra with the immediate and its instance. Undoubtedly an excessive amount of for a exact software of our set of directions.

On the plus aspect, timestamp accuracy did certainly enhance considerably with this chunking method: we maintained a drift of simply 5 to 10 seconds max on transcriptions over an hour. As the beginning of a transcript ought to have minimal drift in timestamping, we instructed the LLM to make use of the timestamps of the “ending chunk” as reference for the fusion and proper any drift by a second per sentence. This made the reduce factors seamless and stored general timestamp accuracy.

Splitting the chunk transcripts for full transcript reconstruction

In a modular method just like the workaround we used for transcription, we determined to hold out the merge of the transcript individually, with a purpose to keep away from the beforehand described points. To take action, every 10 minute transcript is break up into three components based mostly on the start_time of the segments:

Overlap phase to merge in the beginning: 0 to 1 minute
Essential phase to stick: 1 to 9 minutes
Overlap phase to merge on the finish: 9 to 10 minutes

NB: Since every chunk, together with first and final ones, is processed the identical manner, the overlap in the beginning of the primary chunk is instantly merged with the primary phase, and the overlap on the finish of the final chunk (if there’s one) is merged accordingly.

The start and finish segments are then despatched in pairs to be merged. As anticipated, the standard of the output drastically elevated, leading to an environment friendly and dependable merge between the transcripts chunk. With this process, the response of the LLM proved to be extremely dependable and confirmed not one of the beforehand talked about errors encountered in the course of the looping course of.

The method of transcript meeting for an audio of 28 minutes 42 seconds:

Full transcript reconstruction

At this ultimate stage, the one remaining activity was to reconstruct the whole transcript from the processed splits. To attain this, we algorithmically mixed the primary content material segments with their corresponding merged overlaps alternately.

Total course of overview

The general course of includes 6 steps from which 2 are carried out by Gemini:

Chunking the audio into overlapped audio chunks
Transcribing every chunks into partial textual content transcripts (LLM step)
Correction of partial transcripts
Splitting audio chunks transcripts into begin, principal, and finish textual content splits
Fusing finish and begin splits of every couple of chunk splits (LLM step)
Reconstructing the total transcripts

The general course of takes about 5 min per hour of transcription deserved to the person in an asynchronous device. Fairly cheap contemplating the amount of labor executed behind the scene, and this for a fraction of the worth of different instruments or pre-built Google fashions like Chirp2.

One further enchancment that we thought of however in the end determined to not implement was the timestamp correction. We noticed that timestamps on the finish of every chunk sometimes ran about 5 seconds forward of the particular audio. A simple resolution might have been to incrementally alter the timestamps algorithmically by roughly one second each two minutes to appropriate most of this drift. Nonetheless, we selected to not implement this adjustment, because the minor discrepancy was acceptable for our enterprise wants.

Conclusion

Constructing a high-quality, scalable transcription pipeline for lengthy interviews turned out to be rather more advanced than merely selecting the “proper” Speech-to-Textual content mannequin. Our journey with Google’s Vertex AI and Gemini fashions highlighted key challenges round diarization, timestamping, cost-efficiency, and lengthy audio dealing with, particularly when aiming to export the total info of an audio.

Utilizing cautious immediate engineering, good audio chunking methods, and iterative refinements, we had been in a position to construct a sturdy system that balances accuracy, efficiency, and operational value, turning an initially fragmented course of right into a clean, production-ready pipeline.

There’s nonetheless room for enchancment however this workflow now varieties a strong basis for scalable, high-fidelity audio transcription. As LLMs proceed to evolve and APIs grow to be extra versatile, we’re optimistic about much more streamlined options within the close to future.

Key takeaways

No Vertex AI S2T mannequin met all our wants: Google Vertex AI offers specialised fashions, however each has limitations by way of transcription accuracy, diarization, or timestamping for lengthy audios.
Token limits and lengthy prompts affect transcription high quality drastically: Gemini output token limitation considerably degrades transcription high quality for lengthy audios, requiring closely prompted looping methods and eventually forcing us to shift to shorter audio chunks.
Chunked audio transcription and transcript reconstruction considerably improves high quality and cost-efficiency:
Splitting audio into 10 minute overlapping segments minimized vital bugs like repeated sentences and timestamp drift, enabling larger high quality outcomes and drastically diminished prices.
Cautious immediate engineering stays important: Precision in prompts, particularly relating to diarization and interjections for transcription, in addition to transcript fusions, proved to be essential for dependable LLM efficiency.
Brief transcript fusion merging maximize reliability:
Splitting every chunk transcript into smaller segments with finish to start out merging of overlaps offered excessive accuracy and averted widespread LLM points like hallucinations or incorrect formatting.