• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Saturday, November 29, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Artificial Intelligence

Coaching a Tokenizer for BERT Fashions

Admin by Admin
November 29, 2025
in Artificial Intelligence
0
John towner uo02gaw3c0c unsplash scaled.jpg
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


BERT is an early transformer-based mannequin for NLP duties that’s small and quick sufficient to coach on a house laptop. Like all deep studying fashions, it requires a tokenizer to transform textual content into integer tokens. This text exhibits tips on how to practice a WordPiece tokenizer following BERT’s authentic design.

Let’s get began.

Coaching a Tokenizer for BERT Fashions
Photograph by JOHN TOWNER. Some rights reserved.

Overview

This text is split into two elements; they’re:

  • Selecting a Dataset
  • Coaching a Tokenizer

Selecting a Dataset

To maintain issues easy, we’ll use English textual content solely. WikiText is a well-liked preprocessed dataset for experiments, obtainable via the Hugging Face datasets library:

import random

from datasets import load_dataset

 

# path and title of every dataset

path, title = “wikitext-2”, “wikitext-2-raw-v1”

dataset = load_dataset(path, title, cut up=“practice”)

print(f“measurement: {len(dataset)}”)

# Print a number of samples

for idx in random.pattern(vary(len(dataset)), 5):

    textual content = dataset[idx][“text”].strip()

    print(f“{idx}: {textual content}”)

On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset appropriate for fast experiments, whereas WikiText-103 is bigger and extra consultant of real-world textual content for a greater mannequin.

The output of this code could appear like this:

measurement: 36718

23905: Dudgeon Creek

4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 …

7181: Crew : 5

24596: On March 19 , 2007 , Sports activities Illustrated posted on its web site an article in its …

12920: The latest constructing included within the record is within the Quantock Hills . The …

The dataset incorporates strings of various lengths with areas round punctuation marks. When you may cut up on whitespace, this wouldn’t seize sub-word parts. That’s what the WordPiece tokenization algorithm is nice at.

Coaching a Tokenizer

A number of tokenization algorithms assist sub-word parts. BERT makes use of WordPiece, whereas fashionable LLMs usually use Byte-Pair Encoding (BPE). We’ll practice a WordPiece tokenizer following BERT’s authentic design.

The tokenizers library implements a number of tokenization algorithms that may be configured to your wants. It saves you the trouble of implementing the tokenization algorithm from scratch. You need to set up it with pip command:

Let’s practice a tokenizer:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

import tokenizers

from datasets import load_dataset

 

path, title = “wikitext”, “wikitext-103-raw-v1”

vocab_size = 30522

dataset = load_dataset(path, title, cut up=“practice”)

 

# Accumulate texts, skip title traces beginning with “=”

texts = []

for line in dataset[“text”]:

    line = line.strip()

    if line and not line.startswith(“=”):

        texts.append(line)

 

# Configure WordPiece tokenizer with NFKC normalization and particular tokens

tokenizer = tokenizers.Tokenizer(tokenizers.fashions.WordPiece())

tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

tokenizer.decoder = tokenizers.decoders.WordPiece()

tokenizer.normalizer = tokenizers.normalizers.NFKC()

tokenizer.coach = tokenizers.trainers.WordPieceTrainer(

    vocab_size=vocab_size,

    special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”]

)

# Prepare the tokenizer and reserve it

tokenizer.train_from_iterator(texts, coach=tokenizer.coach)

tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=“[PAD]”)

tokenizer_path = f“{dataset_name}_wordpiece.json”

tokenizer.save(tokenizer_path, fairly=True)

 

# Check the tokenizer

tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path)

print(tokenizer.encode(“Hi there, world!”).tokens)

print(tokenizer.decode(tokenizer.encode(“Hi there, world!”).ids))

Working this code could print the next output:

wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s]

wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s]

Producing take a look at cut up: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s]

Producing practice cut up: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s]

Producing validation cut up: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s]

measurement: 1801350

[00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0

[00:00:00] Tokenize phrases ████████████████████████████ 606445 / 606445

[00:00:00] Rely pairs ████████████████████████████ 606445 / 606445

[00:00:04] Compute merges ████████████████████████████ 22020 / 22020

[‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’]

Hi there, world!

This code makes use of the WikiText-103 dataset. The primary run downloads 157MB of information containing 1.8 million traces. The coaching takes a number of seconds. The instance exhibits how "Hi there, world!" turns into 5 tokens, with “Hi there” cut up into “Hell” and “##o” (the “##” prefix signifies a sub-word part).

The tokenizer created within the code above has the next properties:

  • Vocabulary measurement: 30,522 tokens (matching the unique BERT mannequin)
  • Particular tokens: [PAD], [CLS], [SEP], [MASK], and [UNK] are added to the vocabulary though they don’t seem to be within the dataset.
  • Pre-tokenizer: Whitespace splitting (for the reason that dataset has areas round punctuation)
  • Normalizer: NFKC normalization for unicode textual content. Be aware which you can additionally configure the tokenizer to transform every part into lowercase, because the frequent BERT-uncased mannequin does.
  • Algorithm: WordPiece is used. Therefore the decoder ought to be set accordingly in order that the “##” prefix for sub-word parts is acknowledged.
  • Padding: Enabled with [PAD] token for batch processing. This isn’t demonstrated within the code above, however will probably be helpful if you end up coaching a BERT mannequin.

The tokenizer saves to a reasonably large JSON file containing the complete vocabulary, permitting you to reload the tokenizer later with out retraining.

To transform a string into an inventory of tokens, you utilize the syntax tokenizer.encode(textual content).tokens, during which every token is only a string. To be used in a mannequin, it’s best to use tokenizer.encode(textual content).ids as an alternative, during which the outcome shall be an inventory of integers. The decode technique can be utilized to transform an inventory of integers again to a string. That is demonstrated within the code above.

Under are some assets that you could be discover helpful:

This text demonstrated tips on how to practice a WordPiece tokenizer for BERT utilizing the WikiText dataset. You realized to configure the tokenizer with acceptable normalization and particular tokens, and tips on how to encode textual content to tokens and decode again to strings. That is simply a place to begin for tokenizer coaching. Take into account leveraging current libraries and instruments to optimize tokenizer coaching pace so it doesn’t develop into a bottleneck in your coaching course of.

READ ALSO

The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation

Why We’ve Been Optimizing the Fallacious Factor in LLMs for Years


BERT is an early transformer-based mannequin for NLP duties that’s small and quick sufficient to coach on a house laptop. Like all deep studying fashions, it requires a tokenizer to transform textual content into integer tokens. This text exhibits tips on how to practice a WordPiece tokenizer following BERT’s authentic design.

Let’s get began.

Coaching a Tokenizer for BERT Fashions
Photograph by JOHN TOWNER. Some rights reserved.

Overview

This text is split into two elements; they’re:

  • Selecting a Dataset
  • Coaching a Tokenizer

Selecting a Dataset

To maintain issues easy, we’ll use English textual content solely. WikiText is a well-liked preprocessed dataset for experiments, obtainable via the Hugging Face datasets library:

import random

from datasets import load_dataset

 

# path and title of every dataset

path, title = “wikitext-2”, “wikitext-2-raw-v1”

dataset = load_dataset(path, title, cut up=“practice”)

print(f“measurement: {len(dataset)}”)

# Print a number of samples

for idx in random.pattern(vary(len(dataset)), 5):

    textual content = dataset[idx][“text”].strip()

    print(f“{idx}: {textual content}”)

On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset appropriate for fast experiments, whereas WikiText-103 is bigger and extra consultant of real-world textual content for a greater mannequin.

The output of this code could appear like this:

measurement: 36718

23905: Dudgeon Creek

4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 …

7181: Crew : 5

24596: On March 19 , 2007 , Sports activities Illustrated posted on its web site an article in its …

12920: The latest constructing included within the record is within the Quantock Hills . The …

The dataset incorporates strings of various lengths with areas round punctuation marks. When you may cut up on whitespace, this wouldn’t seize sub-word parts. That’s what the WordPiece tokenization algorithm is nice at.

Coaching a Tokenizer

A number of tokenization algorithms assist sub-word parts. BERT makes use of WordPiece, whereas fashionable LLMs usually use Byte-Pair Encoding (BPE). We’ll practice a WordPiece tokenizer following BERT’s authentic design.

The tokenizers library implements a number of tokenization algorithms that may be configured to your wants. It saves you the trouble of implementing the tokenization algorithm from scratch. You need to set up it with pip command:

Let’s practice a tokenizer:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

import tokenizers

from datasets import load_dataset

 

path, title = “wikitext”, “wikitext-103-raw-v1”

vocab_size = 30522

dataset = load_dataset(path, title, cut up=“practice”)

 

# Accumulate texts, skip title traces beginning with “=”

texts = []

for line in dataset[“text”]:

    line = line.strip()

    if line and not line.startswith(“=”):

        texts.append(line)

 

# Configure WordPiece tokenizer with NFKC normalization and particular tokens

tokenizer = tokenizers.Tokenizer(tokenizers.fashions.WordPiece())

tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

tokenizer.decoder = tokenizers.decoders.WordPiece()

tokenizer.normalizer = tokenizers.normalizers.NFKC()

tokenizer.coach = tokenizers.trainers.WordPieceTrainer(

    vocab_size=vocab_size,

    special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”]

)

# Prepare the tokenizer and reserve it

tokenizer.train_from_iterator(texts, coach=tokenizer.coach)

tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=“[PAD]”)

tokenizer_path = f“{dataset_name}_wordpiece.json”

tokenizer.save(tokenizer_path, fairly=True)

 

# Check the tokenizer

tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path)

print(tokenizer.encode(“Hi there, world!”).tokens)

print(tokenizer.decode(tokenizer.encode(“Hi there, world!”).ids))

Working this code could print the next output:

wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s]

wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s]

Producing take a look at cut up: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s]

Producing practice cut up: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s]

Producing validation cut up: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s]

measurement: 1801350

[00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0

[00:00:00] Tokenize phrases ████████████████████████████ 606445 / 606445

[00:00:00] Rely pairs ████████████████████████████ 606445 / 606445

[00:00:04] Compute merges ████████████████████████████ 22020 / 22020

[‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’]

Hi there, world!

This code makes use of the WikiText-103 dataset. The primary run downloads 157MB of information containing 1.8 million traces. The coaching takes a number of seconds. The instance exhibits how "Hi there, world!" turns into 5 tokens, with “Hi there” cut up into “Hell” and “##o” (the “##” prefix signifies a sub-word part).

The tokenizer created within the code above has the next properties:

  • Vocabulary measurement: 30,522 tokens (matching the unique BERT mannequin)
  • Particular tokens: [PAD], [CLS], [SEP], [MASK], and [UNK] are added to the vocabulary though they don’t seem to be within the dataset.
  • Pre-tokenizer: Whitespace splitting (for the reason that dataset has areas round punctuation)
  • Normalizer: NFKC normalization for unicode textual content. Be aware which you can additionally configure the tokenizer to transform every part into lowercase, because the frequent BERT-uncased mannequin does.
  • Algorithm: WordPiece is used. Therefore the decoder ought to be set accordingly in order that the “##” prefix for sub-word parts is acknowledged.
  • Padding: Enabled with [PAD] token for batch processing. This isn’t demonstrated within the code above, however will probably be helpful if you end up coaching a BERT mannequin.

The tokenizer saves to a reasonably large JSON file containing the complete vocabulary, permitting you to reload the tokenizer later with out retraining.

To transform a string into an inventory of tokens, you utilize the syntax tokenizer.encode(textual content).tokens, during which every token is only a string. To be used in a mannequin, it’s best to use tokenizer.encode(textual content).ids as an alternative, during which the outcome shall be an inventory of integers. The decode technique can be utilized to transform an inventory of integers again to a string. That is demonstrated within the code above.

Under are some assets that you could be discover helpful:

This text demonstrated tips on how to practice a WordPiece tokenizer for BERT utilizing the WikiText dataset. You realized to configure the tokenizer with acceptable normalization and particular tokens, and tips on how to encode textual content to tokens and decode again to strings. That is simply a place to begin for tokenizer coaching. Take into account leveraging current libraries and instruments to optimize tokenizer coaching pace so it doesn’t develop into a bottleneck in your coaching course of.

Tags: BERTModelsTokenizerTraining

Related Posts

Image 284.jpg
Artificial Intelligence

The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation

November 29, 2025
Chatgpt image nov 25 2025 06 03 10 pm.jpg
Artificial Intelligence

Why We’ve Been Optimizing the Fallacious Factor in LLMs for Years

November 28, 2025
Mlm chugani decision trees fail fix feature v2 1024x683.png
Artificial Intelligence

Why Resolution Timber Fail (and The way to Repair Them)

November 28, 2025
Mk s thhfiw6gneu unsplash scaled.jpg
Artificial Intelligence

TDS Publication: November Should-Reads on GraphRAG, ML Tasks, LLM-Powered Time-Sequence Evaluation, and Extra

November 28, 2025
Nastya dulhiier fisdt1rzkh8 unsplash scaled.jpg
Artificial Intelligence

BERT Fashions and Its Variants

November 27, 2025
Temp 2.png
Artificial Intelligence

How I Use AI to Persuade Corporations to Undertake Sustainability

November 27, 2025
Next Post
Bitcoin mw 2.jpg

Pi Community's PI Dumps 7% Day by day, Bitcoin (BTC) Stopped at $93K: Market Watch

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Holdinghands.png

What My GPT Stylist Taught Me About Prompting Higher

May 10, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025

EDITOR'S PICK

0 jx xivu2ll40b5za.jpg

MobileNetV2 Paper Walkthrough: The Smarter Tiny Big

October 4, 2025
Tron Pr 1.jpg

Justin Solar and WLFI Co-Founder Headline Consensus HK 2025 as TRON DAO Showcases T3 FCU

February 25, 2025
Chatgpt image aug 3 2025 11 57 46 am 1024x683.png

Discovering Golden Examples: A Smarter Strategy to In-Context Studying

August 8, 2025
0z L4oikmfhuub1gy.jpeg

Arms-On Imitation Studying: From Conduct Cloning to Multi-Modal Imitation Studying | by Yasin Yousif | Sep, 2024

September 12, 2024

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation
  • Pi Community’s PI Dumps 7% Day by day, Bitcoin (BTC) Stopped at $93K: Market Watch
  • Coaching a Tokenizer for BERT Fashions
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?