• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Saturday, February 28, 2026
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Artificial Intelligence

Anatomy of a Parquet File

Admin by Admin
March 17, 2025
in Artificial Intelligence
0
Parquet.jpg
0
SHARES
5
VIEWS
Share on FacebookShare on Twitter


In recent times, Parquet has change into a typical format for knowledge storage in Huge Information ecosystems. Its column-oriented format presents a number of benefits:

  • Quicker question execution when solely a subset of columns is being processed
  • Fast calculation of statistics throughout all knowledge
  • Lowered storage quantity because of environment friendly compression

When mixed with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with question engines (e.g., Trino) and knowledge warehouse compute clusters (e.g., Snowflake, BigQuery). On this article, the content material of a Parquet file is dissected utilizing primarily normal Python instruments to raised perceive its construction and the way it contributes to such performances.

Writing Parquet file(s)

To provide Parquet information, we use PyArrow, a Python binding for Apache Arrow that shops dataframes in reminiscence in columnar format. PyArrow permits fine-grained parameter tuning when writing the file. This makes PyArrow splendid for Parquet manipulation (one may merely use Pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

faux = Faker()
Faker.seed(12345)
num_records = 100

# Generate faux knowledge
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace("n", ", ") for _ in range(num_records)]
birth_dates = [
    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(", ")[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Forged the information to the Arrow format
name_array = pa.array(names, sort=pa.string())
address_array = pa.array(addresses, sort=pa.string())
birth_date_array = pa.array(birth_dates, sort=pa.date32())
city_array = pa.array(cities, sort=pa.string())
birth_year_array = pa.array(birth_years, sort=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
    [
        pa.field("name", pa.string(), nullable=False),
        pa.field("address", pa.string(), nullable=False),
        pa.field("date_of_birth", pa.date32(), nullable=False),
        pa.field("city", pa.string(), nullable=False),
        pa.field("birth_year", pa.int32(), nullable=False),
    ]
)

desk = pa.Desk.from_arrays(
    [name_array, address_array, birth_date_array, city_array, birth_year_array],
    schema=schema,
)

print(desk)
pyarrow.Desk
identify: string not null
handle: string not null
date_of_birth: date32[day] not null
metropolis: string not null
birth_year: int32 not null
----
identify: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
handle: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
metropolis: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
birth_year: [[1955,1950,1955,1957,1956]]

The output clearly displays a columns-oriented storage, in contrast to Pandas, which often shows a standard “row-wise” desk.

How is a Parquet file saved?

Parquet information are usually saved in low cost object storage databases like S3 (AWS) or GCS (GCP) to be simply accessible by knowledge processing pipelines. These information are often organized with a partitioning technique by leveraging listing constructions:

# generator.py

num_records = 100

# ...

# Writing the parquet information to disk
pq.write_to_dataset(
    desk,
    root_path='dataset',
    partition_cols=['birth_year', 'city']
)

If birth_year and metropolis columns are outlined as partitioning keys, PyArrow creates such a tree construction within the listing dataset:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ metropolis=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ metropolis=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ ...

The technique permits partition pruning: when a question filters on these columns, the engine can use folder names to learn solely the mandatory information. For this reason the partitioning technique is essential for limiting delay, I/O, and compute assets when dealing with massive volumes of knowledge (as has been the case for many years with conventional relational databases).

The pruning impact will be simply verified by counting the information opened by a Python script that filters the beginning yr:

# question.py
import duckdb

duckdb.sql(
    """
    SELECT * 
    FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
    the place birth_year = 1949
    """
).present()
> strace -e hint=open,openat,learn -f python question.py 2>&1 | grep "dataset/.*.parquet"

[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

Solely 23 information are learn out of 100.

Studying a uncooked Parquet file

Let’s decode a uncooked Parquet file with out specialised libraries. For simplicity, the dataset is dumped right into a single file with out compression or encoding.

# generator.py

# ...

pq.write_table(
    desk,
    "dataset.parquet",
    use_dictionary=False,
    compression="NONE",
    write_statistics=True,
    column_encoding=None,
)

The very first thing to know is that the binary file is framed by 4 bytes whose ASCII illustration is “PAR1”. The file is corrupted if this isn’t the case.

# reader.py

with open("dataset.parquet", "rb") as file:
    parquet_data = file.learn()

assert parquet_data[:4] == b"PAR1", "Not a sound parquet file"
assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

As indicated within the documentation, the file is split into two components: the “row teams” containing precise knowledge, and the footer containing metadata (schema beneath).

The footer

The dimensions of the footer is indicated within the 4 bytes previous the tip marker as an unsigned integer written in “little endian” format (famous “unpack operate).

# reader.py

import struct

# ...

footer_length = struct.unpack("
Footer measurement in bytes: 1088

The footer data is encoded in a cross-language serialization format referred to as Apache Thrift. Utilizing a human-readable however verbose format like JSON after which translating it into binary could be much less environment friendly by way of reminiscence utilization. With Thrift, one can declare knowledge constructions as follows:

struct Buyer {
	1: required string identify,
	2: elective i16 birthYear,
	3: elective listing pursuits
}

On the premise of this declaration, Thrift can generate Python code to decode byte strings with such knowledge construction (it additionally generates code to carry out the encoding half). The thrift file containing all the information constructions carried out in a Parquet file will be downloaded right here. After having put in the thrift binary, let’s run:

thrift -r --gen py parquet.thrift

The generated Python code is positioned within the “gen-py” folder. The footer’s knowledge construction is represented by the FileMetaData class – a Python class mechanically generated from the Thrift schema. Utilizing Thrift’s Python utilities, binary knowledge is parsed and populated into an occasion of this FileMetaData class.

# reader.py

import sys

# ...

# Add the generated courses to the python path
sys.path.append("gen-py")
from parquet.ttypes import FileMetaData, PageHeader
from thrift.transport import TTransport
from thrift.protocol import TCompactProtocol

def read_thrift(knowledge, thrift_instance):
    """
    Learn a Thrift object from a binary buffer.
    Returns the Thrift object and the variety of bytes learn.
    """
    transport = TTransport.TMemoryBuffer(knowledge)
    protocol = TCompactProtocol.TCompactProtocol(transport)
    thrift_instance.learn(protocol)
    return thrift_instance, transport._buffer.inform()

# The variety of bytes learn just isn't used for now
file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())

print(f"Variety of rows in the entire file: {file_metadata_thrift.num_rows}")
print(f"Variety of row teams: {len(file_metadata_thrift.row_groups)}")

Variety of rows in the entire file: 100
Variety of row teams: 1

The footer accommodates in depth details about the file’s construction and content material. As an illustration, it precisely tracks the variety of rows within the generated dataframe. These rows are all contained inside a single “row group.” However what’s a “row group?”

Row teams

Not like purely column-oriented codecs, Parquet employs a hybrid method. Earlier than writing column blocks, the dataframe is first partitioned vertically into row teams (the parquet file we generated is simply too small to be cut up in a number of row teams).

This hybrid construction presents a number of benefits:

Parquet calculates statistics (similar to min/max values) for every column inside every row group. These statistics are essential for question optimization, permitting question engines to skip complete row teams that don’t match filtering standards. For instance, if a question filters for birth_year > 1955 and a row group’s most beginning yr is 1954, the engine can effectively skip that complete knowledge part. This optimisation known as “predicate pushdown”. Parquet additionally shops different helpful statistics like distinct worth counts and null counts.

# reader.py
# ...

first_row_group = file_metadata_thrift.row_groups[0]
birth_year_column = first_row_group.columns[4]

min_stat_bytes = birth_year_column.meta_data.statistics.min
max_stat_bytes = birth_year_column.meta_data.statistics.max

min_year = struct.unpack("
The beginning yr vary is between 1949 and 1958
  • Row teams allow parallel processing of knowledge (significantly precious for frameworks like Apache Spark). The dimensions of those row teams will be configured based mostly on the computing assets out there (utilizing the row_group_size property in operate write_table when utilizing PyArrow).
# generator.py

# ...

pq.write_table(
    desk,
    "dataset.parquet",
    row_group_size=100,
)

# /! Preserve the default worth of "row_group_size" for the subsequent components
  • Even when this isn’t the first goal of a column format, Parquet’s hybrid construction maintains cheap efficiency when reconstructing full rows. With out row teams, rebuilding a complete row would possibly require scanning the whole lot of every column which might be extraordinarily inefficient for big information.

Information Pages

The smallest substructure of a Parquet file is the web page. It accommodates a sequence of values from the identical column and, subsequently, of the identical sort. The selection of web page measurement is the results of a trade-off:

  • Bigger pages imply much less metadata to retailer and skim, which is perfect for queries with minimal filtering.
  • Smaller pages scale back the quantity of pointless knowledge learn, which is best when queries goal small, scattered knowledge ranges.

Now let’s decode the contents of the primary web page of the column devoted to addresses whose location will be discovered within the footer (given by the data_page_offset attribute of the suitable ColumnMetaData) . Every web page is preceded by a Thrift PageHeader object containing some metadata. The offset really factors to a Thrift binary illustration of the web page metadata that precedes the web page itself. The Thrift class known as a PageHeader and will also be discovered within the gen-py listing.

READ ALSO

Introduction to Small Language Fashions: The Full Information for 2026

Coding the Pong Recreation from Scratch in Python

💡 Between the PageHeader and the precise values contained inside the web page, there could also be a number of bytes devoted to implementing the Dremel format, which permits encoding nested knowledge constructions. Since our knowledge has a daily tabular format and the values will not be nullable, these bytes are skipped when writing the file (https://parquet.apache.org/docs/file-format/data-pages/).

# reader.py
# ...

address_column = first_row_group.columns[1]
column_start = address_column.meta_data.data_page_offset
column_end = column_start + address_column.meta_data.total_compressed_size
column_content = parquet_data[column_start:column_end]

page_thrift, page_header_size = read_thrift(column_content, PageHeader())
page_content = column_content[
    page_header_size : (page_header_size + page_thrift.compressed_page_size)
]
print(column_content[:100])
b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'

The generated values lastly seem, in plain textual content and never encoded (as specified when writing the Parquet file). Nevertheless, to optimize the columnar format, it is suggested to make use of one of many following encoding algorithms: dictionary encoding, run size encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 sorts), adopted by compression utilizing gzip or snappy (out there codecs are listed right here). Since encoded pages include comparable values (all addresses, all decimal numbers, and many others.), compression ratios will be significantly advantageous.

As documented within the specification, when character strings (BYTE_ARRAY) will not be encoded, every worth is preceded by its measurement represented as a 4-byte integer. This may be noticed within the earlier output:

To learn all of the values (for instance, the primary 10), the loop is reasonably easy:

idx = 0
for _ in vary(10):
    str_size = struct.unpack("
481 Mata Squares Suite 260, Lake Rachelville, KY 87464
671 Barker Crossing Suite 390, Mooretown, MI 21488
62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068
948 Victor Sq. Apt. 753, Braybury, RI 67113
365 Edward Place Apt. 162, Calebborough, AL 13037
894 Reed Lock, New Davidmouth, NV 84612
24082 Allison Squares Suite 345, North Sharonberg, WY 97642
00266 Johnson Drives, South Lori, MI 98513
15255 Kelly Plains, Richardmouth, GA 33438
260 Thomas Glens, Port Gabriela, OH 96758

And there we’ve got it! We have now efficiently recreated, in a quite simple method, how a specialised library would learn a Parquet file. By understanding its constructing blocks together with headers, footers, row teams, and knowledge pages, we are able to higher respect how options like predicate pushdown and partition pruning ship such spectacular efficiency advantages in data-intensive environments. I’m satisfied figuring out how Parquet works below the hood helps making higher choices about storage methods, compression selections, and efficiency optimization.

All of the code used on this article is offered on my GitHub repository at https://github.com/kili-mandjaro/anatomy-parquet, the place you possibly can discover extra examples and experiment with completely different Parquet file configurations.

Whether or not you might be constructing knowledge pipelines, optimizing question efficiency, or just interested by knowledge storage codecs, I hope this deep dive into Parquet’s interior constructions has offered precious insights on your Information Engineering journey.

All photographs are by the writer.

Tags: AnatomyFileParquet

Related Posts

Mlm chugani small language models complete guide 2026 feature scaled.jpg
Artificial Intelligence

Introduction to Small Language Fashions: The Full Information for 2026

February 28, 2026
Pong scaled 1.jpg
Artificial Intelligence

Coding the Pong Recreation from Scratch in Python

February 27, 2026
Mlm chugani llm embeddings tf idf metadata scikit learn pipeline feature scaled.jpg
Artificial Intelligence

The way to Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

February 27, 2026
Mike author spotlight.jpg
Artificial Intelligence

Designing Knowledge and AI Methods That Maintain Up in Manufacturing

February 27, 2026
Nathan dumlao eksqjxtlpak unsplash scaled 1.jpg
Artificial Intelligence

Take a Deep Dive into Filtering in DAX

February 26, 2026
Alain pham p qvsf7yodw unsplash.jpg
Artificial Intelligence

Scaling Characteristic Engineering Pipelines with Feast and Ray

February 25, 2026
Next Post
Shutterstock 695759755.jpg

Utilizing Design Instruments vs. AI

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Chainlink Link And Cardano Ada Dominate The Crypto Coin Development Chart.jpg

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

May 17, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Image 100 1024x683.png

Easy methods to Use LLMs for Highly effective Computerized Evaluations

August 13, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025

EDITOR'S PICK

Kdn olumide predictive analytics healthcare improving patient outcomes 1.png

Predictive Analytics in Healthcare: Bettering Affected person Outcomes

August 23, 2025
Rootnot Creations Pfleadtzue0 Unsplash Scaled 1.jpg

AI Brokers from Zero to Hero — Half 2

March 27, 2025
Data Shutterstock 2362078849 Special.png

Why Auto-Tiering is Important for AI Options: Optimizing Knowledge Storage from Coaching to Lengthy-Time period Archiving 

November 12, 2024
Pexels googledeepmind 17485657 scaled 1.jpeg

When Fashions Cease Listening: How Function Collapse Quietly Erodes Machine Studying Methods

August 3, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Keep away from Widespread Errors in B2B Information Appending: An Govt Information
  • SBI Holdings is dangling XRP to promote a plain three yr bond, however the numbers present how small
  • Introduction to Small Language Fashions: The Full Information for 2026
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?