As instruments for Python sort annotations (or hints) have developed, extra complicated information constructions could be typed, enhancing maintainability and static evaluation. Arrays and DataFrames, as complicated containers, have solely just lately supported full sort annotations in Python. NumPy 1.22 launched generic specification of arrays and dtypes. Constructing on NumPy’s basis, StaticFrame 2.0 launched full sort specification of DataFrames, using NumPy primitives and variadic generics. This text demonstrates sensible approaches to completely type-hinting arrays and DataFrames, and exhibits how the identical annotations can enhance code high quality with each static evaluation and runtime validation.
StaticFrame is an open-source DataFrame library of which I’m an writer.
Kind hints (see PEP 484) enhance code high quality in numerous methods. As an alternative of utilizing variable names or feedback to speak varieties, Python-object-based sort annotations present maintainable and expressive instruments for sort specification. These sort annotations could be examined with sort checkers resembling mypy
or pyright
, shortly discovering potential bugs with out executing code.
The identical annotations can be utilized for runtime validation. Whereas reliance on duck-typing over runtime validation is frequent in Python, runtime validation is extra typically wanted with complicated information constructions resembling arrays and DataFrames. For instance, an interface anticipating a DataFrame argument, if given a Collection, may not want specific validation as utilization of the unsuitable sort will seemingly elevate. Nonetheless, an interface anticipating a 2D array of floats, if given an array of Booleans, may profit from validation as utilization of the unsuitable sort might not elevate.
Many essential typing utilities are solely accessible with the most-recent variations of Python. Fortuitously, the typing-extensions
bundle back-ports normal library utilities for older variations of Python. A associated problem is that sort checkers can take time to implement full assist for brand new options: lots of the examples proven right here require no less than mypy
1.9.0.
With out sort annotations, a Python perform signature offers no indication of the anticipated varieties. For instance, the perform under may take and return any varieties:
def process0(v, q): ... # no sort data
By including sort annotations, the signature informs readers of the anticipated varieties. With trendy Python, user-defined and built-in lessons can be utilized to specify varieties, with further sources (resembling Any
, Iterator
, forged()
, and Annotated
) present in the usual library typing
module. For instance, the interface under improves the one above by making anticipated varieties specific:
def process0(v: int, q: bool) -> record[float]: ...
When used with a sort checker like mypy
, code that violates the specs of the sort annotations will elevate an error throughout static evaluation (proven as feedback, under). For instance, offering an integer when a Boolean is required is an error:
x = process0(v=5, q=20)
# tp.py: error: Argument "q" to "process0"
# has incompatible sort "int"; anticipated "bool" [arg-type]
Static evaluation can solely validate statically outlined varieties. The total vary of runtime inputs and outputs is commonly extra numerous, suggesting some type of runtime validation. The perfect of each worlds is feasible by reusing sort annotations for runtime validation. Whereas there are libraries that do that (e.g., typeguard
and beartype
), StaticFrame presents CallGuard
, a instrument specialised for complete array and DataFrame type-annotation validation.
A Python decorator is good for leveraging annotations for runtime validation. CallGuard
presents two decorators: @CallGuard.examine
, which raises an informative Exception
on error, or @CallGuard.warn
, which points a warning.
Additional extending the process0
perform above with @CallGuard.examine
, the identical sort annotations can be utilized to lift an Exception
(proven once more as feedback) when runtime objects violate the necessities of the sort annotations:
import static_frame as sf@sf.CallGuard.examine
def process0(v: int, q: bool) -> record[float]:
return [x * (0.5 if q else 0.25) for x in range(v)]
z = process0(v=5, q=20)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: int, q: bool) -> record[float]
# └── Anticipated bool, offered int invalid
Whereas sort annotations have to be legitimate Python, they’re irrelevant at runtime and could be unsuitable: it’s attainable to have accurately verified varieties that don’t replicate runtime actuality. As proven above, reusing sort annotations for runtime checks ensures annotations are legitimate.
Python lessons that let element sort specification are “generic”. Part varieties are specified with positional “sort variables”. A listing of integers, for instance, is annotated with record[int]
; a dictionary of floats keyed by tuples of integers and strings is annotated dict[tuple[int, str], float]
.
With NumPy 1.20, ndarray
and dtype
develop into generic. The generic ndarray
requires two arguments, a form and a dtype
. Because the utilization of the primary argument continues to be below improvement, Any
is usually used. The second argument, dtype
, is itself a generic that requires a sort variable for a NumPy sort resembling np.int64
. NumPy additionally presents extra basic generic varieties resembling np.integer[Any]
.
For instance, an array of Booleans is annotated np.ndarray[Any, np.dtype[np.bool_]]
; an array of any sort of integer is annotated np.ndarray[Any, np.dtype[np.integer[Any]]]
.
As generic annotations with element sort specs can develop into verbose, it’s sensible to retailer them as sort aliases (right here prefixed with “T”). The next perform specifies such aliases after which makes use of them in a perform.
from typing import Any
import numpy as npTNDArrayInt8 = np.ndarray[Any, np.dtype[np.int8]]
TNDArrayBool = np.ndarray[Any, np.dtype[np.bool_]]
TNDArrayFloat64 = np.ndarray[Any, np.dtype[np.float64]]
def process1(
v: TNDArrayInt8,
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * s
As earlier than, when used with mypy
, code that violates the sort annotations will elevate an error throughout static evaluation. For instance, offering an integer when a Boolean is required is an error:
v1: TNDArrayInt8 = np.arange(20, dtype=np.int8)
x = process1(v1, v1)
# tp.py: error: Argument 2 to "process1" has incompatible sort
# "ndarray[Any, dtype[floating[_64Bit]]]"; anticipated "ndarray[Any, dtype[bool_]]" [arg-type]
The interface requires 8-bit signed integers (np.int8
); trying to make use of a special sized integer can be an error:
TNDArrayInt64 = np.ndarray[Any, np.dtype[np.int64]]
v2: TNDArrayInt64 = np.arange(20, dtype=np.int64)
q: TNDArrayBool = np.arange(20) % 3 == 0
x = process1(v2, q)
# tp.py: error: Argument 1 to "process1" has incompatible sort
# "ndarray[Any, dtype[signedinteger[_64Bit]]]"; anticipated "ndarray[Any, dtype[signedinteger[_8Bit]]]" [arg-type]
Whereas some interfaces may profit from such slim numeric sort specs, broader specification is feasible with NumPy’s generic varieties resembling np.integer[Any]
, np.signedinteger[Any]
, np.float[Any]
, and many others. For instance, we will outline a brand new perform that accepts any measurement signed integer. Static evaluation now passes with each TNDArrayInt8
and TNDArrayInt64
arrays.
TNDArrayIntAny = np.ndarray[Any, np.dtype[np.signedinteger[Any]]]
def process2(
v: TNDArrayIntAny, # a extra versatile interface
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * sx = process2(v1, q) # no mypy error
x = process2(v2, q) # no mypy error
Simply as proven above with components, generically specified NumPy arrays could be validated at runtime if embellished with CallGuard.examine
:
@sf.CallGuard.examine
def process3(v: TNDArrayIntAny, q: TNDArrayBool) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * sx = process3(v1, q) # no error, identical as mypy
x = process3(v2, q) # no error, identical as mypy
v3: TNDArrayFloat64 = np.arange(20, dtype=np.float64) * 0.5
x = process3(v3, q) # error
# static_frame.core.type_clinic.ClinicError:
# In args of (v: ndarray[Any, dtype[signedinteger[Any]]],
# q: ndarray[Any, dtype[bool_]]) -> ndarray[Any, dtype[float64]]
# └── ndarray[Any, dtype[signedinteger[Any]]]
# └── dtype[signedinteger[Any]]
# └── Anticipated signedinteger, offered float64 invalid
StaticFrame gives utilities to increase runtime validation past sort checking. Utilizing the typing
module’s Annotated
class (see PEP 593), we will prolong the sort specification with a number of StaticFrame Require
objects. For instance, to validate that an array has a 1D form of (24,)
, we will substitute TNDArrayIntAny
with Annotated[TNDArrayIntAny, sf.Require.Shape(24)]
. To validate {that a} float array has no NaNs, we will substitute TNDArrayFloat64
with Annotated[TNDArrayFloat64, sf.Require.Apply(lambda a: ~a.insna().any())]
.
Implementing a brand new perform, we will require that each one enter and output arrays have the form (24,)
. Calling this perform with the beforehand created arrays raises an error:
from typing import Annotated@sf.CallGuard.examine
def process4(
v: Annotated[TNDArrayIntAny, sf.Require.Shape(24)],
q: Annotated[TNDArrayBool, sf.Require.Shape(24)],
) -> Annotated[TNDArrayFloat64, sf.Require.Shape(24)]:
s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)
return v * s
x = process4(v1, q) # varieties go, however Require.Form fails
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Annotated[ndarray[Any, dtype[int8]], Form((24,))], q: Annotated[ndarray[Any, dtype[bool_]], Form((24,))]) -> Annotated[ndarray[Any, dtype[float64]], Form((24,))]
# └── Annotated[ndarray[Any, dtype[int8]], Form((24,))]
# └── Form((24,))
# └── Anticipated form ((24,)), offered form (20,)
Similar to a dictionary, a DataFrame is a fancy information construction composed of many element varieties: the index labels, column labels, and the column values are all distinct varieties.
A problem of generically specifying a DataFrame is {that a} DataFrame has a variable variety of columns, the place every column is perhaps a special sort. The Python TypeVarTuple
variadic generic specifier (see PEP 646), first launched in Python 3.11, permits defining a variable variety of column sort variables.
With StaticFrame 2.0, Body
, Collection
, Index
and associated containers develop into generic. Assist for variable column sort definitions is offered by TypeVarTuple
, back-ported with the implementation in typing-extensions
for compatibility right down to Python 3.9.
A generic Body
requires two or extra sort variables: the kind of the index, the kind of the columns, and nil or extra specs of columnar worth varieties specified with NumPy varieties. A generic Collection
requires two sort variables: the kind of the index and a NumPy sort for the values. The Index
is itself generic, additionally requiring a NumPy sort as a sort variable.
With generic specification, a Collection
of floats, listed by dates, could be annotated with sf.Collection[sf.IndexDate, np.float64]
. A Body
with dates as index labels, strings as column labels, and column values of integers and floats could be annotated with sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]
.
Given a fancy Body
, deriving the annotation is perhaps troublesome. StaticFrame presents the via_type_clinic
interface to offer a whole generic specification for any element at runtime:
>>> v4 = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))
>>> v4
a b <
2021-12-30 0 1.5
2021-12-31 1 2.0
2022-01-01 2 2.5
2022-01-02 3 3.0
2022-01-03 4 3.5
# get a string illustration of the annotation
>>> v4.via_type_clinic
Body[IndexDate, Index[str_], int64, float64]
As proven with arrays, storing annotations as sort aliases permits reuse and extra concise perform signatures. Under, a brand new perform is outlined with generic Body
and Collection
arguments absolutely annotated. A forged
is required as not all operations can statically resolve their return sort.
TFrameDateInts = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64]
TSeriesYMBool = sf.Collection[sf.IndexYearMonth, np.bool_]
TSeriesDFloat = sf.Collection[sf.IndexDate, np.float64]def process5(v: TFrameDateInts, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # sort: ignore
s = np.the place(t, 0.5, 0.25)
return forged(TSeriesDFloat, (v.via_T * s).imply(axis=1))
These extra complicated annotated interfaces will also be validated with mypy
. Under, a Body
with out the anticipated column worth varieties is handed, inflicting mypy
to error (proven as feedback, under).
TFrameDateIntFloat = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]
v5: TFrameDateIntFloat = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))q: TSeriesYMBool = sf.Collection([True, False],
index=sf.IndexYearMonth.from_date_range('2021-12', '2022-01'))
x = process5(v5, q)
# tp.py: error: Argument 1 to "process5" has incompatible sort
# "Body[IndexDate, Index[str_], signedinteger[_64Bit], floating[_64Bit]]"; anticipated
# "Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]" [arg-type]
To make use of the identical sort hints for runtime validation, the sf.CallGuard.examine
decorator could be utilized. Under, a Body
of three integer columns is offered the place a Body
of two columns is predicted.
# a Body of three columns of integers
TFrameDateIntIntInt = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64, np.int64]
v6: TFrameDateIntIntInt = sf.Body.from_fields([range(5), range(3, 8), range(1, 6)],
columns=('a', 'b', 'c'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))x = process5(v6, q)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]],
# q: Collection[IndexYearMonth, bool_]) -> Collection[IndexDate, float64]
# └── Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]
# └── Anticipated Body has 2 dtype, offered Body has 3 dtype
It may not be sensible to annotate each column of each Body
: it is not uncommon for interfaces to work with Body
of variable column sizes. TypeVarTuple
helps this by the utilization of *tuple[]
expressions (launched in Python 3.11, back-ported with the Unpack
annotation). For instance, the perform above could possibly be outlined to take any variety of integer columns with that annotation Body[IndexDate, Index[np.str_], *tuple[np.int64, ...]]
, the place *tuple[np.int64, ...]]
means zero or extra integer columns.
The identical implementation could be annotated with a much more basic specification of columnar varieties. Under, the column values are annotated with np.quantity[Any]
(allowing any sort of numeric NumPy sort) and a *tuple[]
expression (allowing any variety of columns): *tuple[np.number[Any], …]
. Now neither mypy
nor CallGuard
errors with both beforehand created Body
.
TFrameDateNums = sf.Body[sf.IndexDate, sf.Index[np.str_], *tuple[np.number[Any], ...]]@sf.CallGuard.examine
def process6(v: TFrameDateNums, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # sort: ignore
s = np.the place(t, 0.5, 0.25)
return tp.forged(TSeriesDFloat, (v.via_T * s).imply(axis=1))
x = process6(v5, q) # a Body with integer, float columns passes
x = process6(v6, q) # a Body with three integer columns passes
As with NumPy arrays, Body
annotations can wrap Require
specs in Annotated
generics, allowing the definition of further run-time validations.
Whereas StaticFrame is perhaps the primary DataFrame library to supply full generic specification and a unified answer for each static sort evaluation and run-time sort validation, different array and DataFrame libraries supply associated utilities.
Neither the Tensor
class in PyTorch (2.4.0), nor the Tensor
class in TensorFlow (2.17.0) assist generic sort or form specification. Whereas each libraries supply a TensorSpec
object that can be utilized to carry out run-time sort and form validation, static sort checking with instruments like mypy
shouldn’t be supported.
As of Pandas 2.2.2, neither the Pandas Collection
nor DataFrame
assist generic sort specs. A variety of third-party packages have provided partial options. The pandas-stubs
library, for instance, gives sort annotations for the Pandas API, however doesn’t make the Collection
or DataFrame
lessons generic. The Pandera library permits defining DataFrameSchema
lessons that can be utilized for run-time validation of Pandas DataFrames. For static-analysis with mypy
, Pandera presents different DataFrame
and Collection
subclasses that let generic specification with the identical DataFrameSchema
lessons. This method doesn’t allow the expressive alternatives of utilizing generic NumPy varieties or the unpack operator for supplying variadic generic expressions.
Python sort annotations could make static evaluation of varieties a useful examine of code high quality, discovering errors earlier than code is even executed. Up till just lately, an interface may take an array or a DataFrame, however no specification of the kinds contained in these containers was attainable. Now, full specification of element varieties is feasible in NumPy and StaticFrame, allowing extra highly effective static evaluation of varieties.
Offering appropriate sort annotations is an funding. Reusing these annotations for runtime checks gives the most effective of each worlds. StaticFrame’s CallGuard
runtime sort checker is specialised to accurately consider absolutely specified generic NumPy varieties, in addition to all generic StaticFrame containers.