Why it’s best to learn this text
information scientists whip up a Jupyter Pocket book, mess around in some cells, after which preserve complete information processing and mannequin coaching pipelines in the identical pocket book.
The code is examined as soon as when the pocket book was first written, after which it’s uncared for for some undetermined period of time – days, weeks, months, years, till:
- The outputs of the pocket book must be rerun to re-generate outputs that have been misplaced.
- The pocket book must be rerun with completely different parameters to retrain a mannequin.
- One thing wanted to be modified upstream, and the pocket book must be rerun to refresh downstream datasets.
Lots of you should have felt shivers down your backbone studying this…
Why?
Since you instinctively know that this pocket book is rarely going to run.
You understand it in your bones the code in that pocket book will must be debugged for hours at greatest, re-written from scratch at worst.
In each instances, it can take you a very long time to get what you want.
Why does this occur?
Is there any means of avoiding this?
Is there a greater means of writing and sustaining code?
That is the query we will probably be answering on this article.
The Resolution: Automated Testing
What’s it?
Because the identify suggests, automated testing is the method of working a predefined set of exams in your code to make sure that it’s working as anticipated.
These exams confirm that your code behaves as anticipated — particularly after modifications or additions — and warn you when one thing breaks. It removes the necessity for a human to manually check your code, and there’s no have to run it on precise information.
Handy, isn’t it?
Forms of Automated Testing
There are such a lot of several types of testing, and overlaying all of them is past the scope of this text.
Let’s simply deal with the 2 most important sorts most related to a knowledge scientist:
- Unit Exams
- Integration Exams
Unit Exams

Exams the smallest elements of code in isolation (e.g., a operate).
The operate ought to do one factor solely to make it simple to check. Give it a identified enter, and test that the output is as anticipated.
Integration Exams

Exams how a number of parts work collectively.
For us information scientists, it means checking whether or not information loading, merging, and preprocessing steps produce the anticipated ultimate dataset, given a identified enter dataset.
A sensible instance
Sufficient with the idea, let’s see the way it works in apply.
We are going to undergo a easy instance the place an information scientist has written some code in a Jupyter pocket book (or script), one which many information scientists may have seen of their jobs.
We are going to decide up on why the code is dangerous. Then, we’ll attempt to make it higher.
By higher, we imply:
- Simple to check
- Simple to learn
which in the end means simple to keep up, as a result of in the long term, good code is code that works, retains working, and is straightforward to keep up.
We are going to then design some unit exams for our improved code, highlighting why the modifications are useful for testing. To forestall this text from turning into too lengthy, I’ll defer examples of integration testing to a future article.
Then, we are going to undergo some guidelines of thumb for what code to check.
Lastly, we are going to cowl learn how to run exams and learn how to construction initiatives.

Instance Pipeline
We are going to use the next pipeline for instance:
# bad_pipeline.py
import pandas as pd
# Load information
df1 = pd.read_csv("information/customers.csv")
df2 = pd.read_parquet("information/transactions.parquet")
df3 = pd.read_parquet("information/merchandise.parquet")
# Preprocessing
# Merge person and transaction information
df = df2.merge(df1, how='left', on='user_id')
# Merge with product information
df = df.merge(df3, how='left', on='product_id')
# Filter for latest transactions
df = df[df['transaction_date'] > '2023-01-01']
# Calculate complete value
df['total_price'] = df['quantity'] * df['price']
# Create buyer phase
df['segment'] = df['total_price'].apply(lambda x: 'excessive' if x > 100 else 'low')
# Drop pointless columns
df = df.drop(['user_email', 'product_description', 'price'], axis=1)
# Group by person and phase to get complete quantity spent
df = df.groupby(['user_id', 'segment']).agg({'total_price': 'sum'}).reset_index()
# Save output
df.to_parquet("information/final_output.parquet")
In actual life, we might see tons of of strains of code crammed right into a single pocket book. However the script is exemplary of all of the issues that want fixing in typical information science notebooks.
This code is doing the next:
- Masses person, transaction, and product information.
- Merges them right into a unified dataset.
- Filters latest transactions.
- Provides calculated fields (
total_price
,phase
). - Drops irrelevant columns.
- Aggregates complete spending per person and phase.
- Saves the consequence as a Parquet file.
Why is that this pipeline dangerous?
Oh, there are such a lot of causes coding on this method is dangerous, relying on what lens you take a look at it from. It’s not the content material that’s the downside, however how it’s structured.
Whereas there are lots of angles we are able to talk about the disadvantages of writing code this manner, for this text we are going to deal with testability.
1. Tightly coupled logic (in different phrases, no modularity)
All operations are crammed right into a single script and run directly. It’s unclear what every half does until you learn each line. Even for a script this straightforward, that is tough to do. In real-life scripts, it could possibly solely worsen when code can attain tons of of strains.
This makes it inconceivable to check.
The one means to take action could be to run the complete factor abruptly from begin to end, most likely on precise information that you simply’re going to make use of.
In case your dataset is small, then maybe you may get away with this. However generally, information scientists are working with a truck-load of information, so it’s infeasible to run any type of a check or sanity test shortly.
We’d like to have the ability to break the code up into manageable chunks that do one factor solely, and do it effectively. Then, we are able to management what goes in, and make sure that what we anticipate comes out of it.
2. No Parameterization
Hardcoded file paths and values like 2023-01-01
make the code brittle and rigid. Once more, laborious to check with something however the dwell/manufacturing information.
There’s no flexibility in how we are able to run the code, every thing is fastened.
What’s worse, as quickly as you alter one thing, you haven’t any assurance that nothing’s damaged additional down the script.
For instance, what number of instances have you ever made a change that you simply thought was benign, solely to run the code and discover a utterly surprising a part of the code to interrupt?
Methods to enhance?
Now, let’s see step-by-step how we are able to enhance this code.
Please observe, we are going to assume that we’re utilizing the
pytest
module for our exams going forwards.
1. A transparent, configurable entry level
def run_pipeline(
user_path: str,
transaction_path: str,
product_path: str,
output_path: str,
cutoff_date: str = '2023-01-01'
):
# Load information
...
# Course of information
...
# Save consequence
...
We begin off by making a single operate that we are able to run from wherever, with clear arguments that may be modified.
What does this obtain?
This permits us to run the pipeline in particular check circumstances.
# GIVEN SOME TEST DATA
test_args = dict(
test_user_path = "/fake_users.csv",
test_transaction_path = "/fake_transaction.parquet",
test_product_path = "/fake_products.parquet",
test_cutoff_date = "",
)
# RUN THE PIPELINE THAT'S TO BE TESTED
run_pipeline(**test_args)
# TEST THE OUTPUT IS AS EXPECTED
output =
expected_output =
assert output == expected_output
Instantly, you can begin passing in numerous inputs, completely different parameters, relying on the sting case that you simply wish to check for.
It provides you flexibility to run the code in numerous settings by making it simpler to regulate the inputs and outputs of your code.
Writing your pipeline on this means paves the way in which for integration testing your pipeline. Extra on this in a later article.
2. Group code into significant chunks that do one factor, and do it effectively
Now, that is the place a little bit of artwork is available in – completely different individuals will organise code in a different way relying on which elements they discover essential.
There isn’t a proper or unsuitable reply, however the frequent sense is to verify a operate does one factor and does it effectively. Do that, and it turns into simple to check.
A technique we may group our code is like beneath:
def load_data(user_path: str, transaction_path: str, product_path: str):
"""Load information from specified paths"""
df1 = pd.read_csv(user_path)
df2 = pd.read_parquet(transaction_path)
df3 = pd.read_parquet(product_path)
return df1, df2, df3
def create_user_product_transaction_dataset(
user_df:pd.DataFrame,
transaction_df:pd.DataFrame,
product_df:pd.DataFrame
):
"""Merge person, transaction, and product information right into a single dataset.
The dataset identifies which person purchased what product at what time and value.
Args:
user_df (pd.DataFrame):
A dataframe containing person info. Will need to have column
'user_id' that uniquely identifies every person.
transaction_df (pd.DataFrame):
A dataframe containing transaction info. Will need to have
columns 'user_id' and 'product_id' which are overseas keys
to the person and product dataframes, respectively.
product_df (pd.DataFrame):
A dataframe containing product info. Will need to have
column 'product_id' that uniquely identifies every product.
Returns:
A dataframe that merges the person, transaction, and product information
right into a single dataset.
"""
df = transaction_df.merge(user_df, how='left', on='user_id')
df = df.merge(product_df, how='left', on='product_id')
return df
def drop_unnecessary_date_period(df:pd.DataFrame, cutoff_date: str):
"""Drop transactions that occurred earlier than the cutoff date.
Observe:
Something earlier than the cutoff date will be dropped as a result of
of .
Args:
df (pd.DataFrame): A dataframe with a column `transaction_date`
cutoff_date (str): A date within the format 'yyyy-MM-dd'
Returns:
A dataframe with the transactions that occurred after the cutoff date
"""
df = df[df['transaction_date'] > cutoff_date]
return df
def compute_secondary_features(df:pd.DataFrame) -> pd.DataFrame:
"""Compute secondary options.
Args:
df (pd.DataFrame): A dataframe with columns `amount` and `value`
Returns:
A dataframe with columns `total_price` and `phase`
added to it.
"""
df['total_price'] = df['quantity'] * df['price']
df['segment'] = df['total_price'].apply(lambda x: 'excessive' if x > 100 else 'low')
return df
What does the grouping obtain?
Higher documentation
Nicely, to begin with, you find yourself with some pure retail area in your code so as to add docstrings. Why is that this essential? Nicely have you ever tried studying your personal code a month after writing it?
Individuals overlook particulars in a short time, and even code *you’ve* written can change into undecipherable inside only a few days.
It’s important to doc what the code is doing, what it expects to take as enter, and what it returns, on the very least.
Together with docstrings in your code offers context and units expectations for the way a operate ought to behave, making it simpler to grasp and debug failing exams sooner or later.
Higher Readability
By ‘encapsulating’ the complexity of your code into smaller features, you can also make it simpler to learn and perceive the general circulation of a pipeline with out having to learn each single line of code.
def run_pipeline(
user_path: str,
transaction_path: str,
product_path: str,
output_path: str,
cutoff_date: str
):
user_df, transaction_df, product_df = load_data(
user_path,
transaction_path,
product_path
)
df = create_user_product_transaction_dataset(
user_df,
transaction_df,
product_df
)
df = drop_unnecessary_date_period(df, cutoff_date)
df = compute_secondary_features(df)
df.to_parquet(output_path)
You’ve supplied the reader with a hierarchy of data, and it provides the reader a step-by-step breakdown of what’s happing within the run_pipeline
operate by means of significant operate names.
The reader then has the selection of trying on the operate definition and the complexity inside, relying on their wants.
The act of mixing code into ‘significant’ chunks like that is demonstrating an idea known as ‘Encapsulation’ and ‘Abstraction’.
For extra particulars on encapsulation, you may learn my article on this right here
Smaller packets of code to check
Subsequent, now we have a really particular, well-defined set of features that do one factor. This makes it simpler to check and debug, since we solely have one factor to fret about.
See beneath on how we assemble a check.
Setting up a Unit Take a look at
1. Comply with the AAA Sample
def test_create_user_product_transaction_dataset():
# GIVEN
# RUN
# TEST
...
Firstly, we outline a check operate, appropriately named test_
.
Then, we divide it into three sections:
GIVEN
: the inputs to the operate, and the anticipated output. Arrange every thing required to run the operate we wish to check.RUN
: run the operate given the inputs.TEST
: examine the output of the operate to the anticipated output.
This can be a generic sample that unit exams ought to comply with. The usual identify for this design sample is the ‘AAA sample’, which stands for Prepare
, Act
, Assert
.
I don’t discover this naming intuitive, which is why I exploit GIVEN
, RUN
, TEST
.
2. Prepare: arrange the check
# GIVEN
user_df = pd.DataFrame({
'user_id': [1, 2, 3], 'identify': ["John", "Jane", "Bob"]
})
transaction_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'extra-column1-str': ['1', '2', '3'],
'extra-column2-int': [4, 5, 6],
'extra-column3-float': [1.1, 2.2, 3.3],
})
product_df = pd.DataFrame({
'product_id': [1, 2], 'product_name': ["apple", "banana"]
})
expected_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'extra-column1-str': ['1', '2', '3'],
'extra-column2-int': [4, 5, 6],
'extra-column3-float': [1.1, 2.2, 3.3],
'identify': ["John", "Jane", "Bob"],
'product_name': ["apple", "apple", "banana"],
})
Secondly, we outline the inputs to the operate, and the anticipated output. That is the place we bake in our expectations about how the inputs will appear like, and what the output ought to appear like.
As you may see, we don’t have to outline each single function that we anticipate to be run, solely those that matter for the check.
For instance, transaction_df
defines the user_id
, product_id
columns correctly, while additionally including three columns of various sorts (str
, int
, float
) to simulate the truth that there will probably be different columns.
The identical goes for product_df
and user_df
, although these tables are anticipated to be a dimension desk, so simply defining identify
and product_name
columns will suffice.
3. Act: Run the operate to check
# RUN
output_df = create_user_product_transaction_dataset(
user_df, transaction_df, product_df
)
Thirdly, we run the operate with the inputs we outlined, and accumulate the output.
4. Assert: Take a look at the end result is as anticipated
# TEST
pd.testing.assert_frame_equal(
output_df,
expected_df
)
and at last, we test whether or not the output matches the anticipated output.
Observe, we use the pandas
testing module since we’re evaluating pandas dataframes. For non-pandas datafames, you should utilize the assert
assertion as an alternative.
The total testing code will appear like this:
import pandas as pd
def test_create_user_product_transaction_dataset():
# GIVEN
user_df = pd.DataFrame({
'user_id': [1, 2, 3], 'identify': ["John", "Jane", "Bob"]
})
transaction_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
'extra-column1': [1, 2, 3],
'extra-column2': [4, 5, 6],
})
product_df = pd.DataFrame({
'product_id': [1, 2], 'product_name': ["apple", "banana"]
})
expected_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
'extra-column1': [1, 2, 3],
'extra-column2': [4, 5, 6],
'identify': ["John", "Jane", "Bob"],
'product_name': ["apple", "apple", "banana"],
})
# RUN
output_df = create_user_product_transaction_dataset(
user_df, transaction_df, product_df
)
# TEST
pd.testing.assert_frame_equal(
output_df,
expected_df
)
To organise your exams higher and make them cleaner, you can begin utilizing a mixture of lessons, fixtures, and parametrisation.
It’s past the scope of this text to delve into every of those ideas intimately, so for many who have an interest I present the pytest
How-To information as reference to those ideas.

What to Take a look at?
Now that we’ve created a unit check for one operate, we flip our consideration to the remaining features that now we have. Acute readers will now be considering:
“Wow, do I’ve to write down a check for every thing? That’s a variety of work!”
Sure, it’s true. It’s further code that it’s essential write and preserve.
However the excellent news is, it’s not mandatory to check completely every thing, however it’s essential know what’s essential within the context of what your work is doing.
Beneath, I’ll provide you with just a few guidelines of thumb and issues that I make when deciding what to check, and why.
1. Is the code important for the end result of the challenge?
There are important junctures in an information science challenge which are simply pivotal to the success of an information science challenge, a lot of which normally comes on the data-preparation and mannequin analysis/clarification levels.
The instance check we noticed above on the create_user_product_transaction_dataset
operate is an efficient instance.
This dataset will kind the idea of all downstream modelling exercise.
If the person -> product
be part of is wrong in no matter means, then it can affect every thing we do downstream.
Thus, it’s value taking the time to make sure this code works accurately.
At a naked minimal, the check we’ve established makes positive the operate is behaving in precisely the identical means because it used to after each code change.
Instance
Suppose the be part of must be rewritten to enhance reminiscence effectivity.
After making the change, the unit check ensures the output stays the identical.
If one thing was inadvertently altered such that the output began to look completely different (lacking rows, columns, completely different datatypes), the check would instantly flag the problem.
2. Is the code primarily utilizing third-party libraries?
Take the load information operate for instance:
def load_data(user_path: str, transaction_path: str, product_path: str):
"""Load information from specified paths"""
df1 = pd.read_csv(user_path)
df2 = pd.read_parquet(transaction_path)
df3 = pd.read_parquet(product_path)
return df1, df2, df3
This operate is encapsulating the method of studying information from completely different recordsdata. Underneath the hood, all it does is name three pandas
load features.
The primary worth of this code is the encapsulation.
In the meantime, it doesn’t have any enterprise logic, and in my view, the operate scope is so particular that you simply wouldn’t anticipate any logic to be added sooner or later.
If it does, then the operate identify needs to be modified because it does extra than simply loading information.
Due to this fact, this operate does not require a unit check.
A unit check for this operate would simply be testing that pandas
works correctly, and we should always be capable of belief that pandas
has examined their very own code.
3. Is the code prone to change over time?
This level has already been implied in 1 & 2. For maintainability, maybe that is an important consideration.
You ought to be considering:
- How complicated is the code? Are there some ways to attain the identical output?
- What may trigger somebody to change this code? Is the information supply inclined to modifications sooner or later?
- Is the code clear? Are there behaviours that may very well be simply neglected throughout a refactor?
Take create_user_product_transaction_dataset
for instance.
- The enter information could have modifications to their schema sooner or later.
- Maybe the dataset turns into bigger, and we have to break up the merge into a number of steps for efficiency causes.
- Maybe a unclean hack must go in briefly to deal with nulls as a result of a problem with the information supply.
In every case, a change to the underlying code could also be mandatory, and every time we have to make sure the output doesn’t change.
In distinction, load_data
does nothing however hundreds information from a file.
I don’t see this altering a lot sooner or later, aside from maybe a change in file format. So I’d defer writing a check for this till a big change to the upstream information supply happens (one thing like this may probably require altering a variety of the pipeline).
The place to Put Exams and Methods to Run Them
To date, we’ve lined learn how to write testable code and learn how to create the exams themselves.
Now, let’s take a look at learn how to construction your challenge to incorporate exams — and learn how to run them successfully.
Undertaking Construction
Usually, an information science challenge can comply with the beneath construction:
|-- information # the place information is saved
|-- conf # the place config recordsdata in your pipelines are saved
|-- src # all of the code to duplicate your challenge is saved right here
|-- notebooks # all of the code for one-off experiments, explorations, and so forth. are saved right here
|-- exams # all of the exams are saved right here
|-- pyproject.toml
|-- README.md
|-- necessities.txt
The src
folder ought to comprise all of the code for the challenge which are important for the supply of your challenge.
Normal rule of thumb
If it’s code you anticipate working a number of instances (with completely different inputs or parameters), it ought to go within the src
folder.
Examples embrace:
- information processing
- function engineering
- mannequin coaching
- mannequin analysis
In the meantime, something that’s one-off items of study will be in Jupyter notebooks, saved within the notebooks
folder.
This primarily consists of
- EDA
- ad-hoc mannequin experimentation
- evaluation of native mannequin explanations
Why?
As a result of Jupyter notebooks are notoriously flaky, tough to handle, and laborious to check. We don’t wish to be rerunning important code through notebooks.
The Take a look at Folder Construction
Let’s say your src
folder seems to be like this:
src
|-- pipelines
|-- data_processing.py
|-- feature_engineering.py
|-- model_training.py
|-- __init__.py
Every file accommodates features and pipelines, much like the instance we noticed above.
The check folder ought to then appear like this:
exams
|-- pipelines
|-- test_data_processing.py
|-- test_feature_engineering.py
|-- test_model_training.py
the place the check listing mirrors the construction of the src
listing and every file begins with the test_
prefix.
The rationale for that is easy:
- It’s simple to seek out the exams for a given file, for the reason that check folder construction mirrors the
src
folder. - It retains check code properly separated from supply code.
Working Exams
After getting your exams arrange like above, you may run them in quite a lot of methods:
1. By way of the terminal
pytest -v
2. By way of a code editor
I exploit this for all my initiatives.
Visible studio code is my editor of alternative; it auto-discovers the exams for me, and it’s tremendous simple to debug.
After having a learn of the docs, I don’t suppose there’s any level in me re-iterating their contents since they’re fairly self-explanatory, so right here’s the hyperlink:
Equally, most code editors can even have comparable capabilities, so there’s no excuse for not writing exams.
It actually is easy, learn the docs and get began.
3. By way of a CI pipeline (e.g. GitHub Actions, Gitlab, and so forth.)
It’s simple to arrange exams to run mechanically on pull requests through GitHub.
The concept is everytime you make a PR, it can mechanically discover and run the exams for you.
Which means even when overlook to run the exams regionally through 1 or 2, they’ll at all times be run for you everytime you wish to merge your modifications.
Once more, no level in me re-iterating the docs; right here’s the hyperlink
The Finish-Aim We Need To Obtain
Following on from the above directions, I feel it’s higher use of each of our time to spotlight some essential factors about what we wish to obtain by means of automated exams, slightly than regurgitating directions you’ll find within the above hyperlinks.
Firstly, automated exams are being written to determine belief in your code, and to minimise human error.
That is for the advantage of:
- Your self
- Your crew
- and the enterprise as a complete.
Due to this fact, to really get probably the most out of the exams you’ve written, you will need to get spherical to establishing a CI pipeline.
It makes a world of distinction with the ability to overlook to run the exams regionally, and nonetheless have the reassurance that the exams will probably be run while you create a PR or push some modifications.
You don’t wish to be the individual liable for a bug that creates a manufacturing incident since you forgot to run the exams, or to be the one to have missed a bug throughout a PR evaluation.
So please, in the event you write some exams, make investments a while into establishing a CI pipeline. Learn the github
docs, I implore you. It’s trivial to arrange, and it’ll do you wonders.
Remaining Remarks
After studying this text, I hope it’s impressed upon you
- The significance of writing exams, particularly inside the context of information science
- How simple it’s to write down and run them
However there’s one final cause why it’s best to know learn how to write automated check.
That cause is that
Knowledge Science is altering.
Knowledge science was once largely proof-of-concept, constructing fashions in Jupyter notebooks, and sending fashions to engineers for deployment. In the meantime, information scientists constructed up a notoriety for creating horrible code.
However now, the business has matured.
It’s turning into simpler to shortly construct and deploy fashions as ML-Ops and ML-engineering mature.
Thus,
- mannequin constructing
- deployment
- retraining
- upkeep
is turning into the duty of machine studying engineers.
On the similar time, the information wrangling that we used to do have gotten so complicated that that is now turning into specialised to devoted information engineering groups.
In consequence, information science sits in a really slim area between these two disciplines, and fairly quickly the strains between information scientist and information analyst will blur.
The trajectory is that information scientists will now not be constructing cutting-edge fashions, however will change into extra enterprise and product targeted, producing insights and MI reviews as an alternative.
If you wish to keep nearer to the mannequin constructing, it doesn’t suffice to simply code anymore.
It’s essential discover ways to code correctly, and learn how to preserve them effectively. Machine studying is now not a novelty, it’s now not simply PoCs, it’s turning into software program engineering.
If You Need To Study Extra
If you wish to be taught extra about software program engineering abilities utilized to Knowledge Science, listed below are some associated articles:
You may as well change into a Group Member on Patreon right here!
We have now devoted dialogue threads for all articles; Ask me questions on automated testing, talk about the subject in additional element, and share experiences with different information scientists. The training doesn’t have to cease right here.
You’ll find the devoted dialogue thread for this text right here.