, I discussed that scheduling is the subsequent wall I’ll be strolling towards.
So I assume, right here I’m, strolling in direction of it
However earlier than I get into what occurred, let me give some context for anybody stumbling on this for the primary time.
I’m a methods analyst who determined to transition into information engineering. As an alternative of simply taking programs and accumulating certificates, I made a decision to study by constructing and writing about it publicly. Each article on this sequence paperwork one thing I truly constructed, the choices I made, the issues that broke, and what I discovered from it.
The primary article was my 12-month self-study roadmap, the place I laid out the plan for a way I used to be going to method this transition. The second was me constructing my first ETL pipeline from scratch utilizing the GitHub API, as an entire newbie. Within the third, I took that very same pipeline and made it extra production-ready by including SQLite storage, idempotency dealing with, and Google Drive persistence, all inside Google Colab.
This text is the fourth. And it picks up precisely the place the final one ended.
I anticipated to spend most of my time choosing a scheduling device and configuring it. What I didn’t count on was that earlier than I may even take into consideration scheduling, I needed to cope with one thing extra elementary. My pipeline couldn’t run outdoors of Google Colab. And till that modified, no scheduler on this planet may assist me.
That is the story of what truly occurred.
The First Wall: My Pipeline Lived in Colab
Earlier than I even obtained to scheduling, I needed to know what it might truly take to run my pipeline routinely. So I checked out my code correctly for the primary time with that query in thoughts.
Right here’s what the load part regarded like:
conn = sqlite3.join('/content material/drive/MyDrive/github_repos.db')
That path, /content material/drive/MyDrive/, solely exists inside Google Colab. It’s the mounted Google Drive path that Colab provides you whenever you join your Drive to a pocket book. Outdoors Colab, that path doesn’t exist. If any scheduler tried to run this script, it might crash proper there.
The attention-grabbing factor is that my code had no google.colab imports. No Colab-specific libraries. Only one hardcoded path that I had been typing with out actually excited about it. That path was the dependency, not the code.
This was the very first thing I didn’t count on. I believed the problem can be studying a scheduling device. As an alternative, the primary lesson was that my surroundings was a part of my pipeline, and I hadn’t observed.
The repair was easy. As an alternative of hardcoding the Colab path, I made the database path configurable by means of an surroundings variable:
import os
DB_PATH = os.environ.get('DB_PATH', 'github_repos.db')
conn = sqlite3.join(DB_PATH)
Now the script makes use of no matter path is about within the surroundings. If nothing is about, it falls again to creating a neighborhood github_repos.db file in the identical folder. One change, and the pipeline was now not tied to Colab.
Working It Outdoors Colab for the First Time
Earlier than establishing any scheduler, I needed to verify the script truly labored by itself. So I saved it as pipeline.py, created a necessities.txt with the 2 libraries it wants:
requests
pandas
And ran it from my terminal:
It printed: Pipeline full. Duplicates dealt with.
And a file referred to as github_repos.db appeared in my folder. The identical pipeline I had been working in Colab was now working as a plain Python script, wherever.
That felt like a much bigger deal than I anticipated. Not as a result of the change was complicated, it wasn’t. However as a result of I spotted I had been pondering of my pipeline as a pocket book, when what I truly had was a script that occurred to dwell inside one.
Selecting a Scheduling Instrument
At this level I had a standalone script. Now I wanted one thing to run it on a schedule.
I checked out a couple of choices. APScheduler helps you to outline schedules inside your Python code, which works whereas a session is working however stops the second you shut your terminal. That’s not likely scheduling, that’s only a loop. Airflow is the business customary for orchestrating pipelines, nevertheless it requires working a server, a metadata database, and an online interface. That’s loads of infrastructure for the place I’m proper now.
GitHub Actions sat within the center. It’s free, it runs on GitHub’s servers, the schedule is outlined in code, and it doesn’t require me to keep up any infrastructure. The tradeoff is that it’s designed for CI/CD workflows, not pipeline orchestration, so it has limits round complicated dependencies and monitoring. However for a pipeline at my stage, it’s a sensible alternative.
I additionally need to be trustworthy: instruments like Airflow exist for a motive. When a pipeline grows, when you may have dependencies between duties, whenever you want visibility into what ran and what failed, you want correct orchestration. GitHub Actions will not be that. But it surely’s a superb first step, and understanding why it’s restricted is a part of studying what these extra critical instruments are literally fixing.
Setting Up GitHub Actions
GitHub Actions works by means of workflow recordsdata, that are YAML recordsdata you place in a particular folder in your repository. The folder construction appears to be like like this:
github-etl/
├── .github/
│ └── workflows/
│ └── schedule.yml
├── pipeline.py
└── necessities.txt
Right here’s the total workflow file I created:
identify: Run ETL Pipeline
on:
schedule:
- cron: '0 9 * * *'
workflow_dispatch:
jobs:
run-pipeline:
runs-on: ubuntu-latest
steps:
- identify: Checkout code
makes use of: actions/checkout@v4
- identify: Arrange Python
makes use of: actions/setup-python@v5
with:
python-version: '3.11'
- identify: Set up dependencies
run: pip set up -r necessities.txt
- identify: Run pipeline
run: python pipeline.py
Let me stroll by means of what every half is doing.
cron: '0 9 * * *'is the precise schedule. Cron is a time-based job scheduling format that’s been round in Unix methods for many years. The 5 values symbolize minute, hour, day of month, month, and day of week. So0 9 * * *means: at minute 0 of hour 9, every single day, each month, every single day of the week. In different phrases, 9am UTC every single day.workflow_dispatchprovides a handbook set off. This implies you too can run the workflow by clicking a button in GitHub, with out ready for the scheduled time. That is helpful for testing.runs-on: ubuntu-latesttells GitHub to spin up a contemporary Linux machine for every run. Each time the workflow triggers, GitHub creates a clear surroundings, installs your dependencies, runs your script, after which shuts the whole lot down. There’s no persistent machine sitting someplace working your code. It’s ephemeral.
The steps are easy. Checkout pulls your code from the repository into the runner. Setup Python installs the model you specify. Set up dependencies runs pip set up -r necessities.txt. After which Run pipeline executes your script.
What Occurred After I Ran It
After pushing the workflow file to GitHub, I went to the Actions tab in my repository and triggered it manually utilizing the workflow_dispatch button.
It ran. Twenty-seven seconds from begin to end. The pipeline pulled information from the GitHub API, remodeled it, and loaded it into SQLite, all on a GitHub server, with out me doing something after clicking the button.
I did get one warning on the primary run:
Node.js 20 actions are deprecated...
This was as a result of I had used older variations of the checkout and setup-python actions. The repair was updating actions/checkout@v3 to actions/checkout@v4 and actions/setup-python@v4 to actions/setup-python@v5. After that, the workflow ran clear.
What I Truly Discovered
Going into this, I believed scheduling was about choosing the right device. What I discovered was that scheduling pressured me to consider one thing I hadn’t thought rigorously about earlier than: portability.
A pipeline that solely runs in a single particular surroundings isn’t actually a pipeline. It’s a script tied to a platform. Making it schedulable meant making it moveable first, and making it moveable meant understanding what it truly trusted.
The hardcoded path was a small factor. However catching it modified how I take into consideration writing pipeline code going ahead. Each time I write a path or a credential or an environment-specific worth, I now ask whether or not that factor will exist outdoors the context I’m constructing in.
The opposite factor I discovered is that scheduling and orchestration are completely different issues. GitHub Actions handles scheduling nicely. It doesn’t deal with issues like retrying failed runs with backoff, alerting when one thing goes mistaken, visualizing pipeline dependencies, or managing a number of pipelines that rely upon one another. These are orchestration issues, they usually’re what instruments like Airflow are constructed to resolve.
I’m not there but. However I perceive now why these instruments exist in a means I didn’t earlier than.
What’s Subsequent
The pipeline is now working every single day at 9am UTC. Information is being collected. And I’m beginning to discover one thing: when you may have a pipeline working each day, you begin caring concerning the information it produces otherwise.
Are all of the information clear? Are there repos slipping by means of with lacking fields? Is the viral flag truly significant, or did I outline it in a means that makes virtually the whole lot “No”?
These are information high quality questions. And so they’re the subsequent wall I’m strolling towards.
That is a part of my ongoing sequence documenting my transition from methods analyst to information engineer. In case you’ve been following alongside, thanks. If that is your first article within the sequence, the sooner ones are linked beneath.
From Information Analyst to Information Engineer: My 12-Month Self-Examine Roadmap
I Constructed My First ETL Pipeline as a Full Newbie. Right here’s How.
I Thought Information Engineering Was Simply Writing Scripts. I Was Incorrect.














