• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Friday, June 19, 2026
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

I Tried to Schedule My ETL Pipeline. Right here’s What I Didn’t Anticipate.

Admin by Admin
June 19, 2026
in Machine Learning
0
Etl scheduling.jpg
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

The Secret to Reproducible and Transportable Optimization: ORPilot’s Intermediate Illustration (IR)

Run a Native LLM with OpenClaw on Your Mac Mini


, I discussed that scheduling is the subsequent wall I’ll be strolling towards.

So I assume, right here I’m, strolling in direction of it

However earlier than I get into what occurred, let me give some context for anybody stumbling on this for the primary time.

I’m a methods analyst who determined to transition into information engineering. As an alternative of simply taking programs and accumulating certificates, I made a decision to study by constructing and writing about it publicly. Each article on this sequence paperwork one thing I truly constructed, the choices I made, the issues that broke, and what I discovered from it.

The primary article was my 12-month self-study roadmap, the place I laid out the plan for a way I used to be going to method this transition. The second was me constructing my first ETL pipeline from scratch utilizing the GitHub API, as an entire newbie. Within the third, I took that very same pipeline and made it extra production-ready by including SQLite storage, idempotency dealing with, and Google Drive persistence, all inside Google Colab.

This text is the fourth. And it picks up precisely the place the final one ended.

I anticipated to spend most of my time choosing a scheduling device and configuring it. What I didn’t count on was that earlier than I may even take into consideration scheduling, I needed to cope with one thing extra elementary. My pipeline couldn’t run outdoors of Google Colab. And till that modified, no scheduler on this planet may assist me.

That is the story of what truly occurred.

The First Wall: My Pipeline Lived in Colab

Earlier than I even obtained to scheduling, I needed to know what it might truly take to run my pipeline routinely. So I checked out my code correctly for the primary time with that query in thoughts.

Right here’s what the load part regarded like:

conn = sqlite3.join('/content material/drive/MyDrive/github_repos.db')

That path, /content material/drive/MyDrive/, solely exists inside Google Colab. It’s the mounted Google Drive path that Colab provides you whenever you join your Drive to a pocket book. Outdoors Colab, that path doesn’t exist. If any scheduler tried to run this script, it might crash proper there.

The attention-grabbing factor is that my code had no google.colab imports. No Colab-specific libraries. Only one hardcoded path that I had been typing with out actually excited about it. That path was the dependency, not the code.

This was the very first thing I didn’t count on. I believed the problem can be studying a scheduling device. As an alternative, the primary lesson was that my surroundings was a part of my pipeline, and I hadn’t observed.

The repair was easy. As an alternative of hardcoding the Colab path, I made the database path configurable by means of an surroundings variable:

import os

DB_PATH = os.environ.get('DB_PATH', 'github_repos.db')
conn = sqlite3.join(DB_PATH)

Now the script makes use of no matter path is about within the surroundings. If nothing is about, it falls again to creating a neighborhood github_repos.db file in the identical folder. One change, and the pipeline was now not tied to Colab.

Working It Outdoors Colab for the First Time

Earlier than establishing any scheduler, I needed to verify the script truly labored by itself. So I saved it as pipeline.py, created a necessities.txt with the 2 libraries it wants:

requests
pandas

And ran it from my terminal:

It printed: Pipeline full. Duplicates dealt with.

And a file referred to as github_repos.db appeared in my folder. The identical pipeline I had been working in Colab was now working as a plain Python script, wherever.

That felt like a much bigger deal than I anticipated. Not as a result of the change was complicated, it wasn’t. However as a result of I spotted I had been pondering of my pipeline as a pocket book, when what I truly had was a script that occurred to dwell inside one.

Selecting a Scheduling Instrument

At this level I had a standalone script. Now I wanted one thing to run it on a schedule.

I checked out a couple of choices. APScheduler helps you to outline schedules inside your Python code, which works whereas a session is working however stops the second you shut your terminal. That’s not likely scheduling, that’s only a loop. Airflow is the business customary for orchestrating pipelines, nevertheless it requires working a server, a metadata database, and an online interface. That’s loads of infrastructure for the place I’m proper now.

GitHub Actions sat within the center. It’s free, it runs on GitHub’s servers, the schedule is outlined in code, and it doesn’t require me to keep up any infrastructure. The tradeoff is that it’s designed for CI/CD workflows, not pipeline orchestration, so it has limits round complicated dependencies and monitoring. However for a pipeline at my stage, it’s a sensible alternative.

I additionally need to be trustworthy: instruments like Airflow exist for a motive. When a pipeline grows, when you may have dependencies between duties, whenever you want visibility into what ran and what failed, you want correct orchestration. GitHub Actions will not be that. But it surely’s a superb first step, and understanding why it’s restricted is a part of studying what these extra critical instruments are literally fixing.

Setting Up GitHub Actions

GitHub Actions works by means of workflow recordsdata, that are YAML recordsdata you place in a particular folder in your repository. The folder construction appears to be like like this:

github-etl/
├── .github/
│   └── workflows/
│       └── schedule.yml
├── pipeline.py
└── necessities.txt

Right here’s the total workflow file I created:

identify: Run ETL Pipeline

on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:

jobs:
  run-pipeline:
    runs-on: ubuntu-latest

    steps:
      - identify: Checkout code
        makes use of: actions/checkout@v4

      - identify: Arrange Python
        makes use of: actions/setup-python@v5
        with:
          python-version: '3.11'

      - identify: Set up dependencies
        run: pip set up -r necessities.txt

      - identify: Run pipeline
        run: python pipeline.py

Let me stroll by means of what every half is doing.

  • cron: '0 9 * * *' is the precise schedule. Cron is a time-based job scheduling format that’s been round in Unix methods for many years. The 5 values symbolize minute, hour, day of month, month, and day of week. So 0 9 * * * means: at minute 0 of hour 9, every single day, each month, every single day of the week. In different phrases, 9am UTC every single day.
  • workflow_dispatch provides a handbook set off. This implies you too can run the workflow by clicking a button in GitHub, with out ready for the scheduled time. That is helpful for testing.
  • runs-on: ubuntu-latest tells GitHub to spin up a contemporary Linux machine for every run. Each time the workflow triggers, GitHub creates a clear surroundings, installs your dependencies, runs your script, after which shuts the whole lot down. There’s no persistent machine sitting someplace working your code. It’s ephemeral.

The steps are easy. Checkout pulls your code from the repository into the runner. Setup Python installs the model you specify. Set up dependencies runs pip set up -r necessities.txt. After which Run pipeline executes your script.

What Occurred After I Ran It

After pushing the workflow file to GitHub, I went to the Actions tab in my repository and triggered it manually utilizing the workflow_dispatch button.

It ran. Twenty-seven seconds from begin to end. The pipeline pulled information from the GitHub API, remodeled it, and loaded it into SQLite, all on a GitHub server, with out me doing something after clicking the button.

I did get one warning on the primary run:

Node.js 20 actions are deprecated...

This was as a result of I had used older variations of the checkout and setup-python actions. The repair was updating actions/checkout@v3 to actions/checkout@v4 and actions/setup-python@v4 to actions/setup-python@v5. After that, the workflow ran clear.

What I Truly Discovered

Going into this, I believed scheduling was about choosing the right device. What I discovered was that scheduling pressured me to consider one thing I hadn’t thought rigorously about earlier than: portability.

A pipeline that solely runs in a single particular surroundings isn’t actually a pipeline. It’s a script tied to a platform. Making it schedulable meant making it moveable first, and making it moveable meant understanding what it truly trusted.

The hardcoded path was a small factor. However catching it modified how I take into consideration writing pipeline code going ahead. Each time I write a path or a credential or an environment-specific worth, I now ask whether or not that factor will exist outdoors the context I’m constructing in.

The opposite factor I discovered is that scheduling and orchestration are completely different issues. GitHub Actions handles scheduling nicely. It doesn’t deal with issues like retrying failed runs with backoff, alerting when one thing goes mistaken, visualizing pipeline dependencies, or managing a number of pipelines that rely upon one another. These are orchestration issues, they usually’re what instruments like Airflow are constructed to resolve.

I’m not there but. However I perceive now why these instruments exist in a means I didn’t earlier than.

What’s Subsequent

The pipeline is now working every single day at 9am UTC. Information is being collected. And I’m beginning to discover one thing: when you may have a pipeline working each day, you begin caring concerning the information it produces otherwise.

Are all of the information clear? Are there repos slipping by means of with lacking fields? Is the viral flag truly significant, or did I outline it in a means that makes virtually the whole lot “No”?

These are information high quality questions. And so they’re the subsequent wall I’m strolling towards.

That is a part of my ongoing sequence documenting my transition from methods analyst to information engineer. In case you’ve been following alongside, thanks. If that is your first article within the sequence, the sooner ones are linked beneath.

From Information Analyst to Information Engineer: My 12-Month Self-Examine Roadmap

I Constructed My First ETL Pipeline as a Full Newbie. Right here’s How.

I Thought Information Engineering Was Simply Writing Scripts. I Was Incorrect.

Join with me on LinkedIn, YouTube, and Twitter.

Tags: DidntETLexpectHeresPipelineSchedule

Related Posts

Gemini generated image f3s6k6f3s6k6f3s6.jpg
Machine Learning

The Secret to Reproducible and Transportable Optimization: ORPilot’s Intermediate Illustration (IR)

June 18, 2026
93c5e532 5182 40a1 b6a5 d11734f86e68.jpg
Machine Learning

Run a Native LLM with OpenClaw on Your Mac Mini

June 17, 2026
Coding agent alignment cover.jpg
Machine Learning

Tips on how to Successfully Align with Claude Code

June 16, 2026
Microscope fihq3 d45zo v3 card.jpg
Machine Learning

Imaginative and prescient LLMs are PDF Parsers Too: Studying Charts and Diagrams for RAG

June 14, 2026
Mlm multi label text classification with scikit llm feature.png
Machine Learning

Multi-Label Textual content Classification with Scikit-LLM

June 14, 2026
Wmremove transformed.jpeg
Machine Learning

Why Decade-Previous Residual Connections Nonetheless Energy All of AI (And Why That’s a Downside)

June 13, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Chainlink Link And Cardano Ada Dominate The Crypto Coin Development Chart.jpg

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

May 17, 2025
Image 100 1024x683.png

Easy methods to Use LLMs for Highly effective Computerized Evaluations

August 13, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025

EDITOR'S PICK

Yin Yang Futurist.webp.webp

Historical Knowledge Beats AI? Taoism’s Shocking Information to Tech Chaos

April 16, 2025
Image 45dc32df68511c07355626747b7434b9 Scaled.jpg

How Cross-Chain DApps Deal with Gasoline Optimization

March 3, 2025
1htulji9sllorihytzax4wq.png

Integrating LLM Brokers with LangChain into VICA

August 20, 2024
019c5f0d c818 7e0a b6bc dfd12597037b.jpg

Roundhill’s US Election Occasion Contract ETFs ‘Doubtlessly Groundbreaking’

February 15, 2026

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • I Tried to Schedule My ETL Pipeline. Right here’s What I Didn’t Anticipate.
  • GPU-Resident Prime-Okay for Agentic RAG: I Constructed a CUDA Kernel So My Retrieval Step Would Cease Bouncing Off the GPU
  • Avalanche Launches Funds Collective for International Funds
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?