• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Saturday, November 29, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Artificial Intelligence

From Configuration to Orchestration: Constructing an ETL Workflow with AWS Is No Longer a Battle

Admin by Admin
June 22, 2025
in Artificial Intelligence
0
Greg rakozy ompaz dn 9i unsplash scaled 1.jpg
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter

READ ALSO

The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation

Coaching a Tokenizer for BERT Fashions


to guide the cloud business with a whopping 32% share as a result of its early market entry, strong know-how and complete service choices. Nonetheless, many customers discover AWS difficult to navigate, and this discontentment lead extra firms and organisations to want its rivals Microsoft Azure and Google Cloud Platform.

Regardless of its steeper studying curve and fewer intuitive interface, AWS stays the highest cloud service as a result of its reliability, hybrid cloud and most service choices. Extra importantly, the choice of correct methods can considerably cut back configuration complexity, streamline workflows, and increase efficiency.

On this article, I’ll introduce an environment friendly technique to arrange a whole ETL pipeline with orchestration on AWS, primarily based by myself expertise. It is going to additionally provide you with a refreshed view on the manufacturing of information with AWS or make you are feeling much less struggling when conducting configuration if that is your first time to make use of AWS for sure duties.

Technique for Designing an Environment friendly Knowledge Pipeline

AWS has essentially the most complete ecosystem with its huge providers. To construct a production-ready knowledge warehouse on AWS at the least requires the next providers:

  • IAM – Though this service isn’t included into any a part of the workflow, it’s the muse for accessing all different providers.
  • AWS S3 – Knowledge Lake storage
  • AWS Glue – ETL processing
  • Amazon Redshift – Knowledge Warehouse
  • CloudWatch – Monitoring and logging

You additionally want entry to Airflow if you need to schedule extra advanced dependencies and conduct superior retries by way of error dealing with though Redshift can deal with some fundamental cron jobs.

To make your work simpler, I extremely suggest to put in an IDE (Visible Studio Code or PyCharm and naturally you possibly can select your personal favorite IDE). An IDE dramatically improves your effectivity for advanced python code, native testing/debugging, model management integration and staff collaboration. And within the subsequent session, I’ll present step-by-step configurations.

Preliminary Setup

Listed below are the steps of preliminary configurations:

  • Launch a digital setting in your IDE
  • Set up dependencies – principally, we have to set up the libraries that shall be used in a while.
pip set up apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy
  • Set up AWS CLI – this step permits you to write scripts to automate varied AWS operations and makes the administration of AWS sources extra effectively.
  • AWS Configuration – be certain to enter these IAM person credentials when prompted:
    • AWS Entry Key ID: Out of your IAM person.
    • AWS Secret Entry Key: Out of your IAM person.
    • Default area: us-east-1 (or your most well-liked area)
    • Default output format: json.
  • Combine Airflow – listed here are the steps:
    • Initialize Airflow
    • Create DAG information in Airflow
    • Run the online server at http://localhost:8080 (login:admin/admin)
    • Open one other terminal tab and begin the scheduler
export AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow customers create 
  --username admin 
  --password admin 
  --firstname Admin 
  --lastname Person 
  --role Admin 
  --email [email protected]
#Initialize Airflow
airflow webserver --port 8080 ##run the webserver
airflow scheduler #begin the scheduler

Improvement Workflow: COVID-19 Knowledge Case Research

I’m utilizing JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration function. You may check with knowledge right here,

The chart beneath exhibits the workflow from knowledge ingestion to knowledge loading to Redshift tables within the improvement setting.

Improvement workflow created by writer

Knowledge Ingestion

In step one of information ingestion to AWS S3, I processed knowledge by melting them to lengthy format and changing the date format. I saved the info within the parquet format to enhance the storage effectivity, improve question efficiency and cut back storage prices. The code for this step is as beneath:

import pandas as pd
from datetime import datetime
import os
import boto3
import sys

def process_covid_data():
    strive:
        # Load uncooked knowledge
        url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
        
        # --- Knowledge Processing ---
        # 1. Soften to lengthy format
        df = df.soften(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        
        # 2. Convert dates (JHU format: MM/DD/YY)
        df['date'] = pd.to_datetime(
            df['date_str'], 
            format='%m/%d/%y',
            errors='coerce'
        ).dropna()
        
        # 3. Save as partitioned Parquet
        output_dir = "covid_processed"
        df.to_parquet(
            output_dir,
            engine='pyarrow',
            compression='snappy',
            partition_cols=['date']
        )
        
        # 4. Add to S3
        s3 = boto3.shopper('s3')
        total_files = 0
        
        for root, _, information in os.stroll(output_dir):
            for file in information:
                local_path = os.path.be part of(root, file)
                s3_path = os.path.be part of(
                    'uncooked/covid/',
                    os.path.relpath(local_path, output_dir)
                )
                s3.upload_file(
                    Filename=local_path,
                    Bucket='my-dev-bucket',
                    Key=s3_path
                )
            total_files += len(information)
        
        print(f"Efficiently processed and uploaded {total_files} Parquet information")
        print(f"Knowledge covers from {df['date'].min()} to {df['date'].max()}")
        return True

    besides Exception as e:
        print(f"Error: {str(e)}", file=sys.stderr)
        return False

if __name__ == "__main__":
    process_covid_data()

After operating the python code, you need to have the ability to see the parquet information within the S3 buckets, below the folder of ‘uncooked/covid/’.

Screenshot by writer

ETL Pipeline Improvement

AWS Glue is principally used for ETL Pipeline Improvement. Though it may also be used for knowledge ingestion even when the info hasn’t loaded to S3, its power lies in processing knowledge as soon as it’s in S3 for knowledge warehousing functions. Right here’s PySpark scripts for knowledge remodel:

# transform_covid.py
from awsglue.context import GlueContext
from pyspark.sql.features import *

glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_options(
    "s3",
    {"paths": ["s3://my-dev-bucket/raw/covid/"]},
    format="parquet"
).toDF()

# Add transformations right here
df_transformed = df.withColumn("load_date", current_date())

# Write to processed zone
df_transformed.write.parquet(
    "s3://my-dev-bucket/processed/covid/",
    mode="overwrite"
)
Screenshot by writer

The subsequent step is to load knowledge to Redshift. In Redshift Console, click on on “Question Editor Q2” on the left facet and you may edit your SQL code and end the Redshift COPY.

# Create a desk covid_data in dev schema
CREATE TABLE dev.covid_data (
    "Province/State" VARCHAR(100),  
    "Nation/Area" VARCHAR(100),
    "Lat" FLOAT8,
    "Lengthy" FLOAT8,
    date_str VARCHAR(100),
    confirmed_cases FLOAT8  
)
DISTKEY("Nation/Area")   
SORTKEY(date_str);
# COPY knowledge to redshift
COPY dev.covid_data (
    "Province/State",
    "Nation/Area",
    "Lat",
    "Lengthy",
    date_str,
    confirmed_cases
)
FROM 's3://my-dev-bucket/processed/covid/'
IAM_ROLE 'arn:aws:iam::your-account-id:function/RedshiftLoadRole'
REGION 'your-region'
FORMAT PARQUET;

You then’ll see the info efficiently uploaded to the info warehouse.

Screenshot by writer

Pipeline Automation

The best technique to automate your knowledge pipeline is to schedule jobs below Redshift question editor v2 by making a Saved Process (I’ve a extra detailed introduction about SQL Saved Process, you possibly can check with this text).

CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
AS $$
BEGIN
  TRUNCATE TABLE dev.covid_data;
  COPY dev.covid_data 
  FROM 's3://simba-dev-bucket/uncooked/covid'
  IAM_ROLE 'arn:aws:iam::your-account-id:function/RedshiftLoadRole'
  REGION 'your-region'
  FORMAT PARQUET;
END;
$$ LANGUAGE plpgsql;
Screenshot by writer

Alternatively, you possibly can run Airflow for scheduled jobs.

from datetime import datetime
from airflow import DAG
from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator

default_args = {
    'proprietor': 'data_team',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'retries': 2
}

with DAG(
    'redshift_etl_dev',
    default_args=default_args,
    schedule_interval='@day by day',
    catchup=False
) as dag:

    run_etl = RedshiftSQLOperator(
        task_id='run_covid_etl',
        redshift_conn_id='redshift_dev',
        sql='CALL dev.run_covid_etl()',
    )

Manufacturing Workflow

Airflow DAG is highly effective to orchestrates your whole ETL pipeline if there are numerous dependencies and it’s additionally a superb follow in manufacturing setting.

After growing and testing your ETL pipeline, you possibly can automate your duties in manufacturing setting utilizing Airflow.

Manufacturing workflow created by writer

Listed below are the test listing of key preparation steps to assist the profitable deployment in Airflow:

  • Create S3 bucket my-prod-bucket 
  • Create Glue job prod_covid_transformation in AWS Console
  • Create Redshift Saved Process prod.load_covid_data()
  • Configure Airflow
  • Configure SMTP for emails in airflow.cfg

Then the deployment of the info pipeline in Airflow is:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.suppliers.amazon.aws.operators.glue import GlueJobOperator
from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
from airflow.operators.e mail import EmailOperator

# 1. DAG CONFIGURATION
default_args = {
    'proprietor': 'data_team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'start_date': datetime(2023, 1, 1)
}

# 2. DATA INGESTION FUNCTION
def load_covid_data():
    import pandas as pd
    import boto3
    
    url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
    df = pd.read_csv(url)

    df = df.soften(
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
        var_name='date_str',
        value_name='confirmed_cases'
    )
    df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
    
    df.to_parquet(
        's3://my-prod-bucket/uncooked/covid/',
        engine='pyarrow',
        partition_cols=['date']
    )

# 3. DAG DEFINITION
with DAG(
    'covid_etl',
    default_args=default_args,
    schedule_interval='@day by day',
    catchup=False
) as dag:

    # Process 1: Ingest Knowledge
    ingest = PythonOperator(
        task_id='ingest_data',
        python_callable=load_covid_data
    )

    # Process 2: Rework with Glue
    remodel = GlueJobOperator(
        task_id='transform_data',
        job_name='prod_covid_transformation',
        script_args={
            '--input_path': 's3://my-prod-bucket/uncooked/covid/',
            '--output_path': 's3://my-prod-bucket/processed/covid/'
        }
    )

    # Process 3: Load to Redshift
    load = RedshiftSQLOperator(
        task_id='load_data',
        sql="CALL prod.load_covid_data()"
    )

    # Process 4: Notifications
    notify = EmailOperator(
        task_id='send_email',
        to='you-email-address',
        topic='ETL Standing: {{ ds }}',
        html_content='ETL job accomplished: View Logs'
    )

My Last Ideas

Though some customers, particularly those that are new to the cloud and searching for easy options are typically daunted by AWS’s excessive barrier to entry and be overwhelmed by the huge decisions of providers, it’s definitely worth the time and efforts and listed here are the explanations:

  • The method of configuration, and the designing, constructing and testing of the info pipelines offers you the deep understanding of a typical knowledge engineering workflow. The abilities will profit you even in the event you produce your initiatives with different cloud providers, akin to Azure, GCP and Alibaba Cloud.
  • The mature ecosystem that AWS has and an enormous array of providers that it affords allow customers to customize their knowledge structure methods and luxuriate in extra flexibility and scalability of their initiatives.

Thanks for studying! Hope this text useful to construct your cloud-base knowledge pipeline!

Tags: AWSBuildingConfigurationETLLongerOrchestrationStruggleWorkflow

Related Posts

Image 284.jpg
Artificial Intelligence

The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation

November 29, 2025
John towner uo02gaw3c0c unsplash scaled.jpg
Artificial Intelligence

Coaching a Tokenizer for BERT Fashions

November 29, 2025
Chatgpt image nov 25 2025 06 03 10 pm.jpg
Artificial Intelligence

Why We’ve Been Optimizing the Fallacious Factor in LLMs for Years

November 28, 2025
Mlm chugani decision trees fail fix feature v2 1024x683.png
Artificial Intelligence

Why Resolution Timber Fail (and The way to Repair Them)

November 28, 2025
Mk s thhfiw6gneu unsplash scaled.jpg
Artificial Intelligence

TDS Publication: November Should-Reads on GraphRAG, ML Tasks, LLM-Powered Time-Sequence Evaluation, and Extra

November 28, 2025
Nastya dulhiier fisdt1rzkh8 unsplash scaled.jpg
Artificial Intelligence

BERT Fashions and Its Variants

November 27, 2025
Next Post
Image 66.jpg

What PyTorch Actually Means by a Leaf Tensor and Its Grad

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Holdinghands.png

What My GPT Stylist Taught Me About Prompting Higher

May 10, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025

EDITOR'S PICK

Why max keiser is highly optimistic about bitcoin cracking 220000 in ‘a flash.jpg

Bitcoin Units New ATH Above $112,000 As Spot BTC ETFs High $50 Billion In Cumulative Internet Inflows ⋆ ZyCrypto

July 10, 2025
Can humans recognize chatgpt ai generated text.jpg

Can People Acknowledge Chatgpt Ai Generated Textual content? (My Opinion) » Ofemwire

July 25, 2024
Ai Use Cases In Insurance.png

AI/ML Use Instances: 4 Developments Insurance coverage Business Leaders Ought to Observe

October 28, 2024
1uacd5pe6qd8o32fnj8hrva.jpeg

Find out how to Choose Between Knowledge Science, Knowledge Analytics, Knowledge Engineering, ML Engineering, and SW Engineering | by Marina Wyss – Gratitude Pushed | Jan, 2025

January 19, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • The Product Well being Rating: How I Decreased Important Incidents by 35% with Unified Monitoring and n8n Automation
  • Pi Community’s PI Dumps 7% Day by day, Bitcoin (BTC) Stopped at $93K: Market Watch
  • Coaching a Tokenizer for BERT Fashions
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?