Skip to main content

What is ETL?

E - Extract

Extract raw data from various sources. Data sources may include Logs, Databases, APIs, Crawlers, Local Files, Google Sheets, etc.

  • Structured Data
    • RDB, tabular data
  • Semi-Structured Data
    • JSON, XML, YAML
  • Unstructured Data
    • Text files, Excel sheets

T - Transform

Any processing performed on the data is considered a transformation.

  • Examples: Filling nulls, dropping nulls, normalization, aggregation, etc.

L - Load

Store the processed data into the target destination, similar to the sources mentioned in Extract.


Data Pipeline

  • Extract
    • Data Sources: Logs, Databases, APIs, Local Files, Google Sheets, etc.
  • Transform
    • Data Cleaning: Clean the collected data by removing incomplete or incorrect data.
    • Data Transformation: Convert cleaned data into a format suitable for analysis, often stored in relational databases.
  • Load
    • Data Storage: Store the transformed data in a location (e.g., a database) accessible to analysts for further analysis.
  • Application
    • BI Reports: Analyze the stored data to produce BI reports or visualizations.
    • Decision Making: Use insights from the reports to support decision-making processes.

Tools for ETL

1. Python Libraries

  • Pandas: For cleaning and transforming data.
  • Database Connectors:
    • pymysql (MySQL)
    • pymongo (MongoDB)
  • Web Scraping:
    • requests, Beautiful Soup, selenium

2. ETL Orchestration Tools

  • Airflow: Schedules and monitors ETL workflows.
  • Prefect: Simplifies ETL pipelines and debugging.
  • dbt: Focuses on transforming data with SQL.

3. Big Data Frameworks

  • Spark: For large-scale data processing.
  • Flink: For real-time data streaming.