Objective

Data Engineering is about designing, collecting, storing, and converting data into valuable information from various sources. In today’s world, organizations generate and collect vast amounts of data, and to make this data useful for data scientists and analysts, they require skilled individuals and the right technology.

The purpose of this document is to guide those interested in exploring the construction of data pipelines and the technologies involved in Data Engineering. You will play a pivotal role in an organization’s success by ensuring that data is readily accessible for data scientists, analysts, and decision-makers. Your proficiency in programming and problem-solving will be crucial for creating scalable solutions.

Top skills you need,

  1. Proficiency in programming languages, with a preference for Python and Java.
  2. Strong understanding of database architecture, data modeling, and processing.
  3. Mastery of SQL, the core of data engineering.
  4. Expertise in ETL (Extract, Transform, Load) processes.
  5. Solid knowledge of Big Data technologies and their components.
  6. Familiarity with Cloud services such as AWS, Azure, and GCP.
  7. Experience with orchestration tools like Jenkins, Oozie, and Airflow.
  8. Hands-on proficiency with BI tools including Tableau, QuickSight, and Super

Step-By-Step to become the Data Engineer

  • Python

To Start with, learn python, a programming language with  basic concepts like variables, data types, conditional statements, lambda functions and pandas and numpy packages. We can install the python and the local environment starts running and testing code in any of the available IDE’s.  

How to install python:   https://www.dataquest.io/blog/installing-python-on-mac/

Python Doc : The Python Tutorial

Video series on python basics and oops concept:

Python Programming Beginner Tutorials – YouTube :

Python OOP Tutorials – Working with Classes – YouTube

#0 Python for Beginners | Programming Tutorial

  • Linux

Next important scripting language is basic linux commands with the scripting skills which are used in day to day programming or creating pipelines of ETL scripts or writing validation scripts. 

Linux Journey

Linux/Unix Tutorial – javatpoint

  • Database knowledge and SQL and DWH Concepts 

https://learn.udacity.com/courses/ud150

https://www.tutorialspoint.com/dbms/index.htm

https://youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB

SQLBolt

SQLZOO

Advanced SQL:

  • SQL Functions, Subqueries, Rules, Views, Window functions
  • Nested Queries, string functions, pattern matching
  • Mathematical functions, Date-time functions, etc. 
  • Types of UDFs, Inline table value, multi-statement table. 
  • Stored procedures, rank function, triggers, etc.
  • Record grouping, searching, sorting, etc. 
  • Clustered indexes, common table expressions.

https://www.geeksforgeeks.org/structured-query-language/

https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL-Cookbook-2nd-Edition-Final.pdf

Complex SQL Questions for Interview Preparation – YouTube

  • Data Modeling: 
  • Start learning what is data modeling and why its required in the well architectured database and how it will be helpful when creating or designing the OLTP or OLAP systems. Learn different type of data modelling and start  creation of a dimensional model featuring dimension and fact tables. Design a database and ETL process for transforming and loading data into the dimensional model. Optimize the ETL process to make sure the data transfer happens quickly.

Dimensional Modeling Techniques – Kimball Group

Data Warehouse Tutorials Playlist – YouTube

Database Design in DBMS Tutorial: Learn Data Modeling

  • Big Data

Acquire expertise in the big data ecosystem and Apache Spark for handling substantial data volumes. Utilize Spark SQL for data querying and optimization. Develop an ETL pipeline to extract data from various sources like HDFS and S3, working with a variety of data formats including CSV, TXT, JSON, Parquet, fixed format, and streaming data. Execute tasks on a Spark Cluster in an AWS environment.

  • Introduction to HDFS and Apache Spark
  • Spark Basics
  • Working with RDDs in Spark
  • Aggregating Data with Pair RDDs
  • Writing and Deploying Spark Applications
  • Parallel Processing
  • Spark RDD Persistence
  • Integrating Apache Flume and Apache Kafka
  • Spark Streaming
  • Improving Spark Performance
  • Spark SQL and Data Frames
  • Scheduling or Partitioning

https://spark.apache.org/documentation.html

Introduction to Apache Spark – Processing Big Data | Coursera

introduction to Big Data with Spark and Hadoop

Big Data & Hadoop Full Course – Learn Hadoop In 10 Hours | Hadoop Tutorial For Beginners | Edureka

Spark Best Practices

  • ETL:
  • What is ETL, Use Cases, and Applications?
  • Why Do We Need ETL Tools?
  • Working with Different Data Sources—Relational Databases, NoSQL, HDFS, Stream Data, CSV Files, TXT Files, Json or XML Files, and Fixed File Formats
  • Transformation of Data
  • Loading Data into a Data Model or File System
  • Using SQL for Data Transformation
  • Optimizing ETL Processes
  • Understanding ETL Architecture for Tracking the Data Flow and Data Pipelines
  • Understanding Data Quality Checks
  • Few recommended ETL Tools are Airbyte, Pentaho, SSIS, Informatica, Fivetran, DMS 
  • DBT (Data Build Tool):
  • What is DBT, Use Cases, and Applications?
  • Why Do We Need DBT?
  • How to transform data using DBT
  • How to write “models.” which can be used across multiple projects and data pipelines, reducing the time and effort required to develop and maintain new pipelines.
  • Testing, validations and documentation using DBT

ETL Design Process & Best Practices

 https://docs.getdbt.com/

Intro to Data Build Tool (dbt) // Create your first project!

  • Cloud Services:

Become proficient in constructing a highly scalable data warehouse on AWS. Utilize Redshift to extract data from AWS RDS and other media services via ETL pipelines, and then load this data into the data warehouse.

  • AWS Data Storage Services—S3, S3 Glacier, Amazon DynamoDB
  • AWS Processing Services—AWS EMR, EMR Cluster, Hadoop, Hue with EMR, Spark with EMR, AWS Lambda, HCatalog, Glue, and Glue Lab
  • AWS Data Analysis Services—Amazon Redshift, Tuning Query Performance, Amazon ML, Amazon Athena, Amazon Elasticsearch, and ES Domain

Jayendra’s Cloud Certification Blog

AWS re:Invent 2022 – Keynote with Adam Selipsky

https://www.youtube.com/playlist?list=PLS-kiDL9KrIgXkHG0C6joPeS6rq2mZAs5

  • Scheduler:

Gain expertise in scheduling, automating, and monitoring ETL pipelines using tools like Apache Airflow, Luigi, and Cron. Master the implementation of data quality checks for ETL processes in production, ensuring a robust process and architecture to prevent data quality-related ETL failures. Learn how to effectively address ETL failure issues in a production environment.

What is Airflow? — Airflow Documentation

Apache Airflow Tutorials – YouTube

https://github.com/aws/aws-mwaa-local-runner

  • Data Engineering Devops

In the current data engineering landscape, it’s advantageous to acquire knowledge of DevOps tools such as Docker, Kubernetes, Terraform, setting up environments, understanding VPCs, subnets, and familiarizing yourself with AWS services relevant to DevOps.