

Data Engineering is the practice of designing and collecting and storing and converting raw data into usable information and processing the data from a variety of data sources. Every organisation has the ability to generate/collect massive amounts of data, and they need the right people and technology to ensure it is in a highly usable state by the time it reaches data scientists and analysts.
The objective of this article is to know where to start in the Data Engineering area if anyone wants to explore how these data pipelines are built and what are the technologies involved.
A career in this field can be both rewarding and challenging. You’ll play an important role in an organisation’s success, providing easier access to data that data scientists, analysts, and decision-makers need to do their jobs. You’ll rely on your programming and problem-solving skills to create scalable solutions.
Top skills you need:
- Be Proficient in programming languages (Python, Java preferred)
- Be Strong in the database architecture and data modelling and processing.
- Be a master of SQL Language. It’s the heart of data engineering.
- Mastering on the ETL (Extract , Transform and Load)
- Good knowledge on the Big data and its components and processing
- Good knowledge on the Cloud services (AWS, Azure, GCP)
- Orchestration tools for scheduling (jenkins, oozie, Airflow)
- Good hands on knowledge on BI tools (tableau, Quick Sight, Super Set)
Step-By-Step to become the Data Engineer
1. Python
- Introduction to Python and IDEs – The basics of the python programming language, how you can use various IDEs for python development like Jupyter, Pycharm, etc.
- Python Basics – Variables, Data Types, Loops, Conditional Statements, functions, decorators, lambda functions, file handling, exception handling ,etc.
- Object Oriented Programming – Introduction to OOPs concepts like classes, objects, inheritance, abstraction, polymorphism, encapsulation, etc.
How to install python: https://www.dataquest.io/blog/installing-python-on-mac/
Python Doc : The Python Tutorial
Video series on python basics and oops concept:
Python Programming Beginner Tutorials – YouTube :
Python OOP Tutorials – Working with Classes – YouTube
#0 Python for Beginners | Programming Tutorial
2. Linux
- Introduction to Linux – Establishing the fundamental knowledge of how linux works and how you can begin with Linux OS.
- Linux Basics – File Handling, data extraction, etc.
Linux/Unix Tutorial – javatpoint
- Database knowledge and SQL and DWH Concepts
- Learn the internal distribution mechanism of MPP Database
- DWH Concepts
- Fundamentals of Structured Query Language
- SQL Tables, Joins, Variables
https://learn.udacity.com/courses/ud150
https://www.tutorialspoint.com/dbms/index.htm
https://youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB
Advanced SQL
- SQL Functions, Subqueries, Rules, Views, Window functions
- Nested Queries, string functions, pattern matching
- Mathematical functions, Date-time functions, etc.
- Types of UDFs, Inline table value, multi-statement table.
- Stored procedures, rank function, triggers, etc.
- Record grouping, searching, sorting, etc.
- Clustered indexes, common table expressions.
https://www.geeksforgeeks.org/structured-query-language/
https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL-Cookbook-2nd-Edition-Final.pdf
Complex SQL Questions for Interview Preparation – YouTube
4. Data Modeling
- Understand the difference between SQL and NoSQL. Create relations data models and NoSQL-based data models on business reporting requirements. Work with ETL tools to push the data to the model.
- Work on MS SQL and Mongodb for creating databases and using ETL tools for data extracting, transformation, and loading to the models.
- You can try to Create a dimensional model with dimension and fact tables. Create database and ETL for data transformation and loading into the dimensional model created. Optimise the ETL process for faster loading of data.
Dimensional Modeling Techniques – Kimball Group
Data Warehouse Tutorials Playlist – YouTube
Database Design in DBMS Tutorial: Learn Data Modeling
5. Big Data
Learn the big data ecosystem and Apache Spark to load large volumes of data. Work with Spark SQL for querying data and optimising the same. Build an ETL pipeline to pull the data from different data sources, such as HDFS and S3, and use different format data files, such as csv, txt, json, fixed format, streaming data, etc., to load the data. Work on Spark Cluster using AWS.
- Introduction to HDFS and Apache Spark
- Spark Basics
- Working with RDDs in Spark
- Aggregating Data with Pair RDDs
- Writing and Deploying Spark Applications
- Parallel Processing
- Spark RDD Persistence
- Integrating Apache Flume and Apache Kafka
- Spark Streaming
- Improving Spark Performance
- Spark SQL and Data Frames
- Scheduling or Partitioning
https://spark.apache.org/documentation.html
Introduction to Apache Spark – Processing Big Data | Coursera
introduction to Big Data with Spark and Hadoop
Big Data & Hadoop Full Course – Learn Hadoop In 10 Hours | Hadoop Tutorial For Beginners | Edureka
6. ETL
- What is ETL, Use Cases, and Applications?
- Why Do We Need ETL Tools?
- Working with Different Data Sources—Relational Databases, NoSQL, HDFS, Stream Data, CSV Files, TXT Files, Json or XML Files, and Fixed File Formats
- Transformation of Data
- Loading Data into a Data Model or File System
- Using SQL for Data Transformation
- Optimizing ETL Processes
- Understanding ETL Architecture for Tracking the Data Flow and Data Pipelines
- Understanding Data Quality Checks
- Few recommended ETL Tools are Airbyte, Pentaho, SSIS, Informatica, Fivetran, DMS
7. DBT (Data Build Tool)
- What is DBT, Use Cases, and Applications?
- Why Do We Need DBT?
- How to transform data using DBT
- How to write “models.” which can be used across multiple projects and data pipelines, reducing the time and effort required to develop and maintain new pipelines.
- Testing, validations and documentation using DBT
ETL Design Process & Best Practices
Intro to Data Build Tool (dbt) // Create your first project!
8. Cloud Services
Master the skills of building a highly scalable data warehouse on AWS. Work with Redshift and pull the data from RDS and other media services of AWS using ETL pipeline and load the data into the data warehouse.
- AWS Data Storage Services—S3, S3 Glacier, Amazon DynamoDB
- AWS Processing Services—AWS EMR, EMR Cluster, Hadoop, Hue with EMR, Spark with EMR, AWS Lambda, HCatalog, Glue, and Glue Lab
- AWS Data Analysis Services—Amazon Redshift, Tuning Query Performance, Amazon ML, Amazon Athena, Amazon Elasticsearch, and ES Domain
Jayendra’s Cloud Certification Blog
AWS re:Invent 2022 – Keynote with Adam Selipsky
https://www.youtube.com/playlist?list=PLS-kiDL9KrIgXkHG0C6joPeS6rq2mZAs5
9. Scheduler
Learn to schedule, automate, and monitor ETL pipelines with Apache Airflow, Luigi, and Cron. Learn and master how to implement data quality checks and processes for running the ETL in a production environment. Understand and create a strong process and architecture to avoid ETL failure due to data quality issues. Learn how to handle ETL failure issues in a production environment.
What is Airflow? — Airflow Documentation
Apache Airflow Tutorials – YouTube
https://github.com/aws/aws-mwaa-local-runner
10. Data Engineering Devops
In today’s data engineering environment, it’s always better to learn the devops tools like Docker, Kubernetes, Terraform, How to setup the environment, what is VPC and subnet, and all the AWS Services related to Devops.
The Data Org at Yubi is a very mature one, spread across various levels. We use most of the technologies mentioned above to solve challenges and to streamline data pertaining to all of Yubi’s products. The Data team works cross functionally across Yubi and provides Dashboards, DS models, Recommendations and Integrates to downstream systems as well as client’s system. If you are keen to start your career in Data Engineering and want to join us, skill up and look for a relevant opportunity here