Awesome Data Engineering
Learning path and resources to become a data engineer
Best books, best courses and best articles on each subject.
Other sections: Data engineering best books
How to read it: First, not every subject is required to master. Look for the "essentiality" measure. Then, each resource standalone for its measurements. "coverage" and "depth" are relative to the subject of the specific resource, not the entire category.
SQL
essentiality
Querying data using SQL is an essential skill for anyone who works with data
coverage
depth
coverage
depth
Programming language
essentiality
As a data engineer you'll be writing a lot of code to handle various business cases such as ETLs, data pipelines, etc. The de facto standard language for data engineering is Python (not to be confused with R or nim that are used for data science, they have no use in data engineering).
coverage
depth
coverage
depth
Relational Databases - Design & Architecture
essentiality
RDBMS are the basic building blocks for any application data. A data engineer should know how to design and architect their structures, and learn about concepts that are related to them.
coverage
depth
coverage
depth
coverage
depth
noSql
essentiality
noSQL is a term for any non-relational database model: key-value, document, column, graph, and more. A basic acquaintance is required, but going deeper into any model depends on the job (except columnar, in the next section).
Columnar Databases
essentiality
Column databases are a kind of nosql databases. They deserve their own section as they are essential for the data engineer as working with Big Data online (as opposed to offline batching) usually requires a columnar back-end.
coverage
depth
Data warehouses
essentiality
Understand the concepts behind data warehouses and familiarize youself with common data warehouse solutions
coverage
depth
coverage
depth
coverage
depth
OLAP Data modeling
essentiality
OLAP (analytical) databases (used in data warehouses) data modeling concepts, modeling the data correctly is essential for a functioning data warehouse
coverage
depth
coverage
depth
(A note)
Data processing - Batch, MapReduce, Streaming
The next 2 categories are all about data processing mechanisms. We'll start with batch processing and MapReduce, typically with Hadoop. This is considered the first gen of data processing. From there we'll go to stream processing, typically done with Spark. These subjects are deeply connected. For example, Spark can operate on HDFS which is the file system for Hadoop. Even though it would seem outdated to learn about batch processing with Hadoop, it is essential to understand the subject even if you plan to live the streaming data life.
Batch data processing & MapReduce
essentiality
The "first" generation of data processing, using Hadoop and Spring. Everyone should know how it works, but going deep into the details and operations are recommended only if necessary. Focus more on streaming with tools like Spark today.
coverage
depth
coverage
depth
Stream data processing
essentiality
The "next" generation of data processing. Suggested to get a good grasp of the subject from the "Streaming Systems" book and then dive deep into a specific tool like Kafka, Spark, Flink, etc.
Pipeline / Workflow Management
essentiality
Scheduling tools for data processing. Airflow is considered to be the defacto standard, but any understanding of DAGs - directed acyclical graphs for tasks will be good.
Security and privacy
essentiality