Awesome Data Engineering

Learning path and resources to become a data engineer

Best books, best courses and best articles on each subject.

How to read it: First, not every subject is required to master. Look for the "essentiality" measure. Then, each resource standalone for its measurements. "coverage" and "depth" are relative to the subject of the specific resource, not the entire category.
Show All
Free resources
Books
Courses

SQL

essentiality

Querying data using SQL is an essential skill for anyone who works with data

arrow down
Show more
curve line

Programming language

essentiality

As a data engineer you'll be writing a lot of code to handle various business cases such as ETLs, data pipelines, etc. The de facto standard language for data engineering is Python (not to be confused with R or nim that are used for data science, they have no use in data engineering).

arrow down
Show more
curve line

Relational Databases - Design & Architecture

essentiality

RDBMS are the basic building blocks for any application data. A data engineer should know how to design and architect their structures, and learn about concepts that are related to them.

arrow down
Show more
curve line

noSql

essentiality

noSQL is a term for any non-relational database model: key-value, document, column, graph, and more. A basic acquaintance is required, but going deeper into any model depends on the job (except columnar, in the next section).

arrow down
Show more
curve line

Columnar Databases

essentiality

Column databases are a kind of nosql databases. They deserve their own section as they are essential for the data engineer as working with Big Data online (as opposed to offline batching) usually requires a columnar back-end.

curve line

Data warehouses

essentiality

Understand the concepts behind data warehouses and familiarize youself with common data warehouse solutions

arrow down
Show more
curve line

OLAP Data modeling

essentiality

OLAP (analytical) databases (used in data warehouses) data modeling concepts, modeling the data correctly is essential for a functioning data warehouse

arrow down
Show more
(A note)

Data processing - Batch, MapReduce, Streaming

The next 2 categories are all about data processing mechanisms. We'll start with batch processing and MapReduce, typically with Hadoop. This is considered the first gen of data processing. From there we'll go to stream processing, typically done with Spark. These subjects are deeply connected. For example, Spark can operate on HDFS which is the file system for Hadoop. Even though it would seem outdated to learn about batch processing with Hadoop, it is essential to understand the subject even if you plan to live the streaming data life.

Batch data processing & MapReduce

essentiality

The "first" generation of data processing, using Hadoop and Spring. Everyone should know how it works, but going deep into the details and operations are recommended only if necessary. Focus more on streaming with tools like Spark today.

arrow down
Show more
curve line

Stream data processing

essentiality

The "next" generation of data processing. Suggested to get a good grasp of the subject from the "Streaming Systems" book and then dive deep into a specific tool like Kafka, Spark, Flink, etc.

arrow down
Show more
curve line

Pipeline / Workflow Management

essentiality

Scheduling tools for data processing. Airflow is considered to be the defacto standard, but any understanding of DAGs - directed acyclical graphs for tasks will be good.

arrow down
Show more
curve line

Security and privacy

essentiality

How to manage sensitive data, compliance with regulation (GDPR) and more

arrow down
Show more
Made with ♥ by Snir David