Logo

Batch processing with Apache Beam

Image

Easy to follow, hands-on introduction to batch data processing in Python

Created by
Alexandra Abbas, Data Engineer
Rating
star interface icon star interface icon star interface icon star interface icon star interface icon
4.9/5
graphical divider

What you'll learn

  • Checkmark icon
    Core concepts of the Apache Beam framework
  • Checkmark icon
    How to design a pipeline in Apache Beam
  • Checkmark icon
    How to install Apache Beam locally
  • Checkmark icon
    How to build a real-world ETL pipeline in Apache Beam
  • Checkmark icon
    How to read and write CSV data from Apache Beam
  • Checkmark icon
    How to apply built-in and custom transformations on a dataset
  • Checkmark icon
    How to deploy your pipeline to Cloud Dataflow on Google Cloud

Description

Apache Beam is an open-source programming model for defining large scale ETL, batch and streaming data processing pipelines. It is used by companies like Google, Discord and PayPal.

In this course you will learn Apache Beam in a practical manner, with every lecture comes a full coding screencast. By the end of the course you'll be able to build your own custom batch data processing pipeline in Apache Beam.

This course includes 20 concise bite-size lectures and a real-life coding project that you can add to your Github portfolio! You're expected to follow the instructor and code along with her.

You will learn:

  • How to install Apache Beam on your machine
  • Basic and advanced Apache Beam concepts
  • How to develop a real-world batch processing pipeline
  • How to define custom transformation steps
  • How to deploy your pipeline on Cloud Dataflow

This course is for all levels. You do not need any previous knowledge of Apache Beam or Cloud Dataflow.

Complete source code for this course is freely available on GitHub. Have a look at the source code here!

Introduction video

Course content

20 lectures, 1h 15min total length

Welcome (2:17) Preview
What is Apache Beam (2:14) Preview
Apache Beam concepts (2:06)
Design a pipeline (2:24)
Install Apache Beam (2:51)

Create a pipeline (2:45) Preview
Configure pipeline options (3:25)
Read data from CSV file (3:42)
Format data with Map (6:02)
Transform data with ParDo and DoFn (8:00)
Call external API from DoFn (5:27)
Access side input (7:06)
Combine data (2:23)
Format output (2:02)
Write output CSV file (3:27)

What is Cloud Dataflow (1:26)
Set up Google Cloud environment (2:57)
Run pipeline on Cloud Dataflow (7:40)
Clean up in Google Cloud (0:49)

Who this course is for

  • Data Engineers
  • Aspiring Data Engineers
  • Python developers interested in Apache Beam

Requirements

  • Python programming experience
  • Having an idea of distributed data processing e.g. You have used Spark before
  • Having Conda (or other Virtual Environment Manager) installed on your machine

Instructor

Alexandra Abbas
Google Cloud Certified Data Engineer & Architect

Image
Meet Alexandra, your instructor for this course

Alexandra is a Google Cloud Certified Data Engineer & Architect and Apache Airflow Contributor.

She has experience with large-scale data science and engineering projects. She spends her time building data pipelines using Apache Airflow and Apache Beam and creating production ready Machine Learning pipelines with Tensorflow.

Alexandra was a speaker at Serverless Days London 2019 and presented at the Tensorflow London meetup.

divider graphic

Related Courses

icon

Introduction to Docker for Data Engineers

Read More
icon

Learn Scala programming in two hours

Read More
icon

Explore the landscape with our roadmap

Read More
divider graphic
arrow-up icon