Big Data Mastery: Data Pipelines, Processing, and Messaging

Master the art of building robust data pipelines with our Big Data Mastery course. Dive deep into Hadoop architecture, Spark optimization, and Kafka messaging under expert guidance. This comprehensive training program equips you with essential skills for designing efficient data workflows tailored to industry demands.

Face-to-Face May 5, 2025 - May 6, 2025
new
intermediate
Big Data Mastery: Data Pipelines, Processing, and Messaging
MYR 3500

Training Provider Pricing

Material Fees: MYR 600

Pax:

MYR 5600

Features

2 days (9:00 AM - 5:00 PM)
14 modules
13 intakes
Full life-time access
English

Subsidies

HRDC Claimable logo

What you'll learn

  • Gain expertise in Hive data warehousing for efficient large dataset analysis.
  • Explore Spark core concepts and optimization strategies for enhanced performance.
  • Develop proficiency in MapReduce programming with a focus on design patterns and optimization techniques.
  • Master Kafka core concepts and Streams API for real-time data processing.
  • Understand the fundamentals of Hadoop architecture including HDFS storage and YARN resource management.
  • Design end-to-end pipelines integrating multiple big data technologies.
  • Implement security measures within the Hadoop ecosystem using Kerberos and Apache Ranger.
  • Learn advanced batch processing patterns to manage real-time data effectively.

Why should you attend?

This course offers an in-depth exploration of big data technologies and methodologies, focusing on the creation and optimization of data pipelines, processing frameworks, and messaging systems. Participants will begin by understanding the foundational architecture of Hadoop, including HDFS storage mechanisms and YARN resource management. Through hands-on exercises, learners will set up pseudo-clusters using Docker to gain practical experience. The course progresses into MapReduce programming, where students will learn about mapper, reducer, and combiner design patterns. Optimization techniques for the shuffle/sort phase are covered to enhance performance. Learners will also delve into Hive data warehousing concepts, comparing HiveQL with ANSI SQL syntax and exploring partitioning strategies to efficiently analyze large datasets. Spark's core concepts are thoroughly examined, highlighting the tradeoffs between RDDs and DataFrames. The Catalyst optimizer is explored in detail to understand how Spark executes queries efficiently. Students will engage in hands-on ETL pipeline development with complex joins to solidify their understanding. Advanced topics include Spark optimization techniques such as memory management and handling data skew. The course also covers batch processing patterns like Lambda and Kappa architectures, ensuring participants can handle late-arriving data effectively. Security within the Hadoop ecosystem is addressed through Kerberos authentication and Apache Ranger policy control. Finally, learners will explore Kafka's core concepts and Streams API for real-time data processing, culminating in a comprehensive end-to-end pipeline design project that integrates multiple big data technologies.

Course Syllabus

HDFS storage: NameNode/DataNode roles
YARN resource management overview
Cluster setups: Pseudo vs Fully Distributed
Comparison: HDFS vs S3/GCS
Hands-on: Docker-based pseudo-cluster
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
Lunch
1 hour
Short Break
15 mins
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
End of Day 1
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
Lunch
1 hour
Short Break
15 mins
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
End of Day 2

Minimum Qualification

undergraduate

Target Audience

students
entry level
engineers

Methodologies

lecture
slides
labs
q&A

Why should you attend?

This course offers an in-depth exploration of big data technologies and methodologies, focusing on the creation and optimization of data pipelines, processing frameworks, and messaging systems. Participants will begin by understanding the foundational architecture of Hadoop, including HDFS storage mechanisms and YARN resource management. Through hands-on exercises, learners will set up pseudo-clusters using Docker to gain practical experience. The course progresses into MapReduce programming, where students will learn about mapper, reducer, and combiner design patterns. Optimization techniques for the shuffle/sort phase are covered to enhance performance. Learners will also delve into Hive data warehousing concepts, comparing HiveQL with ANSI SQL syntax and exploring partitioning strategies to efficiently analyze large datasets. Spark's core concepts are thoroughly examined, highlighting the tradeoffs between RDDs and DataFrames. The Catalyst optimizer is explored in detail to understand how Spark executes queries efficiently. Students will engage in hands-on ETL pipeline development with complex joins to solidify their understanding. Advanced topics include Spark optimization techniques such as memory management and handling data skew. The course also covers batch processing patterns like Lambda and Kappa architectures, ensuring participants can handle late-arriving data effectively. Security within the Hadoop ecosystem is addressed through Kerberos authentication and Apache Ranger policy control. Finally, learners will explore Kafka's core concepts and Streams API for real-time data processing, culminating in a comprehensive end-to-end pipeline design project that integrates multiple big data technologies.

What you'll learn

  • Gain expertise in Hive data warehousing for efficient large dataset analysis.
  • Explore Spark core concepts and optimization strategies for enhanced performance.
  • Develop proficiency in MapReduce programming with a focus on design patterns and optimization techniques.
  • Master Kafka core concepts and Streams API for real-time data processing.
  • Understand the fundamentals of Hadoop architecture including HDFS storage and YARN resource management.
  • Design end-to-end pipelines integrating multiple big data technologies.
  • Implement security measures within the Hadoop ecosystem using Kerberos and Apache Ranger.
  • Learn advanced batch processing patterns to manage real-time data effectively.

Course Syllabus

HDFS storage: NameNode/DataNode roles
YARN resource management overview
Cluster setups: Pseudo vs Fully Distributed
Comparison: HDFS vs S3/GCS
Hands-on: Docker-based pseudo-cluster
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
Lunch
1 hour
Short Break
15 mins
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
End of Day 1
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
Lunch
1 hour
Short Break
15 mins
Short Break
15 mins
Short Break
15 mins
Recap and Q&A
15 mins
End of Day 2
MYR 3500

Training Provider Pricing

Material Fees: MYR 600

Pax:

MYR 5600

Features

2 days (9:00 AM - 5:00 PM)
14 modules
13 intakes
Full life-time access
English

Subsidies

HRDC Claimable logo

Minimum Qualification

undergraduate

Target Audience

students
entry level
engineers

Methodologies

lecture
slides
labs
q&A
Close menu