Course Overview

This course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Pig and Hive. Topics include: Hadoop, YARN, HDFS, MapReduce, data ingestion, workflow definition, using Pig and Hive to perform data analytics on Big Data and an introduction to Spark Core and Spark SQL.

CLASS INFORMATION
Price: 
$2,800
Duration: 
4 days
    • Describe Hadoop, YARN and use cases for Hadoop
    • Describe Hadoop ecosystem tools and frameworks
    • Describe the HDFS architecture
    • Use the Hadoop client to input data into HDFS
    • Transfer data between Hadoop and a relational database
    • Explain YARN and MaoReduce architectures
    • Run a MapReduce job on YARN
    • Use Pig to explore and transform data in HDFS
    • Understand how Hive tables are defined and implemented
    • Use Hive to explore and analyze data sets
    • Use the new Hive windowing functions
    • Explain and use the various Hive file formats
    • Create and populate a Hive table that uses ORC file formats
    • Use Hive to run SQL-like queries to perform data analysis
    • Use Hive to join datasets using a variety of techniques
    • Write efficient Hive queries
    • Create ngrams and context ngrams using Hive
    • Perform data analytics using the DataFu Pig library
    • Explain the uses and purpose of HCatalog
    • Use HCatalog with Pig and Hive
    • Define and schedule an Oozie workflow
    • Present the Spark ecosystem and high-level architecture
    • Perform data analysis with Spark’s Resilient Distributed Dataset API
    • Explore Spark SQL and the DataFrame API
  • 50% Lecture/Discussion
    50% Hands-on Labs

    DAY 1 – IN INTRODUCTION TO THE HADOOP DISTRIBUTED FILE SYSTEM

    • Understanding Hadoop
    • The Hadoop Distributed File System
    • Ingesting Data into HDFS
    • The MapReduce Framework

    DAY 2 – AN INTRODUCTION TO APACHE PIG

    • Introduction to Apache Pig
    • Advanced Apache Pig Programming

    DAY 3 – AN INTRODUCTION TO APACHE HIVE

    • Apache Hive Programming
    • Using HCatalog
    • Advanced Apache Hive Programming

    DAY 4 – WORKING WITH SPARK CORE, SPARK SQL AND OOZIE

    • Advanced Apache Hive Programming (Continued)
    • Hadoop 2 and YARN
    • Introduction to Spark Core and Spark SQL
      Defining Workflow with Oozie
  • 50% Lecture/Discussion
    50% Hands-on Labs

    DAY 1 – IN INTRODUCTION TO THE HADOOP DISTRIBUTED FILE SYSTEM

    • Starting an HDP Cluster
    • Demonstration: Understanding Block Storage
    • Using HDFS Commands
    • Importing RDBMS Data into HDFS
    • Exporting HDFS Data to an RDBMS
    • Importing Log Data into HDFS Using Flume
    • Demonstration: Understanding MapReduce
    • Running a MapReduce Job

    DAY 2 – AN INTRODUCTION TO APACHE PIG

    • Demonstration: Understanding Apache Pig
    • Getting Starting with Apache Pig
    • Exploring Data with Apache Pig
    • Splitting a Dataset
    • Joining Datasets with Apache Pig
    • Preparing Data for Apache Hive
    • Demonstration: Computing Page Rank
    • Analyzing Clickstream Data
    • Analyzing Stock Market Data Using Quantiles

    DAY 3 – AN INTRODUCTION TO APACHE HIVE

    • Understanding Hive Tables
    • Understanding Partition and Skew
    • Analyzing Big Data with Apache Hive
    • Demonstration: Computing NGrams
    • Joining Datasets in Apache Hive
    • Computing NGrams of Emails in Avro Format
    • Using HCatalog with ApachePig

    DAY 4 – WORKING WITH SPARK CORE, SPARK SQL AND OOZIE

    • Advanced Apache Hive Programming
    • Running a YARN Application
    • Getting Started with Apache Spark
    • Exploring Apache Spark SQL
    • Defining an Apache Oozie Workflow
  • Students should be familiar with programming principles and have experience in software development. SQL knowledge is also helpful. No prior Hadoop knowledge is required.

  • Software developers who need to understand and develop applications for Hadoop.