Course Overview

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.

3 days
  • Recognize use cases for data science on Hadoop

    • Describe the Hadoop and YARN architecture
    • Describe supervised and unsupervised learning differences
    • Use Mahout to run a machine learning algorithm on Hadoop
    • Describe the data science life cycle
    • Use Pig to transform and prepare data on Hadoop
    • Write a Python script
    • Describe options for running Python code on a Hadoop cluster
    • Write a Pig User-Defined Function in Python
    • Use Pig streaming on Hadoop with a Python script
    • Use machine learning algorithms
    • Describe use cases for Natural Language Processing (NLP)
    • Use the Natural Language Toolkit (NLTK)
    • Describe the components of a Spark application
    • Write a Spark application in Python
    • Run machine learning algorithms using Spark MLlib
    • Take data science into production
    • 50% Lecture/Discussion
    • 50% Hands-on Labs
    • Lab: Setting Up a Development Environment
    • Demo: Block Storage
    • Lab: Using HDFS Commands
    • Demo: MapReduce
    • Lab: Using Apache Mahout for Machine Learning
    • Demo: Apache Pig
    • Lab: Getting Started with Apache Pig
    • Lab: Exploring Data with Pig
    • Lab: Using the IPython Notebook
    • Demo: The NumPy Package
    • Demo: The pandas Library
    • Lab: Data Analysis with Python
    • Lab: Interpolating Data Points
    • Lab: Defining a Pig UDF in Python
    • Lab: Streaming Python with Pig
    • Demo: Classification with Scikit-Learn
    • Lab: Computing K-Nearest Neighbor
    • Lab: Generating a K-Means Clustering
    • Lab: POS Tagging Using a Decision Tree
    • Lab: Using NLTK for Natural Language Processing
    • Lab: Classifying Text using Naive Bayes
    • Lab: Using Spark Transformations and Actions
    • Lab Using Spark MLlib
    • Lab: Creating a Spam Classifier with MLlib
  • Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

  • Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.