Course Overview

Data is the residue of every action that takes place in a company, with customers, and in the marketplace. It is created when customers buy products, users interact with services, and colleagues collaborate.

In an increasingly connected world, our ability to capture and leverage data has increased exponentially; but data in the wrong hands is useless, if not dangerous. In the right hands, data can drive new insights and powerfully informed decisions. This course introduces fundamental techniques and technologies from data science, predictive analytics, and machine learning that can help you get a handle on the modern information flood. Using the Python programming language, you will learn:

  • Analytics skills which will enable you to evaluate, query, and visualize data using open source tools: NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn.
  • Strategies to create data-driven questions that can provide scientific or business value
  • Methods for assembling data from multiple sources and preparing powerful machine learning (ML) models
  • Techniques for deploying models as part of larger systems
3 (with an optional extension for one or two days extra to dive deeper into specific topics) days
  • Data Fundamentals

    Introducing Python

    Introduce the Python programming language, its syntax, and core libraries that are used for working with data.

    • Python Modules: Toolboxes
      • Importing modules
      • Listing methods
      • Creating modules
    • Python Syntax and Structure
      • Core programming language structure
      • Functions
      • Object oriented programming
      • Comprehensions and other syntactic niceties
    • Python Data Science Libraries
      • NumPy
      • NumPy Arrays
      • SciPy
      • Pandas
    • Python Dev Tools, Analytic Environments, and REPLS
      • IPython
      • Jupyter
      • Jupyter Operation Modes
      • Anaconda

    Practical Data Science

    Describe how the utilization of data is changing and the emergence of the “Data Scientist” or “a programmer who knows more statistics than a software engineer and more programming than a statistician.”

    • How is data being used innovatively to ask new and interesting questions?
    • What is Data Science?
    • Data Science, Machine Learning, AI: What is the difference?
    • Case Study: Applied Data Science at Google
    • Case Study: Predictive Models in Advertising
    • Case Study: Recommender Systems in ECommerce
    • Data Analytics Life-cycle
      •  Priming
      • Exploratory Data Analysis
      • Model Planning
      • Model Building
      •  Validation
      • Production Roll-out

    Data Fundamentals

    Aggregating, repairing, normalizing, exploring, and visualizing data.

    • Working with data in Python
      • Importing data from external sources
      • Dealing with missing data
      • Dropping columns
      • Interpolating missing data in Pandas
      • Replacing data
      • Scaling/normalizing data
    • Exploratory Data Analysis and Visualization: Pandas, Matplotlib, and Plotly
      • Transformation, validation, and interpretation
      • Getting started with matplotlib and Seaborn
      • Plotting Windows and Figures
    • Distributions and variance:
      • Show to represent a distribution in pictures (histogram and related charts) and numbers (summaries)
      • Introduce outliers and describe the effect they might have on a distribution
      • Variance: measuring the spread of a distribution
      • Modeling distributions: normal, lognormal, and Pareto distributions
      • Lab: Visualizing and Summarizing Distributions
    • Analyzing Relationships
      • Show how Pandas can be used to assess relationships amongst variables
      • Visualizing relationships: scatterplots and beyond
      • Measuring relationships: correlation and covariance
      • Testing relationships: is it meaningful?
      • Classical hypothesis testing: means, correlation, and proportions
      • Demonstration: Analyzing Relationships
      • Lab: More Relationship Analysis
    • Data Grouping and Aggregation in Python
      • Data aggregation and grouping
      • core.groupby.SeriesGroupBy
      • Grouping multiple columns
      • Pivot Tables
      • Cross-Tabulation

    What is Machine Learning

    • The Machines are Coming: Machine Learning and Artifical Intelligence
      • What are machine learning and artifical intelligence?
      • What are some ML techniques and how can they be used to solve business problems?
    • Supervised versus unsupervised learning: what are the differences?
    • Terminology and definitions
      • Features and observations
      • Labels
      • Continuous and categorical features
    • Practical Machine Learning
      • Data preparation
      • Model training
      • Model validation and assessment
    • scikit-learn: Estimators, Models, and Predictors

    Machine Learning Algorithms

    Introduce common machine learning algorithms and explore their use.

    • Classification and Regression
      • How do you build machine learning models to “make guesses” and “put things into buckets”
      • Classification
      • Regression
    • Clustering and Principal Components Analysis
    • Time Series

    Case Study: Machine Learning and Natural Language Processing

    Show how machine learning techniques can be applied alongside feature engineering to solve complex problems.

    • Introduce Natural Language Processing, core constructs that can be used to work with human language.
    • Explore computational models of human language that can be used for classification and clustering.
    • Show how keyword extraction using NLP and data normalization can be used to locate patients who have a specific condition or disease.

    Deep Learning

    Introduce neural networks and their basic function.

    • What is a deep neural network? How are they different from other types of machine learning techniques?
    • What are the mathematical techniques behind neural networks? How do they work?
    • How do we teach networks to “Learn”?
    • What are some of the applications for these types of tools in healthcare, finance, and advertising?
  • Participants should have a working knowledge of Python and be familiar with core statistical concepts (variance, correlation, etc.).