Practical Machine Learning with Apache Spark


Course Number: PYTH-250WA
Duration: 3 days (19.5 hours)
Format: Live, hands-on

ML with Spark Training Overview

This Practical Machine Learning with Apache Spark training course teaches attendees how to integrate Python's capabilities for scaling data with machine learning (ML) on the Apache Spark platform. In addition, attendees learn the terminology, concepts, and algorithms used in ML. 

Location and Pricing

Accelebrate offers instructor-led enterprise training for groups of 3 or more online or at your site. Most Accelebrate classes can be flexibly scheduled for your group, including delivery in half-day segments across a week or set of weeks. To receive a customized proposal and price quote for private corporate training on-site or online, please contact us.

In addition, some courses are available as live, instructor-led training from one of our partners.

Objectives

  • Understand the elements of Functional Programming with Python
  • Use the Spark Shell
  • Use the spark-submit Tool
  • Understand the DataFrame Object
  • Transform data with PySpark
  • Switch to PySpark Jupyter Notebooks
  • Use matplotlib for data visualization
  • Work with descriptive statistics and EDA
  • Use PySpark for data repair and normalization
  • Understand linear regression
  • Work with logistic regression
  • Perform classification with Naive Bayes
  • Work with Random Forest Classification
  • Support Vector Machine Classification
  • Use kMeans Algorithm

Prerequisites

All attendees must have basic knowledge of statistics and programming.

Outline

Expand All | Collapse All

Introduction
Defining Data Science
  • Data Science, Machine Learning, AI?
  • The Data-Related Roles
  • Data Science Ecosystem
  • Business Analytics vs. Data Science
  • Who is a Data Scientist?
  • The Break-Down of Data Science Project Activities
  • Data Scientists at Work
  • The Data Engineer Role
  • What is Data Wrangling (Munging)?
  • Examples of Data Science Projects
  • Data Science Gotchas
Machine Learning Life-cycle Phases
  • Data Analytics Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Data Cleansing
  • Feature Engineering
  • Data Logistics and Data Governance
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
Quick Introduction to Python Programming
  • Module Overview
  • Some Basic Facts about Python
  • Dynamic Typing Examples
  • Code Blocks and Indentation
  • Importing Modules
  • Lists and Tuples
  • Dictionaries
  • List Comprehension
  • What is Functional Programming (FP)?
  • Terminology: Higher-Order Functions
  • A Short List of Languages that Support FP
  • Lambda
  • Common High-Order Functions in Python 3
Introduction to Apache Spark
  • What is Apache Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs. MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL
  • Spark Machine Learning Library
  • GraphX
The Spark Shell
  • The Spark Shell
  • The Spark v.2 + Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and Spark Session (spark)
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
Quick Intro to Jupyter Notebooks
  • Python Dev Tools and REPLs
  • IPython
  • Jupyter
  • Jupyter Operation Modes
  • Basic Edit Mode Shortcuts
  • Basic Command Mode Shortcuts
Data Visualization in Python using matplotlib
  • Data Visualization
  • What is matplotlib?
  • Getting Started with matplotlib
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.scatter() Function
  • Labels and Titles
  • Styles
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.hist () Function
  • The matplotlib.pyplot.pie () Function
  • The Figure Object
  • The matplotlib.pyplot.subplot() Function
  • Selecting a Grid Cell
  • Saving Figures to a File
Data Science and ML Algorithms with PySpark
  • In-Class Discussion
  • Types of Machine Learning
  • Supervised vs Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Classification (Supervised ML) Examples
  • Unsupervised Machine Learning Algorithms
  • Clustering (Unsupervised ML) Examples
  • Choosing the Right Algorithm
  • Terminology: Observations, Features, and Targets
  • Representing Observations
  • Terminology: Labels
  • Terminology: Continuous and Categorical Features
  • Continuous Features
  • Categorical Features
  • Common Distance Metrics
  • The Euclidean Distance
  • What is a Model
  • Model Evaluation
  • The Classification Error Rate
  • Data Split for Training and Test Data Sets
  • Data Splitting in PySpark
  • Hold-Out Data
  • Cross-Validation Technique
  • Spark ML Overview
  • DataFrame-based API is the Primary Spark ML API
  • Estimators, Models, and Predictors
  • Descriptive Statistics
  • Data Visualization and EDA
  • Correlations
  • Feature Engineering
  • Scaling of the Features
  • Feature Blending (Creating Synthetic Features)
  • The 'One-Hot' Encoding Scheme
  • Example of 'One-Hot' Encoding Scheme
  • Bias-Variance (Underfitting vs Overfitting) Trade-off
  • The Modeling Error Factors
  • One Way to Visualize Bias and Variance
  • Underfitting vs Overfitting Visualization
  • Balancing Off the Bias-Variance Ratio
  • Linear Model Regularization
  • ML Model Tuning Visually
  • Linear Model Regularization in Spark
  • Regularization, Take Two
  • Dimensionality Reduction
  • PCA and isomap
  • The Advantages of Dimensionality Reduction
  • Spark Dense and Sparse Vectors
  • Labeled Point
  • Python Example of Using the LabeledPoint Class
  • The LIBSVM format
  • LIBSVM in PySpark
  • Example of Reading a File In LIBSVM Format
  • Life-cycles of Machine Learning Development
  • Regression Analysis
  • Regression vs Correlation
  • Regression vs Classification
  • Simple Linear Regression Model
  • Linear Regression Illustration
  • Least-Squares Method (LSM)
  • Gradient Descent Optimization
  • Locally Weighted Linear Regression
  • Regression Models in Excel
  • Multiple Regression Analysis
  • Evaluating Regression Model Accuracy
  • The R2 Model Score
  • The MSE Model Score
  • Linear Logistic (Logit) Regression
  • Interpreting Logistic Regression Results
  • Hands-on Exercise
  • Naive Bayes Classifier (SL)
  • Naive Bayesian Probabilistic Model in a Nutshell
  • Bayes Formula
  • Classification of Documents with Naive Bayes
  • Decision Trees
  • Decision Tree Terminology
  • Properties of Decision Trees
  • Decision Tree Classification in the Context of Information Theory
  • The Simplified Decision Tree Algorithm
  • Using Decision Trees
  • Random Forests
  • Support Vector Machines (SVMs)
  • Unsupervised Learning Type: Clustering
  • k-Means Clustering (UL)
  • k-Means Clustering in a Nutshell
  • k-Means Characteristics
  • Global vs. Local Minimum Explained
  • Time-Series Analysis
  • Decomposing Time-Series
  • A Better Algorithm or More Data?
Conclusion

Training Materials

All Machine Learning training students receive comprehensive courseware.

Software Requirements

  • Windows, Mac, or Linux with at least 8 GB RAM
    • Most class activities will create Spark code and visualizations in a browser-based notebook environment. The class also details how to export these notebooks and how to run code outside of this environment.
  • A current version of Anaconda for Python 3.x
  • Related lab files that Accelebrate will provide
  • Internet access


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan