Data Science and Big Data Analytics


Course Number: DATA-120WA

Duration: 5 days (32.5 hours)

Format: Live, hands-on

Data Science Training Overview

This Data Science and Big Data Analytics training course teaches attendees how to discover valuable business insights from large datasets and then use these insights to make data-driven decisions. Attendees learn to use Apache Spark for distributed computing, machine learning for predictive analytics, natural language processing for sentiment analysis, R programming, and more, to succeed in a data-driven world.

Location and Pricing

Accelebrate offers instructor-led enterprise training for groups of 3 or more online or at your site. Most Accelebrate classes can be flexibly scheduled for your group, including delivery in half-day segments across a week or set of weeks. To receive a customized proposal and price quote for private corporate training on-site or online, please contact us.

In addition, we offer some courses as live, instructor-led online classes for individuals.

Objectives

  • Understand applied data science and business analytics
  • Work with algorithms, techniques, and common analytical methods
  • Implement machine learning
  • Visualize and report processed results
  • Use the R Programming Language to perform data analysis
  • Understand the elements of functional programming
  • Work with Apache Spark, Spark SQL, and ETL with Spark
  • Use the MLlib (Machine Learning Library)
  • Use GraphX for graph processing

Prerequisites

All participants must have general statistics and modern programming knowledge.

Outline

Expand All | Collapse All

Introduction
Data Science Algorithms and Analytical Methods
  • Supervised vs. Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Unsupervised Machine Learning Algorithms
  • Choose the Right Algorithm
  • Life-cycles of Machine Learning Development
  • Classifying with k-Nearest Neighbors (SL)
  • k-Nearest Neighbors Algorithm
  • The Error Rate
  • Decision Trees (SL)
  • Using Decision Trees
  • Random Forests
  • Naive Bayes Classifier (SL)
  • Classification of Documents with Naive Bayes
  • Unsupervised Learning Type: Clustering
  • K-Means Clustering (UL)
  • K-Means Clustering in a Nutshell
  • K-Means Clustering in a Nutshell
  • Regression Analysis
  • Types of Regression
  • Simple Linear Regression Model
  • Linear Regression Illustration
  • Least-Squares Method (LSM)
  • LSM Assumptions
  • Fitting Linear Regression Models in R
  • Example of Using R's lm() Function
  • Example of Using lm() with a Data Frame
  • Regression Models in Excel
  • Logistic Regression
  • Regression vs. Classification
  • Time-Series Analysis
  • Decomposing Time-Series
  • Decomposing Time-Series
Getting Started with R
  • Positioning of R in the Data Science Arena
  • R Integrated Development Environments
  • Running R
  • Running RStudio
  • Ending the Current R Session
  • Getting Help
  • Getting System Information
  • General Notes on R Commands and Statements
  • R Data Structures
  • R Objects and Workspace
  • Assignment Operators
  • Assignment Example
  • Arithmetic Operators
  • Logical Operators
  • System Date and Time
  • Operations
  • User-defined Functions
  • R Code Example
  • Type Conversion (Coercion)
  • Control Statements
  • Conditional Execution
  • Repetitive Execution
  • Repetitive execution
  • Built-in Functions
  • Reading Data from Files into Vectors
  • Example of Reading Data from a File
  • Writing Data to a File
  • Example of Writing Data to a File
  • Logical Vectors
  • Character Vectors
  • Matrix Data Structure
  • Creating Matrices
  • Working with Data Frames
  • Matrices vs. Data Frames
  • A Data Frame Sample
  • Accessing Data Cells
  • Getting Info About a Data Frame
  • Selecting Columns in Data Frames
  • Selecting Rows in Data Frames
  • Getting a Subset of a Data Frame
  • Sorting (ordering) Data in Data Frames by Attribute(s)
  • Applying Functions to Matrices and Data Frames
  • Using the apply() Function
  • Example of Using apply()
  • Executing External R commands
  • Loading External Scripts in RStudio
  • Listing Objects in Workspace
  • Removing Objects in Workspace
  • Saving Your Workspace in R
  • Saving Your Workspace in RStudio
  • Saving Your Workspace in R GUI
  • Loading Your Workspace
  • Getting and Setting up the Working Directory
  • Getting the List of Files in a Directory
  • Diverting Output to a File
  • Batch (Unattended) Processing
  • Importing Data into R
  • Exporting Data from R
  • Standard R Packages
  • Extending R
  • CRAN Page
Text Mining
  • What is Text Mining?
  • The Common Text Mining Tasks
  • What is Natural Language Processing (NLP)?
  • Some of the NLP Use Cases
  • Machine Learning in Text Mining and NLP
  • Machine Learning in NLP
  • TF-IDF
  • The Feature Hashing Trick
  • Stemming
  • Example of Stemming
  • Stop Words
  • Popular Text Mining and NLP Libraries and Packages
Introduction to Functional Programming
  • What is Functional Programming (FP)?
  • Terminology: Higher-Order Functions
  • Terminology: Lambda vs. Closure
  • A Short List of Languages that Support FP
  • FP with Java
  • FP With JavaScript
  • Imperative Programming in JavaScript
  • The JavaScript map (FP) Example
  • The JavaScript reduce (FP) Example
  • Using reduce to Flatten an Array of Arrays (FP) Example
  • The JavaScript filter (FP) Example
  • Common High-Order Functions in Python
  • Common High-Order Functions in Scala
  • Elements of FP in R
What is NoSQL?
  • Limitations of Relational Databases
  • Limitations of Relational Databases (Cont'd)
  • Defining NoSQL
  • What are NoSQL (Not Only SQL) Databases?
  • The Past and Present of the NoSQL World
  • NoSQL Database Properties
  • NoSQL Benefits
  • NoSQL Benefits
  • NoSQL Database Storage Types
  • The CAP Theorem
  • NoSQL Systems CAP Triangle
  • Mechanisms to Guarantee a Single CAP Property
  • Limitations of NoSQL Databases
  • Big Data Sharding
  • Sharding Example
MapReduce Overview
  • The Client – Server Processing Pattern
  • Distributed Computing Challenges
  • MapReduce Defined
  • Google's MapReduce
  • MapReduce Phases
  • The Map Phase
  • The Reduce Phase
  • MapReduce Word Count Job
  • MapReduce Shared-Nothing Architecture
  • Similarity with SQL Aggregation Operations
  • Example of Map & Reduce Operations using JavaScript
  • Problems Suitable for Solving with MapReduce
  • Typical MapReduce Jobs
  • Fault-tolerance of MapReduce
  • Distributed Computing Economics
  • MapReduce Systems
Hadoop Overview
  • Apache Hadoop
  • Typical Hadoop Applications
  • Hadoop Clusters
  • Hadoop Design Principles
  • Hadoop Versions
  • Hadoop's Main Components
  • Hadoop Simple Definition
  • Side-by-Side Comparison: Hadoop 1 and Hadoop 2
  • Hadoop-based Systems for Data Analysis
  • Other Hadoop Ecosystem Projects
  • Hadoop Caveats
  • Hadoop Distributions
  • Cloudera Distribution of Hadoop (CDH)
  • Cloudera Distributions
  • Hortonworks Data Platform (HDP)
  • MapR
Hadoop Distributed File System Overview
  • Hadoop Distributed File System (HDFS)
  • HDFS Considerations
  • HDFS High Availability
  • Storing Raw Data in HDFS
  • HDFS Security
  • HDFS Rack-awareness
  • Data Blocks
  • Data Block Replication Example
  • HDFS NameNode Directory Diagram
  • File Metadata Records (Conceptual View)
  • NameNode Meta Information Size
  • HDFS Balancing
  • Accessing HDFS
  • Examples of HDFS Commands
  • Other Supported File Systems
  • WebHDFS
  • Examples of WebHDFS Calls
  • HDFS Daemon Web UI Ports
  • Viewing Replica Factor and Block Size in NameNode Web UI
  • HDFS Write Operation
  • HDFS Read Operation
  • Read the Operation Sequence Diagram
  • Communication inside HDFS
MapReduce with Hadoop
  • Hadoop's MapReduce
  • MapReduce 1 and MapReduce 2
  • Why do I need a Discussion of the Old MapReduce?
  • MapReduce v1 ("Classic MapReduce")
  • JobTracker and TaskTracker (the "Classic MapReduce")
  • YARN (MapReduce v2)
  • YARN vs. MR1
  • YARN As Data Operating System
  • MapReduce Programming Options
  • Hadoop's Streaming MapReduce
  • Python Word Count Mapper Program Example
  • Python Word Count Reducer Program Example
  • Setting up Java Classpath for Streaming Support
  • Streaming Use Cases
  • The Streaming API vs. Java MapReduce API
  • Amazon Elastic MapReduce
  • Amazon Elastic MapReduce
  • Apache Tez
Apache Pig Scripting Platform
  • What is Pig?
  • Pig Latin
  • Apache Pig Logo
  • Pig Execution Modes
  • Local Execution Mode
  • MapReduce Execution Mode
  • Running Pig
  • Running Pig in Batch Mode
  • What is Grunt?
  • Pig Latin Statements
  • Pig Programs
  • Pig Latin Script Example
  • SQL Equivalent
  • Differences between Pig and SQL
  • Statement Processing in Pig
  • Comments in Pig
  • Supported Simple Data Types
  • Supported Complex Data Types
  • Arrays
  • Defining Relation's Schema
  • Not Matching the Defined Schema
  • The bytearray Generic Type
  • Using Field Delimiters
  • Loading Data with TextLoader()
  • Referencing Fields in Relations
Apache Pig Relational and Eval Operators
  • Pig Relational Operators
  • Example of Using the JOIN Operator
  • Example of Using the Order By Operator
  • Caveats of Using Relational Operators
  • Pig Eval Functions
  • Caveats of Using Eval Functions (Operators)
  • Example of Using Single-column Eval Operations
  • Example of Using Eval Operators For Global Operations
Hive
  • What is Hive?
  • Apache Hive Logo
  • Hive's Value Proposition
  • Who uses Hive?
  • What Hive Does Not Have
  • Hive's Main Sub-Systems
  • Hive Features
  • The "Classic" Hive Architecture
  • The New Hive Architecture
  • HiveQL
  • Where are the Hive Tables Located?
  • Hive Command-line Interface (CLI)
  • The Beeline Command Shell
  • Summary
Hive Command-line Interface
  • Hive Command-line Interface (CLI)
  • The Hive Interactive Shell
  • Running Host OS Commands from the Hive Shell
  • Interfacing with HDFS from the Hive Shell
  • The Hive in Unattended Mode
  • The Hive CLI Integration with the OS Shell
  • Executing HiveQL Scripts
  • Comments in Hive Scripts
  • Variables and Properties in Hive CLI
  • Setting Properties in CLI
  • Example of Setting Properties in CLI
  • Hive Namespaces
  • Using the SET Command
  • Setting Properties in the Shell
  • Setting Properties for the New Shell Session
  • Setting Alternative Hive Execution Engines
  • The Beeline Shell
  • Connecting to the Hive Server in Beeline
  • Beeline Command Switches
  • Beeline Internal Commands
Hive Data Definition Language
  • Hive Data Definition Language
  • Creating Databases in Hive
  • Using Databases
  • Creating Tables in Hive
  • Supported Data Type Categories
  • Common Numeric Types
  • String and Date / Time Types
  • Miscellaneous Types
  • Example of the CREATE TABLE Statement
  • Working with Complex Types
  • Working with Complex Types
  • Table Partitioning
  • Table Partitioning
  • Table Partitioning on Multiple Columns
  • Viewing Table Partitions
  • Row Format
  • Data Serializers / Deserializers
  • File Format Storage
  • File Compression
  • More on File Formats
  • The ORC Data Format
  • Converting Text to ORC Data Format
  • The EXTERNAL DDL Parameter
  • Example of Using EXTERNAL
  • Creating an Empty Table
  • Dropping a Table
  • Table/Partition(s) Truncation
  • Alter Table/Partition/Column
  • Views
  • Create View Statement
  • Why Use Views?
  • Restricting Amount of Viewable Data
  • Examples of Restricting Amount of Viewable Data
  • Creating and Dropping Indexes
  • Describing Data
Apache Sqoop
  • What is Sqoop?
  • Apache Sqoop Logo
  • Sqoop Import/Export
  • Sqoop Help
  • Examples of Using Sqoop Commands
  • Data Import Example
  • Fine-tuning Data Import
  • Controlling the Number of Import Processes
  • Data Splitting
  • Helping Sqoop Out
  • Example of Executing Sqoop Load in Parallel
  • A Word of Caution: Avoid Complex Free-Form Queries
  • Using Direct Export from Databases
  • Example of Using Direct Export from MySQL
  • More on Direct Mode Import
  • Data Export from HDFS
  • Export Tool Common Arguments
  • Data Export Control Arguments
  • Data Export Example
  • INSERT and UPDATE Statements
  • INSERT Operations
  • UPDATE Operations
  • Example of the Update Operation
  • Failed Exports
  • Sqoop2
Introduction to Apache Spark
  • What is Apache Spark
  • A Short History of Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs. MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Spark Streaming (Micro-batching)
  • Spark SQL
  • Example of Spark SQL
  • Spark Machine Learning Library
  • GraphX
  • Spark vs. R
The Spark Shell
  • The Spark Shell
  • The Spark Shell
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and SQL Context (sqlContext)
  • The Shell Spark Context
  • Loading Files
  • Saving Files
  • Basic Spark ETL Operations
Spark RDDs
  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Custom RDDs
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Checkpointing RDDs
  • Local Checkpointing
  • Parallelized Collections
  • More on parallelize() Method
  • The Pair RDD
  • Where do I use Pair RDDs?
  • Example of Creating a Pair RDD with Map
  • Example of Creating a Pair RDD with keyBy
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • RDD Persistence
  • The Tachyon Storage
Parallel Data Processing with Spark
  • Running Spark on a Cluster
  • Spark Stand-alone Option
  • The High-Level Execution Flow in Stand-alone Spark Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The "Big Picture"
Shared Variables in Spark
  • Shared Variables in Spark
  • Broadcast Variables
  • Creating and Using Broadcast Variables
  • Example of Using Broadcast Variables
  • Accumulators
  • Creating and Using Accumulators
  • Example of Using Accumulators
  • Custom Accumulators
Introduction to Spark SQL
  • What is Spark SQL?
  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • Spark SQL is No Longer Experimentalan  Developer API!
  • What is a DataFrame?
  • The SQLContext Object
  • The SQLContext API
  • Changes Between Spark SQL 1.3 to 1.4
  • Example of Spark SQL (Scala Example)
  • Example of Working with a JSON File
  • Example of Working with a Parquet File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance & Scalability of Spark SQL
Graph Processing with GraphX
  • What is GraphX?
  • Supported Languages
  • Vertices and Edges
  • Graph Terminology
  • Example of Property Graph
  • The GraphX API
  • The GraphX Views
  • The Triplet View
  • Graph Algorithms
  • Graphs and RDDs
  • Constructing Graphs
  • Graph Operators
  • Example of Using GraphX Operators
  • GraphX Performance Optimization
  • The PageRank Algorithm
  • GraphX Support for PageRank
The Spark Machine Learning Library
  • What is MLlib?
  • Supported Languages
  • MLlib Packages
  • Dense and Sparse Vectors
  • Labeled Point
  • Python Example of Using the LabeledPoint Class
  • LIBSVM format
  • An Example of a LIBSVM File
  • Loading LIBSVM Files
  • Local Matrices
  • Example of Creating Matrices in MLlib
  • Distributed Matrices
  • Example of Using a Distributed Matrix
  • Classification and Regression Algorithm
  • Clustering
Machine Learning with BigML
  • What is BigML?
  • How BigML Service Works
  • Data Files
  • Data Sets
  • Data Sets Example
  • Models
  • Predictions
  • The Prediction UI Form
  • Text Analysis in BigML
  • REST API
Conclusion

Training Materials

All Data Analytics training students receive comprehensive courseware.

Software Requirements

  • Windows, Mac, or Linux with at least 8 GB RAM
    • Most class activities will create Spark code and visualizations in a browser-based notebook environment. The class also details how to export these notebooks and how to run code outside of this environment.
  • A current version of Anaconda for Python 3.x
  • Related lab files that Accelebrate will provide
  • Internet access


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan