Big Data Architecture Training Overview
This Data Science for Solution Architects training course teaches attendees how to process big data to make data-driven business decisions. Participants learn to use R Programming, Hadoop, Pig, Hive, Spark, NoSQL, and more to build cost-effective data analytics and data processing solutions.
Location and Pricing
Accelebrate offers instructor-led enterprise training for groups of 3 or more online or at your site. Most Accelebrate classes can be flexibly scheduled for your group, including delivery in half-day segments across a week or set of weeks. To receive a customized proposal and price quote for private corporate training on-site or online, please contact us.
In addition, we offer some courses as live, instructor-led online classes for individuals.
Objectives
- Understand applied data science and business analytics
- Incorporate algorithms, techniques, and common analytical methods
- Understand NoSQL and big data systems
- Use MapReduce
- Work with big data business intelligence and analytics
- Visualize and report processed results
- Analyze data with R Programming
- Work with the Hadoop programming ecosystem
- Work with data sets in Apache Pig
- Use the Spark ETL and HDFS interface
Prerequisites
Participants must have basic statistics and programming knowledge.
Outline
Expand All | Collapse All
Introduction
Applied Data Science
- What is Data Science?
- Data Science Ecosystem
- Data Mining vs. Data Science
- Business Analytics vs. Data Science
- Who is a Data Scientist?
- Data Science Skill Sets Venn Diagram
- Data Scientists at Work
- Examples of Data Science Projects
- An Example of a Data Product
- Applied Data Science at Google
- Data Science Gotchas
Data Analytics Life-cycle Phases
- Big Data Analytics Pipeline
- Data Discovery Phase
- Data Harvesting Phase
- Data Priming Phase
- Exploratory Data Analysis
- Model Planning Phase
- Model Building Phase
- Communicating the Results
- Production Roll-out
- Summary
Getting Started with R
- Positioning of R in the Data Science Arena
- R Integrated Development Environments
- Running R
- Running RStudio
- Ending the Current R Session
- Getting Help
- Getting System Information
- General Notes on R Commands and Statements
- R Data Structures
- R Objects and Workspace
- Assignment Operators
- Assignment Example
- Arithmetic Operators
- Logical Operators
- System Date and Time
- Operations
- User-defined Functions
- User-defined Function Example
- R Code Example
- Type Conversion (Coercion)
- Control Statements
- Conditional Execution
- Repetitive Execution
- Repetitive execution
- Built-in Functions
- Reading Data from Files into Vectors
- Example of Reading Data from a File
- Writing Data to a File
- Example of Writing Data to a File
- Logical Vectors
- Character Vectors
- Matrix Data Structure
- Creating Matrices
- Working with Data Frames
- Matrices vs Data Frames
- A Data Frame Sample
- Accessing Data Cells
- Getting Info About a Data Frame
- Selecting Columns in Data Frames
- Selecting Rows in Data Frames
- Getting a Subset of a Data Frame
- Sorting (ordering) Data in Data Frames by Attribute(s)
- Applying Functions to Matrices and Data Frames
- Using the apply() Function
- Example of Using apply()
- Executing External R commands
- Loading External Scripts in RStudio
- Listing Objects in Workspace
- Removing Objects in Workspace
- Saving Your Workspace in R
- Saving Your Workspace in RStudio
- Saving Your Workspace in R GUI
- Loading Your Workspace
- Hands-on Exercises
- Getting and Setting the Working Directory
- Getting the List of Files in a Directory
- Diverting Output to a File
- Batch (Unattended) Processing
- Importing Data into R
- Exporting Data from R
- Hands-on Exercise
- Standard R Packages
- Extending R
- Extending R in R GUI
- Extending R in RStudio
- CRAN Page
R Statistical Computing Features
- Statistical Computing Features
- Descriptive Statistics
- Basic Statistical Functions
- Examples of Using Basic Statistical Functions
- Using the summary() Function
- Math Functions Used in Data Analysis
- Examples of Using Math Functions
- Correlations
- Correlation Example
Data Science Algorithms and Analytical Methods
- Supervised vs. Unsupervised Machine Learning
- Supervised Machine Learning Algorithms
- Unsupervised Machine Learning Algorithms
- Choose the Right Algorithm
- Life-cycles of Machine Learning Development
- Classifying with k-Nearest Neighbors (SL)
- k-Nearest Neighbors Algorithm
- k-Nearest Neighbors Algorithm
- The Error Rate
- Hands-on Exercise
- Decision Trees (SL)
- Using Decision Trees
- Random Forests
- Naive Bayes Classifier (SL)
- Classification of Documents with Naive Bayes
- Unsupervised Learning Type: Clustering
- K-Means Clustering (UL)
- K-Means Clustering in a Nutshell
- Regression Analysis
- Types of Regression
- Simple Linear Regression Model
- Linear Regression Illustration
- Least-Squares Method (LSM)
- LSM Assumptions
- Fitting Linear Regression Models in R
- Example of Using R's lm() Function
- Example of Using lm() with a Data Frame
- Regression Models in Excel
- Hands-on Exercise
- Logistic Regression
- Regression vs Classification
- Time-Series Analysis
- Decomposing Time-Series
Text Mining
- What is Text Mining?
- The Common Text Mining Tasks
- What is Natural Language Processing (NLP)?
- Some of the NLP Use Cases
- Machine Learning in Text Mining and NLP
- Machine Learning in NLP
- TF-IDF
- The Feature Hashing Trick
- Stemming
- Example of Stemming
- Stop Words
- Popular Text Mining and NLP Libraries and Packages
What is NoSQL?
- Limitations of Relational Databases
- Limitations of Relational Databases (Cont'd)
- Defining NoSQL
- What are NoSQL (Not Only SQL) Databases?
- The Past and Present of the NoSQL World
- NoSQL Database Properties
- NoSQL Benefits
- NoSQL Database Storage Types
- The CAP Theorem
- NoSQL Systems CAP Triangle
- Mechanisms to Guarantee a Single CAP Property
- Limitations of NoSQL Databases
- Big Data Sharding
- Sharding Example
MapReduce Overview
- The Client – Server Processing Pattern
- Distributed Computing Challenges
- MapReduce Defined
- Google's MapReduce
- MapReduce Phases
- The Map Phase
- The Reduce Phase
- MapReduce Word Count Job
- MapReduce Shared-Nothing Architecture
- Similarity with SQL Aggregation Operations
- Example of Map & Reduce Operations using JavaScript
- Problems Suitable for Solving with MapReduce
- Typical MapReduce Jobs
- Fault-tolerance of MapReduce
- Distributed Computing Economics
- MapReduce Systems
Hadoop Overview
- Apache Hadoop
- Apache Hadoop Logo
- Typical Hadoop Applications
- Hadoop Clusters
- Hadoop Design Principles
- Hadoop Versions
- Hadoop's Main Components
- Hadoop Simple Definition
- Side-by-Side Comparison: Hadoop 1 and Hadoop 2
- Hadoop-based Systems for Data Analysis
- Other Hadoop Ecosystem Projects
- Hadoop Caveats
- Hadoop Distributions
- Cloudera Distribution of Hadoop (CDH)
- Cloudera Distributions
- Hortonworks Data Platform (HDP)
- MapR
Hadoop Distributed File System Overview
- Hadoop Distributed File System (HDFS)
- HDFS Considerations
- HDFS High Availability
- Storing Raw Data in HDFS
- HDFS Security
- HDFS Rack-awareness
- Data Blocks
- Data Block Replication Example
- HDFS NameNode Directory Diagram
- File Metadata Records (Conceptual View)
- NameNode Meta Information Size
- HDFS Balancing
- Accessing HDFS
- Examples of HDFS Commands
- Other Supported File Systems
- WebHDFS
- Examples of WebHDFS Calls
- HDFS Daemon Web UI Ports
- Viewing Replica Factor and Block Size in NameNode Web UI
- HDFS Write Operation
- HDFS Read Operation
- Read Operation Sequence Diagram
- Communication inside HDFS
MapReduce with Hadoop
- Hadoop's MapReduce
- MapReduce 1 and MapReduce 2
- Why do I need Discussion of the Old MapReduce?
- MapReduce v1 ("Classic MapReduce")
- JobTracker and TaskTracker (the "Classic MapReduce")
- YARN (MapReduce v2)
- YARN vs MR1
- YARN As Data Operating System
- MapReduce Programming Options
- Hadoop's Streaming MapReduce
- Python Word Count Mapper Program Example
- Python Word Count Reducer Program Example
- Setting up Java Classpath for Streaming Support
- Streaming Use Cases
- The Streaming API vs Java MapReduce API
- Amazon Elastic MapReduce
- Apache Tez
Apache Pig Scripting Platform
- What is Pig?
- Pig Latin
- Apache Pig Logo
- Pig Execution Modes
- Local Execution Mode
- MapReduce Execution Mode
- Running Pig
- Running Pig in Batch Mode
- What is Grunt?
- Pig Latin Statements
- Pig Programs
- Pig Latin Script Example
- SQL Equivalent
- Differences between Pig and SQL
- Statement Processing in Pig
- Comments in Pig
- Supported Simple Data Types
- Supported Complex Data Types
- Arrays
- Defining Relation's Schema
- Not Matching the Defined Schema
- The bytearray Generic Type
- Using Field Delimiters
- Loading Data with TextLoader()
- Referencing Fields in Relations
Apache Pig Relational and Eval Operators
- Pig Relational Operators
- Example of Using the JOIN Operator
- Example of Using the Order By Operator
- Caveats of Using Relational Operators
- Pig Eval Functions
- Caveats of Using Eval Functions (Operators)
- Example of Using Single-column Eval Operations
- Example of Using Eval Operators For Global Operations
Hive
- What is Hive?
- Apache Hive Logo
- Hive's Value Proposition
- Who uses Hive?
- What Hive Does Not Have
- Hive's Main Sub-Systems
- Hive Features
- The "Classic" Hive Architecture
- The New Hive Architecture
- HiveQL
- Where are the Hive Tables Located?
- Hive Command-line Interface (CLI)
- The Beeline Command Shell
Hive Command-line Interface
- Hive Command-line Interface (CLI)
- The Hive Interactive Shell
- Running Host OS Commands from the Hive Shell
- Interfacing with HDFS from the Hive Shell
- The Hive in Unattended Mode
- The Hive CLI Integration with the OS Shell
- Executing HiveQL Scripts
- Comments in Hive Scripts
- Variables and Properties in Hive CLI
- Setting Properties in CLI
- Example of Setting Properties in CLI
- Hive Namespaces
- Using the SET Command
- Setting Properties in the Shell
- Setting Properties for the New Shell Session
- Setting Alternative Hive Execution Engines
- The Beeline Shell
- Connecting to the Hive Server in Beeline
- Beeline Command Switches
- Beeline Internal Commands
Hive Data Definition Language
- Hive Data Definition Language
- Creating Databases in Hive
- Using Databases
- Creating Tables in Hive
- Supported Data Type Categories
- Common Numeric Types
- String and Date / Time Types
- Miscellaneous Types
- Example of the CREATE TABLE Statement
- Working with Complex Types
- Table Partitioning
- Table Partitioning
- Table Partitioning on Multiple Columns
- Viewing Table Partitions
- Row Format
- Data Serializers / Deserializers
- File Format Storage
- File Compression
- More on File Formats
- The ORC Data Format
- Converting Text to ORC Data Format
- The EXTERNAL DDL Parameter
- Example of Using EXTERNAL
- Creating an Empty Table
- Dropping a Table
- Table / Partition(s) Truncation
- Alter Table/Partition/Column
- Views
- Create View Statement
- Why Use Views?
- Restricting Amount of Viewable Data
- Examples of Restricting Amount of Viewable Data
- Creating and Dropping Indexes
- Describing Data
Apache Sqoop
- What is Sqoop?
- Apache Sqoop Logo
- Sqoop Import / Export
- Sqoop Help
- Examples of Using Sqoop Commands
- Data Import Example
- Fine-tuning Data Import
- Controlling the Number of Import Processes
- Data Splitting
- Helping Sqoop Out
- Example of Executing Sqoop Load in Parallel
- A Word of Caution: Avoid Complex Free-Form Queries
- Using Direct Export from Databases
- Example of Using Direct Export from MySQL
- More on Direct Mode Import
- Data Export from HDFS
- Export Tool Common Arguments
- Data Export Control Arguments
- Data Export Example
- INSERT and UPDATE Statements
- INSERT Operations
- UPDATE Operations
- Example of the Update Operation
- Failed Exports
- Sqoop2
Introduction to Functional Programming
- What is Functional Programming (FP)?
- Terminology: Higher-Order Functions
- Terminology: Lambda vs Closure
- A Short List of Languages that Support FP
- FP with Java
- FP With JavaScript
- Imperative Programming in JavaScript
- The JavaScript map (FP) Example
- The JavaScript reduce (FP) Example
- Using reduce to Flatten an Array of Arrays (FP) Example
- The JavaScript filter (FP) Example
- Common High-Order Functions in Python
- Common High-Order Functions in Scala
- Elements of FP in R
Introduction to Apache Spark
- What is Apache Spark
- A Short History of Spark
- Where to Get Spark?
- The Spark Platform
- Spark Logo
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Driver Process
- Spark Applications
- Spark Shell
- The spark-submit Tool
- The spark-submit Tool Configuration
- The Executor and Worker Processes
- The Spark Application Architecture
- Interfaces with Data Storage Systems
- Limitations of Hadoop's MapReduce
- Spark vs MapReduce
- Spark as an Alternative to Apache Tez
- The Resilient Distributed Dataset (RDD)
- Spark Streaming (Micro-batching)
- Spark SQL
- Example of Spark SQL
- Spark Machine Learning Library
- GraphX
- Spark vs. R
The Spark Shell
- The Spark Shell
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- The Spark Context (sc) and SQL Context (sqlContext)
- The Shell Spark Context
- Loading Files
- Saving Files
- Basic Spark ETL Operations
Spark RDDs
- The Resilient Distributed Dataset (RDD)
- Ways to Create an RDD
- Custom RDDs
- Supported Data Types
- RDD Operations
- RDDs are Immutable
- Spark Actions
- RDD Transformations
- Other RDD Operations
- Chaining RDD Operations
- RDD Lineage
- The Big Picture
- What May Go Wrong
- Checkpointing RDDs
- Local Checkpointing
- Parallelized Collections
- More on parallelize() Method
- The Pair RDD
- Where do I use Pair RDDs?
- Example of Creating a Pair RDD with Map
- Example of Creating a Pair RDD with keyBy
- Miscellaneous Pair RDD Operations
- RDD Caching
- RDD Persistence
- The Tachyon Storage
Parallel Data Processing with Spark
- Running Spark on a Cluster
- Spark Stand-alone Option
- The High-Level Execution Flow in Stand-alone Spark Cluster
- Data Partitioning
- Data Partitioning Diagram
- Single Local File System RDD Partitioning
- Multiple File RDD Partitioning
- Special Cases for Small-sized Files
- Parallel Data Processing of Partitions
- Spark Application, Jobs, and Tasks
- Stages and Shuffles
- The "Big Picture"
- Summary
The Spark Machine Learning Library
- What is MLlib?
- Supported Languages
- MLlib Packages
- Dense and Sparse Vectors
- Labeled Point
- Python Example of Using the LabeledPoint Class
- LIBSVM format
- An Example of a LIBSVM File
- Loading LIBSVM Files
- Local Matrices
- Example of Creating Matrices in MLlib
- Distributed Matrices
- Example of Using a Distributed Matrix
- Classification and Regression Algorithm
- Clustering
Conclusion
Training Materials
All Data Science training students will receive comprehensive courseware.
Software Requirements
- Computer with Internet connectivity
- Ability to install software on the computer
- Recent 64-bit OS, such as Windows 10, macOS, or Linux