Analyzing Big Data with R Programming

55 Ratings

Course Number: RPROG-112
Duration: 4 days (26 hours)
Format: Live, hands-on

Big Data with R Training Overview

Accelebrate's Analyzing Big Data with R Programming training teaches attendees how to use In-memory/on-disk, distributed analysis using H20, Hadoop, and Apache Spark, and how to integrate Microsoft Machine Learning Server and R.

Location and Pricing

Accelebrate offers instructor-led enterprise training for groups of 3 or more online or at your site. Most Accelebrate classes can be flexibly scheduled for your group, including delivery in half-day segments across a week or set of weeks. To receive a customized proposal and price quote for private corporate training on-site or online, please contact us.

In addition, some courses are available as live, instructor-led training from one of our partners.

Objectives

  • Understand how R works with big data sets
  • Manage big data in memory with data.table
  • Conduct exploratory data analysis with data.table
  • Learn big data management strategies such as sampling, chunk-and-pull, and pushing compute to the database
  • Run SQL queries directly against R dataframes using DuckDB
  • Use DuckDB as an out-of memory backend for R dataframes
  • Perform machine learning operations using mlr3
  • Interface with Apache Spark using Sparklyr or SparkR
  • Use H2O for data munging and machine learning

Prerequisites

In addition to their professional experience, students who attend this course should have:

  • Programming experience using R, and familiarity with common R packages
  • Knowledge of common statistical methods and data analysis best practices
  • Basic knowledge of the Microsoft Windows operating system and its core functionality

Outline

Expand All | Collapse All

Introduction: 
  • Does R work with big datasets?
  • What challenges does big data introduce when using R?
  • ETL and descriptive data tasks
  • Modeling tasks, optimization challenges
In-memory Big Data: Data.table
  • Why do we need data.table?
  • The i and the j arguments in data.table
  • Renaming columns
  • Adding new columns
  • Binning data (continuous to categorical)
  • Combining categorical values
  • Transforming variables
  • Group-by functions with data.table
  • Chaining commands with data.table
  • Data.table pronouns .N, .SD, SDCols
  • Handling missing data
EDA with Data.table
  • Data subsetting, splitting, and merging
  • Managing datasets
  • Long to wide and back
  • Merging datasets together
  • Stacking datasets together (concatenation)
  • Data summarization
    • Numerical summaries
    • Categorical summaries
    • Multivariate summaries
  • Creating visualizations
Big Three Strategies for dealing with Big Data in R
  • https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/
  • 1. Sampling
  • 2. Chunk-and-pull
  • 3. Push compute to DB
DuckDB 
  • Overview: DuckDB works nicely with R
  • Basic SQL commands for working with DuckDB
  • Understanding query performance optimizations
  • Using dbplyr to work with DuckDB
mlr3 for Machine Learning in R
  • Overview of mlr3
  • Goals of machine learning
  • mlr3 R6 object-oriented R and methods
  • Defining a task
  • Assigning roles to data
  • Performing a classification
  • Performing a regression
  • Visualization with mlr3
  • Pipelines
  • Model assessment
  • Model optimization
  • Implementing general linear models
  • Establishing and leveraging partitions/clusters
  • Fitting regression models and making predictions
  • Decision trees and random forests
  • Naïve bayes
  • Implementing stacked models via pipelines
  • Implementing an AutoML model via pipelines
  • Managing resource utilization through parallelization
Apache Spark
  • Overview of Spark
  • APIs to use Apache Spark with R
  • Sparklyr versus SparkR
  • R, Python, Java and Scala APIs to Spark
  • Applied Examples using SparkR
  • Spark and H2O together: sparklingwater
  • Data import and manipulation in Spark(R)
  • The Spark machine learning library MLlib:
    • General linear models
    • Random forest
    • Naïve bayes
  • Data Munging and Machine Learning Via H20
    • Intro to H20
    • Launching the cluster, checking status
    • Data Import, manipulation in H20
    • Fitting models in H20
    • Generalized Linear Models
    • Naïve bayes
    • Random forest
    • Gradient boosting machine (GBM)
    • Ensemble model building
    • AutoML
    • Methods for explaining modeling output
Conclusion

Training Materials

All R training students receive comprehensive courseware.

Software Requirements

  • A recent release of R 4.x
  • IDE or text editor of your choice (RStudio recommended)


Accelebrate really brought the goods. By the end of the class, every one of the students was solving their own business-specific problem.

 David - National Renewable Energy Laboratory, Golden, CO

Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan