27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
High Performance Predictive Analytics in R and Hadoop
1. High Performance
Predictive Analytics
in R and Hadoop:
Achieving Big Data Big Analytics
Presented by:
Mario E. Inchiosa, Ph.D.
US Chief Scientist
August 27, 2013
1
2. Revolution Confidential
Agenda
Riding the Hadoop Wave
Big Data Big Analytics
R + Hadoop from Revolution Analytics
Revolution R Enterprise ScaleR
Getting Started
Q&A
2
16. Revolution Confidential
Innovate with R
16
Most widely used data analysis software
Used by 2M+ data scientists, statisticians and analysts
Most powerful statistical programming language
Flexible, extensible and comprehensive for productivity
Create beautiful and unique data visualizations
As seen in New York Times, Twitter and Flowing Data
Thriving open-source community
Leading edge of analytics research
Fills the talent gap
New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot
17. Revolution Confidential
R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based
scalability
Parallel threading
Commercial
support
Leverage open
source packages
plus Big Data-ready
packages
17
In-memory bound
Single threaded
Community support
5,000+ innovative
analytic packages
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
18. Revolution Confidential
18
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance, Scalable, Portable,
Parallelized, Full-Featured Big Data
Analytics
DistributeR
Parallel & Distributed Computing
Framework
LSF, HPC Server, Azure Burst, Hadoop
RevoR
Performance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst,
Cloudera, Hortonworks, IBM BigInsights, Intel Hadoop, SMP servers, Teradata
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R
19. Revolution Confidential
Big Data Speed @ Scale with
Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Execution
In-Hadoop Execution
Memory Management
Parallelized User Code
20. Our Objectives with Respect to Hadoop
Provide the first enterprise-ready,
commercially supported, full-featured, out-of-
the-box Predictive Analytics suite running in
Hadoop
Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their workstations
Scalable and High Performance
20
21. Simplicity Goal:
Hadoop As An R Engine.
Run Revolution R Enterprise
Code In Hadoop Without Change
Provide ScaleR Pre-Parallelized
Algorithms
No Need To “Think In
MapReduce”
Eliminate Movement to Slash
Latencies
Expanded Deployment Options
21
Hadoop
22. Revolution R Enterprise ScaleR
22
An R package that adds capabilities to R:
Data Import/Clean/Explore/Transform
Analytics – Descriptive and Predictive
Parallel and distributed computing
Visualization
Scales from small local data to huge distributed data
Scales from workstation to server to cluster to cloud
Portable – the same code works on small and big data, and
on workstation, server, cluster, Hadoop
23. High Performance Big Data Analytics with
Revolution R Enterprise ScaleR
23
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling
24. ScaleR: High Performance Scalable
Parallel External Memory Algorithms
24
Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Use any of the functionality
of the R language to
transform and clean data
row by row!
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
t-Test
F-Test
Plus 1,000’s of other tests
available in R!
Data Prep, Distillation & Descriptive Analytics
Subsample (observations &
variables)
Random Sampling
High quality, fast, parallel
random number generators
R Data Step Statistical Tests
Sampling
Descriptive Statistics
25. Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
25
Covariance, Correlation, Sum of
Squares (cross product matrix for
set variables) matrices
Multiple Linear Regression
Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Logistic Regression
Classification & Regression Trees
Decision Forests
Predictions/scoring for models
Residuals for all models
Histogram
Line Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Decision Forests
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Parallel random number
generators for Monte Carlo
Use the rich functionality
of R for simulations
Variable Selection
Stepwise Regression
PCA
26. ScaleR Scalability and Performance
Handles an arbitrarily large number of rows
in a fixed amount of memory
Scales linearly with the number of rows
Scales linearly with the number of nodes
Scales well with the number of cores per
node
Scales well with the number of parameters
Extremely high performance
26
27. GLM comparison using in-memory data:
glm() and ScaleR’s rxGlm()
Revolution R Enterprise 27
28. Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core
Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-core
LSF cluster
5.7 minutes
Revolution R Enterprise 28
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
29. SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
29
Revolution R Enterprise is faster on the same amount of data, despite using approximately a 20th as many
cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
30. Specific speed-related factors
Efficient computational algorithms
Efficient memory management – minimize data
copying and data conversion
Heavy use of C++ templates; optimal code
Efficient data file format; fast access by row and
column
Models are pre-analyzed to detect and remove
duplicate computations and points of failure
(singularities)
Handle categorical variables efficiently
Revolution R Enterprise 30
31. ScaleR Parallel External Memory
Algorithms (PEMA’s)
The ScaleR analytics algorithms are all built
on a platform (DistributeR) that efficiently
parallelizes a broad class of statistical, data
mining and machine learning algorithms
These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes
1) Initialize, 2) Process Chunk, 3) Aggregate,
4) Finalize
Revolution R Enterprise 31
32. Scalability and portability of Revolution
Analytics’ implementation of PEMA’s
These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
They are independent of where the data is coming
from, giving portability with respect to data sources
Revolution R Enterprise 32
33. Simplified ScaleR Internal Architecture
Revolution R Enterprise 33
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
34. ScaleR on Hadoop
Each pass through the data is one MapReduce job
Prediction (Scoring), Transformation, Simulation:
Map tasks store results in HDFS or return to client
Statistics, Model Building, Visualization:
Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
Master process decides if another pass through the data
is required
Data can be cached or stored in XDF binary format
for increased speed, especially on iterative
algorithms
Revolution R Enterprise 34
35. Sample code for logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin +
Year + Month + DayOfWeek +
UniqueCarrier + F(CRSDepTime),
data=airData )
35
36. Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
36
40. Revolution R Enterprise 7 on Hadoop
Revolution R Enterprise 7 on
Hadoop and Analytics Clusters
“Right Tool For The Job”
RRE 7 “Inside” and “Beside” Hadoop
Connect a Compute Server or
Cluster to Hadoop
When To Use:
Production Hadoop Cluster
Need Parallelized Algorithms
Heavy Random Workloads
Extensive “Sandboxing”
Big Data Scoring
Data Security Constraints
Legacy Data Sources
Advantages:
Independent Scalability
Flexibility
Low Latency
40
Data
EDW &
Other
Sources
MapReduce
Applications
Hadoop
||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||
Other
MapReduce
Jobs
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
Compute
Server/Cluster
|||||
Analytics
Revolution
R Enterprise
DistributedR
Framework
ScaleR
Algorithms
ConnectR:
HDFS
ODBC &
High-Speed
Connectors
Revolution
R Enterprise
DistributedR
Framework
ScaleR
Algorithms
ConnectR:
HDFS
ODBC &
High-Speed
Connectors
DeployR
Revolution
R Enterprise
BI and
Browser
41. Revolution Confidential
Services
Remote & On site
Projects & Staff Aug
Quick Start Programs
Entire project lifecycle
Training
Comprehensive Topics
Self Paced & Classroom
Customizable
Consulting Services and Training
41
Big Data Analytics Strategy
Design & Architecture
Use Case Definition
Model Development & Deployment
Support & Maintenance
R with Hadoop
R for SAS Users
Data Visualization
Parallel Computing with RRE
Big Data Analytics with RRE