[go: up one dir, main page]

SlideShare a Scribd company logo
High Performance
Predictive Analytics
in R and Hadoop:
Achieving Big Data Big Analytics
Presented by:
Mario E. Inchiosa, Ph.D.
US Chief Scientist
August 27, 2013
1
Revolution Confidential
Agenda
 Riding the Hadoop Wave
 Big Data Big Analytics
 R + Hadoop from Revolution Analytics
 Revolution R Enterprise ScaleR
 Getting Started
 Q&A
2
Riding the Hadoop Wave
Solve old problems in new ways
Solve new problems
If you want something you’ve never had,
you must be willing to do something
you’ve never done.
Major entertainment company
integrates analytics across brands
Fraud detection interval reduced
from 2 weeks to 7 hours
Predict mortgage default in time
to avoid it
Recommend optimal growing plan
Big Data Big Analytics is different
+
+
Big Data is big, complex and messy
Big Analytics are compute intensive
Big Data Big Analytics rewards you
+
+
Revolution Confidential
Innovate with R
16
 Most widely used data analysis software
 Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
 Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
 As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
 Leading edge of analytics research
 Fills the talent gap
 New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot
Revolution Confidential
R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based
scalability
Parallel threading
Commercial
support
Leverage open
source packages
plus Big Data-ready
packages
17
In-memory bound
Single threaded
Community support
5,000+ innovative
analytic packages
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Revolution Confidential
18
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance, Scalable, Portable,
Parallelized, Full-Featured Big Data
Analytics
DistributeR
Parallel & Distributed Computing
Framework
LSF, HPC Server, Azure Burst, Hadoop
RevoR
Performance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst,
Cloudera, Hortonworks, IBM BigInsights, Intel Hadoop, SMP servers, Teradata
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R
Revolution Confidential
Big Data Speed @ Scale with
Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Execution
In-Hadoop Execution
Memory Management
Parallelized User Code
Our Objectives with Respect to Hadoop
 Provide the first enterprise-ready,
commercially supported, full-featured, out-of-
the-box Predictive Analytics suite running in
Hadoop
 Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their workstations
 Scalable and High Performance
20
Simplicity Goal:
Hadoop As An R Engine.
 Run Revolution R Enterprise
Code In Hadoop Without Change
 Provide ScaleR Pre-Parallelized
Algorithms
 No Need To “Think In
MapReduce”
 Eliminate Movement to Slash
Latencies
 Expanded Deployment Options
21
Hadoop
Revolution R Enterprise ScaleR
22
 An R package that adds capabilities to R:
 Data Import/Clean/Explore/Transform
 Analytics – Descriptive and Predictive
 Parallel and distributed computing
 Visualization
 Scales from small local data to huge distributed data
 Scales from workstation to server to cluster to cloud
 Portable – the same code works on small and big data, and
on workstation, server, cluster, Hadoop
High Performance Big Data Analytics with
Revolution R Enterprise ScaleR
23
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling
ScaleR: High Performance Scalable
Parallel External Memory Algorithms
24
 Data import – Delimited,
Fixed, SAS, SPSS, OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Use any of the functionality
of the R language to
transform and clean data
row by row!
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 t-Test
 F-Test
 Plus 1,000’s of other tests
available in R!
Data Prep, Distillation & Descriptive Analytics
 Subsample (observations &
variables)
 Random Sampling
 High quality, fast, parallel
random number generators
R Data Step Statistical Tests
Sampling
Descriptive Statistics
Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
25
 Covariance, Correlation, Sum of
Squares (cross product matrix for
set variables) matrices
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
 Logistic Regression
 Classification & Regression Trees
 Decision Forests
 Predictions/scoring for models
 Residuals for all models
 Histogram
 Line Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 K-Means
Statistical Modeling
 Decision Trees
 Decision Forests
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
 Parallel random number
generators for Monte Carlo
 Use the rich functionality
of R for simulations
Variable Selection
 Stepwise Regression
 PCA
ScaleR Scalability and Performance
 Handles an arbitrarily large number of rows
in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per
node
 Scales well with the number of parameters
 Extremely high performance
26
GLM comparison using in-memory data:
glm() and ScaleR’s rxGlm()
Revolution R Enterprise 27
Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core
Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-core
LSF cluster
5.7 minutes
Revolution R Enterprise 28
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
29
Revolution R Enterprise is faster on the same amount of data, despite using approximately a 20th as many
cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
Specific speed-related factors
 Efficient computational algorithms
 Efficient memory management – minimize data
copying and data conversion
 Heavy use of C++ templates; optimal code
 Efficient data file format; fast access by row and
column
 Models are pre-analyzed to detect and remove
duplicate computations and points of failure
(singularities)
 Handle categorical variables efficiently
Revolution R Enterprise 30
ScaleR Parallel External Memory
Algorithms (PEMA’s)
 The ScaleR analytics algorithms are all built
on a platform (DistributeR) that efficiently
parallelizes a broad class of statistical, data
mining and machine learning algorithms
 These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes
 1) Initialize, 2) Process Chunk, 3) Aggregate,
4) Finalize
Revolution R Enterprise 31
Scalability and portability of Revolution
Analytics’ implementation of PEMA’s
 These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
 They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
 They are independent of where the data is coming
from, giving portability with respect to data sources
Revolution R Enterprise 32
Simplified ScaleR Internal Architecture
Revolution R Enterprise 33
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
ScaleR on Hadoop
 Each pass through the data is one MapReduce job
 Prediction (Scoring), Transformation, Simulation:
 Map tasks store results in HDFS or return to client
 Statistics, Model Building, Visualization:
 Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
 Master process decides if another pass through the data
is required
 Data can be cached or stored in XDF binary format
for increased speed, especially on iterative
algorithms
Revolution R Enterprise 34
Sample code for logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin +
Year + Month + DayOfWeek +
UniqueCarrier + F(CRSDepTime),
data=airData )
35
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
36
Demo rxLinMod in Hadoop - Launching
Revolution R Enterprise 37
Demo rxLinMod in Hadoop - In Progress
Revolution R Enterprise 38
Demo rxLinMod in Hadoop - Completed
Revolution R Enterprise 39
Revolution R Enterprise 7 on Hadoop
 Revolution R Enterprise 7 on
Hadoop and Analytics Clusters
 “Right Tool For The Job”
 RRE 7 “Inside” and “Beside” Hadoop
 Connect a Compute Server or
Cluster to Hadoop
 When To Use:
 Production Hadoop Cluster
 Need Parallelized Algorithms
 Heavy Random Workloads
 Extensive “Sandboxing”
 Big Data Scoring
 Data Security Constraints
 Legacy Data Sources
 Advantages:
 Independent Scalability
 Flexibility
 Low Latency
40
Data
EDW &
Other
Sources
MapReduce
Applications
Hadoop
||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||
Other
MapReduce
Jobs
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
Compute
Server/Cluster
|||||
Analytics
Revolution
R Enterprise
DistributedR
Framework
ScaleR
Algorithms
ConnectR:
HDFS
ODBC &
High-Speed
Connectors
Revolution
R Enterprise
DistributedR
Framework
ScaleR
Algorithms
ConnectR:
HDFS
ODBC &
High-Speed
Connectors
DeployR
Revolution
R Enterprise
BI and
Browser
Revolution Confidential
Services
Remote & On site
Projects & Staff Aug
Quick Start Programs
Entire project lifecycle
Training
Comprehensive Topics
Self Paced & Classroom
Customizable
Consulting Services and Training
41




Big Data Analytics Strategy
Design & Architecture
Use Case Definition
Model Development & Deployment
Support & Maintenance




R with Hadoop
R for SAS Users
Data Visualization
Parallel Computing with RRE
Big Data Analytics with RRE

Revolution Confidential
42
Polling Question 3
Questions
Contact Revolution Analytics at
info@revolutionanalytics.com

More Related Content

What's hot

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarRevolution Analytics
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsRevolution Analytics
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with RTechsparks
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Revolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics
 
R for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesR for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesRevolution Analytics
 

What's hot (20)

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
R server and spark
R server and sparkR server and spark
R server and spark
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Revolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute history
 
R for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesR for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two Strategies
 

Viewers also liked

Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part Ijayroy
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path ForwardDan Mallinger
 
Revolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R CommunityRevolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R CommunityRevolution Analytics
 
Big Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueBig Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueDan Mallinger
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)Revolution Analytics
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
Big Data at #WADAY11
Big Data at #WADAY11 Big Data at #WADAY11
Big Data at #WADAY11 Cosimo Accoto
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
CGT Research May 2013: Analytics & Insights
CGT Research May 2013: Analytics & InsightsCGT Research May 2013: Analytics & Insights
CGT Research May 2013: Analytics & InsightsCognizant
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-iDr. Awase Khirni Syed
 

Viewers also liked (20)

Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Revolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R CommunityRevolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R Community
 
Big Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueBig Analytics: Building Lasting Value
Big Analytics: Building Lasting Value
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
Big Data at #WADAY11
Big Data at #WADAY11 Big Data at #WADAY11
Big Data at #WADAY11
 
R2DOCX example
R2DOCX exampleR2DOCX example
R2DOCX example
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
CGT Research May 2013: Analytics & Insights
CGT Research May 2013: Analytics & InsightsCGT Research May 2013: Analytics & Insights
CGT Research May 2013: Analytics & Insights
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Just in time
Just in timeJust in time
Just in time
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 

Similar to High Performance Predictive Analytics in R and Hadoop

What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2Revolution Analytics
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 

Similar to High Performance Predictive Analytics in R and Hadoop (20)

Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 

Recently uploaded

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 

Recently uploaded (20)

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 

High Performance Predictive Analytics in R and Hadoop

  • 1. High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1
  • 2. Revolution Confidential Agenda  Riding the Hadoop Wave  Big Data Big Analytics  R + Hadoop from Revolution Analytics  Revolution R Enterprise ScaleR  Getting Started  Q&A 2
  • 4. Solve old problems in new ways Solve new problems
  • 5. If you want something you’ve never had, you must be willing to do something you’ve never done.
  • 6. Major entertainment company integrates analytics across brands
  • 7. Fraud detection interval reduced from 2 weeks to 7 hours
  • 8. Predict mortgage default in time to avoid it
  • 10. Big Data Big Analytics is different
  • 11. + +
  • 12. Big Data is big, complex and messy
  • 13. Big Analytics are compute intensive
  • 14. Big Data Big Analytics rewards you
  • 15. + +
  • 16. Revolution Confidential Innovate with R 16  Most widely used data analysis software  Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R Download the White Paper R is Hot bit.ly/r-is-hot
  • 17. Revolution Confidential R is open source and drives analytic innovation but…. has some limitations for Enterprises Disk based scalability Parallel threading Commercial support Leverage open source packages plus Big Data-ready packages 17 In-memory bound Single threaded Community support 5,000+ innovative analytic packages Big Data Speed of Analysis Enterprise Readiness Analytic Breadth & Depth
  • 18. Revolution Confidential 18 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R Enterprise DeployR Web Services Software Development Kit DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC ScaleR High Performance, Scalable, Portable, Parallelized, Full-Featured Big Data Analytics DistributeR Parallel & Distributed Computing Framework LSF, HPC Server, Azure Burst, Hadoop RevoR Performance Enhanced Open Source R + CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM BigInsights, Intel Hadoop, SMP servers, Teradata Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
  • 19. Revolution Confidential Big Data Speed @ Scale with Revolution R Enterprise Fast Math Libraries Parallelized Algorithms In-Database Execution Multi-Threaded Execution Multi-Core Execution In-Hadoop Execution Memory Management Parallelized User Code
  • 20. Our Objectives with Respect to Hadoop  Provide the first enterprise-ready, commercially supported, full-featured, out-of- the-box Predictive Analytics suite running in Hadoop  Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their workstations  Scalable and High Performance 20
  • 21. Simplicity Goal: Hadoop As An R Engine.  Run Revolution R Enterprise Code In Hadoop Without Change  Provide ScaleR Pre-Parallelized Algorithms  No Need To “Think In MapReduce”  Eliminate Movement to Slash Latencies  Expanded Deployment Options 21 Hadoop
  • 22. Revolution R Enterprise ScaleR 22  An R package that adds capabilities to R:  Data Import/Clean/Explore/Transform  Analytics – Descriptive and Predictive  Parallel and distributed computing  Visualization  Scales from small local data to huge distributed data  Scales from workstation to server to cluster to cloud  Portable – the same code works on small and big data, and on workstation, server, cluster, Hadoop
  • 23. High Performance Big Data Analytics with Revolution R Enterprise ScaleR 23 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
  • 24. ScaleR: High Performance Scalable Parallel External Memory Algorithms 24  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Use any of the functionality of the R language to transform and clean data row by row!  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  t-Test  F-Test  Plus 1,000’s of other tests available in R! Data Prep, Distillation & Descriptive Analytics  Subsample (observations & variables)  Random Sampling  High quality, fast, parallel random number generators R Data Step Statistical Tests Sampling Descriptive Statistics
  • 25. Revolution ConfidentialRevolution R Enterprise ScaleR: High Performance Big Data Analytics 25  Covariance, Correlation, Sum of Squares (cross product matrix for set variables) matrices  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Logistic Regression  Classification & Regression Trees  Decision Forests  Predictions/scoring for models  Residuals for all models  Histogram  Line Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  K-Means Statistical Modeling  Decision Trees  Decision Forests Predictive Models Cluster AnalysisData Visualization Classification Machine Learning Simulation  Parallel random number generators for Monte Carlo  Use the rich functionality of R for simulations Variable Selection  Stepwise Regression  PCA
  • 26. ScaleR Scalability and Performance  Handles an arbitrarily large number of rows in a fixed amount of memory  Scales linearly with the number of rows  Scales linearly with the number of nodes  Scales well with the number of cores per node  Scales well with the number of parameters  Extremely high performance 26
  • 27. GLM comparison using in-memory data: glm() and ScaleR’s rxGlm() Revolution R Enterprise 27
  • 28. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/MapReduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise 5-node 20-core LSF cluster 5.7 minutes Revolution R Enterprise 28 Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
  • 29. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 29 Revolution R Enterprise is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the Cost
  • 30. Specific speed-related factors  Efficient computational algorithms  Efficient memory management – minimize data copying and data conversion  Heavy use of C++ templates; optimal code  Efficient data file format; fast access by row and column  Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently Revolution R Enterprise 30
  • 31. ScaleR Parallel External Memory Algorithms (PEMA’s)  The ScaleR analytics algorithms are all built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes  1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize Revolution R Enterprise 31
  • 32. Scalability and portability of Revolution Analytics’ implementation of PEMA’s  These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability  They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions  They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 32
  • 33. Simplified ScaleR Internal Architecture Revolution R Enterprise 33 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF
  • 34. ScaleR on Hadoop  Each pass through the data is one MapReduce job  Prediction (Scoring), Transformation, Simulation:  Map tasks store results in HDFS or return to client  Statistics, Model Building, Visualization:  Map tasks produce “intermediate result objects” that are aggregated by a Reduce task  Master process decides if another pass through the data is required  Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 34
  • 35. Sample code for logit on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData ) 35
  • 36. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) 36
  • 37. Demo rxLinMod in Hadoop - Launching Revolution R Enterprise 37
  • 38. Demo rxLinMod in Hadoop - In Progress Revolution R Enterprise 38
  • 39. Demo rxLinMod in Hadoop - Completed Revolution R Enterprise 39
  • 40. Revolution R Enterprise 7 on Hadoop  Revolution R Enterprise 7 on Hadoop and Analytics Clusters  “Right Tool For The Job”  RRE 7 “Inside” and “Beside” Hadoop  Connect a Compute Server or Cluster to Hadoop  When To Use:  Production Hadoop Cluster  Need Parallelized Algorithms  Heavy Random Workloads  Extensive “Sandboxing”  Big Data Scoring  Data Security Constraints  Legacy Data Sources  Advantages:  Independent Scalability  Flexibility  Low Latency 40 Data EDW & Other Sources MapReduce Applications Hadoop |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| Other MapReduce Jobs ConnectR: Hbase HDFS ODBC & High-Speed Connectors Analytics Server or Cluster: Linux, Windows, LSF or Azure Compute Server/Cluster ||||| Analytics Revolution R Enterprise DistributedR Framework ScaleR Algorithms ConnectR: HDFS ODBC & High-Speed Connectors Revolution R Enterprise DistributedR Framework ScaleR Algorithms ConnectR: HDFS ODBC & High-Speed Connectors DeployR Revolution R Enterprise BI and Browser
  • 41. Revolution Confidential Services Remote & On site Projects & Staff Aug Quick Start Programs Entire project lifecycle Training Comprehensive Topics Self Paced & Classroom Customizable Consulting Services and Training 41     Big Data Analytics Strategy Design & Architecture Use Case Definition Model Development & Deployment Support & Maintenance     R with Hadoop R for SAS Users Data Visualization Parallel Computing with RRE Big Data Analytics with RRE 
  • 43. Questions Contact Revolution Analytics at info@revolutionanalytics.com