Using Sparklyr
https://spark.rstudio.com/guides/connections/
Configuring Spark Connections
Local mode
Local mode is an excellent way to learn and experiment with Spark.
Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster.
To work in local mode, you should first install a version of Spark for local use.
You can do this using the spark_install()
function, for example:
Recommended properties
The following are the recommended Spark properties to set when connecting via R:
sparklyr.cores.local - It defaults to using all of the available cores.
Not a necessary property to set, unless there’s a reason to use less cores than available for a given Spark session.
sparklyr.shell.driver-memory - The limit is the amount of RAM available in the computer minus what would be needed for OS operations.
spark.memory.fraction - The default is set to 60% of the requested memory per executor.
For more information, please see this Memory Management Overview page in the official Spark website.
Connection example
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "16G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local",
version = "2.1.0",
config = conf)
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI available in http://localhost:4040/storage/
Customizing connections
A connection to Spark can be customized by setting the values of certain Spark properties.
In sparklyr
, Spark properties can be set by using the config
argument in the spark_connect()
function.
By default, spark_connect()
uses spark_config()
as the default configuration.
But that can be customized as shown in the example code below.
Because of the unending number of possible combinations, spark_config()
contains only a basic configuration, so it will be very likely that additional settings will be needed to properly connect to the cluster.
conf <- spark_config() # Load variable with spark_config()
conf$spark.executor.memory <- "16G" # Use `$` to add or set values
sc <- spark_connect(master = "yarn-client",
config = conf) # Pass the conf variable
Spark definitions
It may be useful to provide some simple definitions for the Spark nomenclature:
Node: A server
Worker Node: A server that is part of the cluster and are available to run Spark jobs
Master Node: The server that coordinates the Worker nodes.
Executor: A sort of virtual machine inside a node.
One Node can have multiple Executors.
Driver Node: The Node that initiates the Spark session.
Typically, this will be the server where sparklyr
is located.
Driver (Executor): The Driver Node will also show up in the Executor list.
Useful concepts
Spark configuration properties passed by R are just requests - In most cases, the cluster has the final say regarding the resources apportioned to a given Spark session.
The cluster overrides ‘silently’ - Many times, no errors are returned when more resources than allowed are requested, or if an attempt is made to change a setting fixed by the cluster.
YARN
Background
Using Spark and R inside a Hadoop based Data Lake is becoming a common practice at companies.
Currently, there is no good way to manage user connections to the Spark service centrally.
There are some caps and settings that can be applied, but in most cases there are configurations that the R user will need to customize.
The Running on YARN page in Spark’s official website is the best place to start for configuration settings reference, please bookmark it.
Cluster administrators and users can benefit from this document.
If Spark is new to the company, the YARN tunning article, courtesy of Cloudera, does a great job at explaining how the Spark/YARN architecture works.
Recommended properties
The following are the recommended Spark properties to set when connecting via R:
spark.executor.memory - The maximum possible is managed by the YARN cluster.
See the Executor Memory Error
spark.executor.cores - Number of cores assigned per Executor.
spark.executor.instances - Number of executors to start.
This property is acknowledged by the cluster if spark.dynamicAllocation.enabled is set to “false”.
spark.dynamicAllocation.enabled - Overrides the mechanism that Spark provides to dynamically adjust resources.
Disabling it provides more control over the number of the Executors that can be started, which in turn impact the amount of storage available for the session.
For more information, please see the Dynamic Resource Allocation page in the official Spark website.
Client mode
Using yarn-client
as the value for the master
argument in spark_connect()
will make the server in which R is running to be the Spark’s session driver.
Here is a sample connection:
conf <- spark_config()
conf$spark.executor.memory <- "300M"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 3
conf$spark.dynamicAllocation.enabled <- "false"
sc <- spark_connect(master = "yarn-client",
spark_home = "/usr/lib/spark/",
version = "1.6.0",
config = conf)
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI.
Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040.
Notice that 155.3MB per executor are assigned instead of the 300MB requested.
This is because the spark.memory.fraction has been fixed by the cluster, plus, there is fixed amount of memory designated for overhead.
Cluster mode
Running in cluster mode means that YARN will choose where the driver of the Spark session will run.
This means that the server where R is running may not necessarily be the driver for that session.
Here is a good write-up explaining how running Spark applications work: Running Spark on YARN
The server will need to have copies of at least two files: yarn-site.xml
and hive-site.xml
.
There may be other files needed based on your cluster’s individual setup.
This is an example of connecting to a Cloudera cluster:
library(sparklyr)
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-oracle-cloudera/")
Sys.setenv(SPARK_HOME = '/opt/cloudera/parcels/CDH/lib/spark')
Sys.setenv(YARN_CONF_DIR = '/opt/cloudera/parcels/CDH/lib/spark/conf/yarn-conf')
conf$spark.executor.memory <- "300M"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 3
conf$spark.dynamicAllocation.enabled <- "false"
conf <- spark_config()
sc <- spark_connect(master = "yarn-cluster",
config = conf)
Executor memory error
Requesting more memory or CPUs for Executors than allowed will return an error.
This is one of the exceptions to the cluster’s ‘silent’ overrides.
It will return a message similar to this:
Failed during initialize_connection: java.lang.IllegalArgumentException: Required executor memory (16384+1638 MB) is above the max threshold (8192 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'
A cluster’s administrator is the only person who can make changes to the settings mentioned in the error.
If the cluster is supported by a vendor, like Cloudera or Hortonworks, then the change can be made using the cluster’s web UI.
Otherwise, changes to those settings are done directly in the yarn-default.xml file.
Kerberos
There are two options to access a “kerberized” data lake:
Use kinit to get and cache the ticket.
After kinit is installed and configured.
After kinit is setup, it can used in R via a system()
call prior to connecting to the cluster: system("echo '<password>' | kinit <username>")
For more information visit this site: Apache - Authenticate with kinit
A preferred option may be to use the out-of-the-box integration with Kerberos that the commercial version of RStudio Server offers.
Standalone mode
Recommended properties
The following are the recommended Spark properties to set when connecting via R:
The default behavior in Standalone mode is to create one executor per worker.
So in a 3 worker node cluster, there will be 3 executors setup.
The basic properties that can be set are:
spark.executor.memory - The requested memory cannot exceed the actual RAM available.
spark.memory.fraction - The default is set to 60% of the requested memory per executor.
For more information, please see this Memory Management Overview page in the official Spark website.
spark.executor.cores - The requested cores cannot be higher than the cores available in each worker.
Dynamic Allocation
If dynamic allocation is disabled, then Spark will attempt to assign all of the available cores evenly across the cluster.
The property used is spark.dynamicAllocation.enabled.
For example, the Standalone cluster used for this article has 3 worker nodes.
Each node has 14.7GB in RAM and 4 cores.
This means that there are a total of 12 cores (3 workers with 4 cores) and 44.1GB in RAM (3 workers with 14.7GB in RAM each).
If the spark.executor.cores
property is set to 2, and dynamic allocation is disabled, then Spark will spawn 6 executors.
The spark.executor.memory
property should be set to a level that when the value is multiplied by 6 (number of executors) it will not be over total available RAM.
In this case, the value can be safely set to 7GB so that the total memory requested will be 42GB, which is under the available 44.1GB.
Connection example
conf <- spark_config()
conf$spark.executor.memory <- "7GB"
conf$spark.memory.fraction <- 0.9
conf$spark.executor.cores <- 2
conf$spark.dynamicAllocation.enabled <- "false"
sc <- spark_connect(master="spark://master-url:7077",
version = "2.1.0",
config = conf,
spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI.
Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040:
Troubleshooting
Help with code debugging
For general programming questions with sparklyr
, please ask on Stack Overflow.
Code does not work after upgrading to the latest sparklyr version
Please refer to the NEWS section of the sparklyr package to find out if any of the updates listed may have changed the way your code needs to work.
If it seems that current version of the package has a bug, or the new functionality does not perform as stated, please refer to the sparklyr ISSUES page.
If no existing issue matches to what your problem is, please open a new issue.
Not able to connect, or the jobs take a long time when working with a Data Lake
The Configuration connections contains an overview and recommendations for requesting resources form the cluster.
The articles in the Guides section provide best-practice information about specific operations that may match to the intent of your code.
To verify your infrastructure, please review the Deployment Examples section.
Manipulating Data with dplyr
Overview
dplyr is
an R package for working with structured data both in and outside of R.
dplyr makes data manipulation for R users easy, consistent, and
performant.
With dplyr as an interface to manipulating Spark DataFrames,
you can:
Select, filter, and aggregate data
Use window functions (e.g. for sampling)
Perform joins on DataFrames
Collect data from Spark into R
Statements in dplyr can be chained together using pipes defined by the
magrittr
R package.
dplyr also supports non-standard
evalution
of its arguments.
For more information on dplyr, see the
introduction,
a guide for connecting to
databases,
and a variety of
vignettes.
Reading Data
You can read data into Spark DataFrames using the following
functions:
Function |
Description |
spark_read_csv |
Reads a CSV file and provides a data source compatible with dplyr |
spark_read_json |
Reads a JSON file and provides a data source compatible with dplyr |
spark_read_parquet |
Reads a parquet file and provides a data source compatible with dplyr |
Regardless of the format of your data, Spark supports reading data from
a variety of different data sources.
These include data stored on HDFS
(hdfs://
protocol), Amazon S3 (s3n://
protocol), or local files
available to the Spark worker nodes (file://
protocol)
Each of these functions returns a reference to a Spark DataFrame which
can be used as a dplyr table (tbl
).
Flights Data
This guide will demonstrate some of the basic data manipulation verbs of
dplyr by using data from the nycflights13
R package.
This package
contains data for all 336,776 flights departing New York City in 2013.
It also includes useful metadata on airlines, airports, weather, and
planes.
The data comes from the US Bureau of Transportation
Statistics,
and is documented in ?nycflights13
Connect to the cluster and copy the flights data using the copy_to
function.
Caveat: The flight data in nycflights13
is convenient for
dplyr demonstrations because it is small, but in practice large data
should rarely be copied directly from R objects.
library(sparklyr)
library(dplyr)
library(nycflights13)
library(ggplot2)
sc <- spark_connect(master="local")
flights <- copy_to(sc, flights, "flights")
airlines <- copy_to(sc, airlines, "airlines")
src_tbls(sc)
## [1] "airlines" "flights"
dplyr Verbs
Verbs are dplyr commands for manipulating data.
When connected to a
Spark DataFrame, dplyr translates the commands into Spark SQL
statements.
Remote data sources use exactly the same five verbs as local
data sources.
Here are the five verbs with their corresponding SQL
commands:
select
~ SELECT
filter
~ WHERE
arrange
~ ORDER
summarise
~ aggregators: sum, min, sd, etc.
mutate
~ operators: +, *, log, etc.
select(flights, year:day, arr_delay, dep_delay)
## # Source: lazy query [?? x 5]
## # Database: spark_connection
## year month day arr_delay dep_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 11.0 2.00
## 2 2013 1 1 20.0 4.00
## 3 2013 1 1 33.0 2.00
## 4 2013 1 1 -18.0 -1.00
## 5 2013 1 1 -25.0 -6.00
## 6 2013 1 1 12.0 -4.00
## 7 2013 1 1 19.0 -5.00
## 8 2013 1 1 -14.0 -3.00
## 9 2013 1 1 - 8.00 -3.00
## 10 2013 1 1 8.00 -2.00
## # ...
with more rows
filter(flights, dep_delay > 1000)
## # Source: lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_t~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384~
## 2 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517~
## 3 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504~
## 4 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665~
## 5 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338~
## # ...
with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
arrange(flights, desc(dep_delay))
## # Source: table<flights> [?? x 19]
## # Database: spark_connection
## # Ordered by: desc(dep_delay)
## year month day dep_~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384~
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504~
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517~
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338~
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665~
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL 2391 N959~
## 7 2013 3 17 2321 810 911 135 1020 915 DL 2119 N927~
## 8 2013 6 27 959 1900 899 1236 2226 850 DL 2007 N376~
## 9 2013 7 22 2257 759 898 121 1026 895 DL 2047 N671~
## 10 2013 12 5 756 1700 896 1058 2020 878 AA 172 N5DM~
## # ...
with more rows, and 7 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour
## # <dbl>
summarise(flights, mean_dep_delay = mean(dep_delay))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 1]
## # Database: spark_connection
## mean_dep_delay
## <dbl>
## 1 12.6
mutate(flights, speed = distance / air_time * 60)
## # Source: lazy query [?? x 20]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ...
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>, speed <dbl>
Laziness
When working with databases, dplyr tries to be as lazy as possible:
It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects
together everything you want to do and then sends it to the database
in one step.
For example, take the following
code:
c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL'))
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- arrange(c2, year, month, day, carrier)
c4 <- mutate(c3, air_time_hours = air_time / 60)
This sequence of operations never actually touches the database.
It’s
not until you ask for the data (e.g. by printing c4
) that dplyr
requests the results from the database.
c4
## # Source: lazy query [?? x 8]
## # Database: spark_connection
## # Ordered by: year, month, day, carrier
## year month day carrier dep_delay air_time distance air_time_hours
## <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2013 5 17 AA -2.00 294 2248 4.90
## 2 2013 5 17 AA -1.00 146 1096 2.43
## 3 2013 5 17 AA -2.00 185 1372 3.08
## 4 2013 5 17 AA -9.00 186 1389 3.10
## 5 2013 5 17 AA 2.00 147 1096 2.45
## 6 2013 5 17 AA -4.00 114 733 1.90
## 7 2013 5 17 AA -7.00 117 733 1.95
## 8 2013 5 17 AA -7.00 142 1089 2.37
## 9 2013 5 17 AA -6.00 148 1089 2.47
## 10 2013 5 17 AA -7.00 137 944 2.28
## # ...
with more rows
Piping
You can use
magrittr
pipes to write cleaner syntax.
Using the same example from above, you
can write a much cleaner version like this:
c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
arrange(carrier) %>%
mutate(air_time_hours = air_time / 60)
Grouping
The group_by
function corresponds to the GROUP BY
statement in SQL.
c4 %>%
group_by(carrier) %>%
summarize(count = n(), mean_dep_delay = mean(dep_delay))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## carrier count mean_dep_delay
## <chr> <dbl> <dbl>
## 1 AA 94.0 1.47
## 2 DL 136 6.24
## 3 UA 172 9.63
## 4 WN 34.0 7.97
Collecting to R
You can copy data from Spark into R’s memory by using collect()
.
carrierhours <- collect(c4)
collect()
executes the Spark query and returns the results to R for
further analysis and visualization.
# Test the significance of pairwise differences and plot the results
with(carrierhours, pairwise.t.test(air_time, carrier))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: air_time and carrier
##
## AA DL UA
## DL 0.25057 - -
## UA 0.07957 0.00044 -
## WN 0.07957 0.23488 0.00041
##
## P value adjustment method: holm
ggplot(carrierhours, aes(carrier, air_time_hours)) + geom_boxplot()
SQL Translation
It’s relatively straightforward to translate R code to SQL (or indeed to
any programming language) when doing simple mathematical operations of
the form you normally use when filtering, mutating and summarizing.
dplyr knows how to convert the following R functions to Spark SQL:
# Basic math operators
+, -, *, /, %%, ^
# Math functions
abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
# Logical comparisons
<, <=, !=, >=, >, ==, %in%
# Boolean operations
&, &&, |, ||, !
# Character functions
paste, tolower, toupper, nchar
# Casting
as.double, as.integer, as.logical, as.character, as.date
# Basic aggregations
mean, sum, min, max, sd, var, cor, cov, n
Window Functions
dplyr supports Spark SQL window functions.
Window functions are used in
conjunction with mutate and filter to solve a wide range of problems.
You can compare the dplyr syntax to the query it has generated by using dbplyr::sql_render()
.
# Find the most and least delayed flight each day
bestworst <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
filter(dep_delay == min(dep_delay) || dep_delay == max(dep_delay))
dbplyr::sql_render(bestworst)
## Warning: Missing values are always removed in SQL.
## Use `min(x, na.rm = TRUE)` to silence this warning
## Warning: Missing values are always removed in SQL.
## Use `max(x, na.rm = TRUE)` to silence this warning
## <SQL> SELECT `year`, `month`, `day`, `dep_delay`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`, min(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz3`, max(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz4`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`
## FROM `flights`) `coaxmtqqbj`) `efznnpuovy`
## WHERE (`dep_delay` = `zzz3` OR `dep_delay` = `zzz4`)
bestworst
## Warning: Missing values are always removed in SQL.
## Use `min(x, na.rm = TRUE)` to silence this warning
## Warning: Missing values are always removed in SQL.
## Use `max(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 4]
## # Database: spark_connection
## # Groups: year, month, day
## year month day dep_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 853
## 2 2013 1 1 - 15.0
## 3 2013 1 1 - 15.0
## 4 2013 1 9 1301
## 5 2013 1 9 - 17.0
## 6 2013 1 24 - 15.0
## 7 2013 1 24 329
## 8 2013 1 29 - 27.0
## 9 2013 1 29 235
## 10 2013 2 1 - 15.0
## # ...
with more rows
# Rank each flight within a daily
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
dbplyr::sql_render(ranked)
## <SQL> SELECT `year`, `month`, `day`, `dep_delay`, rank() OVER (PARTITION BY `year`, `month`, `day` ORDER BY `dep_delay` DESC) AS `rank`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`
## FROM `flights`) `mauqwkxuam`
ranked
## # Source: lazy query [?? x 5]
## # Database: spark_connection
## # Groups: year, month, day
## year month day dep_delay rank
## <int> <int> <int> <dbl> <int>
## 1 2013 1 1 853 1
## 2 2013 1 1 379 2
## 3 2013 1 1 290 3
## 4 2013 1 1 285 4
## 5 2013 1 1 260 5
## 6 2013 1 1 255 6
## 7 2013 1 1 216 7
## 8 2013 1 1 192 8
## 9 2013 1 1 157 9
## 10 2013 1 1 155 10
## # ...
with more rows
It’s rare that a data analysis involves only a single table of data.
In
practice, you’ll normally have many tables that contribute to an
analysis, and you need flexible tools to combine them.
In dplyr, there
are three families of verbs that work with two tables at a time:
Mutating joins, which add new variables to one table from matching
rows in another.
Filtering joins, which filter observations from one table based on
whether or not they match an observation in the other table.
Set operations, which combine the observations in the data sets as
if they were set elements.
All two-table verbs work similarly.
The first two arguments are x
and y
, and provide the tables to combine.
The output is always a new table
with the same type as x
.
The following statements are equivalent:
flights %>% left_join(airlines)
## Joining, by = "carrier"
## # Source: lazy query [?? x 20]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ...
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>, name <chr>
flights %>% left_join(airlines, by = "carrier")
## # Source: lazy query [?? x 20]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ...
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>, name <chr>
flights %>% left_join(airlines, by = c("carrier", "carrier"))
## # Source: lazy query [?? x 20]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ...
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>, name <chr>
Sampling
You can use sample_n()
and sample_frac()
to take a random sample of
rows: use sample_n()
for a fixed number and sample_frac()
for a
fixed fraction.
sample_n(flights, 10)
## # Source: lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ...
with more rows, and 8 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>
sample_frac(flights, 0.01)
## # Source: lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 655 655 0 1021 1030 - 9.00 DL 1415
## 2 2013 1 1 656 700 - 4.00 854 850 4.00 AA 305
## 3 2013 1 1 1044 1045 - 1.00 1231 1212 19.0 EV 4322
## 4 2013 1 1 1056 1059 - 3.00 1203 1209 - 6.00 EV 4479
## 5 2013 1 1 1317 1325 - 8.00 1454 1505 -11.0 MQ 4475
## 6 2013 1 1 1708 1700 8.00 2037 2005 32.0 WN 1066
## 7 2013 1 1 1825 1829 - 4.00 2056 2053 3.00 9E 3286
## 8 2013 1 1 1843 1845 - 2.00 1955 2024 -29.0 DL 904
## 9 2013 1 1 2108 2057 11.0 25 39 -14.0 UA 1517
## 10 2013 1 2 557 605 - 8.00 832 823 9.00 DL 544
## # ...
with more rows, and 8 more variables: tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dbl>
Writing Data
It is often useful to save the results of your analysis or the tables
that you have generated on your Spark cluster into persistent storage.
The best option in many scenarios is to write the table out to a
Parquet file using the
spark_write_parquet
function.
For example:
spark_write_parquet(tbl, "hdfs://hdfs.company.org:9000/hdfs-path/data")
This will write the Spark DataFrame referenced by the tbl R variable to
the given HDFS path.
You can use the
spark_read_parquet
function to read the same table back into a subsequent Spark
session:
tbl <- spark_read_parquet(sc, "data", "hdfs://hdfs.company.org:9000/hdfs-path/data")
You can also write data as CSV or JSON using the
spark_write_csv and
spark_write_json
functions.
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions
(UDAF) can be called inside dplyr’s mutate and summarize.
The Languange
Reference
UDF
page provides the list of available functions.
The following example uses the datediff and current_date Hive
UDFs to figure the difference between the flight_date and the current
system date:
flights %>%
mutate(flight_date = paste(year,month,day,sep="-"),
days_since = datediff(current_date(), flight_date)) %>%
group_by(flight_date,days_since) %>%
tally() %>%
arrange(-days_since)
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## # Groups: flight_date
## # Ordered by: -days_since
## flight_date days_since n
## <chr> <int> <dbl>
## 1 2013-1-1 1844 842
## 2 2013-1-2 1843 943
## 3 2013-1-3 1842 914
## 4 2013-1-4 1841 915
## 5 2013-1-5 1840 720
## 6 2013-1-6 1839 832
## 7 2013-1-7 1838 933
## 8 2013-1-8 1837 899
## 9 2013-1-9 1836 902
## 10 2013-1-10 1835 932
## # ...
with more rows
Spark Machine Learning Library (MLlib)
Overview
sparklyr provides bindings to Spark's distributed machine learning library.
In particular, sparklyr allows you to access the machine learning routines provided by the spark.ml package.
Together with sparklyr's dplyr interface, you can easily create and tune machine learning workflows on Spark, orchestrated entirely within R.
sparklyr provides three families of functions that you can use with Spark machine learning:
Machine learning algorithms for analyzing data (ml_*
)
Feature transformers for manipulating individual features (ft_*
)
Functions for manipulating Spark DataFrames (sdf_*
)
An analytic workflow with sparklyr might be composed of the following stages.
For an example see Example Workflow.
Perform SQL queries through the sparklyr dplyr interface,
Use the sdf_*
and ft_*
family of functions to generate new columns, or partition your data set,
Choose an appropriate machine learning algorithm from the ml_*
family of functions to model your data,
Inspect the quality of your model fit, and use it to make predictions with new data.
Collect the results for visualization and further analysis in R
Algorithms
Spark's machine learning library can be accessed from sparklyr through the ml_*
set of functions:
The ml_*
functions take the arguments response
and features
.
But features
can also be a formula with main effects (it currently does not accept interaction terms).
The intercept term can be omitted by using -1
.
# Equivalent statements
ml_linear_regression(z ~ -1 + x + y)
ml_linear_regression(intercept = FALSE, response = "z", features = c("x", "y"))
Options
The Spark model output can be modified with the ml_options
argument in the ml_*
functions.
The ml_options
is an experts only interface for tweaking the model output.
For example, model.transform
can be used to mutate the Spark model object before the fit is performed.
A model is often fit not on a dataset as-is, but instead on some transformation of that dataset.
Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_*
family of functions.
These routines generally take one or more input columns, and generate a new output column formed as a transformation of those columns.
ft_binarizer |
Threshold numerical features to binary (0/1) feature |
ft_bucketizer |
Bucketizer transforms a column of continuous features to a column of feature buckets |
ft_discrete_cosine_transform |
Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain |
ft_elementwise_product |
Multiplies each input vector by a provided weight vector, using element-wise multiplication. |
ft_index_to_string |
Maps a column of label indices back to a column containing the original labels as strings |
ft_quantile_discretizer |
Takes a column with continuous features and outputs a column with binned categorical features |
sql_transformer |
Implements the transformations which are defined by a SQL statement |
ft_string_indexer |
Encodes a string column of labels to a column of label indices |
ft_vector_assembler |
Combines a given list of columns into a single vector column |
Examples
We will use the iris
data set to examine a handful of learning algorithms and transformers.
The iris data set measures attributes for 150 flowers in 3 different species of iris.
library(sparklyr)
## Warning: package 'sparklyr' was built under R version 3.4.3
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sc <- spark_connect(master = "local")
## * Using Spark: 2.1.0
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
iris_tbl
## # Source: table<iris> [?? x 5]
## # Database: spark_connection
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ...
with more rows
K-Means Clustering
Use Spark's K-means clustering to partition a dataset into groups.
K-means clustering partitions points into k
groups, such that the sum of squares from points to the assigned cluster centers is minimized.
kmeans_model <- iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
ml_kmeans(centers = 3)
## * No rows dropped by 'na.omit' call
# print our model fit
kmeans_model
## K-means clustering with 3 clusters
##
## Cluster centers:
## Petal_Width Petal_Length
## 1 1.359259 4.292593
## 2 0.246000 1.462000
## 3 2.047826 5.626087
##
## Within Set Sum of Squared Errors = 31.41289
# predict the associated class
predicted <- sdf_predict(kmeans_model, iris_tbl) %>%
collect
table(predicted$Species, predicted$prediction)
##
## 0 1 2
## setosa 0 50 0
## versicolor 48 0 2
## virginica 6 0 44
# plot cluster membership
sdf_predict(kmeans_model) %>%
collect() %>%
ggplot(aes(Petal_Length, Petal_Width)) +
geom_point(aes(Petal_Width, Petal_Length, col = factor(prediction + 1)),
size = 2, alpha = 0.5) +
geom_point(data = kmeans_model$centers, aes(Petal_Width, Petal_Length),
col = scales::muted(c("red", "green", "blue")),
pch = 'x', size = 12) +
scale_color_discrete(name = "Predicted Cluster",
labels = paste("Cluster", 1:3)) +
labs(
x = "Petal Length",
y = "Petal Width",
title = "K-Means Clustering",
subtitle = "Use Spark.ML to predict cluster membership with the iris dataset."
)
Linear Regression
Use Spark's linear regression to model the linear relationship between a response variable and one or more explanatory variables.
lm_model <- iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
ml_linear_regression(Petal_Length ~ Petal_Width)
## * No rows dropped by 'na.omit' call
iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
collect %>%
ggplot(aes(Petal_Length, Petal_Width)) +
geom_point(aes(Petal_Width, Petal_Length), size = 2, alpha = 0.5) +
geom_abline(aes(slope = coef(lm_model)[["Petal_Width"]],
intercept = coef(lm_model)[["(Intercept)"]]),
color = "red") +
labs(
x = "Petal Width",
y = "Petal Length",
title = "Linear Regression: Petal Length ~ Petal Width",
subtitle = "Use Spark.ML linear regression to predict petal length as a function of petal width."
)
Logistic Regression
Use Spark's logistic regression to perform logistic regression, modeling a binary outcome as a function of one or more explanatory variables.
# Prepare beaver dataset
beaver <- beaver2
beaver$activ <- factor(beaver$activ, labels = c("Non-Active", "Active"))
copy_to(sc, beaver, "beaver")
## # Source: table<beaver> [?? x 4]
## # Database: spark_connection
## day time temp activ
## <dbl> <dbl> <dbl> <chr>
## 1 307 930 36.58 Non-Active
## 2 307 940 36.73 Non-Active
## 3 307 950 36.93 Non-Active
## 4 307 1000 37.15 Non-Active
## 5 307 1010 37.23 Non-Active
## 6 307 1020 37.24 Non-Active
## 7 307 1030 37.24 Non-Active
## 8 307 1040 36.90 Non-Active
## 9 307 1050 36.95 Non-Active
## 10 307 1100 36.89 Non-Active
## # ...
with more rows
beaver_tbl <- tbl(sc, "beaver")
glm_model <- beaver_tbl %>%
mutate(binary_response = as.numeric(activ == "Active")) %>%
ml_logistic_regression(binary_response ~ temp)
## * No rows dropped by 'na.omit' call
glm_model
## Call: binary_response ~ temp
##
## Coefficients:
## (Intercept) temp
## -550.52331 14.69184
PCA
Use Spark's Principal Components Analysis (PCA) to perform dimensionality reduction.
PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.
pca_model <- tbl(sc, "iris") %>%
select(-Species) %>%
ml_pca()
## * No rows dropped by 'na.omit' call
print(pca_model)
## Explained variance:
##
## PC1 PC2 PC3 PC4
## 0.924618723 0.053066483 0.017102610 0.005212184
##
## Rotation:
## PC1 PC2 PC3 PC4
## Sepal_Length -0.36138659 -0.65658877 0.58202985 0.3154872
## Sepal_Width 0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal_Length -0.85667061 0.17337266 -0.07623608 -0.4798390
## Petal_Width -0.35828920 0.07548102 -0.54583143 0.7536574
Random Forest
Use Spark's Random Forest to perform regression or multiclass classification.
rf_model <- iris_tbl %>%
ml_random_forest(Species ~ Petal_Length + Petal_Width, type = "classification")
## * No rows dropped by 'na.omit' call
rf_predict <- sdf_predict(rf_model, iris_tbl) %>%
ft_string_indexer("Species", "Species_idx") %>%
collect
table(rf_predict$Species_idx, rf_predict$prediction)
##
## 0 1 2
## 0 49 1 0
## 1 0 50 0
## 2 0 0 50
SDF Partitioning
Split a Spark DataFrame into training, test datasets.
partitions <- tbl(sc, "iris") %>%
sdf_partition(training = 0.75, test = 0.25, seed = 1099)
fit <- partitions$training %>%
ml_linear_regression(Petal_Length ~ Petal_Width)
## * No rows dropped by 'na.omit' call
estimate_mse <- function(df){
sdf_predict(fit, df) %>%
mutate(resid = Petal_Length - prediction) %>%
summarize(mse = mean(resid ^ 2)) %>%
collect
}
sapply(partitions, estimate_mse)
## $training.mse
## [1] 0.2374596
##
## $test.mse
## [1] 0.1898848
FT String Indexing
Use ft_string_indexer
and ft_index_to_string
to convert a character column into a numeric column and back again.
ft_string2idx <- iris_tbl %>%
ft_string_indexer("Species", "Species_idx") %>%
ft_index_to_string("Species_idx", "Species_remap") %>%
collect
table(ft_string2idx$Species, ft_string2idx$Species_remap)
##
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 50 0
## virginica 0 0 50
SDF Mutate
sdf_mutate is provided as a helper function, to allow you to use feature transformers.
For example, the previous code snippet could have been written as:
ft_string2idx <- iris_tbl %>%
sdf_mutate(Species_idx = ft_string_indexer(Species)) %>%
sdf_mutate(Species_remap = ft_index_to_string(Species_idx)) %>%
collect
ft_string2idx %>%
select(Species, Species_idx, Species_remap) %>%
distinct
## # A tibble: 3 x 3
## Species Species_idx Species_remap
## <chr> <dbl> <chr>
## 1 setosa 2 setosa
## 2 versicolor 0 versicolor
## 3 virginica 1 virginica
Example Workflow
Let's walk through a simple example to demonstrate the use of Spark's machine learning algorithms within R.
We'll use ml_linear_regression to fit a linear regression model.
Using the built-in mtcars
dataset, we'll try to predict a car's fuel consumption (mpg
) based on its weight (wt
), and the number of cylinders the engine contains (cyl
).
First, we will copy the mtcars
dataset into Spark.
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
Transform the data with Spark SQL, feature transformers, and DataFrame functions.
Use Spark SQL to remove all cars with horsepower less than 100
Use Spark feature transformers to bucket cars into two groups based on cylinders
Use Spark DataFrame functions to partition the data into test and training
Then fit a linear model using spark ML.
Model MPG as a function of weight and cylinders.
# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
filter(hp >= 100) %>%
sdf_mutate(cyl8 = ft_bucketizer(cyl, c(0,8,12))) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 888)
# fit a linear mdoel to the training dataset
fit <- partitions$training %>%
ml_linear_regression(mpg ~ wt + cyl)
## * No rows dropped by 'na.omit' call
# summarize the model
summary(fit)
## Call: ml_linear_regression(., mpg ~ wt + cyl)
##
## Deviance Residuals::
## Min 1Q Median 3Q Max
## -2.0947 -1.2747 -0.1129 1.0876 2.2185
##
## Coefficients:
## Estimate Std.
Error t value Pr(>|t|)
## (Intercept) 33.79558 2.67240 12.6462 4.92e-07 ***
## wt -1.59625 0.73729 -2.1650 0.05859 .
## cyl -1.58036 0.49670 -3.1817 0.01115 *
## ---
## Signif.
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-Squared: 0.8267
## Root Mean Squared Error: 1.437
The summary()
suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption.
(The model suggests that, on average, heavier cars consume more fuel.)
Let's use our Spark model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption.
We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions.
# Score the data
pred <- sdf_predict(fit, partitions$test) %>%
collect
# Plot the predicted versus actual mpg
ggplot(pred, aes(x = mpg, y = prediction)) +
geom_abline(lty = "dashed", col = "red") +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
coord_fixed(ratio = 1) +
labs(
x = "Actual Fuel Consumption",
y = "Predicted Fuel Consumption",
title = "Predicted vs.
Actual Fuel Consumption"
)
Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption.
As you can see, we can easily and effectively combine feature transformers, machine learning algorithms, and Spark DataFrame functions into a complete analysis with Spark and R.
Understanding Spark Caching
Introduction
Spark also supports pulling data sets into a cluster-wide in-memory cache.
This is very useful when data is accessed repeatedly, such as when querying a small dataset or when running an iterative algorithm like random forests.
Since operations in Spark are lazy, caching can help force computation.
Sparklyr tools can be used to cache and uncache DataFrames.
The Spark UI will tell you which DataFrames and what percentages are in memory.
By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark’s memory management options.
Preparation
Download Test Data
The 2008 and 2007 Flights data from the Statistical Computing site will be used for this exercise.
The spark_read_csv supports reading compressed CSV files in a bz2 format, so no additional file preparation is needed.
if(!file.exists("2008.csv.bz2"))
{download.file("http://stat-computing.org/dataexpo/2009/2008.csv.bz2", "2008.csv.bz2")}
if(!file.exists("2007.csv.bz2"))
{download.file("http://stat-computing.org/dataexpo/2009/2007.csv.bz2", "2007.csv.bz2")}
Start a Spark session
A local deployment will be used for this example.
library(sparklyr)
library(dplyr)
library(ggplot2)
# Install Spark version 2
spark_install(version = "2.0.0")
# Customize the connection configuration
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "16G"
# Connect to Spark
sc <- spark_connect(master = "local", config = conf, version = "2.0.0")
The Memory Argument
In the spark_read_… functions, the memory argument controls if the data will be loaded into memory as an RDD.
Setting it to FALSE means that Spark will essentially map the file, but not make a copy of it in memory.
This makes the spark_read_csv command run faster, but the trade off is that any data transformation operations will take much longer.
spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)
In the RStudio IDE, the flights_spark_2008 table now shows up in the Spark tab.
To access the Spark Web UI, click the SparkUI button in the RStudio Spark Tab.
As expected, the Storage page shows no tables loaded into memory.
Loading Less Data into Memory
Using the pre-processing capabilities of Spark, the data will be transformed before being loaded into memory.
In this section, we will continue to build on the example started in the Spark Read section
Lazy Transform
The following dplyr script will not be immediately run, so the code is processed quickly.
There are some check-ups made, but for the most part it is building a Spark SQL statement in the background.
flights_table <- tbl(sc,"flights_spark_2008") %>%
mutate(DepDelay = as.numeric(DepDelay),
ArrDelay = as.numeric(ArrDelay),
DepDelay > 15 , DepDelay < 240,
ArrDelay > -60 , ArrDelay < 360,
Gain = DepDelay - ArrDelay) %>%
filter(ArrDelay > 0) %>%
select(Origin, Dest, UniqueCarrier, Distance, DepDelay, ArrDelay, Gain)
Register in Spark
sdf_register will register the resulting Spark SQL in Spark.
The results will show up as a table called flights_spark.
But a table of the same name is still not loaded into memory in Spark.
sdf_register(flights_table, "flights_spark")
Cache into Memory
The tbl_cache command loads the results into an Spark RDD in memory, so any analysis from there on will not need to re-read and re-transform the original file.
The resulting Spark RDD is smaller than the original file because the transformations created a smaller data set than the original file.
tbl_cache(sc, "flights_spark")
Driver Memory
In the Executors page of the Spark Web UI, we can see that the Storage Memory is at about half of the 16 gigabytes requested.
This is mainly because of a Spark setting called spark.memory.fraction, which reserves by default 40% of the memory requested.
Process on the fly
The plan is to read the Flights 2007 file, combine it with the 2008 file and summarize the data without bringing either file fully into memory.
spark_read_csv(sc, "flights_spark_2007" , "2007.csv.bz2", memory = FALSE)
Union and Transform
The union command is akin to the bind_rows dyplyr command.
It will allow us to append the 2007 file to the 2008 file, and as with the previous transform, this script will be evaluated lazily.
all_flights <- tbl(sc, "flights_spark_2008") %>%
union(tbl(sc, "flights_spark_2007")) %>%
group_by(Year, Month) %>%
tally()
Collect into R
When receiving a collect command, Spark will execute the SQL statement and send the results back to R in a data frame.
In this case, R only loads 24 observations into a data frame called all_flights.
all_flights <- all_flights %>%
collect()
Plot in R
Now the smaller data set can be plotted
ggplot(data = all_flights, aes(x = Month, y = n/1000, fill = factor(Year))) +
geom_area(position = "dodge", alpha = 0.5) +
geom_line(alpha = 0.4) +
scale_fill_brewer(palette = "Dark2", name = "Year") +
scale_x_continuous(breaks = 1:12, labels = c("J","F","M","A","M","J","J","A","S","O","N","D")) +
theme_light() +
labs(y="Number of Flights (Thousands)", title = "Number of Flights Year-Over-Year")
Deployment and Configuration
Deployment
There are two well supported deployment modes for sparklyr:
Local — Working on a local desktop typically with smaller/sampled datasets
Cluster — Working directly within or alongside a Spark cluster (standalone, YARN, Mesos, etc.)
Local Deployment
Local mode is an excellent way to learn and experiment with Spark.
Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster.
To work in local mode you should first install a version of Spark for local use.
You can do this using the spark_install function, for example:
sparklyr::spark_install(version = "2.1.0")
To connect to the local Spark instance you pass “local” as the value of the Spark master node to spark_connect:
library(sparklyr)
sc <- spark_connect(master = "local")
For the local development scenario, see the Configuration section below for details on how to have the same code work seamlessly in both development and production environments.
Cluster Deployment
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster).
In this setup, client mode is appropriate.
In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.
The input and output of the application is attached to the console.
Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
For more information see Submitting Applications.
To use spaklyr with a Spark cluster you should locate your R session on a machine that is either directly on one of the cluster nodes or is close to the cluster (for networking performance).
In the case where R is not running directly on the cluster you should also ensure that the machine has a Spark version and configuration identical to that of the cluster nodes.
The most straightforward way to run R within or near to the cluster is either a remote SSH session or via RStudio Server.
In cluster mode you use the version of Spark already deployed on the cluster node.
This version is located via the SPARK_HOME
environment variable, so you should be sure that this variable is correctly defined on your server before attempting a connection.
This would typically be done within the Renviron.site configuration file.
For example:
SPARK_HOME=/opt/spark/spark-2.0.0-bin-hadoop2.6
To connect, pass the address of the master node to spark_connect, for example:
library(sparklyr)
sc <- spark_connect(master = "spark://local:7077")
For a Hadoop YARN cluster, you can connect using the YARN master, for example:
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
If you are running on EC2 using the Spark EC2 deployment scripts then you can read the master from /root/spark-ec2/cluster-url
, for example:
library(sparklyr)
cluster_url <- system('cat /root/spark-ec2/cluster-url', intern=TRUE)
sc <- spark_connect(master = cluster_url)
Livy Connections
Livy, “An Open Source REST Service for Apache Spark (Apache License)” , is available starting in sparklyr 0.5
as an experimental feature.
Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster.
To work with Livy locally, sparklyr
supports livy_install()
which installs Livy in your local environment, this is similar to spark_install()
.
Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start()
.
Once the service is running, spark_connect()
needs to reference the running service and use method = "Livy"
, then sparklyr
can be used as usual.
A short example follows:
livy_install()
livy_service_start()
sc <- spark_connect(master = "http://localhost:8998", method = "livy")
copy_to(sc, iris)
spark_disconnect(sc)
livy_service_stop()
Connection Tools
You can view the Spark web UI via the spark_web function, and view the Spark log via the spark_log function:
spark_web(sc)
spark_log(sc)
You can disconnect from Spark using the spark_disconnect function:
spark_disconnect(sc)
Collect
The collect
function transfers data from Spark into R.
The data are collected from a cluster environment and transfered into local R memory.
In the process, all data is first transfered from executor nodes to the driver node.
Therefore, the driver node must have enough memory to collect all the data.
Collecting data on the driver node is relatively slow.
The process also inflates the data as it moves from the executor nodes to the driver node.
Caution should be used when collecting large data.
The following parameters could be adjusted to avoid OutOfMemory and Timeout errors:
spark.executor.heartbeatInterval
spark.network.timeout
spark.driver.extraJavaOptions
spark.driver.memory
spark.yarn.driver.memoryOverhead
spark.driver.maxResultSize
Configuration
This section describes the various options available for configuring both the behavior of the sparklyr package as well as the underlying Spark cluster.
Creating multiple configuration profiles (e.g. development, test, production) is also covered.
Config Files
The configuration for a Spark connection is specified via the config
parameter of the spark_connect function.
By default the configuration is established by calling the spark_config function.
This code represents the default behavior:
spark_connect(master = "local", config = spark_config())
By default the spark_config function reads configuration data from a file named config.yml
located in the current working directory (or in parent directories if not located in the working directory).
This file is not required and only need be provided for overriding default behavior.
You can also specify an alternate config file name and/or location.
The config.yml
file is in turn processed using the config package, which enables support for multiple named configuration profiles.
Package Options
There are a number of options available to configure the behavior of the sparklyr package:
For example, this configuration file sets the number of local cores to 4 and the amount of memory allocated for the Spark driver to 4G:
default:
sparklyr.cores.local: 4
sparklyr.shell.driver-memory: 4G
Note that the use of default
will be explained below in Multiple Profiles.
Spark
sparklyr.shell.* |
Command line parameters to pass to spark-submit .
For example, sparklyr.shell.executor-memory: 20G configures --executor-memory 20G (see the Spark documentation for details on supported options). |
Runtime
sparklyr.cores.local |
Number of cores to use when running in local mode (defaults to parallel::detectCores ). |
sparklyr.sparkui.url |
Configures the url to the Spark UI web interface when calling spark_web. |
sparklyr.defaultPackages |
List of default Spark packages to install in the cluster (defaults to “com.databricks:spark-csv_2.11:1.3.0” and “com.amazonaws:aws-java-sdk-pom:1.10.34”). |
sparklyr.sanitize.column.names |
Allows Spark to automatically rename column names to conform to Spark naming restrictions. |
Diagnostics
sparklyr.backend.threads |
Number of threads to use in the sparklyr backend to process incoming connections form the sparklyr client. |
sparklyr.app.jar |
The application jar to be submitted in Spark submit. |
sparklyr.ports.file |
Path to the ports file used to share connection information to the sparklyr backend. |
sparklyr.ports.wait.seconds |
Number of seconds to wait while for the Spark connection to initialize. |
sparklyr.verbose |
Provide additional feedback while performing operations.
Currently used to communicate which column names are being sanitized in sparklyr.sanitize.column.names. |
Spark Options
You can also use config.yml
to specify arbitrary Spark configuration properties:
spark.* |
Configuration settings for the Spark context (applied by creating a SparkConf containing the specified properties).
For example, spark.executor.memory: 1g configures the memory available in each executor (see Spark Configuration for additional options.) |
spark.sql.* |
Configuration settings for the Spark SQL context (applied using SET).
For instance, spark.sql.shuffle.partitions configures number of partitions to use while shuffling (see SQL Programming Guide for additional options). |
For example, this configuration file sets a custom scratch directory for Spark and specifies 100 as the number of partitions to use when shuffling data for joins or aggregations:
default:
spark.local.dir: /tmp/spark-scratch
spark.sql.shuffle.partitions: 100
User Options
You can also include arbitrary custom user options within the config.yml
file.
These can be named anything you like so long as they do not use either spark
or sparklyr
as a prefix.
For example, this configuration file defines dataset
and sample-size
options:
default:
dataset: "observations.parquet"
sample-size: 10000
Multiple Profiles
The config package enables the definition of multiple named configuration profiles for different environments (e.g. default, test, production).
All environments automatically inherit from the default
environment and can optionally also inherit from each other.
For example, you might want to use a distinct datasets for development and testing or might want to use custom Spark configuration properties that are only applied when running on a production cluster.
Here’s how that would be expressed in config.yml
:
default:
dataset: "observations-dev.parquet"
sample-size: 10000
production:
spark.memory.fraction: 0.9
spark.rdd.compress: true
dataset: "observations.parquet"
sample-size: null
You can also use this feature to specify distinct Spark master nodes for different environments, for example:
default:
spark.master: "local"
production:
spark.master: "spark://local:7077"
With this configuration, you can omit the master
argument entirely from the call to spark_connect:
sc <- spark_connect()
Note that the currently active configuration is determined via the value of R_CONFIG_ACTIVE
environment variable.
See the config package documentation for additional details.
Tuning
In general, you will need to tune a Spark cluster for it to perform well.
Spark applications tend to consume a lot of resources.
There are many knobs to control the performance of Yarn and executor (i.e. worker) nodes in a cluster.
Some of the parameters to pay attention to are as follows:
spark.executor.heartbeatInterval
spark.network.timeout
spark.executor.extraJavaOptions
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
spark.executor.instances (if is not enabled)
Example Config
Here is an example spark configuration for an EMR cluster on AWS with 1 master and 2 worker nodes.
Eache node has 8 vCPU and 61 GiB of memory.
spark.driver.extraJavaOptions |
append -XX:MaxPermSize=30G |
spark.driver.maxResultSize |
0 |
spark.driver.memory |
30G |
spark.yarn.driver.memoryOverhead |
4096 |
spark.yarn.executor.memoryOverhead |
4096 |
spark.executor.memory |
4G |
spark.executor.cores |
2 |
spark.dynamicAllocation.maxExecutors |
15 |
Configuration parameters can be set in the config R object or can be set in the config.yml
.
Alternatively, they can be set in the spark-defaults.conf
.
Configuration in R script
config <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "yarn-client", config = config, version = '2.0.0')
Configuration in YAML script
default:
spark.executor.cores: 2
spark.executor.memory: 4G
RStudio Server
RStudio Server provides a web-based IDE interface to a remote R session, making it ideal for use as a front-end to a Spark cluster.
This section covers some additional configuration options that are useful for RStudio Server.
Connection Options
The RStudio IDE Spark pane provides a New Connection dialog to assist in connecting with both local instances of Spark and Spark clusters:
You can configure which connection choices are presented using the rstudio.spark.connections
option.
By default, users are presented with possibility of both local and cluster connections, however, you can modify this behavior to present only one of these, or even a specific Spark master URL.
Some commonly used combinations of connection choices include:
c("local", "cluster") |
Default.
Present connections to both local and cluster Spark instances. |
"local" |
Present only connections to local Spark instances. |
"spark://local:7077" |
Present only a connection to a specific Spark cluster. |
c("spark://local:7077", "cluster") |
Present a connection to a specific Spark cluster and other clusters. |
This option should generally be set within Rprofile.site.
For example:
options(rstudio.spark.connections = "spark://local:7077")
Spark Installations
If you are running within local mode (as opposed to cluster mode) you may want to provide pre-installed Spark version(s) to be shared by all users of the server.
You can do this by installing Spark versions within a shared directory (e.g. /opt/spark
) then designating it as the Spark installation directory.
For example, after installing one or more versions of Spark to /opt/spark
you would add the following to Rprofile.site:
options(spark.install.dir = "/opt/spark")
If this directory is read-only for ordinary users then RStudio will not offer installation of additional versions, which will help guide users to a version that is known to be compatible with versions of Spark deployed on clusters in the same organization.
Distributing R Computations
Overview
sparklyr provides support to run arbitrary R code at scale within your Spark Cluster through spark_apply()
.
This is especially useful where there is a need to use functionality available only in R or R packages that is not available in Apache Spark nor Spark Packages.
spark_apply()
applies an R function to a Spark object (typically, a Spark DataFrame).
Spark objects are partitioned so they can be distributed across a cluster.
You can use spark_apply
with the default partitions or you can define your own partitions with the group_by
argument.
Your R function must return another Spark DataFrame. spark_apply
will run your R function on each partition and output a single Spark DataFrame.
Apply an R function to a Spark Object
Lets run a simple example.
We will apply the identify function, I()
, over a list of numbers we created with the sdf_len
function.
library(sparklyr)
sc <- spark_connect(master = "local")
sdf_len(sc, 5, repartition = 1) %>%
spark_apply(function(e) I(e))
## # Source: table<sparklyr_tmp_378c2e4fb50> [?? x 1]
## # Database: spark_connection
## id
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
Your R function should be designed to operate on an R data frame.
The R function passed to spark_apply
expects a DataFrame and will return an object that can be cast as a DataFrame.
We can use the class
function to verify the class of the data.
sdf_len(sc, 10, repartition = 1) %>%
spark_apply(function(e) class(e))
## # Source: table<sparklyr_tmp_378c7ce7618d> [?? x 1]
## # Database: spark_connection
## id
## <chr>
## 1 data.frame
Spark will partition your data by hash or range so it can be distributed across a cluster.
In the following example we create two partitions and count the number of rows in each partition.
Then we print the first record in each partition.
trees_tbl <- sdf_copy_to(sc, trees, repartition = 2)
trees_tbl %>%
spark_apply(function(e) nrow(e), names = "n")
## # Source: table<sparklyr_tmp_378c15c45eb1> [?? x 1]
## # Database: spark_connection
## n
## <int>
## 1 16
## 2 15
trees_tbl %>%
spark_apply(function(e) head(e, 1))
## # Source: table<sparklyr_tmp_378c29215418> [?? x 3]
## # Database: spark_connection
## Girth Height Volume
## <dbl> <dbl> <dbl>
## 1 8.3 70 10.3
## 2 8.6 65 10.3
We can apply any arbitrary function to the partitions in the Spark DataFrame.
For instance, we can scale or jitter the columns.
Notice that spark_apply
applies the R function to all partitions and returns a single DataFrame.
trees_tbl %>%
spark_apply(function(e) scale(e))
## # Source: table<sparklyr_tmp_378c8922ba8> [?? x 3]
## # Database: spark_connection
## Girth Height Volume
## <dbl> <dbl> <dbl>
## 1 -1.4482330 -0.99510521 -1.1503645
## 2 -1.3021313 -2.06675697 -1.1558670
## 3 -0.7469449 0.68891899 -0.6826528
## 4 -0.6592839 -1.60747764 -0.8587325
## 5 -0.6300635 0.53582588 -0.4735581
## 6 -0.5716229 0.38273277 -0.3855183
## 7 -0.5424025 -0.07654655 -0.5395880
## 8 -0.3670805 -0.22963966 -0.6661453
## 9 -0.1040975 1.30129143 0.1427209
## 10 0.1296653 -0.84201210 -0.3029809
## # ...
with more rows
trees_tbl %>%
spark_apply(function(e) lapply(e, jitter))
## # Source: table<sparklyr_tmp_378c43237574> [?? x 3]
## # Database: spark_connection
## Girth Height Volume
## <dbl> <dbl> <dbl>
## 1 8.319392 70.04321 10.30556
## 2 8.801237 62.85795 10.21751
## 3 10.719805 81.15618 18.78076
## 4 11.009892 65.98926 15.58448
## 5 11.089322 80.14661 22.58749
## 6 11.309682 79.01360 24.18158
## 7 11.418486 75.88748 21.38380
## 8 11.982421 74.85612 19.09375
## 9 12.907616 84.81742 33.80591
## 10 13.691892 71.05309 25.70321
## # ...
with more rows
By default spark_apply()
derives the column names from the input Spark data frame.
Use the names
argument to rename or add new columns.
trees_tbl %>%
spark_apply(
function(e) data.frame(2.54 * e$Girth, e),
names = c("Girth(cm)", colnames(trees)))
## # Source: table<sparklyr_tmp_378c14e015b5> [?? x 4]
## # Database: spark_connection
## `Girth(cm)` Girth Height Volume
## <dbl> <dbl> <dbl> <dbl>
## 1 21.082 8.3 70 10.3
## 2 22.352 8.8 63 10.2
## 3 27.178 10.7 81 18.8
## 4 27.940 11.0 66 15.6
## 5 28.194 11.1 80 22.6
## 6 28.702 11.3 79 24.2
## 7 28.956 11.4 76 21.4
## 8 30.480 12.0 75 19.1
## 9 32.766 12.9 85 33.8
## 10 34.798 13.7 71 25.7
## # ...
with more rows
Group By
In some cases you may want to apply your R function to specific groups in your data.
For example, suppose you want to compute regression models against specific subgroups.
To solve this, you can specify a group_by
argument.
This example counts the number of rows in iris
by species and then fits a simple linear model for each species.
iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>%
spark_apply(nrow, group_by = "Species")
## # Source: table<sparklyr_tmp_378c1b8155f3> [?? x 2]
## # Database: spark_connection
## Species Sepal_Length
## <chr> <int>
## 1 versicolor 50
## 2 virginica 50
## 3 setosa 50
iris_tbl %>%
spark_apply(
function(e) summary(lm(Petal_Length ~ Petal_Width, e))$r.squared,
names = "r.squared",
group_by = "Species")
## # Source: table<sparklyr_tmp_378c30e6155> [?? x 2]
## # Database: spark_connection
## Species r.squared
## <chr> <dbl>
## 1 versicolor 0.6188467
## 2 virginica 0.1037537
## 3 setosa 0.1099785
Distributing Packages
With spark_apply()
you can use any R package inside Spark.
For instance, you can use the broom package to create a tidy data frame from linear regression output.
spark_apply(
iris_tbl,
function(e) broom::tidy(lm(Petal_Length ~ Petal_Width, e)),
names = c("term", "estimate", "std.error", "statistic", "p.value"),
group_by = "Species")
## # Source: table<sparklyr_tmp_378c5502500b> [?? x 6]
## # Database: spark_connection
## Species term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 versicolor (Intercept) 1.7812754 0.2838234 6.276000 9.484134e-08
## 2 versicolor Petal_Width 1.8693247 0.2117495 8.827999 1.271916e-11
## 3 virginica (Intercept) 4.2406526 0.5612870 7.555230 1.041600e-09
## 4 virginica Petal_Width 0.6472593 0.2745804 2.357267 2.253577e-02
## 5 setosa (Intercept) 1.3275634 0.0599594 22.141037 7.676120e-27
## 6 setosa Petal_Width 0.5464903 0.2243924 2.435422 1.863892e-02
To use R packages inside Spark, your packages must be installed on the worker nodes.
The first time you call spark_apply
all of the contents in your local .libPaths()
will be copied into each Spark worker node via the SparkConf.addFile()
function.
Packages will only be copied once and will persist as long as the connection remains open.
It's not uncommon for R libraries to be several gigabytes in size, so be prepared for a one-time tax while the R packages are copied over to your Spark cluster.
You can disable package distribution by setting packages = FALSE
.
Note: packages are not copied in local mode (master="local"
) because the packages already exist on the system.
Handling Errors
It can be more difficult to troubleshoot R issues in a cluster than in local mode.
For instance, the following R code causes the distributed execution to fail and suggests you check the logs for details.
spark_apply(iris_tbl, function(e) stop("Make this fail"))
Error in force(code) :
sparklyr worker rscript failure, check worker logs for details
In local mode, sparklyr
will retrieve the logs for you.
The logs point out the real failure as ERROR sparklyr: RScript (4190) Make this fail
as you might expect.
---- Output Log ----
(17/07/27 21:24:18 ERROR sparklyr: Worker (2427) is shutting down with exception ,java.net.SocketException: Socket closed)
17/07/27 21:24:18 WARN TaskSetManager: Lost task 0.0 in stage 389.0 (TID 429, localhost, executor driver): 17/07/27 21:27:21 INFO sparklyr: RScript (4190) retrieved 150 rows
17/07/27 21:27:21 INFO sparklyr: RScript (4190) computing closure
17/07/27 21:27:21 ERROR sparklyr: RScript (4190) Make this fail
It is worth mentioning that different cluster providers and platforms expose worker logs in different ways.
Specific documentation for your environment will point out how to retrieve these logs.
Requirements
The R Runtime is expected to be pre-installed in the cluster for spark_apply
to function.
Failure to install the cluster will trigger a Cannot run program, no such file or directory
error while attempting to use spark_apply()
.
Contact your cluster administrator to consider making the R runtime available throughout the entire cluster.
A Homogeneous Cluster is required since the driver node distributes, and potentially compiles, packages to the workers.
For instance, the driver and workers must have the same processor architecture, system libraries, etc.
Configuration
The following table describes relevant parameters while making use of spark_apply
.
spark.r.command |
The path to the R binary.
Useful to select from multiple R versions. |
sparklyr.worker.gateway.address |
The gateway address to use under each worker node.
Defaults to sparklyr.gateway.address . |
sparklyr.worker.gateway.port |
The gateway port to use under each worker node.
Defaults to sparklyr.gateway.port . |
For example, one could make use of an specific R version by running:
config <- spark_config()
config[["spark.r.command"]] <- "<path-to-r-version>"
sc <- spark_connect(master = "local", config = config)
sdf_len(sc, 10) %>% spark_apply(function(e) e)
Limitations
Closures
Closures are serialized using serialize
, which is described as “A simple low-level interface for serializing to connections.”.
One of the current limitations of serialize
is that it wont serialize objects being referenced outside of it's environment.
For instance, the following function will error out since the closures references external_value
:
external_value <- 1
spark_apply(iris_tbl, function(e) e + external_value)
Livy
Currently, Livy connections do not support distributing packages since the client machine where the libraries are precompiled might not have the same processor architecture, not operating systems that the cluster machines.
Computing over Groups
While performing computations over groups, spark_apply()
will provide partitions over the selected column; however, this implies that each partition can fit into a worker node, if this is not the case an exception will be thrown.
To perform operations over groups that exceed the resources of a single node, one can consider partitioning to smaller units or use dplyr::do
which is currently optimized for large partitions.
Package Installation
Since packages are copied only once for the duration of the spark_connect()
connection, installing additional packages is not supported while the connection is active.
Therefore, if a new package needs to be installed, spark_disconnect()
the connection, modify packages and reconnect.
Data Science using a Data Lake
Audience
This article aims explain how to take advantage of Apache Spark inside organizations that have already implemented, or are in the process of implementing, a Hadoop based Big Data Lake.
Introduction
We have noticed that the types of questions we field after a demo of sparklyr to our customers were more about high-level architecture than how the package works.
To answer those questions, we put together a set of slides that illustrate and discuss important concepts, to help customers see where Spark, R, and sparklyr fit in a Big Data Platform implementation.
In this article, we’ll review those slides and provide a narrative that will help you better envision how you can take advantage of our products.
R for Data Science
It is very important to preface the Use Case review with some background information about where RStudio focuses its efforts when developing packages and products.
Many vendors offer R integration, but in most cases, what this means is that they will add a model built in R to their pipeline or interface, and pass new inputs to that model to generate outputs that can be used in the next step in the pipeline, or in a calculation for the interface.
In contrast, our focus is on the process that happens before that: the discipline that produces the model, meaning Data Science.
In their R for Data Science book, Hadley Wickham and Garrett Grolemund provide a great diagram that nicely illustrates the Data Science process: We import data into memory with R and clean and tidy the data.
Then we go into a cyclical process called understand, which helps us to get to know our data, and hopefully find the answer to the question we started with.
This cycle typically involves making transformations to our tidied data, using the transformed data to fit models, and visualizing results.
Once we find an answer to our question, we then communicate the results.
Data Scientists like using R because it allows them to complete a Data Science project from beginning to end inside the R environment, and in memory.
Hadoop as a Data Source
What happens when the data that needs to be analyzed is very large, like the data sets found in a Hadoop cluster? It would be impossible to fit these in memory, so workarounds are normally used.
Possible workarounds include using a comparatively minuscule data sample, or download as much data as possible.
This becomes disruptive to Data Scientists because either the small sample may not be representative, or they have to wait a long time in every iteration of importing a lot of data, exploring a lot of data, and modeling a lot of data.
Spark as an Analysis Engine
We noticed that a very important mental leap to make is to see Spark not just as a gateway to Hadoop (or worse, as an additional data source), but as a computing engine.
As such, it is an excellent vehicle to scale our analytics.
Spark has many capabilities that makes it ideal for Data Science in a data lake, such as close integration with Hadoop and Hive, the ability to cache data into memory across multiple nodes, data transformers, and its Machine Learning libraries.
The approach, then, is to push as much compute to the cluster as possible, using R primarily as an interface to Spark for the Data Scientist, which will then collect as few results as possible back into R memory, mostly to visualize and communicate.
As shown in the slide, the more import, tidy, transform and modeling work we can push to Spark, the faster we can analyze very large data sets.
Cluster Setup
Here is an illustration of how R, RStudio, and sparklyr can be added to the YARN managed cluster.
The highlights are:
R, RStudio, and sparklyr need to be installed on one node only, typically an edge node
The Data Scientist can access R, Spark, and the cluster via a web browser by navigating to the RStudio IDE inside the edge node
Considerations
There are some important considerations to keep in mind when combining your Data Lake and R for large scale analytics:
Spark’s Machine Learning libraries may not contain specific models that a Data Scientist needs.
For those cases, workarounds would include using a sparklyr extension like H2O, or collecting a sample of the data into R memory for modeling.
Spark does not have visualization functionality; currently, the best approach is to collect pre-calculated data into R for plotting.
A good way to drastically reduce the number of rows being brought back into memory is to push as much computation as possible to Spark, and return just the results to be plotted.
For example, the bins of a Histogram can be calculated in Spark, so that only the final bucket values would be returned to R for visualization.
Here is sample code for such a scenario: sparkDemos/Histogram
A particular use case may require a different way of scaling analytics.
We have published an article that provides a very good overview of the options that are available: R for Enterprise: How to Scale Your Analytics Using R
R for Data Science Toolchain with Spark
With sparklyr, the Data Scientist will be able to access the Data Lake’s data, and also gain an additional, very powerful understand layer via Spark.
sparklyr, along with the RStudio IDE and the tidyverse packages, provides the Data Scientist with an excellent toolbox to analyze data, big and small.
Spark ML Pipelines
Spark’s ML Pipelines provide a way to easily combine multiple
transformations and algorithms into a single workflow, or pipeline.
For R users, the insights gathered during the interactive sessions with
Spark can now be converted to a formal pipeline.
This makes the hand-off
from Data Scientists to Big Data Engineers a lot easier, this is because
there should not be additional changes needed to be made by the later
group.
The final list of selected variables, data manipulation, feature
transformations and modeling can be easily re-written into a ml_pipeline()
object, saved, and ultimately placed into a Production
environment.
The sparklyr
output of a saved Spark ML Pipeline object
is in Scala code, which means that the code can be added to the
scheduled Spark ML jobs, and without any dependencies in R.
Introduction to ML Pipelines
The official Apache Spark site contains a more complete overview of ML
Pipelines.
This
article will focus in introducing the basic concepts and steps to work
with ML Pipelines via sparklyr
.
There are two important stages in building an ML Pipeline.
The first one
is creating a Pipeline.
A good way to look at it, or call it, is as
an “empty” pipeline.
This step just builds the steps that the data
will go through.
This is the somewhat equivalent of doing this in R:
r_pipeline <- .
%>% mutate(cyl = paste0("c", cyl)) %>% lm(am ~ cyl + mpg, data = .)
r_pipeline
## Functional sequence with the following components:
##
## 1.
mutate(., cyl = paste0("c", cyl))
## 2.
lm(am ~ cyl + mpg, data = .)
##
## Use 'functions' to extract the individual functions.
The r_pipeline
object has all the steps needed to transform and fit
the model, but it has not yet transformed any data.
The second step, is to pass data through the pipeline, which in turn
will output a fitted model.
That is called a PipelineModel.
The
PipelineModel can then be used to produce predictions.
r_model <- r_pipeline(mtcars)
r_model
##
## Call:
## lm(formula = am ~ cyl + mpg, data = .)
##
## Coefficients:
## (Intercept) cylc6 cylc8 mpg
## -0.54388 0.03124 -0.03313 0.04767
Taking advantage of Pipelines and PipelineModels
The two stage ML Pipeline approach produces two final data products:
A PipelineModel that can be added to the daily Spark jobs which
will produce new predictions for the incoming data, and again, with
no R dependencies.
A Pipeline that can be easily re-fitted on a regular
interval, say every month.
All that is needed is to pass a new
sample to obtain the new coefficients.
Pipeline
An additional goal of this article is that the reader can follow along,
so the data, transformations and Spark connection in this example will
be kept as easy to reproduce as possible.
library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", spark_version = "2.2.0")
## * Using Spark: 2.2.0
spark_flights <- sdf_copy_to(sc, flights)
Pipelines make heavy use of Feature
Transformers.
If new to Spark, and sparklyr
, it would be good to review what these
transformers do.
These functions use the Spark API directly to transform
the data, and may be faster at making the data manipulations that a dplyr
(SQL) transformation.
In sparklyr
the ft
functions are essentially are wrappers to
original Spark feature
transformer.
This example will start with dplyr
transformations, which are
ultimately SQL transformations, loaded into the df
variable.
In sparklyr
, there is one feature transformer that is not available in
Spark, ft_dplyr_transformer()
.
The goal of this function is to convert
the dplyr
code to a SQL Feature Transformer that can then be used in a
Pipeline.
df <- spark_flights %>%
filter(!is.na(dep_delay)) %>%
mutate(
month = paste0("m", month),
day = paste0("d", day)
) %>%
select(dep_delay, sched_dep_time, month, day, distance)
This is the resulting pipeline stage produced from the dplyr
code:
ft_dplyr_transformer(sc, df)
Use the ml_param()
function to extract the “statement” attribute.
That
attribute contains the finalized SQL statement.
Notice that the flights
table name has been replace with __THIS__
.
This allows the
pipeline to accept different table names as its source, making the
pipeline very modular.
ft_dplyr_transformer(sc, df) %>%
ml_param("statement")
## [1] "SELECT `dep_delay`, `sched_dep_time`, `month`, `day`, `distance`\nFROM (SELECT `year`, CONCAT(\"m\", `month`) AS `month`, CONCAT(\"d\", `day`) AS `day`, `dep_time`, `sched_dep_time`, `dep_delay`, `arr_time`, `sched_arr_time`, `arr_delay`, `carrier`, `flight`, `tailnum`, `origin`, `dest`, `air_time`, `distance`, `hour`, `minute`, `time_hour`\nFROM (SELECT *\nFROM `__THIS__`\nWHERE (NOT(((`dep_delay`) IS NULL)))) `bjbujfpqzq`) `axbwotqnbr`"
Creating the Pipeline
The following step will create a 5 stage pipeline:
SQL transformer - Resulting from the ft_dplyr_transformer()
transformation
Binarizer - To determine if the flight should be considered delay.
The eventual outcome variable.
Bucketizer - To split the day into specific hour buckets
R Formula - To define the model’s formula
Logistic Model
flights_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(
tbl = df
) %>%
ft_binarizer(
input.col = "dep_delay",
output.col = "delayed",
threshold = 15
) %>%
ft_bucketizer(
input.col = "sched_dep_time",
output.col = "hours",
splits = c(400, 800, 1200, 1600, 2000, 2400)
) %>%
ft_r_formula(delayed ~ month + day + hours + distance) %>%
ml_logistic_regression()
Another nice feature for ML Pipelines in sparklyr
, is the print-out.
It makes it really easy to how each stage is setup:
flights_pipeline
## Pipeline (Estimator) with 5 stages
## <pipeline_24044e4f2e21>
## Stages
## |--1 SQLTransformer (Transformer)
## | <dplyr_transformer_2404e6a1b8e>
## | (Parameters -- Column Names)
## |--2 Binarizer (Transformer)
## | <binarizer_24045c9227f2>
## | (Parameters -- Column Names)
## | input_col: dep_delay
## | output_col: delayed
## |--3 Bucketizer (Transformer)
## | <bucketizer_240412366b1e>
## | (Parameters -- Column Names)
## | input_col: sched_dep_time
## | output_col: hours
## |--4 RFormula (Estimator)
## | <r_formula_240442d75f00>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | (Parameters)
## | force_index_label: FALSE
## | formula: delayed ~ month + day + hours + distance
## |--5 LogisticRegression (Estimator)
## | <logistic_regression_24044321ad0>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | prediction_col: prediction
## | probability_col: probability
## | raw_prediction_col: rawPrediction
## | (Parameters)
## | aggregation_depth: 2
## | elastic_net_param: 0
## | family: auto
## | fit_intercept: TRUE
## | max_iter: 100
## | reg_param: 0
## | standardization: TRUE
## | threshold: 0.5
## | tol: 1e-06
Notice that there are no coefficients defined yet.
That’s because no
data has been actually processed.
Even though df
uses spark_flights()
, recall that the final SQL transformer makes that
name, so there’s no data to process yet.
PipelineModel
A quick partition of the data is created for this exercise.
partitioned_flights <- sdf_partition(
spark_flights,
training = 0.01,
testing = 0.01,
rest = 0.98
)
The ml_fit()
function produces the PipelineModel.
The training
partition of the partitioned_flights
data is used to train the model:
fitted_pipeline <- ml_fit(
flights_pipeline,
partitioned_flights$training
)
fitted_pipeline
## PipelineModel (Transformer) with 5 stages
## <pipeline_24044e4f2e21>
## Stages
## |--1 SQLTransformer (Transformer)
## | <dplyr_transformer_2404e6a1b8e>
## | (Parameters -- Column Names)
## |--2 Binarizer (Transformer)
## | <binarizer_24045c9227f2>
## | (Parameters -- Column Names)
## | input_col: dep_delay
## | output_col: delayed
## |--3 Bucketizer (Transformer)
## | <bucketizer_240412366b1e>
## | (Parameters -- Column Names)
## | input_col: sched_dep_time
## | output_col: hours
## |--4 RFormulaModel (Transformer)
## | <r_formula_240442d75f00>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | (Transformer Info)
## | formula: chr "delayed ~ month + day + hours + distance"
## |--5 LogisticRegressionModel (Transformer)
## | <logistic_regression_24044321ad0>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | prediction_col: prediction
## | probability_col: probability
## | raw_prediction_col: rawPrediction
## | (Transformer Info)
## | coefficient_matrix: num [1, 1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ...
## | coefficients: num [1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ...
## | intercept: num -3.04
## | intercept_vector: num -3.04
## | num_classes: int 2
## | num_features: int 43
## | threshold: num 0.5
Notice that the print-out for the fitted pipeline now displays the
model’s coefficients.
The ml_transform()
function can be used to run predictions, in other
words it is used instead of predict()
or sdf_predict()
.
predictions <- ml_transform(
fitted_pipeline,
partitioned_flights$testing
)
predictions %>%
group_by(delayed, prediction) %>%
tally()
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## # Groups: delayed
## delayed prediction n
## <dbl> <dbl> <dbl>
## 1 0.
1.
51.
## 2 0.
0.
2599.
## 3 1.
0.
666.
## 4 1.
1.
69.
Save the pipelines to disk
The ml_save()
command can be used to save the Pipeline and
PipelineModel to disk.
The resulting output is a folder with the
selected name, which contains all of the necessary Scala scripts:
ml_save(
flights_pipeline,
"flights_pipeline",
overwrite = TRUE
)
## NULL
ml_save(
fitted_pipeline,
"flights_model",
overwrite = TRUE
)
## NULL
Use an existing PipelineModel
The ml_load()
command can be used to re-load Pipelines and
PipelineModels.
The saved ML Pipeline files can only be loaded into an
open Spark session.
reloaded_model <- ml_load(sc, "flights_model")
A simple query can be used as the table that will be used to make the
new predictions.
This of course, does not have to done in R, at this
time the “flights_model” can be loaded into an independent Spark
session outside of R.
new_df <- spark_flights %>%
filter(
month == 7,
day == 5
)
ml_transform(reloaded_model, new_df)
## # Source: table<sparklyr_tmp_24041e052b5> [?? x 12]
## # Database: spark_connection
## dep_delay sched_dep_time month day distance delayed hours features
## <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <list>
## 1 39.
2359 m7 d5 1617.
1.
4.
<dbl [43]>
## 2 141.
2245 m7 d5 2475.
1.
4.
<dbl [43]>
## 3 0.
500 m7 d5 529.
0.
0.
<dbl [43]>
## 4 -5.
536 m7 d5 1400.
0.
0.
<dbl [43]>
## 5 -2.
540 m7 d5 1089.
0.
0.
<dbl [43]>
## 6 -7.
545 m7 d5 1416.
0.
0.
<dbl [43]>
## 7 -3.
545 m7 d5 1576.
0.
0.
<dbl [43]>
## 8 -7.
600 m7 d5 1076.
0.
0.
<dbl [43]>
## 9 -7.
600 m7 d5 96.
0.
0.
<dbl [43]>
## 10 -6.
600 m7 d5 937.
0.
0.
<dbl [43]>
## # ...
with more rows, and 4 more variables: label <dbl>,
## # rawPrediction <list>, probability <list>, prediction <dbl>
Re-fit an existing Pipeline
First, reload the pipeline into an open Spark session:
reloaded_pipeline <- ml_load(sc, "flights_pipeline")
Use ml_fit()
again to pass new data, in this case, sample_frac()
is
used instead of sdf_partition()
to provide the new data.
The idea
being that the re-fitting would happen at a later date than when the
model was initially fitted.
new_model <- ml_fit(reloaded_pipeline, sample_frac(spark_flights, 0.01))
new_model
## PipelineModel (Transformer) with 5 stages
## <pipeline_24044e4f2e21>
## Stages
## |--1 SQLTransformer (Transformer)
## | <dplyr_transformer_2404e6a1b8e>
## | (Parameters -- Column Names)
## |--2 Binarizer (Transformer)
## | <binarizer_24045c9227f2>
## | (Parameters -- Column Names)
## | input_col: dep_delay
## | output_col: delayed
## |--3 Bucketizer (Transformer)
## | <bucketizer_240412366b1e>
## | (Parameters -- Column Names)
## | input_col: sched_dep_time
## | output_col: hours
## |--4 RFormulaModel (Transformer)
## | <r_formula_240442d75f00>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | (Transformer Info)
## | formula: chr "delayed ~ month + day + hours + distance"
## |--5 LogisticRegressionModel (Transformer)
## | <logistic_regression_24044321ad0>
## | (Parameters -- Column Names)
## | features_col: features
## | label_col: label
## | prediction_col: prediction
## | probability_col: probability
## | raw_prediction_col: rawPrediction
## | (Transformer Info)
## | coefficient_matrix: num [1, 1:43] 0.258 0.648 -0.317 0.36 -0.279 ...
## | coefficients: num [1:43] 0.258 0.648 -0.317 0.36 -0.279 ...
## | intercept: num -3.77
## | intercept_vector: num -3.77
## | num_classes: int 2
## | num_features: int 43
## | threshold: num 0.5
The new model can be saved using ml_save()
.
A new name is used in this
case, but the same name as the existing PipelineModel to replace it.
ml_save(new_model, "new_flights_model", overwrite = TRUE)
## NULL
Finally, this example is complete by closing the Spark session.
spark_disconnect(sc)
Text mining with Spark & sparklyr
This article focuses on a set of functions that can be used for text mining with Spark and sparklyr
.
The main goal is to illustrate how to perform most of the data preparation and analysis with commands that will run inside the Spark cluster, as opposed to locally in R.
Because of that, the amount of data used will be small.
Data source
For this example, there are two files that will be analyzed.
They are both the full works of Sir Arthur Conan Doyle and Mark Twain.
The files were downloaded from the Gutenberg Project site via the gutenbergr
package.
Intentionally, no data cleanup was done to the files prior to this analysis.
See the appendix below to see how the data was downloaded and prepared.
readLines("arthur_doyle.txt", 10)
## [1] "THE RETURN OF SHERLOCK HOLMES,"
## [2] ""
## [3] "A Collection of Holmes Adventures"
## [4] ""
## [5] ""
## [6] "by Sir Arthur Conan Doyle"
## [7] ""
## [8] ""
## [9] ""
## [10] ""
Data Import
Connect to Spark
An additional goal of this article is to encourage the reader to try it out, so a simple Spark local mode session is used.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.1.0")
spark_read_text()
The spark_read_text()
is a new function which works like readLines()
but for sparklyr
.
It comes in handy when non-structured data, such as lines in a book, is what is available for analysis.
# Imports Mark Twain's file
# Setting up the path to the file in a Windows OS laptop
twain_path <- paste0("file:///", getwd(), "/mark_twain.txt")
twain <- spark_read_text(sc, "twain", twain_path)
# Imports Sir Arthur Conan Doyle's file
doyle_path <- paste0("file:///", getwd(), "/arthur_doyle.txt")
doyle <- spark_read_text(sc, "doyle", doyle_path)
The objective is to end up with a tidy table inside Spark with one row per word used.
The steps will be:
The needed data transformations apply to the data from both authors.
The data sets will be appended to one another
Punctuation will be removed
The words inside each line will be separated, or tokenized
For a cleaner analysis, stop words will be removed
To tidy the data, each word in a line will become its own row
The results will be saved to Spark memory
sdf_bind_rows()
sdf_bind_rows()
appends the doyle
Spark Dataframe to the twain
Spark Dataframe.
This function can be used in lieu of a dplyr::bind_rows()
wrapper function.
For this exercise, the column author
is added to differentiate between the two bodies of work.
all_words <- doyle %>%
mutate(author = "doyle") %>%
sdf_bind_rows({
twain %>%
mutate(author = "twain")}) %>%
filter(nchar(line) > 0)
regexp_replace
The Hive UDF, regexp_replace, is used as a sort of gsub()
that works inside Spark.
In this case it is used to remove punctuation.
The usual [:punct:]
regular expression did not work well during development, so a custom list is provided.
For more information, see the Hive Functions section in the dplyr
page.
all_words <- all_words %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
ft_tokenizer()
ft_tokenizer()
uses the Spark API to separate each word.
It creates a new list column with the results.
all_words <- all_words %>%
ft_tokenizer(input.col = "line",
output.col = "word_list")
head(all_words, 4)
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## author line word_list
## <chr> <chr> <list>
## 1 doyle THE RETURN OF SHERLOCK HOLMES <list [5]>
## 2 doyle A Collection of Holmes Adventures <list [5]>
## 3 doyle by Sir Arthur Conan Doyle <list [5]>
## 4 doyle CONTENTS <list [1]>
ft_stop_words_remover()
ft_stop_words_remover()
is a new function that, as its name suggests, takes care of removing stop words from the previous transformation.
It expects a list column, so it is important to sequence it correctly after a ft_tokenizer()
command.
In the sample results, notice that the new wo_stop_words
column contains less items than word_list
.
all_words <- all_words %>%
ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words")
head(all_words, 4)
## # Source: lazy query [?? x 4]
## # Database: spark_connection
## author line word_list wo_stop_words
## <chr> <chr> <list> <list>
## 1 doyle THE RETURN OF SHERLOCK HOLMES <list [5]> <list [3]>
## 2 doyle A Collection of Holmes Adventures <list [5]> <list [3]>
## 3 doyle by Sir Arthur Conan Doyle <list [5]> <list [4]>
## 4 doyle CONTENTS <list [1]> <list [1]>
explode
The Hive UDF explode performs the job of unnesting the tokens into their own row.
Some further filtering and field selection is done to reduce the size of the dataset.
all_words <- all_words %>%
mutate(word = explode(wo_stop_words)) %>%
select(word, author) %>%
filter(nchar(word) > 2)
head(all_words, 4)
## # Source: lazy query [?? x 2]
## # Database: spark_connection
## word author
## <chr> <chr>
## 1 return doyle
## 2 sherlock doyle
## 3 holmes doyle
## 4 collection doyle
compute()
compute()
will operate this transformation and cache the results in Spark memory.
It is a good idea to pass a name to compute()
to make it easier to identify it inside the Spark environment.
In this case the name will be all_words
all_words <- all_words %>%
compute("all_words")
Full code
This is what the code would look like on an actual analysis:
all_words <- doyle %>%
mutate(author = "doyle") %>%
sdf_bind_rows({
twain %>%
mutate(author = "twain")}) %>%
filter(nchar(line) > 0) %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
ft_tokenizer(input.col = "line",
output.col = "word_list") %>%
ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words") %>%
mutate(word = explode(wo_stop_words)) %>%
select(word, author) %>%
filter(nchar(word) > 2) %>%
compute("all_words")
Data Analysis
Words used the most
word_count <- all_words %>%
group_by(author, word) %>%
tally() %>%
arrange(desc(n))
word_count
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## # Groups: author
## # Ordered by: desc(n)
## author word n
## <chr> <chr> <dbl>
## 1 twain one 20028
## 2 doyle upon 16482
## 3 twain would 15735
## 4 doyle one 14534
## 5 doyle said 13716
## 6 twain said 13204
## 7 twain could 11301
## 8 doyle would 11300
## 9 twain time 10502
## 10 doyle man 10478
## # ...
with more rows
Words used by Doyle and not Twain
doyle_unique <- filter(word_count, author == "doyle") %>%
anti_join(filter(word_count, author == "twain"), by = "word") %>%
arrange(desc(n)) %>%
compute("doyle_unique")
doyle_unique
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## # Groups: author
## # Ordered by: desc(n), desc(n)
## author word n
## <chr> <chr> <dbl>
## 1 doyle nigel 972
## 2 doyle alleyne 500
## 3 doyle ezra 421
## 4 doyle maude 337
## 5 doyle aylward 336
## 6 doyle catinat 301
## 7 doyle sharkey 281
## 8 doyle lestrade 280
## 9 doyle summerlee 248
## 10 doyle congo 211
## # ...
with more rows
doyle_unique %>%
head(100) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
Twain and Sherlock
The word cloud highlighted something interesting.
The word lestrade is listed as one of the words used by Doyle but not Twain.
Lestrade is the last name of a major character in the Sherlock Holmes books.
It makes sense that the word “sherlock” appears considerably more times than “lestrade” in Doyle's books, so why is Sherlock not in the word cloud? Did Mark Twain use the word “sherlock” in his writings?
all_words %>%
filter(author == "twain",
word == "sherlock") %>%
tally()
## # Source: lazy query [?? x 1]
## # Database: spark_connection
## n
## <dbl>
## 1 16
The all_words
table contains 16 instances of the word sherlock in the words used by Twain in his works.
The instr Hive UDF is used to extract the lines that contain that word in the twain
table.
This Hive function works can be used instead of base::grep()
or stringr::str_detect()
.
To account for any word capitalization, the lower command will be used in mutate()
to make all words in the full text lower cap.
instr & lower
Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story.
As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.
twain %>%
mutate(line = lower(line)) %>%
filter(instr(line, "sherlock") > 0) %>%
pull(line)
## [1] "late sherlock holmes, and yet discernible by a member of a race charged"
## [2] "sherlock holmes."
## [3] "\"uncle sherlock! the mean luck of it!--that he should come just"
## [4] "another trouble presented itself.
\"uncle sherlock 'll be wanting to talk"
## [5] "flint buckner's cabin in the frosty gloom.
they were sherlock holmes and"
## [6] "\"uncle sherlock's got some work to do, gentlemen, that 'll keep him till"
## [7] "\"by george, he's just a duke, boys! three cheers for sherlock holmes,"
## [8] "he brought sherlock holmes to the billiard-room, which was jammed with"
## [9] "of interest was there--sherlock holmes.
the miners stood silent and"
## [10] "the room; the chair was on it; sherlock holmes, stately, imposing,"
## [11] "\"you have hunted me around the world, sherlock holmes, yet god is my"
## [12] "\"if it's only sherlock holmes that's troubling you, you needn't worry"
## [13] "they sighed; then one said: \"we must bring sherlock holmes.
he can be"
## [14] "i had small desire that sherlock holmes should hang for my deeds, as you"
## [15] "\"my name is sherlock holmes, and i have not been doing anything.\""
## [16] "late sherlock holmes, and yet discernible by a member of a race charged"
spark_disconnect(sc)
Appendix
gutenbergr package
This is an example of how the data for this article was pulled from the Gutenberg site:
library(gutenbergr)
gutenberg_works() %>%
filter(author == "Twain, Mark") %>%
pull(gutenberg_id) %>%
gutenberg_download() %>%
pull(text) %>%
writeLines("mark_twain.txt")
Intro to Spark Streaming with sparklyr
The sparklyr
interface
As stated in the Spark’s official site, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Because is part of the Spark API, it is possible to re-use query code that queries the current state of the stream, as well as joining the streaming data with historical data.
Please see Spark’s official documentation for a deeper look into Spark Streaming.
The sparklyr
interface provides the following:
Ability to run dplyr, SQL, spark_apply(), and PipelineModels against a stream
Read in multiple formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc
Write stream results to Spark memory and the following file formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc
An out-of-the box graph visualization to monitor the stream
A new reactiveSpark() function, that allows Shiny apps to poll the contents of the stream
create Shiny apps that are able to read the contents of the stream
Interacting with a stream
A good way of looking at the way how Spark streams update is as a three stage operation:
Input - Spark reads the data inside a given folder.
The folder is expected to contain multiple data files, with new files being created containing the most current stream data.
Processing - Spark applies the desired operations on top of the data.
These operations could be data manipulations (dplyr
, SQL), data transformations (sdf
operations, PipelineModel predictions), or native R manipulations (spark_apply()
).
Output - The results of processing the input files are saved in a different folder.
In the same way all of the read and write operations in sparklyr
for Spark Standalone, or in sparklyr
’s local mode, the input and output folders are actual OS file system folders.
For Hadoop clusters, these will be folder locations inside the HDFS.
Example 1 - Input/Output
The first intro example is a small script that can be used with a local master.
The result should be to see the stream_view()
app showing live the number of records processed for each iteration of test data being sent to the stream.
library(future)
library(sparklyr)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)
stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source")
write_output <- stream_write_csv(read_folder, "source-out")
invisible(future(stream_generate_test(interval = 0.5)))
stream_view(write_output)
stream_stop(write_output)
spark_disconnect(sc)
Code breakdown
Open the Spark connection
library(sparklyr)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
Optional step.
This resets the input and output folders.
It makes it easier to run the code multiple times in a clean manner.
if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)
Produces a single test file inside the “source” folder.
This allows the “read” function to infer CSV file definition.
stream_generate_test(iterations = 1)
list.files("source")
[1] "stream_1.csv"
Points the stream reader to the folder where the streaming files will be placed.
Since it is primed with a single CSV file, it will use as the expected layout of subsequent files.
By default, stream_read_csv()
creates a single integer variable data frame.
read_folder <- stream_read_csv(sc, "source")
The output writer is what starts the streaming job.
It will start monitoring the input folder, and then write the new results in the “source-out” folder.
So as new records stream in, new files will be created in the “source-out” folder.
Since there are no operations on the incoming data at this time, the output files will have the same exact raw data as the input files.
The only difference is that the files and sub folders within “source-out” will be structured how Spark structures data folders.
write_output <- stream_write_csv(read_folder, "source-out")
list.files("source-out")
[1] "_spark_metadata" "checkpoint"
[3] "part-00000-1f29719a-2314-40e1-b93d-a647a3d57154-c000.csv"
The test generation function will run 100 files every 0.2 seconds.
To run the tests “out-of-sync” with the current R session, the future
package is used.
library(future)
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))
The stream_view()
function can be used before the 50 tests are complete because of the use of the future
package.
It will monitor the status of the job that write_output
is pointing to and provide information on the amount of data coming into the “source” folder and going out into the “source-out” folder.
stream_view(write_output)
The monitor will continue to run even after the tests are complete.
To end the experiment, stop the Shiny app and then use the following to stop the stream and close the Spark session.
stream_stop(write_output)
spark_disconnect(sc)
Example 2 - Processing
The second example builds on the first.
It adds a processing step that manipulates the input data before saving it to the output folder.
In this case, a new binary field is added indicating if the value from x
is over 400 or not.
This time, while run the second code chunk in this example a few times during the stream tests to see the aggregated values change.
library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)
stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source")
process_stream <- read_folder %>%
mutate(x = as.double(x)) %>%
ft_binarizer(
input_col = "x",
output_col = "over",
threshold = 400
)
write_output <- stream_write_csv(process_stream, "source-out")
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))
Run this code a few times during the experiment:
spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>%
group_by(over) %>%
tally()
The results would look similar to this.
The n
totals will increase as the experiment progresses.
# Source: lazy query [?? x 2]
# Database: spark_connection
over n
<dbl> <dbl>
1 0 40215
2 1 60006
Clean up after the experiment
stream_stop(write_output)
spark_disconnect(sc)
Code breakdown
The processing starts with the read_folder
variable that contains the input stream.
It coerces the integer field x
, into a type double.
This is because the next function, ft_binarizer()
does not accept integers.
The binarizer determines if x
is over 400 or not.
This is a good illustration of how dplyr
can help simplify the manipulation needed during the processing stage.
process_stream <- read_folder %>%
mutate(x = as.double(x)) %>%
ft_binarizer(
input_col = "x",
output_col = "over",
threshold = 400
)
The output now needs to write-out the processed data instead of the raw input data.
Swap read_folder
with process_stream
.
write_output <- stream_write_csv(process_stream, "source-out")
The “source-out” folder can be treated as a if it was a single table within Spark.
Using spark_read_csv()
, the data can be mapped, but not brought into memory (memory = FALSE
).
This allows the current results to be further analyzed using regular dplyr
commands.
spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>%
group_by(over) %>%
tally()
Example 3 - Aggregate in process and output to memory
Another option is to save the results of the processing into a in-memory Spark table.
Unless intentionally saving it to disk, the table and its data will only exist while the Spark session is active.
The biggest advantage of using Spark memory as the target, is that it will allow for aggregation to happen during processing.
This is an advantage because aggregation is not allowed for any file output, expect Kafka, on the input/process stage.
Using example 2 as the base, this example code will perform some aggregations to the current stream input and save only those summarized results into Spark memory:
library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
if(file.exists("source")) unlink("source", TRUE)
stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source")
process_stream <- read_folder %>%
stream_watermark() %>%
group_by(timestamp) %>%
summarise(
max_x = max(x, na.rm = TRUE),
min_x = min(x, na.rm = TRUE),
count = n()
)
write_output <- stream_write_memory(process_stream, name = "stream")
invisible(future(stream_generate_test()))
Run this command a different times while the experiment is running:
tbl(sc, "stream")
Clean up after the experiment
stream_stop(write_output)
spark_disconnect(sc)
Code breakdown
The stream_watermark()
functions add a new timestamp
variable that is then used in the group_by()
command.
This is required by Spark Stream to accept summarized results as output of the stream.
The second step is to simply decide what kinds of aggregations we need to perform.
In this case, a simply max, min and count are performed.
process_stream <- read_folder %>%
stream_watermark() %>%
group_by(timestamp) %>%
summarise(
max_x = max(x, na.rm = TRUE),
min_x = min(x, na.rm = TRUE),
count = n()
)
The spark_write_memory()
function is used to write the output to Spark memory.
The results will appear as a table of the Spark session with the name assigned in the name
argument, in this case the name selected is: “stream”.
write_output <- stream_write_memory(process_stream, name = "stream")
To query the current data in the “stream” table can be queried by using the dplyr
tbl()
command.
tbl(sc, "stream")
Example 4 - Shiny integration
sparklyr
provides a new Shiny function called reactiveSpark()
.
It can take a Spark data frame, in this case the one created as a result of the stream processing, and then creates a Spark memory stream table, the same way a table is created in example 3.
library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)
stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source")
process_stream <- read_folder %>%
stream_watermark() %>%
group_by(timestamp) %>%
summarise(
max_x = max(x, na.rm = TRUE),
min_x = min(x, na.rm = TRUE),
count = n()
)
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))
library(shiny)
ui <- function(){
tableOutput("table")
}
server <- function(input, output, session){
ps <- reactiveSpark(process_stream)
output$table <- renderTable({
ps() %>%
mutate(timestamp = as.character(timestamp))
})
}
runGadget(ui, server)
Code breakdown
Notice that there is no stream_write_...
command.
The reason is that reactiveSpark()
function contains the stream_write_memory()
function.
This very basic Shiny app simply displays the output of a table in the ui
section
library(shiny)
ui <- function(){
tableOutput("table")
}
In the server
section, the reactiveSpark()
function will update every time there’s a change to the stream and return a data frame.
The results are saved to a variable called ps()
in this script.
Treat the ps()
variable as a regular table that can be piped from, as shown in the example.
In this case, the timestamp
variable is converted to string for to make it easier to read.
server <- function(input, output, session){
ps <- reactiveSpark(process_stream)
output$table <- renderTable({
ps() %>%
mutate(timestamp = as.character(timestamp))
})
}
Use runGadget()
to display the Shiny app in the Viewer pane.
This is optional, the app can be run using normal Shiny run functions.
runGadget(ui, server)
Example 5 - ML Pipeline Model
This example uses a fitted Pipeline Model to process the input, and saves the predictions to the output.
This approach would be used to apply Machine Learning on top of streaming data.
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)
sc <- spark_connect(master = "local", spark_version = "2.3.0")
if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)
df <- data.frame(x = rep(1:1000), y = rep(2:1001))
stream_generate_test(df = df, iteration = 1)
model_sample <- spark_read_csv(sc, "sample", "source")
pipeline <- sc %>%
ml_pipeline() %>%
ft_r_formula(x ~ y) %>%
ml_linear_regression()
fitted_pipeline <- ml_fit(pipeline, model_sample)
ml_stream <- stream_read_csv(
sc = sc,
path = "source",
columns = c(x = "integer", y = "integer")
) %>%
ml_transform(fitted_pipeline, .) %>%
select(- features) %>%
stream_write_csv("source-out")
stream_generate_test(df = df, interval = 0.5)
spark_read_csv(sc, "stream", "source-out", memory = FALSE)
### Source: spark<stream> [?? x 4]
## x y label prediction
## * <int> <int> <dbl> <dbl>
## 1 276 277 276 276.
## 2 277 278 277 277.
## 3 278 279 278 278.
## 4 279 280 279 279.
## 5 280 281 280 280.
## 6 281 282 281 281.
## 7 282 283 282 282.
## 8 283 284 283 283.
## 9 284 285 284 284.
##10 285 286 285 285.
### ...
with more rows
stream_stop(ml_stream)
spark_disconnect(sc)
Code Breakdown
Creates and fits a pipeline
df <- data.frame(x = rep(1:1000), y = rep(2:1001))
stream_generate_test(df = df, iteration = 1)
model_sample <- spark_read_csv(sc, "sample", "source")
pipeline <- sc %>%
ml_pipeline() %>%
ft_r_formula(x ~ y) %>%
ml_linear_regression()
fitted_pipeline <- ml_fit(pipeline, model_sample)
This example pipelines the input, process and output in a single code segment.
The ml_transform()
function is used to create the predictions.
Because the CSV format does not support list type fields, the features
column is removed before the results are sent to the output.
ml_stream <- stream_read_csv(
sc = sc,
path = "source",
columns = c(x = "integer", y = "integer")
) %>%
ml_transform(fitted_pipeline, .) %>%
select(- features) %>%
stream_write_csv("source-out")
Using Spark with AWS S3 buckets
AWS Access Keys
AWS Access Keys are needed to access S3 data.
To learn how to setup a new keys, please review the AWS documentation: http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html .We then pass the keys to R via Environment Variables:
Sys.setenv(AWS_ACCESS_KEY_ID="[Your access key]")
Sys.setenv(AWS_SECRET_ACCESS_KEY="[Your secret access key]")
Connecting to Spark
There are four key settings needed to connect to Spark and use S3:
A Hadoop-AWS package
Executor memory (key but not critical)
The master URL
The Spark Home
To connect to Spark, we first need to initialize a variable with the contents of sparklyr default config (spark_config
) which we will then customize for our needs
library(sparklyr)
conf <- spark_config()
Hadoop-AWS package:
A Spark connection can be enhanced by using packages, please note that these are not R packages.
For example, there are packages that tells Spark how to read CSV files, Hadoop or Hadoop in AWS.
In order to read S3 buckets, our Spark connection will need a package called hadoop-aws
.
If needed, multiple packages can be used.
We experimented with many combinations of packages, and determined that for reading data in S3 we only need the one.
The version we used, 2.7.3, refers to the latest Hadoop version, so as this article ages, please make sure to check this site to ensure that you are using the latest version: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"
Executor Memory
As mentioned above this setting key but not critical.
There are two points worth highlighting about it is:
The only performance related setting in a Spark Stand Alone cluster that can be tweaked, and in most cases because Spark defaults to a fraction of what is available, we then need to increase it by manually passing a value to that setting.
If more than the available RAM is requested, then Spark will set the Cores to 0, thus rendering the session unusable.
conf$spark.executor.memory <- "14g"
Master URL and Spark home
There are three important points to mention when executing the spark_connect command:
The master will be the Spark Master’s URL.
To find the URL, please see the Spark Cluster section.
Point the Spark Home to the location where Spark was installed in this node
Make sure to the conf variable as the value for the config argument
sc <- spark_connect(master = "spark://ip-172-30-1-5.us-west-2.compute.internal:7077",
spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/",
config = conf)
Data Import/Wrangle approach
We experimented with multiple approaches.
Most of the factors for settling on a recommended approach were made based on the speed of each step.
The premise is that we would rather wait longer during Data Import, if it meant that we can much faster Register and Cache our data subsets during Data Wrangling, especially since we would expect to end up with many subsets as we explore and model.The selected combination was the second slowest during the Import stage, but the fastest when caching a subset, by a lot.
In our tests, it took 72 seconds to read and cache the 29 columns of the 41 million rows of data, the slowest was 77 seconds.
But when it comes to registering and caching a considerably sizable subset of 3 columns and almost all of the 41 million records, this approach was 17X faster than the second fastest approach.
It took 1/3 of a second to register and cache the subset, the second fastest was 5 seconds.
To implement this approach, we need to set three arguments in the spark_csv_read()
step:
memory
infer_schema
columns
Again, this is a recommended approach.
The columns argument is needed only if infer_schema is set to FALSE.
When memory is set to TRUE
it makes Spark load the entire dataset into memory, and setting infer_schema to FALSE prevents Spark from trying to figure out what the schema of the files are.
By trying different combinations the memory and infer_schema arguments you may be able to find an approach that may better fits your needs.
Reading the schema
Surprisingly, another critical detail that can easily be overlooked is choosing the right s3 URI scheme.
There are two options: s3n and s3a.
In most examples and tutorials I found, there was no reason give of why or when to use which one.
The article the finally clarified it was this one: https://wiki.apache.org/hadoop/AmazonS3
The gist of it is that s3a is the recommended one going forward, especially for Hadoop versions 2.7 and above.
This means that if we copy from older examples that used Hadoop 2.6 we would more likely also used s3n thus making data import much, much slower.
Data Import
After the long introduction in the previous section, there is only one point to add about the following code chunk.
If there are any NA
values in numeric fields, then define the column as character and then convert it on later subsets using dplyr.
The data import will fail if it finds any NA values on numeric fields.
This is a small trade off in this approach because the next fastest one does not have this issue but is 17X slower at caching subsets.
flights <- spark_read_csv(sc, "flights_spark",
path = "s3a://flights-data/full",
memory = TRUE,
columns = list(
Year = "character",
Month = "character",
DayofMonth = "character",
DayOfWeek = "character",
DepTime = "character",
CRSDepTime = "character",
ArrTime = "character",
CRSArrTime = "character",
UniqueCarrier = "character",
FlightNum = "character",
TailNum = "character",
ActualElapsedTime = "character",
CRSElapsedTime = "character",
AirTime = "character",
ArrDelay = "character",
DepDelay = "character",
Origin = "character",
Dest = "character",
Distance = "character",
TaxiIn = "character",
TaxiOut = "character",
Cancelled = "character",
CancellationCode = "character",
Diverted = "character",
CarrierDelay = "character",
WeatherDelay = "character",
NASDelay = "character",
SecurityDelay = "character",
LateAircraftDelay = "character"),
infer_schema = FALSE)
Data Wrangle
There are a few points we need to highlight about the following simple dyplr code:
Because there were NAs in the original fields, we have to mutate them to a number.
Try coercing any variable as integer instead of numeric, this will save a lot of space when cached to Spark memory.
The sdf_register command can be piped at the end of the code.
After running the code, a new table will appear in the RStudio IDE’s Spark tab
tidy_flights <- tbl(sc, "flights_spark") %>%
mutate(ArrDelay = as.integer(ArrDelay),
DepDelay = as.integer(DepDelay),
Distance = as.integer(Distance)) %>%
filter(!is.na(ArrDelay)) %>%
select(DepDelay, ArrDelay, Distance) %>%
sdf_register("tidy_spark")
After we use tbl_cache()
to load the tidy_spark
table into Spark memory.
We can see the new table in the Storage page of our Spark session.
tbl_cache(sc, "tidy_spark")
Using Apache Arrow
Introduction
Apache Arrow is a cross-language development platform for in-memory data.
Arrow is supported starting with sparklyr 1.0.0
to improve performance when transferring data between Spark and R.
You can find some performance benchmarks under:
sparklyr 1.0: Arrow, XGBoost, Broom and TFRecords.
Speeding up R and Apache Spark using Apache Arrow.
Installation
Using Arrow from R requires installing:
The Arrow Runtime: Provides a cross-language runtime library.
The Arrow R Package: Provides support for using Arrow from R through an R package.
Runtime
OS X
Installing from OS X requires Homebrew and executing from a terminal:
brew install apache-arrow
Windows
Currently, installing Arrow in Windows requires Conda and executing from a terminal:
conda install arrow-cpp=0.12.* -c conda-forge
conda install pyarrow=0.12.* -c conda-forge
Linux
Please reference arrow.apache.org/install when installing Arrow for Linux.
Package
As of this writing, the arrow
R package is not yet available in CRAN; however, this package can be installed using the remotes
package.
First, install remotes
:
install.packages("remotes")
Then install the R package from github as follows:
remotes::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.0")
If you happen to have Arrow 0.11 installed, you will have to install
remotes::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
Use Cases
There are three main use cases for arrow
in sparklyr
:
Data Copying: When copying data with copy_to()
, Arrow will be used.
Data Collection: Also, when collecting either, implicitly by printing datasets or explicitly calling collect
.
R Transformations: When using spark_apply()
, data will be transferred using Arrow when possible.
To use arrow
in sparklyr
one simply needs to import this library:
library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
The following objects are masked from ‘package:base’:
array, table
Considerations
Types
Some data types are mapped to slightly different, one can argue more correct, types when using Arrow.
For instance, consider collecting 64 bit integers in sparklyr
:
library(sparklyr)
sc <- spark_connect(master = "local")
integer64 <- sdf_len(sc, 2, type = "integer64")
integer64
# Source: spark<?> [?? x 1]
id
<dbl>
1 1
2 2
Notice that sparklyr
collects 64 bit integers as double
; however, using arrow
:
library(arrow)
integer64
# Source: spark<?> [?? x 1]
id
<S3: integer64>
1 1
2 2
64 bit integers are now being collected as proper 64 bit integer using the bit64
package.
Fallback
The Arrow R package supports many data types; however, in cases where a type is unsupported, sparklyr
will fallback to not using arrow and print a warning.
library(sparklyr.nested)
library(sparklyr)
library(dplyr)
library(arrow)
sc <- spark_connect(master = "local")
cars <- copy_to(sc, mtcars)
sdf_nest(cars, hp) %>%
group_by(cyl) %>%
summarize(data = collect_list(data))
# Source: spark<?> [?? x 2]
cyl data
<dbl> <list>
1 6 <list [7]>
2 4 <list [11]>
3 8 <list [14]>
Warning message:
In arrow_enabled_object.spark_jobj(sdf) :
Arrow disabled due to columns: data
Creating Extensions for sparklyr
Introduction
The sparklyr package provides a dplyr interface to Spark DataFrames as well as an R interface to Spark’s distributed machine learning pipelines.
However, since Spark is a general-purpose cluster computing system there are many other R interfaces that could be built (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).
The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages.
This guide describes how you can use these tools to create your own custom R interfaces to Spark.
Examples
Here’s an example of an extension function that calls the text file line counting function available via the SparkContext:
library(sparklyr)
count_lines <- function(sc, file) {
spark_context(sc) %>%
invoke("textFile", file, 1L) %>%
invoke("count")
}
The count_lines
function takes a spark_connection
(sc
) argument which enables it to obtain a reference to the SparkContext
object, and in turn call the textFile().count()
method.
You can use this function with an existing sparklyr connection as follows:
library(sparklyr)
sc <- spark_connect(master = "local")
count_lines(sc, "hdfs://path/data.csv")
Here are links to some additional examples of extension packages:
spark.sas7bdat |
Read in SAS data in parallel into Apache Spark. |
rsparkling |
Extension for using H2O machine learning algorithms against Spark Data Frames. |
sparkhello |
Simple example of including a custom JAR file within an extension package. |
rddlist |
Implements some methods of an R list as a Spark RDD (resilient distributed dataset). |
sparkwarc |
Load WARC files into Apache Spark with sparklyr. |
sparkavro |
Load Avro data into Spark with sparklyr.
It is a wrapper of spark-avro |
crassy |
Connect to Cassandra with sparklyr using the Spark-Cassandra-Connector . |
sparklygraphs |
R interface for GraphFrames which aims to provide the functionality of GraphX. |
sparklyr.nested |
Extension for working with nested data. |
sparklyudf |
Simple example registering an Scala UDF within an extension package. |
Core Types
Three classes are defined for representing the fundamental types of the R to Java bridge:
S3 methods are defined for each of these classes so they can be easily converted to or from objects that contain or wrap them.
Note that for any given spark_jobj
it’s possible to discover the underlying spark_connection
.
Calling Spark from R
There are several functions available for calling the methods of Java objects and static methods of Java classes:
For example, to create a new instance of the java.math.BigInteger
class and then call the longValue()
method on it you would use code like this:
billionBigInteger <- invoke_new(sc, "java.math.BigInteger", "1000000000")
billion <- invoke(billionBigInteger, "longValue")
Note the sc
argument: that’s the spark_connection
object which is provided by the front-end package (e.g. sparklyr).
The previous example can be re-written to be more compact and clear using magrittr pipes:
billion <- sc %>%
invoke_new("java.math.BigInteger", "1000000000") %>%
invoke("longValue")
This syntax is similar to the method-chaining syntax often used with Scala code so is generally preferred.
Calling a static method of a class is also straightforward.
For example, to call the Math::hypot()
static function you would use this code:
hypot <- sc %>%
invoke_static("java.lang.Math", "hypot", 10, 20)
Wrapper Functions
Creating an extension typically consists of writing R wrapper functions for a set of Spark services.
In this section we’ll describe the typical form of these functions as well as how to handle special types like Spark DataFrames.
Here’s the wrapper function for textFile().count()
which we defined earlier:
count_lines <- function(sc, file) {
spark_context(sc) %>%
invoke("textFile", file, 1L) %>%
invoke("count")
}
The count_lines
function takes a spark_connection
(sc
) argument which enables it to obtain a reference to the SparkContext
object, and in turn call the textFile().count()
method.
The following functions are useful for implementing wrapper functions of various kinds:
spark_connection |
Get the Spark connection associated with an object (S3) |
spark_jobj |
Get the Spark jobj associated with an object (S3) |
spark_dataframe |
Get the Spark DataFrame associated with an object (S3) |
spark_context |
Get the SparkContext for a spark_connection |
hive_context |
Get the HiveContext for a spark_connection |
spark_version |
Get the version of Spark (as a numeric_version ) for a spark_connection |
The use of these functions is illustrated in this simple example:
analyze <- function(x, features) {
# normalize whatever we were passed (e.g.
a dplyr tbl) into a DataFrame
df <- spark_dataframe(x)
# get the underlying connection so we can create new objects
sc <- spark_connection(df)
# create an object to do the analysis and call its `analyze` and `summary`
# methods (note that the df and features are passed to the analyze function)
summary <- sc %>%
invoke_new("com.example.tools.Analyzer") %>%
invoke("analyze", df, features) %>%
invoke("summary")
# return the results
summary
}
The first argument is an object that can be accessed using the Spark DataFrame API (this might be an actual reference to a DataFrame or could rather be a dplyr tbl
which has a DataFrame reference inside it).
After using the spark_dataframe
function to normalize the reference, we extract the underlying Spark connection associated with the data frame using the spark_connection
function.
Finally, we create a new Analyzer
object, call it’s analyze
method with the DataFrame and list of features, and then call the summary
method on the results of the analysis.
Accepting a spark_jobj
or spark_dataframe
as the first argument of a function makes it very easy to incorporate into magrittr pipelines so this pattern is highly recommended when possible.
Dependencies
When creating R packages which implement interfaces to Spark you may need to include additional dependencies.
Your dependencies might be a set of Spark Packages or might be a custom JAR file.
In either case, you’ll need a way to specify that these dependencies should be included during the initialization of a Spark session.
A Spark dependency is defined using the spark_dependency
function:
spark_dependency |
Define a Spark dependency consisting of JAR files and Spark packages |
Your extension package can specify it’s dependencies by implementing a function named spark_dependencies
within the package (this function should not be publicly exported).
For example, let’s say you were creating an extension package named sparkds that needs to include a custom JAR as well as the Redshift and Apache Avro packages:
spark_dependencies <- function(spark_version, scala_version, ...) {
spark_dependency(
jars = c(
system.file(
sprintf("java/sparkds-%s-%s.jar", spark_version, scala_version),
package = "sparkds"
)
),
packages = c(
sprintf("com.databricks:spark-redshift_%s:0.6.0", scala_version),
sprintf("com.databricks:spark-avro_%s:2.0.1", scala_version)
)
)
}
.onLoad <- function(libname, pkgname) {
sparklyr::register_extension(pkgname)
}
The spark_version
argument is provided so that a package can support multiple Spark versions for it’s JARs.
Note that the argument will include just the major and minor versions (e.g. 1.6
or 2.0
) and will not include the patch level (as JARs built for a given major/minor version are expected to work for all patch levels).
The scala_version
argument is provided so that a single package can support multiple Scala compiler versions for it’s JARs and packages (currently Scala 1.6 downloadable binaries are compiled with Scala 2.10 and Scala 2.0 downloadable binaries are compiled with Scala 2.11).
The ...
argument is unused but nevertheless should be included to ensure compatibility if new arguments are added to spark_dependencies
in the future.
The .onLoad
function registers your extension package so that it’s spark_dependencies
function will be automatically called when new connections to Spark are made via spark_connect
:
library(sparklyr)
library(sparkds)
sc <- spark_connect(master = "local")
Compiling JARs
The sparklyr package includes a utility function (compile_package_jars
) that will automatically compile a JAR file from your Scala source code for the required permutations of Spark and Scala compiler versions.
To use the function just invoke it from the root directory of your R package as follows:
sparklyr::compile_package_jars()
Note that a prerequisite to calling compile_package_jars
is the installation of the Scala 2.10 and 2.11 compilers to one of the following paths:
/opt/scala
/opt/local/scala
/usr/local/scala
~/scala (Windows-only)
See the sparkhello repository for a complete example of including a custom JAR within an extension package.
CRAN
When including a JAR file within an R package distributed on CRAN, you should follow the guidelines provided in Writing R Extensions:
Java code is a special case: except for very small programs, .java files should be byte-compiled (to a .class file) and distributed as part of a .jar file: the conventional location for the .jar file(s) is inst/java
.
It is desirable (and required under an Open Source license) to make the Java source files available: this is best done in a top-level java
directory in the package – the source files should not be installed.
Data Types
The ensure_*
family of functions can be used to enforce specific data types that are passed to a Spark routine.
For example, Spark routines that require an integer will not accept an R numeric element.
Use these functions ensure certain parameters are scalar integers, or scalar doubles, and so on.
ensure_scalar_integer
ensure_scalar_double
ensure_scalar_boolean
ensure_scalar_character
In order to match the correct data types while calling Scala code from R, or retrieving results from Scala back to R, consider the following types mapping:
NULL |
void |
NULL |
integer |
Int |
integer |
character |
String |
character |
logical |
Boolean |
logical |
double |
Double |
double |
numeric |
Double |
double |
|
Float |
double |
|
Decimal |
double |
|
Long |
double |
raw |
Array[Byte] |
raw |
Date |
Date |
Date |
POSIXlt |
Time |
|
POSIXct |
Time |
POSIXct |
list |
Array[T] |
list |
environment |
Map[String, T] |
|
jobj |
Object |
jobj |
Compiling
Most Spark extensions won’t need to define their own compilation specification, and can instead rely on the default behavior of compile_package_jars
.
For users who would like to take more control over where the scalac compilers should be looked up, use the spark_compilation_spec
fucnction.
The Spark compilation specification is used when compiling Spark extension Java Archives, and defines which versions of Spark, as well as which versions of Scala, should be used for compilation.
Sparkling Water (H2O) Machine Learning
Overview
The rsparkling extension package provides bindings to H2O's distributed machine learning algorithms via sparklyr.
In particular, rsparkling allows you to access the machine learning routines provided by the Sparkling Water Spark package.
Together with sparklyr's dplyr interface, you can easily create and tune H2O machine learning workflows on Spark, orchestrated entirely within R.
rsparkling provides a few simple conversion functions that allow the user to transfer data between Spark DataFrames and H2O Frames.
Once the Spark DataFrames are available as H2O Frames, the h2o R interface can be used to train H2O machine learning algorithms on the data.
A typical machine learning pipeline with rsparkling might be composed of the following stages.
To fit a model, you might need to:
Perform SQL queries through the sparklyr dplyr interface,
Use the sdf_*
and ft_*
family of functions to generate new columns, or partition your data set,
Convert your training, validation and/or test data frames into H2O Frames using the as_h2o_frame
function,
Choose an appropriate H2O machine learning algorithm to model your data,
Inspect the quality of your model fit, and use it to make predictions with new data.
Installation
You can install the rsparkling package from CRAN as follows:
install.packages("rsparkling")
Then set the Sparkling Water version for rsparkling.:
options(rsparkling.sparklingwater.version = "2.1.14")
For Spark 2.0.x
set rsparkling.sparklingwater.version
to 2.0.3
instead, for Spark 1.6.2
use 1.6.8
.
Using H2O
Now let's walk through a simple example to demonstrate the use of H2O's machine learning algorithms within R.
We'll use h2o.glm to fit a linear regression model.
Using the built-in mtcars
dataset, we'll try to predict a car's fuel consumption (mpg
) based on its weight (wt
), and the number of cylinders the engine contains (cyl
).
First, we will initialize a local Spark connection, and copy the mtcars
dataset into Spark.
library(rsparkling)
library(sparklyr)
library(h2o)
library(dplyr)
sc <- spark_connect("local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
Now, let's perform some simple transformations – we'll
Remove all cars with horsepower less than 100,
Produce a column encoding whether a car has 8 cylinders or not,
Partition the data into separate training and test data sets,
Fit a model to our training data set,
Evaluate our predictive performance on our test dataset.
# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
filter(hp >= 100) %>%
mutate(cyl8 = cyl == 8) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 1099)
Now, we convert our training and test sets into H2O Frames using rsparkling conversion functions.
We have already split the data into training and test frames using dplyr.
training <- as_h2o_frame(sc, partitions$training, strict_version_check = FALSE)
test <- as_h2o_frame(sc, partitions$test, strict_version_check = FALSE)
Alternatively, we can use the h2o.splitFrame()
function instead of sdf_partition()
to partition the data within H2O instead of Spark (e.g. partitions <- h2o.splitFrame(as_h2o_frame(mtcars_tbl), 0.5)
)
# fit a linear model to the training dataset
glm_model <- h2o.glm(x = c("wt", "cyl"),
y = "mpg",
training_frame = training,
lambda_search = TRUE)
For linear regression models produced by H2O, we can use either print()
or summary()
to learn a bit more about the quality of our fit.
The summary()
method returns some extra information about scoring history and variable importance.
glm_model
## Model Details:
## ==============
##
## H2ORegressionModel: glm
## Model ID: GLM_model_R_1510348062048_1
## GLM Model: summary
## family link regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.05468 )
## lambda_search
## 1 nlambda = 100, lambda.max = 5.4682, lambda.min = 0.05468, lambda.1se = -1.0
## number_of_predictors_total number_of_active_predictors
## 1 2 2
## number_of_iterations training_frame
## 1 100 frame_rdd_32_929e407384e0082416acd4c9897144a0
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 32.997281 16.625000
## 2 cyl -0.906688 -1.349195
## 3 wt -2.712562 -2.282649
##
## H2ORegressionMetrics: glm
## ** Reported on training data.
**
##
## MSE: 2.03293
## RMSE: 1.425808
## MAE: 1.306314
## RMSLE: 0.08238032
## Mean Residual Deviance : 2.03293
## R^2 : 0.8265696
## Null Deviance :93.775
## Null D.o.F.
:7
## Residual Deviance :16.26344
## Residual D.o.F.
:5
## AIC :36.37884
The output suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption.
(The model suggests that, on average, heavier cars consume more fuel.)
Let's use our H2O model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption.
We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions.
library(ggplot2)
# compute predicted values on our test dataset
pred <- h2o.predict(glm_model, newdata = test)
# convert from H2O Frame to Spark DataFrame
predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE)
# extract the true 'mpg' values from our test dataset
actual <- partitions$test %>%
select(mpg) %>%
collect() %>%
`[[`("mpg")
# produce a data.frame housing our predicted + actual 'mpg' values
data <- data.frame(
predicted = predicted,
actual = actual
)
# a bug in data.frame does not set colnames properly; reset here
names(data) <- c("predicted", "actual")
# plot predicted vs.
actual values
ggplot(data, aes(x = actual, y = predicted)) +
geom_abline(lty = "dashed", col = "red") +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
coord_fixed(ratio = 1) +
labs(
x = "Actual Fuel Consumption",
y = "Predicted Fuel Consumption",
title = "Predicted vs.
Actual Fuel Consumption"
)
Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption.
As you can see, we can easily and effectively combine dplyr data transformation pipelines with the machine learning algorithms provided by H2O's Sparkling Water.
Algorithms
Once the H2OContext
is made available to Spark (as demonstrated below), all of the functions in the standard h2o R interface can be used with H2O Frames (converted from Spark DataFrames).
Here is a table of the available algorithms:
Additionally, the h2oEnsemble R package can be used to generate Super Learner ensembles of H2O algorithms:
A model is often fit not on a dataset as-is, but instead on some transformation of that dataset.
Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_*
family of functions.
Transformers can be used on Spark DataFrames, and the final training set can be sent to the H2O cluster for machine learning.
ft_binarizer |
Threshold numerical features to binary (0/1) feature |
ft_bucketizer |
Bucketizer transforms a column of continuous features to a column of feature buckets |
ft_discrete_cosine_transform |
Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain |
ft_elementwise_product |
Multiplies each input vector by a provided weight vector, using element-wise multiplication. |
ft_index_to_string |
Maps a column of label indices back to a column containing the original labels as strings |
ft_quantile_discretizer |
Takes a column with continuous features and outputs a column with binned categorical features |
ft_sql_transformer |
Implements the transformations which are defined by a SQL statement |
ft_string_indexer |
Encodes a string column of labels to a column of label indices |
ft_vector_assembler |
Combines a given list of columns into a single vector column |
Examples
We will use the iris
data set to examine a handful of learning algorithms and transformers.
The iris data set measures attributes for 150 flowers in 3 different species of iris.
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
iris_tbl
## # Source: table<iris> [?? x 5]
## # Database: spark_connection
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ...
with more rows
Convert to an H2O Frame:
iris_hf <- as_h2o_frame(sc, iris_tbl, strict_version_check = FALSE)
K-Means Clustering
Use H2O's K-means clustering to partition a dataset into groups.
K-means clustering partitions points into k
groups, such that the sum of squares from points to the assigned cluster centers is minimized.
kmeans_model <- h2o.kmeans(training_frame = iris_hf,
x = 3:4,
k = 3,
seed = 1)
To look at particular metrics of the K-means model, we can use h2o.centroid_stats()
and h2o.centers()
or simply print out all the model metrics using print(kmeans_model)
.
# print the cluster centers
h2o.centers(kmeans_model)
## petal_length petal_width
## 1 1.462000 0.24600
## 2 5.566667 2.05625
## 3 4.296154 1.32500
# print the centroid statistics
h2o.centroid_stats(kmeans_model)
## Centroid Statistics:
## centroid size within_cluster_sum_of_squares
## 1 1 50.00000 1.41087
## 2 2 48.00000 9.29317
## 3 3 52.00000 7.20274
PCA
Use H2O's Principal Components Analysis (PCA) to perform dimensionality reduction.
PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.
pca_model <- h2o.prcomp(training_frame = iris_hf,
x = 1:4,
k = 4,
seed = 1)
## Warning in doTryCatch(return(expr), name, parentenv, handler): _train:
## Dataset used may contain fewer number of rows due to removal of rows with
## NA/missing values.
If this is not desirable, set impute_missing argument in
## pca call to TRUE/True/true/...
depending on the client language.
pca_model
## Model Details:
## ==============
##
## H2ODimReductionModel: pca
## Model ID: PCA_model_R_1510348062048_3
## Importance of components:
## pc1 pc2 pc3 pc4
## Standard deviation 7.861342 1.455041 0.283531 0.154411
## Proportion of Variance 0.965303 0.033069 0.001256 0.000372
## Cumulative Proportion 0.965303 0.998372 0.999628 1.000000
##
##
## H2ODimReductionMetrics: pca
##
## No model metrics available for PCA
Random Forest
Use H2O's Random Forest to perform regression or classification on a dataset.
We will continue to use the iris dataset as an example for this problem.
As usual, we define the response and predictor variables using the x
and y
arguments.
Since we'd like to do a classification, we need to ensure that the response column is encoded as a factor (enum) column.
y <- "Species"
x <- setdiff(names(iris_hf), y)
iris_hf[,y] <- as.factor(iris_hf[,y])
We can split the iris_hf
H2O Frame into a train and test set (the split defaults to 75⁄25 train/test).
splits <- h2o.splitFrame(iris_hf, seed = 1)
Then we can train a Random Forest model:
rf_model <- h2o.randomForest(x = x,
y = y,
training_frame = splits[[1]],
validation_frame = splits[[2]],
nbins = 32,
max_depth = 5,
ntrees = 20,
seed = 1)
Since we passed a validation frame, the validation metrics will be calculated.
We can retrieve individual metrics using functions such as h2o.mse(rf_model, valid = TRUE)
.
The confusion matrix can be printed using the following:
h2o.confusionMatrix(rf_model, valid = TRUE)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## setosa versicolor virginica Error Rate
## setosa 7 0 0 0.0000 = 0 / 7
## versicolor 0 13 0 0.0000 = 0 / 13
## virginica 0 1 10 0.0909 = 1 / 11
## Totals 7 14 10 0.0323 = 1 / 31
To view the variable importance computed from an H2O model, you can use either the h2o.varimp()
or h2o.varimp_plot()
functions:
h2o.varimp_plot(rf_model)
Gradient Boosting Machine
The Gradient Boosting Machine (GBM) is one of H2O's most popular algorithms, as it works well on many types of data.
We will continue to use the iris dataset as an example for this problem.
Using the same dataset and x
and y
from above, we can train a GBM:
gbm_model <- h2o.gbm(x = x,
y = y,
training_frame = splits[[1]],
validation_frame = splits[[2]],
ntrees = 20,
max_depth = 3,
learn_rate = 0.01,
col_sample_rate = 0.7,
seed = 1)
Since this is a multi-class problem, we may be interested in inspecting the confusion matrix on a hold-out set.
Since we passed along a validatin_frame
at train time, the validation metrics are already computed and we just need to retreive them from the model object.
h2o.confusionMatrix(gbm_model, valid = TRUE)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## setosa versicolor virginica Error Rate
## setosa 7 0 0 0.0000 = 0 / 7
## versicolor 0 13 0 0.0000 = 0 / 13
## virginica 0 1 10 0.0909 = 1 / 11
## Totals 7 14 10 0.0323 = 1 / 31
Deep Learning
Use H2O's Deep Learning to perform regression or classification on a dataset, extact non-linear features generated by the deep neural network, and/or detect anomalies using a deep learning model with auto-encoding.
In this example, we will use the prostate
dataset available within the h2o package:
path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate_df <- spark_read_csv(sc, "prostate", path)
head(prostate_df)
## # Source: lazy query [?? x 9]
## # Database: spark_connection
## ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
## <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
## 1 1 0 65 1 2 1 1.4 0.0 6
## 2 2 0 72 1 3 2 6.7 0.0 7
## 3 3 0 70 1 1 2 4.9 0.0 6
## 4 4 0 76 2 2 1 51.2 20.0 7
## 5 5 0 69 1 1 1 12.3 55.9 6
## 6 6 1 71 1 3 2 3.3 0.0 8
Once we've done whatever data manipulation is required to run our model we'll get a reference to it as an h2o frame then split it into training and test sets using the h2o.splitFrame
function:
prostate_hf <- as_h2o_frame(sc, prostate_df, strict_version_check = FALSE)
splits <- h2o.splitFrame(prostate_hf, seed = 1)
Next we define the response and predictor columns.
y <- "VOL"
#remove response and ID cols
x <- setdiff(names(prostate_hf), c("ID", y))
Now we can train a deep neural net.
dl_fit <- h2o.deeplearning(x = x, y = y,
training_frame = splits[[1]],
epochs = 15,
activation = "Rectifier",
hidden = c(10, 5, 10),
input_dropout_ratio = 0.7)
Evaluate performance on a test set:
h2o.performance(dl_fit, newdata = splits[[2]])
## H2ORegressionMetrics: deeplearning
##
## MSE: 253.7022
## RMSE: 15.92803
## MAE: 12.90077
## RMSLE: 1.885052
## Mean Residual Deviance : 253.7022
Note that the above metrics are not reproducible when H2O's Deep Learning is run on multiple cores, however, the metrics should be fairly stable across repeat runs.
Grid Search
H2O's grid search capabilities currently supports traditional (Cartesian) grid search and random grid search.
Grid search in R provides the following capabilities:
H2OGrid
class: Represents the results of the grid search h2o.getGrid(<grid_id>, sort_by, decreasing)
: Display the specified grid h2o.grid
: Start a new grid search parameterized by
model builder name (e.g., algorithm = "gbm"
)
model parameters (e.g., ntrees = 100
) hyper_parameters
: attribute for passing a list of hyper parameters (e.g., list(ntrees=c(1,100), learn_rate=c(0.1,0.001))
) search_criteria
: optional attribute for specifying more a advanced search strategy
Cartesian Grid Search
By default, h2o.grid()
will train a Cartesian grid search – meaning, all possible models in the specified grid.
In this example, we will re-use the prostate data as an example dataset for a regression problem.
splits <- h2o.splitFrame(prostate_hf, seed = 1)
y <- "VOL"
#remove response and ID cols
x <- setdiff(names(prostate_hf), c("ID", y))
After prepping the data, we define a grid and execute the grid search.
# GBM hyperparamters
gbm_params1 <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0),
col_sample_rate = c(0.2, 0.5, 1.0))
# Train and validate a grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid1",
training_frame = splits[[1]],
validation_frame = splits[[1]],
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
# Get the grid results, sorted by validation MSE
gbm_gridperf1 <- h2o.getGrid(grid_id = "gbm_grid1",
sort_by = "mse",
decreasing = FALSE)
gbm_gridperf1
## H2O Grid Details
## ================
##
## Grid ID: gbm_grid1
## Used hyper parameters:
## - col_sample_rate
## - learn_rate
## - max_depth
## - sample_rate
## Number of models: 36
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by increasing mse
## col_sample_rate learn_rate max_depth sample_rate model_ids
## 1 1.0 0.1 9 1.0 gbm_grid1_model_35
## 2 0.5 0.1 9 1.0 gbm_grid1_model_34
## 3 1.0 0.1 9 0.8 gbm_grid1_model_17
## 4 0.5 0.1 9 0.8 gbm_grid1_model_16
## 5 1.0 0.1 5 0.8 gbm_grid1_model_11
## mse
## 1 88.10947523138782
## 2 102.3118989994892
## 3 102.78632321923726
## 4 126.4217260351778
## 5 149.6066650109763
##
## ---
## col_sample_rate learn_rate max_depth sample_rate model_ids
## 31 0.5 0.01 3 0.8 gbm_grid1_model_1
## 32 0.2 0.01 5 1.0 gbm_grid1_model_24
## 33 0.5 0.01 3 1.0 gbm_grid1_model_19
## 34 0.2 0.01 5 0.8 gbm_grid1_model_6
## 35 0.2 0.01 3 1.0 gbm_grid1_model_18
## 36 0.2 0.01 3 0.8 gbm_grid1_model_0
## mse
## 31 324.8117304723162
## 32 325.10992525687294
## 33 325.27898443785045
## 34 329.36983845305735
## 35 338.54411936919456
## 36 339.7744828617712
Random Grid Search
H2O's Random Grid Search samples from the given parameter space until a set of constraints is met.
The user can specify the total number of desired models using (e.g. max_models = 40
), the amount of time (e.g. max_runtime_secs = 1000
), or tell the grid to stop after performance stops improving by a specified amount.
Random Grid Search is a practical way to arrive at a good model without too much effort.
The example below is set to run fairly quickly – increase max_runtime_secs
or max_models
to cover more of the hyperparameter space in your grid search.
Also, you can expand the hyperparameter space of each of the algorithms by modifying the definition of hyper_param
below.
# GBM hyperparamters
gbm_params2 <- list(learn_rate = seq(0.01, 0.1, 0.01),
max_depth = seq(2, 10, 1),
sample_rate = seq(0.5, 1.0, 0.1),
col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria2 <- list(strategy = "RandomDiscrete",
max_models = 50)
# Train and validate a grid of GBMs
gbm_grid2 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid2",
training_frame = splits[[1]],
validation_frame = splits[[2]],
ntrees = 100,
seed = 1,
hyper_params = gbm_params2,
search_criteria = search_criteria2)
# Get the grid results, sorted by validation MSE
gbm_gridperf2 <- h2o.getGrid(grid_id = "gbm_grid2",
sort_by = "mse",
decreasing = FALSE)
To get the best model, as measured by validation MSE, we simply grab the first row of the gbm_gridperf2@summary_table
object, since this table is already sorted such that the lowest MSE model is on top.
gbm_gridperf2@summary_table[1,]
## Hyper-Parameter Search Summary: ordered by increasing mse
## col_sample_rate learn_rate max_depth sample_rate model_ids
## 1 0.8 0.01 2 0.7 gbm_grid2_model_35
## mse
## 1 244.61196951586288
In the examples above, we generated two different grids, specified by grid_id
.
The first grid was called grid_id = "gbm_grid1"
and the second was called grid_id = "gbm_grid2"
.
However, if we are using the same dataset & algorithm in two grid searches, it probably makes more sense just to add the results of the second grid search to the first.
If you want to add models to an existing grid, rather than create a new one, you simply re-use the same grid_id
.
Exporting Models
There are two ways of exporting models from H2O – saving models as a binary file, or saving models as pure Java code.
Binary Models
The more traditional method is to save a binary model file to disk using the h2o.saveModel()
function.
To load the models using h2o.loadModel()
, the same version of H2O that generated the models is required.
This method is commonly used when H2O is being used in a non-production setting.
A binary model can be saved as follows:
h2o.saveModel(my_model, path = "/Users/me/h2omodels")
Java (POJO) Models
One of the most valuable features of H2O is it's ability to export models as pure Java code, or rather, a “Plain Old Java Object” (POJO).
You can learn more about H2O POJO models in this POJO quickstart guide.
The POJO method is used most commonly when a model is deployed in a production setting.
POJO models are ideal for when you need very fast prediction response times, and minimal requirements – the POJO is a standalone Java class with no dependencies on the full H2O stack.
To generate the POJO for your model, use the following command:
h2o.download_pojo(my_model, path = "/Users/me/h2omodels")
Finally, disconnect with:
spark_disconnect_all()
## [1] 1
You can learn more about how to take H2O models to production in the productionizing H2O models section of the H2O docs.
Additional Resources
Main documentation site for Sparkling Water (and all H2O software projects)
H2O.ai website
If you are new to H2O for machine learning, we recommend you start with the Intro to H2O Tutorial, followed by the H2O Grid Search & Model Selection Tutorial.
There are a number of other H2O R tutorials and demos available, as well as the H2O World 2015 Training Gitbook, and the Machine Learning with R and H2O Booklet (pdf).
R interface for GraphFrames
Highlights
Support for GraphFrames which aims to provide the functionality of GraphX.
Perform graph algorithms like: PageRank, ShortestPaths and many others.
Designed to work with sparklyr and the sparklyr extensions.
Installation
To install from CRAN, run:
install.packages("graphframes")
For the development version, run:
devtools::install_github("rstudio/graphframes")
Examples
The examples make use of the highschool
dataset from the ggplot
package.
Create a GraphFrame
The base for graph analyses in Spark, using sparklyr
, will be a GraphFrame.
Open a new Spark connection using sparklyr
, and copy the highschool
data set
library(graphframes)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.1.0")
highschool_tbl <- copy_to(sc, ggraph::highschool, "highschool")
head(highschool_tbl)
## # Source: lazy query [?? x 3]
## # Database: spark_connection
## from to year
## <dbl> <dbl> <dbl>
## 1 1.
14.
1957.
## 2 1.
15.
1957.
## 3 1.
21.
1957.
## 4 1.
54.
1957.
## 5 1.
55.
1957.
## 6 2.
21.
1957.
The vertices table is be constructed using dplyr
.
The variable name expected by the GraphFrame is id.
from_tbl <- highschool_tbl %>%
distinct(from) %>%
transmute(id = from)
to_tbl <- highschool_tbl %>%
distinct(to) %>%
transmute(id = to)
vertices_tbl <- from_tbl %>%
sdf_bind_rows(to_tbl)
head(vertices_tbl)
## # Source: lazy query [?? x 1]
## # Database: spark_connection
## id
## <dbl>
## 1 6.
## 2 7.
## 3 12.
## 4 13.
## 5 55.
## 6 58.
The edges table can also be created using dplyr
.
In order for the GraphFrame to work, the from variable needs be renamed src, and the to variable dst.
# Create a table with <source, destination> edges
edges_tbl <- highschool_tbl %>%
transmute(src = from, dst = to)
The gf_graphframe()
function creates a new GraphFrame
gf_graphframe(vertices_tbl, edges_tbl)
## GraphFrame
## Vertices:
## $ id <dbl> 6, 7, 12, 13, 55, 58, 63, 41, 44, 48, 59, 1, 4, 17, 20, 22,...
## Edges:
## $ src <dbl> 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 7, 8...
## $ dst <dbl> 14, 15, 21, 54, 55, 21, 22, 9, 15, 5, 18, 19, 43, 19, 43, ...
Basic Page Rank
We will calculate PageRank over this dataset.
The gf_graphframe()
command can easily be piped into the gf_pagerank()
function to execute the Page Rank.
gf_graphframe(vertices_tbl, edges_tbl) %>%
gf_pagerank(reset_prob = 0.15, max_iter = 10L, source_id = "1")
## GraphFrame
## Vertices:
## $ id <dbl> 12, 12, 59, 59, 1, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26,...
## $ pagerank <dbl> 1.216914e-02, 1.216914e-02, 1.151867e-03, 1.151867e-0...
## Edges:
## $ src <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,...
## $ dst <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 22, 22,...
## $ weight <dbl> 0.02777778, 0.02777778, 0.02777778, 0.02777778, 0.02777...
Additionaly, one can calculate the degrees of vertices using gf_degrees
as follows:
gf_graphframe(vertices_tbl, edges_tbl) %>%
gf_degrees()
## # Source: table<sparklyr_tmp_27b034635ad> [?? x 2]
## # Database: spark_connection
## id degree
## <dbl> <int>
## 1 55.
25
## 2 6.
10
## 3 13.
16
## 4 7.
6
## 5 12.
11
## 6 63.
21
## 7 58.
8
## 8 41.
19
## 9 48.
15
## 10 59.
11
## # ...
with more rows
Visualizations
In order to visualize large graphframe
s, one can use sample_n
and then use ggraph
with igraph
to visualize the graph as follows:
library(ggraph)
library(igraph)
graph <- highschool_tbl %>%
sample_n(20) %>%
collect() %>%
graph_from_data_frame()
ggraph(graph, layout = 'kk') +
geom_edge_link(aes(colour = factor(year))) +
geom_node_point() +
ggtitle('An example')
Additional functions
Apart from calculating PageRank
using gf_pagerank
, the following functions are available:
gf_bfs()
: Breadth-first search (BFS). gf_connected_components()
: Connected components. gf_shortest_paths()
: Shortest paths algorithm. gf_scc()
: Strongly connected components. gf_triangle_count
: Computes the number of triangles passing through each vertex and others.
R interface for MLeap
mleap is a sparklyr extension that provides an interface to MLeap, which allows us to take Spark pipelines to production.
Install mleap
mleap can be installed from CRAN via
install.packages("mleap")
or, for the latest development version from GitHub, using
devtools::install_github("rstudio/mleap")
Setup
Once mleap
has been installed, we can install the external dependencies using:
library(mleap)
install_mleap()
Another dependency of mleap
is Maven.
If it is already installed, just point mleap
to its location:
options(maven.home = "path/to/maven")
If Maven is not yet installed, which is the most likely case, use the following to install it:
install_maven()
Create an MLeap Bundle
Start Spark session using sparklyr
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.2.0")
mtcars_tbl <- sdf_copy_to(sc, mtcars, overwrite = TRUE)
Create a fit an ML Pipeline
pipeline <- ml_pipeline(sc) %>%
ft_binarizer("hp", "big_hp", threshold = 100) %>%
ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
ml_gbt_regressor(label_col = "mpg")
pipeline_model <- ml_fit(pipeline, mtcars_tbl)
A transformed data frame with the appropriate schema is required for exporting the Pipeline model
transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl)
Export the model using the ml_write_bundle()
function from mleap
model_path <- file.path(tempdir(), "mtcars_model.zip")
ml_write_bundle(pipeline_model, transformed_tbl, model_path)
## Model successfully exported.
Close Spark session
spark_disconnect(sc)
At this point, we can share mtcars_model.zip
with the deployment/implementation engineers, and they would be able to embed the model in another application.
See the MLeap docs for details.
Test the mleap
bundle
The mleap
package also provides R functions for testing that the saved models behave as expected.
Here we load the previously saved model:
model <- mleap_load_bundle(model_path)
model
## MLeap Transformer
## <db23a9f1-7b3d-4d27-9eb0-8675125ab3a5>
## Name: pipeline_fe6b8cb0028f
## Format: json
## MLeap Version: 0.10.0-SNAPSHOT
To retrieve the schema associated with the model use the mleap_model_schema()
function
mleap_model_schema(model)
## # A tibble: 6 x 4
## name type nullable dimension
## <chr> <chr> <lgl> <chr>
## 1 qsec double TRUE <NA>
## 2 hp double FALSE <NA>
## 3 wt double TRUE <NA>
## 4 big_hp double FALSE <NA>
## 5 features double TRUE (3)
## 6 prediction double FALSE <NA>
Then, we create a new data frame to be scored, and make predictions using the model:
newdata <- tibble::tribble(
~qsec, ~hp, ~wt,
16.2, 101, 2.68,
18.1, 99, 3.08
)
# Transform the data frame
transformed_df <- mleap_transform(model, newdata)
dplyr::glimpse(transformed_df)
## Observations: 2
## Variables: 6
## $ qsec <dbl> 16.2, 18.1
## $ hp <dbl> 101, 99
## $ wt <dbl> 2.68, 3.08
## $ big_hp <dbl> 1, 0
## $ features <list> [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]]
## $ prediction <dbl> 21.06529, 22.36667
Examples
|
Overview With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect.
This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources.
Advantages and limitations Advantages:
RStudio Server Pro will remain functional if Databricks clusters are terminated Provides the ability to communicate with one or more Databricks clusters as a remote compute resource Avoids resource contention between RStudio Server Pro and Databricks Limitations: |
|
Overview If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below.
With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to work locally with Spark using sparklyr.
This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters. |
Spark Standalone Deployment in AWS
|
Overview The plan is to launch 4 identical EC2 server instances.
One server will be the Master node and the other 3 the worker nodes.
In one of the worker nodes, we will install RStudio server.
What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master.
This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them. |
|
Overview This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager.
Using RStudio Team with Databricks RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages.
RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark. |
|
Summary This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster.
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes.
RStudio Server is installed on the master node and orchestrates the analysis in spark.
Cloudera Cluster This demonstration is focused on adding RStudio integration to an existing Cloudera cluster.
The assumption will be made that there no aid is needed to setup and administer the cluster. |
|
This document demonstrates how to use sparklyr with an Apache Spark cluster.
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes.
RStudio Server is installed on the master node and orchestrates the analysis in spark.
Here is the basic workflow.
Data preparation Set up the cluster This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider. |
Spark Standalone Deployment in AWS
Overview
The plan is to launch 4 identical EC2 server instances.
One server will be the Master node and the other 3 the worker nodes.
In one of the worker nodes, we will install RStudio server.
What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master.
This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.
The topology will look something like this:
AWS EC Instances
Here are the details of the EC2 instance, just deploy one at this point:
Type: t2.medium
OS: Ubuntu 16.04 LTS
Disk space: At least 20GB
Security group: Open the following ports: 8080 (Spark UI), 4040 (Spark Worker UI), 8088 (sparklyr UI) and 8787 (RStudio).
Also open All TCP ports for the machines inside the security group.
Spark
Perform the steps in this section on all of the servers that will be part of the cluster.
Install Java 8
We will add the Java 8 repository, install it and set it as default sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
sudo apt-get update
Download Spark
Download and unpack a pre-compiled version of Spark.
Here’s is the link to the official Spark download page wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz
cd spark-2.1.0-bin-hadoop2.7
Create and launch AMI
We will create an image of the server.
In Amazon, these are called AMIs, for information please see the User Guide.
Launch 3 instances of the AMI
RStudio Server
Select one of the nodes to execute this section.
Please check the RStudio download page for the latest version
Install R
In order to get the latest R core, we will need to update the source list in Ubuntu. sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update
Now we can install R sudo apt-get install r-base
sudo apt-get install gdebi-core
Install RStudio
We will download and install 1.044 of RStudio Server.
To find the latest version, please visit the RStudio website.
In order to get the enhanced integration with Spark, RStudio version 1.044 or later will be needed. wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb
sudo gdebi rstudio-server-1.0.153-amd64.deb
Install dependencies
Run the following commands sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
sudo apt-get -y install libxml2-dev
Add default user
Run the following command to add a default user sudo adduser rstudio-user
Start the Master node
Select one of the servers to become your Master node
Run the command that starts the master service sudo spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh
Close the terminal connection (optional)
Start Worker nodes
Start the slave service.
Important: Use dots not dashes as separators for the Spark Master node’s address sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://[Master node's IP address]:7077
sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-30-1-94.us-west-2.compute.internal:7077
- Close the terminal connection (optional)
Pre-load pacakges
Log into RStudio (port 8787)
Use ‘rstudio-user’ install.packages("sparklyr")
Connect to the Spark Master
Navigate to the Spark Master’s UI, typically on port 8080
Note the Spark Master URL
Logon to RStudio
Run the following code
library(sparklyr)
conf <- spark_config()
conf$spark.executor.memory <- "2GB"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master="[Spark Master URL]",
version = "2.1.0",
config = conf,
spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")
Using sparklyr with an Apache Spark cluster
This document demonstrates how to use sparklyr
with an Apache Spark cluster.
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes.
RStudio Server is installed on the master node and orchestrates the analysis in spark.
Here is the basic workflow.
Data preparation
Set up the cluster
This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider.
We will use Elastic Map Reduce (EMR) to easily set up a cluster with two core nodes and one master node.
Nodes use virtual servers from the Elastic Compute Cloud (EC2).
Note: There is no free tier for EMR, charges will apply.
Before beginning this setup we assume you have:
Familiarity with and access to an AWS account
Familiarity with basic linux commands
Sudo privileges in order to install software from the command line
Build an EMR cluster
Before beginning the EMR wizard setup, make sure you create the following in AWS:
An AWS key pair (.pem key) so you can SSH into the EC2 master node
A security group that gives you access to port 22 on your IP and port 8787 from anywhere
Step 1: Select software
Make sure to select Hive and Spark as part of the install.
Note that by choosing Spark, R will also be installed on the master node as part of the distribution.
Step 2: Select hardware
Install 2 core nodes and one master node with m3.xlarge 80 GiB storage per node.
You can easily increase the number of nodes later.
Step 3: Select general cluster settings
Click next on the general cluster settings.
Step 4: Select security
Enter your EC2 key pair and security group.
Make sure the security group has ports 22 and 8787 open.
Connect to EMR
The cluster page will give you details about your EMR cluster and instructions on connecting.
Connect to the master node via SSH using your key pair.
Once you connect you will see the EMR welcome.
# Log in to master node
ssh -i ~/spark-demo.pem hadoop@ec2-52-10-102-11.us-west-2.compute.amazonaws.com
Install RStudio Server
EMR uses Amazon Linux which is based on Centos.
Update your master node and install dependencies that will be used by R packages.
# Update
sudo yum update
sudo yum install libcurl-devel openssl-devel # used for devtools
The installation of RStudio Server is easy.
Download the preview version of RStudio and install on the master node.
# Install RStudio Server
wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-0.99.1266-x86_64.rpm
sudo yum install --nogpgcheck /tmp/rstudio-server-rhel-0.99.1266-x86_64.rpm
Create a User
Create a user called rstudio-user
that will perform the data analysis.
Create a user directory for rstudio-user
on HDFS with the hadoop fs
command.
# Make User
sudo useradd -m rstudio-user
sudo passwd rstudio-user
# Create new directory in hdfs
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user
Download flights data
The flights data is a well known data source representing 123 million flights over 22 years.
It consumes roughly 12 GiB of storage in uncompressed CSV format in yearly files.
Switch User
For data loading and analysis, make sure you are logged in as regular user.
# create directories on hdfs for new user
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user
# switch user
su rstudio-user
Download data
Run the following script to download data from the web onto your master node.
Download the yearly flight data and the airlines lookup table.
# Make download directory
mkdir /tmp/flights
# Download flight data by year
for i in {1987..2008}
do
echo "$(date) $i Download"
fnam=$i.csv.bz2
wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/flights/$fnam
done
# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS
# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
Distribute into HDFS
Copy data into HDFS using the hadoop fs
command.
# Copy flight data to HDFS
hadoop fs -mkdir /user/rstudio-user/flights/
hadoop fs -put /tmp/flights /user/rstudio-user/
# Copy airline data to HDFS
hadoop fs -mkdir /user/rstudio-user/airlines/
hadoop fs -put /tmp/airlines.csv /user/rstudio-user/airlines
# Copy airport data to HDFS
hadoop fs -mkdir /user/rstudio-user/airports/
hadoop fs -put /tmp/airports.csv /user/rstudio-user/airports
Create Hive tables
Launch Hive from the command line.
# Open Hive prompt
hive
Create the metadata that will structure the flights table.
Load data into the Hive table.
# Create metadata for flights
CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int,
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string,
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int,
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/flights' INTO TABLE flights;
Create the metadata that will structure the airlines table.
Load data into the Hive table.
# Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airlines' INTO TABLE airlines;
Create the metadata that will structure the airports table.
Load data into the Hive table.
# Create metadata for airports
CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE;
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airports' INTO TABLE airports;
Connect to Spark
Log in to RStudio Server by pointing a browser at your master node IP:8787.
Set the environment variable SPARK_HOME
and then run spark_connect
.
After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane.
# Connect to Spark
library(sparklyr)
library(dplyr)
library(ggplot2)
Sys.setenv(SPARK_HOME="/usr/lib/spark")
config <- spark_config()
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')
Once you are connected, you will see the Spark pane appear along with your hive tables.
You can inspect your tables by clicking on the data icon.
Data analysis
Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.
Cache the tables into memory
Use tbl_cache
to load the flights table into memory.
Caching tables will make analysis much faster.
Create a dplyr reference to the Spark DataFrame.
# Cache flights Hive table into Spark
tbl_cache(sc, 'flights')
flights_tbl <- tbl(sc, 'flights')
# Cache airlines Hive table into Spark
tbl_cache(sc, 'airlines')
airlines_tbl <- tbl(sc, 'airlines')
# Cache airports Hive table into Spark
tbl_cache(sc, 'airports')
airports_tbl <- tbl(sc, 'airports')
Create a model data set
Filter the data to contain only the records to be used in the fitted model.
Join carrier descriptions for reference.
Create a new variable called gain
which represents the amount of time gained (or lost) in flight.
# Filter records and create target variable 'gain'
model_data <- flights_tbl %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year >= 2003 & year <= 2007) %>%
left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
mutate(gain = depdelay - arrdelay) %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain)
# Summarize data by carrier
model_data %>%
group_by(uniquecarrier) %>%
summarize(description = min(description), gain=mean(gain),
distance=mean(distance), depdelay=mean(depdelay)) %>%
select(description, gain, distance, depdelay) %>%
arrange(gain)
Source: query [?? x 4]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
description gain distance depdelay
<chr> <dbl> <dbl> <dbl>
1 ATA Airlines d/b/a ATA -3.3480120 1134.7084 56.06583
2 ExpressJet Airlines Inc.
(1) -3.0326180 519.7125 59.41659
3 Envoy Air -2.5434415 416.3716 53.12529
4 Northwest Airlines Inc.
-2.2030586 779.2342 48.52828
5 Delta Air Lines Inc.
-1.8248026 868.3997 50.77174
6 AirTran Airways Corporation -1.4331555 641.8318 54.96702
7 Continental Air Lines Inc.
-0.9617003 1116.6668 57.00553
8 American Airlines Inc.
-0.8860262 1074.4388 55.45045
9 Endeavor Air Inc.
-0.6392733 467.1951 58.47395
10 JetBlue Airways -0.3262134 1139.0443 54.06156
# ...
with more rows
Train a linear model
Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier.
# Partition the data into training and validation sets
model_partition <- model_data %>%
sdf_partition(train = 0.8, valid = 0.2, seed = 5555)
# Fit a linear model
ml1 <- model_partition$train %>%
ml_linear_regression(gain ~ distance + depdelay + uniquecarrier)
# Summarize the linear model
summary(ml1)
Deviance Residuals: (approximate):
Min 1Q Median 3Q Max
-305.422 -5.593 2.699 9.750 147.871
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) -1.24342576 0.10248281 -12.1330 < 2.2e-16 ***
distance 0.00326600 0.00001670 195.5709 < 2.2e-16 ***
depdelay -0.01466233 0.00020337 -72.0977 < 2.2e-16 ***
uniquecarrier_AA -2.32650517 0.10522524 -22.1098 < 2.2e-16 ***
uniquecarrier_AQ 2.98773637 0.28798507 10.3746 < 2.2e-16 ***
uniquecarrier_AS 0.92054894 0.11298561 8.1475 4.441e-16 ***
uniquecarrier_B6 -1.95784698 0.11728289 -16.6934 < 2.2e-16 ***
uniquecarrier_CO -2.52618081 0.11006631 -22.9514 < 2.2e-16 ***
uniquecarrier_DH 2.23287189 0.11608798 19.2343 < 2.2e-16 ***
uniquecarrier_DL -2.68848119 0.10621977 -25.3106 < 2.2e-16 ***
uniquecarrier_EV 1.93484736 0.10724290 18.0417 < 2.2e-16 ***
uniquecarrier_F9 -0.89788137 0.14422281 -6.2257 4.796e-10 ***
uniquecarrier_FL -1.46706706 0.11085354 -13.2343 < 2.2e-16 ***
uniquecarrier_HA -0.14506644 0.25031456 -0.5795 0.5622
uniquecarrier_HP 2.09354855 0.12337515 16.9690 < 2.2e-16 ***
uniquecarrier_MQ -1.88297535 0.10550507 -17.8473 < 2.2e-16 ***
uniquecarrier_NW -2.79538927 0.10752182 -25.9983 < 2.2e-16 ***
uniquecarrier_OH 0.83520117 0.11032997 7.5700 3.730e-14 ***
uniquecarrier_OO 0.61993842 0.10679884 5.8047 6.447e-09 ***
uniquecarrier_TZ -4.99830389 0.15912629 -31.4109 < 2.2e-16 ***
uniquecarrier_UA -0.68294396 0.10638099 -6.4198 1.365e-10 ***
uniquecarrier_US -0.61589284 0.10669583 -5.7724 7.815e-09 ***
uniquecarrier_WN 3.86386059 0.10362275 37.2878 < 2.2e-16 ***
uniquecarrier_XE -2.59658123 0.10775736 -24.0966 < 2.2e-16 ***
uniquecarrier_YV 3.11113140 0.11659679 26.6828 < 2.2e-16 ***
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-Squared: 0.02385
Root Mean Squared Error: 17.74
Assess model performance
Compare the model performance using the validation data.
# Calculate average gains by predicted decile
model_deciles <- lapply(model_partition, function(x) {
sdf_predict(ml1, x) %>%
mutate(decile = ntile(desc(prediction), 10)) %>%
group_by(decile) %>%
summarize(gain = mean(gain)) %>%
select(decile, gain) %>%
collect()
})
# Create a summary dataset for plotting
deciles <- rbind(
data.frame(data = 'train', model_deciles$train),
data.frame(data = 'valid', model_deciles$valid),
make.row.names = FALSE
)
# Plot average gains by predicted decile
deciles %>%
ggplot(aes(factor(decile), gain, fill = data)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')
Visualize predictions
Compare actual gains to predicted gains for an out of time sample.
# Select data from an out of time sample
data_2008 <- flights_tbl %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year == 2008) %>%
left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
mutate(gain = depdelay - arrdelay) %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest)
# Summarize data by carrier
carrier <- sdf_predict(ml1, data_2008) %>%
group_by(description) %>%
summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>%
filter(freq > 10000) %>%
collect
# Plot actual gains and predicted gains by airline carrier
ggplot(carrier, aes(gain, prediction)) +
geom_point(alpha = 0.75, color = 'red', shape = 3) +
geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') +
geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) +
labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted')
Some carriers make up more time than others in flight, but the differences are relatively small.
The average time gains between the best and worst airlines is only six minutes.
The best predictor of time gained is not carrier but flight distance.
The biggest gains were associated with the longest flights.
Share Insights
This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays.
These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.
Build dashboard
Aggregate the scored data by origin, destination, and airline.
Save the aggregated data.
# Summarize by origin, destination, and carrier
summary_2008 <- sdf_predict(ml1, data_2008) %>%
rename(carrier = uniquecarrier, airline = description) %>%
group_by(origin, dest, carrier, airline) %>%
summarize(
flights = n(),
distance = mean(distance),
avg_dep_delay = mean(depdelay),
avg_arr_delay = mean(arrdelay),
avg_gain = mean(gain),
pred_gain = mean(prediction)
)
# Collect and save objects
pred_data <- collect(summary_2008)
airports <- collect(select(airports_tbl, name, faa, lat, lon))
ml1_summary <- capture.output(summary(ml1))
save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')
Publish dashboard
Use the saved data to build an R Markdown flexdashboard.
Publish the flexdashboard to Shiny Server, Shinyapps.io or RStudio Connect.
Using sparklyr with an Apache Spark cluster
Summary
This document demonstrates how to use sparklyr
with an Cloudera Hadoop & Spark cluster.
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes.
RStudio Server is installed on the master node and orchestrates the analysis in spark.
Cloudera Cluster
This demonstration is focused on adding RStudio integration to an existing Cloudera cluster.
The assumption will be made that there no aid is needed to setup and administer the cluster.
##CDH 5
We will start with a Cloudera cluster CDH version 5.8.2 (free version) with an underlaying Ubuntu Linux distribution.
##Spark 1.6
The default Spark 1.6.0 parcel is in installed and running
Hive data
For this demo, we have created and populated 3 tables in Hive.
The table names are: flights, airlines and airports.
Using Hue, we can see the loaded tables.
For the links to the data files and their Hive import scripts please see Appendix A.
Install RStudio
The latest version of R is needed.
In Ubuntu, the default core R is not the latest so we have to update the source list.
We will also install a few other dependencies.
sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
We will install the preview version of RStudio Server
wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-1.0.40-amd64.deb
sudo gdebi rstudio-server-1.0.49-amd64.deb
Create and configure a User
Create a user called rstudio
that will perform the data analysis.
sudo adduser rstudio
To ease security restriction in this demo, we will add the new user to the default super group defined in the dfs.permissions.superusergroup setting in CDH
sudo groupadd supergroup
sudo usermod -a -G supergroup rstudio
Connect to Spark
Log in to RStudio Server by pointing a browser at your master node IP:8787.
Set the environment variable SPARK_HOME
and then run spark_connect
.
After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane.
library(sparklyr)
library(dplyr)
library(ggplot2)
sc <- spark_connect(master = "yarn-client", version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')
Once you are connected, you will see the Spark pane appear along with your hive tables.
You can inspect your tables by clicking on the data icon.
This is what the tables look like loaded in Spark via the History Server Web UI (port 18088)
Data analysis
Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.
Cache the tables into memory
Use tbl_cache
to load the flights table into memory.
Caching tables will make analysis much faster.
Create a dplyr reference to the Spark DataFrame.
# Cache flights Hive table into Spark
tbl_cache(sc, 'flights')
flights_tbl <- tbl(sc, 'flights')
# Cache airlines Hive table into Spark
tbl_cache(sc, 'airlines')
airlines_tbl <- tbl(sc, 'airlines')
# Cache airports Hive table into Spark
tbl_cache(sc, 'airports')
airports_tbl <- tbl(sc, 'airports')
Create a model data set
Filter the data to contain only the records to be used in the fitted model.
Join carrier descriptions for reference.
Create a new variable called gain
which represents the amount of time gained (or lost) in flight.
# Filter records and create target variable 'gain'
model_data <- flights_tbl %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year >= 2003 & year <= 2007) %>%
left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
mutate(gain = depdelay - arrdelay) %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain)
# Summarize data by carrier
model_data %>%
group_by(uniquecarrier) %>%
summarize(description = min(description), gain=mean(gain),
distance=mean(distance), depdelay=mean(depdelay)) %>%
select(description, gain, distance, depdelay) %>%
arrange(gain)
Source: query [?? x 4]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
description gain distance depdelay
<chr> <dbl> <dbl> <dbl>
1 ATA Airlines d/b/a ATA -5.5679651 1240.7219 61.84391
2 Northwest Airlines Inc.
-3.1134556 779.1926 48.84979
3 Envoy Air -2.2056576 437.0883 54.54923
4 PSA Airlines Inc.
-1.9267647 500.6955 55.60335
5 ExpressJet Airlines Inc.
(1) -1.5886314 537.3077 61.58386
6 JetBlue Airways -1.3742524 1087.2337 59.80750
7 SkyWest Airlines Inc.
-1.1265678 419.6489 54.04198
8 Delta Air Lines Inc.
-0.9829374 956.9576 50.19338
9 American Airlines Inc.
-0.9631200 1066.8396 56.78222
10 AirTran Airways Corporation -0.9411572 665.6574 53.38363
# ...
with more rows
Train a linear model
Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier.
# Partition the data into training and validation sets
model_partition <- model_data %>%
sdf_partition(train = 0.8, valid = 0.2, seed = 5555)
# Fit a linear model
ml1 <- model_partition$train %>%
ml_linear_regression(gain ~ distance + depdelay + uniquecarrier)
# Summarize the linear model
summary(ml1)
Call: ml_linear_regression(., gain ~ distance + depdelay + uniquecarrier)
Deviance Residuals: (approximate):
Min 1Q Median 3Q Max
-302.343 -5.669 2.714 9.832 104.130
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) -1.26566581 0.10385870 -12.1864 < 2.2e-16 ***
distance 0.00308711 0.00002404 128.4155 < 2.2e-16 ***
depdelay -0.01397013 0.00028816 -48.4812 < 2.2e-16 ***
uniquecarrier_AA -2.18483090 0.10985406 -19.8885 < 2.2e-16 ***
uniquecarrier_AQ 3.14330242 0.29114487 10.7964 < 2.2e-16 ***
uniquecarrier_AS 0.09210380 0.12825003 0.7182 0.4726598
uniquecarrier_B6 -2.66988794 0.12682192 -21.0523 < 2.2e-16 ***
uniquecarrier_CO -1.11611186 0.11795564 -9.4621 < 2.2e-16 ***
uniquecarrier_DL -1.95206198 0.11431110 -17.0767 < 2.2e-16 ***
uniquecarrier_EV 1.70420830 0.11337215 15.0320 < 2.2e-16 ***
uniquecarrier_F9 -1.03178176 0.15384863 -6.7065 1.994e-11 ***
uniquecarrier_FL -0.99574060 0.12034738 -8.2739 2.220e-16 ***
uniquecarrier_HA -1.16970713 0.34894788 -3.3521 0.0008020 ***
uniquecarrier_MQ -1.55569040 0.10975613 -14.1741 < 2.2e-16 ***
uniquecarrier_NW -3.58502418 0.11534938 -31.0797 < 2.2e-16 ***
uniquecarrier_OH -1.40654797 0.12034858 -11.6873 < 2.2e-16 ***
uniquecarrier_OO -0.39069404 0.11132164 -3.5096 0.0004488 ***
uniquecarrier_TZ -7.26285217 0.34428509 -21.0955 < 2.2e-16 ***
uniquecarrier_UA -0.56995737 0.11186757 -5.0949 3.489e-07 ***
uniquecarrier_US -0.52000028 0.11218498 -4.6352 3.566e-06 ***
uniquecarrier_WN 4.22838982 0.10629405 39.7801 < 2.2e-16 ***
uniquecarrier_XE -1.13836940 0.11332176 -10.0455 < 2.2e-16 ***
uniquecarrier_YV 3.17149538 0.11709253 27.0854 < 2.2e-16 ***
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-Squared: 0.02301
Root Mean Squared Error: 17.83
Assess model performance
Compare the model performance using the validation data.
# Calculate average gains by predicted decile
model_deciles <- lapply(model_partition, function(x) {
sdf_predict(ml1, x) %>%
mutate(decile = ntile(desc(prediction), 10)) %>%
group_by(decile) %>%
summarize(gain = mean(gain)) %>%
select(decile, gain) %>%
collect()
})
# Create a summary dataset for plotting
deciles <- rbind(
data.frame(data = 'train', model_deciles$train),
data.frame(data = 'valid', model_deciles$valid),
make.row.names = FALSE
)
# Plot average gains by predicted decile
deciles %>%
ggplot(aes(factor(decile), gain, fill = data)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')
Visualize predictions
Compare actual gains to predicted gains for an out of time sample.
# Select data from an out of time sample
data_2008 <- flights_tbl %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year == 2008) %>%
left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
mutate(gain = depdelay - arrdelay) %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest)
# Summarize data by carrier
carrier <- sdf_predict(ml1, data_2008) %>%
group_by(description) %>%
summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>%
filter(freq > 10000) %>%
collect
# Plot actual gains and predicted gains by airline carrier
ggplot(carrier, aes(gain, prediction)) +
geom_point(alpha = 0.75, color = 'red', shape = 3) +
geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') +
geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) +
labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted')
Some carriers make up more time than others in flight, but the differences are relatively small.
The average time gains between the best and worst airlines is only six minutes.
The best predictor of time gained is not carrier but flight distance.
The biggest gains were associated with the longest flights.
Share Insights
This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays.
These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.
Build dashboard
Aggregate the scored data by origin, destination, and airline.
Save the aggregated data.
# Summarize by origin, destination, and carrier
summary_2008 <- sdf_predict(ml1, data_2008) %>%
rename(carrier = uniquecarrier, airline = description) %>%
group_by(origin, dest, carrier, airline) %>%
summarize(
flights = n(),
distance = mean(distance),
avg_dep_delay = mean(depdelay),
avg_arr_delay = mean(arrdelay),
avg_gain = mean(gain),
pred_gain = mean(prediction)
)
# Collect and save objects
pred_data <- collect(summary_2008)
airports <- collect(select(airports_tbl, name, faa, lat, lon))
ml1_summary <- capture.output(summary(ml1))
save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')
Publish dashboard
Use the saved data to build an R Markdown flexdashboard.
Publish the flexdashboard
#Appendix
Appendix A - Data files
Run the following script to download data from the web onto your master node.
Download the yearly flight data and the airlines lookup table.
# Make download directory
mkdir /tmp/flights
# Download flight data by year
for i in {2006..2008}
do
echo "$(date) $i Download"
fnam=$i.csv.bz2
wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/flights/$fnam
done
# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS
# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
Hive tables
We used the Hue interface, logged in as ‘admin’ to load the data into HDFS and then into Hive.
CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int,
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string,
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int,
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");
LOAD DATA INPATH '/user/admin/flights/2006.csv/' INTO TABLE flights;
LOAD DATA INPATH '/user/admin/flights/2007.csv/' INTO TABLE flights;
LOAD DATA INPATH '/user/admin/flights/2008.csv/' INTO TABLE flights;
# Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
LOAD DATA INPATH '/user/admin/L_UNIQUE_CARRIERS.csv' INTO TABLE airlines;
CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/admin/airports.dat' INTO TABLE airports;
Using sparklyr with Databricks
Overview
This documentation demonstrates how to use sparklyr
with Apache Spark in
Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and
RStudio Package Manager.
Using RStudio Team with Databricks
RStudio Team is a bundle of our popular professional software for developing
data science projects, publishing data products, and managing packages.
RStudio Team and sparklyr
can be used with Databricks to work with large
datasets and distributed computations with Apache Spark.
The most common use
case is to perform interactive analysis and exploratory development with RStudio
Server Pro and sparklyr
; write out the results to a database, file system, or
cloud storage; then publish apps, reports, and APIs to RStudio Connect that
query and access the results.
The sections below describe best practices and different options for configuring
specific RStudio products to work with Databricks.
Best practices for working with Databricks
Maintain separate installation environments - Install RStudio Server Pro,
RStudio Connect, and RStudio Package Manager outside of the Databricks cluster
so that they are not limited to the compute resources or ephemeral nature of
Databricks clusters.
Connect to Databricks remotely - Work with Databricks as a remote compute
resource, similar to how you would connect remotely to external databases,
data sources, and storage systems.
This can be accomplished using Databricks
Connect (as described in the
Connecting to Databricks remotely
section below) or by performing SQL queries with JDBC/ODBC using the
Databricks Spark SQL Driver on
AWS or
Azure.
Restrict workloads to interactive analysis - Only perform workloads
related to exploratory or interactive analysis with Spark, then write the
results to a database, file system, or cloud storage for more efficient
retrieval in apps, reports, and APIs.
Load and query results efficiently - Because of the nature of Spark
computations and the associated overhead, Shiny apps that use Spark on the
backend tend to have performance and runtime issues; consider reading the
results from a database, file system, or cloud storage instead.
Using RStudio Server Pro with Databricks
There are two options for using sparklyr
and RStudio Server Pro with
Databricks:
Option 1:
Connecting to Databricks remotely
(Recommended Option)
Option 2:
Working inside of Databricks
(Alternative Option)
Option 1 - Connecting to Databricks remotely
With this configuration, RStudio Server Pro is installed outside of the Spark
cluster and allows users to connect to Spark remotely using sparklyr
with
Databricks Connect.
This is the recommended configuration because it targets separate environments,
involves a typical configuration process, avoids resource contention, and allows
RStudio Server Pro to connect to Databricks as well as other remote storage and
compute resources.
View steps for connecting to Databricks remotely
Option 2 - Working inside of Databricks
If you cannot work with Spark remotely, you should install RStudio Server Pro on
the Driver node of a long-running, persistent Databricks cluster as opposed to a
worker node or an ephemeral cluster.
With this configuration, RStudio Server Pro is installed on the Spark driver
node and allows users to connect to Spark locally using sparklyr
.
This configuration can result in increased complexity, limited connectivity to
other storage and compute resources, resource contention between RStudio Server
Pro and Databricks, and maintenance concerns due to the ephemeral nature of
Databricks clusters.
View steps for working inside of Databricks
Using RStudio Connect with Databricks
The server environment within Databricks clusters is not permissive enough to
support RStudio Connect or the process sandboxing mechanisms that it uses to
isolate published content.
Therefore, the only supported configuration is to install RStudio Connect
outside of the Databricks cluster and connect to Databricks remotely.
Whether RStudio Server Pro is installed outside of the Databricks cluster
(Recommended Option) or within the Databricks cluster (Alternative Option), you
can publish content to RStudio Connect as long as HTTP/HTTPS network traffic is
allowed from RStudio Server Pro to RStudio Connect.
There are two options for using RStudio Connect with Databricks:
Performing SQL queries with JDBC/ODBC using the Databricks Spark SQL Driver
on AWS or
Azure
(Recommended Option)
Adding calls in your R code to create and run Databricks jobs
with bricksteR and the Databricks Jobs API
(Alternative Option)
Using RStudio Package Manager with Databricks
Whether RStudio Server Pro is installed outside of the Databricks cluster
(Recommended Option) or within the Databricks cluster (Alternative Option), you
can install packages from repositories in RStudio Package Manager as long as
HTTP/HTTPS network traffic is allowed from RStudio Server Pro to RStudio Package
Manager.
Development
Function Reference - version 1.04
Read Spark Configuration
Arguments
Value
Details
Read Spark Configuration
spark_config(file = "config.yml", use_default = TRUE)
Arguments
file |
Name of the configuration file |
use_default |
TRUE to use the built-in defaults provided in this package |
Value
Named list with configuration data
Details
Read Spark configuration using the config package.
Manage Spark Connections
Arguments
Details
Examples
These routines allow you to manage your connections to Spark.
spark_connect(master, spark_home = Sys.getenv("SPARK_HOME"),
method = c("shell", "livy", "databricks", "test", "qubole"),
app_name = "sparklyr", version = NULL, config = spark_config(),
extensions = sparklyr::registered_extensions(), packages = NULL, ...)
spark_connection_is_open(sc)
spark_disconnect(sc, ...)
spark_disconnect_all()
spark_submit(master, file, spark_home = Sys.getenv("SPARK_HOME"),
app_name = "sparklyr", version = NULL, config = spark_config(),
extensions = sparklyr::registered_extensions(), ...)
Arguments
master |
Spark cluster url to connect to.
Use "local" to
connect to a local instance of Spark installed via spark_install . |
spark_home |
The path to a Spark installation.
Defaults to the path
provided by the SPARK_HOME environment variable.
If SPARK_HOME is defined, it will always be used unless the version parameter is specified to force the use of a locally
installed version. |
method |
The method used to connect to Spark.
Default connection method
is "shell" to connect using spark-submit, use "livy" to
perform remote connections using HTTP, or "databricks" when using a
Databricks clusters. |
app_name |
The application name to be used while running in the Spark
cluster. |
version |
The version of Spark to use.
Required for "local" Spark
connections, optional otherwise. |
config |
Custom configuration for the generated Spark connection.
See spark_config for details. |
extensions |
Extension R packages to enable for this connection.
By
default, all packages enabled through the use of sparklyr::register_extension will be passed here. |
packages |
A list of Spark packages to load.
For example, "delta" or "kafka" to enable Delta Lake or Kafka.
Also supports full versions like "io.delta:delta-core_2.11:0.4.0" .
This is similar to adding packages into the sparklyr.shell.packages configuration option.
Notice that the version
parameter is used to choose the correect package, otherwise assumes the latest version
is being used. |
... |
Optional arguments; currently unused. |
sc |
A spark_connection . |
file |
Path to R source file to submit for batch execution. |
Details
When using method = "livy"
, it is recommended to specify version
parameter to improve performance by using precompiled code rather than uploading
sources.
By default, jars are downloaded from GitHub but the path to the correct sparklyr
JAR can also be specified through the livy.jars
setting.
Examples
sc <- spark_connect(master = "spark://HOST:PORT")
connection_is_open(sc)#> [1] TRUE
spark_disconnect(sc)
Find a given Spark installation by version.
Arguments
Value
Install versions of Spark for use with local Spark connections
(i.e. spark_connect(master = "local"
)
spark_install_find(version = NULL, hadoop_version = NULL,
installed_only = TRUE, latest = FALSE, hint = FALSE)
spark_install(version = NULL, hadoop_version = NULL, reset = TRUE,
logging = "INFO", verbose = interactive())
spark_uninstall(version, hadoop_version)
spark_install_dir()
spark_install_tar(tarfile)
spark_installed_versions()
spark_available_versions(show_hadoop = FALSE, show_minor = FALSE)
Arguments
version |
Version of Spark to install.
See spark_available_versions for a list of supported versions |
hadoop_version |
Version of Hadoop to install.
See spark_available_versions for a list of supported versions |
installed_only |
Search only the locally installed versions? |
latest |
Check for latest version? |
hint |
On failure should the installation code be provided? |
reset |
Attempts to reset settings to defaults. |
logging |
Logging level to configure install.
Supported options: "WARN", "INFO" |
verbose |
Report information as Spark is downloaded / installed |
tarfile |
Path to TAR file conforming to the pattern spark-###-bin-(hadoop)?### where ###
reference spark and hadoop versions respectively. |
show_hadoop |
Show Hadoop distributions? |
show_minor |
Show minor Spark versions? |
Value
List with information about the installed version.
View Entries in the Spark Log
Arguments
View the most recent entries in the Spark log.
This can be useful when
inspecting output / errors produced by Spark during the invocation of
various commands.
spark_log(sc, n = 100, filter = NULL, ...)
Arguments
sc |
A spark_connection . |
n |
The max number of log entries to retrieve.
Use NULL to
retrieve all entries within the log. |
filter |
Character string to filter log entries. |
... |
Optional arguments; currently unused. |
Open the Spark web interface
Arguments
Open the Spark web interface
spark_web(sc, ...)
Arguments
sc |
A spark_connection . |
... |
Optional arguments; currently unused. |
Check whether the connection is open
Arguments
Check whether the connection is open
connection_is_open(sc)
Arguments
A Shiny app that can be used to construct a <code>spark_connect</code> statement
A Shiny app that can be used to construct a spark_connect
statement
connection_spark_shinyapp()
Runtime configuration interface for the Spark Session
Arguments
Retrieves or sets runtime configuration entries for the Spark Session
spark_session_config(sc, config = TRUE, value = NULL)
Arguments
sc |
A spark_connection . |
config |
The configuration entry name(s) (e.g., "spark.sql.shuffle.partitions" ).
Defaults to NULL to retrieve all configuration entries. |
value |
The configuration value to be set.
Defaults to NULL to retrieve
configuration entries. |
Set/Get Spark checkpoint directory
Arguments
Set/Get Spark checkpoint directory
spark_set_checkpoint_dir(sc, dir)
spark_get_checkpoint_dir(sc)
Arguments
sc |
A spark_connection . |
dir |
checkpoint directory, must be HDFS path of running on cluster |
Generate a Table Name from Expression
Arguments
Attempts to generate a table name from an expression; otherwise,
assigns an auto-generated generic name with "sparklyr_" prefix.
spark_table_name(expr)
Arguments
expr |
The expression to attempt to use as name |
Get the Spark Version Associated with a Spark Installation
Arguments
Retrieve the version of Spark associated with a Spark installation.
spark_version_from_home(spark_home, default = NULL)
Arguments
spark_home |
The path to a Spark installation. |
default |
The default version to be inferred, in case
version lookup failed, e.g.
no Spark installation was found
at spark_home . |
Retrieves a dataframe available Spark versions that van be installed.
Arguments
Retrieves a dataframe available Spark versions that van be installed.
spark_versions(latest = TRUE)
Arguments
latest |
Check for latest version? |
Kubernetes Configuration
Arguments
Convenience function to initialize a Kubernetes configuration instead
of spark_config()
, exposes common properties to set in Kubernetes
clusters.
spark_config_kubernetes(master, version = "2.3.2",
image = "spark:sparklyr", driver = random_string("sparklyr-"),
account = "spark", jars = "local:///opt/sparklyr", forward = TRUE,
executors = NULL, conf = NULL, timeout = 120, ports = c(8880,
8881, 4040), fix_config = identical(.Platform$OS.type, "windows"), ...)
Arguments
master |
Kubernetes url to connect to, found by running kubectl cluster-info . |
version |
The version of Spark being used. |
image |
Container image to use to launch Spark and sparklyr.
Also known
as spark.kubernetes.container.image . |
driver |
Name of the driver pod.
If not set, the driver pod name is set
to "sparklyr" suffixed by id to avoid name conflicts.
Also known as spark.kubernetes.driver.pod.name . |
account |
Service account that is used when running the driver pod.
The driver
pod uses this service account when requesting executor pods from the API
server.
Also known as spark.kubernetes.authenticate.driver.serviceAccountName . |
jars |
Path to the sparklyr jars; either, a local path inside the container
image with the sparklyr jars copied when the image was created or, a path
accesible by the container where the sparklyr jars were copied.
You can find
a path to the sparklyr jars by running system.file("java/", package = "sparklyr") . |
forward |
Should ports used in sparklyr be forwarded automatically through Kubernetes?
Default to TRUE which runs kubectl port-forward and pkill kubectl
on disconnection. |
executors |
Number of executors to request while connecting. |
conf |
A named list of additional entries to add to sparklyr.shell.conf . |
timeout |
Total seconds to wait before giving up on connection. |
ports |
Ports to forward using kubectl. |
fix_config |
Should the spark-defaults.conf get fixed? TRUE for Windows. |
... |
Additional parameters, currently not in use. |
Retrieve Available Settings
Retrieves available sparklyr settings that can be used in configuration files or spark_config()
.
spark_config_settings()
Find Spark Connection
Arguments
Finds an active spark connection in the environment given the
connection parameters.
spark_connection_find(master = NULL, app_name = NULL, method = NULL)
Arguments
master |
The Spark master parameter. |
app_name |
The Spark application name. |
method |
The method used to connect to Spark. |
Fallback to Spark Dependency
Arguments
Value
Helper function to assist falling back to previous Spark versions.
spark_dependency_fallback(spark_version, supported_versions)
Arguments
spark_version |
The Spark version being requested in spark_dependencies . |
supported_versions |
The Spark versions that are supported by this extension. |
Value
A Spark version to use.
Create Spark Extension
Arguments
Creates an R package ready to be used as an Spark extension.
spark_extension(path)
Arguments
path |
Location where the extension will be created. |
Reads from a Spark Table into a Spark DataFrame.
Arguments
See also
Reads from a Spark Table into a Spark DataFrame.
spark_load_table(sc, name, path, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
See also
Other Spark serialization routines: spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read libsvm file into a Spark DataFrame.
Arguments
See also
Read libsvm file into a Spark DataFrame.
spark_read_libsvm(sc, name = NULL, path = name, repartition = 0,
memory = TRUE, overwrite = TRUE, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read a CSV file into a Spark DataFrame
Arguments
Details
See also
Read a tabular data file into a Spark DataFrame.
spark_read_csv(sc, name = NULL, path = name, header = TRUE,
columns = NULL, infer_schema = is.null(columns), delimiter = ",",
quote = "\"", escape = "\\", charset = "UTF-8",
null_value = NULL, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
header |
Boolean; should the first row of data be used as a header?
Defaults to TRUE . |
columns |
A vector of column names or a named vector of column types. |
infer_schema |
Boolean; should column types be automatically inferred?
Requires one extra pass over the data.
Defaults to is.null(columns) . |
delimiter |
The character used to delimit each column.
Defaults to ','. |
quote |
The character used as a quote.
Defaults to '"'. |
escape |
The character used to escape other characters.
Defaults to '\'. |
charset |
The character set.
Defaults to "UTF-8". |
null_value |
The character to use for null, or missing, values.
Defaults to NULL . |
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
... |
Optional arguments; currently unused. |
Details
You can read data from HDFS (hdfs://
), S3 (s3a://
),
as well as the local file system (file://
).
If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key
, spark.hadoop.fs.s3a.secret.key
or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a://
protocol also set the values for spark.hadoop.fs.s3a.impl
and spark.hadoop.fs.s3a.endpoint
.
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4
driver options
for the config key spark.driver.extraJavaOptions
For instructions on how to configure s3n://
check the hadoop documentation:
s3n authentication properties
When header
is FALSE
, the column names are generated with a V
prefix; e.g. V1, V2, ...
.
See also
Other Spark serialization routines: spark_load_table
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read from Delta Lake into a Spark DataFrame.
Arguments
See also
Read from Delta Lake into a Spark DataFrame.
spark_read_delta(sc, path, name = NULL, version = NULL,
timestamp = NULL, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE, ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated table. |
version |
The version of the delta table to read. |
timestamp |
The timestamp of the delta table to read.
For example, "2019-01-01" or "2019-01-01'T'00:00:00.000Z" . |
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read from JDBC connection into a Spark DataFrame.
Arguments
See also
Read from JDBC connection into a Spark DataFrame.
spark_read_jdbc(sc, name, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE, columns = NULL, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
columns |
A vector of column names or a named vector of column types. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read a JSON file into a Spark DataFrame
Arguments
Details
See also
Read a table serialized in the JavaScript
Object Notation format into a Spark DataFrame.
spark_read_json(sc, name = NULL, path = name, options = list(),
repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
columns |
A vector of column names or a named vector of column types. |
... |
Optional arguments; currently unused. |
Details
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key
, spark.hadoop.fs.s3a.secret.key
or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a://
protocol also set the values for spark.hadoop.fs.s3a.impl
and spark.hadoop.fs.s3a.endpoint
.
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4
driver options
for the config key spark.driver.extraJavaOptions
For instructions on how to configure s3n://
check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read a ORC file into a Spark DataFrame
Arguments
Details
See also
Read a ORC file into a Spark
DataFrame.
spark_read_orc(sc, name = NULL, path = name, options = list(),
repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
schema = NULL, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
columns |
A vector of column names or a named vector of column types. |
schema |
A (java) read schema.
Useful for optimizing read operation on nested data. |
... |
Optional arguments; currently unused. |
Details
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read a Parquet file into a Spark DataFrame
Arguments
Details
See also
Read a Parquet file into a Spark
DataFrame.
spark_read_parquet(sc, name = NULL, path = name, options = list(),
repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
schema = NULL, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
columns |
A vector of column names or a named vector of column types. |
schema |
A (java) read schema.
Useful for optimizing read operation on nested data. |
... |
Optional arguments; currently unused. |
Details
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key
, spark.hadoop.fs.s3a.secret.key
or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a://
protocol also set the values for spark.hadoop.fs.s3a.impl
and spark.hadoop.fs.s3a.endpoint
.
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4
driver options
for the config key spark.driver.extraJavaOptions
For instructions on how to configure s3n://
check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read from a generic source into a Spark DataFrame.
Arguments
See also
Read from a generic source into a Spark DataFrame.
spark_read_source(sc, name = NULL, path = name, source,
options = list(), repartition = 0, memory = TRUE,
overwrite = TRUE, columns = NULL, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
source |
A data source capable of reading data. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
columns |
A vector of column names or a named vector of column types. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Reads from a Spark Table into a Spark DataFrame.
Arguments
See also
Reads from a Spark Table into a Spark DataFrame.
spark_read_table(sc, name, options = list(), repartition = 0,
memory = TRUE, columns = NULL, ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
columns |
A vector of column names or a named vector of column types. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Read a Text file into a Spark DataFrame
Arguments
Details
See also
Read a text file into a Spark DataFrame.
spark_read_text(sc, name = NULL, path = name, repartition = 0,
memory = TRUE, overwrite = TRUE, options = list(), whole = FALSE,
...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated table. |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
repartition |
The number of partitions used to distribute the
generated table.
Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it
already exists? |
options |
A list of strings with additional options. |
whole |
Read the entire text file as a single entry? Defaults to FALSE . |
... |
Optional arguments; currently unused. |
Details
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key
, spark.hadoop.fs.s3a.secret.key
or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a://
protocol also set the values for spark.hadoop.fs.s3a.impl
and spark.hadoop.fs.s3a.endpoint
.
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4
driver options
for the config key spark.driver.extraJavaOptions
For instructions on how to configure s3n://
check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Write a Spark DataFrame to a CSV
Arguments
See also
Write a Spark DataFrame to a tabular (typically, comma-separated) file.
spark_write_csv(x, path, header = TRUE, delimiter = ",",
quote = "\"", escape = "\\", charset = "UTF-8",
null_value = NULL, options = list(), mode = NULL,
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
header |
Should the first row of data be used as a header? Defaults to TRUE . |
delimiter |
The character used to delimit each column, defaults to , . |
quote |
The character used as a quote.
Defaults to '"'. |
escape |
The character used to escape other characters, defaults to \ . |
charset |
The character set, defaults to "UTF-8" . |
null_value |
The character to use for default values, defaults to NULL . |
options |
A list of strings with additional options. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Writes a Spark DataFrame into Delta Lake
Arguments
See also
Writes a Spark DataFrame into Delta Lake.
spark_write_delta(x, path, mode = NULL, options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Writes a Spark DataFrame into a JDBC table
Arguments
See also
Writes a Spark DataFrame into a JDBC table.
spark_write_jdbc(x, name, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated table. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Write a Spark DataFrame to a JSON file
Arguments
See also
Serialize a Spark DataFrame to the JavaScript
Object Notation format.
spark_write_json(x, path, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Write a Spark DataFrame to a ORC file
Arguments
See also
Serialize a Spark DataFrame to the
ORC format.
spark_write_orc(x, path, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
,
spark_write_text
Write a Spark DataFrame to a Parquet file
Arguments
See also
Serialize a Spark DataFrame to the
Parquet format.
spark_write_parquet(x, path, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options.
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_source
,
spark_write_table
,
spark_write_text
Writes a Spark DataFrame into a generic source
Arguments
See also
Writes a Spark DataFrame into a generic source.
spark_write_source(x, source, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
source |
A data source capable of reading data. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_table
,
spark_write_text
Writes a Spark DataFrame into a Spark table
Arguments
See also
Writes a Spark DataFrame into a Spark table.
spark_write_table(x, name, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated table. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_text
Write a Spark DataFrame to a Text file
Arguments
See also
Serialize a Spark DataFrame to the plain text format.
spark_write_text(x, path, mode = NULL, options = list(),
partition_by = NULL, ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
A character element.
Specifies the behavior when data or
table already exists.
Supported values include: 'error', 'append', 'overwrite' and
ignore.
Notice that 'overwrite' will also change the column structure.
For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A character vector.
Partitions the output by the given columns on the file system. |
... |
Optional arguments; currently unused. |
See also
Other Spark serialization routines: spark_load_table
,
spark_read_csv
,
spark_read_delta
,
spark_read_jdbc
,
spark_read_json
,
spark_read_libsvm
,
spark_read_orc
,
spark_read_parquet
,
spark_read_source
,
spark_read_table
,
spark_read_text
,
spark_save_table
,
spark_write_csv
,
spark_write_delta
,
spark_write_jdbc
,
spark_write_json
,
spark_write_orc
,
spark_write_parquet
,
spark_write_source
,
spark_write_table
Save / Load a Spark DataFrame
Arguments
Routines for saving and loading Spark DataFrames.
sdf_save_table(x, name, overwrite = FALSE, append = FALSE)
sdf_load_table(sc, name)
sdf_save_parquet(x, path, overwrite = FALSE, append = FALSE)
sdf_load_parquet(sc, path)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
name |
The table name to assign to the saved Spark DataFrame. |
overwrite |
Boolean; overwrite a pre-existing table of the same name? |
append |
Boolean; append to a pre-existing table of the same name? |
sc |
A spark_connection object. |
path |
The path where the Spark DataFrame should be saved. |
Spark ML -- Transform, fit, and predict methods (sdf_ interface)
Arguments
Value
Deprecated methods for transformation, fit, and prediction.
These are mirrors of the corresponding ml-transform-methods.
sdf_predict(x, model, ...)
sdf_transform(x, transformer, ...)
sdf_fit(x, estimator, ...)
sdf_fit_and_transform(x, estimator, ...)
Arguments
x |
A tbl_spark . |
model |
A ml_transformer or a ml_model object. |
... |
Optional arguments passed to the corresponding ml_ methods. |
transformer |
A ml_transformer object. |
estimator |
A ml_estimator object. |
Value
sdf_predict()
, sdf_transform()
, and sdf_fit_and_transform()
return a transformed dataframe whereas sdf_fit()
returns a ml_transformer
.
Create DataFrame for along Object
Arguments
Creates a DataFrame along the given object.
sdf_along(sc, along, repartition = NULL, type = c("integer",
"integer64"))
Arguments
sc |
The associated Spark connection. |
along |
Takes the length from the length of this argument. |
repartition |
The number of partitions to use when distributing the
data across the Spark cluster. |
type |
The data type to use for the index, either "integer" or "integer64" . |
Bind multiple Spark DataFrames by row and column
Arguments
Value
Details sdf_bind_rows()
and sdf_bind_cols()
are implementation of the common pattern of do.call(rbind, sdfs)
or do.call(cbind, sdfs)
for binding many
Spark DataFrames into one.
sdf_bind_rows(..., id = NULL)
sdf_bind_cols(...)
Arguments
... |
Spark tbls to combine.
Each argument can either be a Spark DataFrame or a list of
Spark DataFrames
When row-binding, columns are matched by name, and any missing
columns with be filled with NA.
When column-binding, rows are matched by position, so all data
frames must have the same number of rows. |
id |
Data frame identifier.
When id is supplied, a new column of identifiers is
created to link each row to its original Spark DataFrame.
The labels
are taken from the named arguments to sdf_bind_rows() .
When a
list of Spark DataFrames is supplied, the labels are taken from the
names of the list.
If no names are found a numeric sequence is
used instead. |
Value
sdf_bind_rows()
and sdf_bind_cols()
return tbl_spark
Details
The output of sdf_bind_rows()
will contain a column if that column
appears in any of the inputs.
Broadcast hint
Arguments
Used to force broadcast hash joins.
sdf_broadcast(x)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
Checkpoint a Spark DataFrame
Arguments
Checkpoint a Spark DataFrame
sdf_checkpoint(x, eager = TRUE)
Arguments
x |
an object coercible to a Spark DataFrame |
eager |
whether to truncate the lineage of the DataFrame |
Coalesces a Spark DataFrame
Arguments
Coalesces a Spark DataFrame
sdf_coalesce(x, partitions)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
partitions |
number of partitions |
Collect a Spark DataFrame into R.
Arguments
Collects a Spark dataframe into R.
sdf_collect(object, ...)
Arguments
object |
Spark dataframe to collect |
... |
Additional options. |
Copy an Object into Spark
Arguments
Advanced Usage
See also
Examples
Copy an object into Spark, and return an R object wrapping the
copied object (typically, a Spark DataFrame).
sdf_copy_to(sc, x, name, memory, repartition, overwrite, ...)
sdf_import(x, sc, name, memory, repartition, overwrite, ...)
Arguments
sc |
The associated Spark connection. |
x |
An R object from which a Spark DataFrame can be generated. |
name |
The name to assign to the copied table in Spark. |
memory |
Boolean; should the table be cached into memory? |
repartition |
The number of partitions to use when distributing the
table across the Spark cluster.
The default (0) can be used to avoid
partitioning. |
overwrite |
Boolean; overwrite a pre-existing table with the name name
if one already exists? |
... |
Optional arguments, passed to implementing methods. |
Advanced Usage
sdf_copy_to
is an S3 generic that, by default, dispatches to sdf_import
.
Package authors that would like to implement sdf_copy_to
for a custom object type can accomplish this by
implementing the associated method on sdf_import
.
See also
Other Spark data frames: sdf_random_split
,
sdf_register
, sdf_sample
,
sdf_sort
Examples
sc <- spark_connect(master = "spark://HOST:PORT")
sdf_copy_to(sc, iris)#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 11 5.4 3.7 1.5 0.2 setosa
#> 12 4.8 3.4 1.6 0.2 setosa
#> 13 4.8 3.0 1.4 0.1 setosa
#> 14 4.3 3.0 1.1 0.1 setosa
#> 15 5.8 4.0 1.2 0.2 setosa
#> 16 5.7 4.4 1.5 0.4 setosa
#> 17 5.4 3.9 1.3 0.4 setosa
#> 18 5.1 3.5 1.4 0.3 setosa
#> 19 5.7 3.8 1.7 0.3 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
#> 21 5.4 3.4 1.7 0.2 setosa
#> 22 5.1 3.7 1.5 0.4 setosa
#> 23 4.6 3.6 1.0 0.2 setosa
#> 24 5.1 3.3 1.7 0.5 setosa
#> 25 4.8 3.4 1.9 0.2 setosa
#> 26 5.0 3.0 1.6 0.2 setosa
#> 27 5.0 3.4 1.6 0.4 setosa
#> 28 5.2 3.5 1.5 0.2 setosa
#> 29 5.2 3.4 1.4 0.2 setosa
#> 30 4.7 3.2 1.6 0.2 setosa
#> 31 4.8 3.1 1.6 0.2 setosa
#> 32 5.4 3.4 1.5 0.4 setosa
#> 33 5.2 4.1 1.5 0.1 setosa
#> 34 5.5 4.2 1.4 0.2 setosa
#> 35 4.9 3.1 1.5 0.2 setosa
#> 36 5.0 3.2 1.2 0.2 setosa
#> 37 5.5 3.5 1.3 0.2 setosa
#> 38 4.9 3.6 1.4 0.1 setosa
#> 39 4.4 3.0 1.3 0.2 setosa
#> 40 5.1 3.4 1.5 0.2 setosa
#> 41 5.0 3.5 1.3 0.3 setosa
#> 42 4.5 2.3 1.3 0.3 setosa
#> 43 4.4 3.2 1.3 0.2 setosa
#> 44 5.0 3.5 1.6 0.6 setosa
#> 45 5.1 3.8 1.9 0.4 setosa
#> 46 4.8 3.0 1.4 0.3 setosa
#> 47 5.1 3.8 1.6 0.2 setosa
#> 48 4.6 3.2 1.4 0.2 setosa
#> 49 5.3 3.7 1.5 0.2 setosa
#> 50 5.0 3.3 1.4 0.2 setosa
#> 51 7.0 3.2 4.7 1.4 versicolor
#> 52 6.4 3.2 4.5 1.5 versicolor
#> 53 6.9 3.1 4.9 1.5 versicolor
#> 54 5.5 2.3 4.0 1.3 versicolor
#> 55 6.5 2.8 4.6 1.5 versicolor
#> 56 5.7 2.8 4.5 1.3 versicolor
#> 57 6.3 3.3 4.7 1.6 versicolor
#> 58 4.9 2.4 3.3 1.0 versicolor
#> 59 6.6 2.9 4.6 1.3 versicolor
#> 60 5.2 2.7 3.9 1.4 versicolor
#> 61 5.0 2.0 3.5 1.0 versicolor
#> 62 5.9 3.0 4.2 1.5 versicolor
#> 63 6.0 2.2 4.0 1.0 versicolor
#> 64 6.1 2.9 4.7 1.4 versicolor
#> 65 5.6 2.9 3.6 1.3 versicolor
#> 66 6.7 3.1 4.4 1.4 versicolor
#> 67 5.6 3.0 4.5 1.5 versicolor
#> 68 5.8 2.7 4.1 1.0 versicolor
#> 69 6.2 2.2 4.5 1.5 versicolor
#> 70 5.6 2.5 3.9 1.1 versicolor
#> 71 5.9 3.2 4.8 1.8 versicolor
#> 72 6.1 2.8 4.0 1.3 versicolor
#> 73 6.3 2.5 4.9 1.5 versicolor
#> 74 6.1 2.8 4.7 1.2 versicolor
#> 75 6.4 2.9 4.3 1.3 versicolor
#> 76 6.6 3.0 4.4 1.4 versicolor
#> 77 6.8 2.8 4.8 1.4 versicolor
#> 78 6.7 3.0 5.0 1.7 versicolor
#> 79 6.0 2.9 4.5 1.5 versicolor
#> 80 5.7 2.6 3.5 1.0 versicolor
#> 81 5.5 2.4 3.8 1.1 versicolor
#> 82 5.5 2.4 3.7 1.0 versicolor
#> 83 5.8 2.7 3.9 1.2 versicolor
#> 84 6.0 2.7 5.1 1.6 versicolor
#> 85 5.4 3.0 4.5 1.5 versicolor
#> 86 6.0 3.4 4.5 1.6 versicolor
#> 87 6.7 3.1 4.7 1.5 versicolor
#> 88 6.3 2.3 4.4 1.3 versicolor
#> 89 5.6 3.0 4.1 1.3 versicolor
#> 90 5.5 2.5 4.0 1.3 versicolor
#> 91 5.5 2.6 4.4 1.2 versicolor
#> 92 6.1 3.0 4.6 1.4 versicolor
#> 93 5.8 2.6 4.0 1.2 versicolor
#> 94 5.0 2.3 3.3 1.0 versicolor
#> 95 5.6 2.7 4.2 1.3 versicolor
#> 96 5.7 3.0 4.2 1.2 versicolor
#> 97 5.7 2.9 4.2 1.3 versicolor
#> 98 6.2 2.9 4.3 1.3 versicolor
#> 99 5.1 2.5 3.0 1.1 versicolor
#> 100 5.7 2.8 4.1 1.3 versicolor
#> 101 6.3 3.3 6.0 2.5 virginica
#> 102 5.8 2.7 5.1 1.9 virginica
#> 103 7.1 3.0 5.9 2.1 virginica
#> 104 6.3 2.9 5.6 1.8 virginica
#> 105 6.5 3.0 5.8 2.2 virginica
#> 106 7.6 3.0 6.6 2.1 virginica
#> 107 4.9 2.5 4.5 1.7 virginica
#> 108 7.3 2.9 6.3 1.8 virginica
#> 109 6.7 2.5 5.8 1.8 virginica
#> 110 7.2 3.6 6.1 2.5 virginica
#> 111 6.5 3.2 5.1 2.0 virginica
#> 112 6.4 2.7 5.3 1.9 virginica
#> 113 6.8 3.0 5.5 2.1 virginica
#> 114 5.7 2.5 5.0 2.0 virginica
#> 115 5.8 2.8 5.1 2.4 virginica
#> 116 6.4 3.2 5.3 2.3 virginica
#> 117 6.5 3.0 5.5 1.8 virginica
#> 118 7.7 3.8 6.7 2.2 virginica
#> 119 7.7 2.6 6.9 2.3 virginica
#> 120 6.0 2.2 5.0 1.5 virginica
#> 121 6.9 3.2 5.7 2.3 virginica
#> 122 5.6 2.8 4.9 2.0 virginica
#> 123 7.7 2.8 6.7 2.0 virginica
#> 124 6.3 2.7 4.9 1.8 virginica
#> 125 6.7 3.3 5.7 2.1 virginica
#> 126 7.2 3.2 6.0 1.8 virginica
#> 127 6.2 2.8 4.8 1.8 virginica
#> 128 6.1 3.0 4.9 1.8 virginica
#> 129 6.4 2.8 5.6 2.1 virginica
#> 130 7.2 3.0 5.8 1.6 virginica
#> 131 7.4 2.8 6.1 1.9 virginica
#> 132 7.9 3.8 6.4 2.0 virginica
#> 133 6.4 2.8 5.6 2.2 virginica
#> 134 6.3 2.8 5.1 1.5 virginica
#> 135 6.1 2.6 5.6 1.4 virginica
#> 136 7.7 3.0 6.1 2.3 virginica
#> 137 6.3 3.4 5.6 2.4 virginica
#> 138 6.4 3.1 5.5 1.8 virginica
#> 139 6.0 3.0 4.8 1.8 virginica
#> 140 6.9 3.1 5.4 2.1 virginica
#> 141 6.7 3.1 5.6 2.4 virginica
#> 142 6.9 3.1 5.1 2.3 virginica
#> 143 5.8 2.7 5.1 1.9 virginica
#> 144 6.8 3.2 5.9 2.3 virginica
#> 145 6.7 3.3 5.7 2.5 virginica
#> 146 6.7 3.0 5.2 2.3 virginica
#> 147 6.3 2.5 5.0 1.9 virginica
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
Cross Tabulation
Arguments
Value
Builds a contingency table at each combination of factor levels.
sdf_crosstab(x, col1, col2)
Arguments
x |
A Spark DataFrame |
col1 |
The name of the first column.
Distinct items will make the first item of each row. |
col2 |
The name of the second column.
Distinct items will make the column names of the DataFrame. |
Value
A DataFrame containing the contingency table.
Debug Info for Spark DataFrame
Arguments
Prints plan of execution to generate x
.
This plan will, among other things, show the
number of partitions in parenthesis at the far left and indicate stages using indentation.
sdf_debug_string(x, print = TRUE)
Arguments
x |
An R object wrapping, or containing, a Spark DataFrame. |
print |
Print debug information? |
Compute summary statistics for columns of a data frame
Arguments
Compute summary statistics for columns of a data frame
sdf_describe(x, cols = colnames(x))
Arguments
x |
An object coercible to a Spark DataFrame |
cols |
Columns to compute statistics for, given as a character vector |
Support for Dimension Operations
Arguments sdf_dim()
, sdf_nrow()
and sdf_ncol()
provide similar
functionality to dim()
, nrow()
and ncol()
.
sdf_dim(x)
sdf_nrow(x)
sdf_ncol(x)
Arguments
x |
An object (usually a spark_tbl ). |
Spark DataFrame is Streaming
Arguments
Is the given Spark DataFrame a streaming data?
sdf_is_streaming(x)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
Returns the last index of a Spark DataFrame
Arguments
Returns the last index of a Spark DataFrame.
The Spark mapPartitionsWithIndex
function is used to iterate
through the last nonempty partition of the RDD to find the last record.
sdf_last_index(x, id = "id")
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
id |
The name of the index column. |
Create DataFrame for Length
Arguments
Creates a DataFrame for the given length.
sdf_len(sc, length, repartition = NULL, type = c("integer",
"integer64"))
Arguments
sc |
The associated Spark connection. |
length |
The desired length of the sequence. |
repartition |
The number of partitions to use when distributing the
data across the Spark cluster. |
type |
The data type to use for the index, either "integer" or "integer64" . |
Gets number of partitions of a Spark DataFrame
Arguments
Gets number of partitions of a Spark DataFrame
sdf_num_partitions(x)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
Persist a Spark DataFrame
Arguments
Details
Persist a Spark DataFrame, forcing any pending computations and (optionally)
serializing the results to disk.
sdf_persist(x, storage.level = "MEMORY_AND_DISK")
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
storage.level |
The storage level to be used.
Please view the
Spark Documentation
for information on what storage levels are accepted. |
Details
Spark DataFrames invoke their operations lazily -- pending operations are
deferred until their results are actually needed.
Persisting a Spark
DataFrame effectively 'forces' any pending computations, and then persists
the generated Spark DataFrame as requested (to memory, to disk, or
otherwise).
Users of Spark should be careful to persist the results of any computations
which are non-deterministic -- otherwise, one might see that the values
within a column seem to 'change' as new operations are performed on that
data set.
Pivot a Spark DataFrame
Arguments
Examples
Construct a pivot table over a Spark Dataframe, using a syntax similar to
that from reshape2::dcast
.
sdf_pivot(x, formula, fun.aggregate = "count")
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
A two-sided R formula of the form x_1 + x_2 + ...
~ y_1 .
The left-hand side of the formula indicates which variables are used for grouping,
and the right-hand side indicates which variable is used for pivoting.
Currently,
only a single pivot column is supported. |
fun.aggregate |
How should the grouped dataset be aggregated? Can be
a length-one character vector, giving the name of a Spark aggregation function
to be called; a named R list mapping column names to an aggregation method,
or an R function that is invoked on the grouped dataset. |
Examples
if (FALSE) {
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
# aggregating by mean
iris_tbl %>%
mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>%
sdf_pivot(Petal_Width ~ Species,
fun.aggregate = list(Petal_Length = "mean"))
# aggregating all observations in a list
iris_tbl %>%
mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>%
sdf_pivot(Petal_Width ~ Species,
fun.aggregate = list(Petal_Length = "collect_list"))
}
Project features onto principal components
Arguments
Transforming Spark DataFrames
Project features onto principal components
sdf_project(object, newdata, features = dimnames(object$pc)[[1]],
feature_prefix = NULL, ...)
Arguments
object |
A Spark PCA model object |
newdata |
An object coercible to a Spark DataFrame |
features |
A vector of names of columns to be projected |
feature_prefix |
The prefix used in naming the output features |
... |
Optional arguments; currently unused. |
The family of functions prefixed with sdf_
generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr
interface which
uses Spark SQL.
These functions will 'force' any pending SQL in a dplyr
pipeline, such that the resulting tbl_spark
object
returned will no longer have the attached 'lazy' SQL operations.
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect()
the table.
Compute (Approximate) Quantiles with a Spark DataFrame
Arguments
Given a numeric column within a Spark DataFrame, compute
approximate quantiles (to some relative error).
sdf_quantile(x, column, probabilities = c(0, 0.25, 0.5, 0.75, 1),
relative.error = 1e-05)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
column |
The column for which quantiles should be computed. |
probabilities |
A numeric vector of probabilities, for
which quantiles should be computed. |
relative.error |
The relative error -- lower values imply more
precision in the computed quantiles. |
Partition a Spark Dataframe
Arguments
Value
Details
Transforming Spark DataFrames
See also
Examples
Partition a Spark DataFrame into multiple groups.
This routine is useful
for splitting a DataFrame into, for example, training and test datasets.
sdf_random_split(x, ..., weights = NULL,
seed = sample(.Machine$integer.max, 1))
sdf_partition(x, ..., weights = NULL,
seed = sample(.Machine$integer.max, 1))
Arguments
x |
An object coercable to a Spark DataFrame. |
... |
Named parameters, mapping table names to weights.
The weights
will be normalized such that they sum to 1. |
weights |
An alternate mechanism for supplying weights -- when
specified, this takes precedence over the ... arguments. |
seed |
Random seed to use for randomly partitioning the dataset.
Set
this if you want your partitioning to be reproducible on repeated runs. |
Value
An R list
of tbl_spark
s.
Details
The sampling weights define the probability that a particular observation
will be assigned to a particular partition, not the resulting size of the
partition.
This implies that partitioning a DataFrame with, for example,
sdf_random_split(x, training = 0.5, test = 0.5)
is not guaranteed to produce training
and test
partitions
of equal size.
The family of functions prefixed with sdf_
generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr
interface which
uses Spark SQL.
These functions will 'force' any pending SQL in a dplyr
pipeline, such that the resulting tbl_spark
object
returned will no longer have the attached 'lazy' SQL operations.
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect()
the table.
See also
Other Spark data frames: sdf_copy_to
,
sdf_register
, sdf_sample
,
sdf_sort
Examples
if (FALSE) {
# randomly partition data into a 'training' and 'test'
# dataset, with 60% of the observations assigned to the
# 'training' dataset, and 40% assigned to the 'test' dataset
data(diamonds, package = "ggplot2")
diamonds_tbl <- copy_to(sc, diamonds, "diamonds")
partitions <- diamonds_tbl %>%
sdf_random_split(training = 0.6, test = 0.4)
print(partitions)
# alternate way of specifying weights
weights <- c(training = 0.6, test = 0.4)
diamonds_tbl %>% sdf_random_split(weights = weights)
}
Read a Column from a Spark DataFrame
Arguments
Details
Read a single column from a Spark DataFrame, and return
the contents of that column back to R.
sdf_read_column(x, column)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
column |
The name of a column within x . |
Details
It is expected for this operation to preserve row order.
Register a Spark DataFrame
Arguments
Transforming Spark DataFrames
See also
Registers a Spark DataFrame (giving it a table name for the
Spark SQL context), and returns a tbl_spark
.
sdf_register(x, name = NULL)
Arguments
x |
A Spark DataFrame. |
name |
A name to assign this table. |
The family of functions prefixed with sdf_
generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr
interface which
uses Spark SQL.
These functions will 'force' any pending SQL in a dplyr
pipeline, such that the resulting tbl_spark
object
returned will no longer have the attached 'lazy' SQL operations.
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect()
the table.
See also
Other Spark data frames: sdf_copy_to
,
sdf_random_split
, sdf_sample
,
sdf_sort
Repartition a Spark DataFrame
Arguments
Repartition a Spark DataFrame
sdf_repartition(x, partitions = NULL, partition_by = NULL)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
partitions |
number of partitions |
partition_by |
vector of column names used for partitioning, only supported for Spark 2.0+ |
Model Residuals
Arguments
This generic method returns a Spark DataFrame with model
residuals added as a column to the model training data.
# S3 method for ml_model_generalized_linear_regression
sdf_residuals(object,
type = c("deviance", "pearson", "working", "response"), ...)
# S3 method for ml_model_linear_regression
sdf_residuals(object, ...)
sdf_residuals(object, ...)
Arguments
object |
Spark ML model object. |
type |
type of residuals which should be returned. |
... |
additional arguments |
Randomly Sample Rows from a Spark DataFrame
Arguments
Transforming Spark DataFrames
See also
Draw a random sample of rows (with or without replacement)
from a Spark DataFrame.
sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL)
Arguments
x |
An object coercable to a Spark DataFrame. |
fraction |
The fraction to sample. |
replacement |
Boolean; sample with replacement? |
seed |
An (optional) integer seed. |
The family of functions prefixed with sdf_
generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr
interface which
uses Spark SQL.
These functions will 'force' any pending SQL in a dplyr
pipeline, such that the resulting tbl_spark
object
returned will no longer have the attached 'lazy' SQL operations.
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect()
the table.
See also
Other Spark data frames: sdf_copy_to
,
sdf_random_split
,
sdf_register
, sdf_sort
Read the Schema of a Spark DataFrame
Arguments
Value
Details
Read the schema of a Spark DataFrame.
sdf_schema(x)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
Value
An R list
, with each list
element describing the
name
and type
of a column.
Details
The type
column returned gives the string representation of the
underlying Spark type for that column; for example, a vector of numeric
values would be returned with the type "DoubleType"
.
Please see the
Spark Scala API Documentation
for information on what types are available and exposed by Spark.
Separate a Vector Column into Scalar Columns
Arguments
Given a vector column in a Spark DataFrame, split that
into n
separate columns, each column made up of
the different elements in the column column
.
sdf_separate_column(x, column, into = NULL)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
column |
The name of a (vector-typed) column. |
into |
A specification of the columns that should be
generated from column .
This can either be a
vector of column names, or an R list mapping column
names to the (1-based) index at which a particular
vector element should be extracted. |
Create DataFrame for Range
Arguments
Creates a DataFrame for the given range
sdf_seq(sc, from = 1L, to = 1L, by = 1L, repartition = type,
type = c("integer", "integer64"))
Arguments
sc |
The associated Spark connection. |
from, to |
The start and end to use as a range |
by |
The increment of the sequence. |
repartition |
The number of partitions to use when distributing the
data across the Spark cluster. |
type |
The data type to use for the index, either "integer" or "integer64" . |
Sort a Spark DataFrame
Arguments
Transforming Spark DataFrames
See also
Sort a Spark DataFrame by one or more columns, with each column
sorted in ascending order.
sdf_sort(x, columns)
Arguments
x |
An object coercable to a Spark DataFrame. |
columns |
The column(s) to sort by. |
The family of functions prefixed with sdf_
generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr
interface which
uses Spark SQL.
These functions will 'force' any pending SQL in a dplyr
pipeline, such that the resulting tbl_spark
object
returned will no longer have the attached 'lazy' SQL operations.
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect()
the table.
See also
Other Spark data frames: sdf_copy_to
,
sdf_random_split
,
sdf_register
, sdf_sample
Spark DataFrame from SQL
Arguments
Defines a Spark DataFrame from a SQL query, useful to create Spark DataFrames
without collecting the results immediately.
sdf_sql(sc, sql)
Arguments
sc |
A spark_connection . |
sql |
a 'SQL' query used to generate a Spark DataFrame. |
Add a Sequential ID Column to a Spark DataFrame
Arguments
Add a sequential ID column to a Spark DataFrame.
The Spark zipWithIndex
function is used to produce these.
This differs from sdf_with_unique_id
in that the IDs generated are independent of
partitioning.
sdf_with_sequential_id(x, id = "id", from = 1L)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
id |
The name of the column to host the generated IDs. |
from |
The starting value of the id column |
Add a Unique ID Column to a Spark DataFrame
Arguments
Add a unique ID column to a Spark DataFrame.
The Spark monotonicallyIncreasingId
function is used to produce these and is
guaranteed to produce unique, monotonically increasing ids; however, there
is no guarantee that these IDs will be sequential.
The table is persisted
immediately after the column is generated, to ensure that the column is
stable -- otherwise, it can differ across new computations.
sdf_with_unique_id(x, id = "id")
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
id |
The name of the column to host the generated IDs. |
Spark ML -- Decision Trees
Arguments
Value
Details
See also
Examples
Perform classification and regression using decision trees.
ml_decision_tree_classifier(x, formula = NULL, max_depth = 5,
max_bins = 32, min_instances_per_node = 1, min_info_gain = 0,
impurity = "gini", seed = NULL, thresholds = NULL,
cache_node_ids = FALSE, checkpoint_interval = 10,
max_memory_in_mb = 256, features_col = "features",
label_col = "label", prediction_col = "prediction",
probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("decision_tree_classifier_"), ...)
ml_decision_tree(x, formula = NULL, type = c("auto", "regression",
"classification"), features_col = "features", label_col = "label",
prediction_col = "prediction", variance_col = NULL,
probability_col = "probability",
raw_prediction_col = "rawPrediction", checkpoint_interval = 10L,
impurity = "auto", max_bins = 32L, max_depth = 5L,
min_info_gain = 0, min_instances_per_node = 1L, seed = NULL,
thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256L,
uid = random_string("decision_tree_"), response = NULL,
features = NULL, ...)
ml_decision_tree_regressor(x, formula = NULL, max_depth = 5,
max_bins = 32, min_instances_per_node = 1, min_info_gain = 0,
impurity = "variance", seed = NULL, cache_node_ids = FALSE,
checkpoint_interval = 10, max_memory_in_mb = 256,
variance_col = NULL, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("decision_tree_regressor_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree. |
max_bins |
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node.
More bins give higher granularity. |
min_instances_per_node |
Minimum number of instances each child must
have after split. |
min_info_gain |
Minimum information gain for a split to be considered
at a tree node.
Should be >= 0, defaults to 0. |
impurity |
Criterion used for information gain calculation.
Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression.
For ml_decision_tree , setting "auto" will default to the appropriate
criterion based on model type. |
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
cache_node_ids |
If FALSE , the algorithm will pass trees to executors to match instances with nodes.
If TRUE , the algorithm will cache node IDs for each instance.
Caching can speed up training of deeper trees.
Defaults to FALSE . |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g.
10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size.
Defaults to 256. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable.
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise. |
variance_col |
(Optional) Column name for the biased sample variance of prediction. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
ml_decision_tree
is a wrapper around ml_decision_tree_regressor.tbl_spark
and ml_decision_tree_classifier.tbl_spark
and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
dt_model <- iris_training %>%
ml_decision_tree(Species ~ .)
pred <- ml_predict(dt_model, iris_test)
ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Generalized Linear Regression
Arguments
Value
Details
See also
Examples
Perform regression using Generalized Linear Model (GLM).
ml_generalized_linear_regression(x, formula = NULL,
family = "gaussian", link = NULL, fit_intercept = TRUE,
offset_col = NULL, link_power = NULL, link_prediction_col = NULL,
reg_param = 0, max_iter = 25, weight_col = NULL, solver = "irls",
tol = 1e-06, variance_power = 0, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("generalized_linear_regression_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
family |
Name of family which is a description of the error distribution to be used in the model.
Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie".
Default is "gaussian". |
link |
Name of link function which provides the relationship between the linear predictor and the mean of the distribution function.
See for supported link functions. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
offset_col |
Offset column name.
If this is not set, we treat all instance offsets as 0.0.
The feature specified as offset has a constant coefficient of 1.0. |
link_power |
Index in the power link function.
Only applicable to the Tweedie family.
Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt link, respectively.
When not set, this value defaults to 1 - variancePower, which matches the R "statmod" package. |
link_prediction_col |
Link prediction (linear predictor) column name.
Default is not set, which means we do not output link prediction. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
weight_col |
The name of the column to use as weights for the model fit. |
solver |
Solver algorithm for optimization. |
tol |
Param for the convergence tolerance for iterative algorithms. |
variance_power |
Power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution.
Only applicable to the Tweedie family.
(see Tweedie Distribution (Wikipedia)) Supported values: 0 and [1, Inf).
Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma family, respectively. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
Valid link functions for each family is listed below.
The first link function of each family is the default one.
gaussian: "identity", "log", "inverse"
binomial: "logit", "probit", "loglog"
poisson: "log", "identity", "sqrt"
gamma: "inverse", "identity", "log"
tweedie: power link function specified through link_power
.
The default link power in the tweedie family is 1 - variance_power
.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
partitions <- mtcars_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
mtcars_training <- partitions$training
mtcars_test <- partitions$test
# Specify the grid
family <- c("gaussian", "gamma", "poisson")
link <- c("identity", "log")
family_link <- expand.grid(family = family, link = link, stringsAsFactors = FALSE)
family_link <- data.frame(family_link, rmse = 0)
# Train the models
for (i in 1:nrow(family_link)) {
glm_model <- mtcars_training %>%
ml_generalized_linear_regression(mpg ~ .,
family = family_link[i, 1],
link = family_link[i, 2]
)
pred <- ml_predict(glm_model, mtcars_test)
family_link[i, 3] <- ml_regression_evaluator(pred, label_col = "mpg")
}
family_link
}
Spark ML -- Gradient Boosted Trees
Arguments
Value
Details
See also
Examples
Perform binary classification and regression using gradient boosted trees.
Multiclass classification is not supported yet.
ml_gbt_classifier(x, formula = NULL, max_iter = 20, max_depth = 5,
step_size = 0.1, subsampling_rate = 1,
feature_subset_strategy = "auto", min_instances_per_node = 1L,
max_bins = 32, min_info_gain = 0, loss_type = "logistic",
seed = NULL, thresholds = NULL, checkpoint_interval = 10,
cache_node_ids = FALSE, max_memory_in_mb = 256,
features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("gbt_classifier_"), ...)
ml_gradient_boosted_trees(x, formula = NULL, type = c("auto",
"regression", "classification"), features_col = "features",
label_col = "label", prediction_col = "prediction",
probability_col = "probability",
raw_prediction_col = "rawPrediction", checkpoint_interval = 10,
loss_type = c("auto", "logistic", "squared", "absolute"),
max_bins = 32, max_depth = 5, max_iter = 20L, min_info_gain = 0,
min_instances_per_node = 1, step_size = 0.1, subsampling_rate = 1,
feature_subset_strategy = "auto", seed = NULL, thresholds = NULL,
cache_node_ids = FALSE, max_memory_in_mb = 256,
uid = random_string("gradient_boosted_trees_"), response = NULL,
features = NULL, ...)
ml_gbt_regressor(x, formula = NULL, max_iter = 20, max_depth = 5,
step_size = 0.1, subsampling_rate = 1,
feature_subset_strategy = "auto", min_instances_per_node = 1,
max_bins = 32, min_info_gain = 0, loss_type = "squared",
seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE,
max_memory_in_mb = 256, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("gbt_regressor_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
max_iter |
Maxmimum number of iterations. |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree. |
step_size |
Step size (a.k.a.
learning rate) in interval (0, 1] for shrinking the contribution of each estimator.
(default = 0.1) |
subsampling_rate |
Fraction of the training data used for learning each decision tree, in range (0, 1].
(default = 1.0) |
feature_subset_strategy |
The number of features to consider for splits at each tree node.
See details for options. |
min_instances_per_node |
Minimum number of instances each child must
have after split. |
max_bins |
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node.
More bins give higher granularity. |
min_info_gain |
Minimum information gain for a split to be considered
at a tree node.
Should be >= 0, defaults to 0. |
loss_type |
Loss function which GBT tries to minimize.
Supported: "squared" (L2) and "absolute" (L1) (default = squared) for regression and "logistic" (default) for classification.
For ml_gradient_boosted_trees , setting "auto"
will default to the appropriate loss type based on model type. |
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g.
10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cache_node_ids |
If FALSE , the algorithm will pass trees to executors to match instances with nodes.
If TRUE , the algorithm will cache node IDs for each instance.
Caching can speed up training of deeper trees.
Defaults to FALSE . |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size.
Defaults to 256. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable.
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
The supported options for feature_subset_strategy
are
"auto"
: Choose automatically for task: If num_trees == 1
, set to "all"
.
If num_trees > 1
(forest), set to "sqrt"
for classification and to "onethird"
for regression.
"all"
: use all features
"onethird"
: use 1/3 of the features
"sqrt"
: use use sqrt(number of features)
"log2"
: use log2(number of features)
"n"
: when n
is in the range (0, 1.0], use n * number of features.
When n
is in the range (1, number of features), use n
features.
(default = "auto"
)
ml_gradient_boosted_trees
is a wrapper around ml_gbt_regressor.tbl_spark
and ml_gbt_classifier.tbl_spark
and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
gbt_model <- iris_training %>%
ml_gradient_boosted_trees(Sepal_Length ~ Petal_Length + Petal_Width)
pred <- ml_predict(gbt_model, iris_test)
ml_regression_evaluator(pred, label_col = "Sepal_Length")
}
Spark ML -- K-Means Clustering
Arguments
Value
See also
Examples
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
Using `ml_kmeans()` with the formula interface requires Spark 2.0+.
ml_kmeans(x, formula = NULL, k = 2, max_iter = 20, tol = 1e-04,
init_steps = 2, init_mode = "k-means||", seed = NULL,
features_col = "features", prediction_col = "prediction",
uid = random_string("kmeans_"), ...)
ml_compute_cost(model, dataset)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
init_steps |
Number of steps for the k-means|| initialization mode.
This is an advanced setting -- the default of 2 is almost always enough.
Must be > 0.
Default: 2. |
init_mode |
Initialization algorithm.
This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012).
Default: k-means||. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details. |
model |
A fitted K-means model returned by ml_kmeans() |
dataset |
Dataset on which to calculate K-means cost |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the clustering estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning a clustering model.
tbl_spark
, with formula
or features
specified: When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the estimator.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
This signature does not apply to ml_lda()
.
ml_compute_cost()
returns the K-means cost (sum of
squared distances of points to their nearest center) for the model
on the given data.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
more information on the set of clustering algorithms.
Other ml clustering algorithms: ml_bisecting_kmeans
,
ml_gaussian_mixture
, ml_lda
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
ml_kmeans(iris_tbl, Species ~ .)
}
Spark ML -- Latent Dirichlet Allocation
Arguments
Value
Details
Parameter details
See also
Examples
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
ml_lda(x, formula = NULL, k = 10, max_iter = 20,
doc_concentration = NULL, topic_concentration = NULL,
subsampling_rate = 0.05, optimizer = "online",
checkpoint_interval = 10, keep_last_checkpoint = TRUE,
learning_decay = 0.51, learning_offset = 1024,
optimize_doc_concentration = TRUE, seed = NULL,
features_col = "features",
topic_distribution_col = "topicDistribution",
uid = random_string("lda_"), ...)
ml_describe_topics(model, max_terms_per_topic = 10)
ml_log_likelihood(model, dataset)
ml_log_perplexity(model, dataset)
ml_topics_matrix(model)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
doc_concentration |
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
See details. |
topic_concentration |
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms. |
subsampling_rate |
(For Online optimizer only) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
Note that this should be adjusted in synch with max_iter so the entire corpus is used.
Specifically, set both so that maxIterations * miniBatchFraction greater than or equal to 1. |
optimizer |
Optimizer or inference algorithm used to estimate the LDA model.
Supported: "online" for Online Variational Bayes (default) and "em" for Expectation-Maximization. |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g.
10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
keep_last_checkpoint |
(Spark 2.0.0+) (For EM optimizer only) If using checkpointing, this indicates whether to keep the last checkpoint.
If FALSE , then the checkpoint will be deleted.
Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care.
Note that checkpoints will be cleaned up via reference counting, regardless. |
learning_decay |
(For Online optimizer only) Learning rate, set as an exponential decay rate.
This should be between (0.5, 1.0] to guarantee asymptotic convergence.
This is called "kappa" in the Online LDA paper (Hoffman et al., 2010).
Default: 0.51, based on Hoffman et al. |
learning_offset |
(For Online optimizer only) A (positive) learning parameter that downweights early iterations.
Larger values make early iterations count less.
This is called "tau0" in the Online LDA paper (Hoffman et al., 2010) Default: 1024, following Hoffman et al. |
optimize_doc_concentration |
(For Online optimizer only) Indicates whether the doc_concentration (Dirichlet parameter for document-topic distribution) will be optimized during training.
Setting this to true will make the model more expressive and fit the training data better.
Default: FALSE |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
topic_distribution_col |
Output column with estimates of the topic mixture distribution for each document (often called "theta" in the literature).
Returns a vector of zeros for an empty document. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details. |
model |
A fitted LDA model returned by ml_lda() . |
max_terms_per_topic |
Maximum number of terms to collect for each topic.
Default value of 10. |
dataset |
test corpus to use for calculating log likelihood or log perplexity |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the clustering estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning a clustering model.
tbl_spark
, with formula
or features
specified: When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the estimator.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
This signature does not apply to ml_lda()
.
ml_describe_topics
returns a DataFrame with topics and their top-weighted terms.
ml_log_likelihood
calculates a lower bound on the log likelihood of
the entire corpus
Details
For `ml_lda.tbl_spark` with the formula interface, you can specify named arguments in `...` that will
be passed `ft_regex_tokenizer()`, `ft_stop_words_remover()`, and `ft_count_vectorizer()`.
For example, to increase the
default `min_token_length`, you can use `ml_lda(dataset, ~ text, min_token_length = 4)`.
Terminology for LDA:
"term" = "word": an element of the vocabulary
"token": instance of a term appearing in a document
"topic": multinomial distribution over terms representing some concept
"document": one piece of text, corresponding to one row in the input data
Original LDA paper (journal version): Blei, Ng, and Jordan.
"Latent Dirichlet Allocation." JMLR, 2003.
Input data (features_col
): LDA is given a collection of documents as input data, via the features_col
parameter.
Each document is specified as a Vector of length vocab_size
, where each entry is the count for the corresponding term (word) in the document.
Feature transformers such as ft_tokenizer
and ft_count_vectorizer
can be useful for converting text to word count vectors
Parameter details
doc_concentration
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If not set by the user, then doc_concentration
is set automatically.
If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting.
Otherwise, the doc_concentration
vector must be length k.
(default = automatic)
Optimizer-specific parameter settings:
EM
Currently only supports symmetric distributions, so all values in the vector should be the same.
Values should be greater than 1.0
default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al.
(2009), who recommend a +1 adjustment for EM.
Online
Values should be greater than or equal to 0
default = uniformly (1.0 / k), following the implementation from here
topic_concentration
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If not set by the user, then topic_concentration
is set automatically.
(default = automatic)
Optimizer-specific parameter settings:
EM
Value should be greater than 1.0
default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al.
(2009), who recommend a +1 adjustment for EM.
Online
Value should be greater than or equal to 0
default = (1.0 / k), following the implementation from here.
topic_distribution_col
This uses a variational approximation following Hoffman et al.
(2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
more information on the set of clustering algorithms.
Other ml clustering algorithms: ml_bisecting_kmeans
,
ml_gaussian_mixture
,
ml_kmeans
Examples
if (FALSE) {
library(janeaustenr)
library(dplyr)
sc <- spark_connect(master = "local")
lines_tbl <- sdf_copy_to(sc,
austen_books()[c(1:30), ],
name = "lines_tbl",
overwrite = TRUE
)
# transform the data in a tidy form
lines_tbl_tidy <- lines_tbl %>%
ft_tokenizer(
input_col = "text",
output_col = "word_list"
) %>%
ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
) %>%
mutate(text = explode(wo_stop_words)) %>%
filter(text != "") %>%
select(text, book)
lda_model <- lines_tbl_tidy %>%
ml_lda(~text, k = 4)
# vocabulary and topics
tidy(lda_model)
}
Spark ML -- Linear Regression
Arguments
Value
Details
See also
Examples
Perform regression using linear regression.
ml_linear_regression(x, formula = NULL, fit_intercept = TRUE,
elastic_net_param = 0, reg_param = 0, max_iter = 100,
weight_col = NULL, loss = "squaredError", solver = "auto",
standardization = TRUE, tol = 1e-06, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("linear_regression_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
elastic_net_param |
ElasticNet mixing parameter, in range [0, 1].
For alpha = 0, the penalty is an L2 penalty.
For alpha = 1, it is an L1 penalty. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
weight_col |
The name of the column to use as weights for the model fit. |
loss |
The loss function to be optimized.
Supported options: "squaredError"
and "huber".
Default: "squaredError" |
solver |
Solver algorithm for optimization. |
standardization |
Whether to standardize the training features before fitting the model. |
tol |
Param for the convergence tolerance for iterative algorithms. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
partitions <- mtcars_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
mtcars_training <- partitions$training
mtcars_test <- partitions$test
lm_model <- mtcars_training %>%
ml_linear_regression(mpg ~ .)
pred <- ml_predict(lm_model, mtcars_test)
ml_regression_evaluator(pred, label_col = "mpg")
}
Spark ML -- Logistic Regression
Arguments
Value
Details
See also
Examples
Perform classification using logistic regression.
ml_logistic_regression(x, formula = NULL, fit_intercept = TRUE,
elastic_net_param = 0, reg_param = 0, max_iter = 100,
threshold = 0.5, thresholds = NULL, tol = 1e-06,
weight_col = NULL, aggregation_depth = 2,
lower_bounds_on_coefficients = NULL,
lower_bounds_on_intercepts = NULL,
upper_bounds_on_coefficients = NULL,
upper_bounds_on_intercepts = NULL, features_col = "features",
label_col = "label", family = "auto",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("logistic_regression_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
elastic_net_param |
ElasticNet mixing parameter, in range [0, 1].
For alpha = 0, the penalty is an L2 penalty.
For alpha = 1, it is an L1 penalty. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
threshold |
in binary classification prediction, in range [0, 1]. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
tol |
Param for the convergence tolerance for iterative algorithms. |
weight_col |
The name of the column to use as weights for the model fit. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
lower_bounds_on_coefficients |
(Spark 2.2.0+) Lower bounds on coefficients if fitting under bound constrained optimization.
The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. |
lower_bounds_on_intercepts |
(Spark 2.2.0+) Lower bounds on intercepts if fitting under bound constrained optimization.
The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression. |
upper_bounds_on_coefficients |
(Spark 2.2.0+) Upper bounds on coefficients if fitting under bound constrained optimization.
The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. |
upper_bounds_on_intercepts |
(Spark 2.2.0+) Upper bounds on intercepts if fitting under bound constrained optimization.
The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
family |
(Spark 2.1.0+) Param for the name of family which is a description of the label distribution to be used in the model.
Supported options: "auto", "binomial", and "multinomial." |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
partitions <- mtcars_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
mtcars_training <- partitions$training
mtcars_test <- partitions$test
lr_model <- mtcars_training %>%
ml_logistic_regression(am ~ gear + carb)
pred <- ml_predict(lr_model, mtcars_test)
ml_binary_classification_evaluator(pred)
}
Extracts data associated with a Spark ML model
Arguments
Value
Extracts data associated with a Spark ML model
ml_model_data(object)
Arguments
Value
A tbl_spark
Spark ML -- Multilayer Perceptron
Arguments
Value
Details
See also
Examples
Classification model based on the Multilayer Perceptron.
Each layer has sigmoid activation function, output layer has softmax.
ml_multilayer_perceptron_classifier(x, formula = NULL, layers = NULL,
max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128,
solver = "l-bfgs", seed = NULL, initial_weights = NULL,
thresholds = NULL, features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("multilayer_perceptron_classifier_"), ...)
ml_multilayer_perceptron(x, formula = NULL, layers, max_iter = 100,
step_size = 0.03, tol = 1e-06, block_size = 128,
solver = "l-bfgs", seed = NULL, initial_weights = NULL,
features_col = "features", label_col = "label", thresholds = NULL,
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("multilayer_perceptron_classifier_"),
response = NULL, features = NULL, ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
layers |
A numeric vector describing the layers -- each element in the vector gives the size of a layer.
For example, c(4, 5, 2) would imply three layers, with an input (feature) layer of size 4, an intermediate layer of size 5, and an output (class) layer of size 2. |
max_iter |
The maximum number of iterations to use. |
step_size |
Step size to be used for each iteration of optimization (> 0). |
tol |
Param for the convergence tolerance for iterative algorithms. |
block_size |
Block size for stacking input data in matrices to speed up the computation.
Data is stacked within partitions.
If block size is more than remaining data in a partition then it is adjusted to the size of this data.
Recommended size is between 10 and 1000.
Default: 128 |
solver |
The solver algorithm for optimization.
Supported options: "gd" (minibatch gradient descent) or "l-bfgs".
Default: "l-bfgs" |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
initial_weights |
The initial weights of the model. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
ml_multilayer_perceptron()
is an alias for ml_multilayer_perceptron_classifier()
for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
mlp_model <- iris_training %>%
ml_multilayer_perceptron_classifier(Species ~ ., layers = c(4,3,3))
pred <- ml_predict(mlp_model, iris_test)
ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Naive-Bayes
Arguments
Value
Details
See also
Examples
Naive Bayes Classifiers.
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
For example, by converting documents into TF-IDF vectors, it can be used for document classification.
By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here).
The input feature values must be nonnegative.
ml_naive_bayes(x, formula = NULL, model_type = "multinomial",
smoothing = 1, thresholds = NULL, weight_col = NULL,
features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("naive_bayes_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
model_type |
The model type.
Supported options: "multinomial"
and "bernoulli" .
(default = multinomial ) |
smoothing |
The (Laplace) smoothing parameter.
Defaults to 1. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
weight_col |
(Spark 2.1.0+) Weight column name.
If this is not set or empty, we treat all instance weights as 1.0. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
nb_model <- iris_training %>%
ml_naive_bayes(Species ~ .)
pred <- ml_predict(nb_model, iris_test)
ml_multiclass_classification_evaluator(pred)
}
Spark ML -- OneVsRest
Arguments
Value
Details
See also
Reduction of Multiclass Classification to Binary Classification.
Performs reduction using one against all strategy.
For a multiclass classification with k classes, train k models (one per class).
Each example is scored against all k models and the model with highest score is picked to label the example.
ml_one_vs_rest(x, formula = NULL, classifier = NULL,
features_col = "features", label_col = "label",
prediction_col = "prediction", uid = random_string("one_vs_rest_"),
...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
classifier |
Object of class ml_estimator .
Base binary classifier that we reduce multiclass classification into. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_random_forest_classifier
Feature Transformation -- PCA (Estimator)
Arguments
Value
Details
See also
Examples
PCA trains a model to project vectors to a lower dimensional space of the top k principal components.
ft_pca(x, input_col = NULL, output_col = NULL, k = NULL,
uid = random_string("pca_"), ...)
ml_pca(x, features = tbl_vars(x), k = length(features),
pc_prefix = "PC", ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
k |
The number of principal components |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
features |
The columns to use in the principal components
analysis.
Defaults to all columns in x . |
pc_prefix |
Length-one character vector used to prepend names of components. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
ml_pca()
is a wrapper around ft_pca()
that returns a
ml_model
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl %>%
select(-Species) %>%
ml_pca(k = 2)
}
Spark ML -- Random Forest
Arguments
Value
Details
See also
Examples
Perform classification and regression using random forests.
ml_random_forest_classifier(x, formula = NULL, num_trees = 20,
subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1,
feature_subset_strategy = "auto", impurity = "gini",
min_info_gain = 0, max_bins = 32, seed = NULL, thresholds = NULL,
checkpoint_interval = 10, cache_node_ids = FALSE,
max_memory_in_mb = 256, features_col = "features",
label_col = "label", prediction_col = "prediction",
probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("random_forest_classifier_"), ...)
ml_random_forest(x, formula = NULL, type = c("auto", "regression",
"classification"), features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
feature_subset_strategy = "auto", impurity = "auto",
checkpoint_interval = 10, max_bins = 32, max_depth = 5,
num_trees = 20, min_info_gain = 0, min_instances_per_node = 1,
subsampling_rate = 1, seed = NULL, thresholds = NULL,
cache_node_ids = FALSE, max_memory_in_mb = 256,
uid = random_string("random_forest_"), response = NULL,
features = NULL, ...)
ml_random_forest_regressor(x, formula = NULL, num_trees = 20,
subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1,
feature_subset_strategy = "auto", impurity = "variance",
min_info_gain = 0, max_bins = 32, seed = NULL,
checkpoint_interval = 10, cache_node_ids = FALSE,
max_memory_in_mb = 256, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("random_forest_regressor_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
num_trees |
Number of trees to train (>= 1).
If 1, then no bootstrapping is used.
If > 1, then bootstrapping is done. |
subsampling_rate |
Fraction of the training data used for learning each decision tree, in range (0, 1].
(default = 1.0) |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree. |
min_instances_per_node |
Minimum number of instances each child must
have after split. |
feature_subset_strategy |
The number of features to consider for splits at each tree node.
See details for options. |
impurity |
Criterion used for information gain calculation.
Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression.
For ml_decision_tree , setting "auto" will default to the appropriate
criterion based on model type. |
min_info_gain |
Minimum information gain for a split to be considered
at a tree node.
Should be >= 0, defaults to 0. |
max_bins |
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node.
More bins give higher granularity. |
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class.
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g.
10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cache_node_ids |
If FALSE , the algorithm will pass trees to executors to match instances with nodes.
If TRUE , the algorithm will cache node IDs for each instance.
Caching can speed up training of deeper trees.
Defaults to FALSE . |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size.
Defaults to 256. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable.
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
The supported options for feature_subset_strategy
are
"auto"
: Choose automatically for task: If num_trees == 1
, set to "all"
.
If num_trees > 1
(forest), set to "sqrt"
for classification and to "onethird"
for regression.
"all"
: use all features
"onethird"
: use 1/3 of the features
"sqrt"
: use use sqrt(number of features)
"log2"
: use log2(number of features)
"n"
: when n
is in the range (0, 1.0], use n * number of features.
When n
is in the range (1, number of features), use n
features.
(default = "auto"
)
ml_random_forest
is a wrapper around ml_random_forest_regressor.tbl_spark
and ml_random_forest_classifier.tbl_spark
and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
rf_model <- iris_training %>%
ml_random_forest(Species ~ ., type = "classification")
pred <- ml_predict(rf_model, iris_test)
ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Survival Regression
Arguments
Value
Details
See also
Examples
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.
ml_aft_survival_regression(x, formula = NULL, censor_col = "censor",
quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95,
0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06,
aggregation_depth = 2, quantiles_col = NULL,
features_col = "features", label_col = "label",
prediction_col = "prediction",
uid = random_string("aft_survival_regression_"), ...)
ml_survival_regression(x, formula = NULL, censor_col = "censor",
quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95,
0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06,
aggregation_depth = 2, quantiles_col = NULL,
features_col = "features", label_col = "label",
prediction_col = "prediction",
uid = random_string("aft_survival_regression_"), response = NULL,
features = NULL, ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
censor_col |
Censor column name.
The value of this column could be 0 or 1.
If the value is 1, it means the event has occurred i.e.
uncensored; otherwise censored. |
quantile_probabilities |
Quantile probabilities array.
Values of the quantile probabilities array should be in the range (0, 1) and the array should be non-empty. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
quantiles_col |
Quantiles column name.
This column will output quantiles of corresponding quantileProbabilities if it is set. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
ml_survival_regression()
is an alias for ml_aft_survival_regression()
for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
library(survival)
library(sparklyr)
sc <- spark_connect(master = "local")
ovarian_tbl <- sdf_copy_to(sc, ovarian, name = "ovarian_tbl", overwrite = TRUE)
partitions <- ovarian_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
ovarian_training <- partitions$training
ovarian_test <- partitions$test
sur_reg <- ovarian_training %>%
ml_aft_survival_regression(futime ~ ecog_ps + rx + age + resid_ds, censor_col = "fustat")
pred <- ml_predict(sur_reg, ovarian_test)
pred
}
Add a Stage to a Pipeline
Arguments
Adds a stage to a pipeline.
ml_add_stage(x, stage)
Arguments
x |
A pipeline or a pipeline stage. |
stage |
A pipeline stage. |
Spark ML -- ALS
Arguments
Value
Details
Examples
Perform recommendation using Alternating Least Squares (ALS) matrix factorization.
ml_als(x, formula = NULL, rating_col = "rating", user_col = "user",
item_col = "item", rank = 10, reg_param = 0.1,
implicit_prefs = FALSE, alpha = 1, nonnegative = FALSE,
max_iter = 10, num_user_blocks = 10, num_item_blocks = 10,
checkpoint_interval = 10, cold_start_strategy = "nan",
intermediate_storage_level = "MEMORY_AND_DISK",
final_storage_level = "MEMORY_AND_DISK", uid = random_string("als_"),
...)
ml_recommend(model, type = c("items", "users"), n = 1)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
The ALS model requires a specific formula format, please use rating_col ~ user_col + item_col . |
rating_col |
Column name for ratings.
Default: "rating" |
user_col |
Column name for user ids.
Ids must be integers.
Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range.
Default: "user" |
item_col |
Column name for item ids.
Ids must be integers.
Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range.
Default: "item" |
rank |
Rank of the matrix factorization (positive).
Default: 10 |
reg_param |
Regularization parameter. |
implicit_prefs |
Whether to use implicit preference.
Default: FALSE. |
alpha |
Alpha parameter in the implicit preference formulation (nonnegative). |
nonnegative |
Whether to apply nonnegativity constraints.
Default: FALSE. |
max_iter |
Maximum number of iterations. |
num_user_blocks |
Number of user blocks (positive).
Default: 10 |
num_item_blocks |
Number of item blocks (positive).
Default: 10 |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g.
10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cold_start_strategy |
(Spark 2.2.0+) Strategy for dealing with unknown or new users/items at prediction time.
This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data.
Supported values: - "nan": predicted value for unknown ids will be NaN.
- "drop": rows in the input DataFrame containing unknown ids will be dropped from the output DataFrame containing predictions.
Default: "nan". |
intermediate_storage_level |
(Spark 2.0.0+) StorageLevel for intermediate datasets.
Pass in a string representation of StorageLevel .
Cannot be "NONE".
Default: "MEMORY_AND_DISK". |
final_storage_level |
(Spark 2.0.0+) StorageLevel for ALS model factors.
Pass in a string representation of StorageLevel .
Default: "MEMORY_AND_DISK". |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
model |
An ALS model object |
type |
What to recommend, one of items or users |
n |
Maximum number of recommendations to return |
Value
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e.
X * Yt = R.
Typically these approximations are called 'factor' matrices.
The general approach is iterative.
During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares.
The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector.
This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on).
This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0.
The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_als
recommender object, which is an Estimator.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the recommender appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a recommender
estimator is constructed then immediately fit with the input
tbl_spark
, returning a recommendation model, i.e. ml_als_model
.
Details
ml_recommend()
returns the top n
users/items recommended for each item/user, for all items/users.
The output has been transformed (exploded and separated) from the default Spark outputs to be more user friendly.
Examples
if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")
movies <- data.frame(
user = c(1, 2, 0, 1, 2, 0),
item = c(1, 1, 1, 2, 2, 0),
rating = c(3, 1, 2, 4, 5, 4)
)
movies_tbl <- sdf_copy_to(sc, movies)
model <- ml_als(movies_tbl, rating ~ user + item)
ml_predict(model, movies_tbl)
ml_recommend(model, type = "item", 1)
}
Utility functions for LSH models
Arguments
Utility functions for LSH models
ml_approx_nearest_neighbors(model, dataset, key, num_nearest_neighbors,
dist_col = "distCol")
ml_approx_similarity_join(model, dataset_a, dataset_b, threshold,
dist_col = "distCol")
Arguments
model |
A fitted LSH model, returned by either ft_minhash_lsh()
or ft_bucketed_random_projection_lsh() . |
dataset |
The dataset to search for nearest neighbors of the key. |
key |
Feature vector representing the item to search for. |
num_nearest_neighbors |
The maximum number of nearest neighbors. |
dist_col |
Output column for storing the distance between each result row and the key. |
dataset_a |
One of the datasets to join. |
dataset_b |
Another dataset to join. |
threshold |
The threshold for the distance of row pairs. |
Frequent Pattern Mining -- FPGrowth
Arguments
A parallel FP-growth algorithm to mine frequent itemsets.
ml_fpgrowth(x, items_col = "items", min_confidence = 0.8,
min_support = 0.3, prediction_col = "prediction",
uid = random_string("fpgrowth_"), ...)
ml_association_rules(model)
ml_freq_itemsets(model)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
items_col |
Items column name.
Default: "items" |
min_confidence |
Minimal confidence for generating Association Rule. min_confidence will not affect the mining for frequent itemsets, but
will affect the association rules generation.
Default: 0.8 |
min_support |
Minimal support level of the frequent pattern.
[0.0, 1.0].
Any pattern that appears more than (min_support * size-of-the-dataset) times
will be output in the frequent itemsets.
Default: 0.3 |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
model |
A fitted FPGrowth model returned by ml_fpgrowth() |
Spark ML - Evaluators
Arguments
Value
Details
Examples
A set of functions to calculate performance metrics for prediction models.
Also see the Spark ML Documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.package
ml_binary_classification_evaluator(x, label_col = "label",
raw_prediction_col = "rawPrediction", metric_name = "areaUnderROC",
uid = random_string("binary_classification_evaluator_"), ...)
ml_binary_classification_eval(x, label_col = "label",
prediction_col = "prediction", metric_name = "areaUnderROC")
ml_multiclass_classification_evaluator(x, label_col = "label",
prediction_col = "prediction", metric_name = "f1",
uid = random_string("multiclass_classification_evaluator_"), ...)
ml_classification_eval(x, label_col = "label",
prediction_col = "prediction", metric_name = "f1")
ml_regression_evaluator(x, label_col = "label",
prediction_col = "prediction", metric_name = "rmse",
uid = random_string("regression_evaluator_"), ...)
Arguments
x |
A spark_connection object or a tbl_spark containing label and prediction columns.
The latter should be the output of sdf_predict . |
label_col |
Name of column string specifying which column contains the true labels or values. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
metric_name |
The performance metric.
See details. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
prediction_col |
Name of the column that contains the predicted
label or value NOT the scored probability.
Column should be of type Double . |
Value
The calculated performance metric
Details
The following metrics are supported
Binary Classification: areaUnderROC
(default) or areaUnderPR
(not available in Spark 2.X.)
Multiclass Classification: f1
(default), precision
, recall
, weightedPrecision
, weightedRecall
or accuracy
; for Spark 2.X: f1
(default), weightedPrecision
, weightedRecall
or accuracy
.
Regression: rmse
(root mean squared error, default),
mse
(mean squared error), r2
, or mae
(mean absolute error.)
ml_binary_classification_eval()
is an alias for ml_binary_classification_evaluator()
for backwards compatibility.
ml_classification_eval()
is an alias for ml_multiclass_classification_evaluator()
for backwards compatibility.
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
partitions <- mtcars_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
mtcars_training <- partitions$training
mtcars_test <- partitions$test
# for multiclass classification
rf_model <- mtcars_training %>%
ml_random_forest(cyl ~ ., type = "classification")
pred <- ml_predict(rf_model, mtcars_test)
ml_multiclass_classification_evaluator(pred)
# for regression
rf_model <- mtcars_training %>%
ml_random_forest(cyl ~ ., type = "regression")
pred <- ml_predict(rf_model, mtcars_test)
ml_regression_evaluator(pred, label_col = "cyl")
# for binary classification
rf_model <- mtcars_training %>%
ml_random_forest(am ~ gear + carb, type = "classification")
pred <- ml_predict(rf_model, mtcars_test)
ml_binary_classification_evaluator(pred)
}
Spark ML -- Bisecting K-Means Clustering
Arguments
Value
See also
Examples
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible.
The bisecting steps of clusters on the same level are grouped together to increase parallelism.
If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
ml_bisecting_kmeans(x, formula = NULL, k = 4, max_iter = 20,
seed = NULL, min_divisible_cluster_size = 1,
features_col = "features", prediction_col = "prediction",
uid = random_string("bisecting_bisecting_kmeans_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
min_divisible_cluster_size |
The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0). |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the clustering estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning a clustering model.
tbl_spark
, with formula
or features
specified: When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the estimator.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
This signature does not apply to ml_lda()
.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
more information on the set of clustering algorithms.
Other ml clustering algorithms: ml_gaussian_mixture
,
ml_kmeans
, ml_lda
Examples
if (FALSE) {
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl %>%
select(-Species) %>%
ml_bisecting_kmeans(k = 4 , Species ~ .)
}
Wrap a Spark ML JVM object
Arguments
Identifies the associated sparklyr ML constructor for the JVM object by inspecting its
class and performing a lookup.
The lookup table is specified by the
`sparkml/class_mapping.json` files of sparklyr and the loaded extensions.
ml_call_constructor(jobj)
Arguments
jobj |
The jobj for the pipeline stage. |
Chi-square hypothesis testing for categorical data.
Arguments
Value
Examples
Conduct Pearson's independence test for every feature against the
label.
For each feature, the (feature, label) pairs are converted
into a contingency matrix for which the Chi-squared statistic is
computed.
All label and feature values must be categorical.
ml_chisquare_test(x, features, label)
Arguments
x |
A tbl_spark . |
features |
The name(s) of the feature columns.
This can also be the name
of a single vector column created using ft_vector_assembler() . |
label |
The name of the label column. |
Value
A data frame with one row for each (feature, label) pair with p-values,
degrees of freedom, and test statistics.
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width")
ml_chisquare_test(iris_tbl, features = features, label = "Species")
}
Spark ML - Clustering Evaluator
Arguments
Value
Examples
Evaluator for clustering results.
The metric computes the Silhouette measure using the squared
Euclidean distance.
The Silhouette is a measure for the validation of the consistency
within clusters.
It ranges between 1 and -1, where a value close to 1 means that the
points in a cluster are close to the other points in the same cluster and far from the
points of the other clusters.
ml_clustering_evaluator(x, features_col = "features",
prediction_col = "prediction", metric_name = "silhouette",
uid = random_string("clustering_evaluator_"), ...)
Arguments
x |
A spark_connection object or a tbl_spark containing label and prediction columns.
The latter should be the output of sdf_predict . |
features_col |
Name of features column. |
prediction_col |
Name of the prediction column. |
metric_name |
The performance metric.
Currently supports "silhouette". |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
Value
The calculated performance metric
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
formula <- Species ~ .
# Train the models
kmeans_model <- ml_kmeans(iris_training, formula = formula)
b_kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula)
gmm_model <- ml_gaussian_mixture(iris_training, formula = formula)
# Predict
pred_kmeans <- ml_predict(kmeans_model, iris_test)
pred_b_kmeans <- ml_predict(b_kmeans_model, iris_test)
pred_gmm <- ml_predict(gmm_model, iris_test)
# Evaluate
ml_clustering_evaluator(pred_kmeans)
ml_clustering_evaluator(pred_b_kmeans)
ml_clustering_evaluator(pred_gmm)
}
Constructors for `ml_model` Objects
Arguments
Functions for developers writing extensions for Spark ML.
These functions are constructors
for `ml_model` objects that are returned when using the formula interface.
new_ml_model_prediction(pipeline_model, formula, dataset, label_col,
features_col, ..., class = character())
new_ml_model(pipeline_model, formula, dataset, ..., class = character())
new_ml_model_classification(pipeline_model, formula, dataset, label_col,
features_col, predicted_label_col, ..., class = character())
new_ml_model_regression(pipeline_model, formula, dataset, label_col,
features_col, ..., class = character())
new_ml_model_clustering(pipeline_model, formula, dataset, features_col,
..., class = character())
ml_supervised_pipeline(predictor, dataset, formula, features_col,
label_col)
ml_clustering_pipeline(predictor, dataset, formula, features_col)
ml_construct_model_supervised(constructor, predictor, formula, dataset,
features_col, label_col, ...)
ml_construct_model_clustering(constructor, predictor, formula, dataset,
features_col, ...)
Arguments
pipeline_model |
The pipeline model object returned by `ml_supervised_pipeline()`. |
formula |
The formula used for data preprocessing |
dataset |
The training dataset. |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
class |
Name of the subclass. |
predictor |
The pipeline stage corresponding to the ML algorithm. |
constructor |
The constructor function for the `ml_model`. |
Compute correlation matrix
Arguments
Value
Examples
Compute correlation matrix
ml_corr(x, columns = NULL, method = c("pearson", "spearman"))
Arguments
x |
A tbl_spark . |
columns |
The names of the columns to calculate correlations of.
If only one
column is specified, it must be a vector column (for example, assembled using ft_vector_assember() ). |
method |
The method to use, either "pearson" or "spearman" . |
Value
A correlation matrix organized as a data frame.
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width")
ml_corr(iris_tbl, columns = features , method = "pearson")
}
Spark ML -- Tuning
Arguments
Value
Details
Examples
Perform hyper-parameter tuning using either K-fold cross validation or train-validation split.
ml_sub_models(model)
ml_validation_metrics(model)
ml_cross_validator(x, estimator = NULL, estimator_param_maps = NULL,
evaluator = NULL, num_folds = 3, collect_sub_models = FALSE,
parallelism = 1, seed = NULL,
uid = random_string("cross_validator_"), ...)
ml_train_validation_split(x, estimator = NULL,
estimator_param_maps = NULL, evaluator = NULL, train_ratio = 0.75,
collect_sub_models = FALSE, parallelism = 1, seed = NULL,
uid = random_string("train_validation_split_"), ...)
Arguments
model |
A cross validation or train-validation-split model. |
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
estimator |
A ml_estimator object. |
estimator_param_maps |
A named list of stages and hyper-parameter sets to tune.
See details. |
evaluator |
A ml_evaluator object, see ml_evaluator. |
num_folds |
Number of folds for cross validation.
Must be >= 2.
Default: 3 |
collect_sub_models |
Whether to collect a list of sub-models trained during tuning.
If set to FALSE , then only the single best sub-model will be available after fitting.
If set to true, then all sub-models will be available.
Warning: For large models, collecting
all sub-models can cause OOMs on the Spark driver. |
parallelism |
The number of threads to use when running parallel algorithms.
Default is 1 for serial execution. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
train_ratio |
Ratio between train and validation data.
Must be between 0 and 1.
Default: 0.75 |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_cross_validator
or ml_traing_validation_split
object.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the tuning estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a tuning estimator is constructed then
immediately fit with the input tbl_spark
, returning a ml_cross_validation_model
or a
ml_train_validation_split_model
object.
For cross validation, ml_sub_models()
returns a nested
list of models, where the first layer represents fold indices and the
second layer represents param maps.
For train-validation split,
ml_sub_models()
returns a list of models, corresponding to the
order of the estimator param maps.
ml_validation_metrics()
returns a data frame of performance
metrics and hyperparameter combinations.
Details
ml_cross_validator()
performs k-fold cross validation while ml_train_validation_split()
performs tuning on one pair of train and validation datasets.
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
# Create a pipeline
pipeline <- ml_pipeline(sc) %>%
ft_r_formula(Species ~ . ) %>%
ml_random_forest_classifier()
# Specify hyperparameter grid
grid <- list(
random_forest = list(
num_trees = c(5,10),
max_depth = c(5,10),
impurity = c("entropy", "gini")
)
)
# Create the cross validator object
cv <- ml_cross_validator(
sc, estimator = pipeline, estimator_param_maps = grid,
evaluator = ml_multiclass_classification_evaluator(sc),
num_folds = 3,
parallelism = 4
)
# Train the models
cv_model <- ml_fit(cv, iris_tbl)
# Print the metrics
ml_validation_metrics(cv_model)
}
Default stop words
Arguments
Value
Details
See also
Loads the default stop words for the given language.
ml_default_stop_words(sc, language = c("english", "danish", "dutch",
"finnish", "french", "german", "hungarian", "italian", "norwegian",
"portuguese", "russian", "spanish", "swedish", "turkish"), ...)
Arguments
sc |
A spark_connection |
language |
A character string. |
... |
Optional arguments; currently unused. |
Value
A list of stop words.
Details
Supported languages: danish, dutch, english, finnish, french,
german, hungarian, italian, norwegian, portuguese, russian, spanish,
swedish, turkish.
Defaults to English.
See http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
for more details
See also
ft_stop_words_remover
Evaluate the Model on a Validation Set
Arguments
Compute performance metrics.
ml_evaluate(x, dataset)
# S3 method for ml_model_logistic_regression
ml_evaluate(x, dataset)
# S3 method for ml_logistic_regression_model
ml_evaluate(x, dataset)
# S3 method for ml_model_linear_regression
ml_evaluate(x, dataset)
# S3 method for ml_linear_regression_model
ml_evaluate(x, dataset)
# S3 method for ml_model_generalized_linear_regression
ml_evaluate(x, dataset)
# S3 method for ml_generalized_linear_regression_model
ml_evaluate(x, dataset)
# S3 method for ml_evaluator
ml_evaluate(x, dataset)
Arguments
x |
An ML model object or an evaluator object. |
dataset |
The dataset to be validate the model on. |
Spark ML - Feature Importance for Tree Models
Arguments
Value
Spark ML - Feature Importance for Tree Models
ml_feature_importances(model, ...)
ml_tree_feature_importance(model, ...)
Arguments
model |
A decision tree-based model. |
... |
Optional arguments; currently unused. |
Value
For ml_model
, a sorted data frame with feature labels and their relative importance.
For ml_prediction_model
, a vector of relative importances.
Feature Transformation -- Word2Vec (Estimator)
Arguments
Value
Details
See also
Word2Vec transforms a word into a code for further natural language processing or machine learning process.
ft_word2vec(x, input_col = NULL, output_col = NULL,
vector_size = 100, min_count = 5, max_sentence_length = 1000,
num_partitions = 1, step_size = 0.025, max_iter = 1, seed = NULL,
uid = random_string("word2vec_"), ...)
ml_find_synonyms(model, word, num)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
vector_size |
The dimension of the code that you want to transform from words.
Default: 100 |
min_count |
The minimum number of times a token must appear to be included in
the word2vec model's vocabulary.
Default: 5 |
max_sentence_length |
(Spark 2.0.0+) Sets the maximum length (in words) of each sentence
in the input data.
Any sentence longer than this threshold will be divided into
chunks of up to max_sentence_length size.
Default: 1000 |
num_partitions |
Number of partitions for sentences of words.
Default: 1 |
step_size |
Param for Step size to be used for each iteration of optimization (> 0). |
max_iter |
The maximum number of iterations to use. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A fitted Word2Vec model, returned by ft_word2vec() . |
word |
A word, as a length-one character vector. |
num |
Number of words closest in similarity to the given word to find. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
ml_find_synonyms()
returns a DataFrame of synonyms and cosine similarities
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
Spark ML -- Transform, fit, and predict methods (ml_ interface)
Arguments
Value
Details
Methods for transformation, fit, and prediction.
These are mirrors of the corresponding sdf-transform-methods.
is_ml_transformer(x)
is_ml_estimator(x)
ml_fit(x, dataset, ...)
ml_transform(x, dataset, ...)
ml_fit_and_transform(x, dataset, ...)
ml_predict(x, dataset, ...)
# S3 method for ml_model_classification
ml_predict(x, dataset,
probability_prefix = "probability_", ...)
Arguments
x |
A ml_estimator , ml_transformer (or a list thereof), or ml_model object. |
dataset |
A tbl_spark . |
... |
Optional arguments; currently unused. |
probability_prefix |
String used to prepend the class probability output columns. |
Value
When x
is an estimator, ml_fit()
returns a transformer whereas ml_fit_and_transform()
returns a transformed dataset.
When x
is a transformer, ml_transform()
and ml_predict()
return a transformed dataset.
When ml_predict()
is called on a ml_model
object, additional columns (e.g.
probabilities in case of classification models) are appended to the transformed output for the user's convenience.
Details
These methods are
Spark ML -- Gaussian Mixture clustering.
Arguments
Value
See also
Examples
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).
A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than tol
, or until it has reached the max number of iterations.
While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
ml_gaussian_mixture(x, formula = NULL, k = 2, max_iter = 100,
tol = 0.01, seed = NULL, features_col = "features",
prediction_col = "prediction", probability_col = "probability",
uid = random_string("gaussian_mixture_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities.
Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the clustering estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning a clustering model.
tbl_spark
, with formula
or features
specified: When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the estimator.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
This signature does not apply to ml_lda()
.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
more information on the set of clustering algorithms.
Other ml clustering algorithms: ml_bisecting_kmeans
,
ml_kmeans
, ml_lda
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .)
pred <- sdf_predict(iris_tbl, gmm_model)
ml_clustering_evaluator(pred)
}
Spark ML -- ML Params
Arguments
Helper methods for working with parameters for ML objects.
ml_is_set(x, param, ...)
ml_param_map(x, ...)
ml_param(x, param, allow_null = FALSE, ...)
ml_params(x, params = NULL, allow_null = FALSE, ...)
Arguments
x |
A Spark ML object, either a pipeline stage or an evaluator. |
param |
The parameter to extract or set. |
... |
Optional arguments; currently unused. |
allow_null |
Whether to allow NULL results when extracting parameters.
If FALSE , an error will be thrown if the specified parameter is not found.
Defaults to FALSE . |
params |
A vector of parameters to extract. |
Spark ML -- Isotonic Regression
Arguments
Value
Details
See also
Examples
Currently implemented using parallelized pool adjacent violators algorithm.
Only univariate (single feature) algorithm supported.
ml_isotonic_regression(x, formula = NULL, feature_index = 0,
isotonic = TRUE, weight_col = NULL, features_col = "features",
label_col = "label", prediction_col = "prediction",
uid = random_string("isotonic_regression_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
feature_index |
Index of the feature if features_col is a vector column (default: 0), no effect otherwise. |
isotonic |
Whether the output sequence should be isotonic/increasing (true) or antitonic/decreasing (false).
Default: true |
weight_col |
The name of the column to use as weights for the model fit. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
iso_res <- iris_tbl %>%
ml_isotonic_regression(Petal_Length ~ Petal_Width)
pred <- ml_predict(iso_res, iris_test)
pred
}
Feature Transformation -- StringIndexer (Estimator)
Arguments
Value
Details
See also
A label indexer that maps a string column of labels to an ML column of
label indices.
If the input column is numeric, we cast it to string and
index the string values.
The indices are in [0, numLabels)
, ordered by
label frequencies.
So the most frequent label gets index 0.
This function
is the inverse of ft_index_to_string
.
ft_string_indexer(x, input_col = NULL, output_col = NULL,
handle_invalid = "error", string_order_type = "frequencyDesc",
uid = random_string("string_indexer_"), ...)
ml_labels(model)
ft_string_indexer_model(x, input_col = NULL, output_col = NULL, labels,
handle_invalid = "error",
uid = random_string("string_indexer_model_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries.
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket).
Default: "error" |
string_order_type |
(Spark 2.3+)How to order labels of string column.
The first label after ordering is assigned an index of 0.
Options are "frequencyDesc" , "frequencyAsc" , "alphabetDesc" , and "alphabetAsc" .
Defaults to "frequencyDesc" . |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A fitted StringIndexer model returned by ft_string_indexer() |
labels |
Vector of labels, corresponding to indices to be assigned. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
ml_labels()
returns a vector of labels, corresponding to indices to be assigned.
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
ft_index_to_string
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Spark ML -- LinearSVC
Arguments
Value
Details
See also
Examples
Perform classification using linear support vector machines (SVM).
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer.
Only supports L2 regularization currently.
ml_linear_svc(x, formula = NULL, fit_intercept = TRUE, reg_param = 0,
max_iter = 100, standardization = TRUE, weight_col = NULL,
tol = 1e-06, threshold = 0, aggregation_depth = 2,
features_col = "features", label_col = "label",
prediction_col = "prediction", raw_prediction_col = "rawPrediction",
uid = random_string("linear_svc_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
Used when x is a tbl_spark .
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
standardization |
Whether to standardize the training features before fitting the model. |
weight_col |
The name of the column to use as weights for the model fit. |
tol |
Param for the convergence tolerance for iterative algorithms. |
threshold |
in binary classification prediction, in range [0, 1]. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
prediction_col |
Prediction column name. |
raw_prediction_col |
Raw prediction (a.k.a.
confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object.
The object contains a pointer to
a Spark Predictor
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the predictor appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a predictor is constructed then
immediately fit with the input tbl_spark
, returning a prediction model.
tbl_spark
, with formula
: specified When formula
is specified, the input tbl_spark
is first transformed using a
RFormula
transformer before being fit by
the predictor.
The object returned in this case is a ml_model
which is a
wrapper of a ml_pipeline_model
.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels.
For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column.
In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data.
This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_gbt_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier
Examples
if (FALSE) {
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
filter(Species != "setosa") %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
iris_training <- partitions$training
iris_test <- partitions$test
svc_model <- iris_training %>%
ml_linear_svc(Species ~ .)
pred <- ml_predict(svc_model, iris_test)
ml_binary_classification_evaluator(pred)
}
Spark ML -- Model Persistence
Arguments
Value
Save/load Spark ML objects
ml_save(x, path, overwrite = FALSE, ...)
# S3 method for ml_model
ml_save(x, path, overwrite = FALSE,
type = c("pipeline_model", "pipeline"), ...)
ml_load(sc, path)
Arguments
x |
A ML object, which could be a ml_pipeline_stage or a ml_model |
path |
The path where the object is to be serialized/deserialized. |
overwrite |
Whether to overwrite the existing path, defaults to FALSE . |
... |
Optional arguments; currently unused. |
type |
Whether to save the pipeline model or the pipeline. |
sc |
A Spark connection. |
Value
ml_save()
serializes a Spark object into a format that can be read back into sparklyr
or by the Scala or PySpark APIs.
When called on ml_model
objects, i.e.
those that were created via the tbl_spark - formula
signature, the associated pipeline model is serialized.
In other words, the saved model contains both the data processing (RFormulaModel
) stage and the machine learning stage.
ml_load()
reads a saved Spark object into sparklyr
.
It calls the correct Scala load
method based on parsing the saved metadata.
Note that a PipelineModel
object saved from a sparklyr ml_model
via ml_save()
will be read back in as an ml_pipeline_model
, rather than the ml_model
object.
Spark ML -- Pipelines
Arguments
Value
Create Spark ML Pipelines
ml_pipeline(x, ..., uid = random_string("pipeline_"))
Arguments
x |
Either a spark_connection or ml_pipeline_stage objects |
... |
ml_pipeline_stage objects. |
uid |
A character string used to uniquely identify the ML estimator. |
Value
When x
is a spark_connection
, ml_pipeline()
returns an empty pipeline object.
When x
is a ml_pipeline_stage
, ml_pipeline()
returns an ml_pipeline
with the stages set to x
and any transformers or estimators given in ...
.
Spark ML -- Pipeline stage extraction
Arguments
Value
Extraction of stages from a Pipeline or PipelineModel object.
ml_stage(x, stage)
ml_stages(x, stages = NULL)
Arguments
x |
A ml_pipeline or a ml_pipeline_model object |
stage |
The UID of a stage in the pipeline. |
stages |
The UIDs of stages in the pipeline as a character vector. |
Value
For ml_stage()
: The stage specified.
For ml_stages()
: A list of stages.
If stages
is not set, the function returns all stages of the pipeline in a list.
Standardize Formula Input for `ml_model`
Arguments
Generates a formula string from user inputs, to be used in `ml_model` constructor.
ml_standardize_formula(formula = NULL, response = NULL,
features = NULL)
Arguments
formula |
The `formula` argument. |
response |
The `response` argument. |
features |
The `features` argument. |
Spark ML -- Extraction of summary metrics
Arguments
Extracts a metric from the summary object of a Spark ML model.
ml_summary(x, metric = NULL, allow_null = FALSE)
Arguments
x |
A Spark ML model that has a summary. |
metric |
The name of the metric to extract.
If not set, returns the summary object. |
allow_null |
Whether null results are allowed when the metric is not found in the summary. |
Spark ML -- UID
Arguments
Extracts the UID of an ML object.
ml_uid(x)
Arguments
Feature Transformation -- CountVectorizer (Estimator)
Arguments
Value
Details
See also
Extracts a vocabulary from document collections.
ft_count_vectorizer(x, input_col = NULL, output_col = NULL,
binary = FALSE, min_df = 1, min_tf = 1, vocab_size = 2^18,
uid = random_string("count_vectorizer_"), ...)
ml_vocabulary(model)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
binary |
Binary toggle to control the output vector values.
If TRUE , all nonzero counts (after min_tf filter applied)
are set to 1.
This is useful for discrete probabilistic models that
model binary events rather than integer counts.
Default: FALSE |
min_df |
Specifies the minimum number of different documents a
term must appear in to be included in the vocabulary.
If this is an
integer greater than or equal to 1, this specifies the number of
documents the term must appear in; if this is a double in [0,1), then
this specifies the fraction of documents.
Default: 1. |
min_tf |
Filter to ignore rare words in a document.
For each
document, terms with frequency/count less than the given threshold
are ignored.
If this is an integer greater than or equal to 1, then
this specifies a count (of times the term must appear in the document);
if this is a double in [0,1), then this specifies a fraction (out of
the document's token count).
Default: 1. |
vocab_size |
Build a vocabulary that only considers the top vocab_size terms ordered by term frequency across the corpus.
Default: 2^18 . |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A ml_count_vectorizer_model . |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
ml_vocabulary()
returns a vector of vocabulary built.
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Binarizer (Transformer)
Arguments
Value
See also
Examples
Apply thresholding to a column, such that values less than or equal to the threshold
are assigned the value 0.0, and values greater than the
threshold are assigned the value 1.0.
Column output is numeric for
compatibility with other modeling functions.
ft_binarizer(x, input_col, output_col, threshold = 0,
uid = random_string("binarizer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
threshold |
Threshold used to binarize continuous features. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl %>%
ft_binarizer(input_col = "Sepal_Length",
output_col = "Sepal_Length_bin",
threshold = 5) %>%
select(Sepal_Length, Sepal_Length_bin, Species)
}
Feature Transformation -- Bucketizer (Transformer)
Arguments
Value
See also
Examples
Similar to R's cut
function, this transforms a numeric column
into a discretized column, with breaks specified through the splits
parameter.
ft_bucketizer(x, input_col = NULL, output_col = NULL, splits = NULL,
input_cols = NULL, output_cols = NULL, splits_array = NULL,
handle_invalid = "error", uid = random_string("bucketizer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
splits |
A numeric vector of cutpoints, indicating the bucket boundaries. |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
splits_array |
Parameter for specifying multiple splits parameters.
Each
element in this array can be used to map continuous features into buckets. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries.
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket).
Default: "error" |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl %>%
ft_bucketizer(input_col = "Sepal_Length",
output_col = "Sepal_Length_bucket",
splits = c(0, 4.5, 5, 8)) %>%
select(Sepal_Length, Sepal_Length_bucket, Species)
}
Feature Transformation -- ChiSqSelector (Estimator)
Arguments
Value
Details
See also
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label
ft_chisq_selector(x, features_col = "features", output_col = NULL,
label_col = "label", selector_type = "numTopFeatures", fdr = 0.05,
fpr = 0.05, fwe = 0.05, num_top_features = 50, percentile = 0.1,
uid = random_string("chisq_selector_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
output_col |
The name of the output column. |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
selector_type |
(Spark 2.1.0+) The selector type of the ChisqSelector.
Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe". |
fdr |
(Spark 2.2.0+) The upper bound of the expected false discovery rate.
Only applicable when selector_type = "fdr".
Default value is 0.05. |
fpr |
(Spark 2.1.0+) The highest p-value for features to be kept.
Only applicable when selector_type= "fpr".
Default value is 0.05. |
fwe |
(Spark 2.2.0+) The upper bound of the expected family-wise error rate.
Only applicable when selector_type = "fwe".
Default value is 0.05. |
num_top_features |
Number of features that selector will select, ordered by ascending p-value.
If the number of features is less than num_top_features , then this will select all features.
Only applicable when selector_type = "numTopFeatures".
The default value of num_top_features is 50. |
percentile |
(Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending.
Only applicable when selector_type = "percentile".
Default value is 0.1. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
Arguments
Value
Details
See also
A feature transformer that takes the 1D discrete cosine transform of a real
vector.
No zero padding is performed on the input vector.
It returns a real
vector of the same length representing the DCT.
The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).
ft_dct(x, input_col = NULL, output_col = NULL, inverse = FALSE,
uid = random_string("dct_"), ...)
ft_discrete_cosine_transform(x, input_col, output_col, inverse = FALSE,
uid = random_string("dct_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
inverse |
Indicates whether to perform the inverse DCT (TRUE) or forward DCT (FALSE). |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
ft_discrete_cosine_transform()
is an alias for ft_dct
for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- ElementwiseProduct (Transformer)
Arguments
Value
See also
Outputs the Hadamard product (i.e., the element-wise product) of each input vector
with a provided "weight" vector.
In other words, it scales each column of the
dataset by a scalar multiplier.
ft_elementwise_product(x, input_col = NULL, output_col = NULL,
scaling_vec = NULL, uid = random_string("elementwise_product_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
scaling_vec |
the vector to multiply with input vectors |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- FeatureHasher (Transformer)
Arguments
Value
Details
See also
Feature Transformation -- FeatureHasher (Transformer)
ft_feature_hasher(x, input_cols = NULL, output_col = NULL,
num_features = 2^18, categorical_cols = NULL,
uid = random_string("feature_hasher_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_cols |
Names of input columns. |
output_col |
Name of output column. |
num_features |
Number of features.
Defaults to \(2^18\). |
categorical_cols |
Numeric columns to treat as categorical features.
By default only string and boolean columns are treated as categorical,
so this param can be used to explicitly specify the numerical columns to
treat as categorical. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
Feature hashing projects a set of categorical or numerical features into a
feature vector of specified dimension (typically substantially smaller than
that of the original feature space).
This is done using the hashing trick
https://en.wikipedia.org/wiki/Feature_hashing to map features to indices
in the feature vector.
The FeatureHasher transformer operates on multiple columns.
Each column may
contain either numeric or categorical features.
Behavior and handling of
column data types is as follows: -Numeric columns: For numeric features,
the hash value of the column name is used to map the feature value to its
index in the feature vector.
By default, numeric features are not treated
as categorical (even when they are integers).
To treat them as categorical,
specify the relevant columns in categoricalCols.
-String columns: For
categorical features, the hash value of the string "column_name=value"
is used to map to the vector index, with an indicator value of 1.0.
Thus, categorical features are "one-hot" encoded (similarly to using
OneHotEncoder with drop_last=FALSE).
-Boolean columns: Boolean values
are treated in the same way as string columns.
That is, boolean features
are represented as "column_name=true" or "column_name=false", with an
indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF.
Since a
simple modulo on the hashed value is used to determine the vector index, it is
advisable to use a power of two as the num_features parameter; otherwise the
features will not be mapped evenly to the vector indices.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- HashingTF (Transformer)
Arguments
Value
See also
Maps a sequence of terms to their term frequencies using the hashing trick.
ft_hashing_tf(x, input_col = NULL, output_col = NULL, binary = FALSE,
num_features = 2^18, uid = random_string("hashing_tf_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
binary |
Binary toggle to control term frequency counts.
If true, all non-zero counts are set to 1.
This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
(default = FALSE ) |
num_features |
Number of features.
Should be greater than 0.
(default = 2^18 ) |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- IDF (Estimator)
Arguments
Value
Details
See also
Compute the Inverse Document Frequency (IDF) given a collection of documents.
ft_idf(x, input_col = NULL, output_col = NULL, min_doc_freq = 0,
uid = random_string("idf_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
min_doc_freq |
The minimum number of documents in which a term should appear.
Default: 0 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Imputer (Estimator)
Arguments
Value
Details
See also
Imputation estimator for completing missing values, either using the mean or
the median of the columns in which the missing values are located.
The input
columns should be of numeric type.
This function requires Spark 2.2.0+.
ft_imputer(x, input_cols = NULL, output_cols = NULL,
missing_value = NULL, strategy = "mean",
uid = random_string("imputer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_cols |
The names of the input columns |
output_cols |
The names of the output columns. |
missing_value |
The placeholder for the missing values.
All occurrences of missing_value will be imputed.
Note that null values are always treated
as missing. |
strategy |
The imputation strategy.
Currently only "mean" and "median" are
supported.
If "mean", then replace missing values using the mean value of the
feature.
If "median", then replace missing values using the approximate median
value of the feature.
Default: mean |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- IndexToString (Transformer)
Arguments
Value
See also
A Transformer that maps a column of indices back to a new column of
corresponding string values.
The index-string mapping is either from
the ML attributes of the input column, or from user-supplied labels
(which take precedence over ML attributes).
This function is the inverse
of ft_string_indexer
.
ft_index_to_string(x, input_col = NULL, output_col = NULL,
labels = NULL, uid = random_string("index_to_string_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
labels |
Optional param for array of labels specifying index-string mapping. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
ft_string_indexer
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
, ft_interaction
,
ft_lsh
, ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Interaction (Transformer)
Arguments
Value
See also
Implements the feature interaction transform.
This transformer takes in Double and
Vector type columns and outputs a flattened vector of their feature interactions.
To handle interaction, we first one-hot encode any nominal features.
Then, a
vector of the feature cross-products is produced.
ft_interaction(x, input_cols = NULL, output_col = NULL,
uid = random_string("interaction_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_cols |
The names of the input columns |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- LSH (Estimator)
Arguments
Value
Details
See also
Locality Sensitive Hashing functions for Euclidean distance
(Bucketed Random Projection) and Jaccard distance (MinHash).
ft_bucketed_random_projection_lsh(x, input_col = NULL,
output_col = NULL, bucket_length = NULL, num_hash_tables = 1,
seed = NULL, uid = random_string("bucketed_random_projection_lsh_"),
...)
ft_minhash_lsh(x, input_col = NULL, output_col = NULL,
num_hash_tables = 1L, seed = NULL,
uid = random_string("minhash_lsh_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
bucket_length |
The length of each hash bucket, a larger bucket lowers the
false negative rate.
The number of buckets will be (max L2 norm of input vectors) /
bucketLength. |
num_hash_tables |
Number of hash tables used in LSH OR-amplification.
LSH
OR-amplification can be used to reduce the false negative rate.
Higher values
for this param lead to a reduced false negative rate, at the expense of added
computational complexity. |
seed |
A random seed.
Set this value if you need your results to be
reproducible across repeated calls. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
ft_lsh_utils
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- MaxAbsScaler (Estimator)
Arguments
Value
Details
See also
Examples
Rescale each feature individually to range [-1, 1] by dividing through the
largest maximum absolute value in each feature.
It does not shift/center the
data, and thus does not destroy any sparsity.
ft_max_abs_scaler(x, input_col = NULL, output_col = NULL,
uid = random_string("max_abs_scaler_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")
iris_tbl %>%
ft_vector_assembler(input_col = features,
output_col = "features_temp") %>%
ft_max_abs_scaler(input_col = "features_temp",
output_col = "features")
}
Feature Transformation -- MinMaxScaler (Estimator)
Arguments
Value
Details
See also
Examples
Rescale each feature individually to a common range [min, max] linearly using
column summary statistics, which is also known as min-max normalization or
Rescaling
ft_min_max_scaler(x, input_col = NULL, output_col = NULL, min = 0,
max = 1, uid = random_string("min_max_scaler_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
min |
Lower bound after transformation, shared by all features Default: 0.0 |
max |
Upper bound after transformation, shared by all features Default: 1.0 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")
iris_tbl %>%
ft_vector_assembler(input_col = features,
output_col = "features_temp") %>%
ft_min_max_scaler(input_col = "features_temp",
output_col = "features")
}
Feature Transformation -- NGram (Transformer)
Arguments
Value
Details
See also
A feature transformer that converts the input array of strings into an array of n-grams.
Null values in the input array are ignored.
It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
ft_ngram(x, input_col = NULL, output_col = NULL, n = 2,
uid = random_string("ngram_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
n |
Minimum n-gram length, greater than or equal to 1.
Default: 2, bigram features |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
When the input is empty, an empty array is returned.
When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Normalizer (Transformer)
Arguments
Value
See also
Normalize a vector to have unit norm using the given p-norm.
ft_normalizer(x, input_col = NULL, output_col = NULL, p = 2,
uid = random_string("normalizer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
p |
Normalization in L^p space.
Must be >= 1.
Defaults to 2. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- OneHotEncoder (Transformer)
Arguments
Value
See also
One-hot encoding maps a column of label indices to a column of binary
vectors, with at most a single one-value.
This encoding allows algorithms
which expect continuous features, such as Logistic Regression, to use
categorical features.
Typically, used with ft_string_indexer()
to
index a column first.
ft_one_hot_encoder(x, input_col = NULL, output_col = NULL,
drop_last = TRUE, uid = random_string("one_hot_encoder_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
drop_last |
Whether to drop the last category.
Defaults to TRUE . |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- OneHotEncoderEstimator (Estimator)
Arguments
Value
Details
See also
A one-hot encoder that maps a column of category indices
to a column of binary vectors, with at most a single one-value
per row that indicates the input category index.
For example
with 5 categories, an input value of 2.0 would map to an output
vector of [0.0, 0.0, 1.0, 0.0].
The last category is not included
by default (configurable via dropLast), because it makes the
vector entries sum up to one, and hence linearly dependent.
So
an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
ft_one_hot_encoder_estimator(x, input_cols = NULL, output_cols = NULL,
handle_invalid = "error", drop_last = TRUE,
uid = random_string("one_hot_encoder_estimator_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries.
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket).
Default: "error" |
drop_last |
Whether to drop the last category.
Defaults to TRUE . |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- PolynomialExpansion (Transformer)
Arguments
Value
See also
Perform feature expansion in a polynomial space.
E.g.
take a 2-variable feature
vector as an example: (x, y), if we want to expand it with degree 2, then
we get (x, x * x, y, x * y, y * y).
ft_polynomial_expansion(x, input_col = NULL, output_col = NULL,
degree = 2, uid = random_string("polynomial_expansion_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
degree |
The polynomial degree to expand, which should be greater
than equal to 1.
A value of 1 means no expansion.
Default: 2 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- QuantileDiscretizer (Estimator)
Arguments
Value
Details
See also ft_quantile_discretizer
takes a column with continuous features and outputs
a column with binned categorical features.
The number of bins can be
set using the num_buckets
parameter.
It is possible that the number
of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct
quantiles.
ft_quantile_discretizer(x, input_col = NULL, output_col = NULL,
num_buckets = 2, input_cols = NULL, output_cols = NULL,
num_buckets_array = NULL, handle_invalid = "error",
relative_error = 0.001, uid = random_string("quantile_discretizer_"),
...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
num_buckets |
Number of buckets (quantiles, or categories) into which data
points are grouped.
Must be greater than or equal to 2. |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
num_buckets_array |
Array of number of buckets (quantiles, or categories)
into which data points are grouped.
Each value must be greater than or equal to 2. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries.
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket).
Default: "error" |
relative_error |
(Spark 2.0.0+) Relative error (see documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here
for description).
Must be in the range [0, 1].
default: 0.001 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
NaN handling: null and NaN values will be ignored from the column
during QuantileDiscretizer
fitting.
This will produce a Bucketizer
model for making predictions.
During the transformation, Bucketizer
will raise an error when it finds NaN values in the dataset, but the
user can also choose to either keep or remove NaN values within the
dataset by setting handle_invalid
If the user chooses to keep NaN values,
they will be handled specially and placed into their own bucket,
for example, if 4 buckets are used, then non-NaN data will be put
into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see
the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here for a detailed description).
The precision of the approximation can be
controlled with the relative_error
parameter.
The lower and upper bin
bounds will be -Infinity and +Infinity, covering all real values.
Note that the result may be different every time you run it, since the sample
strategy behind it is non-deterministic.
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
ft_bucketizer
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- RFormula (Estimator)
Arguments
Value
Details
See also
Implements the transforms required for fitting a dataset against an R model
formula.
Currently we support a limited subset of the R operators,
including ~
, .
, :
, +
, and -
.
Also see the R formula docs here:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html
ft_r_formula(x, formula = NULL, features_col = "features",
label_col = "label", force_index_label = FALSE,
uid = random_string("r_formula_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
formula |
R formula as a character string or a formula.
Formula objects are
converted to character strings directly and the environment is not captured. |
features_col |
Features column name, as a length-one character vector.
The column should be single vector column of numeric values.
Usually this column is output by ft_r_formula . |
label_col |
Label column name.
The column should be a numeric column.
Usually this column is output by ft_r_formula . |
force_index_label |
(Spark 2.1.0+) Force to index label whether it is numeric or
string type.
Usually we index label only when it is string type.
If
the formula was used by classification algorithms, we can force to index
label even it is numeric type by setting this param with true.
Default: FALSE . |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
The basic operators in the formula are:
~ separate target and terms
+ concat terms, "+ 0" means removing intercept
- remove a term, "- 1" means removing intercept
: interaction (multiplication for numeric values, or binarized categorical values)
.
all columns except target
Suppose a and b are double columns, we use the following simple examples to illustrate the
effect of RFormula:
y ~ a + b
means model y ~ w0 + w1 * a + w2 * b
where w0
is the intercept and w1, w2
are coefficients.
y ~ a + b + a:b - 1
means model y ~ w1 * a + w2 * b + w3 * a * b
where w1, w2, w3
are coefficients.
RFormula produces a vector column of features and a double or string column
of label.
Like when formulas are used in R for linear regression, string
input columns will be one-hot encoded, and numeric columns will be cast to
doubles.
If the label column is of type string, it will be first transformed
to double with StringIndexer.
If the label column does not exist in the
DataFrame, the output label column will be created from the specified
response variable in the formula.
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- RegexTokenizer (Transformer)
Arguments
Value
See also
A regex based tokenizer that extracts tokens either by using the provided
regex pattern to split the text (default) or repeatedly matching the regex
(if gaps
is false).
Optional parameters also allow filtering tokens using a
minimal length.
It returns an array of strings that can be empty.
ft_regex_tokenizer(x, input_col = NULL, output_col = NULL,
gaps = TRUE, min_token_length = 1, pattern = "\\s+",
to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
gaps |
Indicates whether regex splits on gaps (TRUE) or matches tokens (FALSE). |
min_token_length |
Minimum token length, greater than or equal to 0. |
pattern |
The regular expression pattern to be used. |
to_lower_case |
Indicates whether to convert all characters to lowercase before tokenizing. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- StandardScaler (Estimator)
Arguments
Value
Details
See also
Examples
Standardizes features by removing the mean and scaling to unit variance using
column summary statistics on the samples in the training set.
The "unit std"
is computed using the corrected sample standard deviation, which is computed
as the square root of the unbiased sample variance.
ft_standard_scaler(x, input_col = NULL, output_col = NULL,
with_mean = FALSE, with_std = TRUE,
uid = random_string("standard_scaler_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
with_mean |
Whether to center the data with mean before scaling.
It will
build a dense output, so take care when applying to sparse input.
Default: FALSE |
with_std |
Whether to scale the data to unit standard deviation.
Default: TRUE |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")
iris_tbl %>%
ft_vector_assembler(input_col = features,
output_col = "features_temp") %>%
ft_standard_scaler(input_col = "features_temp",
output_col = "features",
with_mean = TRUE)
}
Feature Transformation -- StopWordsRemover (Transformer)
Arguments
Value
See also
A feature transformer that filters out stop words from input.
ft_stop_words_remover(x, input_col = NULL, output_col = NULL,
case_sensitive = FALSE,
stop_words = ml_default_stop_words(spark_connection(x), "english"),
uid = random_string("stop_words_remover_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
case_sensitive |
Whether to do a case sensitive comparison over the stop words. |
stop_words |
The words to be filtered out. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
ml_default_stop_words
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- Tokenizer (Transformer)
Arguments
Value
See also
A tokenizer that converts the input string to lowercase and then splits it
by white spaces.
ft_tokenizer(x, input_col = NULL, output_col = NULL,
uid = random_string("tokenizer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- VectorAssembler (Transformer)
Arguments
Value
See also
Combine multiple vectors into a single row-vector; that is,
where each row element of the newly generated column is a
vector formed by concatenating each row element from the
specified input columns.
ft_vector_assembler(x, input_cols = NULL, output_col = NULL,
uid = random_string("vector_assembler_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_cols |
The names of the input columns |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- VectorIndexer (Estimator)
Arguments
Value
Details
See also
Indexing categorical feature columns in a dataset of Vector.
ft_vector_indexer(x, input_col = NULL, output_col = NULL,
max_categories = 20, uid = random_string("vector_indexer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
max_categories |
Threshold for the number of values a categorical feature can take.
If a feature is found to have > max_categories values, then it is declared continuous.
Must be greater than or equal to 2.
Defaults to 20. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_slicer
, ft_word2vec
Feature Transformation -- VectorSlicer (Transformer)
Arguments
Value
See also
Takes a feature vector and outputs a new feature vector with a subarray of the original features.
ft_vector_slicer(x, input_col = NULL, output_col = NULL,
indices = NULL, uid = random_string("vector_slicer_"), ...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
indices |
An vector of indices to select features from a vector column.
Note that the indices are 0-based. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_word2vec
Feature Transformation -- SQLTransformer
Arguments
Value
Details
See also
Implements the transformations which are defined by SQL statement.
Currently we
only support SQL syntax like 'SELECT ...
FROM __THIS__ ...' where '__THIS__' represents
the underlying table of the input dataset.
The select clause specifies the
fields, constants, and expressions to display in the output, it can be any
select clause that Spark SQL supports.
Users can also use Spark SQL built-in
function and UDFs to operate on these selected columns.
ft_sql_transformer(x, statement = NULL,
uid = random_string("sql_transformer_"), ...)
ft_dplyr_transformer(x, tbl, uid = random_string("dplyr_transformer_"),
...)
Arguments
x |
A spark_connection , ml_pipeline , or a tbl_spark . |
statement |
A SQL statement. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
tbl |
A tbl_spark generated using dplyr transformations. |
Value
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns a ml_transformer
,
a ml_estimator
, or one of their subclasses.
The object contains a pointer to
a Spark Transformer
or Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the transformer or estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a transformer is constructed then
immediately applied to the input tbl_spark
, returning a tbl_spark
Details
ft_dplyr_transformer()
is a wrapper around ft_sql_transformer()
that
takes a tbl_spark
instead of a SQL statement.
Internally, the ft_dplyr_transformer()
extracts the dplyr
transformations used to generate tbl
as a SQL statement
then passes it on to ft_sql_transformer()
.
Note that only single-table dplyr
verbs
are supported and that the sdf_
family of functions are not.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
more information on the set of transformations available for DataFrame
columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_quantile_discretizer
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec
Compile Scala sources into a Java Archive (jar)
Arguments
Compile the scala
source files contained within an R package
into a Java Archive (jar
) file that can be loaded and used within
a Spark environment.
compile_package_jars(..., spec = NULL)
Arguments
... |
Optional compilation specifications, as generated by spark_compilation_spec .
When no arguments are passed, spark_default_compilation_spec is used instead. |
spec |
An optional list of compilation specifications.
When
set, this option takes precedence over arguments passed to ... . |
Read configuration values for a connection
Arguments
Value
Read configuration values for a connection
connection_config(sc, prefix, not_prefix = list())
Arguments
sc |
spark_connection |
prefix |
Prefix to read parameters for
(e.g. spark.context. , spark.sql. , etc.) |
not_prefix |
Prefix to not include. |
Value
Named list of config parameters (note that if a prefix was
specified then the names will not include the prefix)
Downloads default Scala Compilers
Arguments
Details compile_package_jars
requires several versions of the
scala compiler to work, this is to match Spark scala versions.
To help setup your environment, this function will download the
required compilers under the default search path.
download_scalac(dest_path = NULL)
Arguments
dest_path |
The destination path where scalac will be
downloaded to. |
Details
See find_scalac
for a list of paths searched and used by
this function to install the required compilers.
Discover the Scala Compiler
Arguments
Find the scalac
compiler for a particular version of scala
, by scanning some common directories containing scala
installations.
find_scalac(version, locations = NULL)
Arguments
version |
The scala version to search for.
Versions
of the form major.minor will be matched against the scalac installation with version major.minor.patch ;
if multiple compilers are discovered the most recent one will be
used. |
locations |
Additional locations to scan.
By default, the
directories /opt/scala and /usr/local/scala will
be scanned. |
Access the Spark API
Arguments
Details
Spark Context
Java Spark Context
Hive Context
Spark Session
Access the commonly-used Spark objects associated with a Spark instance.
These objects provide access to different facets of the Spark API.
spark_context(sc)
java_context(sc)
hive_context(sc)
spark_session(sc)
Arguments
Details
The Scala API documentation
is useful for discovering what methods are available for each of these
objects.
Use invoke
to call methods on these objects.
Spark Context
The main entry point for Spark functionality.
The Spark Context
represents the connection to a Spark cluster, and can be used to create RDD
s, accumulators and broadcast variables on that cluster.
Java Spark Context
A Java-friendly version of the aforementioned Spark Context.
Hive Context
An instance of the Spark SQL execution engine that integrates with data
stored in Hive.
Configuration for Hive is read from hive-site.xml
on
the classpath.
Starting with Spark >= 2.0.0, the Hive Context class has been
deprecated -- it is superceded by the Spark Session class, and hive_context
will return a Spark Session object instead.
Note that both classes share a SQL interface, and therefore one can invoke
SQL through these objects.
Spark Session
Available since Spark 2.0.0, the Spark Session unifies the
Spark Context and Hive Context classes into a single
interface.
Its use is recommended over the older APIs for code
targeting Spark 2.0.0 and above.
Runtime configuration interface for Hive
Arguments
Retrieves the runtime configuration interface for Hive.
hive_context_config(sc)
Arguments
Invoke a Method on a JVM Object
Arguments
Details
Examples
Invoke methods on Java object references.
These functions provide a
mechanism for invoking various Java object methods directly from R.
invoke(jobj, method, ...)
invoke_static(sc, class, method, ...)
invoke_new(sc, class, ...)
Arguments
jobj |
An R object acting as a Java object reference (typically, a spark_jobj ). |
method |
The name of the method to be invoked. |
... |
Optional arguments, currently unused. |
sc |
A spark_connection . |
class |
The name of the Java class whose methods should be invoked. |
Details
Use each of these functions in the following scenarios:
invoke | Execute a method on a Java object reference (typically, a spark_jobj ). | invoke_static |
Execute a static method associated with a Java class. | invoke_new | Invoke a constructor associated with a Java class. |
Examples
sc <- spark_connect(master = "spark://HOST:PORT")
spark_context(sc) %>%
invoke("textFile", "file.csv", 1L) %>%
invoke("count")
Register a Package that Implements a Spark Extension
Arguments
Note
Registering an extension package will result in the package being
automatically scanned for spark dependencies when a connection to Spark is
created.
register_extension(package)
registered_extensions()
Arguments
package |
The package(s) to register. |
Note
Packages should typically register their extensions in their
.onLoad
hook -- this ensures that their extensions are registered
when their namespaces are loaded.
Define a Spark Compilation Specification
Arguments
Details
For use with compile_package_jars
.
The Spark compilation
specification is used when compiling Spark extension Java Archives, and
defines which versions of Spark, as well as which versions of Scala, should
be used for compilation.
spark_compilation_spec(spark_version = NULL, spark_home = NULL,
scalac_path = NULL, scala_filter = NULL, jar_name = NULL,
jar_path = NULL, jar_dep = NULL)
Arguments
spark_version |
The Spark version to build against.
This can
be left unset if the path to a suitable Spark home is supplied. |
spark_home |
The path to a Spark home installation.
This can
be left unset if spark_version is supplied; in such a case, sparklyr will attempt to discover the associated Spark
installation using spark_home_dir . |
scalac_path |
The path to the scalac compiler to be used
during compilation of your Spark extension.
Note that you should
ensure the version of scalac selected matches the version of scalac used with the version of Spark you are compiling against. |
scala_filter |
An optional R function that can be used to filter
which scala files are used during compilation.
This can be
useful if you have auxiliary files that should only be included with
certain versions of Spark. |
jar_name |
The name to be assigned to the generated jar . |
jar_path |
The path to the jar tool to be used
during compilation of your Spark extension. |
jar_dep |
An optional list of additional jar dependencies. |
Details
Most Spark extensions won't need to define their own compilation specification,
and can instead rely on the default behavior of compile_package_jars
.
Default Compilation Specification for Spark Extensions
Arguments
This is the default compilation specification used for
Spark extensions, when used with compile_package_jars
.
spark_default_compilation_spec(pkg = infer_active_package_name(),
locations = NULL)
Arguments
pkg |
The package containing Spark extensions to be compiled. |
locations |
Additional locations to scan.
By default, the
directories /opt/scala and /usr/local/scala will
be scanned. |
Retrieve the Spark Connection Associated with an R Object
Arguments
Retrieve the spark_connection
associated with an R object.
spark_connection(x, ...)
Arguments
x |
An R object from which a spark_connection can be obtained. |
... |
Optional arguments; currently unused. |
Runtime configuration interface for the Spark Context.
Arguments
Retrieves the runtime configuration interface for the Spark Context.
spark_context_config(sc)
Arguments
Retrieve a Spark DataFrame
Arguments
Value
This S3 generic is used to access a Spark DataFrame object (as a Java
object reference) from an R object.
spark_dataframe(x, ...)
Arguments
x |
An R object wrapping, or containing, a Spark DataFrame. |
... |
Optional arguments; currently unused. |
Value
A spark_jobj
representing a Java object reference
to a Spark DataFrame.
Define a Spark dependency
Arguments
Value
Define a Spark dependency consisting of a set of custom JARs and Spark packages.
spark_dependency(jars = NULL, packages = NULL, initializer = NULL,
catalog = NULL, repositories = NULL, ...)
Arguments
jars |
Character vector of full paths to JAR files. |
packages |
Character vector of Spark packages names. |
initializer |
Optional callback function called when initializing a connection. |
catalog |
Optional location where extension JAR files can be downloaded for Livy. |
repositories |
Character vector of Spark package repositories. |
... |
Additional optional arguments. |
Value
An object of type `spark_dependency`
Set the SPARK_HOME environment variable
Arguments
Value
Examples
Set the SPARK_HOME
environment variable.
This slightly speeds up some
operations, including the connection time.
spark_home_set(path = NULL, ...)
Arguments
path |
A string containing the path to the installation location of
Spark.
If NULL , the path to the most latest Spark/Hadoop versions is
used. |
... |
Additional parameters not currently used. |
Value
The function is mostly invoked for the side-effect of setting the SPARK_HOME
environment variable.
It also returns TRUE
if the
environment was successfully set, and FALSE
otherwise.
Examples
if (FALSE) {
# Not run due to side-effects
spark_home_set()
}
Retrieve a Spark JVM Object Reference
Arguments
See also
This S3 generic is used for accessing the underlying Java Virtual Machine
(JVM) Spark objects associated with R objects.
These objects act as
references to Spark objects living in the JVM.
Methods on these objects
can be called with the invoke
family of functions.
spark_jobj(x, ...)
Arguments
x |
An R object containing, or wrapping, a spark_jobj . |
... |
Optional arguments; currently unused. |
See also
invoke
, for calling methods on Java object references.
Get the Spark Version Associated with a Spark Connection
Arguments
Value
Details
Retrieve the version of Spark associated with a Spark connection.
spark_version(sc)
Arguments
Value
The Spark version as a numeric_version
.
Details
Suffixes for e.g.
preview versions, or snapshotted versions,
are trimmed -- if you require the full Spark version, you can
retrieve it with invoke(spark_context(sc), "version")
.
Apply an R Function in Spark
Arguments
Configuration
Examples
Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply(x, f, columns = NULL, memory = !is.null(name),
group_by = NULL, packages = NULL, context = NULL, name = NULL,
...)
Arguments
x |
An object (usually a spark_tbl ) coercable to a Spark DataFrame. |
f |
A function that transforms a data frame partition into a data frame.
The function f has signature f(df, context, group1, group2, ...) where
df is a data frame with the data to be processed, context
is an optional object passed as the context parameter and group1 to
groupN contain the values of the group_by values.
When
group_by is not specified, f takes only one argument.
Can also be an rlang anonymous function.
For example, as ~ .x + 1
to define an expression that adds one to the given .x data frame. |
columns |
A vector of column names or a named vector of column types for
the transformed object.
When not specified, a sample of 10 rows is taken to
infer out the output columns automatically, to avoid this performance penalty,
specify the column types.
The sample size is configurable using the sparklyr.apply.schema.infer configuration option. |
memory |
Boolean; should the table be cached into memory? |
group_by |
Column name used to group by data frame partitions. |
packages |
Boolean to distribute .libPaths() packages to each node,
a list of packages to distribute, or a package bundle created with
spark_apply_bundle() .
Defaults to TRUE or the sparklyr.apply.packages value set in
spark_config() .
For clusters using Yarn cluster mode, packages can point to a package
bundle created using spark_apply_bundle() and made available as a Spark
file using config$sparklyr.shell.files .
For clusters using Livy, packages
can be manually installed on the driver node.
For offline clusters where available.packages() is not available,
manually download the packages database from
https://cran.r-project.org/web/packages/packages.rds and set
Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>") .
Otherwise,
all packages will be used by default.
For clusters where R packages already installed in every worker node,
the spark.r.libpaths config entry can be set in spark_config()
to the local packages library.
To specify multiple paths collapse them
(without spaces) with a comma delimiter (e.g., "/lib/path/one,/lib/path/two" ). |
context |
Optional object to be serialized and passed back to f() . |
name |
Optional table name while registering the resulting data frame. |
... |
Optional arguments; currently unused. |
Configuration
spark_config()
settings can be specified to change the workers
environment.
For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.*
config, to launch workers
without --vanilla
use sparklyr.apply.options.vanilla
set to FALSE
, to run a custom script before launching Rscript use sparklyr.apply.options.rscript.before
.
Examples
if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")
# creates an Spark data frame with 10 elements then multiply times 10 in R
sdf_len(sc, 10) %>% spark_apply(function(df) df * 10)
}
Create Bundle for Spark Apply
Arguments
Creates a bundle of packages for spark_apply()
.
spark_apply_bundle(packages = TRUE, base_path = getwd())
Arguments
packages |
List of packages to pack or TRUE to pack all. |
base_path |
Base path used to store the resulting bundle. |
Log Writer for Spark Apply
Arguments
Writes data to log under spark_apply()
.
spark_apply_log(..., level = "INFO")
Arguments
... |
Arguments to write to log. |
level |
Severity level for this entry; recommended values: INFO , ERROR or WARN . |
Create a Spark Configuration for Livy
Arguments
Value
Details
Create a Spark Configuration for Livy
livy_config(config = spark_config(), username = NULL,
password = NULL, negotiate = FALSE,
custom_headers = list(`X-Requested-By` = "sparklyr"), ...)
Arguments
config |
Optional base configuration |
username |
The username to use in the Authorization header |
password |
The password to use in the Authorization header |
negotiate |
Whether to use gssnegotiate method or not |
custom_headers |
List of custom headers to append to http requests.
Defaults to list("X-Requested-By" = "sparklyr") . |
... |
additional Livy session parameters |
Value
Named list with configuration data
Details
Extends a Spark spark_config()
configuration with settings
for Livy.
For instance, username
and password
define the basic authentication settings for a Livy session.
The default value of "custom_headers"
is set to list("X-Requested-By" = "sparklyr")
in order to facilitate connection to Livy servers with CSRF protection enabled.
Additional parameters for Livy sessions are:
proxy_user
- User to impersonate when starting the session
jars
- jars to be used in this session
py_files
- Python files to be used in this session
files
- files to be used in this session
driver_memory
- Amount of memory to use for the driver process
driver_cores
- Number of cores to use for the driver process
executor_memory
- Amount of memory to use per executor process
executor_cores
- Number of cores to use for each executor
num_executors
- Number of executors to launch for this session
archives
- Archives to be used in this session
queue
- The name of the YARN queue to which submitted
name
- The name of this session
heartbeat_timeout
- Timeout in seconds to which session be orphaned
Note that queue
is supported only by version 0.4.0 of Livy or newer.
If you are using the older one, specify queue via config
(e.g. config = spark_config(spark.yarn.queue = "my_queue")
).
Start Livy
Arguments
Starts the livy service.
Stops the running instances of the livy service.
livy_service_start(version = NULL, spark_version = NULL, stdout = "",
stderr = "", ...)
livy_service_stop()
Arguments
version |
The version of livy to use. |
spark_version |
The version of spark to connect to. |
stdout, stderr |
where output to 'stdout' or 'stderr' should
be sent.
Same options as system2 . |
... |
Optional arguments; currently unused. |
Find Stream
Arguments
Examples
Finds and returns a stream based on the stream's identifier.
stream_find(sc, id)
Arguments
sc |
The associated Spark connection. |
id |
The stream identifier to find. |
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>%
spark_write_parquet(path = "parquet-in")
stream <- stream_read_parquet(sc, "parquet-in") %>%
stream_write_parquet("parquet-out")
stream_id <- stream_id(stream)
stream_find(sc, stream_id)
}
Generate Test Stream
Arguments
Details
Generates a local test stream, useful when testing streams locally.
stream_generate_test(df = rep(1:1000), path = "source",
distribution = floor(10 + 1e+05 * stats::dbinom(1:20, 20, 0.5)),
iterations = 50, interval = 1)
Arguments
df |
The data frame used as a source of rows to the stream, will
be cast to data frame if needed.
Defaults to a sequence of one thousand
entries. |
path |
Path to save stream of files to, defaults to "source" . |
distribution |
The distribution of rows to use over each iteration,
defaults to a binomial distribution.
The stream will cycle through the
distribution if needed. |
iterations |
Number of iterations to execute before stopping, defaults
to fifty. |
interval |
The inverval in seconds use to write the stream, defaults
to one second. |
Details
This function requires the callr
package to be installed.
Spark Stream's Identifier
Arguments
Retrieves the identifier of the Spark stream.
stream_id(stream)
Arguments
stream |
The spark stream object. |
Spark Stream's Name
Arguments
Retrieves the name of the Spark stream if available.
stream_name(stream)
Arguments
stream |
The spark stream object. |
Read CSV Stream
Arguments
See also
Examples
Reads a CSV stream as a Spark dataframe stream.
stream_read_csv(sc, path, name = NULL, header = TRUE, columns = NULL,
delimiter = ",", quote = "\"", escape = "\\",
charset = "UTF-8", null_value = NULL, options = list(), ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated stream. |
header |
Boolean; should the first row of data be used as a header?
Defaults to TRUE . |
columns |
A vector of column names or a named vector of column types. |
delimiter |
The character used to delimit each column.
Defaults to ','. |
quote |
The character used as a quote.
Defaults to '"'. |
escape |
The character used to escape other characters.
Defaults to '\'. |
charset |
The character set.
Defaults to "UTF-8". |
null_value |
The character to use for null, or missing, values.
Defaults to NULL . |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)
csv_path <- file.path("file://", getwd(), "csv-in")
stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out")
stream_stop(stream)
}
Read JSON Stream
Arguments
See also
Examples
Reads a JSON stream as a Spark dataframe stream.
stream_read_json(sc, path, name = NULL, columns = NULL,
options = list(), ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated stream. |
columns |
A vector of column names or a named vector of column types. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("json-in")
jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json")
json_path <- file.path("file://", getwd(), "json-in")
stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out")
stream_stop(stream)
}
Read Kafka Stream
Arguments
Details
See also
Examples
Reads a Kafka stream as a Spark dataframe stream.
stream_read_kafka(sc, name = NULL, options = list(), ...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated stream. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
Details
Please note that Kafka requires installing the appropriate
package by conneting with a config setting where sparklyr.shell.packages
is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"
.
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
config <- spark_config()
# The following package is dependent to Spark version, for Spark 2.3.2:
config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"
sc <- spark_connect(master = "local", config = config)
read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1")
write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2")
stream <- stream_read_kafka(sc, options = read_options) %>%
stream_write_kafka(options = write_options)
stream_stop(stream)
}
Read ORC Stream
Arguments
See also
Examples
Reads an ORC stream as a Spark dataframe stream.
stream_read_orc(sc, path, name = NULL, columns = NULL,
options = list(), ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated stream. |
columns |
A vector of column names or a named vector of column types. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% spark_write_orc("orc-in")
stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out")
stream_stop(stream)
}
Read Parquet Stream
Arguments
See also
Examples
Reads a parquet stream as a Spark dataframe stream.
stream_read_parquet(sc, path, name = NULL, columns = NULL,
options = list(), ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated stream. |
columns |
A vector of column names or a named vector of column types. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% spark_write_parquet("parquet-in")
stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out")
stream_stop(stream)
}
Read Socket Stream
Arguments
See also
Examples
Reads a Socket stream as a Spark dataframe stream.
stream_read_scoket(sc, name = NULL, columns = NULL, options = list(),
...)
Arguments
sc |
A spark_connection . |
name |
The name to assign to the newly generated stream. |
columns |
A vector of column names or a named vector of column types. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
# Start socket server from terminal, example: nc -lk 9999
stream <- stream_read_scoket(sc, options = list(host = "localhost", port = 9999))
stream
}
Read Text Stream
Arguments
See also
Examples
Reads a text stream as a Spark dataframe stream.
stream_read_text(sc, path, name = NULL, options = list(), ...)
Arguments
sc |
A spark_connection . |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
name |
The name to assign to the newly generated stream. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("text-in")
writeLines("A text entry", "text-in/text.txt")
text_path <- file.path("file://", getwd(), "text-in")
stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out")
stream_stop(stream)
}
Render Stream
Arguments
Examples
Collects streaming statistics to render the stream as an 'htmlwidget'.
stream_render(stream = NULL, collect = 10, stats = NULL, ...)
Arguments
stream |
The stream to render |
collect |
The interval in seconds to collect data before rendering the
'htmlwidget'. |
stats |
Optional stream statistics collected using stream_stats() ,
when specified, stream should be omitted. |
... |
Additional optional arguments. |
Examples
if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")
dir.create("iris-in")
write.csv(iris, "iris-in/iris.csv", row.names = FALSE)
stream <- stream_read_csv(sc, "iris-in/") %>%
stream_write_csv("iris-out/")
stream_render(stream)
stream_stop(stream)
}
Stream Statistics
Arguments
Value
Examples
Collects streaming statistics, usually, to be used with stream_render()
to render streaming statistics.
stream_stats(stream, stats = list())
Arguments
stream |
The stream to collect statistics from. |
stats |
An optional stats object generated using stream_stats() . |
Value
A stats object containing streaming statistics that can be passed
back to the stats
parameter to continue aggregating streaming stats.
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>%
spark_write_parquet(path = "parquet-in")
stream <- stream_read_parquet(sc, "parquet-in") %>%
stream_write_parquet("parquet-out")
stream_stats(stream)
}
Stops a Spark Stream
Arguments
Stops processing data from a Spark stream.
stream_stop(stream)
Arguments
stream |
The spark stream object to be stopped. |
Spark Stream Continuous Trigger
Arguments
See also
Creates a Spark structured streaming trigger to execute
continuously.
This mode is the most performant but not all operations
are supported.
stream_trigger_continuous(checkpoint = 5000)
Arguments
checkpoint |
The checkpoint interval specified in milliseconds. |
See also
stream_trigger_interval
Spark Stream Interval Trigger
Arguments
See also
Creates a Spark structured streaming trigger to execute
over the specified interval.
stream_trigger_interval(interval = 1000)
Arguments
interval |
The execution interval specified in milliseconds. |
See also
stream_trigger_continuous
View Stream
Arguments
Examples
Opens a Shiny gadget to visualize the given stream.
stream_view(stream, ...)
Arguments
stream |
The stream to visualize. |
... |
Additional optional arguments. |
Examples
if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")
dir.create("iris-in")
write.csv(iris, "iris-in/iris.csv", row.names = FALSE)
stream_read_csv(sc, "iris-in/") %>%
stream_write_csv("iris-out/") %>%
stream_view() %>%
stream_stop()
}
Watermark Stream
Arguments
Ensures a stream has a watermark defined, which is required for some
operations over streams.
stream_watermark(x, column = "timestamp", threshold = "10 minutes")
Arguments
x |
An object coercable to a Spark Streaming DataFrame. |
column |
The name of the column that contains the event time of the row,
if the column is missing, a column with the current time will be added. |
threshold |
The minimum delay to wait to data to arrive late, defaults
to ten minutes. |
Write Console Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into console logs.
stream_write_console(x, mode = c("append", "complete", "update"),
options = list(), trigger = stream_trigger_interval(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
options |
A list of strings with additional options. |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% dplyr::transmute(text = as.character(id)) %>% spark_write_text("text-in")
stream <- stream_read_text(sc, "text-in") %>% stream_write_console()
stream_stop(stream)
}
Write CSV Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into a tabular (typically, comma-separated) stream.
stream_write_csv(x, path, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(), checkpoint = file.path(path,
"checkpoint"), header = TRUE, delimiter = ",", quote = "\"",
escape = "\\", charset = "UTF-8", null_value = NULL,
options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
header |
Should the first row of data be used as a header? Defaults to TRUE . |
delimiter |
The character used to delimit each column, defaults to , . |
quote |
The character used as a quote.
Defaults to '"'. |
escape |
The character used to escape other characters, defaults to \ . |
charset |
The character set, defaults to "UTF-8" . |
null_value |
The character to use for default values, defaults to NULL . |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)
csv_path <- file.path("file://", getwd(), "csv-in")
stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out")
stream_stop(stream)
}
Write JSON Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into a JSON stream.
stream_write_json(x, path, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(), checkpoint = file.path(path,
"checkpoints", random_string("")), options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The destination path.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("json-in")
jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json")
json_path <- file.path("file://", getwd(), "json-in")
stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out")
stream_stop(stream)
}
Write Kafka Stream
Arguments
Details
See also
Examples
Writes a Spark dataframe stream into an kafka stream.
stream_write_kafka(x, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(),
checkpoint = file.path("checkpoints", random_string("")),
options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
Details
Please note that Kafka requires installing the appropriate
package by conneting with a config setting where sparklyr.shell.packages
is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"
.
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
config <- spark_config()
# The following package is dependent to Spark version, for Spark 2.3.2:
config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"
sc <- spark_connect(master = "local", config = config)
read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1")
write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2")
stream <- stream_read_kafka(sc, options = read_options) %>%
stream_write_kafka(options = write_options)
stream_stop(stream)
}
Write Memory Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into a memory stream.
stream_write_memory(x, name = random_string("sparklyr_tmp_"),
mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(),
checkpoint = file.path("checkpoints", name, random_string("")),
options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated stream. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_orc
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)
csv_path <- file.path("file://", getwd(), "csv-in")
stream <- stream_read_csv(sc, csv_path) %>% stream_write_memory("csv-out")
stream_stop(stream)
}
Write a ORC Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into an ORC stream.
stream_write_orc(x, path, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(), checkpoint = file.path(path,
"checkpoints", random_string("")), options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The destination path.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_parquet
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% spark_write_orc("orc-in")
stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out")
stream_stop(stream)
}
Write Parquet Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into a parquet stream.
stream_write_parquet(x, path, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(), checkpoint = file.path(path,
"checkpoints", random_string("")), options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The destination path.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_text
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% spark_write_parquet("parquet-in")
stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out")
stream_stop(stream)
}
Write Text Stream
Arguments
See also
Examples
Writes a Spark dataframe stream into a text stream.
stream_write_text(x, path, mode = c("append", "complete", "update"),
trigger = stream_trigger_interval(), checkpoint = file.path(path,
"checkpoints", random_string("")), options = list(), ...)
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The destination path.
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols. |
mode |
Specifies how data is written to a streaming sink.
Valid values are "append" , "complete" or "update" . |
trigger |
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds.
See stream_trigger_interval and stream_trigger_continuous . |
checkpoint |
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
See also
Other Spark stream serialization: stream_read_csv
,
stream_read_json
,
stream_read_kafka
,
stream_read_orc
,
stream_read_parquet
,
stream_read_scoket
,
stream_read_text
,
stream_write_console
,
stream_write_csv
,
stream_write_json
,
stream_write_kafka
,
stream_write_memory
,
stream_write_orc
,
stream_write_parquet
Examples
if (FALSE) {
sc <- spark_connect(master = "local")
dir.create("text-in")
writeLines("A text entry", "text-in/text.txt")
text_path <- file.path("file://", getwd(), "text-in")
stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out")
stream_stop(stream)
}
Reactive spark reader
Arguments
Given a spark object, returns a reactive data source for the contents
of the spark object.
This function is most useful to read Spark streams.
reactiveSpark(x, intervalMillis = 1000, session = NULL)
Arguments
x |
An object coercable to a Spark DataFrame. |
intervalMillis |
Approximate number of milliseconds to wait to retrieve
updated data frame.
This can be a numeric value, or a function that returns
a numeric value. |
session |
The user session to associate this file reader with, or NULL if
none.
If non-null, the reader will automatically stop when the session ends. |