Using Sparklyr

https://spark.rstudio.com/guides/connections/




Configuring Spark Connections 
Local mode
Local mode is an excellent way to learn and experiment with Spark. 
Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster.

To work in local mode, you should first install a version of Spark for local use. 
You can do this using the spark_install() function, for example:

Recommended properties
The following are the recommended Spark properties to set when connecting via R:

sparklyr.cores.local - It defaults to using all of the available cores. 
Not a necessary property to set, unless there’s a reason to use less cores than available for a given Spark session.

sparklyr.shell.driver-memory - The limit is the amount of RAM available in the computer minus what would be needed for OS operations.

spark.memory.fraction - The default is set to 60% of the requested memory per executor. 
For more information, please see this Memory Management Overview page in the official Spark website.

Connection example
 conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "16G"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master = "local", 
      version = "2.1.0",
      config = conf)
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI available in http://localhost:4040/storage/


Customizing connections
A connection to Spark can be customized by setting the values of certain Spark properties. 
In sparklyr, Spark properties can be set by using the config argument in the spark_connect() function.

By default, spark_connect() uses spark_config() as the default configuration. 
But that can be customized as shown in the example code below. 
Because of the unending number of possible combinations, spark_config() contains only a basic configuration, so it will be very likely that additional settings will be needed to properly connect to the cluster.
 conf <- spark_config()   # Load variable with spark_config()

conf$spark.executor.memory <- "16G" # Use `$` to add or set values

sc <- spark_connect(master = "yarn-client", 
      config = conf)  # Pass the conf variable 

Spark definitions
It may be useful to provide some simple definitions for the Spark nomenclature:

Node: A server

Worker Node: A server that is part of the cluster and are available to run Spark jobs

Master Node: The server that coordinates the Worker nodes.

Executor: A sort of virtual machine inside a node. 
One Node can have multiple Executors.

Driver Node: The Node that initiates the Spark session. 
Typically, this will be the server where sparklyr is located.

Driver (Executor): The Driver Node will also show up in the Executor list.

Useful concepts

Spark configuration properties passed by R are just requests - In most cases, the cluster has the final say regarding the resources apportioned to a given Spark session.

The cluster overrides ‘silently’ - Many times, no errors are returned when more resources than allowed are requested, or if an attempt is made to change a setting fixed by the cluster.
YARN
Background
Using Spark and R inside a Hadoop based Data Lake is becoming a common practice at companies. 
Currently, there is no good way to manage user connections to the Spark service centrally. 
There are some caps and settings that can be applied, but in most cases there are configurations that the R user will need to customize.

The Running on YARN page in Spark’s official website is the best place to start for configuration settings reference, please bookmark it. 
Cluster administrators and users can benefit from this document. 
If Spark is new to the company, the YARN tunning article, courtesy of Cloudera, does a great job at explaining how the Spark/YARN architecture works.

Recommended properties
The following are the recommended Spark properties to set when connecting via R:

spark.executor.memory - The maximum possible is managed by the YARN cluster. 
See the Executor Memory Error

spark.executor.cores - Number of cores assigned per Executor.

spark.executor.instances - Number of executors to start. 
This property is acknowledged by the cluster if spark.dynamicAllocation.enabled is set to “false”.

spark.dynamicAllocation.enabled - Overrides the mechanism that Spark provides to dynamically adjust resources. 
Disabling it provides more control over the number of the Executors that can be started, which in turn impact the amount of storage available for the session. 
For more information, please see the Dynamic Resource Allocation page in the official Spark website.

Client mode
Using yarn-client as the value for the master argument in spark_connect() will make the server in which R is running to be the Spark’s session driver. 
Here is a sample connection:
 conf <- spark_config()

conf$spark.executor.memory <- "300M"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 3
conf$spark.dynamicAllocation.enabled <- "false"

sc <- spark_connect(master = "yarn-client", 
      spark_home = "/usr/lib/spark/",
      version = "1.6.0",
      config = conf)
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI. 
Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040.

Notice that 155.3MB per executor are assigned instead of the 300MB requested. 
This is because the spark.memory.fraction has been fixed by the cluster, plus, there is fixed amount of memory designated for overhead.



Cluster mode
Running in cluster mode means that YARN will choose where the driver of the Spark session will run. 
This means that the server where R is running may not necessarily be the driver for that session. 
Here is a good write-up explaining how running Spark applications work: Running Spark on YARN

The server will need to have copies of at least two files: yarn-site.xml and hive-site.xml. 
There may be other files needed based on your cluster’s individual setup.

This is an example of connecting to a Cloudera cluster:
 library(sparklyr)

Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-oracle-cloudera/")
Sys.setenv(SPARK_HOME = '/opt/cloudera/parcels/CDH/lib/spark')
Sys.setenv(YARN_CONF_DIR = '/opt/cloudera/parcels/CDH/lib/spark/conf/yarn-conf')

conf$spark.executor.memory <- "300M"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 3
conf$spark.dynamicAllocation.enabled <- "false"
conf <- spark_config()

sc <- spark_connect(master = "yarn-cluster", 
      config = conf)

Executor memory error
Requesting more memory or CPUs for Executors than allowed will return an error. 
This is one of the exceptions to the cluster’s ‘silent’ overrides. 
It will return a message similar to this:
     Failed during initialize_connection: java.lang.IllegalArgumentException: Required executor memory (16384+1638 MB) is above the max threshold (8192 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'

A cluster’s administrator is the only person who can make changes to the settings mentioned in the error. 
If the cluster is supported by a vendor, like Cloudera or Hortonworks, then the change can be made using the cluster’s web UI. 
Otherwise, changes to those settings are done directly in the yarn-default.xml file.

Kerberos
There are two options to access a “kerberized” data lake:

Use kinit to get and cache the ticket. 
After kinit is installed and configured. 
After kinit is setup, it can used in R via a system() call prior to connecting to the cluster: system("echo '<password>' | kinit <username>")

For more information visit this site: Apache - Authenticate with kinit

A preferred option may be to use the out-of-the-box integration with Kerberos that the commercial version of RStudio Server offers.
Standalone mode
Recommended properties
The following are the recommended Spark properties to set when connecting via R:

The default behavior in Standalone mode is to create one executor per worker. 
So in a 3 worker node cluster, there will be 3 executors setup. 
The basic properties that can be set are:

spark.executor.memory - The requested memory cannot exceed the actual RAM available.

spark.memory.fraction - The default is set to 60% of the requested memory per executor. 
For more information, please see this Memory Management Overview page in the official Spark website.

spark.executor.cores - The requested cores cannot be higher than the cores available in each worker.
Dynamic Allocation
If dynamic allocation is disabled, then Spark will attempt to assign all of the available cores evenly across the cluster. 
The property used is spark.dynamicAllocation.enabled.

For example, the Standalone cluster used for this article has 3 worker nodes. 
Each node has 14.7GB in RAM and 4 cores. 
This means that there are a total of 12 cores (3 workers with 4 cores) and 44.1GB in RAM (3 workers with 14.7GB in RAM each).

If the spark.executor.cores property is set to 2, and dynamic allocation is disabled, then Spark will spawn 6 executors. 
The spark.executor.memory property should be set to a level that when the value is multiplied by 6 (number of executors) it will not be over total available RAM. 
In this case, the value can be safely set to 7GB so that the total memory requested will be 42GB, which is under the available 44.1GB.

Connection example
 conf <- spark_config()
conf$spark.executor.memory <- "7GB"
conf$spark.memory.fraction <- 0.9
conf$spark.executor.cores <- 2
conf$spark.dynamicAllocation.enabled <- "false"

sc <- spark_connect(master="spark://master-url:7077", 
version = "2.1.0",
config = conf,
spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")
Executors page
To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI. 
Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040:


Troubleshooting
Help with code debugging
For general programming questions with sparklyr, please ask on Stack Overflow.
Code does not work after upgrading to the latest sparklyr version
Please refer to the NEWS section of the sparklyr package to find out if any of the updates listed may have changed the way your code needs to work.

If it seems that current version of the package has a bug, or the new functionality does not perform as stated, please refer to the sparklyr ISSUES page. 
 If no existing issue matches to what your problem is, please open a new issue.
Not able to connect, or the jobs take a long time when working with a Data Lake
The Configuration connections contains an overview and recommendations for requesting resources form the cluster.

The articles in the Guides section provide best-practice information about specific operations that may match to the intent of your code.

To verify your infrastructure, please review the Deployment Examples section.
Manipulating Data with dplyr
Overview
dplyr is
an R package for working with structured data both in and outside of R.
dplyr makes data manipulation for R users easy, consistent, and
performant. 
With dplyr as an interface to manipulating Spark DataFrames,
you can:

Select, filter, and aggregate data
Use window functions (e.g. for sampling)
Perform joins on DataFrames
Collect data from Spark into R

Statements in dplyr can be chained together using pipes defined by the
magrittr
R package. 
dplyr also supports non-standard
evalution
of its arguments. 
For more information on dplyr, see the
introduction,
a guide for connecting to
databases,
and a variety of
vignettes.
Reading Data
You can read data into Spark DataFrames using the following
functions:




Function
Description




spark_read_csv
Reads a CSV file and provides a data source compatible with dplyr


spark_read_json
Reads a JSON file and provides a data source compatible with dplyr


spark_read_parquet
Reads a parquet file and provides a data source compatible with dplyr



Regardless of the format of your data, Spark supports reading data from
a variety of different data sources. 
These include data stored on HDFS
(hdfs:// protocol), Amazon S3 (s3n:// protocol), or local files
available to the Spark worker nodes (file:// protocol)

Each of these functions returns a reference to a Spark DataFrame which
can be used as a dplyr table (tbl).

Flights Data

This guide will demonstrate some of the basic data manipulation verbs of
dplyr by using data from the nycflights13 R package. 
This package
contains data for all 336,776 flights departing New York City in 2013.
It also includes useful metadata on airlines, airports, weather, and
planes. 
The data comes from the US Bureau of Transportation
Statistics,
and is documented in ?nycflights13

Connect to the cluster and copy the flights data using the copy_to
function. 
Caveat: The flight data in nycflights13 is convenient for
dplyr demonstrations because it is small, but in practice large data
should rarely be copied directly from R objects.

library(sparklyr)
library(dplyr)
library(nycflights13)
library(ggplot2)

sc <- spark_connect(master="local")
flights <- copy_to(sc, flights, "flights")
airlines <- copy_to(sc, airlines, "airlines")
src_tbls(sc)

 ## [1] "airlines" "flights"

dplyr Verbs
Verbs are dplyr commands for manipulating data. 
When connected to a
Spark DataFrame, dplyr translates the commands into Spark SQL
statements. 
Remote data sources use exactly the same five verbs as local
data sources. 
Here are the five verbs with their corresponding SQL
commands:
 select ~ SELECT filter ~ WHERE arrange ~ ORDER summarise ~ aggregators: sum, min, sd, etc. mutate ~ operators: +, *, log, etc.



select(flights, year:day, arr_delay, dep_delay)

 ## # Source: lazy query [?? x 5]
## # Database: spark_connection
##     year month   day arr_delay dep_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1     11.0       2.00
##  2  2013     1     1     20.0       4.00
##  3  2013     1     1     33.0       2.00
##  4  2013     1     1    -18.0      -1.00
##  5  2013     1     1    -25.0      -6.00
##  6  2013     1     1     12.0      -4.00
##  7  2013     1     1     19.0      -5.00
##  8  2013     1     1    -14.0      -3.00
##  9  2013     1     1    - 8.00     -3.00
## 10  2013     1     1      8.00     -2.00
## # ... 
with more rows


filter(flights, dep_delay > 1000)

 ## # Source: lazy query [?? x 19]
## # Database: spark_connection
##    year month   day dep_t~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~
##   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1  2013     1     9    641   900  1301  1242  1530  1272 HA       51 N384~
## 2  2013     1    10   1121  1635  1126  1239  1810  1109 MQ     3695 N517~
## 3  2013     6    15   1432  1935  1137  1607  2120  1127 MQ     3535 N504~
## 4  2013     7    22    845  1600  1005  1044  1815   989 MQ     3075 N665~
## 5  2013     9    20   1139  1845  1014  1457  2210  1007 AA      177 N338~
## # ... 
with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>


arrange(flights, desc(dep_delay))

 ## # Source: table<flights> [?? x 19]
## # Database: spark_connection
## # Ordered by: desc(dep_delay)
##     year month   day dep_~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~
##    <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
##  1  2013     1     9   641   900  1301  1242  1530  1272 HA       51 N384~
##  2  2013     6    15  1432  1935  1137  1607  2120  1127 MQ     3535 N504~
##  3  2013     1    10  1121  1635  1126  1239  1810  1109 MQ     3695 N517~
##  4  2013     9    20  1139  1845  1014  1457  2210  1007 AA      177 N338~
##  5  2013     7    22   845  1600  1005  1044  1815   989 MQ     3075 N665~
##  6  2013     4    10  1100  1900   960  1342  2211   931 DL     2391 N959~
##  7  2013     3    17  2321   810   911   135  1020   915 DL     2119 N927~
##  8  2013     6    27   959  1900   899  1236  2226   850 DL     2007 N376~
##  9  2013     7    22  2257   759   898   121  1026   895 DL     2047 N671~
## 10  2013    12     5   756  1700   896  1058  2020   878 AA      172 N5DM~
## # ... 
with more rows, and 7 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour
## #   <dbl>


summarise(flights, mean_dep_delay = mean(dep_delay))

 ## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning

## # Source: lazy query [?? x 1]
## # Database: spark_connection
##   mean_dep_delay
##            <dbl>
## 1           12.6


mutate(flights, speed = distance / air_time * 60)

 ## # Source: lazy query [?? x 20]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... 
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>, speed <dbl>

Laziness
When working with databases, dplyr tries to be as lazy as possible:

It never pulls data into R unless you explicitly ask for it.

It delays doing any work until the last possible moment: it collects
together everything you want to do and then sends it to the database
in one step.

For example, take the following
code:

c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL'))
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- arrange(c2, year, month, day, carrier)
c4 <- mutate(c3, air_time_hours = air_time / 60)


This sequence of operations never actually touches the database. 
It’s
not until you ask for the data (e.g. by printing c4) that dplyr
requests the results from the database.

c4

 ## # Source: lazy query [?? x 8]
## # Database: spark_connection
## # Ordered by: year, month, day, carrier
##     year month   day carrier dep_delay air_time distance air_time_hours
##    <int> <int> <int> <chr>       <dbl>    <dbl>    <dbl>          <dbl>
##  1  2013     5    17 AA          -2.00      294     2248           4.90
##  2  2013     5    17 AA          -1.00      146     1096           2.43
##  3  2013     5    17 AA          -2.00      185     1372           3.08
##  4  2013     5    17 AA          -9.00      186     1389           3.10
##  5  2013     5    17 AA           2.00      147     1096           2.45
##  6  2013     5    17 AA          -4.00      114      733           1.90
##  7  2013     5    17 AA          -7.00      117      733           1.95
##  8  2013     5    17 AA          -7.00      142     1089           2.37
##  9  2013     5    17 AA          -6.00      148     1089           2.47
## 10  2013     5    17 AA          -7.00      137      944           2.28
## # ... 
with more rows

Piping
You can use
magrittr
pipes to write cleaner syntax. 
Using the same example from above, you
can write a much cleaner version like this:

c4 <- flights %>%
  filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
  select(carrier, dep_delay, air_time, distance) %>%
  arrange(carrier) %>%
  mutate(air_time_hours = air_time / 60)

Grouping
The group_by function corresponds to the GROUP BY statement in SQL.

c4 %>%
  group_by(carrier) %>%
  summarize(count = n(), mean_dep_delay = mean(dep_delay))

 ## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning

## # Source: lazy query [?? x 3]
## # Database: spark_connection
##   carrier count mean_dep_delay
##   <chr>   <dbl>          <dbl>
## 1 AA       94.0           1.47
## 2 DL      136             6.24
## 3 UA      172             9.63
## 4 WN       34.0           7.97

Collecting to R
You can copy data from Spark into R’s memory by using collect().

carrierhours <- collect(c4)

 collect() executes the Spark query and returns the results to R for
further analysis and visualization.

# Test the significance of pairwise differences and plot the results
with(carrierhours, pairwise.t.test(air_time, carrier))

 ## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  air_time and carrier 
## 
##    AA      DL      UA     
## DL 0.25057 -       -      
## UA 0.07957 0.00044 -      
## WN 0.07957 0.23488 0.00041
## 
## P value adjustment method: holm


ggplot(carrierhours, aes(carrier, air_time_hours)) + geom_boxplot()



SQL Translation
It’s relatively straightforward to translate R code to SQL (or indeed to
any programming language) when doing simple mathematical operations of
the form you normally use when filtering, mutating and summarizing.
dplyr knows how to convert the following R functions to Spark SQL:

# Basic math operators
+, -, *, /, %%, ^
  
# Math functions
abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh

# Logical comparisons
<, <=, !=, >=, >, ==, %in%

# Boolean operations
&, &&, |, ||, !

# Character functions
paste, tolower, toupper, nchar

# Casting
as.double, as.integer, as.logical, as.character, as.date

# Basic aggregations
mean, sum, min, max, sd, var, cor, cov, n

Window Functions
dplyr supports Spark SQL window functions. 
Window functions are used in
conjunction with mutate and filter to solve a wide range of problems.
You can compare the dplyr syntax to the query it has generated by using dbplyr::sql_render().

# Find the most and least delayed flight each day
bestworst <- flights %>%
  group_by(year, month, day) %>%
  select(dep_delay) %>% 
  filter(dep_delay == min(dep_delay) || dep_delay == max(dep_delay))
dbplyr::sql_render(bestworst)
## Warning: Missing values are always removed in SQL.
## Use `min(x, na.rm = TRUE)` to silence this warning
## Warning: Missing values are always removed in SQL.
## Use `max(x, na.rm = TRUE)` to silence this warning
## <SQL> SELECT `year`, `month`, `day`, `dep_delay`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`, min(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz3`, max(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz4`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`
## FROM `flights`) `coaxmtqqbj`) `efznnpuovy`
## WHERE (`dep_delay` = `zzz3` OR `dep_delay` = `zzz4`)
bestworst
## Warning: Missing values are always removed in SQL.
## Use `min(x, na.rm = TRUE)` to silence this warning

## Warning: Missing values are always removed in SQL.
## Use `max(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 4]
## # Database: spark_connection
## # Groups: year, month, day
##     year month   day dep_delay
##    <int> <int> <int>     <dbl>
##  1  2013     1     1     853  
##  2  2013     1     1   -  15.0
##  3  2013     1     1   -  15.0
##  4  2013     1     9    1301  
##  5  2013     1     9   -  17.0
##  6  2013     1    24   -  15.0
##  7  2013     1    24     329  
##  8  2013     1    29   -  27.0
##  9  2013     1    29     235  
## 10  2013     2     1   -  15.0
## # ... 
with more rows


# Rank each flight within a daily
ranked <- flights %>%
  group_by(year, month, day) %>%
  select(dep_delay) %>% 
  mutate(rank = rank(desc(dep_delay)))
dbplyr::sql_render(ranked)

 ## <SQL> SELECT `year`, `month`, `day`, `dep_delay`, rank() OVER (PARTITION BY `year`, `month`, `day` ORDER BY `dep_delay` DESC) AS `rank`
## FROM (SELECT `year`, `month`, `day`, `dep_delay`
## FROM `flights`) `mauqwkxuam`


ranked

 ## # Source: lazy query [?? x 5]
## # Database: spark_connection
## # Groups: year, month, day
##     year month   day dep_delay  rank
##    <int> <int> <int>     <dbl> <int>
##  1  2013     1     1       853     1
##  2  2013     1     1       379     2
##  3  2013     1     1       290     3
##  4  2013     1     1       285     4
##  5  2013     1     1       260     5
##  6  2013     1     1       255     6
##  7  2013     1     1       216     7
##  8  2013     1     1       192     8
##  9  2013     1     1       157     9
## 10  2013     1     1       155    10
## # ... 
with more rows

Peforming Joins
It’s rare that a data analysis involves only a single table of data. 
In
practice, you’ll normally have many tables that contribute to an
analysis, and you need flexible tools to combine them. 
In dplyr, there
are three families of verbs that work with two tables at a time:

Mutating joins, which add new variables to one table from matching
rows in another.

Filtering joins, which filter observations from one table based on
whether or not they match an observation in the other table.

Set operations, which combine the observations in the data sets as
if they were set elements.

All two-table verbs work similarly. 
The first two arguments are x and y, and provide the tables to combine. 
The output is always a new table
with the same type as x.

The following statements are equivalent:

flights %>% left_join(airlines)

 ## Joining, by = "carrier"

## # Source: lazy query [?? x 20]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... 
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>, name <chr>


flights %>% left_join(airlines, by = "carrier")

 ## # Source: lazy query [?? x 20]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... 
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>, name <chr>


flights %>% left_join(airlines, by = c("carrier", "carrier"))

 ## # Source: lazy query [?? x 20]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... 
with more rows, and 9 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>, name <chr>

Sampling
You can use sample_n() and sample_frac() to take a random sample of
rows: use sample_n() for a fixed number and sample_frac() for a
fixed fraction.

sample_n(flights, 10)

 ## # Source: lazy query [?? x 19]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... 
with more rows, and 8 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>


sample_frac(flights, 0.01)

 ## # Source: lazy query [?? x 19]
## # Database: spark_connection
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    655     655   0     1021   1030 - 9.00 DL     1415
##  2  2013     1     1    656     700 - 4.00   854    850   4.00 AA      305
##  3  2013     1     1   1044    1045 - 1.00  1231   1212  19.0  EV     4322
##  4  2013     1     1   1056    1059 - 3.00  1203   1209 - 6.00 EV     4479
##  5  2013     1     1   1317    1325 - 8.00  1454   1505 -11.0  MQ     4475
##  6  2013     1     1   1708    1700   8.00  2037   2005  32.0  WN     1066
##  7  2013     1     1   1825    1829 - 4.00  2056   2053   3.00 9E     3286
##  8  2013     1     1   1843    1845 - 2.00  1955   2024 -29.0  DL      904
##  9  2013     1     1   2108    2057  11.0     25     39 -14.0  UA     1517
## 10  2013     1     2    557     605 - 8.00   832    823   9.00 DL      544
## # ... 
with more rows, and 8 more variables: tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dbl>

Writing Data
It is often useful to save the results of your analysis or the tables
that you have generated on your Spark cluster into persistent storage.
The best option in many scenarios is to write the table out to a
Parquet file using the
spark_write_parquet
function. 
For example:

spark_write_parquet(tbl, "hdfs://hdfs.company.org:9000/hdfs-path/data")


This will write the Spark DataFrame referenced by the tbl R variable to
the given HDFS path. 
You can use the
spark_read_parquet
function to read the same table back into a subsequent Spark
session:

tbl <- spark_read_parquet(sc, "data", "hdfs://hdfs.company.org:9000/hdfs-path/data")


You can also write data as CSV or JSON using the
spark_write_csv and
spark_write_json
functions.
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions
(UDAF) can be called inside dplyr’s mutate and summarize. 
The Languange
Reference
UDF
page provides the list of available functions.

The following example uses the datediff and current_date Hive
UDFs to figure the difference between the flight_date and the current
system date:

flights %>% 
  mutate(flight_date = paste(year,month,day,sep="-"),
 days_since = datediff(current_date(), flight_date)) %>%
  group_by(flight_date,days_since) %>%
  tally() %>%
  arrange(-days_since)

 ## # Source: lazy query [?? x 3]
## # Database: spark_connection
## # Groups: flight_date
## # Ordered by: -days_since
##    flight_date days_since     n
##    <chr>            <int> <dbl>
##  1 2013-1-1          1844   842
##  2 2013-1-2          1843   943
##  3 2013-1-3          1842   914
##  4 2013-1-4          1841   915
##  5 2013-1-5          1840   720
##  6 2013-1-6          1839   832
##  7 2013-1-7          1838   933
##  8 2013-1-8          1837   899
##  9 2013-1-9          1836   902
## 10 2013-1-10         1835   932
## # ... 
with more rows

Spark Machine Learning Library (MLlib)
Overview
sparklyr provides bindings to Spark's distributed machine learning library. 
In particular, sparklyr allows you to access the machine learning routines provided by the spark.ml package. 
Together with sparklyr's dplyr interface, you can easily create and tune machine learning workflows on Spark, orchestrated entirely within R.

sparklyr provides three families of functions that you can use with Spark machine learning:

Machine learning algorithms for analyzing data (ml_*)
Feature transformers for manipulating individual features (ft_*)
Functions for manipulating Spark DataFrames (sdf_*)

An analytic workflow with sparklyr might be composed of the following stages. 
For an example see Example Workflow.


Perform SQL queries through the sparklyr dplyr interface,
Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set,
Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data,
Inspect the quality of your model fit, and use it to make predictions with new data.
Collect the results for visualization and further analysis in R

Algorithms
Spark's machine learning library can be accessed from sparklyr through the ml_* set of functions:




Function
Description




ml_kmeans
K-Means Clustering


ml_linear_regression
Linear Regression


ml_logistic_regression
Logistic Regression


ml_survival_regression
Survival Regression


ml_generalized_linear_regression
Generalized Linear Regression


ml_decision_tree
Decision Trees


ml_random_forest
Random Forests


ml_gradient_boosted_trees
Gradient-Boosted Trees


ml_pca
Principal Components Analysis


ml_naive_bayes
Naive-Bayes


ml_multilayer_perceptron
Multilayer Perceptron


ml_lda
Latent Dirichlet Allocation


ml_one_vs_rest
One vs Rest



Formulas

The ml_* functions take the arguments response and features. 
But features can also be a formula with main effects (it currently does not accept interaction terms). 
The intercept term can be omitted by using -1.

# Equivalent statements
ml_linear_regression(z ~ -1 + x + y)
ml_linear_regression(intercept = FALSE, response = "z", features = c("x", "y"))


Options

The Spark model output can be modified with the ml_options argument in the ml_* functions. 
The ml_options is an experts only interface for tweaking the model output. 
For example, model.transform can be used to mutate the Spark model object before the fit is performed.
Transformers
A model is often fit not on a dataset as-is, but instead on some transformation of that dataset. 
Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_* family of functions. 
These routines generally take one or more input columns, and generate a new output column formed as a transformation of those columns.








Function
Description



ft_binarizer
Threshold numerical features to binary (0/1) feature

ft_bucketizer
Bucketizer transforms a column of continuous features to a column of feature buckets

ft_discrete_cosine_transform
Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain

ft_elementwise_product
Multiplies each input vector by a provided weight vector, using element-wise multiplication.

ft_index_to_string
Maps a column of label indices back to a column containing the original labels as strings

ft_quantile_discretizer
Takes a column with continuous features and outputs a column with binned categorical features

sql_transformer
Implements the transformations which are defined by a SQL statement

ft_string_indexer
Encodes a string column of labels to a column of label indices

ft_vector_assembler
Combines a given list of columns into a single vector column



Examples

We will use the iris data set to examine a handful of learning algorithms and transformers. 
The iris data set measures attributes for 150 flowers in 3 different species of iris.

library(sparklyr)

 ## Warning: package 'sparklyr' was built under R version 3.4.3


library(ggplot2)
library(dplyr)

 ## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union


sc <- spark_connect(master = "local")

 ## * Using Spark: 2.1.0


iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
iris_tbl

 ## # Source:   table<iris> [?? x 5]
## # Database: spark_connection
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... 
with more rows


K-Means Clustering

Use Spark's K-means clustering to partition a dataset into groups. 
K-means clustering partitions points into k groups, such that the sum of squares from points to the assigned cluster centers is minimized.

kmeans_model <- iris_tbl %>%
  select(Petal_Width, Petal_Length) %>%
  ml_kmeans(centers = 3)

 ## * No rows dropped by 'na.omit' call


# print our model fit
kmeans_model

 ## K-means clustering with 3 clusters
## 
## Cluster centers:
##   Petal_Width Petal_Length
## 1    1.359259     4.292593
## 2    0.246000     1.462000
## 3    2.047826     5.626087
## 
## Within Set Sum of Squared Errors =  31.41289


# predict the associated class
predicted <- sdf_predict(kmeans_model, iris_tbl) %>%
  collect
table(predicted$Species, predicted$prediction)

 ##             
##               0  1  2
##   setosa      0 50  0
##   versicolor 48  0  2
##   virginica   6  0 44


# plot cluster membership
sdf_predict(kmeans_model) %>%
  collect() %>%
  ggplot(aes(Petal_Length, Petal_Width)) +
  geom_point(aes(Petal_Width, Petal_Length, col = factor(prediction + 1)),
     size = 2, alpha = 0.5) + 
  geom_point(data = kmeans_model$centers, aes(Petal_Width, Petal_Length),
     col = scales::muted(c("red", "green", "blue")),
     pch = 'x', size = 12) +
  scale_color_discrete(name = "Predicted Cluster",
         labels = paste("Cluster", 1:3)) +
  labs(
    x = "Petal Length",
    y = "Petal Width",
    title = "K-Means Clustering",
    subtitle = "Use Spark.ML to predict cluster membership with the iris dataset."
  )




Linear Regression

Use Spark's linear regression to model the linear relationship between a response variable and one or more explanatory variables.

lm_model <- iris_tbl %>%
  select(Petal_Width, Petal_Length) %>%
  ml_linear_regression(Petal_Length ~ Petal_Width)

 ## * No rows dropped by 'na.omit' call


iris_tbl %>%
  select(Petal_Width, Petal_Length) %>%
  collect %>%
  ggplot(aes(Petal_Length, Petal_Width)) +
    geom_point(aes(Petal_Width, Petal_Length), size = 2, alpha = 0.5) +
    geom_abline(aes(slope = coef(lm_model)[["Petal_Width"]],
      intercept = coef(lm_model)[["(Intercept)"]]),
  color = "red") +
  labs(
    x = "Petal Width",
    y = "Petal Length",
    title = "Linear Regression: Petal Length ~ Petal Width",
    subtitle = "Use Spark.ML linear regression to predict petal length as a function of petal width."
  )




Logistic Regression

Use Spark's logistic regression to perform logistic regression, modeling a binary outcome as a function of one or more explanatory variables.

# Prepare beaver dataset
beaver <- beaver2
beaver$activ <- factor(beaver$activ, labels = c("Non-Active", "Active"))
copy_to(sc, beaver, "beaver")

 ## # Source:   table<beaver> [?? x 4]
## # Database: spark_connection
##      day  time  temp      activ
##    <dbl> <dbl> <dbl>      <chr>
##  1   307   930 36.58 Non-Active
##  2   307   940 36.73 Non-Active
##  3   307   950 36.93 Non-Active
##  4   307  1000 37.15 Non-Active
##  5   307  1010 37.23 Non-Active
##  6   307  1020 37.24 Non-Active
##  7   307  1030 37.24 Non-Active
##  8   307  1040 36.90 Non-Active
##  9   307  1050 36.95 Non-Active
## 10   307  1100 36.89 Non-Active
## # ... 
with more rows


beaver_tbl <- tbl(sc, "beaver")

glm_model <- beaver_tbl %>%
  mutate(binary_response = as.numeric(activ == "Active")) %>%
  ml_logistic_regression(binary_response ~ temp)

 ## * No rows dropped by 'na.omit' call


glm_model

 ## Call: binary_response ~ temp
## 
## Coefficients:
## (Intercept)        temp 
##  -550.52331    14.69184


PCA

Use Spark's Principal Components Analysis (PCA) to perform dimensionality reduction. 
PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.

pca_model <- tbl(sc, "iris") %>%
  select(-Species) %>%
  ml_pca()

 ## * No rows dropped by 'na.omit' call


print(pca_model)

 ## Explained variance:
## 
##         PC1         PC2         PC3         PC4 
## 0.924618723 0.053066483 0.017102610 0.005212184 
## 
## Rotation:
##                      PC1         PC2         PC3        PC4
## Sepal_Length -0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal_Width   0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal_Length -0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal_Width  -0.35828920  0.07548102 -0.54583143  0.7536574


Random Forest

Use Spark's Random Forest to perform regression or multiclass classification.

rf_model <- iris_tbl %>%
  ml_random_forest(Species ~ Petal_Length + Petal_Width, type = "classification")

 ## * No rows dropped by 'na.omit' call


rf_predict <- sdf_predict(rf_model, iris_tbl) %>%
  ft_string_indexer("Species", "Species_idx") %>%
  collect

table(rf_predict$Species_idx, rf_predict$prediction)

 ##    
##      0  1  2
##   0 49  1  0
##   1  0 50  0
##   2  0  0 50


SDF Partitioning

Split a Spark DataFrame into training, test datasets.

partitions <- tbl(sc, "iris") %>%
  sdf_partition(training = 0.75, test = 0.25, seed = 1099)

fit <- partitions$training %>%
  ml_linear_regression(Petal_Length ~ Petal_Width)

 ## * No rows dropped by 'na.omit' call


estimate_mse <- function(df){
  sdf_predict(fit, df) %>%
  mutate(resid = Petal_Length - prediction) %>%
  summarize(mse = mean(resid ^ 2)) %>%
  collect
}

sapply(partitions, estimate_mse)

 ## $training.mse
## [1] 0.2374596
## 
## $test.mse
## [1] 0.1898848


FT String Indexing

Use ft_string_indexer and ft_index_to_string to convert a character column into a numeric column and back again.

ft_string2idx <- iris_tbl %>%
  ft_string_indexer("Species", "Species_idx") %>%
  ft_index_to_string("Species_idx", "Species_remap") %>%
  collect

table(ft_string2idx$Species, ft_string2idx$Species_remap)

 ##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         50         0
##   virginica       0          0        50


SDF Mutate

sdf_mutate is provided as a helper function, to allow you to use feature transformers. 
For example, the previous code snippet could have been written as:

ft_string2idx <- iris_tbl %>%
  sdf_mutate(Species_idx = ft_string_indexer(Species)) %>%
  sdf_mutate(Species_remap = ft_index_to_string(Species_idx)) %>%
  collect
  
ft_string2idx %>%
  select(Species, Species_idx, Species_remap) %>%
  distinct

 ## # A tibble: 3 x 3
##      Species Species_idx Species_remap
##        <chr>       <dbl>         <chr>
## 1     setosa           2        setosa
## 2 versicolor           0    versicolor
## 3  virginica           1     virginica


Example Workflow

Let's walk through a simple example to demonstrate the use of Spark's machine learning algorithms within R. 
We'll use ml_linear_regression to fit a linear regression model. 
Using the built-in mtcars dataset, we'll try to predict a car's fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl).

First, we will copy the mtcars dataset into Spark.

mtcars_tbl <- copy_to(sc, mtcars, "mtcars")


Transform the data with Spark SQL, feature transformers, and DataFrame functions.


Use Spark SQL to remove all cars with horsepower less than 100
Use Spark feature transformers to bucket cars into two groups based on cylinders
Use Spark DataFrame functions to partition the data into test and training


Then fit a linear model using spark ML. 
Model MPG as a function of weight and cylinders.

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  sdf_mutate(cyl8 = ft_bucketizer(cyl, c(0,8,12))) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 888)

# fit a linear mdoel to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(mpg ~ wt + cyl)

 ## * No rows dropped by 'na.omit' call


# summarize the model
summary(fit)

 ## Call: ml_linear_regression(., mpg ~ wt + cyl)
## 
## Deviance Residuals::
##     Min      1Q  Median      3Q     Max 
## -2.0947 -1.2747 -0.1129  1.0876  2.2185 
## 
## Coefficients:
##             Estimate Std. 
Error t value Pr(>|t|)    
## (Intercept) 33.79558    2.67240 12.6462 4.92e-07 ***
## wt          -1.59625    0.73729 -2.1650  0.05859 . 
 
## cyl         -1.58036    0.49670 -3.1817  0.01115 *  
## ---
## Signif. 
codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-Squared: 0.8267
## Root Mean Squared Error: 1.437


The summary() suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption. 
(The model suggests that, on average, heavier cars consume more fuel.)

Let's use our Spark model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption. 
We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions.

# Score the data
pred <- sdf_predict(fit, partitions$test) %>%
  collect

# Plot the predicted versus actual mpg
ggplot(pred, aes(x = mpg, y = prediction)) +
  geom_abline(lty = "dashed", col = "red") +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_fixed(ratio = 1) +
  labs(
    x = "Actual Fuel Consumption",
    y = "Predicted Fuel Consumption",
    title = "Predicted vs. 
Actual Fuel Consumption"
  )




Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption.

As you can see, we can easily and effectively combine feature transformers, machine learning algorithms, and Spark DataFrame functions into a complete analysis with Spark and R.
Understanding Spark Caching 
Introduction
Spark also supports pulling data sets into a cluster-wide in-memory cache. 
This is very useful when data is accessed repeatedly, such as when querying a small dataset or when running an iterative algorithm like random forests. 
Since operations in Spark are lazy, caching can help force computation. 
Sparklyr tools can be used to cache and uncache DataFrames. 
The Spark UI will tell you which DataFrames and what percentages are in memory.

By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark’s memory management options.
Preparation
Download Test Data
The 2008 and 2007 Flights data from the Statistical Computing site will be used for this exercise. 
The spark_read_csv supports reading compressed CSV files in a bz2 format, so no additional file preparation is needed.
 if(!file.exists("2008.csv.bz2"))
  {download.file("http://stat-computing.org/dataexpo/2009/2008.csv.bz2", "2008.csv.bz2")}
if(!file.exists("2007.csv.bz2"))
  {download.file("http://stat-computing.org/dataexpo/2009/2007.csv.bz2", "2007.csv.bz2")}

Start a Spark session
A local deployment will be used for this example.
 library(sparklyr)
library(dplyr)
library(ggplot2)

# Install Spark version 2
spark_install(version = "2.0.0")

# Customize the connection configuration
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "16G"

# Connect to Spark
sc <- spark_connect(master = "local", config = conf, version = "2.0.0")
The Memory Argument
In the spark_read_… functions, the memory argument controls if the data will be loaded into memory as an RDD. 
Setting it to FALSE means that Spark will essentially map the file, but not make a copy of it in memory. 
This makes the spark_read_csv command run faster, but the trade off is that any data transformation operations will take much longer.
 spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)

In the RStudio IDE, the flights_spark_2008 table now shows up in the Spark tab.






To access the Spark Web UI, click the SparkUI button in the RStudio Spark Tab. 
As expected, the Storage page shows no tables loaded into memory.






Loading Less Data into Memory
Using the pre-processing capabilities of Spark, the data will be transformed before being loaded into memory. 
In this section, we will continue to build on the example started in the Spark Read section

Lazy Transform
The following dplyr script will not be immediately run, so the code is processed quickly. 
There are some check-ups made, but for the most part it is building a Spark SQL statement in the background.
 flights_table <- tbl(sc,"flights_spark_2008") %>%
  mutate(DepDelay = as.numeric(DepDelay),
 ArrDelay = as.numeric(ArrDelay),
 DepDelay > 15 , DepDelay < 240,
 ArrDelay > -60 , ArrDelay < 360, 
 Gain = DepDelay - ArrDelay) %>%
  filter(ArrDelay > 0) %>%
  select(Origin, Dest, UniqueCarrier, Distance, DepDelay, ArrDelay, Gain)

Register in Spark
sdf_register will register the resulting Spark SQL in Spark. 
The results will show up as a table called flights_spark. 
But a table of the same name is still not loaded into memory in Spark.
 sdf_register(flights_table, "flights_spark")







Cache into Memory
The tbl_cache command loads the results into an Spark RDD in memory, so any analysis from there on will not need to re-read and re-transform the original file. 
The resulting Spark RDD is smaller than the original file because the transformations created a smaller data set than the original file.
 tbl_cache(sc, "flights_spark")







Driver Memory
In the Executors page of the Spark Web UI, we can see that the Storage Memory is at about half of the 16 gigabytes requested. 
This is mainly because of a Spark setting called spark.memory.fraction, which reserves by default 40% of the memory requested.






Process on the fly
The plan is to read the Flights 2007 file, combine it with the 2008 file and summarize the data without bringing either file fully into memory.
 spark_read_csv(sc, "flights_spark_2007" , "2007.csv.bz2", memory = FALSE)

Union and Transform
The union command is akin to the bind_rows dyplyr command. 
It will allow us to append the 2007 file to the 2008 file, and as with the previous transform, this script will be evaluated lazily.
 all_flights <- tbl(sc, "flights_spark_2008") %>%
  union(tbl(sc, "flights_spark_2007")) %>%
  group_by(Year, Month) %>%
  tally()

Collect into R
When receiving a collect command, Spark will execute the SQL statement and send the results back to R in a data frame. 
In this case, R only loads 24 observations into a data frame called all_flights.
 all_flights <- all_flights %>%
  collect()







Plot in R
Now the smaller data set can be plotted
 ggplot(data = all_flights, aes(x = Month, y = n/1000, fill = factor(Year))) +
  geom_area(position = "dodge", alpha = 0.5) +
  geom_line(alpha = 0.4) +
  scale_fill_brewer(palette = "Dark2", name = "Year") +
  scale_x_continuous(breaks = 1:12, labels = c("J","F","M","A","M","J","J","A","S","O","N","D")) +
  theme_light() +
  labs(y="Number of Flights (Thousands)", title = "Number of Flights Year-Over-Year")






Deployment and Configuration 
Deployment
There are two well supported deployment modes for sparklyr:

Local — Working on a local desktop typically with smaller/sampled datasets
Cluster — Working directly within or alongside a Spark cluster (standalone, YARN, Mesos, etc.)

Local Deployment
Local mode is an excellent way to learn and experiment with Spark. 
Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster.

To work in local mode you should first install a version of Spark for local use. 
You can do this using the spark_install function, for example:
 sparklyr::spark_install(version = "2.1.0")

To connect to the local Spark instance you pass “local” as the value of the Spark master node to spark_connect:
 library(sparklyr)
sc <- spark_connect(master = "local")

For the local development scenario, see the Configuration section below for details on how to have the same code work seamlessly in both development and production environments.

Cluster Deployment
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). 
In this setup, client mode is appropriate. 
In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. 
The input and output of the application is attached to the console. 
Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell). 
For more information see Submitting Applications.

To use spaklyr with a Spark cluster you should locate your R session on a machine that is either directly on one of the cluster nodes or is close to the cluster (for networking performance). 
In the case where R is not running directly on the cluster you should also ensure that the machine has a Spark version and configuration identical to that of the cluster nodes.

The most straightforward way to run R within or near to the cluster is either a remote SSH session or via RStudio Server.

In cluster mode you use the version of Spark already deployed on the cluster node. 
This version is located via the SPARK_HOME environment variable, so you should be sure that this variable is correctly defined on your server before attempting a connection. 
This would typically be done within the Renviron.site configuration file. 
For example:
 SPARK_HOME=/opt/spark/spark-2.0.0-bin-hadoop2.6

To connect, pass the address of the master node to spark_connect, for example:
 library(sparklyr)
sc <- spark_connect(master = "spark://local:7077")

For a Hadoop YARN cluster, you can connect using the YARN master, for example:
 library(sparklyr)
sc <- spark_connect(master = "yarn-client")

If you are running on EC2 using the Spark EC2 deployment scripts then you can read the master from /root/spark-ec2/cluster-url, for example:
 library(sparklyr)
cluster_url <- system('cat /root/spark-ec2/cluster-url', intern=TRUE)
sc <- spark_connect(master = cluster_url)

Livy Connections
Livy, “An Open Source REST Service for Apache Spark (Apache License)” , is available starting in sparklyr 0.5 as an experimental feature. 
Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster.

To work with Livy locally, sparklyr supports livy_install() which installs Livy in your local environment, this is similar to spark_install(). 
Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start(). 
Once the service is running, spark_connect() needs to reference the running service and use method = "Livy", then sparklyr can be used as usual. 
A short example follows:
 livy_install()
livy_service_start()

sc <- spark_connect(master = "http://localhost:8998", method = "livy")
copy_to(sc, iris)

spark_disconnect(sc)
livy_service_stop()

Connection Tools
You can view the Spark web UI via the spark_web function, and view the Spark log via the spark_log function:
 spark_web(sc)
spark_log(sc)

You can disconnect from Spark using the spark_disconnect function:
 spark_disconnect(sc)

Collect
The collect function transfers data from Spark into R. 
The data are collected from a cluster environment and transfered into local R memory. 
In the process, all data is first transfered from executor nodes to the driver node. 
Therefore, the driver node must have enough memory to collect all the data.

Collecting data on the driver node is relatively slow. 
The process also inflates the data as it moves from the executor nodes to the driver node. 
Caution should be used when collecting large data.

The following parameters could be adjusted to avoid OutOfMemory and Timeout errors:

spark.executor.heartbeatInterval
spark.network.timeout
spark.driver.extraJavaOptions
spark.driver.memory
spark.yarn.driver.memoryOverhead
spark.driver.maxResultSize
Configuration
This section describes the various options available for configuring both the behavior of the sparklyr package as well as the underlying Spark cluster. 
Creating multiple configuration profiles (e.g. development, test, production) is also covered.

Config Files
The configuration for a Spark connection is specified via the config parameter of the spark_connect function. 
By default the configuration is established by calling the spark_config function. 
This code represents the default behavior:
 spark_connect(master = "local", config = spark_config())

By default the spark_config function reads configuration data from a file named config.yml located in the current working directory (or in parent directories if not located in the working directory). 
This file is not required and only need be provided for overriding default behavior. 
You can also specify an alternate config file name and/or location.

The config.yml file is in turn processed using the config package, which enables support for multiple named configuration profiles.

Package Options
There are a number of options available to configure the behavior of the sparklyr package:

For example, this configuration file sets the number of local cores to 4 and the amount of memory allocated for the Spark driver to 4G:
 default:
  sparklyr.cores.local: 4
  sparklyr.shell.driver-memory: 4G

Note that the use of default will be explained below in Multiple Profiles.
Spark







Option
Description



sparklyr.shell.*
Command line parameters to pass to spark-submit. 
For example, sparklyr.shell.executor-memory: 20G configures --executor-memory 20G (see the Spark documentation for details on supported options).


Runtime







Option
Description



sparklyr.cores.local
Number of cores to use when running in local mode (defaults to parallel::detectCores).

sparklyr.sparkui.url
Configures the url to the Spark UI web interface when calling spark_web.

sparklyr.defaultPackages
List of default Spark packages to install in the cluster (defaults to “com.databricks:spark-csv_2.11:1.3.0” and “com.amazonaws:aws-java-sdk-pom:1.10.34”).

sparklyr.sanitize.column.names
Allows Spark to automatically rename column names to conform to Spark naming restrictions.


Diagnostics







Option
Description



sparklyr.backend.threads
Number of threads to use in the sparklyr backend to process incoming connections form the sparklyr client.

sparklyr.app.jar
The application jar to be submitted in Spark submit.

sparklyr.ports.file
Path to the ports file used to share connection information to the sparklyr backend.

sparklyr.ports.wait.seconds
Number of seconds to wait while for the Spark connection to initialize.

sparklyr.verbose
Provide additional feedback while performing operations. 
Currently used to communicate which column names are being sanitized in sparklyr.sanitize.column.names.



Spark Options
You can also use config.yml to specify arbitrary Spark configuration properties:








Option
Description



spark.*
Configuration settings for the Spark context (applied by creating a SparkConf containing the specified properties). 
For example, spark.executor.memory: 1g configures the memory available in each executor (see Spark Configuration for additional options.)

spark.sql.*
Configuration settings for the Spark SQL context (applied using SET). 
For instance, spark.sql.shuffle.partitions configures number of partitions to use while shuffling (see SQL Programming Guide for additional options).


For example, this configuration file sets a custom scratch directory for Spark and specifies 100 as the number of partitions to use when shuffling data for joins or aggregations:
 default:
  spark.local.dir: /tmp/spark-scratch
  spark.sql.shuffle.partitions: 100

User Options
You can also include arbitrary custom user options within the config.yml file. 
These can be named anything you like so long as they do not use either spark or sparklyr as a prefix. 
For example, this configuration file defines dataset and sample-size options:
 default:
  dataset: "observations.parquet"
  sample-size: 10000

Multiple Profiles
The config package enables the definition of multiple named configuration profiles for different environments (e.g. default, test, production). 
All environments automatically inherit from the default environment and can optionally also inherit from each other.

For example, you might want to use a distinct datasets for development and testing or might want to use custom Spark configuration properties that are only applied when running on a production cluster. 
Here’s how that would be expressed in config.yml:
 default:
  dataset: "observations-dev.parquet"
  sample-size: 10000
  
production:
  spark.memory.fraction: 0.9
  spark.rdd.compress: true
  dataset: "observations.parquet"
  sample-size: null

You can also use this feature to specify distinct Spark master nodes for different environments, for example:
 default:
  spark.master: "local"
  
production:
  spark.master: "spark://local:7077"

With this configuration, you can omit the master argument entirely from the call to spark_connect:
 sc <- spark_connect()

Note that the currently active configuration is determined via the value of R_CONFIG_ACTIVE environment variable. 
See the config package documentation for additional details.

Tuning
In general, you will need to tune a Spark cluster for it to perform well. 
Spark applications tend to consume a lot of resources. 
There are many knobs to control the performance of Yarn and executor (i.e. worker) nodes in a cluster. 
Some of the parameters to pay attention to are as follows:

spark.executor.heartbeatInterval
spark.network.timeout
spark.executor.extraJavaOptions
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
spark.executor.instances (if is not enabled)
Example Config
Here is an example spark configuration for an EMR cluster on AWS with 1 master and 2 worker nodes. 
Eache node has 8 vCPU and 61 GiB of memory.




Parameter
Value



spark.driver.extraJavaOptions
append -XX:MaxPermSize=30G

spark.driver.maxResultSize
0

spark.driver.memory
30G

spark.yarn.driver.memoryOverhead
4096

spark.yarn.executor.memoryOverhead
4096

spark.executor.memory
4G

spark.executor.cores
2

spark.dynamicAllocation.maxExecutors
15


Configuration parameters can be set in the config R object or can be set in the config.yml. 
Alternatively, they can be set in the spark-defaults.conf.

Configuration in R script
 config <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "yarn-client", config = config, version = '2.0.0')

Configuration in YAML script
 default:
  spark.executor.cores: 2
  spark.executor.memory: 4G
RStudio Server
RStudio Server provides a web-based IDE interface to a remote R session, making it ideal for use as a front-end to a Spark cluster. 
This section covers some additional configuration options that are useful for RStudio Server.

Connection Options
The RStudio IDE Spark pane provides a New Connection dialog to assist in connecting with both local instances of Spark and Spark clusters:



You can configure which connection choices are presented using the rstudio.spark.connections option. 
By default, users are presented with possibility of both local and cluster connections, however, you can modify this behavior to present only one of these, or even a specific Spark master URL. 
Some commonly used combinations of connection choices include:








Value
Description



c("local", "cluster")
Default. 
Present connections to both local and cluster Spark instances.

"local"
Present only connections to local Spark instances.

"spark://local:7077"
Present only a connection to a specific Spark cluster.

c("spark://local:7077", "cluster")
Present a connection to a specific Spark cluster and other clusters.


This option should generally be set within Rprofile.site. 
For example:
 options(rstudio.spark.connections = "spark://local:7077")

Spark Installations
If you are running within local mode (as opposed to cluster mode) you may want to provide pre-installed Spark version(s) to be shared by all users of the server. 
You can do this by installing Spark versions within a shared directory (e.g.  /opt/spark) then designating it as the Spark installation directory.

For example, after installing one or more versions of Spark to /opt/spark you would add the following to Rprofile.site:
 options(spark.install.dir = "/opt/spark")

If this directory is read-only for ordinary users then RStudio will not offer installation of additional versions, which will help guide users to a version that is known to be compatible with versions of Spark deployed on clusters in the same organization.
Distributing R Computations
Overview
sparklyr provides support to run arbitrary R code at scale within your Spark Cluster through spark_apply(). 
This is especially useful where there is a need to use functionality available only in R or R packages that is not available in Apache Spark nor Spark Packages.
 spark_apply() applies an R function to a Spark object (typically, a Spark DataFrame). 
Spark objects are partitioned so they can be distributed across a cluster. 
You can use spark_apply with the default partitions or you can define your own partitions with the group_by argument. 
Your R function must return another Spark DataFrame.  spark_apply will run your R function on each partition and output a single Spark DataFrame.

Apply an R function to a Spark Object

Lets run a simple example. 
We will apply the identify function, I(), over a list of numbers we created with the sdf_len function.

library(sparklyr)

sc <- spark_connect(master = "local")

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))

 ## # Source:   table<sparklyr_tmp_378c2e4fb50> [?? x 1]
## # Database: spark_connection
##      id
##   <dbl>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5


Your R function should be designed to operate on an R data frame. 
The R function passed to spark_apply expects a DataFrame and will return an object that can be cast as a DataFrame. 
We can use the class function to verify the class of the data.

sdf_len(sc, 10, repartition = 1) %>%
  spark_apply(function(e) class(e))

 ## # Source:   table<sparklyr_tmp_378c7ce7618d> [?? x 1]
## # Database: spark_connection
##           id
##        <chr>
## 1 data.frame


Spark will partition your data by hash or range so it can be distributed across a cluster. 
In the following example we create two partitions and count the number of rows in each partition. 
Then we print the first record in each partition.

trees_tbl <- sdf_copy_to(sc, trees, repartition = 2)

trees_tbl %>%
  spark_apply(function(e) nrow(e), names = "n")

 ## # Source:   table<sparklyr_tmp_378c15c45eb1> [?? x 1]
## # Database: spark_connection
##       n
##   <int>
## 1    16
## 2    15


trees_tbl %>%
  spark_apply(function(e) head(e, 1))

 ## # Source:   table<sparklyr_tmp_378c29215418> [?? x 3]
## # Database: spark_connection
##   Girth Height Volume
##   <dbl>  <dbl>  <dbl>
## 1   8.3     70   10.3
## 2   8.6     65   10.3


We can apply any arbitrary function to the partitions in the Spark DataFrame. 
For instance, we can scale or jitter the columns. 
Notice that spark_apply applies the R function to all partitions and returns a single DataFrame.

trees_tbl %>%
  spark_apply(function(e) scale(e))

 ## # Source:   table<sparklyr_tmp_378c8922ba8> [?? x 3]
## # Database: spark_connection
##         Girth      Height     Volume
##         <dbl>       <dbl>      <dbl>
##  1 -1.4482330 -0.99510521 -1.1503645
##  2 -1.3021313 -2.06675697 -1.1558670
##  3 -0.7469449  0.68891899 -0.6826528
##  4 -0.6592839 -1.60747764 -0.8587325
##  5 -0.6300635  0.53582588 -0.4735581
##  6 -0.5716229  0.38273277 -0.3855183
##  7 -0.5424025 -0.07654655 -0.5395880
##  8 -0.3670805 -0.22963966 -0.6661453
##  9 -0.1040975  1.30129143  0.1427209
## 10  0.1296653 -0.84201210 -0.3029809
## # ... 
with more rows


trees_tbl %>%
  spark_apply(function(e) lapply(e, jitter))

 ## # Source:   table<sparklyr_tmp_378c43237574> [?? x 3]
## # Database: spark_connection
##        Girth   Height   Volume
##        <dbl>    <dbl>    <dbl>
##  1  8.319392 70.04321 10.30556
##  2  8.801237 62.85795 10.21751
##  3 10.719805 81.15618 18.78076
##  4 11.009892 65.98926 15.58448
##  5 11.089322 80.14661 22.58749
##  6 11.309682 79.01360 24.18158
##  7 11.418486 75.88748 21.38380
##  8 11.982421 74.85612 19.09375
##  9 12.907616 84.81742 33.80591
## 10 13.691892 71.05309 25.70321
## # ... 
with more rows


By default spark_apply() derives the column names from the input Spark data frame. 
Use the names argument to rename or add new columns.

trees_tbl %>%
  spark_apply(
    function(e) data.frame(2.54 * e$Girth, e),
    names = c("Girth(cm)", colnames(trees)))

 ## # Source:   table<sparklyr_tmp_378c14e015b5> [?? x 4]
## # Database: spark_connection
##    `Girth(cm)` Girth Height Volume
##          <dbl> <dbl>  <dbl>  <dbl>
##  1      21.082   8.3     70   10.3
##  2      22.352   8.8     63   10.2
##  3      27.178  10.7     81   18.8
##  4      27.940  11.0     66   15.6
##  5      28.194  11.1     80   22.6
##  6      28.702  11.3     79   24.2
##  7      28.956  11.4     76   21.4
##  8      30.480  12.0     75   19.1
##  9      32.766  12.9     85   33.8
## 10      34.798  13.7     71   25.7
## # ... 
with more rows


Group By

In some cases you may want to apply your R function to specific groups in your data. 
For example, suppose you want to compute regression models against specific subgroups. 
To solve this, you can specify a group_by argument. 
This example counts the number of rows in iris by species and then fits a simple linear model for each species.

iris_tbl <- sdf_copy_to(sc, iris)

iris_tbl %>%
  spark_apply(nrow, group_by = "Species")

 ## # Source:   table<sparklyr_tmp_378c1b8155f3> [?? x 2]
## # Database: spark_connection
##      Species Sepal_Length
##        <chr>        <int>
## 1 versicolor           50
## 2  virginica           50
## 3     setosa           50


iris_tbl %>%
  spark_apply(
    function(e) summary(lm(Petal_Length ~ Petal_Width, e))$r.squared,
    names = "r.squared",
    group_by = "Species")

 ## # Source:   table<sparklyr_tmp_378c30e6155> [?? x 2]
## # Database: spark_connection
##      Species r.squared
##        <chr>     <dbl>
## 1 versicolor 0.6188467
## 2  virginica 0.1037537
## 3     setosa 0.1099785

Distributing Packages
With spark_apply() you can use any R package inside Spark. 
For instance, you can use the broom package to create a tidy data frame from linear regression output.

spark_apply(
  iris_tbl,
  function(e) broom::tidy(lm(Petal_Length ~ Petal_Width, e)),
  names = c("term", "estimate", "std.error", "statistic", "p.value"),
  group_by = "Species")

 ## # Source:   table<sparklyr_tmp_378c5502500b> [?? x 6]
## # Database: spark_connection
##      Species        term  estimate std.error statistic      p.value
##        <chr>       <chr>     <dbl>     <dbl>     <dbl>        <dbl>
## 1 versicolor (Intercept) 1.7812754 0.2838234  6.276000 9.484134e-08
## 2 versicolor Petal_Width 1.8693247 0.2117495  8.827999 1.271916e-11
## 3  virginica (Intercept) 4.2406526 0.5612870  7.555230 1.041600e-09
## 4  virginica Petal_Width 0.6472593 0.2745804  2.357267 2.253577e-02
## 5     setosa (Intercept) 1.3275634 0.0599594 22.141037 7.676120e-27
## 6     setosa Petal_Width 0.5464903 0.2243924  2.435422 1.863892e-02


To use R packages inside Spark, your packages must be installed on the worker nodes. 
The first time you call spark_apply all of the contents in your local .libPaths() will be copied into each Spark worker node via the SparkConf.addFile() function. 
Packages will only be copied once and will persist as long as the connection remains open. 
It's not uncommon for R libraries to be several gigabytes in size, so be prepared for a one-time tax while the R packages are copied over to your Spark cluster. 
You can disable package distribution by setting packages = FALSE. 
Note: packages are not copied in local mode (master="local") because the packages already exist on the system.
Handling Errors
It can be more difficult to troubleshoot R issues in a cluster than in local mode. 
For instance, the following R code causes the distributed execution to fail and suggests you check the logs for details.

spark_apply(iris_tbl, function(e) stop("Make this fail"))

  Error in force(code) : 
  sparklyr worker rscript failure, check worker logs for details


In local mode, sparklyr will retrieve the logs for you. 
The logs point out the real failure as ERROR sparklyr: RScript (4190) Make this fail as you might expect.
 ---- Output Log ----
(17/07/27 21:24:18 ERROR sparklyr: Worker (2427) is shutting down with exception ,java.net.SocketException: Socket closed)
17/07/27 21:24:18 WARN TaskSetManager: Lost task 0.0 in stage 389.0 (TID 429, localhost, executor driver): 17/07/27 21:27:21 INFO sparklyr: RScript (4190) retrieved 150 rows 
17/07/27 21:27:21 INFO sparklyr: RScript (4190) computing closure 
17/07/27 21:27:21 ERROR sparklyr: RScript (4190) Make this fail 


It is worth mentioning that different cluster providers and platforms expose worker logs in different ways. 
Specific documentation for your environment will point out how to retrieve these logs.
Requirements
The R Runtime is expected to be pre-installed in the cluster for spark_apply to function. 
Failure to install the cluster will trigger a Cannot run program, no such file or directory error while attempting to use spark_apply(). 
Contact your cluster administrator to consider making the R runtime available throughout the entire cluster.

A Homogeneous Cluster is required since the driver node distributes, and potentially compiles, packages to the workers. 
For instance, the driver and workers must have the same processor architecture, system libraries, etc.
Configuration
The following table describes relevant parameters while making use of spark_apply.








Value
Description



spark.r.command
The path to the R binary. 
Useful to select from multiple R versions.

sparklyr.worker.gateway.address
The gateway address to use under each worker node. 
Defaults to sparklyr.gateway.address.

sparklyr.worker.gateway.port
The gateway port to use under each worker node. 
Defaults to sparklyr.gateway.port.



For example, one could make use of an specific R version by running:

config <- spark_config()
config[["spark.r.command"]] <- "<path-to-r-version>"

sc <- spark_connect(master = "local", config = config)
sdf_len(sc, 10) %>% spark_apply(function(e) e)

Limitations
Closures

Closures are serialized using serialize, which is described as “A simple low-level interface for serializing to connections.”. 
One of the current limitations of serialize is that it wont serialize objects being referenced outside of it's environment. 
For instance, the following function will error out since the closures references external_value:

external_value <- 1
spark_apply(iris_tbl, function(e) e + external_value)


Livy

Currently, Livy connections do not support distributing packages since the client machine where the libraries are precompiled might not have the same processor architecture, not operating systems that the cluster machines.

Computing over Groups

While performing computations over groups, spark_apply() will provide partitions over the selected column; however, this implies that each partition can fit into a worker node, if this is not the case an exception will be thrown. 
To perform operations over groups that exceed the resources of a single node, one can consider partitioning to smaller units or use dplyr::do which is currently optimized for large partitions.

Package Installation

Since packages are copied only once for the duration of the spark_connect() connection, installing additional packages is not supported while the connection is active. 
Therefore, if a new package needs to be installed, spark_disconnect() the connection, modify packages and reconnect.
Data Science using a Data Lake 
Audience
This article aims explain how to take advantage of Apache Spark inside organizations that have already implemented, or are in the process of implementing, a Hadoop based Big Data Lake.
Introduction
We have noticed that the types of questions we field after a demo of sparklyr to our customers were more about high-level architecture than how the package works. 
To answer those questions, we put together a set of slides that illustrate and discuss important concepts, to help customers see where Spark, R, and sparklyr fit in a Big Data Platform implementation. 
In this article, we’ll review those slides and provide a narrative that will help you better envision how you can take advantage of our products.
R for Data Science
It is very important to preface the Use Case review with some background information about where RStudio focuses its efforts when developing packages and products. 
Many vendors offer R integration, but in most cases, what this means is that they will add a model built in R to their pipeline or interface, and pass new inputs to that model to generate outputs that can be used in the next step in the pipeline, or in a calculation for the interface.

In contrast, our focus is on the process that happens before that: the discipline that produces the model, meaning Data Science.






In their R for Data Science book, Hadley Wickham and Garrett Grolemund provide a great diagram that nicely illustrates the Data Science process: We import data into memory with R and clean and tidy the data. 
Then we go into a cyclical process called understand, which helps us to get to know our data, and hopefully find the answer to the question we started with. 
This cycle typically involves making transformations to our tidied data, using the transformed data to fit models, and visualizing results. 
Once we find an answer to our question, we then communicate the results.

Data Scientists like using R because it allows them to complete a Data Science project from beginning to end inside the R environment, and in memory.
Hadoop as a Data Source
What happens when the data that needs to be analyzed is very large, like the data sets found in a Hadoop cluster? It would be impossible to fit these in memory, so workarounds are normally used. 
Possible workarounds include using a comparatively minuscule data sample, or download as much data as possible. 
This becomes disruptive to Data Scientists because either the small sample may not be representative, or they have to wait a long time in every iteration of importing a lot of data, exploring a lot of data, and modeling a lot of data.






Spark as an Analysis Engine
We noticed that a very important mental leap to make is to see Spark not just as a gateway to Hadoop (or worse, as an additional data source), but as a computing engine. 
As such, it is an excellent vehicle to scale our analytics. 
Spark has many capabilities that makes it ideal for Data Science in a data lake, such as close integration with Hadoop and Hive, the ability to cache data into memory across multiple nodes, data transformers, and its Machine Learning libraries.

The approach, then, is to push as much compute to the cluster as possible, using R primarily as an interface to Spark for the Data Scientist, which will then collect as few results as possible back into R memory, mostly to visualize and communicate. 
As shown in the slide, the more import, tidy, transform and modeling work we can push to Spark, the faster we can analyze very large data sets.






Cluster Setup
Here is an illustration of how R, RStudio, and sparklyr can be added to the YARN managed cluster. 
The highlights are:

R, RStudio, and sparklyr need to be installed on one node only, typically an edge node
The Data Scientist can access R, Spark, and the cluster via a web browser by navigating to the RStudio IDE inside the edge node





Considerations
There are some important considerations to keep in mind when combining your Data Lake and R for large scale analytics:

Spark’s Machine Learning libraries may not contain specific models that a Data Scientist needs. 
For those cases, workarounds would include using a sparklyr extension like H2O, or collecting a sample of the data into R memory for modeling.

Spark does not have visualization functionality; currently, the best approach is to collect pre-calculated data into R for plotting. 
A good way to drastically reduce the number of rows being brought back into memory is to push as much computation as possible to Spark, and return just the results to be plotted. 
For example, the bins of a Histogram can be calculated in Spark, so that only the final bucket values would be returned to R for visualization. 
Here is sample code for such a scenario: sparkDemos/Histogram

A particular use case may require a different way of scaling analytics. 
We have published an article that provides a very good overview of the options that are available: R for Enterprise: How to Scale Your Analytics Using R
R for Data Science Toolchain with Spark
With sparklyr, the Data Scientist will be able to access the Data Lake’s data, and also gain an additional, very powerful understand layer via Spark. 
sparklyr, along with the RStudio IDE and the tidyverse packages, provides the Data Scientist with an excellent toolbox to analyze data, big and small.






Spark ML Pipelines
Spark’s ML Pipelines provide a way to easily combine multiple
transformations and algorithms into a single workflow, or pipeline.

For R users, the insights gathered during the interactive sessions with
Spark can now be converted to a formal pipeline. 
This makes the hand-off
from Data Scientists to Big Data Engineers a lot easier, this is because
there should not be additional changes needed to be made by the later
group.

The final list of selected variables, data manipulation, feature
transformations and modeling can be easily re-written into a ml_pipeline() object, saved, and ultimately placed into a Production
environment. 
The sparklyr output of a saved Spark ML Pipeline object
is in Scala code, which means that the code can be added to the
scheduled Spark ML jobs, and without any dependencies in R.
Introduction to ML Pipelines
The official Apache Spark site contains a more complete overview of ML
Pipelines. 
This
article will focus in introducing the basic concepts and steps to work
with ML Pipelines via sparklyr.

There are two important stages in building an ML Pipeline. 
The first one
is creating a Pipeline. 
A good way to look at it, or call it, is as
an “empty” pipeline. 
This step just builds the steps that the data
will go through. 
This is the somewhat equivalent of doing this in R:
 r_pipeline <-  . 
%>% mutate(cyl = paste0("c", cyl)) %>% lm(am ~ cyl + mpg, data = .)
r_pipeline

## Functional sequence with the following components:
## 
##  1. 
mutate(., cyl = paste0("c", cyl))
##  2. 
lm(am ~ cyl + mpg, data = .)
## 
## Use 'functions' to extract the individual functions.


The r_pipeline object has all the steps needed to transform and fit
the model, but it has not yet transformed any data.

The second step, is to pass data through the pipeline, which in turn
will output a fitted model. 
That is called a PipelineModel. 
The
PipelineModel can then be used to produce predictions.
 r_model <- r_pipeline(mtcars)
r_model

## 
## Call:
## lm(formula = am ~ cyl + mpg, data = .)
## 
## Coefficients:
## (Intercept)        cylc6        cylc8          mpg  
##    -0.54388      0.03124     -0.03313      0.04767


Taking advantage of Pipelines and PipelineModels

The two stage ML Pipeline approach produces two final data products:

A PipelineModel that can be added to the daily Spark jobs which
will produce new predictions for the incoming data, and again, with
no R dependencies.

A Pipeline that can be easily re-fitted on a regular
interval, say every month. 
All that is needed is to pass a new
sample to obtain the new coefficients.
Pipeline
An additional goal of this article is that the reader can follow along,
so the data, transformations and Spark connection in this example will
be kept as easy to reproduce as possible.
 library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", spark_version = "2.2.0")

## * Using Spark: 2.2.0

spark_flights <- sdf_copy_to(sc, flights)


Feature Transformers

Pipelines make heavy use of Feature
Transformers.
If new to Spark, and sparklyr, it would be good to review what these
transformers do. 
These functions use the Spark API directly to transform
the data, and may be faster at making the data manipulations that a dplyr (SQL) transformation.

In sparklyr the ft functions are essentially are wrappers to
original Spark feature
transformer.

ft_dplyr_transformer

This example will start with dplyr transformations, which are
ultimately SQL transformations, loaded into the df variable.

In sparklyr, there is one feature transformer that is not available in
Spark, ft_dplyr_transformer(). 
The goal of this function is to convert
the dplyr code to a SQL Feature Transformer that can then be used in a
Pipeline.
 df <- spark_flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(
    month = paste0("m", month),
    day = paste0("d", day)
  ) %>%
  select(dep_delay, sched_dep_time, month, day, distance) 


This is the resulting pipeline stage produced from the dplyr code:
 ft_dplyr_transformer(sc, df)


Use the ml_param() function to extract the “statement” attribute. 
That
attribute contains the finalized SQL statement. 
Notice that the flights table name has been replace with __THIS__. 
This allows the
pipeline to accept different table names as its source, making the
pipeline very modular.
 ft_dplyr_transformer(sc, df) %>%
  ml_param("statement")

## [1] "SELECT `dep_delay`, `sched_dep_time`, `month`, `day`, `distance`\nFROM (SELECT `year`, CONCAT(\"m\", `month`) AS `month`, CONCAT(\"d\", `day`) AS `day`, `dep_time`, `sched_dep_time`, `dep_delay`, `arr_time`, `sched_arr_time`, `arr_delay`, `carrier`, `flight`, `tailnum`, `origin`, `dest`, `air_time`, `distance`, `hour`, `minute`, `time_hour`\nFROM (SELECT *\nFROM `__THIS__`\nWHERE (NOT(((`dep_delay`) IS NULL)))) `bjbujfpqzq`) `axbwotqnbr`"


Creating the Pipeline

The following step will create a 5 stage pipeline:


SQL transformer - Resulting from the ft_dplyr_transformer()
transformation
Binarizer - To determine if the flight should be considered delay.
The eventual outcome variable.
Bucketizer - To split the day into specific hour buckets
R Formula - To define the model’s formula
Logistic Model



 flights_pipeline <- ml_pipeline(sc) %>%
  ft_dplyr_transformer(
    tbl = df
    ) %>%
  ft_binarizer(
    input.col = "dep_delay",
    output.col = "delayed",
    threshold = 15
  ) %>%
  ft_bucketizer(
    input.col = "sched_dep_time",
    output.col = "hours",
    splits = c(400, 800, 1200, 1600, 2000, 2400)
  )  %>%
  ft_r_formula(delayed ~ month + day + hours + distance) %>% 
  ml_logistic_regression()


Another nice feature for ML Pipelines in sparklyr, is the print-out.
It makes it really easy to how each stage is setup:
 flights_pipeline

## Pipeline (Estimator) with 5 stages
## <pipeline_24044e4f2e21> 
##   Stages 
##   |--1 SQLTransformer (Transformer)
##   |    <dplyr_transformer_2404e6a1b8e> 
##   |     (Parameters -- Column Names)
##   |--2 Binarizer (Transformer)
##   |    <binarizer_24045c9227f2> 
##   |     (Parameters -- Column Names)
##   |      input_col: dep_delay
##   |      output_col: delayed
##   |--3 Bucketizer (Transformer)
##   |    <bucketizer_240412366b1e> 
##   |     (Parameters -- Column Names)
##   |      input_col: sched_dep_time
##   |      output_col: hours
##   |--4 RFormula (Estimator)
##   |    <r_formula_240442d75f00> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Parameters)
##   |      force_index_label: FALSE
##   |      formula: delayed ~ month + day + hours + distance
##   |--5 LogisticRegression (Estimator)
##   |    <logistic_regression_24044321ad0> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |      probability_col: probability
##   |      raw_prediction_col: rawPrediction
##   |     (Parameters)
##   |      aggregation_depth: 2
##   |      elastic_net_param: 0
##   |      family: auto
##   |      fit_intercept: TRUE
##   |      max_iter: 100
##   |      reg_param: 0
##   |      standardization: TRUE
##   |      threshold: 0.5
##   |      tol: 1e-06


Notice that there are no coefficients defined yet. 
That’s because no
data has been actually processed. 
Even though df uses spark_flights(), recall that the final SQL transformer makes that
name, so there’s no data to process yet.
PipelineModel
A quick partition of the data is created for this exercise.
 partitioned_flights <- sdf_partition(
  spark_flights,
  training = 0.01,
  testing = 0.01,
  rest = 0.98
)


The ml_fit() function produces the PipelineModel. 
The training
partition of the partitioned_flights data is used to train the model:
 fitted_pipeline <- ml_fit(
  flights_pipeline,
  partitioned_flights$training
)
fitted_pipeline

## PipelineModel (Transformer) with 5 stages
## <pipeline_24044e4f2e21> 
##   Stages 
##   |--1 SQLTransformer (Transformer)
##   |    <dplyr_transformer_2404e6a1b8e> 
##   |     (Parameters -- Column Names)
##   |--2 Binarizer (Transformer)
##   |    <binarizer_24045c9227f2> 
##   |     (Parameters -- Column Names)
##   |      input_col: dep_delay
##   |      output_col: delayed
##   |--3 Bucketizer (Transformer)
##   |    <bucketizer_240412366b1e> 
##   |     (Parameters -- Column Names)
##   |      input_col: sched_dep_time
##   |      output_col: hours
##   |--4 RFormulaModel (Transformer)
##   |    <r_formula_240442d75f00> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Transformer Info)
##   |      formula:  chr "delayed ~ month + day + hours + distance" 
##   |--5 LogisticRegressionModel (Transformer)
##   |    <logistic_regression_24044321ad0> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |      probability_col: probability
##   |      raw_prediction_col: rawPrediction
##   |     (Transformer Info)
##   |      coefficient_matrix:  num [1, 1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ... 

##   |      coefficients:  num [1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ... 

##   |      intercept:  num -3.04 
##   |      intercept_vector:  num -3.04 
##   |      num_classes:  int 2 
##   |      num_features:  int 43 
##   |      threshold:  num 0.5


Notice that the print-out for the fitted pipeline now displays the
model’s coefficients.

The ml_transform() function can be used to run predictions, in other
words it is used instead of predict() or sdf_predict().
 predictions <- ml_transform(
  fitted_pipeline,
  partitioned_flights$testing
)

predictions %>%
  group_by(delayed, prediction) %>%
  tally()

## # Source:   lazy query [?? x 3]
## # Database: spark_connection
## # Groups:   delayed
##   delayed prediction     n
##     <dbl>      <dbl> <dbl>
## 1      0. 
1. 
  51.
## 2      0. 
0. 
2599.
## 3      1. 
0. 
 666.
## 4      1. 
1. 
  69.

Save the pipelines to disk
The ml_save() command can be used to save the Pipeline and
PipelineModel to disk. 
The resulting output is a folder with the
selected name, which contains all of the necessary Scala scripts:
 ml_save(
  flights_pipeline,
  "flights_pipeline",
  overwrite = TRUE
)

## NULL

ml_save(
  fitted_pipeline,
  "flights_model",
  overwrite = TRUE
)

## NULL

Use an existing PipelineModel
The ml_load() command can be used to re-load Pipelines and
PipelineModels. 
The saved ML Pipeline files can only be loaded into an
open Spark session.
 reloaded_model <- ml_load(sc, "flights_model")


A simple query can be used as the table that will be used to make the
new predictions. 
This of course, does not have to done in R, at this
time the “flights_model” can be loaded into an independent Spark
session outside of R.
 new_df <- spark_flights %>%
  filter(
    month == 7,
    day == 5
  )

ml_transform(reloaded_model, new_df) 

## # Source:   table<sparklyr_tmp_24041e052b5> [?? x 12]
## # Database: spark_connection
##    dep_delay sched_dep_time month day   distance delayed hours features  
##        <dbl>          <int> <chr> <chr>    <dbl>   <dbl> <dbl> <list>    
##  1       39. 
  2359 m7    d5       1617. 
     1. 
   4. 
<dbl [43]>
##  2      141. 
  2245 m7    d5       2475. 
     1. 
   4. 
<dbl [43]>
##  3        0. 
   500 m7    d5        529. 
     0. 
   0. 
<dbl [43]>
##  4       -5. 
   536 m7    d5       1400. 
     0. 
   0. 
<dbl [43]>
##  5       -2. 
   540 m7    d5       1089. 
     0. 
   0. 
<dbl [43]>
##  6       -7. 
   545 m7    d5       1416. 
     0. 
   0. 
<dbl [43]>
##  7       -3. 
   545 m7    d5       1576. 
     0. 
   0. 
<dbl [43]>
##  8       -7. 
   600 m7    d5       1076. 
     0. 
   0. 
<dbl [43]>
##  9       -7. 
   600 m7    d5         96. 
     0. 
   0. 
<dbl [43]>
## 10       -6. 
   600 m7    d5        937. 
     0. 
   0. 
<dbl [43]>
## # ... 
with more rows, and 4 more variables: label <dbl>,
## #   rawPrediction <list>, probability <list>, prediction <dbl>

Re-fit an existing Pipeline
First, reload the pipeline into an open Spark session:
 reloaded_pipeline <- ml_load(sc, "flights_pipeline")


Use ml_fit() again to pass new data, in this case, sample_frac() is
used instead of sdf_partition() to provide the new data. 
The idea
being that the re-fitting would happen at a later date than when the
model was initially fitted.
 new_model <-  ml_fit(reloaded_pipeline, sample_frac(spark_flights, 0.01))

new_model

## PipelineModel (Transformer) with 5 stages
## <pipeline_24044e4f2e21> 
##   Stages 
##   |--1 SQLTransformer (Transformer)
##   |    <dplyr_transformer_2404e6a1b8e> 
##   |     (Parameters -- Column Names)
##   |--2 Binarizer (Transformer)
##   |    <binarizer_24045c9227f2> 
##   |     (Parameters -- Column Names)
##   |      input_col: dep_delay
##   |      output_col: delayed
##   |--3 Bucketizer (Transformer)
##   |    <bucketizer_240412366b1e> 
##   |     (Parameters -- Column Names)
##   |      input_col: sched_dep_time
##   |      output_col: hours
##   |--4 RFormulaModel (Transformer)
##   |    <r_formula_240442d75f00> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Transformer Info)
##   |      formula:  chr "delayed ~ month + day + hours + distance" 
##   |--5 LogisticRegressionModel (Transformer)
##   |    <logistic_regression_24044321ad0> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |      probability_col: probability
##   |      raw_prediction_col: rawPrediction
##   |     (Transformer Info)
##   |      coefficient_matrix:  num [1, 1:43] 0.258 0.648 -0.317 0.36 -0.279 ... 

##   |      coefficients:  num [1:43] 0.258 0.648 -0.317 0.36 -0.279 ... 

##   |      intercept:  num -3.77 
##   |      intercept_vector:  num -3.77 
##   |      num_classes:  int 2 
##   |      num_features:  int 43 
##   |      threshold:  num 0.5


The new model can be saved using ml_save(). 
A new name is used in this
case, but the same name as the existing PipelineModel to replace it.
 ml_save(new_model, "new_flights_model", overwrite = TRUE)

## NULL


Finally, this example is complete by closing the Spark session.
 spark_disconnect(sc)

Text mining with Spark & sparklyr
This article focuses on a set of functions that can be used for text mining with Spark and sparklyr. 
The main goal is to illustrate how to perform most of the data preparation and analysis with commands that will run inside the Spark cluster, as opposed to locally in R. 
Because of that, the amount of data used will be small.

Data source

For this example, there are two files that will be analyzed. 
They are both the full works of Sir Arthur Conan Doyle and Mark Twain. 
The files were downloaded from the Gutenberg Project site via the gutenbergr package. 
Intentionally, no data cleanup was done to the files prior to this analysis. 
See the appendix below to see how the data was downloaded and prepared.

readLines("arthur_doyle.txt", 10) 

 ##  [1] "THE RETURN OF SHERLOCK HOLMES,"   
##  [2] ""                                 
##  [3] "A Collection of Holmes Adventures"
##  [4] ""                                 
##  [5] ""                                 
##  [6] "by Sir Arthur Conan Doyle"        
##  [7] ""                                 
##  [8] ""                                 
##  [9] ""                                 
## [10] ""

Data Import
Connect to Spark

An additional goal of this article is to encourage the reader to try it out, so a simple Spark local mode session is used.

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.1.0")


spark_read_text()

The spark_read_text() is a new function which works like readLines() but for sparklyr. 
It comes in handy when non-structured data, such as lines in a book, is what is available for analysis.

# Imports Mark Twain's file

# Setting up the path to the file in a Windows OS laptop
twain_path <- paste0("file:///", getwd(), "/mark_twain.txt")
twain <-  spark_read_text(sc, "twain", twain_path) 


# Imports Sir Arthur Conan Doyle's file
doyle_path <- paste0("file:///", getwd(), "/arthur_doyle.txt")
doyle <-  spark_read_text(sc, "doyle", doyle_path) 

Data transformation
The objective is to end up with a tidy table inside Spark with one row per word used. 
The steps will be:


The needed data transformations apply to the data from both authors. 
The data sets will be appended to one another
Punctuation will be removed
The words inside each line will be separated, or tokenized
For a cleaner analysis, stop words will be removed
To tidy the data, each word in a line will become its own row
The results will be saved to Spark memory


sdf_bind_rows()
 sdf_bind_rows() appends the doyle Spark Dataframe to the twain Spark Dataframe. 
This function can be used in lieu of a dplyr::bind_rows() wrapper function. 
For this exercise, the column author is added to differentiate between the two bodies of work.

all_words <- doyle %>%
  mutate(author = "doyle") %>%
  sdf_bind_rows({
    twain %>%
mutate(author = "twain")}) %>%
  filter(nchar(line) > 0)


regexp_replace

The Hive UDF, regexp_replace, is used as a sort of gsub() that works inside Spark. 
In this case it is used to remove punctuation. 
The usual [:punct:] regular expression did not work well during development, so a custom list is provided. 
For more information, see the Hive Functions section in the dplyr page.

all_words <- all_words %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) 


ft_tokenizer()
 ft_tokenizer() uses the Spark API to separate each word. 
It creates a new list column with the results.

all_words <- all_words %>%
    ft_tokenizer(input.col = "line",
 output.col = "word_list")

head(all_words, 4)

 ## # Source:   lazy query [?? x 3]
## # Database: spark_connection
##   author                              line  word_list
##    <chr>                             <chr>     <list>
## 1  doyle    THE RETURN OF SHERLOCK HOLMES  <list [5]>
## 2  doyle A Collection of Holmes Adventures <list [5]>
## 3  doyle         by Sir Arthur Conan Doyle <list [5]>
## 4  doyle                         CONTENTS  <list [1]>


ft_stop_words_remover()
 ft_stop_words_remover() is a new function that, as its name suggests, takes care of removing stop words from the previous transformation. 
It expects a list column, so it is important to sequence it correctly after a ft_tokenizer() command. 
In the sample results, notice that the new wo_stop_words column contains less items than word_list.

all_words <- all_words %>%
  ft_stop_words_remover(input.col = "word_list",
          output.col = "wo_stop_words")

head(all_words, 4)

 ## # Source:   lazy query [?? x 4]
## # Database: spark_connection
##   author                              line  word_list wo_stop_words
##    <chr>                             <chr>     <list>        <list>
## 1  doyle    THE RETURN OF SHERLOCK HOLMES  <list [5]>    <list [3]>
## 2  doyle A Collection of Holmes Adventures <list [5]>    <list [3]>
## 3  doyle         by Sir Arthur Conan Doyle <list [5]>    <list [4]>
## 4  doyle                         CONTENTS  <list [1]>    <list [1]>


explode

The Hive UDF explode performs the job of unnesting the tokens into their own row. 
Some further filtering and field selection is done to reduce the size of the dataset.

all_words <- all_words %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2)
  
head(all_words, 4)

 ## # Source:   lazy query [?? x 2]
## # Database: spark_connection
##         word author
##        <chr>  <chr>
## 1     return  doyle
## 2   sherlock  doyle
## 3     holmes  doyle
## 4 collection  doyle


compute()
 compute() will operate this transformation and cache the results in Spark memory. 
It is a good idea to pass a name to compute() to make it easier to identify it inside the Spark environment. 
In this case the name will be all_words

all_words <- all_words %>%
  compute("all_words")


Full code

This is what the code would look like on an actual analysis:

all_words <- doyle %>%
  mutate(author = "doyle") %>%
  sdf_bind_rows({
    twain %>%
mutate(author = "twain")}) %>%
  filter(nchar(line) > 0) %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
  ft_tokenizer(input.col = "line",
 output.col = "word_list") %>%
  ft_stop_words_remover(input.col = "word_list",
          output.col = "wo_stop_words") %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2) %>%
  compute("all_words")

Data Analysis
Words used the most

word_count <- all_words %>%
  group_by(author, word) %>%
  tally() %>%
  arrange(desc(n)) 
  
word_count

 ## # Source:     lazy query [?? x 3]
## # Database:   spark_connection
## # Groups:     author
## # Ordered by: desc(n)
##    author  word     n
##     <chr> <chr> <dbl>
##  1  twain   one 20028
##  2  doyle  upon 16482
##  3  twain would 15735
##  4  doyle   one 14534
##  5  doyle  said 13716
##  6  twain  said 13204
##  7  twain could 11301
##  8  doyle would 11300
##  9  twain  time 10502
## 10  doyle   man 10478
## # ... 
with more rows


Words used by Doyle and not Twain

doyle_unique <- filter(word_count, author == "doyle") %>%
  anti_join(filter(word_count, author == "twain"), by = "word") %>%
  arrange(desc(n)) %>%
  compute("doyle_unique")

doyle_unique

 ## # Source:     lazy query [?? x 3]
## # Database:   spark_connection
## # Groups:     author
## # Ordered by: desc(n), desc(n)
##    author      word     n
##     <chr>     <chr> <dbl>
##  1  doyle     nigel   972
##  2  doyle   alleyne   500
##  3  doyle      ezra   421
##  4  doyle     maude   337
##  5  doyle   aylward   336
##  6  doyle   catinat   301
##  7  doyle   sharkey   281
##  8  doyle  lestrade   280
##  9  doyle summerlee   248
## 10  doyle     congo   211
## # ... 
with more rows


doyle_unique %>%
  head(100) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))




Twain and Sherlock

The word cloud highlighted something interesting. 
The word lestrade is listed as one of the words used by Doyle but not Twain. 
Lestrade is the last name of a major character in the Sherlock Holmes books. 
It makes sense that the word “sherlock” appears considerably more times than “lestrade” in Doyle's books, so why is Sherlock not in the word cloud? Did Mark Twain use the word “sherlock” in his writings?

all_words %>%
  filter(author == "twain",
 word == "sherlock") %>%
  tally()

 ## # Source:   lazy query [?? x 1]
## # Database: spark_connection
##       n
##   <dbl>
## 1    16


The all_words table contains 16 instances of the word sherlock in the words used by Twain in his works. 
The instr Hive UDF is used to extract the lines that contain that word in the twain table. 
This Hive function works can be used instead of base::grep() or stringr::str_detect(). 
To account for any word capitalization, the lower command will be used in mutate() to make all words in the full text lower cap.

instr & lower

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. 
As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

twain %>%
  mutate(line = lower(line)) %>%
  filter(instr(line, "sherlock") > 0) %>%
  pull(line)

 ##  [1] "late sherlock holmes, and yet discernible by a member of a race charged"  
##  [2] "sherlock holmes."                                                         
##  [3] "\"uncle sherlock! the mean luck of it!--that he should come just"         
##  [4] "another trouble presented itself. 
\"uncle sherlock 'll be wanting to talk"
##  [5] "flint buckner's cabin in the frosty gloom. 
they were sherlock holmes and" 
##  [6] "\"uncle sherlock's got some work to do, gentlemen, that 'll keep him till"
##  [7] "\"by george, he's just a duke, boys! three cheers for sherlock holmes,"   
##  [8] "he brought sherlock holmes to the billiard-room, which was jammed with"   
##  [9] "of interest was there--sherlock holmes. 
the miners stood silent and"      
## [10] "the room; the chair was on it; sherlock holmes, stately, imposing,"       
## [11] "\"you have hunted me around the world, sherlock holmes, yet god is my"    
## [12] "\"if it's only sherlock holmes that's troubling you, you needn't worry"   
## [13] "they sighed; then one said: \"we must bring sherlock holmes. 
he can be"   
## [14] "i had small desire that sherlock holmes should hang for my deeds, as you" 
## [15] "\"my name is sherlock holmes, and i have not been doing anything.\""      
## [16] "late sherlock holmes, and yet discernible by a member of a race charged"


spark_disconnect(sc)

Appendix
gutenbergr package

This is an example of how the data for this article was pulled from the Gutenberg site:

library(gutenbergr)

gutenberg_works()  %>%
  filter(author == "Twain, Mark") %>%
  pull(gutenberg_id) %>%
  gutenberg_download() %>%
  pull(text) %>%
  writeLines("mark_twain.txt")

Intro to Spark Streaming with sparklyr 
The sparklyr interface
As stated in the Spark’s official site, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. 
Because is part of the Spark API, it is possible to re-use query code that queries the current state of the stream, as well as joining the streaming data with historical data. 
Please see Spark’s official documentation for a deeper look into Spark Streaming.

The sparklyr interface provides the following:

Ability to run dplyr, SQL, spark_apply(), and PipelineModels against a stream
Read in multiple formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc
Write stream results to Spark memory and the following file formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc
An out-of-the box graph visualization to monitor the stream
A new reactiveSpark() function, that allows Shiny apps to poll the contents of the stream
create Shiny apps that are able to read the contents of the stream
Interacting with a stream
A good way of looking at the way how Spark streams update is as a three stage operation:


Input - Spark reads the data inside a given folder. 
The folder is expected to contain multiple data files, with new files being created containing the most current stream data.
Processing - Spark applies the desired operations on top of the data. 
These operations could be data manipulations (dplyr, SQL), data transformations (sdf operations, PipelineModel predictions), or native R manipulations (spark_apply()).
Output - The results of processing the input files are saved in a different folder.

In the same way all of the read and write operations in sparklyr for Spark Standalone, or in sparklyr’s local mode, the input and output folders are actual OS file system folders. 
For Hadoop clusters, these will be folder locations inside the HDFS.
Example 1 - Input/Output
The first intro example is a small script that can be used with a local master. 
The result should be to see the stream_view() app showing live the number of records processed for each iteration of test data being sent to the stream.
 library(future)
library(sparklyr)

sc <- spark_connect(master = "local", spark_version = "2.3.0")

if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source") 
write_output <- stream_write_csv(read_folder, "source-out")
invisible(future(stream_generate_test(interval = 0.5)))

stream_view(write_output)





 stream_stop(write_output)
spark_disconnect(sc)

Code breakdown

Open the Spark connection
 library(sparklyr)
sc <- spark_connect(master = "local", spark_version = "2.3.0")

Optional step. 
This resets the input and output folders. 
It makes it easier to run the code multiple times in a clean manner.
 if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

Produces a single test file inside the “source” folder. 
This allows the “read” function to infer CSV file definition.
 stream_generate_test(iterations = 1)
list.files("source")
 [1] "stream_1.csv"

Points the stream reader to the folder where the streaming files will be placed. 
Since it is primed with a single CSV file, it will use as the expected layout of subsequent files. 
By default, stream_read_csv() creates a single integer variable data frame.
 read_folder <- stream_read_csv(sc, "source")

The output writer is what starts the streaming job. 
It will start monitoring the input folder, and then write the new results in the “source-out” folder. 
So as new records stream in, new files will be created in the “source-out” folder. 
Since there are no operations on the incoming data at this time, the output files will have the same exact raw data as the input files. 
The only difference is that the files and sub folders within “source-out” will be structured how Spark structures data folders.
 write_output <- stream_write_csv(read_folder, "source-out")
list.files("source-out")
 [1] "_spark_metadata"                                     "checkpoint"
[3] "part-00000-1f29719a-2314-40e1-b93d-a647a3d57154-c000.csv"

The test generation function will run 100 files every 0.2 seconds. 
To run the tests “out-of-sync” with the current R session, the future package is used.
 library(future)
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))

The stream_view() function can be used before the 50 tests are complete because of the use of the future package. 
It will monitor the status of the job that write_output is pointing to and provide information on the amount of data coming into the “source” folder and going out into the “source-out” folder.
 stream_view(write_output)

The monitor will continue to run even after the tests are complete. 
To end the experiment, stop the Shiny app and then use the following to stop the stream and close the Spark session.
 stream_stop(write_output)
spark_disconnect(sc)


Example 2 - Processing
The second example builds on the first. 
It adds a processing step that manipulates the input data before saving it to the output folder. 
In this case, a new binary field is added indicating if the value from x is over 400 or not. 
This time, while run the second code chunk in this example a few times during the stream tests to see the aggregated values change.
 library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)

sc <- spark_connect(master = "local", spark_version = "2.3.0")

if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source") 

process_stream <- read_folder %>%
  mutate(x = as.double(x)) %>%
  ft_binarizer(
    input_col = "x",
    output_col = "over",
    threshold = 400
  )

write_output <- stream_write_csv(process_stream, "source-out")
invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))

Run this code a few times during the experiment:
 spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>%
  group_by(over) %>%
  tally()

The results would look similar to this. 
The n totals will increase as the experiment progresses.
 # Source:   lazy query [?? x 2]
# Database: spark_connection
   over     n
  <dbl> <dbl>
1     0 40215
2     1 60006

Clean up after the experiment
 stream_stop(write_output)
spark_disconnect(sc)

Code breakdown

The processing starts with the read_folder variable that contains the input stream. 
It coerces the integer field x, into a type double. 
This is because the next function, ft_binarizer() does not accept integers. 
The binarizer determines if x is over 400 or not. 
This is a good illustration of how dplyr can help simplify the manipulation needed during the processing stage.
 process_stream <- read_folder %>%
  mutate(x = as.double(x)) %>%
  ft_binarizer(
    input_col = "x",
    output_col = "over",
    threshold = 400
  )

The output now needs to write-out the processed data instead of the raw input data. 
Swap read_folder with process_stream.
 write_output <- stream_write_csv(process_stream, "source-out")

The “source-out” folder can be treated as a if it was a single table within Spark. 
Using spark_read_csv(), the data can be mapped, but not brought into memory (memory = FALSE). 
This allows the current results to be further analyzed using regular dplyr commands.
 spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>%
  group_by(over) %>%
  tally()


Example 3 - Aggregate in process and output to memory
Another option is to save the results of the processing into a in-memory Spark table. 
Unless intentionally saving it to disk, the table and its data will only exist while the Spark session is active.

The biggest advantage of using Spark memory as the target, is that it will allow for aggregation to happen during processing. 
This is an advantage because aggregation is not allowed for any file output, expect Kafka, on the input/process stage.

Using example 2 as the base, this example code will perform some aggregations to the current stream input and save only those summarized results into Spark memory:
 library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)

sc <- spark_connect(master = "local", spark_version = "2.3.0")

if(file.exists("source")) unlink("source", TRUE)

stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source") 

process_stream <- read_folder %>%
  stream_watermark() %>%
  group_by(timestamp) %>%
  summarise(
    max_x = max(x, na.rm = TRUE),
    min_x = min(x, na.rm = TRUE),
    count = n()
  )

write_output <- stream_write_memory(process_stream, name = "stream")

invisible(future(stream_generate_test()))

Run this command a different times while the experiment is running:
 tbl(sc, "stream") 

Clean up after the experiment
 stream_stop(write_output)
spark_disconnect(sc)

Code breakdown

The stream_watermark() functions add a new timestamp variable that is then used in the group_by() command. 
This is required by Spark Stream to accept summarized results as output of the stream. 
The second step is to simply decide what kinds of aggregations we need to perform. 
In this case, a simply max, min and count are performed.
 process_stream <- read_folder %>%
  stream_watermark() %>%
  group_by(timestamp) %>%
  summarise(
    max_x = max(x, na.rm = TRUE),
    min_x = min(x, na.rm = TRUE),
    count = n()
  )

The spark_write_memory() function is used to write the output to Spark memory. 
The results will appear as a table of the Spark session with the name assigned in the name argument, in this case the name selected is: “stream”.
 write_output <- stream_write_memory(process_stream, name = "stream")

To query the current data in the “stream” table can be queried by using the dplyr tbl() command.
 tbl(sc, "stream") 


Example 4 - Shiny integration
 sparklyr provides a new Shiny function called reactiveSpark(). 
It can take a Spark data frame, in this case the one created as a result of the stream processing, and then creates a Spark memory stream table, the same way a table is created in example 3.
 library(future)
library(sparklyr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

sc <- spark_connect(master = "local", spark_version = "2.3.0")

if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

stream_generate_test(iterations = 1)
read_folder <- stream_read_csv(sc, "source") 

process_stream <- read_folder %>%
  stream_watermark() %>%
  group_by(timestamp) %>%
  summarise(
    max_x = max(x, na.rm = TRUE),
    min_x = min(x, na.rm = TRUE),
    count = n()
  )

invisible(future(stream_generate_test(interval = 0.2, iterations = 100)))

library(shiny)
ui <- function(){
  tableOutput("table")
}
server <- function(input, output, session){
  
  ps <- reactiveSpark(process_stream)
  
  output$table <- renderTable({
    ps() %>%
mutate(timestamp = as.character(timestamp)) 
    })
}
runGadget(ui, server)







Code breakdown

Notice that there is no stream_write_... command. 
The reason is that reactiveSpark() function contains the stream_write_memory() function.

This very basic Shiny app simply displays the output of a table in the ui section
 library(shiny)

ui <- function(){
  tableOutput("table")
}

In the server section, the reactiveSpark() function will update every time there’s a change to the stream and return a data frame. 
The results are saved to a variable called ps() in this script. 
Treat the ps() variable as a regular table that can be piped from, as shown in the example. 
In this case, the timestamp variable is converted to string for to make it easier to read.
 server <- function(input, output, session){

  ps <- reactiveSpark(process_stream)

  output$table <- renderTable({
    ps() %>%
mutate(timestamp = as.character(timestamp)) 
  })
}

Use runGadget() to display the Shiny app in the Viewer pane. 
This is optional, the app can be run using normal Shiny run functions.
 runGadget(ui, server)


Example 5 - ML Pipeline Model
This example uses a fitted Pipeline Model to process the input, and saves the predictions to the output. 
This approach would be used to apply Machine Learning on top of streaming data.
 library(sparklyr)
library(dplyr, warn.conflicts = FALSE)

sc <- spark_connect(master = "local", spark_version = "2.3.0")

if(file.exists("source")) unlink("source", TRUE)
if(file.exists("source-out")) unlink("source-out", TRUE)

df <- data.frame(x = rep(1:1000), y = rep(2:1001))

stream_generate_test(df = df, iteration = 1)

model_sample <- spark_read_csv(sc, "sample", "source")

pipeline <- sc %>%
  ml_pipeline() %>%
  ft_r_formula(x ~ y) %>%
  ml_linear_regression()

fitted_pipeline <- ml_fit(pipeline, model_sample)

ml_stream <- stream_read_csv(
    sc = sc, 
    path = "source", 
    columns = c(x = "integer", y = "integer")
  )  %>%
  ml_transform(fitted_pipeline, .)  %>%
  select(- features) %>%
  stream_write_csv("source-out")

stream_generate_test(df = df, interval = 0.5)
 spark_read_csv(sc, "stream", "source-out", memory = FALSE) 
 ### Source: spark<stream> [?? x 4]
##       x     y label prediction
## * <int> <int> <dbl>      <dbl>
## 1   276   277   276       276.
## 2   277   278   277       277.
## 3   278   279   278       278.
## 4   279   280   279       279.
## 5   280   281   280       280.
## 6   281   282   281       281.
## 7   282   283   282       282.
## 8   283   284   283       283.
## 9   284   285   284       284.
##10   285   286   285       285.
### ... 
with more rows
 stream_stop(ml_stream)
spark_disconnect(sc)

Code Breakdown

Creates and fits a pipeline
 df <- data.frame(x = rep(1:1000), y = rep(2:1001))
stream_generate_test(df = df, iteration = 1)
model_sample <- spark_read_csv(sc, "sample", "source")

pipeline <- sc %>%
  ml_pipeline() %>%
  ft_r_formula(x ~ y) %>%
  ml_linear_regression()

fitted_pipeline <- ml_fit(pipeline, model_sample)

This example pipelines the input, process and output in a single code segment. 
The ml_transform() function is used to create the predictions. 
Because the CSV format does not support list type fields, the features column is removed before the results are sent to the output.
 ml_stream <- stream_read_csv(
    sc = sc, 
    path = "source", 
    columns = c(x = "integer", y = "integer")
  )  %>%
  ml_transform(fitted_pipeline, .)  %>%
  select(- features) %>%
  stream_write_csv("source-out")


Using Spark with AWS S3 buckets 
AWS Access Keys
AWS Access Keys are needed to access S3 data. 
To learn how to setup a new keys, please review the AWS documentation: http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html .We then pass the keys to R via Environment Variables:
 Sys.setenv(AWS_ACCESS_KEY_ID="[Your access key]")
Sys.setenv(AWS_SECRET_ACCESS_KEY="[Your secret access key]")
Connecting to Spark
There are four key settings needed to connect to Spark and use S3:

A Hadoop-AWS package
Executor memory (key but not critical)
The master URL
The Spark Home
To connect to Spark, we first need to initialize a variable with the contents of sparklyr default config (spark_config) which we will then customize for our needs
 library(sparklyr)

conf <- spark_config()

Hadoop-AWS package:
A Spark connection can be enhanced by using packages, please note that these are not R packages. 
For example, there are packages that tells Spark how to read CSV files, Hadoop or Hadoop in AWS.

In order to read S3 buckets, our Spark connection will need a package called hadoop-aws. 
If needed, multiple packages can be used. 
We experimented with many combinations of packages, and determined that for reading data in S3 we only need the one. 
The version we used, 2.7.3, refers to the latest Hadoop version, so as this article ages, please make sure to check this site to ensure that you are using the latest version: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
 conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"

Executor Memory
As mentioned above this setting key but not critical. 
There are two points worth highlighting about it is:

The only performance related setting in a Spark Stand Alone cluster that can be tweaked, and in most cases because Spark defaults to a fraction of what is available, we then need to increase it by manually passing a value to that setting.

If more than the available RAM is requested, then Spark will set the Cores to 0, thus rendering the session unusable.
 conf$spark.executor.memory <- "14g"

Master URL and Spark home
There are three important points to mention when executing the spark_connect command:


The master will be the Spark Master’s URL. 
To find the URL, please see the Spark Cluster section.
Point the Spark Home to the location where Spark was installed in this node
Make sure to the conf variable as the value for the config argument
 sc <- spark_connect(master = "spark://ip-172-30-1-5.us-west-2.compute.internal:7077", 
      spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/",
      config =  conf)
Data Import/Wrangle approach
We experimented with multiple approaches. 
Most of the factors for settling on a recommended approach were made based on the speed of each step. 
The premise is that we would rather wait longer during Data Import, if it meant that we can much faster Register and Cache our data subsets during Data Wrangling, especially since we would expect to end up with many subsets as we explore and model.The selected combination was the second slowest during the Import stage, but the fastest when caching a subset, by a lot.

In our tests, it took 72 seconds to read and cache the 29 columns of the 41 million rows of data, the slowest was 77 seconds. 
But when it comes to registering and caching a considerably sizable subset of 3 columns and almost all of the 41 million records, this approach was 17X faster than the second fastest approach. 
It took 1/3 of a second to register and cache the subset, the second fastest was 5 seconds.

To implement this approach, we need to set three arguments in the spark_csv_read() step:
 memory infer_schema columns
Again, this is a recommended approach. 
The columns argument is needed only if infer_schema is set to FALSE. When memory is set to TRUE it makes Spark load the entire dataset into memory, and setting infer_schema to FALSE prevents Spark from trying to figure out what the schema of the files are. 
By trying different combinations the memory and infer_schema arguments you may be able to find an approach that may better fits your needs.

Reading the schema
Surprisingly, another critical detail that can easily be overlooked is choosing the right s3 URI scheme. 
There are two options: s3n and s3a. 
In most examples and tutorials I found, there was no reason give of why or when to use which one. 
The article the finally clarified it was this one: https://wiki.apache.org/hadoop/AmazonS3

The gist of it is that s3a is the recommended one going forward, especially for Hadoop versions 2.7 and above. 
This means that if we copy from older examples that used Hadoop 2.6 we would more likely also used s3n thus making data import much, much slower.
Data Import
After the long introduction in the previous section, there is only one point to add about the following code chunk. 
If there are any NA values in numeric fields, then define the column as character and then convert it on later subsets using dplyr. 
The data import will fail if it finds any NA values on numeric fields. 
This is a small trade off in this approach because the next fastest one does not have this issue but is 17X slower at caching subsets.
 flights <- spark_read_csv(sc, "flights_spark", 
            path =  "s3a://flights-data/full", 
            memory = TRUE, 
            columns = list(
              Year = "character",
              Month = "character",
              DayofMonth = "character",
              DayOfWeek = "character",
              DepTime = "character",
              CRSDepTime = "character",
              ArrTime = "character",
              CRSArrTime = "character",
              UniqueCarrier = "character",
              FlightNum = "character",
              TailNum = "character",
              ActualElapsedTime = "character",
              CRSElapsedTime = "character",
              AirTime = "character",
              ArrDelay = "character",
              DepDelay = "character",
              Origin = "character",
              Dest = "character",
              Distance = "character",
              TaxiIn = "character",
              TaxiOut = "character",
              Cancelled = "character",
              CancellationCode = "character",
              Diverted = "character",
              CarrierDelay = "character",
              WeatherDelay = "character",
              NASDelay = "character",
              SecurityDelay = "character",
              LateAircraftDelay = "character"), 
           infer_schema = FALSE)
Data Wrangle
There are a few points we need to highlight about the following simple dyplr code:

Because there were NAs in the original fields, we have to mutate them to a number. 
Try coercing any variable as integer instead of numeric, this will save a lot of space when cached to Spark memory. 
The sdf_register command can be piped at the end of the code. 
After running the code, a new table will appear in the RStudio IDE’s Spark tab
 tidy_flights <- tbl(sc, "flights_spark") %>%
  mutate(ArrDelay = as.integer(ArrDelay),
 DepDelay = as.integer(DepDelay),
 Distance = as.integer(Distance)) %>%
  filter(!is.na(ArrDelay)) %>%
  select(DepDelay, ArrDelay, Distance) %>%
  sdf_register("tidy_spark")

After we use tbl_cache() to load the tidy_spark table into Spark memory. 
We can see the new table in the Storage page of our Spark session.
 tbl_cache(sc, "tidy_spark")
Using Apache Arrow 
Introduction
Apache Arrow is a cross-language development platform for in-memory data. 
Arrow is supported starting with sparklyr 1.0.0 to improve performance when transferring data between Spark and R. 
You can find some performance benchmarks under:

sparklyr 1.0: Arrow, XGBoost, Broom and TFRecords.
Speeding up R and Apache Spark using Apache Arrow.
Installation
Using Arrow from R requires installing:

The Arrow Runtime: Provides a cross-language runtime library.
The Arrow R Package: Provides support for using Arrow from R through an R package.

Runtime
OS X
Installing from OS X requires Homebrew and executing from a terminal:
 brew install apache-arrow
Windows
Currently, installing Arrow in Windows requires Conda and executing from a terminal:
 conda install arrow-cpp=0.12.* -c conda-forge
conda install pyarrow=0.12.* -c conda-forge
Linux
Please reference arrow.apache.org/install when installing Arrow for Linux.

Package
As of this writing, the arrow R package is not yet available in CRAN; however, this package can be installed using the remotes package. 
First, install remotes:
 install.packages("remotes")

Then install the R package from github as follows:
 remotes::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.0")

If you happen to have Arrow 0.11 installed, you will have to install
 remotes::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
Use Cases
There are three main use cases for arrow in sparklyr:

Data Copying: When copying data with copy_to(), Arrow will be used.
Data Collection: Also, when collecting either, implicitly by printing datasets or explicitly calling collect.
R Transformations: When using spark_apply(), data will be transferred using Arrow when possible.
To use arrow in sparklyr one simply needs to import this library:
 library(arrow)
 Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:
timestamp

The following objects are masked from ‘package:base’:
array, table
Considerations
Types
Some data types are mapped to slightly different, one can argue more correct, types when using Arrow. 
For instance, consider collecting 64 bit integers in sparklyr:
 library(sparklyr)

sc <- spark_connect(master = "local")
integer64 <- sdf_len(sc, 2, type = "integer64")
integer64
 # Source: spark<?> [?? x 1]
     id
  <dbl>
1     1
2     2

Notice that sparklyr collects 64 bit integers as double; however, using arrow:
 library(arrow)
integer64
 # Source: spark<?> [?? x 1]
  id             
  <S3: integer64>
1 1              
2 2 

64 bit integers are now being collected as proper 64 bit integer using the bit64 package.

Fallback
The Arrow R package supports many data types; however, in cases where a type is unsupported, sparklyr will fallback to not using arrow and print a warning.
 library(sparklyr.nested)
library(sparklyr)
library(dplyr)
library(arrow)

sc <- spark_connect(master = "local")
cars <- copy_to(sc, mtcars)

sdf_nest(cars, hp) %>%
  group_by(cyl) %>%
  summarize(data = collect_list(data))
 # Source: spark<?> [?? x 2]
    cyl data       
  <dbl> <list>     
1     6 <list [7]> 
2     4 <list [11]>
3     8 <list [14]>
Warning message:
In arrow_enabled_object.spark_jobj(sdf) :
  Arrow disabled due to columns: data
Creating Extensions for sparklyr 
Introduction
The sparklyr package provides a dplyr interface to Spark DataFrames as well as an R interface to Spark’s distributed machine learning pipelines. 
However, since Spark is a general-purpose cluster computing system there are many other R interfaces that could be built (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).

The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. 
This guide describes how you can use these tools to create your own custom R interfaces to Spark.

Examples
Here’s an example of an extension function that calls the text file line counting function available via the SparkContext:
 library(sparklyr)
count_lines <- function(sc, file) {
  spark_context(sc) %>% 
    invoke("textFile", file, 1L) %>% 
    invoke("count")
}

The count_lines function takes a spark_connection (sc) argument which enables it to obtain a reference to the SparkContext object, and in turn call the textFile().count() method.

You can use this function with an existing sparklyr connection as follows:
 library(sparklyr)
sc <- spark_connect(master = "local")
count_lines(sc, "hdfs://path/data.csv")

Here are links to some additional examples of extension packages:








Package
Description



spark.sas7bdat
Read in SAS data in parallel into Apache Spark.

rsparkling
Extension for using H2O machine learning algorithms against Spark Data Frames.

sparkhello
Simple example of including a custom JAR file within an extension package.

rddlist
Implements some methods of an R list as a Spark RDD (resilient distributed dataset).

sparkwarc
Load WARC files into Apache Spark with sparklyr.

sparkavro
Load Avro data into Spark with sparklyr. 
It is a wrapper of spark-avro

crassy
Connect to Cassandra with sparklyr using the Spark-Cassandra-Connector.

sparklygraphs
R interface for GraphFrames which aims to provide the functionality of GraphX.

sparklyr.nested
Extension for working with nested data.

sparklyudf
Simple example registering an Scala UDF within an extension package.


Core Types
Three classes are defined for representing the fundamental types of the R to Java bridge:




Function
Description



spark_connection
Connection between R and the Spark shell process

spark_jobj
Instance of a remote Spark object

spark_dataframe
Instance of a remote Spark DataFrame object


S3 methods are defined for each of these classes so they can be easily converted to or from objects that contain or wrap them. 
Note that for any given spark_jobj it’s possible to discover the underlying spark_connection.
Calling Spark from R
There are several functions available for calling the methods of Java objects and static methods of Java classes:




Function
Description



invoke
Call a method on an object

invoke_new
Create a new object by invoking a constructor

invoke_static
Call a static method on an object


For example, to create a new instance of the java.math.BigInteger class and then call the longValue() method on it you would use code like this:
 billionBigInteger <- invoke_new(sc, "java.math.BigInteger", "1000000000")
billion <- invoke(billionBigInteger, "longValue")

Note the sc argument: that’s the spark_connection object which is provided by the front-end package (e.g. sparklyr).

The previous example can be re-written to be more compact and clear using magrittr pipes:
 billion <- sc %>% 
  invoke_new("java.math.BigInteger", "1000000000") %>%
    invoke("longValue")

This syntax is similar to the method-chaining syntax often used with Scala code so is generally preferred.

Calling a static method of a class is also straightforward. 
For example, to call the Math::hypot() static function you would use this code:
 hypot <- sc %>% 
  invoke_static("java.lang.Math", "hypot", 10, 20) 
Wrapper Functions
Creating an extension typically consists of writing R wrapper functions for a set of Spark services. 
In this section we’ll describe the typical form of these functions as well as how to handle special types like Spark DataFrames.

Here’s the wrapper function for textFile().count() which we defined earlier:
 count_lines <- function(sc, file) {
  spark_context(sc) %>% 
    invoke("textFile", file, 1L) %>% 
invoke("count")
}

The count_lines function takes a spark_connection (sc) argument which enables it to obtain a reference to the SparkContext object, and in turn call the textFile().count() method.

The following functions are useful for implementing wrapper functions of various kinds:








Function
Description



spark_connection
Get the Spark connection associated with an object (S3)

spark_jobj
Get the Spark jobj associated with an object (S3)

spark_dataframe
Get the Spark DataFrame associated with an object (S3)

spark_context
Get the SparkContext for a spark_connection

hive_context
Get the HiveContext for a spark_connection

spark_version
Get the version of Spark (as a numeric_version) for a spark_connection


The use of these functions is illustrated in this simple example:
 analyze <- function(x, features) {
  
  # normalize whatever we were passed (e.g. 
a dplyr tbl) into a DataFrame
  df <- spark_dataframe(x)
  
  # get the underlying connection so we can create new objects
  sc <- spark_connection(df)
  
  # create an object to do the analysis and call its `analyze` and `summary`
  # methods (note that the df and features are passed to the analyze function)
  summary <- sc %>%  
    invoke_new("com.example.tools.Analyzer") %>% 
invoke("analyze", df, features) %>% 
invoke("summary")

  # return the results
  summary
}

The first argument is an object that can be accessed using the Spark DataFrame API (this might be an actual reference to a DataFrame or could rather be a dplyr tbl which has a DataFrame reference inside it).

After using the spark_dataframe function to normalize the reference, we extract the underlying Spark connection associated with the data frame using the spark_connection function. 
Finally, we create a new Analyzer object, call it’s analyze method with the DataFrame and list of features, and then call the summary method on the results of the analysis.

Accepting a spark_jobj or spark_dataframe as the first argument of a function makes it very easy to incorporate into magrittr pipelines so this pattern is highly recommended when possible.
Dependencies
When creating R packages which implement interfaces to Spark you may need to include additional dependencies. 
Your dependencies might be a set of Spark Packages or might be a custom JAR file. 
In either case, you’ll need a way to specify that these dependencies should be included during the initialization of a Spark session. 
A Spark dependency is defined using the spark_dependency function:








Function
Description



spark_dependency
Define a Spark dependency consisting of JAR files and Spark packages


Your extension package can specify it’s dependencies by implementing a function named spark_dependencies within the package (this function should not be publicly exported). 
For example, let’s say you were creating an extension package named sparkds that needs to include a custom JAR as well as the Redshift and Apache Avro packages:
 spark_dependencies <- function(spark_version, scala_version, ...) {
  spark_dependency(
    jars = c(
system.file(
sprintf("java/sparkds-%s-%s.jar", spark_version, scala_version), 
package = "sparkds"
)
    ),
    packages = c(
sprintf("com.databricks:spark-redshift_%s:0.6.0", scala_version),
sprintf("com.databricks:spark-avro_%s:2.0.1", scala_version)
    )
  )
}

.onLoad <- function(libname, pkgname) {
  sparklyr::register_extension(pkgname)
}

The spark_version argument is provided so that a package can support multiple Spark versions for it’s JARs. 
Note that the argument will include just the major and minor versions (e.g.  1.6 or 2.0) and will not include the patch level (as JARs built for a given major/minor version are expected to work for all patch levels).

The scala_version argument is provided so that a single package can support multiple Scala compiler versions for it’s JARs and packages (currently Scala 1.6 downloadable binaries are compiled with Scala 2.10 and Scala 2.0 downloadable binaries are compiled with Scala 2.11).

The ... argument is unused but nevertheless should be included to ensure compatibility if new arguments are added to spark_dependencies in the future.

The .onLoad function registers your extension package so that it’s spark_dependencies function will be automatically called when new connections to Spark are made via spark_connect:
 library(sparklyr)
library(sparkds)
sc <- spark_connect(master = "local")

Compiling JARs
The sparklyr package includes a utility function (compile_package_jars) that will automatically compile a JAR file from your Scala source code for the required permutations of Spark and Scala compiler versions. 
To use the function just invoke it from the root directory of your R package as follows:
 sparklyr::compile_package_jars()

Note that a prerequisite to calling compile_package_jars is the installation of the Scala 2.10 and 2.11 compilers to one of the following paths:

/opt/scala
/opt/local/scala
/usr/local/scala
~/scala (Windows-only)
See the sparkhello repository for a complete example of including a custom JAR within an extension package.
CRAN
When including a JAR file within an R package distributed on CRAN, you should follow the guidelines provided in Writing R Extensions:


Java code is a special case: except for very small programs, .java files should be byte-compiled (to a .class file) and distributed as part of a .jar file: the conventional location for the .jar file(s) is inst/java. 
It is desirable (and required under an Open Source license) to make the Java source files available: this is best done in a top-level java directory in the package – the source files should not be installed.


Data Types
The ensure_* family of functions can be used to enforce specific data types that are passed to a Spark routine. 
For example, Spark routines that require an integer will not accept an R numeric element. 
Use these functions ensure certain parameters are scalar integers, or scalar doubles, and so on.

ensure_scalar_integer
ensure_scalar_double
ensure_scalar_boolean
ensure_scalar_character
In order to match the correct data types while calling Scala code from R, or retrieving results from Scala back to R, consider the following types mapping:




From R
Scala
To R



NULL
void
NULL

integer
Int
integer

character
String
character

logical
Boolean
logical

double
Double
double

numeric
Double
double


Float
double


Decimal
double


Long
double

raw
Array[Byte]
raw

Date
Date
Date

POSIXlt
Time


POSIXct
Time
POSIXct

list
Array[T]
list

environment
Map[String, T]


jobj
Object
jobj


Compiling
Most Spark extensions won’t need to define their own compilation specification, and can instead rely on the default behavior of compile_package_jars. 
For users who would like to take more control over where the scalac compilers should be looked up, use the spark_compilation_spec fucnction. 
The Spark compilation specification is used when compiling Spark extension Java Archives, and defines which versions of Spark, as well as which versions of Scala, should be used for compilation.
Sparkling Water (H2O) Machine Learning
Overview
The rsparkling extension package provides bindings to H2O's distributed machine learning algorithms via sparklyr. 
In particular, rsparkling allows you to access the machine learning routines provided by the Sparkling Water Spark package.

Together with sparklyr's dplyr interface, you can easily create and tune H2O machine learning workflows on Spark, orchestrated entirely within R.

rsparkling provides a few simple conversion functions that allow the user to transfer data between Spark DataFrames and H2O Frames. 
Once the Spark DataFrames are available as H2O Frames, the h2o R interface can be used to train H2O machine learning algorithms on the data.

A typical machine learning pipeline with rsparkling might be composed of the following stages. 
To fit a model, you might need to:


Perform SQL queries through the sparklyr dplyr interface,
Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set,
Convert your training, validation and/or test data frames into H2O Frames using the as_h2o_frame function,
Choose an appropriate H2O machine learning algorithm to model your data,
Inspect the quality of your model fit, and use it to make predictions with new data.

Installation
You can install the rsparkling package from CRAN as follows:

install.packages("rsparkling")


Then set the Sparkling Water version for rsparkling.:

options(rsparkling.sparklingwater.version = "2.1.14")


For Spark 2.0.x set rsparkling.sparklingwater.version to 2.0.3 instead, for Spark 1.6.2 use 1.6.8.
Using H2O
Now let's walk through a simple example to demonstrate the use of H2O's machine learning algorithms within R. 
We'll use h2o.glm to fit a linear regression model. 
Using the built-in mtcars dataset, we'll try to predict a car's fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl).

First, we will initialize a local Spark connection, and copy the mtcars dataset into Spark.

library(rsparkling)
library(sparklyr)
library(h2o)
library(dplyr)

sc <- spark_connect("local", version = "2.1.0")

mtcars_tbl <- copy_to(sc, mtcars, "mtcars")


Now, let's perform some simple transformations – we'll


Remove all cars with horsepower less than 100,
Produce a column encoding whether a car has 8 cylinders or not,
Partition the data into separate training and test data sets,
Fit a model to our training data set,
Evaluate our predictive performance on our test dataset.


# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)


Now, we convert our training and test sets into H2O Frames using rsparkling conversion functions. 
We have already split the data into training and test frames using dplyr.

training <- as_h2o_frame(sc, partitions$training, strict_version_check = FALSE)
test <- as_h2o_frame(sc, partitions$test, strict_version_check = FALSE)


Alternatively, we can use the h2o.splitFrame() function instead of sdf_partition() to partition the data within H2O instead of Spark (e.g.  partitions <- h2o.splitFrame(as_h2o_frame(mtcars_tbl), 0.5))

# fit a linear model to the training dataset
glm_model <- h2o.glm(x = c("wt", "cyl"), 
       y = "mpg", 
       training_frame = training,
       lambda_search = TRUE)


For linear regression models produced by H2O, we can use either print() or summary() to learn a bit more about the quality of our fit. 
The summary() method returns some extra information about scoring history and variable importance.

glm_model

 ## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model ID:  GLM_model_R_1510348062048_1 
## GLM Model: summary
##     family     link                               regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.05468 )
##                                                                 lambda_search
## 1 nlambda = 100, lambda.max = 5.4682, lambda.min = 0.05468, lambda.1se = -1.0
##   number_of_predictors_total number_of_active_predictors
## 1                          2                           2
##   number_of_iterations                                training_frame
## 1                  100 frame_rdd_32_929e407384e0082416acd4c9897144a0
## 
## Coefficients: glm coefficients
##       names coefficients standardized_coefficients
## 1 Intercept    32.997281                 16.625000
## 2       cyl    -0.906688                 -1.349195
## 3        wt    -2.712562                 -2.282649
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. 
**
## 
## MSE:  2.03293
## RMSE:  1.425808
## MAE:  1.306314
## RMSLE:  0.08238032
## Mean Residual Deviance :  2.03293
## R^2 :  0.8265696
## Null Deviance :93.775
## Null D.o.F. 
:7
## Residual Deviance :16.26344
## Residual D.o.F. 
:5
## AIC :36.37884


The output suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption. 
(The model suggests that, on average, heavier cars consume more fuel.)

Let's use our H2O model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption. 
We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions.

library(ggplot2)

# compute predicted values on our test dataset
pred <- h2o.predict(glm_model, newdata = test)
# convert from H2O Frame to Spark DataFrame
predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE)

# extract the true 'mpg' values from our test dataset
actual <- partitions$test %>%
  select(mpg) %>%
  collect() %>%
  `[[`("mpg")

# produce a data.frame housing our predicted + actual 'mpg' values
data <- data.frame(
  predicted = predicted,
  actual    = actual
)
# a bug in data.frame does not set colnames properly; reset here 
names(data) <- c("predicted", "actual")

# plot predicted vs. 
actual values
ggplot(data, aes(x = actual, y = predicted)) +
  geom_abline(lty = "dashed", col = "red") +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_fixed(ratio = 1) +
  labs(
    x = "Actual Fuel Consumption",
    y = "Predicted Fuel Consumption",
    title = "Predicted vs. 
Actual Fuel Consumption"
  )




Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption.

As you can see, we can easily and effectively combine dplyr data transformation pipelines with the machine learning algorithms provided by H2O's Sparkling Water.
Algorithms
Once the H2OContext is made available to Spark (as demonstrated below), all of the functions in the standard h2o R interface can be used with H2O Frames (converted from Spark DataFrames). 
Here is a table of the available algorithms:




Function
Description




h2o.glm
Generalized Linear Model


h2o.deeplearning
Multilayer Perceptron


h2o.randomForest
Random Forest


h2o.gbm
Gradient Boosting Machine


h2o.naiveBayes
Naive-Bayes


h2o.prcomp
Principal Components Analysis


h2o.svd
Singular Value Decomposition


h2o.glrm
Generalized Low Rank Model


h2o.kmeans
K-Means Clustering


h2o.anomaly
Anomaly Detection via Deep Learning Autoencoder



Additionally, the h2oEnsemble R package can be used to generate Super Learner ensembles of H2O algorithms:




Function
Description




h2o.ensemble
Super Learner / Stacking


h2o.stack
Super Learner / Stacking


Transformers
A model is often fit not on a dataset as-is, but instead on some transformation of that dataset. 
Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_* family of functions. 
Transformers can be used on Spark DataFrames, and the final training set can be sent to the H2O cluster for machine learning.









Function


Description





ft_binarizer

Threshold numerical features to binary (0/1) feature


ft_bucketizer

Bucketizer transforms a column of continuous features to a column of feature buckets


ft_discrete_cosine_transform

Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain


ft_elementwise_product

Multiplies each input vector by a provided weight vector, using element-wise multiplication.


ft_index_to_string

Maps a column of label indices back to a column containing the original labels as strings


ft_quantile_discretizer

Takes a column with continuous features and outputs a column with binned categorical features


ft_sql_transformer

Implements the transformations which are defined by a SQL statement


ft_string_indexer

Encodes a string column of labels to a column of label indices


ft_vector_assembler

Combines a given list of columns into a single vector column



Examples

We will use the iris data set to examine a handful of learning algorithms and transformers. 
The iris data set measures attributes for 150 flowers in 3 different species of iris.

iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
iris_tbl

 ## # Source:   table<iris> [?? x 5]
## # Database: spark_connection
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... 
with more rows


Convert to an H2O Frame:

iris_hf <- as_h2o_frame(sc, iris_tbl, strict_version_check = FALSE)


K-Means Clustering

Use H2O's K-means clustering to partition a dataset into groups. 
K-means clustering partitions points into k groups, such that the sum of squares from points to the assigned cluster centers is minimized.

kmeans_model <- h2o.kmeans(training_frame = iris_hf, 
             x = 3:4,
             k = 3,
             seed = 1)


To look at particular metrics of the K-means model, we can use h2o.centroid_stats() and h2o.centers() or simply print out all the model metrics using print(kmeans_model).

# print the cluster centers
h2o.centers(kmeans_model)

 ##   petal_length petal_width
## 1     1.462000     0.24600
## 2     5.566667     2.05625
## 3     4.296154     1.32500


# print the centroid statistics
h2o.centroid_stats(kmeans_model)

 ## Centroid Statistics: 
##   centroid     size within_cluster_sum_of_squares
## 1        1 50.00000                       1.41087
## 2        2 48.00000                       9.29317
## 3        3 52.00000                       7.20274


PCA

Use H2O's Principal Components Analysis (PCA) to perform dimensionality reduction. 
PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.

pca_model <- h2o.prcomp(training_frame = iris_hf,
          x = 1:4,
          k = 4,
          seed = 1)

 ## Warning in doTryCatch(return(expr), name, parentenv, handler): _train:
## Dataset used may contain fewer number of rows due to removal of rows with
## NA/missing values. 
If this is not desirable, set impute_missing argument in
## pca call to TRUE/True/true/... 
depending on the client language.


pca_model

 ## Model Details:
## ==============
## 
## H2ODimReductionModel: pca
## Model ID:  PCA_model_R_1510348062048_3 
## Importance of components: 
##                             pc1      pc2      pc3      pc4
## Standard deviation     7.861342 1.455041 0.283531 0.154411
## Proportion of Variance 0.965303 0.033069 0.001256 0.000372
## Cumulative Proportion  0.965303 0.998372 0.999628 1.000000
## 
## 
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA


Random Forest

Use H2O's Random Forest to perform regression or classification on a dataset. 
We will continue to use the iris dataset as an example for this problem.

As usual, we define the response and predictor variables using the x and y arguments. 
Since we'd like to do a classification, we need to ensure that the response column is encoded as a factor (enum) column.

y <- "Species"
x <- setdiff(names(iris_hf), y)
iris_hf[,y] <- as.factor(iris_hf[,y])


We can split the iris_hf H2O Frame into a train and test set (the split defaults to ⁷⁵⁄₂₅ train/test).

splits <- h2o.splitFrame(iris_hf, seed = 1)


Then we can train a Random Forest model:

rf_model <- h2o.randomForest(x = x, 
               y = y,
               training_frame = splits[[1]],
               validation_frame = splits[[2]],
               nbins = 32,
               max_depth = 5,
               ntrees = 20,
               seed = 1)


Since we passed a validation frame, the validation metrics will be calculated. 
We can retrieve individual metrics using functions such as h2o.mse(rf_model, valid = TRUE). 
The confusion matrix can be printed using the following:

h2o.confusionMatrix(rf_model, valid = TRUE)

 ## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error     Rate
## setosa          7          0         0 0.0000 =  0 / 7
## versicolor      0         13         0 0.0000 = 0 / 13
## virginica       0          1        10 0.0909 = 1 / 11
## Totals          7         14        10 0.0323 = 1 / 31


To view the variable importance computed from an H2O model, you can use either the h2o.varimp() or h2o.varimp_plot() functions:

h2o.varimp_plot(rf_model)




Gradient Boosting Machine

The Gradient Boosting Machine (GBM) is one of H2O's most popular algorithms, as it works well on many types of data. 
We will continue to use the iris dataset as an example for this problem.

Using the same dataset and x and y from above, we can train a GBM:

gbm_model <- h2o.gbm(x = x, 
       y = y,
       training_frame = splits[[1]],
       validation_frame = splits[[2]],                     
       ntrees = 20,
       max_depth = 3,
       learn_rate = 0.01,
       col_sample_rate = 0.7,
       seed = 1)


Since this is a multi-class problem, we may be interested in inspecting the confusion matrix on a hold-out set. 
Since we passed along a validatin_frame at train time, the validation metrics are already computed and we just need to retreive them from the model object.

h2o.confusionMatrix(gbm_model, valid = TRUE)

 ## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error     Rate
## setosa          7          0         0 0.0000 =  0 / 7
## versicolor      0         13         0 0.0000 = 0 / 13
## virginica       0          1        10 0.0909 = 1 / 11
## Totals          7         14        10 0.0323 = 1 / 31


Deep Learning

Use H2O's Deep Learning to perform regression or classification on a dataset, extact non-linear features generated by the deep neural network, and/or detect anomalies using a deep learning model with auto-encoding.

In this example, we will use the prostate dataset available within the h2o package:

path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate_df <- spark_read_csv(sc, "prostate", path)
head(prostate_df)

 ## # Source:   lazy query [?? x 9]
## # Database: spark_connection
##      ID CAPSULE   AGE  RACE DPROS DCAPS   PSA   VOL GLEASON
##   <int>   <int> <int> <int> <int> <int> <dbl> <dbl>   <int>
## 1     1       0    65     1     2     1   1.4   0.0       6
## 2     2       0    72     1     3     2   6.7   0.0       7
## 3     3       0    70     1     1     2   4.9   0.0       6
## 4     4       0    76     2     2     1  51.2  20.0       7
## 5     5       0    69     1     1     1  12.3  55.9       6
## 6     6       1    71     1     3     2   3.3   0.0       8


Once we've done whatever data manipulation is required to run our model we'll get a reference to it as an h2o frame then split it into training and test sets using the h2o.splitFrame function:

prostate_hf <- as_h2o_frame(sc, prostate_df, strict_version_check = FALSE)
splits <- h2o.splitFrame(prostate_hf, seed = 1)


Next we define the response and predictor columns.

y <- "VOL"
#remove response and ID cols
x <- setdiff(names(prostate_hf), c("ID", y))


Now we can train a deep neural net.

dl_fit <- h2o.deeplearning(x = x, y = y,
             training_frame = splits[[1]],
             epochs = 15,
             activation = "Rectifier",
             hidden = c(10, 5, 10),
             input_dropout_ratio = 0.7)


Evaluate performance on a test set:

h2o.performance(dl_fit, newdata = splits[[2]])

 ## H2ORegressionMetrics: deeplearning
## 
## MSE:  253.7022
## RMSE:  15.92803
## MAE:  12.90077
## RMSLE:  1.885052
## Mean Residual Deviance :  253.7022


Note that the above metrics are not reproducible when H2O's Deep Learning is run on multiple cores, however, the metrics should be fairly stable across repeat runs.

Grid Search

H2O's grid search capabilities currently supports traditional (Cartesian) grid search and random grid search. 
Grid search in R provides the following capabilities:
 H2OGrid class: Represents the results of the grid search h2o.getGrid(<grid_id>, sort_by, decreasing): Display the specified grid h2o.grid: Start a new grid search parameterized by

model builder name (e.g., algorithm = "gbm")
model parameters (e.g., ntrees = 100) hyper_parameters: attribute for passing a list of hyper parameters (e.g., list(ntrees=c(1,100), learn_rate=c(0.1,0.001))) search_criteria: optional attribute for specifying more a advanced search strategy

Cartesian Grid Search

By default, h2o.grid() will train a Cartesian grid search – meaning, all possible models in the specified grid. 
In this example, we will re-use the prostate data as an example dataset for a regression problem.

splits <- h2o.splitFrame(prostate_hf, seed = 1)

y <- "VOL"
#remove response and ID cols
x <- setdiff(names(prostate_hf), c("ID", y))


After prepping the data, we define a grid and execute the grid search.

# GBM hyperparamters
gbm_params1 <- list(learn_rate = c(0.01, 0.1),
      max_depth = c(3, 5, 9),
      sample_rate = c(0.8, 1.0),
      col_sample_rate = c(0.2, 0.5, 1.0))

# Train and validate a grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
        grid_id = "gbm_grid1",
        training_frame = splits[[1]],
        validation_frame = splits[[1]],
        ntrees = 100,
        seed = 1,
        hyper_params = gbm_params1)

# Get the grid results, sorted by validation MSE
gbm_gridperf1 <- h2o.getGrid(grid_id = "gbm_grid1", 
               sort_by = "mse", 
               decreasing = FALSE)


gbm_gridperf1

 ## H2O Grid Details
## ================
## 
## Grid ID: gbm_grid1 
## Used hyper parameters: 
##   -  col_sample_rate 
##   -  learn_rate 
##   -  max_depth 
##   -  sample_rate 
## Number of models: 36 
## Number of failed models: 0 
## 
## Hyper-Parameter Search Summary: ordered by increasing mse
##   col_sample_rate learn_rate max_depth sample_rate          model_ids
## 1             1.0        0.1         9         1.0 gbm_grid1_model_35
## 2             0.5        0.1         9         1.0 gbm_grid1_model_34
## 3             1.0        0.1         9         0.8 gbm_grid1_model_17
## 4             0.5        0.1         9         0.8 gbm_grid1_model_16
## 5             1.0        0.1         5         0.8 gbm_grid1_model_11
##                  mse
## 1  88.10947523138782
## 2  102.3118989994892
## 3 102.78632321923726
## 4  126.4217260351778
## 5  149.6066650109763
## 
## ---
##    col_sample_rate learn_rate max_depth sample_rate          model_ids
## 31             0.5       0.01         3         0.8  gbm_grid1_model_1
## 32             0.2       0.01         5         1.0 gbm_grid1_model_24
## 33             0.5       0.01         3         1.0 gbm_grid1_model_19
## 34             0.2       0.01         5         0.8  gbm_grid1_model_6
## 35             0.2       0.01         3         1.0 gbm_grid1_model_18
## 36             0.2       0.01         3         0.8  gbm_grid1_model_0
##                   mse
## 31  324.8117304723162
## 32 325.10992525687294
## 33 325.27898443785045
## 34 329.36983845305735
## 35 338.54411936919456
## 36  339.7744828617712


Random Grid Search

H2O's Random Grid Search samples from the given parameter space until a set of constraints is met. 
The user can specify the total number of desired models using (e.g.  max_models = 40), the amount of time (e.g.  max_runtime_secs = 1000), or tell the grid to stop after performance stops improving by a specified amount. 
Random Grid Search is a practical way to arrive at a good model without too much effort.

The example below is set to run fairly quickly – increase max_runtime_secs or max_models to cover more of the hyperparameter space in your grid search. 
Also, you can expand the hyperparameter space of each of the algorithms by modifying the definition of hyper_param below.

# GBM hyperparamters
gbm_params2 <- list(learn_rate = seq(0.01, 0.1, 0.01),
      max_depth = seq(2, 10, 1),
      sample_rate = seq(0.5, 1.0, 0.1),
      col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria2 <- list(strategy = "RandomDiscrete", 
           max_models = 50)

# Train and validate a grid of GBMs
gbm_grid2 <- h2o.grid("gbm", x = x, y = y,
        grid_id = "gbm_grid2",
        training_frame = splits[[1]],
        validation_frame = splits[[2]],
        ntrees = 100,
        seed = 1,
        hyper_params = gbm_params2,
        search_criteria = search_criteria2)

# Get the grid results, sorted by validation MSE
gbm_gridperf2 <- h2o.getGrid(grid_id = "gbm_grid2", 
               sort_by = "mse", 
               decreasing = FALSE)


To get the best model, as measured by validation MSE, we simply grab the first row of the gbm_gridperf2@summary_table object, since this table is already sorted such that the lowest MSE model is on top.

gbm_gridperf2@summary_table[1,]

 ## Hyper-Parameter Search Summary: ordered by increasing mse
##   col_sample_rate learn_rate max_depth sample_rate          model_ids
## 1             0.8       0.01         2         0.7 gbm_grid2_model_35
##                  mse
## 1 244.61196951586288


In the examples above, we generated two different grids, specified by grid_id. 
The first grid was called grid_id = "gbm_grid1" and the second was called grid_id = "gbm_grid2". 
However, if we are using the same dataset & algorithm in two grid searches, it probably makes more sense just to add the results of the second grid search to the first. 
If you want to add models to an existing grid, rather than create a new one, you simply re-use the same grid_id.
Exporting Models
There are two ways of exporting models from H2O – saving models as a binary file, or saving models as pure Java code.

Binary Models

The more traditional method is to save a binary model file to disk using the h2o.saveModel() function. 
To load the models using h2o.loadModel(), the same version of H2O that generated the models is required. 
This method is commonly used when H2O is being used in a non-production setting.

A binary model can be saved as follows:

h2o.saveModel(my_model, path = "/Users/me/h2omodels")


Java (POJO) Models

One of the most valuable features of H2O is it's ability to export models as pure Java code, or rather, a “Plain Old Java Object” (POJO). 
You can learn more about H2O POJO models in this POJO quickstart guide. 
The POJO method is used most commonly when a model is deployed in a production setting. 
POJO models are ideal for when you need very fast prediction response times, and minimal requirements – the POJO is a standalone Java class with no dependencies on the full H2O stack.

To generate the POJO for your model, use the following command:

h2o.download_pojo(my_model, path = "/Users/me/h2omodels")


Finally, disconnect with:

spark_disconnect_all()

 ## [1] 1


You can learn more about how to take H2O models to production in the productionizing H2O models section of the H2O docs.
Additional Resources
Main documentation site for Sparkling Water (and all H2O software projects)
H2O.ai website

If you are new to H2O for machine learning, we recommend you start with the Intro to H2O Tutorial, followed by the H2O Grid Search & Model Selection Tutorial. 
There are a number of other H2O R tutorials and demos available, as well as the H2O World 2015 Training Gitbook, and the Machine Learning with R and H2O Booklet (pdf).
R interface for GraphFrames 
Highlights
Support for GraphFrames which aims to provide the functionality of GraphX.
Perform graph algorithms like: PageRank, ShortestPaths and many others.
Designed to work with sparklyr and the sparklyr extensions.
Installation
To install from CRAN, run:
 install.packages("graphframes")

For the development version, run:
 devtools::install_github("rstudio/graphframes")
Examples
The examples make use of the highschool dataset from the ggplot package.

Create a GraphFrame
The base for graph analyses in Spark, using sparklyr, will be a GraphFrame.

Open a new Spark connection using sparklyr, and copy the highschool data set
 library(graphframes)
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.1.0")

highschool_tbl <- copy_to(sc, ggraph::highschool, "highschool")

head(highschool_tbl)
 ## # Source:   lazy query [?? x 3]
## # Database: spark_connection
##    from    to  year
##   <dbl> <dbl> <dbl>
## 1    1. 
  14. 
1957.
## 2    1. 
  15. 
1957.
## 3    1. 
  21. 
1957.
## 4    1. 
  54. 
1957.
## 5    1. 
  55. 
1957.
## 6    2. 
  21. 
1957.

The vertices table is be constructed using dplyr. 
The variable name expected by the GraphFrame is id.
 from_tbl <- highschool_tbl %>% 
  distinct(from) %>% 
  transmute(id = from)

to_tbl <- highschool_tbl %>% 
  distinct(to) %>% 
  transmute(id = to)
  
  
vertices_tbl <- from_tbl %>%
  sdf_bind_rows(to_tbl)

head(vertices_tbl)
 ## # Source:   lazy query [?? x 1]
## # Database: spark_connection
##      id
##   <dbl>
## 1    6.
## 2    7.
## 3   12.
## 4   13.
## 5   55.
## 6   58.

The edges table can also be created using dplyr. 
In order for the GraphFrame to work, the from variable needs be renamed src, and the to variable dst.
 # Create a table with <source, destination> edges
edges_tbl <- highschool_tbl %>% 
  transmute(src = from, dst = to)

The gf_graphframe() function creates a new GraphFrame
 gf_graphframe(vertices_tbl, edges_tbl) 
 ## GraphFrame
## Vertices:
##   $ id <dbl> 6, 7, 12, 13, 55, 58, 63, 41, 44, 48, 59, 1, 4, 17, 20, 22,...
## Edges:
##   $ src <dbl> 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 7, 8...
##   $ dst <dbl> 14, 15, 21, 54, 55, 21, 22, 9, 15, 5, 18, 19, 43, 19, 43, ...

Basic Page Rank
We will calculate PageRank over this dataset. 
The gf_graphframe() command can easily be piped into the gf_pagerank() function to execute the Page Rank.
 gf_graphframe(vertices_tbl, edges_tbl) %>%
  gf_pagerank(reset_prob = 0.15, max_iter = 10L, source_id = "1")
 ## GraphFrame
## Vertices:
##   $ id       <dbl> 12, 12, 59, 59, 1, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26,...
##   $ pagerank <dbl> 1.216914e-02, 1.216914e-02, 1.151867e-03, 1.151867e-0...
## Edges:
##   $ src    <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,...
##   $ dst    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 22, 22,...
##   $ weight <dbl> 0.02777778, 0.02777778, 0.02777778, 0.02777778, 0.02777...

Additionaly, one can calculate the degrees of vertices using gf_degrees as follows:
 gf_graphframe(vertices_tbl, edges_tbl) %>% 
  gf_degrees()
 ## # Source:   table<sparklyr_tmp_27b034635ad> [?? x 2]
## # Database: spark_connection
##       id degree
##    <dbl>  <int>
##  1   55. 
    25
##  2    6. 
    10
##  3   13. 
    16
##  4    7. 
     6
##  5   12. 
    11
##  6   63. 
    21
##  7   58. 
     8
##  8   41. 
    19
##  9   48. 
    15
## 10   59. 
    11
## # ... 
with more rows

Visualizations
In order to visualize large graphframes, one can use sample_n and then use ggraph with igraph to visualize the graph as follows:
 library(ggraph)
library(igraph)

graph <- highschool_tbl %>%
  sample_n(20) %>%
  collect() %>%
  graph_from_data_frame()

ggraph(graph, layout = 'kk') + 
    geom_edge_link(aes(colour = factor(year))) + 
    geom_node_point() + 
    ggtitle('An example')


Additional functions
Apart from calculating PageRank using gf_pagerank, the following functions are available:
 gf_bfs(): Breadth-first search (BFS). gf_connected_components(): Connected components. gf_shortest_paths(): Shortest paths algorithm. gf_scc(): Strongly connected components. gf_triangle_count: Computes the number of triangles passing through each vertex and others.
R interface for MLeap 
mleap is a sparklyr extension that provides an interface to MLeap, which allows us to take Spark pipelines to production.
Install mleap
mleap can be installed from CRAN via
 install.packages("mleap")

or, for the latest development version from GitHub, using
 devtools::install_github("rstudio/mleap")
Setup
Once mleap has been installed, we can install the external dependencies using:
 library(mleap)
install_mleap()

Another dependency of mleap is Maven. 
If it is already installed, just point mleap to its location:
 options(maven.home = "path/to/maven")

If Maven is not yet installed, which is the most likely case, use the following to install it:
 install_maven()
Create an MLeap Bundle

Start Spark session using sparklyr
 library(sparklyr)
sc <- spark_connect(master = "local", version = "2.2.0")
mtcars_tbl <- sdf_copy_to(sc, mtcars, overwrite = TRUE)

Create a fit an ML Pipeline
 pipeline <- ml_pipeline(sc) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg")

pipeline_model <- ml_fit(pipeline, mtcars_tbl)

A transformed data frame with the appropriate schema is required for exporting the Pipeline model
 transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl)

Export the model using the ml_write_bundle() function from mleap
 model_path <- file.path(tempdir(), "mtcars_model.zip")
ml_write_bundle(pipeline_model, transformed_tbl, model_path)
 ## Model successfully exported.

Close Spark session
 spark_disconnect(sc)


At this point, we can share mtcars_model.zip with the deployment/implementation engineers, and they would be able to embed the model in another application. 
See the MLeap docs for details.
Test the mleap bundle
The mleap package also provides R functions for testing that the saved models behave as expected. 
Here we load the previously saved model:
 model <- mleap_load_bundle(model_path)
model
 ## MLeap Transformer
## <db23a9f1-7b3d-4d27-9eb0-8675125ab3a5> 
##   Name: pipeline_fe6b8cb0028f 
##   Format: json 
##   MLeap Version: 0.10.0-SNAPSHOT

To retrieve the schema associated with the model use the mleap_model_schema() function
 mleap_model_schema(model)
 ## # A tibble: 6 x 4
##   name       type   nullable dimension
##   <chr>      <chr>  <lgl>    <chr>    
## 1 qsec       double TRUE     <NA>     
## 2 hp         double FALSE    <NA>     
## 3 wt         double TRUE     <NA>     
## 4 big_hp     double FALSE    <NA>     
## 5 features   double TRUE     (3)      
## 6 prediction double FALSE    <NA>

Then, we create a new data frame to be scored, and make predictions using the model:
 newdata <- tibble::tribble(
  ~qsec, ~hp, ~wt,
  16.2,  101, 2.68,
  18.1,  99,  3.08
)

# Transform the data frame
transformed_df <- mleap_transform(model, newdata)
dplyr::glimpse(transformed_df)
 ## Observations: 2
## Variables: 6
## $ qsec       <dbl> 16.2, 18.1
## $ hp         <dbl> 101, 99
## $ wt         <dbl> 2.68, 3.08
## $ big_hp     <dbl> 1, 0
## $ features   <list> [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]]
## $ prediction <dbl> 21.06529, 22.36667
Examples 
  
    

  
  
    Option 1 - Connecting to Databricks remotely
  
   Overview With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect.
This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources.
 Advantages and limitations Advantages:
 RStudio Server Pro will remain functional if Databricks clusters are terminated Provides the ability to communicate with one or more Databricks clusters as a remote compute resource Avoids resource contention between RStudio Server Pro and Databricks  Limitations: 
   

  
  
    Option 2 - Working inside of Databricks
  
   Overview If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below.
With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to work locally with Spark using sparklyr.
This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters. 
   

  
  
    Spark Standalone Deployment in AWS
  
   Overview The plan is to launch 4 identical EC2 server instances. 
One server will be the Master node and the other 3 the worker nodes. 
In one of the worker nodes, we will install RStudio server.
What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. 
This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them. 
   

  
  
    Using sparklyr with Databricks
  
   Overview This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager.
 Using RStudio Team with Databricks RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages.
RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark. 
   

  
  
    Using sparklyr with an Apache Spark cluster
  
   Summary This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. 
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. 
RStudio Server is installed on the master node and orchestrates the analysis in spark.
 Cloudera Cluster This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. 
The assumption will be made that there no aid is needed to setup and administer the cluster. 
   

  
  
    Using sparklyr with an Apache Spark cluster
  
   This document demonstrates how to use sparklyr with an Apache Spark cluster. 
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. 
RStudio Server is installed on the master node and orchestrates the analysis in spark. 
Here is the basic workflow.
Data preparation Set up the cluster This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider. 
   
  
Spark Standalone Deployment in AWS 
Overview
The plan is to launch 4 identical EC2 server instances. 
One server will be the Master node and the other 3 the worker nodes. 
In one of the worker nodes, we will install RStudio server.

What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. 
This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.

The topology will look something like this:


AWS EC Instances
Here are the details of the EC2 instance, just deploy one at this point:

Type: t2.medium
OS: Ubuntu 16.04 LTS
Disk space: At least 20GB
Security group: Open the following ports: 8080 (Spark UI), 4040 (Spark Worker UI), 8088 (sparklyr UI) and 8787 (RStudio). 
Also open All TCP ports for the machines inside the security group.
Spark
Perform the steps in this section on all of the servers that will be part of the cluster.

Install Java 8

We will add the Java 8 repository, install it and set it as default sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
sudo apt-get update

Download Spark

Download and unpack a pre-compiled version of Spark. 
Here’s is the link to the official Spark download page wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz
cd spark-2.1.0-bin-hadoop2.7

Create and launch AMI

We will create an image of the server. 
In Amazon, these are called AMIs, for information please see the User Guide.

Launch 3 instances of the AMI
RStudio Server
Select one of the nodes to execute this section. 
Please check the RStudio download page for the latest version

Install R

In order to get the latest R core, we will need to update the source list in Ubuntu. sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update

Now we can install R sudo apt-get install r-base
sudo apt-get install gdebi-core

Install RStudio

We will download and install 1.044 of RStudio Server. 
To find the latest version, please visit the RStudio website. 
In order to get the enhanced integration with Spark, RStudio version 1.044 or later will be needed. wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb
sudo gdebi rstudio-server-1.0.153-amd64.deb

Install dependencies

Run the following commands sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
sudo apt-get -y install libxml2-dev

Add default user

Run the following command to add a default user sudo adduser rstudio-user

Start the Master node

Select one of the servers to become your Master node

Run the command that starts the master service sudo spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh

Close the terminal connection (optional)

Start Worker nodes

Start the slave service. 
Important: Use dots not dashes as separators for the Spark Master node’s address sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://[Master node's IP address]:7077

sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-30-1-94.us-west-2.compute.internal:7077
- Close the terminal connection (optional)

Pre-load pacakges

Log into RStudio (port 8787)

Use ‘rstudio-user’ install.packages("sparklyr")

Connect to the Spark Master

Navigate to the Spark Master’s UI, typically on port 8080 

Note the Spark Master URL

Logon to RStudio

Run the following code
 
library(sparklyr)

conf <- spark_config()
conf$spark.executor.memory <- "2GB"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master="[Spark Master URL]", 
version = "2.1.0",
config = conf,
spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")

Using sparklyr with an Apache Spark cluster 
This document demonstrates how to use sparklyr with an Apache Spark cluster. 
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. 
RStudio Server is installed on the master node and orchestrates the analysis in spark. 
Here is the basic workflow.


Data preparation
Set up the cluster
This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider. 
We will use Elastic Map Reduce (EMR) to easily set up a cluster with two core nodes and one master node. 
Nodes use virtual servers from the Elastic Compute Cloud (EC2). 
Note: There is no free tier for EMR, charges will apply.

Before beginning this setup we assume you have:

Familiarity with and access to an AWS account
Familiarity with basic linux commands
Sudo privileges in order to install software from the command line



Build an EMR cluster
Before beginning the EMR wizard setup, make sure you create the following in AWS:

An AWS key pair (.pem key) so you can SSH into the EC2 master node
A security group that gives you access to port 22 on your IP and port 8787 from anywhere



Step 1: Select software
Make sure to select Hive and Spark as part of the install. 
Note that by choosing Spark, R will also be installed on the master node as part of the distribution.




Step 2: Select hardware
Install 2 core nodes and one master node with m3.xlarge 80 GiB storage per node. 
You can easily increase the number of nodes later.




Step 3: Select general cluster settings
Click next on the general cluster settings.




Step 4: Select security
Enter your EC2 key pair and security group. 
Make sure the security group has ports 22 and 8787 open.





Connect to EMR
The cluster page will give you details about your EMR cluster and instructions on connecting.



Connect to the master node via SSH using your key pair. 
Once you connect you will see the EMR welcome.
 # Log in to master node
ssh -i ~/spark-demo.pem hadoop@ec2-52-10-102-11.us-west-2.compute.amazonaws.com



Install RStudio Server
EMR uses Amazon Linux which is based on Centos. 
Update your master node and install dependencies that will be used by R packages.
 # Update
sudo yum update
sudo yum install libcurl-devel openssl-devel # used for devtools

The installation of RStudio Server is easy. 
Download the preview version of RStudio and install on the master node.
 # Install RStudio Server
wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-0.99.1266-x86_64.rpm
sudo yum install --nogpgcheck /tmp/rstudio-server-rhel-0.99.1266-x86_64.rpm

Create a User
Create a user called rstudio-user that will perform the data analysis. 
Create a user directory for rstudio-user on HDFS with the hadoop fs command.
 # Make User
sudo useradd -m rstudio-user
sudo passwd rstudio-user

# Create new directory in hdfs
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user
Download flights data
The flights data is a well known data source representing 123 million flights over 22 years. 
It consumes roughly 12 GiB of storage in uncompressed CSV format in yearly files.
Switch User
For data loading and analysis, make sure you are logged in as regular user.
 # create directories on hdfs for new user
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user

# switch user
su rstudio-user

Download data
Run the following script to download data from the web onto your master node. 
Download the yearly flight data and the airlines lookup table.
 # Make download directory
mkdir /tmp/flights

# Download flight data by year
for i in {1987..2008}
  do
    echo "$(date) $i Download"
    fnam=$i.csv.bz2
    wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
    echo "$(date) $i Unzip"
    bunzip2 /tmp/flights/$fnam
  done

# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS

# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Distribute into HDFS
Copy data into HDFS using the hadoop fs command.
 # Copy flight data to HDFS
hadoop fs -mkdir /user/rstudio-user/flights/
hadoop fs -put /tmp/flights /user/rstudio-user/

# Copy airline data to HDFS
hadoop fs -mkdir /user/rstudio-user/airlines/
hadoop fs -put /tmp/airlines.csv /user/rstudio-user/airlines

# Copy airport data to HDFS
hadoop fs -mkdir /user/rstudio-user/airports/
hadoop fs -put /tmp/airports.csv /user/rstudio-user/airports
Create Hive tables
Launch Hive from the command line.
 # Open Hive prompt
hive

Create the metadata that will structure the flights table. 
Load data into the Hive table.
 # Create metadata for flights
CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int, 
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string, 
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int, 
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/flights' INTO TABLE flights;

Create the metadata that will structure the airlines table. 
Load data into the Hive table.
 # Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airlines' INTO TABLE airlines;

Create the metadata that will structure the airports table. 
Load data into the Hive table.
 # Create metadata for airports
CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE;

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airports' INTO TABLE airports;
Connect to Spark
Log in to RStudio Server by pointing a browser at your master node IP:8787.



Set the environment variable SPARK_HOME and then run spark_connect. 
After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane.
 # Connect to Spark
library(sparklyr)
library(dplyr)
library(ggplot2)
Sys.setenv(SPARK_HOME="/usr/lib/spark")
config <- spark_config()
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')

Once you are connected, you will see the Spark pane appear along with your hive tables.



You can inspect your tables by clicking on the data icon.


Data analysis
Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.


Cache the tables into memory
Use tbl_cache to load the flights table into memory. 
Caching tables will make analysis much faster. 
Create a dplyr reference to the Spark DataFrame.
 # Cache flights Hive table into Spark
tbl_cache(sc, 'flights')
flights_tbl <- tbl(sc, 'flights')

# Cache airlines Hive table into Spark
tbl_cache(sc, 'airlines')
airlines_tbl <- tbl(sc, 'airlines')

# Cache airports Hive table into Spark
tbl_cache(sc, 'airports')
airports_tbl <- tbl(sc, 'airports')
Create a model data set
Filter the data to contain only the records to be used in the fitted model. 
Join carrier descriptions for reference. 
Create a new variable called gain which represents the amount of time gained (or lost) in flight.
 # Filter records and create target variable 'gain'
model_data <- flights_tbl %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360) %>%
  filter(year >= 2003 & year <= 2007) %>%
  left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
  mutate(gain = depdelay - arrdelay) %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain)

# Summarize data by carrier
model_data %>%
  group_by(uniquecarrier) %>%
  summarize(description = min(description), gain=mean(gain), 
    distance=mean(distance), depdelay=mean(depdelay)) %>%
  select(description, gain, distance, depdelay) %>%
  arrange(gain)
 Source:   query [?? x 4]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
  description       gain  distance depdelay
            <chr>      <dbl>     <dbl>    <dbl>
1        ATA Airlines d/b/a ATA -3.3480120 1134.7084 56.06583
2  ExpressJet Airlines Inc. 
(1) -3.0326180  519.7125 59.41659
3                     Envoy Air -2.5434415  416.3716 53.12529
4       Northwest Airlines Inc. 
-2.2030586  779.2342 48.52828
5          Delta Air Lines Inc. 
-1.8248026  868.3997 50.77174
6   AirTran Airways Corporation -1.4331555  641.8318 54.96702
7    Continental Air Lines Inc. 
-0.9617003 1116.6668 57.00553
8        American Airlines Inc. 
-0.8860262 1074.4388 55.45045
9             Endeavor Air Inc. 
-0.6392733  467.1951 58.47395
10              JetBlue Airways -0.3262134 1139.0443 54.06156
# ... 
with more rows
Train a linear model
Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier.
 # Partition the data into training and validation sets
model_partition <- model_data %>% 
  sdf_partition(train = 0.8, valid = 0.2, seed = 5555)

# Fit a linear model
ml1 <- model_partition$train %>%
  ml_linear_regression(gain ~ distance + depdelay + uniquecarrier)

# Summarize the linear model
summary(ml1)
 Deviance Residuals: (approximate):
     Min       1Q   Median       3Q      Max 
-305.422   -5.593    2.699    9.750  147.871 

Coefficients:
      Estimate  Std. 
Error  t value  Pr(>|t|)    
(Intercept)      -1.24342576  0.10248281 -12.1330 < 2.2e-16 ***
distance          0.00326600  0.00001670 195.5709 < 2.2e-16 ***
depdelay         -0.01466233  0.00020337 -72.0977 < 2.2e-16 ***
uniquecarrier_AA -2.32650517  0.10522524 -22.1098 < 2.2e-16 ***
uniquecarrier_AQ  2.98773637  0.28798507  10.3746 < 2.2e-16 ***
uniquecarrier_AS  0.92054894  0.11298561   8.1475 4.441e-16 ***
uniquecarrier_B6 -1.95784698  0.11728289 -16.6934 < 2.2e-16 ***
uniquecarrier_CO -2.52618081  0.11006631 -22.9514 < 2.2e-16 ***
uniquecarrier_DH  2.23287189  0.11608798  19.2343 < 2.2e-16 ***
uniquecarrier_DL -2.68848119  0.10621977 -25.3106 < 2.2e-16 ***
uniquecarrier_EV  1.93484736  0.10724290  18.0417 < 2.2e-16 ***
uniquecarrier_F9 -0.89788137  0.14422281  -6.2257 4.796e-10 ***
uniquecarrier_FL -1.46706706  0.11085354 -13.2343 < 2.2e-16 ***
uniquecarrier_HA -0.14506644  0.25031456  -0.5795    0.5622    
uniquecarrier_HP  2.09354855  0.12337515  16.9690 < 2.2e-16 ***
uniquecarrier_MQ -1.88297535  0.10550507 -17.8473 < 2.2e-16 ***
uniquecarrier_NW -2.79538927  0.10752182 -25.9983 < 2.2e-16 ***
uniquecarrier_OH  0.83520117  0.11032997   7.5700 3.730e-14 ***
uniquecarrier_OO  0.61993842  0.10679884   5.8047 6.447e-09 ***
uniquecarrier_TZ -4.99830389  0.15912629 -31.4109 < 2.2e-16 ***
uniquecarrier_UA -0.68294396  0.10638099  -6.4198 1.365e-10 ***
uniquecarrier_US -0.61589284  0.10669583  -5.7724 7.815e-09 ***
uniquecarrier_WN  3.86386059  0.10362275  37.2878 < 2.2e-16 ***
uniquecarrier_XE -2.59658123  0.10775736 -24.0966 < 2.2e-16 ***
uniquecarrier_YV  3.11113140  0.11659679  26.6828 < 2.2e-16 ***
---
Signif. 
codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-Squared: 0.02385
Root Mean Squared Error: 17.74
Assess model performance
Compare the model performance using the validation data.
 # Calculate average gains by predicted decile
model_deciles <- lapply(model_partition, function(x) {
  sdf_predict(ml1, x) %>%
    mutate(decile = ntile(desc(prediction), 10)) %>%
    group_by(decile) %>%
    summarize(gain = mean(gain)) %>%
    select(decile, gain) %>%
    collect()
})

# Create a summary dataset for plotting
deciles <- rbind(
  data.frame(data = 'train', model_deciles$train),
  data.frame(data = 'valid', model_deciles$valid),
  make.row.names = FALSE
)

# Plot average gains by predicted decile
deciles %>%
  ggplot(aes(factor(decile), gain, fill = data)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')


Visualize predictions
Compare actual gains to predicted gains for an out of time sample.
 # Select data from an out of time sample
data_2008 <- flights_tbl %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360) %>%
  filter(year == 2008) %>%
  left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
  mutate(gain = depdelay - arrdelay) %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest)

# Summarize data by carrier
carrier <- sdf_predict(ml1, data_2008) %>%
  group_by(description) %>%
  summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>%
  filter(freq > 10000) %>%
  collect

# Plot actual gains and predicted gains by airline carrier
ggplot(carrier, aes(gain, prediction)) + 
  geom_point(alpha = 0.75, color = 'red', shape = 3) +
  geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') +
  geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) +
  labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted')



Some carriers make up more time than others in flight, but the differences are relatively small. 
The average time gains between the best and worst airlines is only six minutes. 
The best predictor of time gained is not carrier but flight distance. 
The biggest gains were associated with the longest flights.
Share Insights
This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays. 
These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.
Build dashboard
Aggregate the scored data by origin, destination, and airline. 
Save the aggregated data.
 # Summarize by origin, destination, and carrier
summary_2008 <- sdf_predict(ml1, data_2008) %>%
  rename(carrier = uniquecarrier, airline = description) %>%
  group_by(origin, dest, carrier, airline) %>%
  summarize(
    flights = n(),
    distance = mean(distance),
    avg_dep_delay = mean(depdelay),
    avg_arr_delay = mean(arrdelay),
    avg_gain = mean(gain),
    pred_gain = mean(prediction)
    )

# Collect and save objects
pred_data <- collect(summary_2008)
airports <- collect(select(airports_tbl, name, faa, lat, lon))
ml1_summary <- capture.output(summary(ml1))
save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')
Publish dashboard
Use the saved data to build an R Markdown flexdashboard. 
Publish the flexdashboard to Shiny Server, Shinyapps.io or RStudio Connect.




Using sparklyr with an Apache Spark cluster 
Summary
This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. 
Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. 
RStudio Server is installed on the master node and orchestrates the analysis in spark.
Cloudera Cluster
This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. 
The assumption will be made that there no aid is needed to setup and administer the cluster.

##CDH 5

We will start with a Cloudera cluster CDH version 5.8.2 (free version) with an underlaying Ubuntu Linux distribution.



##Spark 1.6

The default Spark 1.6.0 parcel is in installed and running


Hive data
For this demo, we have created and populated 3 tables in Hive. 
The table names are: flights, airlines and airports. 
Using Hue, we can see the loaded tables. 
For the links to the data files and their Hive import scripts please see Appendix A.


Install RStudio
The latest version of R is needed. 
In Ubuntu, the default core R is not the latest so we have to update the source list. 
We will also install a few other dependencies.
 sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev

We will install the preview version of RStudio Server
 wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-1.0.40-amd64.deb
sudo gdebi rstudio-server-1.0.49-amd64.deb

Create and configure a User
Create a user called rstudio that will perform the data analysis.
 sudo adduser rstudio

To ease security restriction in this demo, we will add the new user to the default super group defined in the dfs.permissions.superusergroup setting in CDH
 sudo groupadd supergroup
sudo usermod -a -G supergroup rstudio
Connect to Spark
Log in to RStudio Server by pointing a browser at your master node IP:8787.



Set the environment variable SPARK_HOME and then run spark_connect. 
After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane.
 library(sparklyr)
library(dplyr)
library(ggplot2)

sc <- spark_connect(master = "yarn-client", version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')

Once you are connected, you will see the Spark pane appear along with your hive tables.



You can inspect your tables by clicking on the data icon.



This is what the tables look like loaded in Spark via the History Server Web UI (port 18088)


Data analysis
Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.


Cache the tables into memory
Use tbl_cache to load the flights table into memory. 
Caching tables will make analysis much faster. 
Create a dplyr reference to the Spark DataFrame.
 # Cache flights Hive table into Spark
tbl_cache(sc, 'flights')
flights_tbl <- tbl(sc, 'flights')

# Cache airlines Hive table into Spark
tbl_cache(sc, 'airlines')
airlines_tbl <- tbl(sc, 'airlines')

# Cache airports Hive table into Spark
tbl_cache(sc, 'airports')
airports_tbl <- tbl(sc, 'airports')
Create a model data set
Filter the data to contain only the records to be used in the fitted model. 
Join carrier descriptions for reference. 
Create a new variable called gain which represents the amount of time gained (or lost) in flight.
 # Filter records and create target variable 'gain'
model_data <- flights_tbl %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360) %>%
  filter(year >= 2003 & year <= 2007) %>%
  left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
  mutate(gain = depdelay - arrdelay) %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain)

# Summarize data by carrier
model_data %>%
  group_by(uniquecarrier) %>%
  summarize(description = min(description), gain=mean(gain), 
    distance=mean(distance), depdelay=mean(depdelay)) %>%
  select(description, gain, distance, depdelay) %>%
  arrange(gain)
 Source:   query [?? x 4]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
  description       gain  distance depdelay
            <chr>      <dbl>     <dbl>    <dbl>
1        ATA Airlines d/b/a ATA -5.5679651 1240.7219 61.84391
2       Northwest Airlines Inc. 
-3.1134556  779.1926 48.84979
3                     Envoy Air -2.2056576  437.0883 54.54923
4             PSA Airlines Inc. 
-1.9267647  500.6955 55.60335
5  ExpressJet Airlines Inc. 
(1) -1.5886314  537.3077 61.58386
6               JetBlue Airways -1.3742524 1087.2337 59.80750
7         SkyWest Airlines Inc. 
-1.1265678  419.6489 54.04198
8          Delta Air Lines Inc. 
-0.9829374  956.9576 50.19338
9        American Airlines Inc. 
-0.9631200 1066.8396 56.78222
10  AirTran Airways Corporation -0.9411572  665.6574 53.38363
# ... 
with more rows
Train a linear model
Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier.
 # Partition the data into training and validation sets
model_partition <- model_data %>% 
  sdf_partition(train = 0.8, valid = 0.2, seed = 5555)

# Fit a linear model
ml1 <- model_partition$train %>%
  ml_linear_regression(gain ~ distance + depdelay + uniquecarrier)

# Summarize the linear model
summary(ml1)
 Call: ml_linear_regression(., gain ~ distance + depdelay + uniquecarrier)

Deviance Residuals: (approximate):
     Min       1Q   Median       3Q      Max 
-302.343   -5.669    2.714    9.832  104.130 

Coefficients:
      Estimate  Std. 
Error  t value  Pr(>|t|)    
(Intercept)      -1.26566581  0.10385870 -12.1864 < 2.2e-16 ***
distance          0.00308711  0.00002404 128.4155 < 2.2e-16 ***
depdelay         -0.01397013  0.00028816 -48.4812 < 2.2e-16 ***
uniquecarrier_AA -2.18483090  0.10985406 -19.8885 < 2.2e-16 ***
uniquecarrier_AQ  3.14330242  0.29114487  10.7964 < 2.2e-16 ***
uniquecarrier_AS  0.09210380  0.12825003   0.7182 0.4726598    
uniquecarrier_B6 -2.66988794  0.12682192 -21.0523 < 2.2e-16 ***
uniquecarrier_CO -1.11611186  0.11795564  -9.4621 < 2.2e-16 ***
uniquecarrier_DL -1.95206198  0.11431110 -17.0767 < 2.2e-16 ***
uniquecarrier_EV  1.70420830  0.11337215  15.0320 < 2.2e-16 ***
uniquecarrier_F9 -1.03178176  0.15384863  -6.7065 1.994e-11 ***
uniquecarrier_FL -0.99574060  0.12034738  -8.2739 2.220e-16 ***
uniquecarrier_HA -1.16970713  0.34894788  -3.3521 0.0008020 ***
uniquecarrier_MQ -1.55569040  0.10975613 -14.1741 < 2.2e-16 ***
uniquecarrier_NW -3.58502418  0.11534938 -31.0797 < 2.2e-16 ***
uniquecarrier_OH -1.40654797  0.12034858 -11.6873 < 2.2e-16 ***
uniquecarrier_OO -0.39069404  0.11132164  -3.5096 0.0004488 ***
uniquecarrier_TZ -7.26285217  0.34428509 -21.0955 < 2.2e-16 ***
uniquecarrier_UA -0.56995737  0.11186757  -5.0949 3.489e-07 ***
uniquecarrier_US -0.52000028  0.11218498  -4.6352 3.566e-06 ***
uniquecarrier_WN  4.22838982  0.10629405  39.7801 < 2.2e-16 ***
uniquecarrier_XE -1.13836940  0.11332176 -10.0455 < 2.2e-16 ***
uniquecarrier_YV  3.17149538  0.11709253  27.0854 < 2.2e-16 ***
---
Signif. 
codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-Squared: 0.02301
Root Mean Squared Error: 17.83
Assess model performance
Compare the model performance using the validation data.
 # Calculate average gains by predicted decile
model_deciles <- lapply(model_partition, function(x) {
  sdf_predict(ml1, x) %>%
    mutate(decile = ntile(desc(prediction), 10)) %>%
    group_by(decile) %>%
    summarize(gain = mean(gain)) %>%
    select(decile, gain) %>%
    collect()
})

# Create a summary dataset for plotting
deciles <- rbind(
  data.frame(data = 'train', model_deciles$train),
  data.frame(data = 'valid', model_deciles$valid),
  make.row.names = FALSE
)

# Plot average gains by predicted decile
deciles %>%
  ggplot(aes(factor(decile), gain, fill = data)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')


Visualize predictions
Compare actual gains to predicted gains for an out of time sample.
 # Select data from an out of time sample
data_2008 <- flights_tbl %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360) %>%
  filter(year == 2008) %>%
  left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>%
  mutate(gain = depdelay - arrdelay) %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest)

# Summarize data by carrier
carrier <- sdf_predict(ml1, data_2008) %>%
  group_by(description) %>%
  summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>%
  filter(freq > 10000) %>%
  collect

# Plot actual gains and predicted gains by airline carrier
ggplot(carrier, aes(gain, prediction)) + 
  geom_point(alpha = 0.75, color = 'red', shape = 3) +
  geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') +
  geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) +
  labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted')



Some carriers make up more time than others in flight, but the differences are relatively small. 
The average time gains between the best and worst airlines is only six minutes. 
The best predictor of time gained is not carrier but flight distance. 
The biggest gains were associated with the longest flights.
Share Insights
This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays. 
These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.
Build dashboard
Aggregate the scored data by origin, destination, and airline. 
Save the aggregated data.
 # Summarize by origin, destination, and carrier
summary_2008 <- sdf_predict(ml1, data_2008) %>%
  rename(carrier = uniquecarrier, airline = description) %>%
  group_by(origin, dest, carrier, airline) %>%
  summarize(
    flights = n(),
    distance = mean(distance),
    avg_dep_delay = mean(depdelay),
    avg_arr_delay = mean(arrdelay),
    avg_gain = mean(gain),
    pred_gain = mean(prediction)
    )

# Collect and save objects
pred_data <- collect(summary_2008)
airports <- collect(select(airports_tbl, name, faa, lat, lon))
ml1_summary <- capture.output(summary(ml1))
save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')
Publish dashboard
Use the saved data to build an R Markdown flexdashboard. 
Publish the flexdashboard



#Appendix

Appendix A - Data files
Run the following script to download data from the web onto your master node. 
Download the yearly flight data and the airlines lookup table.
 # Make download directory
mkdir /tmp/flights

# Download flight data by year
for i in {2006..2008}
  do
    echo "$(date) $i Download"
    fnam=$i.csv.bz2
    wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
    echo "$(date) $i Unzip"
    bunzip2 /tmp/flights/$fnam
  done

# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS

# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Hive tables
We used the Hue interface, logged in as ‘admin’ to load the data into HDFS and then into Hive.
 CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int, 
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string, 
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int, 
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");
 LOAD DATA INPATH '/user/admin/flights/2006.csv/' INTO TABLE flights;
LOAD DATA INPATH '/user/admin/flights/2007.csv/' INTO TABLE flights;
LOAD DATA INPATH '/user/admin/flights/2008.csv/' INTO TABLE flights;
 # Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
 LOAD DATA INPATH '/user/admin/L_UNIQUE_CARRIERS.csv' INTO TABLE airlines;
 CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE;
 LOAD DATA INPATH '/user/admin/airports.dat' INTO TABLE airports;
Using sparklyr with Databricks 
Overview
This documentation demonstrates how to use sparklyr with Apache Spark in
Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and
RStudio Package Manager.
Using RStudio Team with Databricks
RStudio Team is a bundle of our popular professional software for developing
data science projects, publishing data products, and managing packages.

RStudio Team and sparklyr can be used with Databricks to work with large
datasets and distributed computations with Apache Spark. 
The most common use
case is to perform interactive analysis and exploratory development with RStudio
Server Pro and sparklyr; write out the results to a database, file system, or
cloud storage; then publish apps, reports, and APIs to RStudio Connect that
query and access the results.



The sections below describe best practices and different options for configuring
specific RStudio products to work with Databricks.
Best practices for working with Databricks
Maintain separate installation environments - Install RStudio Server Pro,
RStudio Connect, and RStudio Package Manager outside of the Databricks cluster
so that they are not limited to the compute resources or ephemeral nature of
Databricks clusters.
Connect to Databricks remotely - Work with Databricks as a remote compute
resource, similar to how you would connect remotely to external databases,
data sources, and storage systems. 
This can be accomplished using Databricks
Connect (as described in the
Connecting to Databricks remotely
section below) or by performing SQL queries with JDBC/ODBC using the
Databricks Spark SQL Driver on
AWS or
Azure.
Restrict workloads to interactive analysis - Only perform workloads
related to exploratory or interactive analysis with Spark, then write the
results to a database, file system, or cloud storage for more efficient
retrieval in apps, reports, and APIs.
Load and query results efficiently - Because of the nature of Spark
computations and the associated overhead, Shiny apps that use Spark on the
backend tend to have performance and runtime issues; consider reading the
results from a database, file system, or cloud storage instead.
Using RStudio Server Pro with Databricks
There are two options for using sparklyr and RStudio Server Pro with
Databricks:

Option 1:
Connecting to Databricks remotely
(Recommended Option)
Option 2:
Working inside of Databricks
(Alternative Option)

Option 1 - Connecting to Databricks remotely
With this configuration, RStudio Server Pro is installed outside of the Spark
cluster and allows users to connect to Spark remotely using sparklyr with
Databricks Connect.

This is the recommended configuration because it targets separate environments,
involves a typical configuration process, avoids resource contention, and allows
RStudio Server Pro to connect to Databricks as well as other remote storage and
compute resources.



View steps for connecting to Databricks remotely






Option 2 - Working inside of Databricks
If you cannot work with Spark remotely, you should install RStudio Server Pro on
the Driver node of a long-running, persistent Databricks cluster as opposed to a
worker node or an ephemeral cluster.

With this configuration, RStudio Server Pro is installed on the Spark driver
node and allows users to connect to Spark locally using sparklyr.

This configuration can result in increased complexity, limited connectivity to
other storage and compute resources, resource contention between RStudio Server
Pro and Databricks, and maintenance concerns due to the ephemeral nature of
Databricks clusters.



View steps for working inside of Databricks





Using RStudio Connect with Databricks
The server environment within Databricks clusters is not permissive enough to
support RStudio Connect or the process sandboxing mechanisms that it uses to
isolate published content.

Therefore, the only supported configuration is to install RStudio Connect
outside of the Databricks cluster and connect to Databricks remotely.

Whether RStudio Server Pro is installed outside of the Databricks cluster
(Recommended Option) or within the Databricks cluster (Alternative Option), you
can publish content to RStudio Connect as long as HTTP/HTTPS network traffic is
allowed from RStudio Server Pro to RStudio Connect.

There are two options for using RStudio Connect with Databricks:


Performing SQL queries with JDBC/ODBC using the Databricks Spark SQL Driver
on AWS or
Azure
(Recommended Option)
Adding calls in your R code to create and run Databricks jobs
with bricksteR and the Databricks Jobs API
(Alternative Option)


Using RStudio Package Manager with Databricks
Whether RStudio Server Pro is installed outside of the Databricks cluster
(Recommended Option) or within the Databricks cluster (Alternative Option), you
can install packages from repositories in RStudio Package Manager as long as
HTTP/HTTPS network traffic is allowed from RStudio Server Pro to RStudio Package
Manager.
Development 
Function Reference









  
    Spark Operations

   spark_config() 
    Read Spark Configuration
   spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() 
    Manage Spark Connections
   spark_install_find() spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() 
    Find a given Spark installation by version.
   spark_log() 
    View Entries in the Spark Log
   spark_web() 
    Open the Spark web interface
   connection_is_open() 
    Check whether the connection is open
   connection_spark_shinyapp() 
    A Shiny app that can be used to construct a spark_connect statement
   spark_session_config() 
    Runtime configuration interface for the Spark Session
   spark_set_checkpoint_dir() spark_get_checkpoint_dir() 
    Set/Get Spark checkpoint directory
   spark_table_name() 
    Generate a Table Name from Expression
   spark_version_from_home() 
    Get the Spark Version Associated with a Spark Installation
   spark_versions() 
    Retrieves a dataframe available Spark versions that van be installed.
   spark_config_kubernetes() 
    Kubernetes Configuration
   spark_config_settings() 
    Retrieve Available Settings
   spark_connection_find() 
    Find Spark Connection
   spark_dependency_fallback() 
    Fallback to Spark Dependency
   spark_extension() 
    Create Spark Extension
   spark_load_table() 
    Reads from a Spark Table into a Spark DataFrame.
   spark_read_libsvm() 
    Read libsvm file into a Spark DataFrame.



  
    Read/Write data

   spark_read_csv() 
    Read a CSV file into a Spark DataFrame
   spark_read_delta() 
    Read from Delta Lake into a Spark DataFrame.
   spark_read_jdbc() 
    Read from JDBC connection into a Spark DataFrame.
   spark_read_json() 
    Read a JSON file into a Spark DataFrame
   spark_read_libsvm() 
    Read libsvm file into a Spark DataFrame.
   spark_read_orc() 
    Read a ORC file into a Spark DataFrame
   spark_read_parquet() 
    Read a Parquet file into a Spark DataFrame
   spark_read_source() 
    Read from a generic source into a Spark DataFrame.
   spark_read_table() 
    Reads from a Spark Table into a Spark DataFrame.
   spark_read_text() 
    Read a Text file into a Spark DataFrame
   spark_write_csv() 
    Write a Spark DataFrame to a CSV
   spark_write_delta() 
    Writes a Spark DataFrame into Delta Lake
   spark_write_jdbc() 
    Writes a Spark DataFrame into a JDBC table
   spark_write_json() 
    Write a Spark DataFrame to a JSON file
   spark_write_orc() 
    Write a Spark DataFrame to a ORC file
   spark_write_parquet() 
    Write a Spark DataFrame to a Parquet file
   spark_write_source() 
    Writes a Spark DataFrame into a generic source
   spark_write_table() 
    Writes a Spark DataFrame into a Spark table
   spark_write_text() 
    Write a Spark DataFrame to a Text file



  
    Spark Tables

   sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() 
    Save / Load a Spark DataFrame
   sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() 
    Spark ML -- Transform, fit, and predict methods (sdf_ interface)
   sdf_along() 
    Create DataFrame for along Object
   sdf_bind_rows() sdf_bind_cols() 
    Bind multiple Spark DataFrames by row and column
   sdf_broadcast() 
    Broadcast hint
   sdf_checkpoint() 
    Checkpoint a Spark DataFrame
   sdf_coalesce() 
    Coalesces a Spark DataFrame
   sdf_collect() 
    Collect a Spark DataFrame into R.
   sdf_copy_to() sdf_import() 
    Copy an Object into Spark
   sdf_crosstab() 
    Cross Tabulation
   sdf_debug_string() 
    Debug Info for Spark DataFrame
   sdf_describe() 
    Compute summary statistics for columns of a data frame
   sdf_dim() sdf_nrow() sdf_ncol() 
    Support for Dimension Operations
   sdf_is_streaming() 
    Spark DataFrame is Streaming
   sdf_last_index() 
    Returns the last index of a Spark DataFrame
   sdf_len() 
    Create DataFrame for Length
   sdf_num_partitions() 
    Gets number of partitions of a Spark DataFrame
   sdf_persist() 
    Persist a Spark DataFrame
   sdf_pivot() 
    Pivot a Spark DataFrame
   sdf_project() 
    Project features onto principal components
   sdf_quantile() 
    Compute (Approximate) Quantiles with a Spark DataFrame
   sdf_random_split() sdf_partition() 
    Partition a Spark Dataframe
   sdf_read_column() 
    Read a Column from a Spark DataFrame
   sdf_register() 
    Register a Spark DataFrame
   sdf_repartition() 
    Repartition a Spark DataFrame
   sdf_residuals() 
    Model Residuals
   sdf_sample() 
    Randomly Sample Rows from a Spark DataFrame
   sdf_schema() 
    Read the Schema of a Spark DataFrame
   sdf_separate_column() 
    Separate a Vector Column into Scalar Columns
   sdf_seq() 
    Create DataFrame for Range
   sdf_sort() 
    Sort a Spark DataFrame
   sdf_sql() 
    Spark DataFrame from SQL
   sdf_with_sequential_id() 
    Add a Sequential ID Column to a Spark DataFrame
   sdf_with_unique_id() 
    Add a Unique ID Column to a Spark DataFrame



  
    Spark Machine Learning

   ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor() 
    Spark ML -- Decision Trees
   ml_generalized_linear_regression() 
    Spark ML -- Generalized Linear Regression
   ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor() 
    Spark ML -- Gradient Boosted Trees
   ml_kmeans() ml_compute_cost() 
    Spark ML -- K-Means Clustering
   ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() 
    Spark ML -- Latent Dirichlet Allocation
   ml_linear_regression() 
    Spark ML -- Linear Regression
   ml_logistic_regression() 
    Spark ML -- Logistic Regression
   ml_model_data() 
    Extracts data associated with a Spark ML model
   ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() 
    Spark ML -- Multilayer Perceptron
   ml_naive_bayes() 
    Spark ML -- Naive-Bayes
   ml_one_vs_rest() 
    Spark ML -- OneVsRest
   ft_pca() ml_pca() 
    Feature Transformation -- PCA (Estimator)
   ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor() 
    Spark ML -- Random Forest
   ml_aft_survival_regression() ml_survival_regression() 
    Spark ML -- Survival Regression
   ml_add_stage() 
    Add a Stage to a Pipeline
   ml_als() ml_recommend() 
    Spark ML -- ALS
   ml_approx_nearest_neighbors() ml_approx_similarity_join() 
    Utility functions for LSH models
   ml_fpgrowth() ml_association_rules() ml_freq_itemsets() 
    Frequent Pattern Mining -- FPGrowth
   ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() 
    Spark ML - Evaluators
   ml_bisecting_kmeans() 
    Spark ML -- Bisecting K-Means Clustering
   ml_call_constructor() 
    Wrap a Spark ML JVM object
   ml_chisquare_test() 
    Chi-square hypothesis testing for categorical data.
   ml_clustering_evaluator() 
    Spark ML - Clustering Evaluator
   new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() 
    Constructors for `ml_model` Objects
   ml_corr() 
    Compute correlation matrix
   ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() 
    Spark ML -- Tuning
   ml_default_stop_words() 
    Default stop words
   ml_evaluate() 
    Evaluate the Model on a Validation Set
   ml_feature_importances() ml_tree_feature_importance() 
    Spark ML - Feature Importance for Tree Models
   ft_word2vec() ml_find_synonyms() 
    Feature Transformation -- Word2Vec (Estimator)
   is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict() 
    Spark ML -- Transform, fit, and predict methods (ml_ interface)
   ml_gaussian_mixture() 
    Spark ML -- Gaussian Mixture clustering.
   ml_is_set() ml_param_map() ml_param() ml_params() 
    Spark ML -- ML Params
   ml_isotonic_regression() 
    Spark ML -- Isotonic Regression
   ft_string_indexer() ml_labels() ft_string_indexer_model() 
    Feature Transformation -- StringIndexer (Estimator)
   ml_linear_svc() 
    Spark ML -- LinearSVC
   ml_save() ml_load() 
    Spark ML -- Model Persistence
   ml_pipeline() 
    Spark ML -- Pipelines
   ml_stage() ml_stages() 
    Spark ML -- Pipeline stage extraction
   ml_standardize_formula() 
    Standardize Formula Input for `ml_model`
   ml_summary() 
    Spark ML -- Extraction of summary metrics
   ml_uid() 
    Spark ML -- UID
   ft_count_vectorizer() ml_vocabulary() 
    Feature Transformation -- CountVectorizer (Estimator)



  
    Spark Feature Transformers

   ft_binarizer() 
    Feature Transformation -- Binarizer (Transformer)
   ft_bucketizer() 
    Feature Transformation -- Bucketizer (Transformer)
   ft_chisq_selector() 
    Feature Transformation -- ChiSqSelector (Estimator)
   ft_count_vectorizer() ml_vocabulary() 
    Feature Transformation -- CountVectorizer (Estimator)
   ft_dct() ft_discrete_cosine_transform() 
    Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
   ft_elementwise_product() 
    Feature Transformation -- ElementwiseProduct (Transformer)
   ft_feature_hasher() 
    Feature Transformation -- FeatureHasher (Transformer)
   ft_hashing_tf() 
    Feature Transformation -- HashingTF (Transformer)
   ft_idf() 
    Feature Transformation -- IDF (Estimator)
   ft_imputer() 
    Feature Transformation -- Imputer (Estimator)
   ft_index_to_string() 
    Feature Transformation -- IndexToString (Transformer)
   ft_interaction() 
    Feature Transformation -- Interaction (Transformer)
   ft_bucketed_random_projection_lsh() ft_minhash_lsh() 
    Feature Transformation -- LSH (Estimator)
   ml_approx_nearest_neighbors() ml_approx_similarity_join() 
    Utility functions for LSH models
   ft_max_abs_scaler() 
    Feature Transformation -- MaxAbsScaler (Estimator)
   ft_min_max_scaler() 
    Feature Transformation -- MinMaxScaler (Estimator)
   ft_ngram() 
    Feature Transformation -- NGram (Transformer)
   ft_normalizer() 
    Feature Transformation -- Normalizer (Transformer)
   ft_one_hot_encoder() 
    Feature Transformation -- OneHotEncoder (Transformer)
   ft_one_hot_encoder_estimator() 
    Feature Transformation -- OneHotEncoderEstimator (Estimator)
   ft_pca() ml_pca() 
    Feature Transformation -- PCA (Estimator)
   ft_polynomial_expansion() 
    Feature Transformation -- PolynomialExpansion (Transformer)
   ft_quantile_discretizer() 
    Feature Transformation -- QuantileDiscretizer (Estimator)
   ft_r_formula() 
    Feature Transformation -- RFormula (Estimator)
   ft_regex_tokenizer() 
    Feature Transformation -- RegexTokenizer (Transformer)
   ft_standard_scaler() 
    Feature Transformation -- StandardScaler (Estimator)
   ft_stop_words_remover() 
    Feature Transformation -- StopWordsRemover (Transformer)
   ft_string_indexer() ml_labels() ft_string_indexer_model() 
    Feature Transformation -- StringIndexer (Estimator)
   ft_tokenizer() 
    Feature Transformation -- Tokenizer (Transformer)
   ft_vector_assembler() 
    Feature Transformation -- VectorAssembler (Transformer)
   ft_vector_indexer() 
    Feature Transformation -- VectorIndexer (Estimator)
   ft_vector_slicer() 
    Feature Transformation -- VectorSlicer (Transformer)
   ft_word2vec() ml_find_synonyms() 
    Feature Transformation -- Word2Vec (Estimator)
   ft_sql_transformer() ft_dplyr_transformer() 
    Feature Transformation -- SQLTransformer



  
    Spark Machine Learning Utilities

   ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() 
    Spark ML - Evaluators
   ml_feature_importances() ml_tree_feature_importance() 
    Spark ML - Feature Importance for Tree Models



  
    Extensions

   compile_package_jars() 
    Compile Scala sources into a Java Archive (jar)
   connection_config() 
    Read configuration values for a connection
   download_scalac() 
    Downloads default Scala Compilers
   find_scalac() 
    Discover the Scala Compiler
   spark_context() java_context() hive_context() spark_session() 
    Access the Spark API
   hive_context_config() 
    Runtime configuration interface for Hive
   invoke() invoke_static() invoke_new() 
    Invoke a Method on a JVM Object
   register_extension() registered_extensions() 
    Register a Package that Implements a Spark Extension
   spark_compilation_spec() 
    Define a Spark Compilation Specification
   spark_default_compilation_spec() 
    Default Compilation Specification for Spark Extensions
   spark_connection() 
    Retrieve the Spark Connection Associated with an R Object
   spark_context_config() 
    Runtime configuration interface for the Spark Context.
   spark_dataframe() 
    Retrieve a Spark DataFrame
   spark_dependency() 
    Define a Spark dependency
   spark_home_set() 
    Set the SPARK_HOME environment variable
   spark_jobj() 
    Retrieve a Spark JVM Object Reference
   spark_version() 
    Get the Spark Version Associated with a Spark Connection



  
    Distributed Computing

   spark_apply() 
    Apply an R Function in Spark
   spark_apply_bundle() 
    Create Bundle for Spark Apply
   spark_apply_log() 
    Log Writer for Spark Apply



  
    Livy

   livy_config() 
    Create a Spark Configuration for Livy
   livy_service_start() livy_service_stop() 
    Start Livy



  
    Streaming

   stream_find() 
    Find Stream
   stream_generate_test() 
    Generate Test Stream
   stream_id() 
    Spark Stream's Identifier
   stream_name() 
    Spark Stream's Name
   stream_read_csv() 
    Read CSV Stream
   stream_read_json() 
    Read JSON Stream
   stream_read_kafka() 
    Read Kafka Stream
   stream_read_orc() 
    Read ORC Stream
   stream_read_parquet() 
    Read Parquet Stream
   stream_read_scoket() 
    Read Socket Stream
   stream_read_text() 
    Read Text Stream
   stream_render() 
    Render Stream
   stream_stats() 
    Stream Statistics
   stream_stop() 
    Stops a Spark Stream
   stream_trigger_continuous() 
    Spark Stream Continuous Trigger
   stream_trigger_interval() 
    Spark Stream Interval Trigger
   stream_view() 
    View Stream
   stream_watermark() 
    Watermark Stream
   stream_write_console() 
    Write Console Stream
   stream_write_csv() 
    Write CSV Stream
   stream_write_json() 
    Write JSON Stream
   stream_write_kafka() 
    Write Kafka Stream
   stream_write_memory() 
    Write Memory Stream
   stream_write_orc() 
    Write a ORC Stream
   stream_write_parquet() 
    Write Parquet Stream
   stream_write_text() 
    Write Text Stream
   reactiveSpark() 
    Reactive spark reader



  

  
Function Reference - version 1.04 










  
    Spark Operations

   spark_config() 
    Read Spark Configuration
   spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() 
    Manage Spark Connections
   spark_install_find() spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() 
    Find a given Spark installation by version.
   spark_log() 
    View Entries in the Spark Log
   spark_web() 
    Open the Spark web interface
   connection_is_open() 
    Check whether the connection is open
   connection_spark_shinyapp() 
    A Shiny app that can be used to construct a spark_connect statement
   spark_session_config() 
    Runtime configuration interface for the Spark Session
   spark_set_checkpoint_dir() spark_get_checkpoint_dir() 
    Set/Get Spark checkpoint directory
   spark_table_name() 
    Generate a Table Name from Expression
   spark_version_from_home() 
    Get the Spark Version Associated with a Spark Installation
   spark_versions() 
    Retrieves a dataframe available Spark versions that van be installed.
   spark_config_kubernetes() 
    Kubernetes Configuration
   spark_config_settings() 
    Retrieve Available Settings
   spark_connection_find() 
    Find Spark Connection
   spark_dependency_fallback() 
    Fallback to Spark Dependency
   spark_extension() 
    Create Spark Extension
   spark_load_table() 
    Reads from a Spark Table into a Spark DataFrame.
   spark_read_libsvm() 
    Read libsvm file into a Spark DataFrame.



  
    Spark Data

   spark_read_csv() 
    Read a CSV file into a Spark DataFrame
   spark_read_jdbc() 
    Read from JDBC connection into a Spark DataFrame.
   spark_read_json() 
    Read a JSON file into a Spark DataFrame
   spark_read_parquet() 
    Read a Parquet file into a Spark DataFrame
   spark_read_source() 
    Read from a generic source into a Spark DataFrame.
   spark_read_table() 
    Reads from a Spark Table into a Spark DataFrame.
   spark_read_orc() 
    Read a ORC file into a Spark DataFrame
   spark_read_text() 
    Read a Text file into a Spark DataFrame
   spark_save_table() 
    Saves a Spark DataFrame as a Spark table
   spark_write_orc() 
    Write a Spark DataFrame to a ORC file
   spark_write_text() 
    Write a Spark DataFrame to a Text file
   spark_write_csv() 
    Write a Spark DataFrame to a CSV
   spark_write_jdbc() 
    Writes a Spark DataFrame into a JDBC table
   spark_write_json() 
    Write a Spark DataFrame to a JSON file
   spark_write_parquet() 
    Write a Spark DataFrame to a Parquet file
   spark_write_source() 
    Writes a Spark DataFrame into a generic source
   spark_write_table() 
    Writes a Spark DataFrame into a Spark table



  
    Spark Tables

   src_databases() 
    Show database list
   tbl_cache() 
    Cache a Spark Table
   tbl_change_db() 
    Use specific database
   tbl_uncache() 
    Uncache a Spark Table



  
    Spark DataFrames

   sdf_along() 
    Create DataFrame for along Object
   sdf_bind_rows() sdf_bind_cols() 
    Bind multiple Spark DataFrames by row and column
   sdf_broadcast() 
    Broadcast hint
   sdf_checkpoint() 
    Checkpoint a Spark DataFrame
   sdf_coalesce() 
    Coalesces a Spark DataFrame
   sdf_copy_to() sdf_import() 
    Copy an Object into Spark
   sdf_len() 
    Create DataFrame for Length
   sdf_num_partitions() 
    Gets number of partitions of a Spark DataFrame
   sdf_random_split() sdf_partition() 
    Partition a Spark Dataframe
   sdf_pivot() 
    Pivot a Spark DataFrame
   sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() 
    Spark ML -- Transform, fit, and predict methods (sdf_ interface)
   sdf_read_column() 
    Read a Column from a Spark DataFrame
   sdf_register() 
    Register a Spark DataFrame
   sdf_repartition() 
    Repartition a Spark DataFrame
   sdf_residuals() 
    Model Residuals
   sdf_sample() 
    Randomly Sample Rows from a Spark DataFrame
   sdf_separate_column() 
    Separate a Vector Column into Scalar Columns
   sdf_seq() 
    Create DataFrame for Range
   sdf_sort() 
    Sort a Spark DataFrame
   sdf_with_unique_id() 
    Add a Unique ID Column to a Spark DataFrame
   sdf_collect() 
    Collect a Spark DataFrame into R.
   sdf_crosstab() 
    Cross Tabulation
   sdf_debug_string() 
    Debug Info for Spark DataFrame
   sdf_describe() 
    Compute summary statistics for columns of a data frame
   sdf_dim() sdf_nrow() sdf_ncol() 
    Support for Dimension Operations
   sdf_is_streaming() 
    Spark DataFrame is Streaming
   sdf_last_index() 
    Returns the last index of a Spark DataFrame
   sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() 
    Save / Load a Spark DataFrame
   sdf_persist() 
    Persist a Spark DataFrame
   sdf_project() 
    Project features onto principal components
   sdf_quantile() 
    Compute (Approximate) Quantiles with a Spark DataFrame
   sdf_schema() 
    Read the Schema of a Spark DataFrame
   sdf_sql() 
    Spark DataFrame from SQL
   sdf_with_sequential_id() 
    Add a Sequential ID Column to a Spark DataFrame



  
    Spark Machine Learning

   ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor() 
    Spark ML -- Decision Trees
   ml_generalized_linear_regression() 
    Spark ML -- Generalized Linear Regression
   ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor() 
    Spark ML -- Gradient Boosted Trees
   ml_kmeans() ml_compute_cost() 
    Spark ML -- K-Means Clustering
   ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() 
    Spark ML -- Latent Dirichlet Allocation
   ml_linear_regression() 
    Spark ML -- Linear Regression
   ml_logistic_regression() 
    Spark ML -- Logistic Regression
   ml_model_data() 
    Extracts data associated with a Spark ML model
   ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() 
    Spark ML -- Multilayer Perceptron
   ml_naive_bayes() 
    Spark ML -- Naive-Bayes
   ml_one_vs_rest() 
    Spark ML -- OneVsRest
   ft_pca() ml_pca() 
    Feature Transformation -- PCA (Estimator)
   ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor() 
    Spark ML -- Random Forest
   ml_aft_survival_regression() ml_survival_regression() 
    Spark ML -- Survival Regression
   ml_add_stage() 
    Add a Stage to a Pipeline
   ml_als() ml_recommend() 
    Spark ML -- ALS
   ml_approx_nearest_neighbors() ml_approx_similarity_join() 
    Utility functions for LSH models
   ml_fpgrowth() ml_association_rules() ml_freq_itemsets() 
    Frequent Pattern Mining -- FPGrowth
   ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() 
    Spark ML - Evaluators
   ml_bisecting_kmeans() 
    Spark ML -- Bisecting K-Means Clustering
   ml_call_constructor() 
    Wrap a Spark ML JVM object
   ml_chisquare_test() 
    Chi-square hypothesis testing for categorical data.
   ml_clustering_evaluator() 
    Spark ML - Clustering Evaluator
   new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() 
    Constructors for `ml_model` Objects
   ml_corr() 
    Compute correlation matrix
   ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() 
    Spark ML -- Tuning
   ml_default_stop_words() 
    Default stop words
   ml_evaluate() 
    Evaluate the Model on a Validation Set
   ml_feature_importances() ml_tree_feature_importance() 
    Spark ML - Feature Importance for Tree Models
   ft_word2vec() ml_find_synonyms() 
    Feature Transformation -- Word2Vec (Estimator)
   is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict() 
    Spark ML -- Transform, fit, and predict methods (ml_ interface)
   ml_gaussian_mixture() 
    Spark ML -- Gaussian Mixture clustering.
   ml_is_set() ml_param_map() ml_param() ml_params() 
    Spark ML -- ML Params
   ml_isotonic_regression() 
    Spark ML -- Isotonic Regression
   ft_string_indexer() ml_labels() ft_string_indexer_model() 
    Feature Transformation -- StringIndexer (Estimator)
   ml_linear_svc() 
    Spark ML -- LinearSVC
   ml_save() ml_load() 
    Spark ML -- Model Persistence
   ml_pipeline() 
    Spark ML -- Pipelines
   ml_stage() ml_stages() 
    Spark ML -- Pipeline stage extraction
   ml_standardize_formula() 
    Standardize Formula Input for `ml_model`
   ml_summary() 
    Spark ML -- Extraction of summary metrics
   ml_uid() 
    Spark ML -- UID
   ft_count_vectorizer() ml_vocabulary() 
    Feature Transformation -- CountVectorizer (Estimator)



  
    Spark Feature Transformers

   ft_binarizer() 
    Feature Transformation -- Binarizer (Transformer)
   ft_bucketizer() 
    Feature Transformation -- Bucketizer (Transformer)
   ft_count_vectorizer() ml_vocabulary() 
    Feature Transformation -- CountVectorizer (Estimator)
   ft_dct() ft_discrete_cosine_transform() 
    Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
   ft_elementwise_product() 
    Feature Transformation -- ElementwiseProduct (Transformer)
   ft_index_to_string() 
    Feature Transformation -- IndexToString (Transformer)
   ft_one_hot_encoder() 
    Feature Transformation -- OneHotEncoder (Transformer)
   ft_quantile_discretizer() 
    Feature Transformation -- QuantileDiscretizer (Estimator)
   ft_sql_transformer() ft_dplyr_transformer() 
    Feature Transformation -- SQLTransformer
   ft_string_indexer() ml_labels() ft_string_indexer_model() 
    Feature Transformation -- StringIndexer (Estimator)
   ft_vector_assembler() 
    Feature Transformation -- VectorAssembler (Transformer)
   ft_tokenizer() 
    Feature Transformation -- Tokenizer (Transformer)
   ft_regex_tokenizer() 
    Feature Transformation -- RegexTokenizer (Transformer)
   ft_bucketed_random_projection_lsh() ft_minhash_lsh() 
    Feature Transformation -- LSH (Estimator)
   ft_chisq_selector() 
    Feature Transformation -- ChiSqSelector (Estimator)
   ft_feature_hasher() 
    Feature Transformation -- FeatureHasher (Transformer)
   ft_hashing_tf() 
    Feature Transformation -- HashingTF (Transformer)
   ft_idf() 
    Feature Transformation -- IDF (Estimator)
   ft_imputer() 
    Feature Transformation -- Imputer (Estimator)
   ft_interaction() 
    Feature Transformation -- Interaction (Transformer)
   ft_max_abs_scaler() 
    Feature Transformation -- MaxAbsScaler (Estimator)
   ft_min_max_scaler() 
    Feature Transformation -- MinMaxScaler (Estimator)
   ft_ngram() 
    Feature Transformation -- NGram (Transformer)
   ft_normalizer() 
    Feature Transformation -- Normalizer (Transformer)
   ft_one_hot_encoder_estimator() 
    Feature Transformation -- OneHotEncoderEstimator (Estimator)
   ft_pca() ml_pca() 
    Feature Transformation -- PCA (Estimator)
   ft_polynomial_expansion() 
    Feature Transformation -- PolynomialExpansion (Transformer)
   ft_r_formula() 
    Feature Transformation -- RFormula (Estimator)
   ft_standard_scaler() 
    Feature Transformation -- StandardScaler (Estimator)
   ft_stop_words_remover() 
    Feature Transformation -- StopWordsRemover (Transformer)
   ft_vector_indexer() 
    Feature Transformation -- VectorIndexer (Estimator)
   ft_vector_slicer() 
    Feature Transformation -- VectorSlicer (Transformer)
   ft_word2vec() ml_find_synonyms() 
    Feature Transformation -- Word2Vec (Estimator)



  
    Spark Machine Learning Utilities

   ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() 
    Spark ML - Evaluators
   ml_feature_importances() ml_tree_feature_importance() 
    Spark ML - Feature Importance for Tree Models



  
    Extensions

   compile_package_jars() 
    Compile Scala sources into a Java Archive (jar)
   connection_config() 
    Read configuration values for a connection
   download_scalac() 
    Downloads default Scala Compilers
   find_scalac() 
    Discover the Scala Compiler
   spark_context() java_context() hive_context() spark_session() 
    Access the Spark API
   hive_context_config() 
    Runtime configuration interface for Hive
   invoke() invoke_static() invoke_new() 
    Invoke a Method on a JVM Object
   register_extension() registered_extensions() 
    Register a Package that Implements a Spark Extension
   spark_compilation_spec() 
    Define a Spark Compilation Specification
   spark_default_compilation_spec() 
    Default Compilation Specification for Spark Extensions
   spark_connection() 
    Retrieve the Spark Connection Associated with an R Object
   spark_context_config() 
    Runtime configuration interface for the Spark Context.
   spark_dataframe() 
    Retrieve a Spark DataFrame
   spark_dependency() 
    Define a Spark dependency
   spark_home_set() 
    Set the SPARK_HOME environment variable
   spark_jobj() 
    Retrieve a Spark JVM Object Reference
   spark_version() 
    Get the Spark Version Associated with a Spark Connection



  
    Distributed Computing

   spark_apply() 
    Apply an R Function in Spark
   spark_apply_bundle() 
    Create Bundle for Spark Apply
   spark_apply_log() 
    Log Writer for Spark Apply



  
    Livy

   livy_install() livy_available_versions() livy_install_dir() livy_installed_versions() livy_home_dir() 
    Install Livy
   livy_config() 
    Create a Spark Configuration for Livy
   livy_service_start() livy_service_stop() 
    Start Livy



  
    Streaming

   stream_find() 
    Find Stream
   stream_generate_test() 
    Generate Test Stream
   stream_id() 
    Spark Stream's Identifier
   stream_name() 
    Spark Stream's Name
   stream_read_csv() 
    Read CSV Stream
   stream_read_json() 
    Read JSON Stream
   stream_read_kafka() 
    Read Kafka Stream
   stream_read_orc() 
    Read ORC Stream
   stream_read_parquet() 
    Read Parquet Stream
   stream_read_scoket() 
    Read Socket Stream
   stream_read_text() 
    Read Text Stream
   stream_render() 
    Render Stream
   stream_stats() 
    Stream Statistics
   stream_stop() 
    Stops a Spark Stream
   stream_trigger_continuous() 
    Spark Stream Continuous Trigger
   stream_trigger_interval() 
    Spark Stream Interval Trigger
   stream_view() 
    View Stream
   stream_watermark() 
    Watermark Stream
   stream_write_console() 
    Write Console Stream
   stream_write_csv() 
    Write CSV Stream
   stream_write_json() 
    Write JSON Stream
   stream_write_kafka() 
    Write Kafka Stream
   stream_write_memory() 
    Write Memory Stream
   stream_write_orc() 
    Write a ORC Stream
   stream_write_parquet() 
    Write Parquet Stream
   stream_write_text() 
    Write Text Stream
   reactiveSpark() 
    Reactive spark reader


Read Spark Configuration 
Arguments
    Value
    Details
Read Spark Configuration
spark_config(file = "config.yml", use_default = TRUE)
Arguments
    



    

file
Name of the configuration file
    

use_default
TRUE to use the built-in defaults provided in this package
    
    
Value
Named list with configuration data
Details
Read Spark configuration using the config package.
Manage Spark Connections 
Arguments
    Details
    Examples
These routines allow you to manage your connections to Spark.
spark_connect(master, spark_home = Sys.getenv("SPARK_HOME"),
  method = c("shell", "livy", "databricks", "test", "qubole"),
  app_name = "sparklyr", version = NULL, config = spark_config(),
  extensions = sparklyr::registered_extensions(), packages = NULL, ...)

spark_connection_is_open(sc)

spark_disconnect(sc, ...)

spark_disconnect_all()

spark_submit(master, file, spark_home = Sys.getenv("SPARK_HOME"),
  app_name = "sparklyr", version = NULL, config = spark_config(),
  extensions = sparklyr::registered_extensions(), ...)
Arguments
    



    

master
Spark cluster url to connect to. 
Use "local" to
connect to a local instance of Spark installed via spark_install.
    

spark_home
The path to a Spark installation. 
Defaults to the path
provided by the SPARK_HOME environment variable. 
If SPARK_HOME is defined, it will always be used unless the version parameter is specified to force the use of a locally
installed version.
    

method
The method used to connect to Spark. 
Default connection method
is "shell" to connect using spark-submit, use "livy" to
perform remote connections using HTTP, or "databricks" when using a
Databricks clusters.
    

app_name
The application name to be used while running in the Spark
cluster.
    

version
The version of Spark to use. 
Required for "local" Spark
connections, optional otherwise.
    

config
Custom configuration for the generated Spark connection. 
See spark_config for details.
    

extensions
Extension R packages to enable for this connection. 
By
default, all packages enabled through the use of sparklyr::register_extension will be passed here.
    

packages
A list of Spark packages to load. 
For example, "delta" or "kafka" to enable Delta Lake or Kafka. 
Also supports full versions like "io.delta:delta-core_2.11:0.4.0". 
This is similar to adding packages into the sparklyr.shell.packages configuration option. 
Notice that the version
parameter is used to choose the correect package, otherwise assumes the latest version
is being used.
    

...
Optional arguments; currently unused.
    

sc
A spark_connection.
    

file
Path to R source file to submit for batch execution.
    
    
Details
When using method = "livy", it is recommended to specify version
parameter to improve performance by using precompiled code rather than uploading
sources. 
By default, jars are downloaded from GitHub but the path to the correct sparklyr JAR can also be specified through the livy.jars setting.
Examples
    
sc <- spark_connect(master = "spark://HOST:PORT")
connection_is_open(sc)#> [1] TRUE
spark_disconnect(sc)
Find a given Spark installation by version. 
Arguments
    Value
Install versions of Spark for use with local Spark connections
  (i.e.  spark_connect(master = "local")
spark_install_find(version = NULL, hadoop_version = NULL,
  installed_only = TRUE, latest = FALSE, hint = FALSE)

spark_install(version = NULL, hadoop_version = NULL, reset = TRUE,
  logging = "INFO", verbose = interactive())

spark_uninstall(version, hadoop_version)

spark_install_dir()

spark_install_tar(tarfile)

spark_installed_versions()

spark_available_versions(show_hadoop = FALSE, show_minor = FALSE)
Arguments
    



    

version
Version of Spark to install. 
See spark_available_versions for a list of supported versions
    

hadoop_version
Version of Hadoop to install. 
See spark_available_versions for a list of supported versions
    

installed_only
Search only the locally installed versions?
    

latest
Check for latest version?
    

hint
On failure should the installation code be provided?
    

reset
Attempts to reset settings to defaults.
    

logging
Logging level to configure install. 
Supported options: "WARN", "INFO"
    

verbose
Report information as Spark is downloaded / installed
    

tarfile
Path to TAR file conforming to the pattern spark-###-bin-(hadoop)?### where ###
reference spark and hadoop versions respectively.
    

show_hadoop
Show Hadoop distributions?
    

show_minor
Show minor Spark versions?
    
    
Value
List with information about the installed version.
View Entries in the Spark Log 
Arguments
View the most recent entries in the Spark log. 
This can be useful when
inspecting output / errors produced by Spark during the invocation of
various commands.
spark_log(sc, n = 100, filter = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

n
The max number of log entries to retrieve. 
Use NULL to
retrieve all entries within the log.
    

filter
Character string to filter log entries.
    

...
Optional arguments; currently unused.
    
    
Open the Spark web interface 
Arguments
Open the Spark web interface
spark_web(sc, ...)
Arguments
    



    

sc
A spark_connection.
    

...
Optional arguments; currently unused.
    
    
Check whether the connection is open 
Arguments
Check whether the connection is open
connection_is_open(sc)
Arguments
    



    

sc
spark_connection
    
    
A Shiny app that can be used to construct a <code>spark_connect</code> statement 
A Shiny app that can be used to construct a spark_connect statement
connection_spark_shinyapp()
Runtime configuration interface for the Spark Session 
Arguments
Retrieves or sets runtime configuration entries for the Spark Session
spark_session_config(sc, config = TRUE, value = NULL)
Arguments
    



    

sc
A spark_connection.
    

config
The configuration entry name(s) (e.g., "spark.sql.shuffle.partitions").
Defaults to NULL to retrieve all configuration entries.
    

value
The configuration value to be set. 
Defaults to NULL to retrieve
configuration entries.
    
    
Set/Get Spark checkpoint directory 
Arguments
Set/Get Spark checkpoint directory
spark_set_checkpoint_dir(sc, dir)

spark_get_checkpoint_dir(sc)
Arguments
    



    

sc
A spark_connection.
    

dir
checkpoint directory, must be HDFS path of running on cluster
    
    
Generate a Table Name from Expression 
Arguments
Attempts to generate a table name from an expression; otherwise,
assigns an auto-generated generic name with "sparklyr_" prefix.
spark_table_name(expr)
Arguments
    



    

expr
The expression to attempt to use as name
    
    
Get the Spark Version Associated with a Spark Installation 
Arguments
Retrieve the version of Spark associated with a Spark installation.
spark_version_from_home(spark_home, default = NULL)
Arguments
    



    

spark_home
The path to a Spark installation.
    

default
The default version to be inferred, in case
version lookup failed, e.g. 
no Spark installation was found
at spark_home.
    
    
Retrieves a dataframe available Spark versions that van be installed. 
Arguments
Retrieves a dataframe available Spark versions that van be installed.
spark_versions(latest = TRUE)
Arguments
    



    

latest
Check for latest version?
    
    
Kubernetes Configuration 
Arguments
Convenience function to initialize a Kubernetes configuration instead
of spark_config(), exposes common properties to set in Kubernetes
clusters.
spark_config_kubernetes(master, version = "2.3.2",
  image = "spark:sparklyr", driver = random_string("sparklyr-"),
  account = "spark", jars = "local:///opt/sparklyr", forward = TRUE,
  executors = NULL, conf = NULL, timeout = 120, ports = c(8880,
  8881, 4040), fix_config = identical(.Platform$OS.type, "windows"), ...)
Arguments
    



    

master
Kubernetes url to connect to, found by running kubectl cluster-info.
    

version
The version of Spark being used.
    

image
Container image to use to launch Spark and sparklyr. 
Also known
as spark.kubernetes.container.image.
    

driver
Name of the driver pod. 
If not set, the driver pod name is set
to "sparklyr" suffixed by id to avoid name conflicts. 
Also known as spark.kubernetes.driver.pod.name.
    

account
Service account that is used when running the driver pod. 
The driver
pod uses this service account when requesting executor pods from the API
server. 
Also known as spark.kubernetes.authenticate.driver.serviceAccountName.
    

jars
Path to the sparklyr jars; either, a local path inside the container
image with the sparklyr jars copied when the image was created or, a path
accesible by the container where the sparklyr jars were copied. 
You can find
a path to the sparklyr jars by running system.file("java/", package = "sparklyr").
    

forward
Should ports used in sparklyr be forwarded automatically through Kubernetes?
Default to TRUE which runs kubectl port-forward and pkill kubectl
on disconnection.
    

executors
Number of executors to request while connecting.
    

conf
A named list of additional entries to add to sparklyr.shell.conf.
    

timeout
Total seconds to wait before giving up on connection.
    

ports
Ports to forward using kubectl.
    

fix_config
Should the spark-defaults.conf get fixed? TRUE for Windows.
    

...
Additional parameters, currently not in use.
    
    
Retrieve Available Settings 
Retrieves available sparklyr settings that can be used in configuration files or spark_config().
spark_config_settings()
Find Spark Connection 
Arguments
Finds an active spark connection in the environment given the
connection parameters.
spark_connection_find(master = NULL, app_name = NULL, method = NULL)
Arguments
    



    

master
The Spark master parameter.
    

app_name
The Spark application name.
    

method
The method used to connect to Spark.
    
    
Fallback to Spark Dependency 
Arguments
    Value
Helper function to assist falling back to previous Spark versions.
spark_dependency_fallback(spark_version, supported_versions)
Arguments
    



    

spark_version
The Spark version being requested in spark_dependencies.
    

supported_versions
The Spark versions that are supported by this extension.
    
    
Value
A Spark version to use.
Create Spark Extension 
Arguments
Creates an R package ready to be used as an Spark extension.
spark_extension(path)
Arguments
    



    

path
Location where the extension will be created.
    
    
Reads from a Spark Table into a Spark DataFrame. 
Arguments
    See also
Reads from a Spark Table into a Spark DataFrame.
spark_load_table(sc, name, path, options = list(), repartition = 0,
  memory = TRUE, overwrite = TRUE)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    
    
See also
Other Spark serialization routines: spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read libsvm file into a Spark DataFrame. 
Arguments
    See also
Read libsvm file into a Spark DataFrame.
spark_read_libsvm(sc, name = NULL, path = name, repartition = 0,
  memory = TRUE, overwrite = TRUE, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read a CSV file into a Spark DataFrame 
Arguments
    Details
    See also
Read a tabular data file into a Spark DataFrame.
spark_read_csv(sc, name = NULL, path = name, header = TRUE,
  columns = NULL, infer_schema = is.null(columns), delimiter = ",",
  quote = "\"", escape = "\\", charset = "UTF-8",
  null_value = NULL, options = list(), repartition = 0,
  memory = TRUE, overwrite = TRUE, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

header
Boolean; should the first row of data be used as a header?
Defaults to TRUE.
    

columns
A vector of column names or a named vector of column types.
    

infer_schema
Boolean; should column types be automatically inferred?
Requires one extra pass over the data. 
Defaults to is.null(columns).
    

delimiter
The character used to delimit each column. 
Defaults to ','.
    

quote
The character used as a quote. 
Defaults to '"'.
    

escape
The character used to escape other characters. 
Defaults to '\'.
    

charset
The character set. 
Defaults to "UTF-8".
    

null_value
The character to use for null, or missing, values. 
Defaults to NULL.
    

options
A list of strings with additional options.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

...
Optional arguments; currently unused.
    
    
Details
You can read data from HDFS (hdfs://), S3 (s3a://),
  as well as the local file system (file://).

If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint .
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options
for the config key spark.driver.extraJavaOptions 
For instructions on how to configure s3n:// check the hadoop documentation:
s3n authentication properties

When header is FALSE, the column names are generated with a V prefix; e.g.  V1, V2, ....
See also
Other Spark serialization routines: spark_load_table,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read from Delta Lake into a Spark DataFrame. 
Arguments
    See also
Read from Delta Lake into a Spark DataFrame.
spark_read_delta(sc, path, name = NULL, version = NULL,
  timestamp = NULL, options = list(), repartition = 0,
  memory = TRUE, overwrite = TRUE, ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated table.
    

version
The version of the delta table to read.
    

timestamp
The timestamp of the delta table to read. 
For example, "2019-01-01" or "2019-01-01'T'00:00:00.000Z".
    

options
A list of strings with additional options.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read from JDBC connection into a Spark DataFrame. 
Arguments
    See also
Read from JDBC connection into a Spark DataFrame.
spark_read_jdbc(sc, name, options = list(), repartition = 0,
  memory = TRUE, overwrite = TRUE, columns = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

columns
A vector of column names or a named vector of column types.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read a JSON file into a Spark DataFrame 
Arguments
    Details
    See also
Read a table serialized in the JavaScript
Object Notation format into a Spark DataFrame.
spark_read_json(sc, name = NULL, path = name, options = list(),
  repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
  ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

options
A list of strings with additional options.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

columns
A vector of column names or a named vector of column types.
    

...
Optional arguments; currently unused.
    
    
Details
You can read data from HDFS (hdfs://), S3 (s3a://), as well as
  the local file system (file://).

If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint .
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options
for the config key spark.driver.extraJavaOptions 
For instructions on how to configure s3n:// check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read a ORC file into a Spark DataFrame 
Arguments
    Details
    See also
Read a ORC file into a Spark
DataFrame.
spark_read_orc(sc, name = NULL, path = name, options = list(),
  repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
  schema = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

columns
A vector of column names or a named vector of column types.
    

schema
A (java) read schema. 
Useful for optimizing read operation on nested data.
    

...
Optional arguments; currently unused.
    
    
Details
You can read data from HDFS (hdfs://), S3 (s3a://), as well as
  the local file system (file://).
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read a Parquet file into a Spark DataFrame 
Arguments
    Details
    See also
Read a Parquet file into a Spark
DataFrame.
spark_read_parquet(sc, name = NULL, path = name, options = list(),
  repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL,
  schema = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

columns
A vector of column names or a named vector of column types.
    

schema
A (java) read schema. 
Useful for optimizing read operation on nested data.
    

...
Optional arguments; currently unused.
    
    
Details
You can read data from HDFS (hdfs://), S3 (s3a://), as well as
  the local file system (file://).

If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint .
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options
for the config key spark.driver.extraJavaOptions 
For instructions on how to configure s3n:// check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read from a generic source into a Spark DataFrame. 
Arguments
    See also
Read from a generic source into a Spark DataFrame.
spark_read_source(sc, name = NULL, path = name, source,
  options = list(), repartition = 0, memory = TRUE,
  overwrite = TRUE, columns = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

source
A data source capable of reading data.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

columns
A vector of column names or a named vector of column types.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Reads from a Spark Table into a Spark DataFrame. 
Arguments
    See also
Reads from a Spark Table into a Spark DataFrame.
spark_read_table(sc, name, options = list(), repartition = 0,
  memory = TRUE, columns = NULL, ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

columns
A vector of column names or a named vector of column types.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Read a Text file into a Spark DataFrame 
Arguments
    Details
    See also
Read a text file into a Spark DataFrame.
spark_read_text(sc, name = NULL, path = name, repartition = 0,
  memory = TRUE, overwrite = TRUE, options = list(), whole = FALSE,
  ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated table.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

repartition
The number of partitions used to distribute the
generated table. 
Use 0 (the default) to avoid partitioning.
    

memory
Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)
    

overwrite
Boolean; overwrite the table with the given name if it
already exists?
    

options
A list of strings with additional options.
    

whole
Read the entire text file as a single entry? Defaults to FALSE.
    

...
Optional arguments; currently unused.
    
    
Details
You can read data from HDFS (hdfs://), S3 (s3a://), as well as
  the local file system (file://).

If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk
documentation Working with AWS credentials
In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint .
In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options
for the config key spark.driver.extraJavaOptions 
For instructions on how to configure s3n:// check the hadoop documentation:
s3n authentication properties
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Write a Spark DataFrame to a CSV 
Arguments
    See also
Write a Spark DataFrame to a tabular (typically, comma-separated) file.
spark_write_csv(x, path, header = TRUE, delimiter = ",",
  quote = "\"", escape = "\\", charset = "UTF-8",
  null_value = NULL, options = list(), mode = NULL,
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

header
Should the first row of data be used as a header? Defaults to TRUE.
    

delimiter
The character used to delimit each column, defaults to ,.
    

quote
The character used as a quote. 
Defaults to '"'.
    

escape
The character used to escape other characters, defaults to \.
    

charset
The character set, defaults to "UTF-8".
    

null_value
The character to use for default values, defaults to NULL.
    

options
A list of strings with additional options.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Writes a Spark DataFrame into Delta Lake 
Arguments
    See also
Writes a Spark DataFrame into Delta Lake.
spark_write_delta(x, path, mode = NULL, options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Writes a Spark DataFrame into a JDBC table 
Arguments
    See also
Writes a Spark DataFrame into a JDBC table.
spark_write_jdbc(x, name, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

name
The name to assign to the newly generated table.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Write a Spark DataFrame to a JSON file 
Arguments
    See also
Serialize a Spark DataFrame to the JavaScript
Object Notation format.
spark_write_json(x, path, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Write a Spark DataFrame to a ORC file 
Arguments
    See also
Serialize a Spark DataFrame to the
ORC format.
spark_write_orc(x, path, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_parquet,
  spark_write_source,
  spark_write_table,
  spark_write_text
Write a Spark DataFrame to a Parquet file 
Arguments
    See also
Serialize a Spark DataFrame to the
Parquet format.
spark_write_parquet(x, path, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options. 
See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_source,
  spark_write_table,
  spark_write_text
Writes a Spark DataFrame into a generic source 
Arguments
    See also
Writes a Spark DataFrame into a generic source.
spark_write_source(x, source, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

source
A data source capable of reading data.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_table,
  spark_write_text
Writes a Spark DataFrame into a Spark table 
Arguments
    See also
Writes a Spark DataFrame into a Spark table.
spark_write_table(x, name, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

name
The name to assign to the newly generated table.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_text
Write a Spark DataFrame to a Text file 
Arguments
    See also
Serialize a Spark DataFrame to the plain text format.
spark_write_text(x, path, mode = NULL, options = list(),
  partition_by = NULL, ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
A character element. 
Specifies the behavior when data or
  table already exists. 
Supported values include: 'error', 'append', 'overwrite' and
  ignore. 
Notice that 'overwrite' will also change the column structure.

For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  for your version of Spark.
    

options
A list of strings with additional options.
    

partition_by
A character vector. 
Partitions the output by the given columns on the file system.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark serialization routines: spark_load_table,
  spark_read_csv,
  spark_read_delta,
  spark_read_jdbc,
  spark_read_json,
  spark_read_libsvm,
  spark_read_orc,
  spark_read_parquet,
  spark_read_source,
  spark_read_table,
  spark_read_text,
  spark_save_table,
  spark_write_csv,
  spark_write_delta,
  spark_write_jdbc,
  spark_write_json,
  spark_write_orc,
  spark_write_parquet,
  spark_write_source,
  spark_write_table
Save / Load a Spark DataFrame 
Arguments
Routines for saving and loading Spark DataFrames.
sdf_save_table(x, name, overwrite = FALSE, append = FALSE)

sdf_load_table(sc, name)

sdf_save_parquet(x, path, overwrite = FALSE, append = FALSE)

sdf_load_parquet(sc, path)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

name
The table name to assign to the saved Spark DataFrame.
    

overwrite
Boolean; overwrite a pre-existing table of the same name?
    

append
Boolean; append to a pre-existing table of the same name?
    

sc
A spark_connection object.
    

path
The path where the Spark DataFrame should be saved.
    
    
Spark ML -- Transform, fit, and predict methods (sdf_ interface) 
Arguments
    Value
Deprecated methods for transformation, fit, and prediction. 
These are mirrors of the corresponding ml-transform-methods.
sdf_predict(x, model, ...)

sdf_transform(x, transformer, ...)

sdf_fit(x, estimator, ...)

sdf_fit_and_transform(x, estimator, ...)
Arguments
    



    

x
A tbl_spark.
    

model
A ml_transformer or a ml_model object.
    

...
Optional arguments passed to the corresponding ml_ methods.
    

transformer
A ml_transformer object.
    

estimator
A ml_estimator object.
    
    
Value
 sdf_predict(), sdf_transform(), and sdf_fit_and_transform() return a transformed dataframe whereas sdf_fit() returns a ml_transformer.
Create DataFrame for along Object 
Arguments
Creates a DataFrame along the given object.
sdf_along(sc, along, repartition = NULL, type = c("integer",
  "integer64"))
Arguments
    



    

sc
The associated Spark connection.
    

along
Takes the length from the length of this argument.
    

repartition
The number of partitions to use when distributing the
data across the Spark cluster.
    

type
The data type to use for the index, either "integer" or "integer64".
    
    
Bind multiple Spark DataFrames by row and column 
Arguments
    Value
    Details sdf_bind_rows() and sdf_bind_cols() are implementation of the common pattern of do.call(rbind, sdfs) or do.call(cbind, sdfs) for binding many
Spark DataFrames into one.
sdf_bind_rows(..., id = NULL)

sdf_bind_cols(...)
Arguments
    



    

...
Spark tbls to combine.

Each argument can either be a Spark DataFrame or a list of
  Spark DataFrames

When row-binding, columns are matched by name, and any missing
  columns with be filled with NA.

When column-binding, rows are matched by position, so all data
  frames must have the same number of rows.
    

id
Data frame identifier.

When id is supplied, a new column of identifiers is
  created to link each row to its original Spark DataFrame. 
The labels
  are taken from the named arguments to sdf_bind_rows(). 
When a
  list of Spark DataFrames is supplied, the labels are taken from the
  names of the list. 
If no names are found a numeric sequence is
  used instead.
    
    
Value
 sdf_bind_rows() and sdf_bind_cols() return tbl_spark
Details
The output of sdf_bind_rows() will contain a column if that column
appears in any of the inputs.
Broadcast hint 
Arguments
Used to force broadcast hash joins.
sdf_broadcast(x)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    
    
Checkpoint a Spark DataFrame 
Arguments
Checkpoint a Spark DataFrame
sdf_checkpoint(x, eager = TRUE)
Arguments
    



    

x
an object coercible to a Spark DataFrame
    

eager
whether to truncate the lineage of the DataFrame
    
    
Coalesces a Spark DataFrame 
Arguments
Coalesces a Spark DataFrame
sdf_coalesce(x, partitions)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

partitions
number of partitions
    
    
Collect a Spark DataFrame into R. 
Arguments
Collects a Spark dataframe into R.
sdf_collect(object, ...)
Arguments
    



    

object
Spark dataframe to collect
    

...
Additional options.
    
    
Copy an Object into Spark 
Arguments
    Advanced Usage
    See also
    Examples
Copy an object into Spark, and return an R object wrapping the
copied object (typically, a Spark DataFrame).
sdf_copy_to(sc, x, name, memory, repartition, overwrite, ...)

sdf_import(x, sc, name, memory, repartition, overwrite, ...)
Arguments
    



    

sc
The associated Spark connection.
    

x
An R object from which a Spark DataFrame can be generated.
    

name
The name to assign to the copied table in Spark.
    

memory
Boolean; should the table be cached into memory?
    

repartition
The number of partitions to use when distributing the
table across the Spark cluster. 
The default (0) can be used to avoid
partitioning.
    

overwrite
Boolean; overwrite a pre-existing table with the name name
if one already exists?
    

...
Optional arguments, passed to implementing methods.
    
    
Advanced Usage
 sdf_copy_to is an S3 generic that, by default, dispatches to sdf_import. 
Package authors that would like to implement sdf_copy_to for a custom object type can accomplish this by
implementing the associated method on sdf_import.
See also
Other Spark data frames: sdf_random_split,
  sdf_register, sdf_sample,
  sdf_sort
Examples
    
sc <- spark_connect(master = "spark://HOST:PORT")
sdf_copy_to(sc, iris)#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1            5.1         3.5          1.4         0.2     setosa
#> 2            4.9         3.0          1.4         0.2     setosa
#> 3            4.7         3.2          1.3         0.2     setosa
#> 4            4.6         3.1          1.5         0.2     setosa
#> 5            5.0         3.6          1.4         0.2     setosa
#> 6            5.4         3.9          1.7         0.4     setosa
#> 7            4.6         3.4          1.4         0.3     setosa
#> 8            5.0         3.4          1.5         0.2     setosa
#> 9            4.4         2.9          1.4         0.2     setosa
#> 10           4.9         3.1          1.5         0.1     setosa
#> 11           5.4         3.7          1.5         0.2     setosa
#> 12           4.8         3.4          1.6         0.2     setosa
#> 13           4.8         3.0          1.4         0.1     setosa
#> 14           4.3         3.0          1.1         0.1     setosa
#> 15           5.8         4.0          1.2         0.2     setosa
#> 16           5.7         4.4          1.5         0.4     setosa
#> 17           5.4         3.9          1.3         0.4     setosa
#> 18           5.1         3.5          1.4         0.3     setosa
#> 19           5.7         3.8          1.7         0.3     setosa
#> 20           5.1         3.8          1.5         0.3     setosa
#> 21           5.4         3.4          1.7         0.2     setosa
#> 22           5.1         3.7          1.5         0.4     setosa
#> 23           4.6         3.6          1.0         0.2     setosa
#> 24           5.1         3.3          1.7         0.5     setosa
#> 25           4.8         3.4          1.9         0.2     setosa
#> 26           5.0         3.0          1.6         0.2     setosa
#> 27           5.0         3.4          1.6         0.4     setosa
#> 28           5.2         3.5          1.5         0.2     setosa
#> 29           5.2         3.4          1.4         0.2     setosa
#> 30           4.7         3.2          1.6         0.2     setosa
#> 31           4.8         3.1          1.6         0.2     setosa
#> 32           5.4         3.4          1.5         0.4     setosa
#> 33           5.2         4.1          1.5         0.1     setosa
#> 34           5.5         4.2          1.4         0.2     setosa
#> 35           4.9         3.1          1.5         0.2     setosa
#> 36           5.0         3.2          1.2         0.2     setosa
#> 37           5.5         3.5          1.3         0.2     setosa
#> 38           4.9         3.6          1.4         0.1     setosa
#> 39           4.4         3.0          1.3         0.2     setosa
#> 40           5.1         3.4          1.5         0.2     setosa
#> 41           5.0         3.5          1.3         0.3     setosa
#> 42           4.5         2.3          1.3         0.3     setosa
#> 43           4.4         3.2          1.3         0.2     setosa
#> 44           5.0         3.5          1.6         0.6     setosa
#> 45           5.1         3.8          1.9         0.4     setosa
#> 46           4.8         3.0          1.4         0.3     setosa
#> 47           5.1         3.8          1.6         0.2     setosa
#> 48           4.6         3.2          1.4         0.2     setosa
#> 49           5.3         3.7          1.5         0.2     setosa
#> 50           5.0         3.3          1.4         0.2     setosa
#> 51           7.0         3.2          4.7         1.4 versicolor
#> 52           6.4         3.2          4.5         1.5 versicolor
#> 53           6.9         3.1          4.9         1.5 versicolor
#> 54           5.5         2.3          4.0         1.3 versicolor
#> 55           6.5         2.8          4.6         1.5 versicolor
#> 56           5.7         2.8          4.5         1.3 versicolor
#> 57           6.3         3.3          4.7         1.6 versicolor
#> 58           4.9         2.4          3.3         1.0 versicolor
#> 59           6.6         2.9          4.6         1.3 versicolor
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 61           5.0         2.0          3.5         1.0 versicolor
#> 62           5.9         3.0          4.2         1.5 versicolor
#> 63           6.0         2.2          4.0         1.0 versicolor
#> 64           6.1         2.9          4.7         1.4 versicolor
#> 65           5.6         2.9          3.6         1.3 versicolor
#> 66           6.7         3.1          4.4         1.4 versicolor
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 68           5.8         2.7          4.1         1.0 versicolor
#> 69           6.2         2.2          4.5         1.5 versicolor
#> 70           5.6         2.5          3.9         1.1 versicolor
#> 71           5.9         3.2          4.8         1.8 versicolor
#> 72           6.1         2.8          4.0         1.3 versicolor
#> 73           6.3         2.5          4.9         1.5 versicolor
#> 74           6.1         2.8          4.7         1.2 versicolor
#> 75           6.4         2.9          4.3         1.3 versicolor
#> 76           6.6         3.0          4.4         1.4 versicolor
#> 77           6.8         2.8          4.8         1.4 versicolor
#> 78           6.7         3.0          5.0         1.7 versicolor
#> 79           6.0         2.9          4.5         1.5 versicolor
#> 80           5.7         2.6          3.5         1.0 versicolor
#> 81           5.5         2.4          3.8         1.1 versicolor
#> 82           5.5         2.4          3.7         1.0 versicolor
#> 83           5.8         2.7          3.9         1.2 versicolor
#> 84           6.0         2.7          5.1         1.6 versicolor
#> 85           5.4         3.0          4.5         1.5 versicolor
#> 86           6.0         3.4          4.5         1.6 versicolor
#> 87           6.7         3.1          4.7         1.5 versicolor
#> 88           6.3         2.3          4.4         1.3 versicolor
#> 89           5.6         3.0          4.1         1.3 versicolor
#> 90           5.5         2.5          4.0         1.3 versicolor
#> 91           5.5         2.6          4.4         1.2 versicolor
#> 92           6.1         3.0          4.6         1.4 versicolor
#> 93           5.8         2.6          4.0         1.2 versicolor
#> 94           5.0         2.3          3.3         1.0 versicolor
#> 95           5.6         2.7          4.2         1.3 versicolor
#> 96           5.7         3.0          4.2         1.2 versicolor
#> 97           5.7         2.9          4.2         1.3 versicolor
#> 98           6.2         2.9          4.3         1.3 versicolor
#> 99           5.1         2.5          3.0         1.1 versicolor
#> 100          5.7         2.8          4.1         1.3 versicolor
#> 101          6.3         3.3          6.0         2.5  virginica
#> 102          5.8         2.7          5.1         1.9  virginica
#> 103          7.1         3.0          5.9         2.1  virginica
#> 104          6.3         2.9          5.6         1.8  virginica
#> 105          6.5         3.0          5.8         2.2  virginica
#> 106          7.6         3.0          6.6         2.1  virginica
#> 107          4.9         2.5          4.5         1.7  virginica
#> 108          7.3         2.9          6.3         1.8  virginica
#> 109          6.7         2.5          5.8         1.8  virginica
#> 110          7.2         3.6          6.1         2.5  virginica
#> 111          6.5         3.2          5.1         2.0  virginica
#> 112          6.4         2.7          5.3         1.9  virginica
#> 113          6.8         3.0          5.5         2.1  virginica
#> 114          5.7         2.5          5.0         2.0  virginica
#> 115          5.8         2.8          5.1         2.4  virginica
#> 116          6.4         3.2          5.3         2.3  virginica
#> 117          6.5         3.0          5.5         1.8  virginica
#> 118          7.7         3.8          6.7         2.2  virginica
#> 119          7.7         2.6          6.9         2.3  virginica
#> 120          6.0         2.2          5.0         1.5  virginica
#> 121          6.9         3.2          5.7         2.3  virginica
#> 122          5.6         2.8          4.9         2.0  virginica
#> 123          7.7         2.8          6.7         2.0  virginica
#> 124          6.3         2.7          4.9         1.8  virginica
#> 125          6.7         3.3          5.7         2.1  virginica
#> 126          7.2         3.2          6.0         1.8  virginica
#> 127          6.2         2.8          4.8         1.8  virginica
#> 128          6.1         3.0          4.9         1.8  virginica
#> 129          6.4         2.8          5.6         2.1  virginica
#> 130          7.2         3.0          5.8         1.6  virginica
#> 131          7.4         2.8          6.1         1.9  virginica
#> 132          7.9         3.8          6.4         2.0  virginica
#> 133          6.4         2.8          5.6         2.2  virginica
#> 134          6.3         2.8          5.1         1.5  virginica
#> 135          6.1         2.6          5.6         1.4  virginica
#> 136          7.7         3.0          6.1         2.3  virginica
#> 137          6.3         3.4          5.6         2.4  virginica
#> 138          6.4         3.1          5.5         1.8  virginica
#> 139          6.0         3.0          4.8         1.8  virginica
#> 140          6.9         3.1          5.4         2.1  virginica
#> 141          6.7         3.1          5.6         2.4  virginica
#> 142          6.9         3.1          5.1         2.3  virginica
#> 143          5.8         2.7          5.1         1.9  virginica
#> 144          6.8         3.2          5.9         2.3  virginica
#> 145          6.7         3.3          5.7         2.5  virginica
#> 146          6.7         3.0          5.2         2.3  virginica
#> 147          6.3         2.5          5.0         1.9  virginica
#> 148          6.5         3.0          5.2         2.0  virginica
#> 149          6.2         3.4          5.4         2.3  virginica
#> 150          5.9         3.0          5.1         1.8  virginica

Cross Tabulation 
Arguments
    Value
Builds a contingency table at each combination of factor levels.
sdf_crosstab(x, col1, col2)
Arguments
    



    

x
A Spark DataFrame
    

col1
The name of the first column. 
Distinct items will make the first item of each row.
    

col2
The name of the second column. 
Distinct items will make the column names of the DataFrame.
    
    
Value
A DataFrame containing the contingency table.
Debug Info for Spark DataFrame 
Arguments
Prints plan of execution to generate x. 
This plan will, among other things, show the
number of partitions in parenthesis at the far left and indicate stages using indentation.
sdf_debug_string(x, print = TRUE)
Arguments
    



    

x
An R object wrapping, or containing, a Spark DataFrame.
    

print
Print debug information?
    
    
Compute summary statistics for columns of a data frame 
Arguments
Compute summary statistics for columns of a data frame
sdf_describe(x, cols = colnames(x))
Arguments
    



    

x
An object coercible to a Spark DataFrame
    

cols
Columns to compute statistics for, given as a character vector
    
    
Support for Dimension Operations 
Arguments sdf_dim(),  sdf_nrow() and sdf_ncol() provide similar
functionality to dim(), nrow() and ncol().
sdf_dim(x)

sdf_nrow(x)

sdf_ncol(x)
Arguments
    



    

x
An object (usually a spark_tbl).
    
    
Spark DataFrame is Streaming 
Arguments
Is the given Spark DataFrame a streaming data?
sdf_is_streaming(x)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    
    
Returns the last index of a Spark DataFrame 
Arguments
Returns the last index of a Spark DataFrame. 
The Spark mapPartitionsWithIndex function is used to iterate
through the last nonempty partition of the RDD to find the last record.
sdf_last_index(x, id = "id")
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

id
The name of the index column.
    
    
Create DataFrame for Length 
Arguments
Creates a DataFrame for the given length.
sdf_len(sc, length, repartition = NULL, type = c("integer",
  "integer64"))
Arguments
    



    

sc
The associated Spark connection.
    

length
The desired length of the sequence.
    

repartition
The number of partitions to use when distributing the
data across the Spark cluster.
    

type
The data type to use for the index, either "integer" or "integer64".
    
    
Gets number of partitions of a Spark DataFrame 
Arguments
Gets number of partitions of a Spark DataFrame
sdf_num_partitions(x)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    
    
Persist a Spark DataFrame 
Arguments
    Details
Persist a Spark DataFrame, forcing any pending computations and (optionally)
serializing the results to disk.
sdf_persist(x, storage.level = "MEMORY_AND_DISK")
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

storage.level
The storage level to be used. 
Please view the
Spark Documentation
for information on what storage levels are accepted.
    
    
Details
Spark DataFrames invoke their operations lazily -- pending operations are
deferred until their results are actually needed. 
Persisting a Spark
DataFrame effectively 'forces' any pending computations, and then persists
the generated Spark DataFrame as requested (to memory, to disk, or
otherwise).

Users of Spark should be careful to persist the results of any computations
which are non-deterministic -- otherwise, one might see that the values
within a column seem to 'change' as new operations are performed on that
data set.
Pivot a Spark DataFrame 
Arguments
    Examples
Construct a pivot table over a Spark Dataframe, using a syntax similar to
that from reshape2::dcast.
sdf_pivot(x, formula, fun.aggregate = "count")
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
A two-sided R formula of the form x_1 + x_2 + ... 
~ y_1.
The left-hand side of the formula indicates which variables are used for grouping,
and the right-hand side indicates which variable is used for pivoting. 
Currently,
only a single pivot column is supported.
    

fun.aggregate
How should the grouped dataset be aggregated? Can be
a length-one character vector, giving the name of a Spark aggregation function
to be called; a named R list mapping column names to an aggregation method,
or an R function that is invoked on the grouped dataset.
    
    
Examples
    if (FALSE) {
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

# aggregating by mean
iris_tbl %>%
  mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>%
  sdf_pivot(Petal_Width ~ Species,
    fun.aggregate = list(Petal_Length = "mean"))

# aggregating all observations in a list
iris_tbl %>%
  mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>%
  sdf_pivot(Petal_Width ~ Species,
    fun.aggregate = list(Petal_Length = "collect_list"))
}
Project features onto principal components 
Arguments
    Transforming Spark DataFrames
Project features onto principal components
sdf_project(object, newdata, features = dimnames(object$pc)[[1]],
  feature_prefix = NULL, ...)
Arguments
    



    

object
A Spark PCA model object
    

newdata
An object coercible to a Spark DataFrame
    

features
A vector of names of columns to be projected
    

feature_prefix
The prefix used in naming the output features
    

...
Optional arguments; currently unused.
    
    
Transforming Spark DataFrames
The family of functions prefixed with sdf_ generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr interface which
uses Spark SQL. 
These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object
returned will no longer have the attached 'lazy' SQL operations. 
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect() the table.
Compute (Approximate) Quantiles with a Spark DataFrame 
Arguments
Given a numeric column within a Spark DataFrame, compute
approximate quantiles (to some relative error).
sdf_quantile(x, column, probabilities = c(0, 0.25, 0.5, 0.75, 1),
  relative.error = 1e-05)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

column
The column for which quantiles should be computed.
    

probabilities
A numeric vector of probabilities, for
which quantiles should be computed.
    

relative.error
The relative error -- lower values imply more
precision in the computed quantiles.
    
    
Partition a Spark Dataframe 
Arguments
    Value
    Details
    Transforming Spark DataFrames
    See also
    Examples
Partition a Spark DataFrame into multiple groups. 
This routine is useful
for splitting a DataFrame into, for example, training and test datasets.
sdf_random_split(x, ..., weights = NULL,
  seed = sample(.Machine$integer.max, 1))

sdf_partition(x, ..., weights = NULL,
  seed = sample(.Machine$integer.max, 1))
Arguments
    



    

x
An object coercable to a Spark DataFrame.
    

...
Named parameters, mapping table names to weights. 
The weights
will be normalized such that they sum to 1.
    

weights
An alternate mechanism for supplying weights -- when
specified, this takes precedence over the ... arguments.
    

seed
Random seed to use for randomly partitioning the dataset. 
Set
this if you want your partitioning to be reproducible on repeated runs.
    
    
Value
An R list of tbl_sparks.
Details
The sampling weights define the probability that a particular observation
will be assigned to a particular partition, not the resulting size of the
partition. 
This implies that partitioning a DataFrame with, for example,
 sdf_random_split(x, training = 0.5, test = 0.5)

is not guaranteed to produce training and test partitions
of equal size.
Transforming Spark DataFrames
The family of functions prefixed with sdf_ generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr interface which
uses Spark SQL. 
These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object
returned will no longer have the attached 'lazy' SQL operations. 
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect() the table.
See also
Other Spark data frames: sdf_copy_to,
  sdf_register, sdf_sample,
  sdf_sort
Examples
    if (FALSE) {
# randomly partition data into a 'training' and 'test'
# dataset, with 60% of the observations assigned to the
# 'training' dataset, and 40% assigned to the 'test' dataset
data(diamonds, package = "ggplot2")
diamonds_tbl <- copy_to(sc, diamonds, "diamonds")
partitions <- diamonds_tbl %>%
  sdf_random_split(training = 0.6, test = 0.4)
print(partitions)

# alternate way of specifying weights
weights <- c(training = 0.6, test = 0.4)
diamonds_tbl %>% sdf_random_split(weights = weights)
}
Read a Column from a Spark DataFrame 
Arguments
    Details
Read a single column from a Spark DataFrame, and return
the contents of that column back to R.
sdf_read_column(x, column)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

column
The name of a column within x.
    
    
Details
It is expected for this operation to preserve row order.
Register a Spark DataFrame 
Arguments
    Transforming Spark DataFrames
    See also
Registers a Spark DataFrame (giving it a table name for the
Spark SQL context), and returns a tbl_spark.
sdf_register(x, name = NULL)
Arguments
    



    

x
A Spark DataFrame.
    

name
A name to assign this table.
    
    
Transforming Spark DataFrames
The family of functions prefixed with sdf_ generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr interface which
uses Spark SQL. 
These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object
returned will no longer have the attached 'lazy' SQL operations. 
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect() the table.
See also
Other Spark data frames: sdf_copy_to,
  sdf_random_split, sdf_sample,
  sdf_sort
Repartition a Spark DataFrame 
Arguments
Repartition a Spark DataFrame
sdf_repartition(x, partitions = NULL, partition_by = NULL)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

partitions
number of partitions
    

partition_by
vector of column names used for partitioning, only supported for Spark 2.0+
    
    
Model Residuals 
Arguments
This generic method returns a Spark DataFrame with model
residuals added as a column to the model training data.
# S3 method for ml_model_generalized_linear_regression
sdf_residuals(object,
  type = c("deviance", "pearson", "working", "response"), ...)

# S3 method for ml_model_linear_regression
sdf_residuals(object, ...)

sdf_residuals(object, ...)
Arguments
    



    

object
Spark ML model object.
    

type
type of residuals which should be returned.
    

...
additional arguments
    
    
Randomly Sample Rows from a Spark DataFrame 
Arguments
    Transforming Spark DataFrames
    See also
Draw a random sample of rows (with or without replacement)
from a Spark DataFrame.
sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL)
Arguments
    



    

x
An object coercable to a Spark DataFrame.
    

fraction
The fraction to sample.
    

replacement
Boolean; sample with replacement?
    

seed
An (optional) integer seed.
    
    
Transforming Spark DataFrames
The family of functions prefixed with sdf_ generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr interface which
uses Spark SQL. 
These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object
returned will no longer have the attached 'lazy' SQL operations. 
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect() the table.
See also
Other Spark data frames: sdf_copy_to,
  sdf_random_split,
  sdf_register, sdf_sort
Read the Schema of a Spark DataFrame 
Arguments
    Value
    Details
Read the schema of a Spark DataFrame.
sdf_schema(x)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    
    
Value
An R list, with each list element describing the
  name and type of a column.
Details
The type column returned gives the string representation of the
underlying Spark  type for that column; for example, a vector of numeric
values would be returned with the type "DoubleType". 
Please see the
Spark Scala API Documentation
for information on what types are available and exposed by Spark.
Separate a Vector Column into Scalar Columns 
Arguments
Given a vector column in a Spark DataFrame, split that
into n separate columns, each column made up of
the different elements in the column column.
sdf_separate_column(x, column, into = NULL)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

column
The name of a (vector-typed) column.
    

into
A specification of the columns that should be
generated from column. 
This can either be a
vector of column names, or an R list mapping column
names to the (1-based) index at which a particular
vector element should be extracted.
    
    
Create DataFrame for Range 
Arguments
Creates a DataFrame for the given range
sdf_seq(sc, from = 1L, to = 1L, by = 1L, repartition = type,
  type = c("integer", "integer64"))
Arguments
    



    

sc
The associated Spark connection.
    

from, to
The start and end to use as a range
    

by
The increment of the sequence.
    

repartition
The number of partitions to use when distributing the
data across the Spark cluster.
    

type
The data type to use for the index, either "integer" or "integer64".
    
    
Sort a Spark DataFrame 
Arguments
    Transforming Spark DataFrames
    See also
Sort a Spark DataFrame by one or more columns, with each column
sorted in ascending order.
sdf_sort(x, columns)
Arguments
    



    

x
An object coercable to a Spark DataFrame.
    

columns
The column(s) to sort by.
    
    
Transforming Spark DataFrames
The family of functions prefixed with sdf_ generally access the Scala
Spark DataFrame API directly, as opposed to the dplyr interface which
uses Spark SQL. 
These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object
returned will no longer have the attached 'lazy' SQL operations. 
Note that
the underlying Spark DataFrame does execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly collect() the table.
See also
Other Spark data frames: sdf_copy_to,
  sdf_random_split,
  sdf_register, sdf_sample
Spark DataFrame from SQL 
Arguments
Defines a Spark DataFrame from a SQL query, useful to create Spark DataFrames
without collecting the results immediately.
sdf_sql(sc, sql)
Arguments
    



    

sc
A spark_connection.
    

sql
a 'SQL' query used to generate a Spark DataFrame.
    
    
Add a Sequential ID Column to a Spark DataFrame 
Arguments
Add a sequential ID column to a Spark DataFrame. 
The Spark zipWithIndex function is used to produce these. 
This differs from sdf_with_unique_id in that the IDs generated are independent of
partitioning.
sdf_with_sequential_id(x, id = "id", from = 1L)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

id
The name of the column to host the generated IDs.
    

from
The starting value of the id column
    
    
Add a Unique ID Column to a Spark DataFrame 
Arguments
Add a unique ID column to a Spark DataFrame. 
The Spark monotonicallyIncreasingId function is used to produce these and is
guaranteed to produce unique, monotonically increasing ids; however, there
is no guarantee that these IDs will be sequential. 
The table is persisted
immediately after the column is generated, to ensure that the column is
stable -- otherwise, it can differ across new computations.
sdf_with_unique_id(x, id = "id")
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

id
The name of the column to host the generated IDs.
    
    
Spark ML -- Decision Trees 
Arguments
    Value
    Details
    See also
    Examples
Perform classification and regression using decision trees.
ml_decision_tree_classifier(x, formula = NULL, max_depth = 5,
  max_bins = 32, min_instances_per_node = 1, min_info_gain = 0,
  impurity = "gini", seed = NULL, thresholds = NULL,
  cache_node_ids = FALSE, checkpoint_interval = 10,
  max_memory_in_mb = 256, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("decision_tree_classifier_"), ...)

ml_decision_tree(x, formula = NULL, type = c("auto", "regression",
  "classification"), features_col = "features", label_col = "label",
  prediction_col = "prediction", variance_col = NULL,
  probability_col = "probability",
  raw_prediction_col = "rawPrediction", checkpoint_interval = 10L,
  impurity = "auto", max_bins = 32L, max_depth = 5L,
  min_info_gain = 0, min_instances_per_node = 1L, seed = NULL,
  thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256L,
  uid = random_string("decision_tree_"), response = NULL,
  features = NULL, ...)

ml_decision_tree_regressor(x, formula = NULL, max_depth = 5,
  max_bins = 32, min_instances_per_node = 1, min_info_gain = 0,
  impurity = "variance", seed = NULL, cache_node_ids = FALSE,
  checkpoint_interval = 10, max_memory_in_mb = 256,
  variance_col = NULL, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("decision_tree_regressor_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

max_depth
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree.
    

max_bins
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node. 
More bins give higher granularity.
    

min_instances_per_node
Minimum number of instances each child must
have after split.
    

min_info_gain
Minimum information gain for a split to be considered
at a tree node. 
Should be >= 0, defaults to 0.
    

impurity
Criterion used for information gain calculation. 
Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression. 
For ml_decision_tree, setting "auto" will default to the appropriate
criterion based on model type.
    

seed
Seed for random numbers.
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

cache_node_ids
If FALSE, the algorithm will pass trees to executors to match instances with nodes.
If TRUE, the algorithm will cache node IDs for each instance. 
Caching can speed up training of deeper trees.
Defaults to FALSE.
    

checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g. 
10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
    

max_memory_in_mb
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size. 
Defaults to 256.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    

type
The type of model to fit.  "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable. 
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise.
    

variance_col
(Optional) Column name for the biased sample variance of prediction.
    

response
(Deprecated) The name of the response column (as a length-one character vector.)
    

features
(Deprecated) The name of features (terms) to use for the model fit.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
 ml_decision_tree is a wrapper around ml_decision_tree_regressor.tbl_spark and ml_decision_tree_classifier.tbl_spark and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

dt_model <- iris_training %>%
  ml_decision_tree(Species ~ .)

pred <- ml_predict(dt_model, iris_test)

ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Generalized Linear Regression 
Arguments
    Value
    Details
    See also
    Examples
Perform regression using Generalized Linear Model (GLM).
ml_generalized_linear_regression(x, formula = NULL,
  family = "gaussian", link = NULL, fit_intercept = TRUE,
  offset_col = NULL, link_power = NULL, link_prediction_col = NULL,
  reg_param = 0, max_iter = 25, weight_col = NULL, solver = "irls",
  tol = 1e-06, variance_power = 0, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("generalized_linear_regression_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

family
Name of family which is a description of the error distribution to be used in the model. 
Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie". 
Default is "gaussian".
    

link
Name of link function which provides the relationship between the linear predictor and the mean of the distribution function. 
See for supported link functions.
    

fit_intercept
Boolean; should the model be fit with an intercept term?
    

offset_col
Offset column name. 
If this is not set, we treat all instance offsets as 0.0. 
The feature specified as offset has a constant coefficient of 1.0.
    

link_power
Index in the power link function. 
Only applicable to the Tweedie family. 
Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt link, respectively. 
When not set, this value defaults to 1 - variancePower, which matches the R "statmod" package.
    

link_prediction_col
Link prediction (linear predictor) column name. 
Default is not set, which means we do not output link prediction.
    

reg_param
Regularization parameter (aka lambda)
    

max_iter
The maximum number of iterations to use.
    

weight_col
The name of the column to use as weights for the model fit.
    

solver
Solver algorithm for optimization.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

variance_power
Power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution. 
Only applicable to the Tweedie family. 
(see Tweedie Distribution (Wikipedia)) Supported values: 0 and [1, Inf). 
Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma family, respectively.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

Valid link functions for each family is listed below. 
The first link function of each family is the default one.

gaussian: "identity", "log", "inverse"

binomial: "logit", "probit", "loglog"

poisson: "log", "identity", "sqrt"

gamma: "inverse", "identity", "log"

tweedie: power link function specified through link_power. 
The default link power in the tweedie family is 1 - variance_power.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
library(sparklyr)

sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

partitions <- mtcars_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

# Specify the grid
family <- c("gaussian", "gamma", "poisson")
link <- c("identity", "log")
family_link <- expand.grid(family = family, link = link, stringsAsFactors = FALSE)
family_link <- data.frame(family_link, rmse = 0)

# Train the models
for (i in 1:nrow(family_link)) {
  glm_model <- mtcars_training %>%
    ml_generalized_linear_regression(mpg ~ .,
family = family_link[i, 1],
link = family_link[i, 2]
    )

  pred <- ml_predict(glm_model, mtcars_test)
  family_link[i, 3] <- ml_regression_evaluator(pred, label_col = "mpg")
}

family_link
}
Spark ML -- Gradient Boosted Trees 
Arguments
    Value
    Details
    See also
    Examples
Perform binary classification and regression using gradient boosted trees. 
Multiclass classification is not supported yet.
ml_gbt_classifier(x, formula = NULL, max_iter = 20, max_depth = 5,
  step_size = 0.1, subsampling_rate = 1,
  feature_subset_strategy = "auto", min_instances_per_node = 1L,
  max_bins = 32, min_info_gain = 0, loss_type = "logistic",
  seed = NULL, thresholds = NULL, checkpoint_interval = 10,
  cache_node_ids = FALSE, max_memory_in_mb = 256,
  features_col = "features", label_col = "label",
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("gbt_classifier_"), ...)

ml_gradient_boosted_trees(x, formula = NULL, type = c("auto",
  "regression", "classification"), features_col = "features",
  label_col = "label", prediction_col = "prediction",
  probability_col = "probability",
  raw_prediction_col = "rawPrediction", checkpoint_interval = 10,
  loss_type = c("auto", "logistic", "squared", "absolute"),
  max_bins = 32, max_depth = 5, max_iter = 20L, min_info_gain = 0,
  min_instances_per_node = 1, step_size = 0.1, subsampling_rate = 1,
  feature_subset_strategy = "auto", seed = NULL, thresholds = NULL,
  cache_node_ids = FALSE, max_memory_in_mb = 256,
  uid = random_string("gradient_boosted_trees_"), response = NULL,
  features = NULL, ...)

ml_gbt_regressor(x, formula = NULL, max_iter = 20, max_depth = 5,
  step_size = 0.1, subsampling_rate = 1,
  feature_subset_strategy = "auto", min_instances_per_node = 1,
  max_bins = 32, min_info_gain = 0, loss_type = "squared",
  seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE,
  max_memory_in_mb = 256, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("gbt_regressor_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

max_iter
Maxmimum number of iterations.
    

max_depth
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree.
    

step_size
Step size (a.k.a. 
learning rate) in interval (0, 1] for shrinking the contribution of each estimator. 
(default = 0.1)
    

subsampling_rate
Fraction of the training data used for learning each decision tree, in range (0, 1]. 
(default = 1.0)
    

feature_subset_strategy
The number of features to consider for splits at each tree node. 
See details for options.
    

min_instances_per_node
Minimum number of instances each child must
have after split.
    

max_bins
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node. 
More bins give higher granularity.
    

min_info_gain
Minimum information gain for a split to be considered
at a tree node. 
Should be >= 0, defaults to 0.
    

loss_type
Loss function which GBT tries to minimize. 
Supported: "squared" (L2) and "absolute" (L1) (default = squared) for regression and "logistic" (default) for classification. 
For ml_gradient_boosted_trees, setting "auto"
will default to the appropriate loss type based on model type.
    

seed
Seed for random numbers.
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g. 
10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
    

cache_node_ids
If FALSE, the algorithm will pass trees to executors to match instances with nodes.
If TRUE, the algorithm will cache node IDs for each instance. 
Caching can speed up training of deeper trees.
Defaults to FALSE.
    

max_memory_in_mb
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size. 
Defaults to 256.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    

type
The type of model to fit.  "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable. 
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise.
    

response
(Deprecated) The name of the response column (as a length-one character vector.)
    

features
(Deprecated) The name of features (terms) to use for the model fit.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

The supported options for feature_subset_strategy are
 "auto": Choose automatically for task: If num_trees == 1, set to "all". 
If num_trees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression.
 "all": use all features
 "onethird": use 1/3 of the features
 "sqrt": use use sqrt(number of features)
 "log2": use log2(number of features)
 "n": when n is in the range (0, 1.0], use n * number of features. 
When n is in the range (1, number of features), use n features. 
(default = "auto")
 ml_gradient_boosted_trees is a wrapper around ml_gbt_regressor.tbl_spark and ml_gbt_classifier.tbl_spark and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

gbt_model <- iris_training %>%
  ml_gradient_boosted_trees(Sepal_Length ~ Petal_Length + Petal_Width)

pred <- ml_predict(gbt_model, iris_test)

ml_regression_evaluator(pred, label_col = "Sepal_Length")
}
Spark ML -- K-Means Clustering 
Arguments
    Value
    See also
    Examples
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
  Using `ml_kmeans()` with the formula interface requires Spark 2.0+.
ml_kmeans(x, formula = NULL, k = 2, max_iter = 20, tol = 1e-04,
  init_steps = 2, init_mode = "k-means||", seed = NULL,
  features_col = "features", prediction_col = "prediction",
  uid = random_string("kmeans_"), ...)

ml_compute_cost(model, dataset)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

k
The number of clusters to create
    

max_iter
The maximum number of iterations to use.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

init_steps
Number of steps for the k-means|| initialization mode. 
This is an advanced setting -- the default of 2 is almost always enough. 
Must be > 0. 
Default: 2.
    

init_mode
Initialization algorithm. 
This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). 
Default: k-means||.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments, see Details.
    

model
A fitted K-means model returned by ml_kmeans()
    

dataset
Dataset on which to calculate K-means cost
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the clustering estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, an estimator is constructed then
  immediately fit with the input tbl_spark, returning a clustering model.
 tbl_spark, with formula or features specified: When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the estimator. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model. 
This signature does not apply to ml_lda().
 ml_compute_cost() returns the K-means cost (sum of
  squared distances of points to their nearest center) for the model
  on the given data.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
  more information on the set of clustering algorithms.

Other ml clustering algorithms: ml_bisecting_kmeans,
  ml_gaussian_mixture, ml_lda
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
ml_kmeans(iris_tbl, Species ~ .)
}
Spark ML -- Latent Dirichlet Allocation 
Arguments
    Value
    Details
    Parameter details
    See also
    Examples
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
ml_lda(x, formula = NULL, k = 10, max_iter = 20,
  doc_concentration = NULL, topic_concentration = NULL,
  subsampling_rate = 0.05, optimizer = "online",
  checkpoint_interval = 10, keep_last_checkpoint = TRUE,
  learning_decay = 0.51, learning_offset = 1024,
  optimize_doc_concentration = TRUE, seed = NULL,
  features_col = "features",
  topic_distribution_col = "topicDistribution",
  uid = random_string("lda_"), ...)

ml_describe_topics(model, max_terms_per_topic = 10)

ml_log_likelihood(model, dataset)

ml_log_perplexity(model, dataset)

ml_topics_matrix(model)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

k
The number of clusters to create
    

max_iter
The maximum number of iterations to use.
    

doc_concentration
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). 
See details.
    

topic_concentration
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    

subsampling_rate
(For Online optimizer only) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. 
Note that this should be adjusted in synch with max_iter so the entire corpus is used. 
Specifically, set both so that maxIterations * miniBatchFraction greater than or equal to 1.
    

optimizer
Optimizer or inference algorithm used to estimate the LDA model. 
Supported: "online" for Online Variational Bayes (default) and "em" for Expectation-Maximization.
    

checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g. 
10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
    

keep_last_checkpoint
(Spark 2.0.0+) (For EM optimizer only) If using checkpointing, this indicates whether to keep the last checkpoint. 
If FALSE, then the checkpoint will be deleted. 
Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care. 
Note that checkpoints will be cleaned up via reference counting, regardless.
    

learning_decay
(For Online optimizer only) Learning rate, set as an exponential decay rate. 
This should be between (0.5, 1.0] to guarantee asymptotic convergence. 
This is called "kappa" in the Online LDA paper (Hoffman et al., 2010). 
Default: 0.51, based on Hoffman et al.
    

learning_offset
(For Online optimizer only) A (positive) learning parameter that downweights early iterations. 
Larger values make early iterations count less. 
This is called "tau0" in the Online LDA paper (Hoffman et al., 2010) Default: 1024, following Hoffman et al.
    

optimize_doc_concentration
(For Online optimizer only) Indicates whether the doc_concentration (Dirichlet parameter for document-topic distribution) will be optimized during training. 
Setting this to true will make the model more expressive and fit the training data better. 
Default: FALSE
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

topic_distribution_col
Output column with estimates of the topic mixture distribution for each document (often called "theta" in the literature). 
Returns a vector of zeros for an empty document.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments, see Details.
    

model
A fitted LDA model returned by ml_lda().
    

max_terms_per_topic
Maximum number of terms to collect for each topic. 
Default value of 10.
    

dataset
test corpus to use for calculating log likelihood or log perplexity
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the clustering estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, an estimator is constructed then
  immediately fit with the input tbl_spark, returning a clustering model.
 tbl_spark, with formula or features specified: When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the estimator. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model. 
This signature does not apply to ml_lda().
 ml_describe_topics returns a DataFrame with topics and their top-weighted terms.
 ml_log_likelihood calculates a lower bound on the log likelihood of
  the entire corpus
Details
For `ml_lda.tbl_spark` with the formula interface, you can specify named arguments in `...` that will
  be passed `ft_regex_tokenizer()`, `ft_stop_words_remover()`, and `ft_count_vectorizer()`. 
For example, to increase the
  default `min_token_length`, you can use `ml_lda(dataset, ~ text, min_token_length = 4)`.

Terminology for LDA:

"term" = "word": an element of the vocabulary

"token": instance of a term appearing in a document

"topic": multinomial distribution over terms representing some concept

"document": one piece of text, corresponding to one row in the input data

Original LDA paper (journal version): Blei, Ng, and Jordan. 
"Latent Dirichlet Allocation." JMLR, 2003.

Input data (features_col): LDA is given a collection of documents as input data, via the features_col parameter. 
Each document is specified as a Vector of length vocab_size, where each entry is the count for the corresponding term (word) in the document. 
Feature transformers such as ft_tokenizer and ft_count_vectorizer can be useful for converting text to word count vectors
Parameter details
doc_concentration
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization). 
If not set by the user, then doc_concentration is set automatically. 
If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. 
Otherwise, the doc_concentration vector must be length k. 
(default = automatic)

Optimizer-specific parameter settings:

EM

Currently only supports symmetric distributions, so all values in the vector should be the same.

Values should be greater than 1.0

default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. 
(2009), who recommend a +1 adjustment for EM.

Online

Values should be greater than or equal to 0

default = uniformly (1.0 / k), following the implementation from here

topic_concentration

This is the parameter to a symmetric Dirichlet distribution.

Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

If not set by the user, then topic_concentration is set automatically. 
(default = automatic)

Optimizer-specific parameter settings:

EM

Value should be greater than 1.0

default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. 
(2009), who recommend a +1 adjustment for EM.

Online

Value should be greater than or equal to 0

default = (1.0 / k), following the implementation from here.

topic_distribution_col
This uses a variational approximation following Hoffman et al. 
(2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
  more information on the set of clustering algorithms.

Other ml clustering algorithms: ml_bisecting_kmeans,
  ml_gaussian_mixture,
  ml_kmeans
Examples
    if (FALSE) {
library(janeaustenr)
library(dplyr)
sc <- spark_connect(master = "local")

lines_tbl <- sdf_copy_to(sc,
  austen_books()[c(1:30), ],
  name = "lines_tbl",
  overwrite = TRUE
)

# transform the data in a tidy form
lines_tbl_tidy <- lines_tbl %>%
  ft_tokenizer(
    input_col = "text",
    output_col = "word_list"
  ) %>%
  ft_stop_words_remover(
    input_col = "word_list",
    output_col = "wo_stop_words"
  ) %>%
  mutate(text = explode(wo_stop_words)) %>%
  filter(text != "") %>%
  select(text, book)

lda_model <- lines_tbl_tidy %>%
  ml_lda(~text, k = 4)

# vocabulary and topics
tidy(lda_model)
}
Spark ML -- Linear Regression 
Arguments
    Value
    Details
    See also
    Examples
Perform regression using linear regression.
ml_linear_regression(x, formula = NULL, fit_intercept = TRUE,
  elastic_net_param = 0, reg_param = 0, max_iter = 100,
  weight_col = NULL, loss = "squaredError", solver = "auto",
  standardization = TRUE, tol = 1e-06, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("linear_regression_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

fit_intercept
Boolean; should the model be fit with an intercept term?
    

elastic_net_param
ElasticNet mixing parameter, in range [0, 1]. 
For alpha = 0, the penalty is an L2 penalty. 
For alpha = 1, it is an L1 penalty.
    

reg_param
Regularization parameter (aka lambda)
    

max_iter
The maximum number of iterations to use.
    

weight_col
The name of the column to use as weights for the model fit.
    

loss
The loss function to be optimized. 
Supported options: "squaredError"
and "huber". 
Default: "squaredError"
    

solver
Solver algorithm for optimization.
    

standardization
Whether to standardize the training features before fitting the model.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

partitions <- mtcars_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

lm_model <- mtcars_training %>%
  ml_linear_regression(mpg ~ .)

pred <- ml_predict(lm_model, mtcars_test)

ml_regression_evaluator(pred, label_col = "mpg")
}
Spark ML -- Logistic Regression 
Arguments
    Value
    Details
    See also
    Examples
Perform classification using logistic regression.
ml_logistic_regression(x, formula = NULL, fit_intercept = TRUE,
  elastic_net_param = 0, reg_param = 0, max_iter = 100,
  threshold = 0.5, thresholds = NULL, tol = 1e-06,
  weight_col = NULL, aggregation_depth = 2,
  lower_bounds_on_coefficients = NULL,
  lower_bounds_on_intercepts = NULL,
  upper_bounds_on_coefficients = NULL,
  upper_bounds_on_intercepts = NULL, features_col = "features",
  label_col = "label", family = "auto",
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("logistic_regression_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

fit_intercept
Boolean; should the model be fit with an intercept term?
    

elastic_net_param
ElasticNet mixing parameter, in range [0, 1]. 
For alpha = 0, the penalty is an L2 penalty. 
For alpha = 1, it is an L1 penalty.
    

reg_param
Regularization parameter (aka lambda)
    

max_iter
The maximum number of iterations to use.
    

threshold
in binary classification prediction, in range [0, 1].
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

weight_col
The name of the column to use as weights for the model fit.
    

aggregation_depth
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
    

lower_bounds_on_coefficients
(Spark 2.2.0+) Lower bounds on coefficients if fitting under bound constrained optimization.
The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression.
    

lower_bounds_on_intercepts
(Spark 2.2.0+) Lower bounds on intercepts if fitting under bound constrained optimization.
The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression.
    

upper_bounds_on_coefficients
(Spark 2.2.0+) Upper bounds on coefficients if fitting under bound constrained optimization.
The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression.
    

upper_bounds_on_intercepts
(Spark 2.2.0+) Upper bounds on intercepts if fitting under bound constrained optimization.
The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

family
(Spark 2.1.0+) Param for the name of family which is a description of the label distribution to be used in the model. 
Supported options: "auto", "binomial", and "multinomial."
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

partitions <- mtcars_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

lr_model <- mtcars_training %>%
  ml_logistic_regression(am ~ gear + carb)

pred <- ml_predict(lr_model, mtcars_test)

ml_binary_classification_evaluator(pred)
}
Extracts data associated with a Spark ML model 
Arguments
    Value
Extracts data associated with a Spark ML model
ml_model_data(object)
Arguments
    



    

object
a Spark ML model
    
    
Value
A tbl_spark
Spark ML -- Multilayer Perceptron 
Arguments
    Value
    Details
    See also
    Examples
Classification model based on the Multilayer Perceptron. 
Each layer has sigmoid activation function, output layer has softmax.
ml_multilayer_perceptron_classifier(x, formula = NULL, layers = NULL,
  max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128,
  solver = "l-bfgs", seed = NULL, initial_weights = NULL,
  thresholds = NULL, features_col = "features", label_col = "label",
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("multilayer_perceptron_classifier_"), ...)

ml_multilayer_perceptron(x, formula = NULL, layers, max_iter = 100,
  step_size = 0.03, tol = 1e-06, block_size = 128,
  solver = "l-bfgs", seed = NULL, initial_weights = NULL,
  features_col = "features", label_col = "label", thresholds = NULL,
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("multilayer_perceptron_classifier_"),
  response = NULL, features = NULL, ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

layers
A numeric vector describing the layers -- each element in the vector gives the size of a layer. 
For example, c(4, 5, 2) would imply three layers, with an input (feature) layer of size 4, an intermediate layer of size 5, and an output (class) layer of size 2.
    

max_iter
The maximum number of iterations to use.
    

step_size
Step size to be used for each iteration of optimization (> 0).
    

tol
Param for the convergence tolerance for iterative algorithms.
    

block_size
Block size for stacking input data in matrices to speed up the computation. 
Data is stacked within partitions. 
If block size is more than remaining data in a partition then it is adjusted to the size of this data. 
Recommended size is between 10 and 1000. 
Default: 128
    

solver
The solver algorithm for optimization. 
Supported options: "gd" (minibatch gradient descent) or "l-bfgs". 
Default: "l-bfgs"
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

initial_weights
The initial weights of the model.
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    

response
(Deprecated) The name of the response column (as a length-one character vector.)
    

features
(Deprecated) The name of features (terms) to use for the model fit.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
 ml_multilayer_perceptron() is an alias for ml_multilayer_perceptron_classifier() for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <-  spark_connect(master = "local")

iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

mlp_model <- iris_training %>%
  ml_multilayer_perceptron_classifier(Species ~ ., layers = c(4,3,3))

pred <- ml_predict(mlp_model, iris_test)

ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Naive-Bayes 
Arguments
    Value
    Details
    See also
    Examples
Naive Bayes Classifiers. 
It supports Multinomial NB (see here) which can handle finitely supported discrete data. 
For example, by converting documents into TF-IDF vectors, it can be used for document classification. 
By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). 
The input feature values must be nonnegative.
ml_naive_bayes(x, formula = NULL, model_type = "multinomial",
  smoothing = 1, thresholds = NULL, weight_col = NULL,
  features_col = "features", label_col = "label",
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("naive_bayes_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

model_type
The model type. 
Supported options: "multinomial"
and "bernoulli". 
(default = multinomial)
    

smoothing
The (Laplace) smoothing parameter. 
Defaults to 1.
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

weight_col
(Spark 2.1.0+) Weight column name. 
If this is not set or empty, we treat all instance weights as 1.0.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

nb_model <- iris_training %>%
  ml_naive_bayes(Species ~ .)

pred <- ml_predict(nb_model, iris_test)

ml_multiclass_classification_evaluator(pred)
}
Spark ML -- OneVsRest 
Arguments
    Value
    Details
    See also
Reduction of Multiclass Classification to Binary Classification. 
Performs reduction using one against all strategy. 
For a multiclass classification with k classes, train k models (one per class). 
Each example is scored against all k models and the model with highest score is picked to label the example.
ml_one_vs_rest(x, formula = NULL, classifier = NULL,
  features_col = "features", label_col = "label",
  prediction_col = "prediction", uid = random_string("one_vs_rest_"),
  ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

classifier
Object of class ml_estimator. 
Base binary classifier that we reduce multiclass classification into.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_random_forest_classifier
Feature Transformation -- PCA (Estimator) 
Arguments
    Value
    Details
    See also
    Examples
PCA trains a model to project vectors to a lower dimensional space of the top k principal components.
ft_pca(x, input_col = NULL, output_col = NULL, k = NULL,
  uid = random_string("pca_"), ...)

ml_pca(x, features = tbl_vars(x), k = length(features),
  pc_prefix = "PC", ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

k
The number of principal components
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    

features
The columns to use in the principal components
analysis. 
Defaults to all columns in x.
    

pc_prefix
Length-one character vector used to prepend names of components.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
 ml_pca() is a wrapper around ft_pca() that returns a
  ml_model.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl %>%
  select(-Species) %>%
  ml_pca(k = 2)
}
Spark ML -- Random Forest 
Arguments
    Value
    Details
    See also
    Examples
Perform classification and regression using random forests.
ml_random_forest_classifier(x, formula = NULL, num_trees = 20,
  subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1,
  feature_subset_strategy = "auto", impurity = "gini",
  min_info_gain = 0, max_bins = 32, seed = NULL, thresholds = NULL,
  checkpoint_interval = 10, cache_node_ids = FALSE,
  max_memory_in_mb = 256, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  uid = random_string("random_forest_classifier_"), ...)

ml_random_forest(x, formula = NULL, type = c("auto", "regression",
  "classification"), features_col = "features", label_col = "label",
  prediction_col = "prediction", probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  feature_subset_strategy = "auto", impurity = "auto",
  checkpoint_interval = 10, max_bins = 32, max_depth = 5,
  num_trees = 20, min_info_gain = 0, min_instances_per_node = 1,
  subsampling_rate = 1, seed = NULL, thresholds = NULL,
  cache_node_ids = FALSE, max_memory_in_mb = 256,
  uid = random_string("random_forest_"), response = NULL,
  features = NULL, ...)

ml_random_forest_regressor(x, formula = NULL, num_trees = 20,
  subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1,
  feature_subset_strategy = "auto", impurity = "variance",
  min_info_gain = 0, max_bins = 32, seed = NULL,
  checkpoint_interval = 10, cache_node_ids = FALSE,
  max_memory_in_mb = 256, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("random_forest_regressor_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

num_trees
Number of trees to train (>= 1). 
If 1, then no bootstrapping is used. 
If > 1, then bootstrapping is done.
    

subsampling_rate
Fraction of the training data used for learning each decision tree, in range (0, 1]. 
(default = 1.0)
    

max_depth
Maximum depth of the tree (>= 0); that is, the maximum
number of nodes separating any leaves from the root of the tree.
    

min_instances_per_node
Minimum number of instances each child must
have after split.
    

feature_subset_strategy
The number of features to consider for splits at each tree node. 
See details for options.
    

impurity
Criterion used for information gain calculation. 
Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression. 
For ml_decision_tree, setting "auto" will default to the appropriate
criterion based on model type.
    

min_info_gain
Minimum information gain for a split to be considered
at a tree node. 
Should be >= 0, defaults to 0.
    

max_bins
The maximum number of bins used for discretizing
continuous features and for choosing how to split on features at
each node. 
More bins give higher granularity.
    

seed
Seed for random numbers.
    

thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. 
Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. 
The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
    

checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g. 
10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
    

cache_node_ids
If FALSE, the algorithm will pass trees to executors to match instances with nodes.
If TRUE, the algorithm will cache node IDs for each instance. 
Caching can speed up training of deeper trees.
Defaults to FALSE.
    

max_memory_in_mb
Maximum memory in MB allocated to histogram aggregation.
If too small, then 1 node will be split per iteration,
and its aggregates may exceed this size. 
Defaults to 256.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    

type
The type of model to fit.  "regression" treats the response
as a continuous variable, while "classification" treats the response
as a categorical variable. 
When "auto" is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise.
    

response
(Deprecated) The name of the response column (as a length-one character vector.)
    

features
(Deprecated) The name of features (terms) to use for the model fit.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

The supported options for feature_subset_strategy are
 "auto": Choose automatically for task: If num_trees == 1, set to "all". 
If num_trees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression.
 "all": use all features
 "onethird": use 1/3 of the features
 "sqrt": use use sqrt(number of features)
 "log2": use log2(number of features)
 "n": when n is in the range (0, 1.0], use n * number of features. 
When n is in the range (1, number of features), use n features. 
(default = "auto")
 ml_random_forest is a wrapper around ml_random_forest_regressor.tbl_spark and ml_random_forest_classifier.tbl_spark and calls the appropriate method based on model type.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

rf_model <- iris_training %>%
  ml_random_forest(Species ~ ., type = "classification")

pred <- ml_predict(rf_model, iris_test)

ml_multiclass_classification_evaluator(pred)
}
Spark ML -- Survival Regression 
Arguments
    Value
    Details
    See also
    Examples
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.
ml_aft_survival_regression(x, formula = NULL, censor_col = "censor",
  quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95,
  0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06,
  aggregation_depth = 2, quantiles_col = NULL,
  features_col = "features", label_col = "label",
  prediction_col = "prediction",
  uid = random_string("aft_survival_regression_"), ...)

ml_survival_regression(x, formula = NULL, censor_col = "censor",
  quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95,
  0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06,
  aggregation_depth = 2, quantiles_col = NULL,
  features_col = "features", label_col = "label",
  prediction_col = "prediction",
  uid = random_string("aft_survival_regression_"), response = NULL,
  features = NULL, ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

censor_col
Censor column name. 
The value of this column could be 0 or 1. 
If the value is 1, it means the event has occurred i.e. 
uncensored; otherwise censored.
    

quantile_probabilities
Quantile probabilities array. 
Values of the quantile probabilities array should be in the range (0, 1) and the array should be non-empty.
    

fit_intercept
Boolean; should the model be fit with an intercept term?
    

max_iter
The maximum number of iterations to use.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

aggregation_depth
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
    

quantiles_col
Quantiles column name. 
This column will output quantiles of corresponding quantileProbabilities if it is set.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    

response
(Deprecated) The name of the response column (as a length-one character vector.)
    

features
(Deprecated) The name of features (terms) to use for the model fit.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
 ml_survival_regression() is an alias for ml_aft_survival_regression() for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {

library(survival)
library(sparklyr)

sc <- spark_connect(master = "local")
ovarian_tbl <- sdf_copy_to(sc, ovarian, name = "ovarian_tbl", overwrite = TRUE)

partitions <- ovarian_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

ovarian_training <- partitions$training
ovarian_test <- partitions$test

sur_reg <- ovarian_training %>%
  ml_aft_survival_regression(futime ~ ecog_ps + rx + age + resid_ds, censor_col = "fustat")

pred <- ml_predict(sur_reg, ovarian_test)
pred
}
Add a Stage to a Pipeline 
Arguments
Adds a stage to a pipeline.
ml_add_stage(x, stage)
Arguments
    



    

x
A pipeline or a pipeline stage.
    

stage
A pipeline stage.
    
    
Spark ML -- ALS 
Arguments
    Value
    Details
    Examples
Perform recommendation using Alternating Least Squares (ALS) matrix factorization.
ml_als(x, formula = NULL, rating_col = "rating", user_col = "user",
  item_col = "item", rank = 10, reg_param = 0.1,
  implicit_prefs = FALSE, alpha = 1, nonnegative = FALSE,
  max_iter = 10, num_user_blocks = 10, num_item_blocks = 10,
  checkpoint_interval = 10, cold_start_strategy = "nan",
  intermediate_storage_level = "MEMORY_AND_DISK",
  final_storage_level = "MEMORY_AND_DISK", uid = random_string("als_"),
  ...)

ml_recommend(model, type = c("items", "users"), n = 1)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula.
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
The ALS model requires a specific formula format, please use rating_col ~ user_col + item_col.
    

rating_col
Column name for ratings. 
Default: "rating"
    

user_col
Column name for user ids. 
Ids must be integers. 
Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. 
Default: "user"
    

item_col
Column name for item ids. 
Ids must be integers. 
Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. 
Default: "item"
    

rank
Rank of the matrix factorization (positive). 
Default: 10
    

reg_param
Regularization parameter.
    

implicit_prefs
Whether to use implicit preference. 
Default: FALSE.
    

alpha
Alpha parameter in the implicit preference formulation (nonnegative).
    

nonnegative
Whether to apply nonnegativity constraints. 
Default: FALSE.
    

max_iter
Maximum number of iterations.
    

num_user_blocks
Number of user blocks (positive). 
Default: 10
    

num_item_blocks
Number of item blocks (positive). 
Default: 10
    

checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1).
E.g. 
10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
    

cold_start_strategy
(Spark 2.2.0+) Strategy for dealing with unknown or new users/items at prediction time. 
This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. 
Supported values: - "nan": predicted value for unknown ids will be NaN. 
- "drop": rows in the input DataFrame containing unknown ids will be dropped from the output DataFrame containing predictions. 
Default: "nan".
    

intermediate_storage_level
(Spark 2.0.0+) StorageLevel for intermediate datasets. 
Pass in a string representation of StorageLevel. 
Cannot be "NONE". 
Default: "MEMORY_AND_DISK".
    

final_storage_level
(Spark 2.0.0+) StorageLevel for ALS model factors. 
Pass in a string representation of StorageLevel. 
Default: "MEMORY_AND_DISK".
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; currently unused.
    

model
An ALS model object
    

type
What to recommend, one of items or users
    

n
Maximum number of recommendations to return
    
    
Value
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. 
X * Yt = R. 
Typically these approximations are called 'factor' matrices. 
The general approach is iterative. 
During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. 
The newly-solved factor matrix is then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. 
This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). 
This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. 
The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.

The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_als recommender object, which is an Estimator.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the recommender appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a recommender
  estimator is constructed then immediately fit with the input
  tbl_spark, returning a recommendation model, i.e.  ml_als_model.
Details
 ml_recommend() returns the top n users/items recommended for each item/user, for all items/users. 
The output has been transformed (exploded and separated) from the default Spark outputs to be more user friendly.
Examples
    if (FALSE) {

library(sparklyr)
sc <- spark_connect(master = "local")

movies <- data.frame(
  user   = c(1, 2, 0, 1, 2, 0),
  item   = c(1, 1, 1, 2, 2, 0),
  rating = c(3, 1, 2, 4, 5, 4)
)
movies_tbl <- sdf_copy_to(sc, movies)

model <- ml_als(movies_tbl, rating ~ user + item)

ml_predict(model, movies_tbl)

ml_recommend(model, type = "item", 1)
}
Utility functions for LSH models 
Arguments
Utility functions for LSH models
ml_approx_nearest_neighbors(model, dataset, key, num_nearest_neighbors,
  dist_col = "distCol")

ml_approx_similarity_join(model, dataset_a, dataset_b, threshold,
  dist_col = "distCol")
Arguments
    



    

model
A fitted LSH model, returned by either ft_minhash_lsh()
or ft_bucketed_random_projection_lsh().
    

dataset
The dataset to search for nearest neighbors of the key.
    

key
Feature vector representing the item to search for.
    

num_nearest_neighbors
The maximum number of nearest neighbors.
    

dist_col
Output column for storing the distance between each result row and the key.
    

dataset_a
One of the datasets to join.
    

dataset_b
Another dataset to join.
    

threshold
The threshold for the distance of row pairs.
    
    
Frequent Pattern Mining -- FPGrowth 
Arguments
A parallel FP-growth algorithm to mine frequent itemsets.
ml_fpgrowth(x, items_col = "items", min_confidence = 0.8,
  min_support = 0.3, prediction_col = "prediction",
  uid = random_string("fpgrowth_"), ...)

ml_association_rules(model)

ml_freq_itemsets(model)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

items_col
Items column name. 
Default: "items"
    

min_confidence
Minimal confidence for generating Association Rule. min_confidence will not affect the mining for frequent itemsets, but
will affect the association rules generation. 
Default: 0.8
    

min_support
Minimal support level of the frequent pattern. 
[0.0, 1.0].
Any pattern that appears more than (min_support * size-of-the-dataset) times
 will be output in the frequent itemsets. 
Default: 0.3
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; currently unused.
    

model
A fitted FPGrowth model returned by ml_fpgrowth()
    
    
Spark ML - Evaluators 
Arguments
    Value
    Details
    Examples
A set of functions to calculate performance metrics for prediction models. 
Also see the Spark ML Documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.package
ml_binary_classification_evaluator(x, label_col = "label",
  raw_prediction_col = "rawPrediction", metric_name = "areaUnderROC",
  uid = random_string("binary_classification_evaluator_"), ...)

ml_binary_classification_eval(x, label_col = "label",
  prediction_col = "prediction", metric_name = "areaUnderROC")

ml_multiclass_classification_evaluator(x, label_col = "label",
  prediction_col = "prediction", metric_name = "f1",
  uid = random_string("multiclass_classification_evaluator_"), ...)

ml_classification_eval(x, label_col = "label",
  prediction_col = "prediction", metric_name = "f1")

ml_regression_evaluator(x, label_col = "label",
  prediction_col = "prediction", metric_name = "rmse",
  uid = random_string("regression_evaluator_"), ...)
Arguments
    



    

x
A spark_connection object or a tbl_spark containing label and prediction columns. 
The latter should be the output of sdf_predict.
    

label_col
Name of column string specifying which column contains the true labels or values.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

metric_name
The performance metric. 
See details.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; currently unused.
    

prediction_col
Name of the column that contains the predicted
label or value NOT the scored probability. 
Column should be of type Double.
    
    
Value
The calculated performance metric
Details
The following metrics are supported

Binary Classification: areaUnderROC (default) or areaUnderPR (not available in Spark 2.X.)

Multiclass Classification: f1 (default), precision, recall, weightedPrecision, weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision, weightedRecall or accuracy.

Regression: rmse (root mean squared error, default),
   mse (mean squared error), r2, or mae (mean absolute error.)
 ml_binary_classification_eval() is an alias for ml_binary_classification_evaluator() for backwards compatibility.
 ml_classification_eval() is an alias for ml_multiclass_classification_evaluator() for backwards compatibility.
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

partitions <- mtcars_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

# for multiclass classification
rf_model <- mtcars_training %>%
  ml_random_forest(cyl ~ ., type = "classification")

pred <- ml_predict(rf_model, mtcars_test)

ml_multiclass_classification_evaluator(pred)

# for regression
rf_model <- mtcars_training %>%
  ml_random_forest(cyl ~ ., type = "regression")

pred <- ml_predict(rf_model, mtcars_test)

ml_regression_evaluator(pred, label_col = "cyl")

# for binary classification
rf_model <- mtcars_training %>%
  ml_random_forest(am ~ gear + carb, type = "classification")

pred <- ml_predict(rf_model, mtcars_test)

ml_binary_classification_evaluator(pred)
}
Spark ML -- Bisecting K-Means Clustering 
Arguments
    Value
    See also
    Examples
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. 
The algorithm starts from a single cluster that contains all points. 
Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. 
The bisecting steps of clusters on the same level are grouped together to increase parallelism. 
If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
ml_bisecting_kmeans(x, formula = NULL, k = 4, max_iter = 20,
  seed = NULL, min_divisible_cluster_size = 1,
  features_col = "features", prediction_col = "prediction",
  uid = random_string("bisecting_bisecting_kmeans_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

k
The number of clusters to create
    

max_iter
The maximum number of iterations to use.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

min_divisible_cluster_size
The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments, see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the clustering estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, an estimator is constructed then
  immediately fit with the input tbl_spark, returning a clustering model.
 tbl_spark, with formula or features specified: When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the estimator. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model. 
This signature does not apply to ml_lda().
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
  more information on the set of clustering algorithms.

Other ml clustering algorithms: ml_gaussian_mixture,
  ml_kmeans, ml_lda
Examples
    if (FALSE) {
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl %>%
  select(-Species) %>%
  ml_bisecting_kmeans(k = 4 , Species ~ .)
}
Wrap a Spark ML JVM object 
Arguments
Identifies the associated sparklyr ML constructor for the JVM object by inspecting its
  class and performing a lookup. 
The lookup table is specified by the
  `sparkml/class_mapping.json` files of sparklyr and the loaded extensions.
ml_call_constructor(jobj)
Arguments
    



    

jobj
The jobj for the pipeline stage.
    
    
Chi-square hypothesis testing for categorical data. 
Arguments
    Value
    Examples
Conduct Pearson's independence test for every feature against the
  label. 
For each feature, the (feature, label) pairs are converted
  into a contingency matrix for which the Chi-squared statistic is
  computed. 
All label and feature values must be categorical.
ml_chisquare_test(x, features, label)
Arguments
    



    

x
A tbl_spark.
    

features
The name(s) of the feature columns. 
This can also be the name
of a single vector column created using ft_vector_assembler().
    

label
The name of the label column.
    
    
Value
A data frame with one row for each (feature, label) pair with p-values,
  degrees of freedom, and test statistics.
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width")

ml_chisquare_test(iris_tbl, features = features, label = "Species")
}
Spark ML - Clustering Evaluator 
Arguments
    Value
    Examples
Evaluator for clustering results. 
The metric computes the Silhouette measure using the squared
  Euclidean distance. 
The Silhouette is a measure for the validation of the consistency
   within clusters. 
It ranges between 1 and -1, where a value close to 1 means that the
    points in a cluster are close to the other points in the same cluster and far from the
    points of the other clusters.
ml_clustering_evaluator(x, features_col = "features",
  prediction_col = "prediction", metric_name = "silhouette",
  uid = random_string("clustering_evaluator_"), ...)
Arguments
    



    

x
A spark_connection object or a tbl_spark containing label and prediction columns. 
The latter should be the output of sdf_predict.
    

features_col
Name of features column.
    

prediction_col
Name of the prediction column.
    

metric_name
The performance metric. 
Currently supports "silhouette".
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; currently unused.
    
    
Value
The calculated performance metric
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

formula <- Species ~ .

# Train the models
kmeans_model <- ml_kmeans(iris_training, formula = formula)
b_kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula)
gmm_model <- ml_gaussian_mixture(iris_training, formula = formula)

# Predict
pred_kmeans <- ml_predict(kmeans_model, iris_test)
pred_b_kmeans <- ml_predict(b_kmeans_model, iris_test)
pred_gmm <- ml_predict(gmm_model, iris_test)

# Evaluate
ml_clustering_evaluator(pred_kmeans)
ml_clustering_evaluator(pred_b_kmeans)
ml_clustering_evaluator(pred_gmm)
}
Constructors for `ml_model` Objects 
Arguments
Functions for developers writing extensions for Spark ML. 
These functions are constructors
  for `ml_model` objects that are returned when using the formula interface.
new_ml_model_prediction(pipeline_model, formula, dataset, label_col,
  features_col, ..., class = character())

new_ml_model(pipeline_model, formula, dataset, ..., class = character())

new_ml_model_classification(pipeline_model, formula, dataset, label_col,
  features_col, predicted_label_col, ..., class = character())

new_ml_model_regression(pipeline_model, formula, dataset, label_col,
  features_col, ..., class = character())

new_ml_model_clustering(pipeline_model, formula, dataset, features_col,
  ..., class = character())

ml_supervised_pipeline(predictor, dataset, formula, features_col,
  label_col)

ml_clustering_pipeline(predictor, dataset, formula, features_col)

ml_construct_model_supervised(constructor, predictor, formula, dataset,
  features_col, label_col, ...)

ml_construct_model_clustering(constructor, predictor, formula, dataset,
  features_col, ...)
Arguments
    



    

pipeline_model
The pipeline model object returned by `ml_supervised_pipeline()`.
    

formula
The formula used for data preprocessing
    

dataset
The training dataset.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

class
Name of the subclass.
    

predictor
The pipeline stage corresponding to the ML algorithm.
    

constructor
The constructor function for the `ml_model`.
    
    
Compute correlation matrix 
Arguments
    Value
    Examples
Compute correlation matrix
ml_corr(x, columns = NULL, method = c("pearson", "spearman"))
Arguments
    



    

x
A tbl_spark.
    

columns
The names of the columns to calculate correlations of. 
If only one
column is specified, it must be a vector column (for example, assembled using ft_vector_assember()).
    

method
The method to use, either "pearson" or "spearman".
    
    
Value
A correlation matrix organized as a data frame.
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width")

ml_corr(iris_tbl, columns = features , method = "pearson")
}
Spark ML -- Tuning 
Arguments
    Value
    Details
    Examples
Perform hyper-parameter tuning using either K-fold cross validation or train-validation split.
ml_sub_models(model)

ml_validation_metrics(model)

ml_cross_validator(x, estimator = NULL, estimator_param_maps = NULL,
  evaluator = NULL, num_folds = 3, collect_sub_models = FALSE,
  parallelism = 1, seed = NULL,
  uid = random_string("cross_validator_"), ...)

ml_train_validation_split(x, estimator = NULL,
  estimator_param_maps = NULL, evaluator = NULL, train_ratio = 0.75,
  collect_sub_models = FALSE, parallelism = 1, seed = NULL,
  uid = random_string("train_validation_split_"), ...)
Arguments
    



    

model
A cross validation or train-validation-split model.
    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

estimator
A ml_estimator object.
    

estimator_param_maps
A named list of stages and hyper-parameter sets to tune. 
See details.
    

evaluator
A ml_evaluator object, see ml_evaluator.
    

num_folds
Number of folds for cross validation. 
Must be >= 2. 
Default: 3
    

collect_sub_models
Whether to collect a list of sub-models trained during tuning.
If set to FALSE, then only the single best sub-model will be available after fitting.
If set to true, then all sub-models will be available. 
Warning: For large models, collecting
all sub-models can cause OOMs on the Spark driver.
    

parallelism
The number of threads to use when running parallel algorithms. 
Default is 1 for serial execution.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; currently unused.
    

train_ratio
Ratio between train and validation data. 
Must be between 0 and 1. 
Default: 0.75
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_cross_validator or ml_traing_validation_split object.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the tuning estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a tuning estimator is constructed then
  immediately fit with the input tbl_spark, returning a ml_cross_validation_model or a
  ml_train_validation_split_model object.

For cross validation, ml_sub_models() returns a nested
  list of models, where the first layer represents fold indices and the
  second layer represents param maps. 
For train-validation split,
  ml_sub_models() returns a list of models, corresponding to the
  order of the estimator param maps.
 ml_validation_metrics() returns a data frame of performance
  metrics and hyperparameter combinations.
Details
 ml_cross_validator() performs k-fold cross validation while ml_train_validation_split() performs tuning on one pair of train and validation datasets.
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

# Create a pipeline
pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(Species ~ . ) %>%
  ml_random_forest_classifier()

# Specify hyperparameter grid
grid <- list(
  random_forest = list(
    num_trees = c(5,10),
    max_depth = c(5,10),
    impurity = c("entropy", "gini")
  )
)

# Create the cross validator object
cv <- ml_cross_validator(
  sc, estimator = pipeline, estimator_param_maps = grid,
  evaluator = ml_multiclass_classification_evaluator(sc),
  num_folds = 3,
  parallelism = 4
)

# Train the models
cv_model <- ml_fit(cv, iris_tbl)

# Print the metrics
ml_validation_metrics(cv_model)

}
Default stop words 
Arguments
    Value
    Details
    See also
Loads the default stop words for the given language.
ml_default_stop_words(sc, language = c("english", "danish", "dutch",
  "finnish", "french", "german", "hungarian", "italian", "norwegian",
  "portuguese", "russian", "spanish", "swedish", "turkish"), ...)
Arguments
    



    

sc
A spark_connection
    

language
A character string.
    

...
Optional arguments; currently unused.
    
    
Value
A list of stop words.
Details
Supported languages: danish, dutch, english, finnish, french,
  german, hungarian, italian, norwegian, portuguese, russian, spanish,
  swedish, turkish. 
Defaults to English. 
See http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
  for more details
See also
 ft_stop_words_remover
Evaluate the Model on a Validation Set 
Arguments
Compute performance metrics.
ml_evaluate(x, dataset)

# S3 method for ml_model_logistic_regression
ml_evaluate(x, dataset)

# S3 method for ml_logistic_regression_model
ml_evaluate(x, dataset)

# S3 method for ml_model_linear_regression
ml_evaluate(x, dataset)

# S3 method for ml_linear_regression_model
ml_evaluate(x, dataset)

# S3 method for ml_model_generalized_linear_regression
ml_evaluate(x, dataset)

# S3 method for ml_generalized_linear_regression_model
ml_evaluate(x, dataset)

# S3 method for ml_evaluator
ml_evaluate(x, dataset)
Arguments
    



    

x
An ML model object or an evaluator object.
    

dataset
The dataset to be validate the model on.
    
    
Spark ML - Feature Importance for Tree Models 
Arguments
    Value
Spark ML - Feature Importance for Tree Models
ml_feature_importances(model, ...)

ml_tree_feature_importance(model, ...)
Arguments
    



    

model
A decision tree-based model.
    

...
Optional arguments; currently unused.
    
    
Value
For ml_model, a sorted data frame with feature labels and their relative importance.
  For ml_prediction_model, a vector of relative importances.
Feature Transformation -- Word2Vec (Estimator) 
Arguments
    Value
    Details
    See also
Word2Vec transforms a word into a code for further natural language processing or machine learning process.
ft_word2vec(x, input_col = NULL, output_col = NULL,
  vector_size = 100, min_count = 5, max_sentence_length = 1000,
  num_partitions = 1, step_size = 0.025, max_iter = 1, seed = NULL,
  uid = random_string("word2vec_"), ...)

ml_find_synonyms(model, word, num)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

vector_size
The dimension of the code that you want to transform from words. 
Default: 100
    

min_count
The minimum number of times a token must appear to be included in
the word2vec model's vocabulary. 
Default: 5
    

max_sentence_length
(Spark 2.0.0+) Sets the maximum length (in words) of each sentence
in the input data. 
Any sentence longer than this threshold will be divided into
chunks of up to max_sentence_length size. 
Default: 1000
    

num_partitions
Number of partitions for sentences of words. 
Default: 1
    

step_size
Param for Step size to be used for each iteration of optimization (> 0).
    

max_iter
The maximum number of iterations to use.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    

model
A fitted Word2Vec model, returned by ft_word2vec().
    

word
A word, as a length-one character vector.
    

num
Number of words closest in similarity to the given word to find.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
 ml_find_synonyms() returns a DataFrame of synonyms and cosine similarities
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer
Spark ML -- Transform, fit, and predict methods (ml_ interface) 
Arguments
    Value
    Details
Methods for transformation, fit, and prediction. 
These are mirrors of the corresponding sdf-transform-methods.
is_ml_transformer(x)

is_ml_estimator(x)

ml_fit(x, dataset, ...)

ml_transform(x, dataset, ...)

ml_fit_and_transform(x, dataset, ...)

ml_predict(x, dataset, ...)

# S3 method for ml_model_classification
ml_predict(x, dataset,
  probability_prefix = "probability_", ...)
Arguments
    



    

x
A ml_estimator, ml_transformer (or a list thereof), or ml_model object.
    

dataset
A tbl_spark.
    

...
Optional arguments; currently unused.
    

probability_prefix
String used to prepend the class probability output columns.
    
    
Value
When x is an estimator, ml_fit() returns a transformer whereas ml_fit_and_transform() returns a transformed dataset. 
When x is a transformer, ml_transform() and ml_predict() return a transformed dataset. 
When ml_predict() is called on a ml_model object, additional columns (e.g. 
probabilities in case of classification models) are appended to the transformed output for the user's convenience.
Details
These methods are
Spark ML -- Gaussian Mixture clustering. 
Arguments
    Value
    See also
    Examples
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). 
A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite. 
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than tol, or until it has reached the max number of iterations. 
While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
ml_gaussian_mixture(x, formula = NULL, k = 2, max_iter = 100,
  tol = 0.01, seed = NULL, features_col = "features",
  prediction_col = "prediction", probability_col = "probability",
  uid = random_string("gaussian_mixture_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

k
The number of clusters to create
    

max_iter
The maximum number of iterations to use.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

probability_col
Column name for predicted class conditional probabilities. 
Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments, see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the clustering estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, an estimator is constructed then
  immediately fit with the input tbl_spark, returning a clustering model.
 tbl_spark, with formula or features specified: When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the estimator. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model. 
This signature does not apply to ml_lda().
See also
See http://spark.apache.org/docs/latest/ml-clustering.html for
  more information on the set of clustering algorithms.

Other ml clustering algorithms: ml_bisecting_kmeans,
  ml_kmeans, ml_lda
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .)
pred <- sdf_predict(iris_tbl, gmm_model)
ml_clustering_evaluator(pred)
}
Spark ML -- ML Params 
Arguments
Helper methods for working with parameters for ML objects.
ml_is_set(x, param, ...)

ml_param_map(x, ...)

ml_param(x, param, allow_null = FALSE, ...)

ml_params(x, params = NULL, allow_null = FALSE, ...)
Arguments
    



    

x
A Spark ML object, either a pipeline stage or an evaluator.
    

param
The parameter to extract or set.
    

...
Optional arguments; currently unused.
    

allow_null
Whether to allow NULL results when extracting parameters. 
If FALSE, an error will be thrown if the specified parameter is not found. 
Defaults to FALSE.
    

params
A vector of parameters to extract.
    
    
Spark ML -- Isotonic Regression 
Arguments
    Value
    Details
    See also
    Examples
Currently implemented using parallelized pool adjacent violators algorithm. 
Only univariate (single feature) algorithm supported.
ml_isotonic_regression(x, formula = NULL, feature_index = 0,
  isotonic = TRUE, weight_col = NULL, features_col = "features",
  label_col = "label", prediction_col = "prediction",
  uid = random_string("isotonic_regression_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

feature_index
Index of the feature if features_col is a vector column (default: 0), no effect otherwise.
    

isotonic
Whether the output sequence should be isotonic/increasing (true) or antitonic/decreasing (false). 
Default: true
    

weight_col
The name of the column to use as weights for the model fit.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_linear_regression,
  ml_linear_svc,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

iso_res <- iris_tbl %>%
  ml_isotonic_regression(Petal_Length ~ Petal_Width)

pred <- ml_predict(iso_res, iris_test)

pred
}
Feature Transformation -- StringIndexer (Estimator) 
Arguments
    Value
    Details
    See also
A label indexer that maps a string column of labels to an ML column of
  label indices. 
If the input column is numeric, we cast it to string and
  index the string values. 
The indices are in [0, numLabels), ordered by
  label frequencies. 
So the most frequent label gets index 0. 
This function
  is the inverse of ft_index_to_string.
ft_string_indexer(x, input_col = NULL, output_col = NULL,
  handle_invalid = "error", string_order_type = "frequencyDesc",
  uid = random_string("string_indexer_"), ...)

ml_labels(model)

ft_string_indexer_model(x, input_col = NULL, output_col = NULL, labels,
  handle_invalid = "error",
  uid = random_string("string_indexer_model_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. 
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket). 
Default: "error"
    

string_order_type
(Spark 2.3+)How to order labels of string column.
The first label after ordering is assigned an index of 0. 
Options are "frequencyDesc", "frequencyAsc", "alphabetDesc", and "alphabetAsc".
Defaults to "frequencyDesc".
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    

model
A fitted StringIndexer model returned by ft_string_indexer()
    

labels
Vector of labels, corresponding to indices to be assigned.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
 ml_labels() returns a vector of labels, corresponding to indices to be assigned.
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.
 ft_index_to_string

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Spark ML -- LinearSVC 
Arguments
    Value
    Details
    See also
    Examples
Perform classification using linear support vector machines (SVM). 
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. 
Only supports L2 regularization currently.
ml_linear_svc(x, formula = NULL, fit_intercept = TRUE, reg_param = 0,
  max_iter = 100, standardization = TRUE, weight_col = NULL,
  tol = 1e-06, threshold = 0, aggregation_depth = 2,
  features_col = "features", label_col = "label",
  prediction_col = "prediction", raw_prediction_col = "rawPrediction",
  uid = random_string("linear_svc_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
Used when x is a tbl_spark. 
R formula as a character string or a formula. 
This is used to transform the input dataframe before fitting, see ft_r_formula for details.
    

fit_intercept
Boolean; should the model be fit with an intercept term?
    

reg_param
Regularization parameter (aka lambda)
    

max_iter
The maximum number of iterations to use.
    

standardization
Whether to standardize the training features before fitting the model.
    

weight_col
The name of the column to use as weights for the model fit.
    

tol
Param for the convergence tolerance for iterative algorithms.
    

threshold
in binary classification prediction, in range [0, 1].
    

aggregation_depth
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

prediction_col
Prediction column name.
    

raw_prediction_col
Raw prediction (a.k.a. 
confidence) column name.
    

uid
A character string used to uniquely identify the ML estimator.
    

...
Optional arguments; see Details.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. 
The object contains a pointer to
  a Spark Predictor object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the predictor appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a predictor is constructed then
  immediately fit with the input tbl_spark, returning a prediction model.
 tbl_spark, with formula: specified When formula
    is specified, the input tbl_spark is first transformed using a
    RFormula transformer before being fit by
    the predictor. 
The object returned in this case is a ml_model which is a
    wrapper of a ml_pipeline_model.
Details
When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. 
For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. 
In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. 
This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.
See also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for
  more information on the set of supervised learning algorithms.

Other ml algorithms: ml_aft_survival_regression,
  ml_decision_tree_classifier,
  ml_gbt_classifier,
  ml_generalized_linear_regression,
  ml_isotonic_regression,
  ml_linear_regression,
  ml_logistic_regression,
  ml_multilayer_perceptron_classifier,
  ml_naive_bayes,
  ml_one_vs_rest,
  ml_random_forest_classifier
Examples
    if (FALSE) {
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

partitions <- iris_tbl %>%
  filter(Species != "setosa") %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

svc_model <- iris_training %>%
  ml_linear_svc(Species ~ .)

pred <- ml_predict(svc_model, iris_test)

ml_binary_classification_evaluator(pred)
}
Spark ML -- Model Persistence 
Arguments
    Value
Save/load Spark ML objects
ml_save(x, path, overwrite = FALSE, ...)

# S3 method for ml_model
ml_save(x, path, overwrite = FALSE,
  type = c("pipeline_model", "pipeline"), ...)

ml_load(sc, path)
Arguments
    



    

x
A ML object, which could be a ml_pipeline_stage or a ml_model
    

path
The path where the object is to be serialized/deserialized.
    

overwrite
Whether to overwrite the existing path, defaults to FALSE.
    

...
Optional arguments; currently unused.
    

type
Whether to save the pipeline model or the pipeline.
    

sc
A Spark connection.
    
    
Value
 ml_save() serializes a Spark object into a format that can be read back into sparklyr or by the Scala or PySpark APIs. 
When called on ml_model objects, i.e. 
those that were created via the tbl_spark - formula signature, the associated pipeline model is serialized. 
In other words, the saved model contains both the data processing (RFormulaModel) stage and the machine learning stage.
 ml_load() reads a saved Spark object into sparklyr. 
It calls the correct Scala load method based on parsing the saved metadata. 
Note that a PipelineModel object saved from a sparklyr ml_model via ml_save() will be read back in as an ml_pipeline_model, rather than the ml_model object.
Spark ML -- Pipelines 
Arguments
    Value
Create Spark ML Pipelines
ml_pipeline(x, ..., uid = random_string("pipeline_"))
Arguments
    



    

x
Either a spark_connection or ml_pipeline_stage objects
    

...
ml_pipeline_stage objects.
    

uid
A character string used to uniquely identify the ML estimator.
    
    
Value
When x is a spark_connection, ml_pipeline() returns an empty pipeline object. 
When x is a ml_pipeline_stage, ml_pipeline() returns an ml_pipeline with the stages set to x and any transformers or estimators given in ....
Spark ML -- Pipeline stage extraction 
Arguments
    Value
Extraction of stages from a Pipeline or PipelineModel object.
ml_stage(x, stage)

ml_stages(x, stages = NULL)
Arguments
    



    

x
A ml_pipeline or a ml_pipeline_model object
    

stage
The UID of a stage in the pipeline.
    

stages
The UIDs of stages in the pipeline as a character vector.
    
    
Value
For ml_stage(): The stage specified.

For ml_stages(): A list of stages. 
If stages is not set, the function returns all stages of the pipeline in a list.
Standardize Formula Input for `ml_model` 
Arguments
Generates a formula string from user inputs, to be used in `ml_model` constructor.
ml_standardize_formula(formula = NULL, response = NULL,
  features = NULL)
Arguments
    



    

formula
The `formula` argument.
    

response
The `response` argument.
    

features
The `features` argument.
    
    
Spark ML -- Extraction of summary metrics 
Arguments
Extracts a metric from the summary object of a Spark ML model.
ml_summary(x, metric = NULL, allow_null = FALSE)
Arguments
    



    

x
A Spark ML model that has a summary.
    

metric
The name of the metric to extract. 
If not set, returns the summary object.
    

allow_null
Whether null results are allowed when the metric is not found in the summary.
    
    
Spark ML -- UID 
Arguments
Extracts the UID of an ML object.
ml_uid(x)
Arguments
    



    

x
A Spark ML object
    
    
Feature Transformation -- CountVectorizer (Estimator) 
Arguments
    Value
    Details
    See also
Extracts a vocabulary from document collections.
ft_count_vectorizer(x, input_col = NULL, output_col = NULL,
  binary = FALSE, min_df = 1, min_tf = 1, vocab_size = 2^18,
  uid = random_string("count_vectorizer_"), ...)

ml_vocabulary(model)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

binary
Binary toggle to control the output vector values.
If TRUE, all nonzero counts (after min_tf filter applied)
are set to 1. 
This is useful for discrete probabilistic models that
 model binary events rather than integer counts. 
Default: FALSE
    

min_df
Specifies the minimum number of different documents a
term must appear in to be included in the vocabulary. 
If this is an
integer greater than or equal to 1, this specifies the number of
documents the term must appear in; if this is a double in [0,1), then
this specifies the fraction of documents. 
Default: 1.
    

min_tf
Filter to ignore rare words in a document. 
For each
document, terms with frequency/count less than the given threshold
are ignored. 
If this is an integer greater than or equal to 1, then
this specifies a count (of times the term must appear in the document);
if this is a double in [0,1), then this specifies a fraction (out of
the document's token count). 
Default: 1.
    

vocab_size
Build a vocabulary that only considers the top vocab_size terms ordered by term frequency across the corpus.
Default: 2^18.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    

model
A ml_count_vectorizer_model.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
 ml_vocabulary() returns a vector of vocabulary built.
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Binarizer (Transformer) 
Arguments
    Value
    See also
    Examples
Apply thresholding to a column, such that values less than or equal to the threshold are assigned the value 0.0, and values greater than the
threshold are assigned the value 1.0. 
Column output is numeric for
compatibility with other modeling functions.
ft_binarizer(x, input_col, output_col, threshold = 0,
  uid = random_string("binarizer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

threshold
Threshold used to binarize continuous features.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl %>%
  ft_binarizer(input_col  = "Sepal_Length",
 output_col = "Sepal_Length_bin",
 threshold  = 5) %>%
  select(Sepal_Length, Sepal_Length_bin, Species)
}
Feature Transformation -- Bucketizer (Transformer) 
Arguments
    Value
    See also
    Examples
Similar to R's cut function, this transforms a numeric column
into a discretized column, with breaks specified through the splits
parameter.
ft_bucketizer(x, input_col = NULL, output_col = NULL, splits = NULL,
  input_cols = NULL, output_cols = NULL, splits_array = NULL,
  handle_invalid = "error", uid = random_string("bucketizer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

splits
A numeric vector of cutpoints, indicating the bucket boundaries.
    

input_cols
Names of input columns.
    

output_cols
Names of output columns.
    

splits_array
Parameter for specifying multiple splits parameters. 
Each
element in this array can be used to map continuous features into buckets.
    

handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. 
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket). 
Default: "error"
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl %>%
  ft_bucketizer(input_col  = "Sepal_Length",
  output_col = "Sepal_Length_bucket",
  splits     = c(0, 4.5, 5, 8)) %>%
  select(Sepal_Length, Sepal_Length_bucket, Species)
}
Feature Transformation -- ChiSqSelector (Estimator) 
Arguments
    Value
    Details
    See also
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label
ft_chisq_selector(x, features_col = "features", output_col = NULL,
  label_col = "label", selector_type = "numTopFeatures", fdr = 0.05,
  fpr = 0.05, fwe = 0.05, num_top_features = 50, percentile = 0.1,
  uid = random_string("chisq_selector_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

output_col
The name of the output column.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

selector_type
(Spark 2.1.0+) The selector type of the ChisqSelector. 
Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe".
    

fdr
(Spark 2.2.0+) The upper bound of the expected false discovery rate. 
Only applicable when selector_type = "fdr". 
Default value is 0.05.
    

fpr
(Spark 2.1.0+) The highest p-value for features to be kept. 
Only applicable when selector_type= "fpr". 
Default value is 0.05.
    

fwe
(Spark 2.2.0+) The upper bound of the expected family-wise error rate. 
Only applicable when selector_type = "fwe". 
Default value is 0.05.
    

num_top_features
Number of features that selector will select, ordered by ascending p-value. 
If the number of features is less than num_top_features, then this will select all features. 
Only applicable when selector_type = "numTopFeatures". 
The default value of num_top_features is 50.
    

percentile
(Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending. 
Only applicable when selector_type = "percentile". 
Default value is 0.1.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer) 
Arguments
    Value
    Details
    See also
A feature transformer that takes the 1D discrete cosine transform of a real
  vector. 
No zero padding is performed on the input vector. 
It returns a real
  vector of the same length representing the DCT. 
The return vector is scaled
  such that the transform matrix is unitary (aka scaled DCT-II).
ft_dct(x, input_col = NULL, output_col = NULL, inverse = FALSE,
  uid = random_string("dct_"), ...)

ft_discrete_cosine_transform(x, input_col, output_col, inverse = FALSE,
  uid = random_string("dct_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

inverse
Indicates whether to perform the inverse DCT (TRUE) or forward DCT (FALSE).
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
 ft_discrete_cosine_transform() is an alias for ft_dct for backwards compatibility.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- ElementwiseProduct (Transformer) 
Arguments
    Value
    See also
Outputs the Hadamard product (i.e., the element-wise product) of each input vector
  with a provided "weight" vector. 
In other words, it scales each column of the
  dataset by a scalar multiplier.
ft_elementwise_product(x, input_col = NULL, output_col = NULL,
  scaling_vec = NULL, uid = random_string("elementwise_product_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

scaling_vec
the vector to multiply with input vectors
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- FeatureHasher (Transformer) 
Arguments
    Value
    Details
    See also
Feature Transformation -- FeatureHasher (Transformer)
ft_feature_hasher(x, input_cols = NULL, output_col = NULL,
  num_features = 2^18, categorical_cols = NULL,
  uid = random_string("feature_hasher_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_cols
Names of input columns.
    

output_col
Name of output column.
    

num_features
Number of features. 
Defaults to \(2^18\).
    

categorical_cols
Numeric columns to treat as categorical features.
By default only string and boolean columns are treated as categorical,
so this param can be used to explicitly specify the numerical columns to
treat as categorical.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
Feature hashing projects a set of categorical or numerical features into a
  feature vector of specified dimension (typically substantially smaller than
  that of the original feature space). 
This is done using the hashing trick
  https://en.wikipedia.org/wiki/Feature_hashing to map features to indices
  in the feature vector.

The FeatureHasher transformer operates on multiple columns. 
Each column may
    contain either numeric or categorical features. 
Behavior and handling of
    column data types is as follows: -Numeric columns: For numeric features,
    the hash value of the column name is used to map the feature value to its
    index in the feature vector. 
By default, numeric features are not treated
    as categorical (even when they are integers). 
To treat them as categorical,
    specify the relevant columns in categoricalCols. 
-String columns: For
     categorical features, the hash value of the string "column_name=value"
     is used to map to the vector index, with an indicator value of 1.0.
     Thus, categorical features are "one-hot" encoded (similarly to using
     OneHotEncoder with drop_last=FALSE). 
-Boolean columns: Boolean values
     are treated in the same way as string columns. 
That is, boolean features
     are represented as "column_name=true" or "column_name=false", with an
     indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. 
Since a
 simple modulo on the hashed value is used to determine the vector index, it is
 advisable to use a power of two as the num_features parameter; otherwise the
 features will not be mapped evenly to the vector indices.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- HashingTF (Transformer) 
Arguments
    Value
    See also
Maps a sequence of terms to their term frequencies using the hashing trick.
ft_hashing_tf(x, input_col = NULL, output_col = NULL, binary = FALSE,
  num_features = 2^18, uid = random_string("hashing_tf_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

binary
Binary toggle to control term frequency counts.
If true, all non-zero counts are set to 1. 
This is useful for discrete
probabilistic models that model binary events rather than integer
counts. 
(default = FALSE)
    

num_features
Number of features. 
Should be greater than 0. 
(default = 2^18)
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- IDF (Estimator) 
Arguments
    Value
    Details
    See also
Compute the Inverse Document Frequency (IDF) given a collection of documents.
ft_idf(x, input_col = NULL, output_col = NULL, min_doc_freq = 0,
  uid = random_string("idf_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

min_doc_freq
The minimum number of documents in which a term should appear. 
Default: 0
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Imputer (Estimator) 
Arguments
    Value
    Details
    See also
Imputation estimator for completing missing values, either using the mean or
  the median of the columns in which the missing values are located. 
The input
  columns should be of numeric type. 
This function requires Spark 2.2.0+.
ft_imputer(x, input_cols = NULL, output_cols = NULL,
  missing_value = NULL, strategy = "mean",
  uid = random_string("imputer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_cols
The names of the input columns
    

output_cols
The names of the output columns.
    

missing_value
The placeholder for the missing values. 
All occurrences of missing_value will be imputed. 
Note that null values are always treated
as missing.
    

strategy
The imputation strategy. 
Currently only "mean" and "median" are
supported. 
If "mean", then replace missing values using the mean value of the
feature. 
If "median", then replace missing values using the approximate median
value of the feature. 
Default: mean
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- IndexToString (Transformer) 
Arguments
    Value
    See also
A Transformer that maps a column of indices back to a new column of
  corresponding string values. 
The index-string mapping is either from
  the ML attributes of the input column, or from user-supplied labels
   (which take precedence over ML attributes). 
This function is the inverse
   of ft_string_indexer.
ft_index_to_string(x, input_col = NULL, output_col = NULL,
  labels = NULL, uid = random_string("index_to_string_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

labels
Optional param for array of labels specifying index-string mapping.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.
 ft_string_indexer

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer, ft_interaction,
  ft_lsh, ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Interaction (Transformer) 
Arguments
    Value
    See also
Implements the feature interaction transform. 
This transformer takes in Double and
  Vector type columns and outputs a flattened vector of their feature interactions.
  To handle interaction, we first one-hot encode any nominal features. 
Then, a
  vector of the feature cross-products is produced.
ft_interaction(x, input_cols = NULL, output_col = NULL,
  uid = random_string("interaction_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_cols
The names of the input columns
    

output_col
The name of the output column.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- LSH (Estimator) 
Arguments
    Value
    Details
    See also
Locality Sensitive Hashing functions for Euclidean distance
  (Bucketed Random Projection) and Jaccard distance (MinHash).
ft_bucketed_random_projection_lsh(x, input_col = NULL,
  output_col = NULL, bucket_length = NULL, num_hash_tables = 1,
  seed = NULL, uid = random_string("bucketed_random_projection_lsh_"),
  ...)

ft_minhash_lsh(x, input_col = NULL, output_col = NULL,
  num_hash_tables = 1L, seed = NULL,
  uid = random_string("minhash_lsh_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

bucket_length
The length of each hash bucket, a larger bucket lowers the
false negative rate. 
The number of buckets will be (max L2 norm of input vectors) /
bucketLength.
    

num_hash_tables
Number of hash tables used in LSH OR-amplification. 
LSH
OR-amplification can be used to reduce the false negative rate. 
Higher values
for this param lead to a reduced false negative rate, at the expense of added
 computational complexity.
    

seed
A random seed. 
Set this value if you need your results to be
reproducible across repeated calls.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

ft_lsh_utils

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- MaxAbsScaler (Estimator) 
Arguments
    Value
    Details
    See also
    Examples
Rescale each feature individually to range [-1, 1] by dividing through the
  largest maximum absolute value in each feature. 
It does not shift/center the
  data, and thus does not destroy any sparsity.
ft_max_abs_scaler(x, input_col = NULL, output_col = NULL,
  uid = random_string("max_abs_scaler_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")

iris_tbl %>%
  ft_vector_assembler(input_col = features,
        output_col = "features_temp") %>%
  ft_max_abs_scaler(input_col = "features_temp",
       output_col = "features")
}
Feature Transformation -- MinMaxScaler (Estimator) 
Arguments
    Value
    Details
    See also
    Examples
Rescale each feature individually to a common range [min, max] linearly using
  column summary statistics, which is also known as min-max normalization or
  Rescaling
ft_min_max_scaler(x, input_col = NULL, output_col = NULL, min = 0,
  max = 1, uid = random_string("min_max_scaler_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

min
Lower bound after transformation, shared by all features Default: 0.0
    

max
Upper bound after transformation, shared by all features Default: 1.0
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")

iris_tbl %>%
  ft_vector_assembler(input_col = features,
        output_col = "features_temp") %>%
  ft_min_max_scaler(input_col = "features_temp",
       output_col = "features")
}
Feature Transformation -- NGram (Transformer) 
Arguments
    Value
    Details
    See also
A feature transformer that converts the input array of strings into an array of n-grams. 
Null values in the input array are ignored. 
It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
ft_ngram(x, input_col = NULL, output_col = NULL, n = 2,
  uid = random_string("ngram_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

n
Minimum n-gram length, greater than or equal to 1. 
Default: 2, bigram features
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
When the input is empty, an empty array is returned. 
When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Normalizer (Transformer) 
Arguments
    Value
    See also
Normalize a vector to have unit norm using the given p-norm.
ft_normalizer(x, input_col = NULL, output_col = NULL, p = 2,
  uid = random_string("normalizer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

p
Normalization in L^p space. 
Must be >= 1. 
Defaults to 2.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- OneHotEncoder (Transformer) 
Arguments
    Value
    See also
One-hot encoding maps a column of label indices to a column of binary
vectors, with at most a single one-value. 
This encoding allows algorithms
which expect continuous features, such as Logistic Regression, to use
categorical features. 
Typically, used with  ft_string_indexer() to
index a column first.
ft_one_hot_encoder(x, input_col = NULL, output_col = NULL,
  drop_last = TRUE, uid = random_string("one_hot_encoder_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

drop_last
Whether to drop the last category. 
Defaults to TRUE.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- OneHotEncoderEstimator (Estimator) 
Arguments
    Value
    Details
    See also
A one-hot encoder that maps a column of category indices
  to a column of binary vectors, with at most a single one-value
  per row that indicates the input category index. 
For example
  with 5 categories, an input value of 2.0 would map to an output
  vector of [0.0, 0.0, 1.0, 0.0]. 
The last category is not included
  by default (configurable via dropLast), because it makes the
  vector entries sum up to one, and hence linearly dependent. 
So
  an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
ft_one_hot_encoder_estimator(x, input_cols = NULL, output_cols = NULL,
  handle_invalid = "error", drop_last = TRUE,
  uid = random_string("one_hot_encoder_estimator_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_cols
Names of input columns.
    

output_cols
Names of output columns.
    

handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. 
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket). 
Default: "error"
    

drop_last
Whether to drop the last category. 
Defaults to TRUE.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- PolynomialExpansion (Transformer) 
Arguments
    Value
    See also
Perform feature expansion in a polynomial space. 
E.g. 
take a 2-variable feature
  vector as an example: (x, y), if we want to expand it with degree 2, then
  we get (x, x * x, y, x * y, y * y).
ft_polynomial_expansion(x, input_col = NULL, output_col = NULL,
  degree = 2, uid = random_string("polynomial_expansion_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

degree
The polynomial degree to expand, which should be greater
than equal to 1. 
A value of 1 means no expansion. 
Default: 2
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- QuantileDiscretizer (Estimator) 
Arguments
    Value
    Details
    See also ft_quantile_discretizer takes a column with continuous features and outputs
  a column with binned categorical features. 
The number of bins can be
  set using the num_buckets parameter. 
It is possible that the number
  of buckets used will be smaller than this value, for example, if there
  are too few distinct values of the input to create enough distinct
  quantiles.
ft_quantile_discretizer(x, input_col = NULL, output_col = NULL,
  num_buckets = 2, input_cols = NULL, output_cols = NULL,
  num_buckets_array = NULL, handle_invalid = "error",
  relative_error = 0.001, uid = random_string("quantile_discretizer_"),
  ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

num_buckets
Number of buckets (quantiles, or categories) into which data
points are grouped. 
Must be greater than or equal to 2.
    

input_cols
Names of input columns.
    

output_cols
Names of output columns.
    

num_buckets_array
Array of number of buckets (quantiles, or categories)
into which data points are grouped. 
Each value must be greater than or equal to 2.
    

handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. 
Options are
'skip' (filter out rows with invalid values), 'error' (throw an error), or
'keep' (keep invalid values in a special additional bucket). 
Default: "error"
    

relative_error
(Spark 2.0.0+) Relative error (see documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here
for description). 
Must be in the range [0, 1]. 
default: 0.001
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
NaN handling: null and NaN values will be ignored from the column
  during QuantileDiscretizer fitting. 
This will produce a Bucketizer
  model for making predictions. 
During the transformation, Bucketizer
  will raise an error when it finds NaN values in the dataset, but the
  user can also choose to either keep or remove NaN values within the
  dataset by setting handle_invalid If the user chooses to keep NaN values,
  they will be handled specially and placed into their own bucket,
  for example, if 4 buckets are used, then non-NaN data will be put
  into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see
  the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
  here for a detailed description). 
The precision of the approximation can be
  controlled with the relative_error parameter. 
The lower and upper bin
  bounds will be -Infinity and +Infinity, covering all real values.

Note that the result may be different every time you run it, since the sample
  strategy behind it is non-deterministic.

In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.
 ft_bucketizer

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- RFormula (Estimator) 
Arguments
    Value
    Details
    See also
Implements the transforms required for fitting a dataset against an R model
  formula. 
Currently we support a limited subset of the R operators,
  including ~, ., :, +, and -. 
Also see the R formula docs here:
  http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html
ft_r_formula(x, formula = NULL, features_col = "features",
  label_col = "label", force_index_label = FALSE,
  uid = random_string("r_formula_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

formula
R formula as a character string or a formula. 
Formula objects are
converted to character strings directly and the environment is not captured.
    

features_col
Features column name, as a length-one character vector. 
The column should be single vector column of numeric values. 
Usually this column is output by ft_r_formula.
    

label_col
Label column name. 
The column should be a numeric column. 
Usually this column is output by ft_r_formula.
    

force_index_label
(Spark 2.1.0+) Force to index label whether it is numeric or
string type. 
Usually we index label only when it is string type. 
If
the formula was used by classification algorithms, we can force to index
label even it is numeric type by setting this param with true.
Default: FALSE.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
The basic operators in the formula are:

~ separate target and terms

+ concat terms, "+ 0" means removing intercept

- remove a term, "- 1" means removing intercept

: interaction (multiplication for numeric values, or binarized categorical values)

. 
all columns except target

Suppose a and b are double columns, we use the following simple examples to illustrate the
  effect of RFormula:
 y ~ a + b means model y ~ w0 + w1 * a + w2 * b
where w0 is the intercept and w1, w2 are coefficients.
 y ~ a + b + a:b - 1 means model y ~ w1 * a + w2 * b + w3 * a * b
where w1, w2, w3 are coefficients.

RFormula produces a vector column of features and a double or string column
 of label. 
Like when formulas are used in R for linear regression, string
 input columns will be one-hot encoded, and numeric columns will be cast to
 doubles. 
If the label column is of type string, it will be first transformed
 to double with StringIndexer. 
If the label column does not exist in the
 DataFrame, the output label column will be created from the specified
 response variable in the formula.

In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- RegexTokenizer (Transformer) 
Arguments
    Value
    See also
A regex based tokenizer that extracts tokens either by using the provided
regex pattern to split the text (default) or repeatedly matching the regex
(if gaps is false). 
Optional parameters also allow filtering tokens using a
minimal length. 
It returns an array of strings that can be empty.
ft_regex_tokenizer(x, input_col = NULL, output_col = NULL,
  gaps = TRUE, min_token_length = 1, pattern = "\\s+",
  to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

gaps
Indicates whether regex splits on gaps (TRUE) or matches tokens (FALSE).
    

min_token_length
Minimum token length, greater than or equal to 0.
    

pattern
The regular expression pattern to be used.
    

to_lower_case
Indicates whether to convert all characters to lowercase before tokenizing.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- StandardScaler (Estimator) 
Arguments
    Value
    Details
    See also
    Examples
Standardizes features by removing the mean and scaling to unit variance using
  column summary statistics on the samples in the training set. 
The "unit std"
   is computed using the corrected sample standard deviation, which is computed
   as the square root of the unbiased sample variance.
ft_standard_scaler(x, input_col = NULL, output_col = NULL,
  with_mean = FALSE, with_std = TRUE,
  uid = random_string("standard_scaler_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

with_mean
Whether to center the data with mean before scaling. 
It will
build a dense output, so take care when applying to sparse input. 
Default: FALSE
    

with_std
Whether to scale the data to unit standard deviation. 
Default: TRUE
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")

iris_tbl %>%
  ft_vector_assembler(input_col = features,
        output_col = "features_temp") %>%
  ft_standard_scaler(input_col = "features_temp",
       output_col = "features",
       with_mean = TRUE)
}
Feature Transformation -- StopWordsRemover (Transformer) 
Arguments
    Value
    See also
A feature transformer that filters out stop words from input.
ft_stop_words_remover(x, input_col = NULL, output_col = NULL,
  case_sensitive = FALSE,
  stop_words = ml_default_stop_words(spark_connection(x), "english"),
  uid = random_string("stop_words_remover_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

case_sensitive
Whether to do a case sensitive comparison over the stop words.
    

stop_words
The words to be filtered out.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.
 ml_default_stop_words

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- Tokenizer (Transformer) 
Arguments
    Value
    See also
A tokenizer that converts the input string to lowercase and then splits it
by white spaces.
ft_tokenizer(x, input_col = NULL, output_col = NULL,
  uid = random_string("tokenizer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- VectorAssembler (Transformer) 
Arguments
    Value
    See also
Combine multiple vectors into a single row-vector; that is,
where each row element of the newly generated column is a
vector formed by concatenating each row element from the
specified input columns.
ft_vector_assembler(x, input_cols = NULL, output_col = NULL,
  uid = random_string("vector_assembler_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_cols
The names of the input columns
    

output_col
The name of the output column.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- VectorIndexer (Estimator) 
Arguments
    Value
    Details
    See also
Indexing categorical feature columns in a dataset of Vector.
ft_vector_indexer(x, input_col = NULL, output_col = NULL,
  max_categories = 20, uid = random_string("vector_indexer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

max_categories
Threshold for the number of values a categorical feature can take. 
If a feature is found to have > max_categories values, then it is declared continuous. 
Must be greater than or equal to 2. 
Defaults to 20.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
In the case where x is a tbl_spark, the estimator fits against x
  to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_slicer, ft_word2vec
Feature Transformation -- VectorSlicer (Transformer) 
Arguments
    Value
    See also
Takes a feature vector and outputs a new feature vector with a subarray of the original features.
ft_vector_slicer(x, input_col = NULL, output_col = NULL,
  indices = NULL, uid = random_string("vector_slicer_"), ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

input_col
The name of the input column.
    

output_col
The name of the output column.
    

indices
An vector of indices to select features from a vector column.
Note that the indices are 0-based.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_sql_transformer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_word2vec
Feature Transformation -- SQLTransformer 
Arguments
    Value
    Details
    See also
Implements the transformations which are defined by SQL statement. 
Currently we
  only support SQL syntax like 'SELECT ... 
FROM __THIS__ ...' where '__THIS__' represents
  the underlying table of the input dataset. 
The select clause specifies the
  fields, constants, and expressions to display in the output, it can be any
  select clause that Spark SQL supports. 
Users can also use Spark SQL built-in
  function and UDFs to operate on these selected columns.
ft_sql_transformer(x, statement = NULL,
  uid = random_string("sql_transformer_"), ...)

ft_dplyr_transformer(x, tbl, uid = random_string("dplyr_transformer_"),
  ...)
Arguments
    



    

x
A spark_connection, ml_pipeline, or a tbl_spark.
    

statement
A SQL statement.
    

uid
A character string used to uniquely identify the feature transformer.
    

...
Optional arguments; currently unused.
    

tbl
A tbl_spark generated using dplyr transformations.
    
    
Value
The object returned depends on the class of x.
 spark_connection: When x is a spark_connection, the function returns a ml_transformer,
  a ml_estimator, or one of their subclasses. 
The object contains a pointer to
  a Spark Transformer or Estimator object and can be used to compose
  Pipeline objects.
 ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
  the transformer or estimator appended to the pipeline.
 tbl_spark: When x is a tbl_spark, a transformer is constructed then
  immediately applied to the input tbl_spark, returning a tbl_spark
Details
 ft_dplyr_transformer() is a wrapper around ft_sql_transformer() that
  takes a tbl_spark instead of a SQL statement. 
Internally, the ft_dplyr_transformer()
  extracts the dplyr transformations used to generate tbl as a SQL statement
  then passes it on to ft_sql_transformer(). 
Note that only single-table dplyr verbs
  are supported and that the sdf_ family of functions are not.
See also
See http://spark.apache.org/docs/latest/ml-features.html for
  more information on the set of transformations available for DataFrame
  columns in Spark.

Other feature transformers: ft_binarizer,
  ft_bucketizer,
  ft_chisq_selector,
  ft_count_vectorizer, ft_dct,
  ft_elementwise_product,
  ft_feature_hasher,
  ft_hashing_tf, ft_idf,
  ft_imputer,
  ft_index_to_string,
  ft_interaction, ft_lsh,
  ft_max_abs_scaler,
  ft_min_max_scaler, ft_ngram,
  ft_normalizer,
  ft_one_hot_encoder_estimator,
  ft_one_hot_encoder, ft_pca,
  ft_polynomial_expansion,
  ft_quantile_discretizer,
  ft_r_formula,
  ft_regex_tokenizer,
  ft_standard_scaler,
  ft_stop_words_remover,
  ft_string_indexer,
  ft_tokenizer,
  ft_vector_assembler,
  ft_vector_indexer,
  ft_vector_slicer, ft_word2vec
Compile Scala sources into a Java Archive (jar) 
Arguments
Compile the scala source files contained within an R package
into a Java Archive (jar) file that can be loaded and used within
a Spark environment.
compile_package_jars(..., spec = NULL)
Arguments
    



    

...
Optional compilation specifications, as generated by spark_compilation_spec. 
When no arguments are passed, spark_default_compilation_spec is used instead.
    

spec
An optional list of compilation specifications. 
When
set, this option takes precedence over arguments passed to ....
    
    
Read configuration values for a connection 
Arguments
    Value
Read configuration values for a connection
connection_config(sc, prefix, not_prefix = list())
Arguments
    



    

sc
spark_connection
    

prefix
Prefix to read parameters for
(e.g.  spark.context., spark.sql., etc.)
    

not_prefix
Prefix to not include.
    
    
Value
Named list of config parameters (note that if a prefix was
 specified then the names will not include the prefix)
Downloads default Scala Compilers 
Arguments
    Details compile_package_jars requires several versions of the
scala compiler to work, this is to match Spark scala versions.
To help setup your environment, this function will download the
required compilers under the default search path.
download_scalac(dest_path = NULL)
Arguments
    



    

dest_path
The destination path where scalac will be
downloaded to.
    
    
Details
See find_scalac for a list of paths searched and used by
this function to install the required compilers.
Discover the Scala Compiler 
Arguments
Find the scalac compiler for a particular version of scala, by scanning some common directories containing scala installations.
find_scalac(version, locations = NULL)
Arguments
    



    

version
The scala version to search for. 
Versions
of the form major.minor will be matched against the scalac installation with version major.minor.patch;
if multiple compilers are discovered the most recent one will be
used.
    

locations
Additional locations to scan. 
By default, the
directories /opt/scala and /usr/local/scala will
be scanned.
    
    
Access the Spark API 
Arguments
    Details
    Spark Context
    Java Spark Context
    Hive Context
    Spark Session
Access the commonly-used Spark objects associated with a Spark instance.
These objects provide access to different facets of the Spark API.
spark_context(sc)

java_context(sc)

hive_context(sc)

spark_session(sc)
Arguments
    



    

sc
A spark_connection.
    
    
Details
The Scala API documentation
is useful for discovering what methods are available for each of these
objects. 
Use invoke to call methods on these objects.
Spark Context
The main entry point for Spark functionality. 
The Spark Context
represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Java Spark Context
A Java-friendly version of the aforementioned Spark Context.
Hive Context
An instance of the Spark SQL execution engine that integrates with data
stored in Hive. 
Configuration for Hive is read from hive-site.xml on
the classpath.

Starting with Spark >= 2.0.0, the Hive Context class has been
deprecated -- it is superceded by the Spark Session class, and hive_context will return a Spark Session object instead.
Note that both classes share a SQL interface, and therefore one can invoke
SQL through these objects.
Spark Session
Available since Spark 2.0.0, the Spark Session unifies the
Spark Context and Hive Context classes into a single
interface. 
Its use is recommended over the older APIs for code
targeting Spark 2.0.0 and above.
Runtime configuration interface for Hive 
Arguments
Retrieves the runtime configuration interface for Hive.
hive_context_config(sc)
Arguments
    



    

sc
A spark_connection.
    
    
Invoke a Method on a JVM Object 
Arguments
    Details
    Examples
Invoke methods on Java object references. 
These functions provide a
mechanism for invoking various Java object methods directly from R.
invoke(jobj, method, ...)

invoke_static(sc, class, method, ...)

invoke_new(sc, class, ...)
Arguments
    



    

jobj
An R object acting as a Java object reference (typically, a spark_jobj).
    

method
The name of the method to be invoked.
    

...
Optional arguments, currently unused.
    

sc
A spark_connection.
    

class
The name of the Java class whose methods should be invoked.
    
    
Details
Use each of these functions in the following scenarios:


invoke Execute a method on a Java object reference (typically, a spark_jobj). invoke_static
Execute a static method associated with a Java class. invoke_new Invoke a constructor associated with a Java class.

Examples
    
sc <- spark_connect(master = "spark://HOST:PORT")
spark_context(sc) %>%
  invoke("textFile", "file.csv", 1L) %>%
    invoke("count")

Register a Package that Implements a Spark Extension 
Arguments
    Note
Registering an extension package will result in the package being
automatically scanned for spark dependencies when a connection to Spark is
created.
register_extension(package)

registered_extensions()
Arguments
    



    

package
The package(s) to register.
    
    
Note
Packages should typically register their extensions in their
  .onLoad hook -- this ensures that their extensions are registered
  when their namespaces are loaded.
Define a Spark Compilation Specification 
Arguments
    Details
For use with compile_package_jars. 
The Spark compilation
specification is used when compiling Spark extension Java Archives, and
defines which versions of Spark, as well as which versions of Scala, should
be used for compilation.
spark_compilation_spec(spark_version = NULL, spark_home = NULL,
  scalac_path = NULL, scala_filter = NULL, jar_name = NULL,
  jar_path = NULL, jar_dep = NULL)
Arguments
    



    

spark_version
The Spark version to build against. 
This can
be left unset if the path to a suitable Spark home is supplied.
    

spark_home
The path to a Spark home installation. 
This can
be left unset if spark_version is supplied; in such a case, sparklyr will attempt to discover the associated Spark
installation using spark_home_dir.
    

scalac_path
The path to the scalac compiler to be used
during compilation of your Spark extension. 
Note that you should
ensure the version of scalac selected matches the version of scalac used with the version of Spark you are compiling against.
    

scala_filter
An optional R function that can be used to filter
which scala files are used during compilation. 
This can be
useful if you have auxiliary files that should only be included with
certain versions of Spark.
    

jar_name
The name to be assigned to the generated jar.
    

jar_path
The path to the jar tool to be used
during compilation of your Spark extension.
    

jar_dep
An optional list of additional jar dependencies.
    
    
Details
Most Spark extensions won't need to define their own compilation specification,
and can instead rely on the default behavior of compile_package_jars.
Default Compilation Specification for Spark Extensions 
Arguments
This is the default compilation specification used for
Spark extensions, when used with compile_package_jars.
spark_default_compilation_spec(pkg = infer_active_package_name(),
  locations = NULL)
Arguments
    



    

pkg
The package containing Spark extensions to be compiled.
    

locations
Additional locations to scan. 
By default, the
directories /opt/scala and /usr/local/scala will
be scanned.
    
    
Retrieve the Spark Connection Associated with an R Object 
Arguments
Retrieve the spark_connection associated with an R object.
spark_connection(x, ...)
Arguments
    



    

x
An R object from which a spark_connection can be obtained.
    

...
Optional arguments; currently unused.
    
    
Runtime configuration interface for the Spark Context. 
Arguments
Retrieves the runtime configuration interface for the Spark Context.
spark_context_config(sc)
Arguments
    



    

sc
A spark_connection.
    
    
Retrieve a Spark DataFrame 
Arguments
    Value
This S3 generic is used to access a Spark DataFrame object (as a Java
object reference) from an R object.
spark_dataframe(x, ...)
Arguments
    



    

x
An R object wrapping, or containing, a Spark DataFrame.
    

...
Optional arguments; currently unused.
    
    
Value
A spark_jobj representing a Java object reference
  to a Spark DataFrame.
Define a Spark dependency 
Arguments
    Value
Define a Spark dependency consisting of a set of custom JARs and Spark packages.
spark_dependency(jars = NULL, packages = NULL, initializer = NULL,
  catalog = NULL, repositories = NULL, ...)
Arguments
    



    

jars
Character vector of full paths to JAR files.
    

packages
Character vector of Spark packages names.
    

initializer
Optional callback function called when initializing a connection.
    

catalog
Optional location where extension JAR files can be downloaded for Livy.
    

repositories
Character vector of Spark package repositories.
    

...
Additional optional arguments.
    
    
Value
An object of type `spark_dependency`
Set the SPARK_HOME environment variable 
Arguments
    Value
    Examples
Set the SPARK_HOME environment variable. 
This slightly speeds up some
operations, including the connection time.
spark_home_set(path = NULL, ...)
Arguments
    



    

path
A string containing the path to the installation location of
Spark. 
If NULL, the path to the most latest Spark/Hadoop versions is
used.
    

...
Additional parameters not currently used.
    
    
Value
The function is mostly invoked for the side-effect of setting the SPARK_HOME environment variable. 
It also returns TRUE if the
environment was successfully set, and FALSE otherwise.
Examples
    if (FALSE) {
# Not run due to side-effects
spark_home_set()
}
Retrieve a Spark JVM Object Reference 
Arguments
    See also
This S3 generic is used for accessing the underlying Java Virtual Machine
(JVM) Spark objects associated with R objects. 
These objects act as
references to Spark objects living in the JVM. 
Methods on these objects
can be called with the invoke family of functions.
spark_jobj(x, ...)
Arguments
    



    

x
An R object containing, or wrapping, a spark_jobj.
    

...
Optional arguments; currently unused.
    
    
See also
 invoke, for calling methods on Java object references.
Get the Spark Version Associated with a Spark Connection 
Arguments
    Value
    Details
Retrieve the version of Spark associated with a Spark connection.
spark_version(sc)
Arguments
    



    

sc
A spark_connection.
    
    
Value
The Spark version as a numeric_version.
Details
Suffixes for e.g. 
preview versions, or snapshotted versions,
are trimmed -- if you require the full Spark version, you can
retrieve it with invoke(spark_context(sc), "version").
Apply an R Function in Spark 
Arguments
    Configuration
    Examples
Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply(x, f, columns = NULL, memory = !is.null(name),
  group_by = NULL, packages = NULL, context = NULL, name = NULL,
  ...)
Arguments
    



    

x
An object (usually a spark_tbl) coercable to a Spark DataFrame.
    

f
A function that transforms a data frame partition into a data frame.
  The function f has signature f(df, context, group1, group2, ...) where
  df is a data frame with the data to be processed, context
  is an optional object passed as the context parameter and group1 to
  groupN contain the values of the group_by values. 
When
  group_by is not specified, f takes only one argument.

Can also be an rlang anonymous function. 
For example, as ~ .x + 1
  to define an expression that adds one to the given .x data frame.
    

columns
A vector of column names or a named vector of column types for
the transformed object. 
When not specified, a sample of 10 rows is taken to
infer out the output columns automatically, to avoid this performance penalty,
specify the column types. 
The sample size is configurable using the sparklyr.apply.schema.infer configuration option.
    

memory
Boolean; should the table be cached into memory?
    

group_by
Column name used to group by data frame partitions.
    

packages
Boolean to distribute .libPaths() packages to each node,
  a list of packages to distribute, or a package bundle created with
  spark_apply_bundle().

Defaults to TRUE or the sparklyr.apply.packages value set in
  spark_config().

For clusters using Yarn cluster mode, packages can point to a package
  bundle created using spark_apply_bundle() and made available as a Spark
  file using config$sparklyr.shell.files. 
For clusters using Livy, packages
  can be manually installed on the driver node.

For offline clusters where available.packages() is not available,
  manually download the packages database from
 https://cran.r-project.org/web/packages/packages.rds and set
  Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>"). 
Otherwise,
  all packages will be used by default.

For clusters where R packages already installed in every worker node,
  the spark.r.libpaths config entry can be set in spark_config()
  to the local packages library. 
To specify multiple paths collapse them
  (without spaces) with a comma delimiter (e.g., "/lib/path/one,/lib/path/two").
    

context
Optional object to be serialized and passed back to f().
    

name
Optional table name while registering the resulting data frame.
    

...
Optional arguments; currently unused.
    
    
Configuration
 spark_config() settings can be specified to change the workers
environment.

For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.* config, to launch workers
without --vanilla use sparklyr.apply.options.vanilla set to FALSE, to run a custom script before launching Rscript use sparklyr.apply.options.rscript.before.
Examples
    if (FALSE) {

library(sparklyr)
sc <- spark_connect(master = "local")

# creates an Spark data frame with 10 elements then multiply times 10 in R
sdf_len(sc, 10) %>% spark_apply(function(df) df * 10)

}
Create Bundle for Spark Apply 
Arguments
Creates a bundle of packages for spark_apply().
spark_apply_bundle(packages = TRUE, base_path = getwd())
Arguments
    



    

packages
List of packages to pack or TRUE to pack all.
    

base_path
Base path used to store the resulting bundle.
    
    
Log Writer for Spark Apply 
Arguments
Writes data to log under spark_apply().
spark_apply_log(..., level = "INFO")
Arguments
    



    

...
Arguments to write to log.
    

level
Severity level for this entry; recommended values: INFO, ERROR or WARN.
    
    
Create a Spark Configuration for Livy 
Arguments
    Value
    Details
Create a Spark Configuration for Livy
livy_config(config = spark_config(), username = NULL,
  password = NULL, negotiate = FALSE,
  custom_headers = list(`X-Requested-By` = "sparklyr"), ...)
Arguments
    



    

config
Optional base configuration
    

username
The username to use in the Authorization header
    

password
The password to use in the Authorization header
    

negotiate
Whether to use gssnegotiate method or not
    

custom_headers
List of custom headers to append to http requests. 
Defaults to list("X-Requested-By" = "sparklyr").
    

...
additional Livy session parameters
    
    
Value
Named list with configuration data
Details
Extends a Spark spark_config() configuration with settings
for Livy. 
For instance, username and password
define the basic authentication settings for a Livy session.

The default value of "custom_headers" is set to list("X-Requested-By" = "sparklyr")
in order to facilitate connection to Livy servers with CSRF protection enabled.

Additional parameters for Livy sessions are:

  proxy_user
User to impersonate when starting the session

  jars
jars to be used in this session

  py_files
Python files to be used in this session

  files
files to be used in this session

  driver_memory
Amount of memory to use for the driver process

  driver_cores
Number of cores to use for the driver process

  executor_memory
Amount of memory to use per executor process

  executor_cores
Number of cores to use for each executor

  num_executors
Number of executors to launch for this session

  archives
Archives to be used in this session

  queue
The name of the YARN queue to which submitted

  name
The name of this session

  heartbeat_timeout
Timeout in seconds to which session be orphaned




Note that queue is supported only by version 0.4.0 of Livy or newer.
If you are using the older one, specify queue via config (e.g. config = spark_config(spark.yarn.queue = "my_queue")).
Start Livy 
Arguments
Starts the livy service.

Stops the running instances of the livy service.
livy_service_start(version = NULL, spark_version = NULL, stdout = "",
  stderr = "", ...)

livy_service_stop()
Arguments
    



    

version
The version of livy to use.
    

spark_version
The version of spark to connect to.
    

stdout, stderr
where output to 'stdout' or 'stderr' should
be sent. 
Same options as system2.
    

...
Optional arguments; currently unused.
    
    
Find Stream 
Arguments
    Examples
Finds and returns a stream based on the stream's identifier.
stream_find(sc, id)
Arguments
    



    

sc
The associated Spark connection.
    

id
The stream identifier to find.
    
    
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>%
  spark_write_parquet(path = "parquet-in")

stream <- stream_read_parquet(sc, "parquet-in") %>%
 stream_write_parquet("parquet-out")

stream_id <- stream_id(stream)
stream_find(sc, stream_id)
}
Generate Test Stream 
Arguments
    Details
Generates a local test stream, useful when testing streams locally.
stream_generate_test(df = rep(1:1000), path = "source",
  distribution = floor(10 + 1e+05 * stats::dbinom(1:20, 20, 0.5)),
  iterations = 50, interval = 1)
Arguments
    



    

df
The data frame used as a source of rows to the stream, will
be cast to data frame if needed. 
Defaults to a sequence of one thousand
entries.
    

path
Path to save stream of files to, defaults to "source".
    

distribution
The distribution of rows to use over each iteration,
defaults to a binomial distribution. 
The stream will cycle through the
distribution if needed.
    

iterations
Number of iterations to execute before stopping, defaults
to fifty.
    

interval
The inverval in seconds use to write the stream, defaults
to one second.
    
    
Details
This function requires the callr package to be installed.
Spark Stream's Identifier 
Arguments
Retrieves the identifier of the Spark stream.
stream_id(stream)
Arguments
    



    

stream
The spark stream object.
    
    
Spark Stream's Name 
Arguments
Retrieves the name of the Spark stream if available.
stream_name(stream)
Arguments
    



    

stream
The spark stream object.
    
    
Read CSV Stream 
Arguments
    See also
    Examples
Reads a CSV stream as a Spark dataframe stream.
stream_read_csv(sc, path, name = NULL, header = TRUE, columns = NULL,
  delimiter = ",", quote = "\"", escape = "\\",
  charset = "UTF-8", null_value = NULL, options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated stream.
    

header
Boolean; should the first row of data be used as a header?
Defaults to TRUE.
    

columns
A vector of column names or a named vector of column types.
    

delimiter
The character used to delimit each column. 
Defaults to ','.
    

quote
The character used as a quote. 
Defaults to '"'.
    

escape
The character used to escape other characters. 
Defaults to '\'.
    

charset
The character set. 
Defaults to "UTF-8".
    

null_value
The character to use for null, or missing, values. 
Defaults to NULL.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)

csv_path <- file.path("file://", getwd(), "csv-in")

stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out")

stream_stop(stream)

}
Read JSON Stream 
Arguments
    See also
    Examples
Reads a JSON stream as a Spark dataframe stream.
stream_read_json(sc, path, name = NULL, columns = NULL,
  options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated stream.
    

columns
A vector of column names or a named vector of column types.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("json-in")
jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json")

json_path <- file.path("file://", getwd(), "json-in")

stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out")

stream_stop(stream)

}
Read Kafka Stream 
Arguments
    Details
    See also
    Examples
Reads a Kafka stream as a Spark dataframe stream.
stream_read_kafka(sc, name = NULL, options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated stream.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
Details
Please note that Kafka requires installing the appropriate
 package by conneting with a config setting where sparklyr.shell.packages
 is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2".
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

config <- spark_config()

# The following package is dependent to Spark version, for Spark 2.3.2:
config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"

sc <- spark_connect(master = "local", config = config)

read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1")
write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2")

stream <- stream_read_kafka(sc, options = read_options) %>%
  stream_write_kafka(options = write_options)

stream_stop(stream)

}
Read ORC Stream 
Arguments
    See also
    Examples
Reads an ORC stream as a Spark dataframe stream.
stream_read_orc(sc, path, name = NULL, columns = NULL,
  options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated stream.
    

columns
A vector of column names or a named vector of column types.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

sdf_len(sc, 10) %>% spark_write_orc("orc-in")

stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out")

stream_stop(stream)

}
Read Parquet Stream 
Arguments
    See also
    Examples
Reads a parquet stream as a Spark dataframe stream.
stream_read_parquet(sc, path, name = NULL, columns = NULL,
  options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated stream.
    

columns
A vector of column names or a named vector of column types.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

sdf_len(sc, 10) %>% spark_write_parquet("parquet-in")

stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out")

stream_stop(stream)

}
Read Socket Stream 
Arguments
    See also
    Examples
Reads a Socket stream as a Spark dataframe stream.
stream_read_scoket(sc, name = NULL, columns = NULL, options = list(),
  ...)
Arguments
    



    

sc
A spark_connection.
    

name
The name to assign to the newly generated stream.
    

columns
A vector of column names or a named vector of column types.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

# Start socket server from terminal, example: nc -lk 9999
stream <- stream_read_scoket(sc, options = list(host = "localhost", port = 9999))
stream

}
Read Text Stream 
Arguments
    See also
    Examples
Reads a text stream as a Spark dataframe stream.
stream_read_text(sc, path, name = NULL, options = list(), ...)
Arguments
    



    

sc
A spark_connection.
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

name
The name to assign to the newly generated stream.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("text-in")
writeLines("A text entry", "text-in/text.txt")

text_path <- file.path("file://", getwd(), "text-in")

stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out")

stream_stop(stream)

}
Render Stream 
Arguments
    Examples
Collects streaming statistics to render the stream as an 'htmlwidget'.
stream_render(stream = NULL, collect = 10, stats = NULL, ...)
Arguments
    



    

stream
The stream to render
    

collect
The interval in seconds to collect data before rendering the
'htmlwidget'.
    

stats
Optional stream statistics collected using stream_stats(),
when specified, stream should be omitted.
    

...
Additional optional arguments.
    
    
Examples
    if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")

dir.create("iris-in")
write.csv(iris, "iris-in/iris.csv", row.names = FALSE)

stream <- stream_read_csv(sc, "iris-in/") %>%
  stream_write_csv("iris-out/")

stream_render(stream)
stream_stop(stream)
}
Stream Statistics 
Arguments
    Value
    Examples
Collects streaming statistics, usually, to be used with stream_render()
to render streaming statistics.
stream_stats(stream, stats = list())
Arguments
    



    

stream
The stream to collect statistics from.
    

stats
An optional stats object generated using stream_stats().
    
    
Value
A stats object containing streaming statistics that can be passed
  back to the stats parameter to continue aggregating streaming stats.
Examples
    if (FALSE) {
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>%
  spark_write_parquet(path = "parquet-in")

stream <- stream_read_parquet(sc, "parquet-in") %>%
 stream_write_parquet("parquet-out")

stream_stats(stream)
}
Stops a Spark Stream 
Arguments
Stops processing data from a Spark stream.
stream_stop(stream)
Arguments
    



    

stream
The spark stream object to be stopped.
    
    
Spark Stream Continuous Trigger 
Arguments
    See also
Creates a Spark structured streaming trigger to execute
continuously. 
This mode is the most performant but not all operations
are supported.
stream_trigger_continuous(checkpoint = 5000)
Arguments
    



    

checkpoint
The checkpoint interval specified in milliseconds.
    
    
See also
 stream_trigger_interval
Spark Stream Interval Trigger 
Arguments
    See also
Creates a Spark structured streaming trigger to execute
over the specified interval.
stream_trigger_interval(interval = 1000)
Arguments
    



    

interval
The execution interval specified in milliseconds.
    
    
See also
 stream_trigger_continuous
View Stream 
Arguments
    Examples
Opens a Shiny gadget to visualize the given stream.
stream_view(stream, ...)
Arguments
    



    

stream
The stream to visualize.
    

...
Additional optional arguments.
    
    
Examples
    if (FALSE) {
library(sparklyr)
sc <- spark_connect(master = "local")

dir.create("iris-in")
write.csv(iris, "iris-in/iris.csv", row.names = FALSE)

stream_read_csv(sc, "iris-in/") %>%
  stream_write_csv("iris-out/") %>%
  stream_view() %>%
  stream_stop()
}
Watermark Stream 
Arguments
Ensures a stream has a watermark defined, which is required for some
operations over streams.
stream_watermark(x, column = "timestamp", threshold = "10 minutes")
Arguments
    



    

x
An object coercable to a Spark Streaming DataFrame.
    

column
The name of the column that contains the event time of the row,
if the column is missing, a column with the current time will be added.
    

threshold
The minimum delay to wait to data to arrive late, defaults
to ten minutes.
    
    
Write Console Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into console logs.
stream_write_console(x, mode = c("append", "complete", "update"),
  options = list(), trigger = stream_trigger_interval(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

options
A list of strings with additional options.
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

sdf_len(sc, 10) %>% dplyr::transmute(text = as.character(id)) %>% spark_write_text("text-in")

stream <- stream_read_text(sc, "text-in") %>% stream_write_console()

stream_stop(stream)

}
Write CSV Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into a tabular (typically, comma-separated) stream.
stream_write_csv(x, path, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(), checkpoint = file.path(path,
  "checkpoint"), header = TRUE, delimiter = ",", quote = "\"",
  escape = "\\", charset = "UTF-8", null_value = NULL,
  options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The path to the file. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

header
Should the first row of data be used as a header? Defaults to TRUE.
    

delimiter
The character used to delimit each column, defaults to ,.
    

quote
The character used as a quote. 
Defaults to '"'.
    

escape
The character used to escape other characters, defaults to \.
    

charset
The character set, defaults to "UTF-8".
    

null_value
The character to use for default values, defaults to NULL.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)

csv_path <- file.path("file://", getwd(), "csv-in")

stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out")

stream_stop(stream)

}
Write JSON Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into a JSON stream.
stream_write_json(x, path, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(), checkpoint = file.path(path,
  "checkpoints", random_string("")), options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The destination path. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("json-in")
jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json")

json_path <- file.path("file://", getwd(), "json-in")

stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out")

stream_stop(stream)

}
Write Kafka Stream 
Arguments
    Details
    See also
    Examples
Writes a Spark dataframe stream into an kafka stream.
stream_write_kafka(x, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(),
  checkpoint = file.path("checkpoints", random_string("")),
  options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
Details
Please note that Kafka requires installing the appropriate
 package by conneting with a config setting where sparklyr.shell.packages
 is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2".
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

config <- spark_config()

# The following package is dependent to Spark version, for Spark 2.3.2:
config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2"

sc <- spark_connect(master = "local", config = config)

read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1")
write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2")

stream <- stream_read_kafka(sc, options = read_options) %>%
  stream_write_kafka(options = write_options)

stream_stop(stream)

}
Write Memory Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into a memory stream.
stream_write_memory(x, name = random_string("sparklyr_tmp_"),
  mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(),
  checkpoint = file.path("checkpoints", name, random_string("")),
  options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

name
The name to assign to the newly generated stream.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_orc,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("csv-in")
write.csv(iris, "csv-in/data.csv", row.names = FALSE)

csv_path <- file.path("file://", getwd(), "csv-in")

stream <- stream_read_csv(sc, csv_path) %>% stream_write_memory("csv-out")

stream_stop(stream)

}
Write a ORC Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into an ORC stream.
stream_write_orc(x, path, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(), checkpoint = file.path(path,
  "checkpoints", random_string("")), options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The destination path. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_parquet,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

sdf_len(sc, 10) %>% spark_write_orc("orc-in")

stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out")

stream_stop(stream)

}
Write Parquet Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into a parquet stream.
stream_write_parquet(x, path, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(), checkpoint = file.path(path,
  "checkpoints", random_string("")), options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The destination path. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_text
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

sdf_len(sc, 10) %>% spark_write_parquet("parquet-in")

stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out")

stream_stop(stream)

}
Write Text Stream 
Arguments
    See also
    Examples
Writes a Spark dataframe stream into a text stream.
stream_write_text(x, path, mode = c("append", "complete", "update"),
  trigger = stream_trigger_interval(), checkpoint = file.path(path,
  "checkpoints", random_string("")), options = list(), ...)
Arguments
    



    

x
A Spark DataFrame or dplyr operation
    

path
The destination path. 
Needs to be accessible from the cluster.
Supports the "hdfs://", "s3a://" and "file://" protocols.
    

mode
Specifies how data is written to a streaming sink. 
Valid values are "append", "complete" or "update".
    

trigger
The trigger for the stream query, defaults to micro-batches runnnig
every 5 seconds. 
See stream_trigger_interval and stream_trigger_continuous.
    

checkpoint
The location where the system will write all the checkpoint
information to guarantee end-to-end fault-tolerance.
    

options
A list of strings with additional options.
    

...
Optional arguments; currently unused.
    
    
See also
Other Spark stream serialization: stream_read_csv,
  stream_read_json,
  stream_read_kafka,
  stream_read_orc,
  stream_read_parquet,
  stream_read_scoket,
  stream_read_text,
  stream_write_console,
  stream_write_csv,
  stream_write_json,
  stream_write_kafka,
  stream_write_memory,
  stream_write_orc,
  stream_write_parquet
Examples
    if (FALSE) {

sc <- spark_connect(master = "local")

dir.create("text-in")
writeLines("A text entry", "text-in/text.txt")

text_path <- file.path("file://", getwd(), "text-in")

stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out")

stream_stop(stream)

}
Reactive spark reader 
Arguments
Given a spark object, returns a reactive data source for the contents
of the spark object. 
This function is most useful to read Spark streams.
reactiveSpark(x, intervalMillis = 1000, session = NULL)
Arguments
    



    

x
An object coercable to a Spark DataFrame.
    

intervalMillis
Approximate number of milliseconds to wait to retrieve
updated data frame. 
This can be a numeric value, or a function that returns
a numeric value.
    

session
The user session to associate this file reader with, or NULL if
none. 
If non-null, the reader will automatically stop when the session ends.
Option	Description
`spark.*`	Configuration settings for the Spark context (applied by creating a `SparkConf` containing the specified properties). For example, `spark.executor.memory: 1g` configures the memory available in each executor (see Spark Configuration for additional options.)
`spark.sql.*`	Configuration settings for the Spark SQL context (applied using SET). For instance, `spark.sql.shuffle.partitions` configures number of partitions to use while shuffling (see SQL Programming Guide for additional options).
Value	Description
`spark.r.command`	The path to the R binary. Useful to select from multiple R versions.
`sparklyr.worker.gateway.address`	The gateway address to use under each worker node. Defaults to `sparklyr.gateway.address`.
`sparklyr.worker.gateway.port`	The gateway port to use under each worker node. Defaults to `sparklyr.gateway.port`.
Function	Description
`spark_connection`	Get the Spark connection associated with an object (S3)
`spark_jobj`	Get the Spark jobj associated with an object (S3)
`spark_dataframe`	Get the Spark DataFrame associated with an object (S3)
`spark_context`	Get the SparkContext for a `spark_connection`
`hive_context`	Get the HiveContext for a `spark_connection`
`spark_version`	Get the version of Spark (as a `numeric_version`) for a `spark_connection`
Function	Description
`spark_read_csv`	Reads a CSV file and provides a data source compatible with dplyr
`spark_read_json`	Reads a JSON file and provides a data source compatible with dplyr
`spark_read_parquet`	Reads a parquet file and provides a data source compatible with dplyr
Function	Description
`ml_kmeans`	K-Means Clustering
`ml_linear_regression`	Linear Regression
`ml_logistic_regression`	Logistic Regression
`ml_survival_regression`	Survival Regression
`ml_generalized_linear_regression`	Generalized Linear Regression
`ml_decision_tree`	Decision Trees
`ml_random_forest`	Random Forests
`ml_gradient_boosted_trees`	Gradient-Boosted Trees
`ml_pca`	Principal Components Analysis
`ml_naive_bayes`	Naive-Bayes
`ml_multilayer_perceptron`	Multilayer Perceptron
`ml_lda`	Latent Dirichlet Allocation
`ml_one_vs_rest`	One vs Rest
Function	Description
`ft_binarizer`	Threshold numerical features to binary (0/1) feature
`ft_bucketizer`	Bucketizer transforms a column of continuous features to a column of feature buckets
`ft_discrete_cosine_transform`	Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain
`ft_elementwise_product`	Multiplies each input vector by a provided weight vector, using element-wise multiplication.
`ft_index_to_string`	Maps a column of label indices back to a column containing the original labels as strings
`ft_quantile_discretizer`	Takes a column with continuous features and outputs a column with binned categorical features
`sql_transformer`	Implements the transformations which are defined by a SQL statement
`ft_string_indexer`	Encodes a string column of labels to a column of label indices
`ft_vector_assembler`	Combines a given list of columns into a single vector column
Option	Description
`sparklyr.cores.local`	Number of cores to use when running in local mode (defaults to `parallel::detectCores`).
`sparklyr.sparkui.url`	Configures the url to the Spark UI web interface when calling spark_web.
`sparklyr.defaultPackages`	List of default Spark packages to install in the cluster (defaults to “com.databricks:spark-csv_2.11:1.3.0” and “com.amazonaws:aws-java-sdk-pom:1.10.34”).
`sparklyr.sanitize.column.names`	Allows Spark to automatically rename column names to conform to Spark naming restrictions.
Option	Description
`sparklyr.backend.threads`	Number of threads to use in the sparklyr backend to process incoming connections form the sparklyr client.
`sparklyr.app.jar`	The application jar to be submitted in Spark submit.
`sparklyr.ports.file`	Path to the ports file used to share connection information to the sparklyr backend.
`sparklyr.ports.wait.seconds`	Number of seconds to wait while for the Spark connection to initialize.
`sparklyr.verbose`	Provide additional feedback while performing operations. Currently used to communicate which column names are being sanitized in sparklyr.sanitize.column.names.
Parameter	Value
spark.driver.extraJavaOptions	append -XX:MaxPermSize=30G
spark.driver.maxResultSize	0
spark.driver.memory	30G
spark.yarn.driver.memoryOverhead	4096
spark.yarn.executor.memoryOverhead	4096
spark.executor.memory	4G
spark.executor.cores	2
spark.dynamicAllocation.maxExecutors	15
Value	Description
`c("local", "cluster")`	Default. Present connections to both local and cluster Spark instances.
`"local"`	Present only connections to local Spark instances.
`"spark://local:7077"`	Present only a connection to a specific Spark cluster.
`c("spark://local:7077", "cluster")`	Present a connection to a specific Spark cluster and other clusters.
Package	Description
`spark.sas7bdat`	Read in SAS data in parallel into Apache Spark.
`rsparkling`	Extension for using H2O machine learning algorithms against Spark Data Frames.
`sparkhello`	Simple example of including a custom JAR file within an extension package.
`rddlist`	Implements some methods of an R list as a Spark RDD (resilient distributed dataset).
`sparkwarc`	Load WARC files into Apache Spark with sparklyr.
`sparkavro`	Load Avro data into Spark with sparklyr. It is a wrapper of spark-avro
`crassy`	Connect to Cassandra with sparklyr using the `Spark-Cassandra-Connector`.
`sparklygraphs`	R interface for GraphFrames which aims to provide the functionality of GraphX.
`sparklyr.nested`	Extension for working with nested data.
`sparklyudf`	Simple example registering an Scala UDF within an extension package.
Function	Description
`spark_connection`	Connection between R and the Spark shell process
`spark_jobj`	Instance of a remote Spark object
`spark_dataframe`	Instance of a remote Spark DataFrame object
Function	Description
`invoke`	Call a method on an object
`invoke_new`	Create a new object by invoking a constructor
`invoke_static`	Call a static method on an object
From R	Scala	To R
NULL	void	NULL
integer	Int	integer
character	String	character
logical	Boolean	logical
double	Double	double
numeric	Double	double
	Float	double
	Decimal	double
	Long	double
raw	Array[Byte]	raw
Date	Date	Date
POSIXlt	Time
POSIXct	Time	POSIXct
list	Array[T]	list
environment	Map[String, T]
jobj	Object	jobj
Function	Description
`h2o.glm`	Generalized Linear Model
`h2o.deeplearning`	Multilayer Perceptron
`h2o.randomForest`	Random Forest
`h2o.gbm`	Gradient Boosting Machine
`h2o.naiveBayes`	Naive-Bayes
`h2o.prcomp`	Principal Components Analysis
`h2o.svd`	Singular Value Decomposition
`h2o.glrm`	Generalized Low Rank Model
`h2o.kmeans`	K-Means Clustering
`h2o.anomaly`	Anomaly Detection via Deep Learning Autoencoder
Function	Description
`h2o.ensemble`	Super Learner / Stacking
`h2o.stack`	Super Learner / Stacking
Option 1 - Connecting to Databricks remotely	Overview With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect. This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources. Advantages and limitations Advantages: RStudio Server Pro will remain functional if Databricks clusters are terminated Provides the ability to communicate with one or more Databricks clusters as a remote compute resource Avoids resource contention between RStudio Server Pro and Databricks Limitations:
Option 2 - Working inside of Databricks	Overview If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below. With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to work locally with Spark using sparklyr. This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.
Spark Standalone Deployment in AWS	Overview The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server. What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.
Using sparklyr with Databricks	Overview This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager. Using RStudio Team with Databricks RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages. RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark.
Using sparklyr with an Apache Spark cluster	Summary This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Cloudera Cluster This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. The assumption will be made that there no aid is needed to setup and administer the cluster.
Using sparklyr with an Apache Spark cluster	This document demonstrates how to use sparklyr with an Apache Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Here is the basic workflow. Data preparation Set up the cluster This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider.
Spark Operations
`spark_config()`	Read Spark Configuration
`spark_connect()` `spark_connection_is_open()` `spark_disconnect()` `spark_disconnect_all()` `spark_submit()`	Manage Spark Connections
`spark_install_find()` `spark_install()` `spark_uninstall()` `spark_install_dir()` `spark_install_tar()` `spark_installed_versions()` `spark_available_versions()`	Find a given Spark installation by version.
`spark_log()`	View Entries in the Spark Log
`spark_web()`	Open the Spark web interface
`connection_is_open()`	Check whether the connection is open
`connection_spark_shinyapp()`	A Shiny app that can be used to construct a `spark_connect` statement
`spark_session_config()`	Runtime configuration interface for the Spark Session
`spark_set_checkpoint_dir()` `spark_get_checkpoint_dir()`	Set/Get Spark checkpoint directory
`spark_table_name()`	Generate a Table Name from Expression
`spark_version_from_home()`	Get the Spark Version Associated with a Spark Installation
`spark_versions()`	Retrieves a dataframe available Spark versions that van be installed.
`spark_config_kubernetes()`	Kubernetes Configuration
`spark_config_settings()`	Retrieve Available Settings
`spark_connection_find()`	Find Spark Connection
`spark_dependency_fallback()`	Fallback to Spark Dependency
`spark_extension()`	Create Spark Extension
`spark_load_table()`	Reads from a Spark Table into a Spark DataFrame.
`spark_read_libsvm()`	Read libsvm file into a Spark DataFrame.
Read/Write data
`spark_read_csv()`	Read a CSV file into a Spark DataFrame
`spark_read_delta()`	Read from Delta Lake into a Spark DataFrame.
`spark_read_jdbc()`	Read from JDBC connection into a Spark DataFrame.
`spark_read_json()`	Read a JSON file into a Spark DataFrame
`spark_read_libsvm()`	Read libsvm file into a Spark DataFrame.
`spark_read_orc()`	Read a ORC file into a Spark DataFrame
`spark_read_parquet()`	Read a Parquet file into a Spark DataFrame
`spark_read_source()`	Read from a generic source into a Spark DataFrame.
`spark_read_table()`	Reads from a Spark Table into a Spark DataFrame.
`spark_read_text()`	Read a Text file into a Spark DataFrame
`spark_write_csv()`	Write a Spark DataFrame to a CSV
`spark_write_delta()`	Writes a Spark DataFrame into Delta Lake
`spark_write_jdbc()`	Writes a Spark DataFrame into a JDBC table
`spark_write_json()`	Write a Spark DataFrame to a JSON file
`spark_write_orc()`	Write a Spark DataFrame to a ORC file
`spark_write_parquet()`	Write a Spark DataFrame to a Parquet file
`spark_write_source()`	Writes a Spark DataFrame into a generic source
`spark_write_table()`	Writes a Spark DataFrame into a Spark table
`spark_write_text()`	Write a Spark DataFrame to a Text file
Spark Tables
`sdf_save_table()` `sdf_load_table()` `sdf_save_parquet()` `sdf_load_parquet()`	Save / Load a Spark DataFrame
`sdf_predict()` `sdf_transform()` `sdf_fit()` `sdf_fit_and_transform()`	Spark ML -- Transform, fit, and predict methods (sdf_ interface)
`sdf_along()`	Create DataFrame for along Object
`sdf_bind_rows()` `sdf_bind_cols()`	Bind multiple Spark DataFrames by row and column
`sdf_broadcast()`	Broadcast hint
`sdf_checkpoint()`	Checkpoint a Spark DataFrame
`sdf_coalesce()`	Coalesces a Spark DataFrame
`sdf_collect()`	Collect a Spark DataFrame into R.
`sdf_copy_to()` `sdf_import()`	Copy an Object into Spark
`sdf_crosstab()`	Cross Tabulation
`sdf_debug_string()`	Debug Info for Spark DataFrame
`sdf_describe()`	Compute summary statistics for columns of a data frame
`sdf_dim()` `sdf_nrow()` `sdf_ncol()`	Support for Dimension Operations
`sdf_is_streaming()`	Spark DataFrame is Streaming
`sdf_last_index()`	Returns the last index of a Spark DataFrame
`sdf_len()`	Create DataFrame for Length
`sdf_num_partitions()`	Gets number of partitions of a Spark DataFrame
`sdf_persist()`	Persist a Spark DataFrame
`sdf_pivot()`	Pivot a Spark DataFrame
`sdf_project()`	Project features onto principal components
`sdf_quantile()`	Compute (Approximate) Quantiles with a Spark DataFrame
`sdf_random_split()` `sdf_partition()`	Partition a Spark Dataframe
`sdf_read_column()`	Read a Column from a Spark DataFrame
`sdf_register()`	Register a Spark DataFrame
`sdf_repartition()`	Repartition a Spark DataFrame
`sdf_residuals()`	Model Residuals
`sdf_sample()`	Randomly Sample Rows from a Spark DataFrame
`sdf_schema()`	Read the Schema of a Spark DataFrame
`sdf_separate_column()`	Separate a Vector Column into Scalar Columns
`sdf_seq()`	Create DataFrame for Range
`sdf_sort()`	Sort a Spark DataFrame
`sdf_sql()`	Spark DataFrame from SQL
`sdf_with_sequential_id()`	Add a Sequential ID Column to a Spark DataFrame
`sdf_with_unique_id()`	Add a Unique ID Column to a Spark DataFrame
Spark Machine Learning
`ml_decision_tree_classifier()` `ml_decision_tree()` `ml_decision_tree_regressor()`	Spark ML -- Decision Trees
`ml_generalized_linear_regression()`	Spark ML -- Generalized Linear Regression
`ml_gbt_classifier()` `ml_gradient_boosted_trees()` `ml_gbt_regressor()`	Spark ML -- Gradient Boosted Trees
`ml_kmeans()` `ml_compute_cost()`	Spark ML -- K-Means Clustering
`ml_lda()` `ml_describe_topics()` `ml_log_likelihood()` `ml_log_perplexity()` `ml_topics_matrix()`	Spark ML -- Latent Dirichlet Allocation
`ml_linear_regression()`	Spark ML -- Linear Regression
`ml_logistic_regression()`	Spark ML -- Logistic Regression
`ml_model_data()`	Extracts data associated with a Spark ML model
`ml_multilayer_perceptron_classifier()` `ml_multilayer_perceptron()`	Spark ML -- Multilayer Perceptron
`ml_naive_bayes()`	Spark ML -- Naive-Bayes
`ml_one_vs_rest()`	Spark ML -- OneVsRest
`ft_pca()` `ml_pca()`	Feature Transformation -- PCA (Estimator)
`ml_random_forest_classifier()` `ml_random_forest()` `ml_random_forest_regressor()`	Spark ML -- Random Forest
`ml_aft_survival_regression()` `ml_survival_regression()`	Spark ML -- Survival Regression
`ml_add_stage()`	Add a Stage to a Pipeline
`ml_als()` `ml_recommend()`	Spark ML -- ALS
`ml_approx_nearest_neighbors()` `ml_approx_similarity_join()`	Utility functions for LSH models
`ml_fpgrowth()` `ml_association_rules()` `ml_freq_itemsets()`	Frequent Pattern Mining -- FPGrowth
`ml_binary_classification_evaluator()` `ml_binary_classification_eval()` `ml_multiclass_classification_evaluator()` `ml_classification_eval()` `ml_regression_evaluator()`	Spark ML - Evaluators
`ml_bisecting_kmeans()`	Spark ML -- Bisecting K-Means Clustering
`ml_call_constructor()`	Wrap a Spark ML JVM object
`ml_chisquare_test()`	Chi-square hypothesis testing for categorical data.
`ml_clustering_evaluator()`	Spark ML - Clustering Evaluator
`new_ml_model_prediction()` `new_ml_model()` `new_ml_model_classification()` `new_ml_model_regression()` `new_ml_model_clustering()` `ml_supervised_pipeline()` `ml_clustering_pipeline()` `ml_construct_model_supervised()` `ml_construct_model_clustering()`	Constructors for `ml_model` Objects
`ml_corr()`	Compute correlation matrix
`ml_sub_models()` `ml_validation_metrics()` `ml_cross_validator()` `ml_train_validation_split()`	Spark ML -- Tuning
`ml_default_stop_words()`	Default stop words
`ml_evaluate()`	Evaluate the Model on a Validation Set
`ml_feature_importances()` `ml_tree_feature_importance()`	Spark ML - Feature Importance for Tree Models
`ft_word2vec()` `ml_find_synonyms()`	Feature Transformation -- Word2Vec (Estimator)
`is_ml_transformer()` `is_ml_estimator()` `ml_fit()` `ml_transform()` `ml_fit_and_transform()` `ml_predict()`	Spark ML -- Transform, fit, and predict methods (ml_ interface)
`ml_gaussian_mixture()`	Spark ML -- Gaussian Mixture clustering.
`ml_is_set()` `ml_param_map()` `ml_param()` `ml_params()`	Spark ML -- ML Params
`ml_isotonic_regression()`	Spark ML -- Isotonic Regression
`ft_string_indexer()` `ml_labels()` `ft_string_indexer_model()`	Feature Transformation -- StringIndexer (Estimator)
`ml_linear_svc()`	Spark ML -- LinearSVC
`ml_save()` `ml_load()`	Spark ML -- Model Persistence
`ml_pipeline()`	Spark ML -- Pipelines
`ml_stage()` `ml_stages()`	Spark ML -- Pipeline stage extraction
`ml_standardize_formula()`	Standardize Formula Input for `ml_model`
`ml_summary()`	Spark ML -- Extraction of summary metrics
`ml_uid()`	Spark ML -- UID
`ft_count_vectorizer()` `ml_vocabulary()`	Feature Transformation -- CountVectorizer (Estimator)
Spark Feature Transformers
`ft_binarizer()`	Feature Transformation -- Binarizer (Transformer)
`ft_bucketizer()`	Feature Transformation -- Bucketizer (Transformer)
`ft_chisq_selector()`	Feature Transformation -- ChiSqSelector (Estimator)
`ft_count_vectorizer()` `ml_vocabulary()`	Feature Transformation -- CountVectorizer (Estimator)
`ft_dct()` `ft_discrete_cosine_transform()`	Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
`ft_elementwise_product()`	Feature Transformation -- ElementwiseProduct (Transformer)
`ft_feature_hasher()`	Feature Transformation -- FeatureHasher (Transformer)
`ft_hashing_tf()`	Feature Transformation -- HashingTF (Transformer)
`ft_idf()`	Feature Transformation -- IDF (Estimator)
`ft_imputer()`	Feature Transformation -- Imputer (Estimator)
`ft_index_to_string()`	Feature Transformation -- IndexToString (Transformer)
`ft_interaction()`	Feature Transformation -- Interaction (Transformer)
`ft_bucketed_random_projection_lsh()` `ft_minhash_lsh()`	Feature Transformation -- LSH (Estimator)
`ml_approx_nearest_neighbors()` `ml_approx_similarity_join()`	Utility functions for LSH models
`ft_max_abs_scaler()`	Feature Transformation -- MaxAbsScaler (Estimator)
`ft_min_max_scaler()`	Feature Transformation -- MinMaxScaler (Estimator)
`ft_ngram()`	Feature Transformation -- NGram (Transformer)
`ft_normalizer()`	Feature Transformation -- Normalizer (Transformer)
`ft_one_hot_encoder()`	Feature Transformation -- OneHotEncoder (Transformer)
`ft_one_hot_encoder_estimator()`	Feature Transformation -- OneHotEncoderEstimator (Estimator)
`ft_pca()` `ml_pca()`	Feature Transformation -- PCA (Estimator)
`ft_polynomial_expansion()`	Feature Transformation -- PolynomialExpansion (Transformer)
`ft_quantile_discretizer()`	Feature Transformation -- QuantileDiscretizer (Estimator)
`ft_r_formula()`	Feature Transformation -- RFormula (Estimator)
`ft_regex_tokenizer()`	Feature Transformation -- RegexTokenizer (Transformer)
`ft_standard_scaler()`	Feature Transformation -- StandardScaler (Estimator)
`ft_stop_words_remover()`	Feature Transformation -- StopWordsRemover (Transformer)
`ft_string_indexer()` `ml_labels()` `ft_string_indexer_model()`	Feature Transformation -- StringIndexer (Estimator)
`ft_tokenizer()`	Feature Transformation -- Tokenizer (Transformer)
`ft_vector_assembler()`	Feature Transformation -- VectorAssembler (Transformer)
`ft_vector_indexer()`	Feature Transformation -- VectorIndexer (Estimator)
`ft_vector_slicer()`	Feature Transformation -- VectorSlicer (Transformer)
`ft_word2vec()` `ml_find_synonyms()`	Feature Transformation -- Word2Vec (Estimator)
`ft_sql_transformer()` `ft_dplyr_transformer()`	Feature Transformation -- SQLTransformer
Spark Machine Learning Utilities
`ml_binary_classification_evaluator()` `ml_binary_classification_eval()` `ml_multiclass_classification_evaluator()` `ml_classification_eval()` `ml_regression_evaluator()`	Spark ML - Evaluators
`ml_feature_importances()` `ml_tree_feature_importance()`	Spark ML - Feature Importance for Tree Models
Extensions
`compile_package_jars()`	Compile Scala sources into a Java Archive (jar)
`connection_config()`	Read configuration values for a connection
`download_scalac()`	Downloads default Scala Compilers
`find_scalac()`	Discover the Scala Compiler
`spark_context()` `java_context()` `hive_context()` `spark_session()`	Access the Spark API
`hive_context_config()`	Runtime configuration interface for Hive
`invoke()` `invoke_static()` `invoke_new()`	Invoke a Method on a JVM Object
`register_extension()` `registered_extensions()`	Register a Package that Implements a Spark Extension
`spark_compilation_spec()`	Define a Spark Compilation Specification
`spark_default_compilation_spec()`	Default Compilation Specification for Spark Extensions
`spark_connection()`	Retrieve the Spark Connection Associated with an R Object
`spark_context_config()`	Runtime configuration interface for the Spark Context.
`spark_dataframe()`	Retrieve a Spark DataFrame
`spark_dependency()`	Define a Spark dependency
`spark_home_set()`	Set the SPARK_HOME environment variable
`spark_jobj()`	Retrieve a Spark JVM Object Reference
`spark_version()`	Get the Spark Version Associated with a Spark Connection
Distributed Computing
`spark_apply()`	Apply an R Function in Spark
`spark_apply_bundle()`	Create Bundle for Spark Apply
`spark_apply_log()`	Log Writer for Spark Apply
Livy
`livy_config()`	Create a Spark Configuration for Livy
`livy_service_start()` `livy_service_stop()`	Start Livy
Streaming
`stream_find()`	Find Stream
`stream_generate_test()`	Generate Test Stream
`stream_id()`	Spark Stream's Identifier
`stream_name()`	Spark Stream's Name
`stream_read_csv()`	Read CSV Stream
`stream_read_json()`	Read JSON Stream
`stream_read_kafka()`	Read Kafka Stream
`stream_read_orc()`	Read ORC Stream
`stream_read_parquet()`	Read Parquet Stream
`stream_read_scoket()`	Read Socket Stream
`stream_read_text()`	Read Text Stream
`stream_render()`	Render Stream
`stream_stats()`	Stream Statistics
`stream_stop()`	Stops a Spark Stream
`stream_trigger_continuous()`	Spark Stream Continuous Trigger
`stream_trigger_interval()`	Spark Stream Interval Trigger
`stream_view()`	View Stream
`stream_watermark()`	Watermark Stream
`stream_write_console()`	Write Console Stream
`stream_write_csv()`	Write CSV Stream
`stream_write_json()`	Write JSON Stream
`stream_write_kafka()`	Write Kafka Stream
`stream_write_memory()`	Write Memory Stream
`stream_write_orc()`	Write a ORC Stream
`stream_write_parquet()`	Write Parquet Stream
`stream_write_text()`	Write Text Stream
`reactiveSpark()`	Reactive spark reader
file	Name of the configuration file
use_default	TRUE to use the built-in defaults provided in this package
master	Spark cluster url to connect to. Use `"local"` to connect to a local instance of Spark installed via `spark_install`.
spark_home	The path to a Spark installation. Defaults to the path provided by the `SPARK_HOME` environment variable. If `SPARK_HOME` is defined, it will always be used unless the `version` parameter is specified to force the use of a locally installed version.
method	The method used to connect to Spark. Default connection method is `"shell"` to connect using spark-submit, use `"livy"` to perform remote connections using HTTP, or `"databricks"` when using a Databricks clusters.
app_name	The application name to be used while running in the Spark cluster.
version	The version of Spark to use. Required for `"local"` Spark connections, optional otherwise.
config	Custom configuration for the generated Spark connection. See `spark_config` for details.
extensions	Extension R packages to enable for this connection. By default, all packages enabled through the use of `sparklyr::register_extension` will be passed here.
packages	A list of Spark packages to load. For example, `"delta"` or `"kafka"` to enable Delta Lake or Kafka. Also supports full versions like `"io.delta:delta-core_2.11:0.4.0"`. This is similar to adding packages into the `sparklyr.shell.packages` configuration option. Notice that the `version` parameter is used to choose the correect package, otherwise assumes the latest version is being used.
...	Optional arguments; currently unused.
sc	A `spark_connection`.
file	Path to R source file to submit for batch execution.
version	Version of Spark to install. See `spark_available_versions` for a list of supported versions
hadoop_version	Version of Hadoop to install. See `spark_available_versions` for a list of supported versions
installed_only	Search only the locally installed versions?
latest	Check for latest version?
hint	On failure should the installation code be provided?
reset	Attempts to reset settings to defaults.
logging	Logging level to configure install. Supported options: "WARN", "INFO"
verbose	Report information as Spark is downloaded / installed
tarfile	Path to TAR file conforming to the pattern spark-###-bin-(hadoop)?### where ### reference spark and hadoop versions respectively.
show_hadoop	Show Hadoop distributions?
show_minor	Show minor Spark versions?
sc	A `spark_connection`.
n	The max number of log entries to retrieve. Use `NULL` to retrieve all entries within the log.
filter	Character string to filter log entries.
...	Optional arguments; currently unused.
sc	A `spark_connection`.
config	The configuration entry name(s) (e.g., `"spark.sql.shuffle.partitions"`). Defaults to `NULL` to retrieve all configuration entries.
value	The configuration value to be set. Defaults to `NULL` to retrieve configuration entries.