Using Sparklyr

https://spark.rstudio.com/guides/connections/


Configuring Spark Connections

Local mode

Local mode is an excellent way to learn and experiment with Spark. Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster. To work in local mode, you should first install a version of Spark for local use. You can do this using the spark_install() function, for example:

Recommended properties

The following are the recommended Spark properties to set when connecting via R: sparklyr.cores.local - It defaults to using all of the available cores. Not a necessary property to set, unless there’s a reason to use less cores than available for a given Spark session. sparklyr.shell.driver-memory - The limit is the amount of RAM available in the computer minus what would be needed for OS operations. spark.memory.fraction - The default is set to 60% of the requested memory per executor. For more information, please see this Memory Management Overview page in the official Spark website.

Connection example

conf$`sparklyr.cores.local` <- 4 conf$`sparklyr.shell.driver-memory` <- "16G" conf$spark.memory.fraction <- 0.9 sc <- spark_connect(master = "local", version = "2.1.0", config = conf)

Executors page

To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI available in http://localhost:4040/storage/

Customizing connections

A connection to Spark can be customized by setting the values of certain Spark properties. In sparklyr, Spark properties can be set by using the config argument in the spark_connect() function. By default, spark_connect() uses spark_config() as the default configuration. But that can be customized as shown in the example code below. Because of the unending number of possible combinations, spark_config() contains only a basic configuration, so it will be very likely that additional settings will be needed to properly connect to the cluster. conf <- spark_config() # Load variable with spark_config() conf$spark.executor.memory <- "16G" # Use `$` to add or set values sc <- spark_connect(master = "yarn-client", config = conf) # Pass the conf variable

Spark definitions

It may be useful to provide some simple definitions for the Spark nomenclature: Node: A server Worker Node: A server that is part of the cluster and are available to run Spark jobs Master Node: The server that coordinates the Worker nodes. Executor: A sort of virtual machine inside a node. One Node can have multiple Executors. Driver Node: The Node that initiates the Spark session. Typically, this will be the server where sparklyr is located. Driver (Executor): The Driver Node will also show up in the Executor list.

Useful concepts

Spark configuration properties passed by R are just requests - In most cases, the cluster has the final say regarding the resources apportioned to a given Spark session. The cluster overrides ‘silently’ - Many times, no errors are returned when more resources than allowed are requested, or if an attempt is made to change a setting fixed by the cluster.

YARN

Background

Using Spark and R inside a Hadoop based Data Lake is becoming a common practice at companies. Currently, there is no good way to manage user connections to the Spark service centrally. There are some caps and settings that can be applied, but in most cases there are configurations that the R user will need to customize. The Running on YARN page in Spark’s official website is the best place to start for configuration settings reference, please bookmark it. Cluster administrators and users can benefit from this document. If Spark is new to the company, the YARN tunning article, courtesy of Cloudera, does a great job at explaining how the Spark/YARN architecture works.

Recommended properties

The following are the recommended Spark properties to set when connecting via R: spark.executor.memory - The maximum possible is managed by the YARN cluster. See the Executor Memory Error spark.executor.cores - Number of cores assigned per Executor. spark.executor.instances - Number of executors to start. This property is acknowledged by the cluster if spark.dynamicAllocation.enabled is set to “false”. spark.dynamicAllocation.enabled - Overrides the mechanism that Spark provides to dynamically adjust resources. Disabling it provides more control over the number of the Executors that can be started, which in turn impact the amount of storage available for the session. For more information, please see the Dynamic Resource Allocation page in the official Spark website.

Client mode

Using yarn-client as the value for the master argument in spark_connect() will make the server in which R is running to be the Spark’s session driver. Here is a sample connection: conf <- spark_config() conf$spark.executor.memory <- "300M" conf$spark.executor.cores <- 2 conf$spark.executor.instances <- 3 conf$spark.dynamicAllocation.enabled <- "false" sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark/", version = "1.6.0", config = conf)

Executors page

To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI. Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040. Notice that 155.3MB per executor are assigned instead of the 300MB requested. This is because the spark.memory.fraction has been fixed by the cluster, plus, there is fixed amount of memory designated for overhead.

Cluster mode

Running in cluster mode means that YARN will choose where the driver of the Spark session will run. This means that the server where R is running may not necessarily be the driver for that session. Here is a good write-up explaining how running Spark applications work: Running Spark on YARN The server will need to have copies of at least two files: yarn-site.xml and hive-site.xml. There may be other files needed based on your cluster’s individual setup. This is an example of connecting to a Cloudera cluster: library(sparklyr) Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-oracle-cloudera/") Sys.setenv(SPARK_HOME = '/opt/cloudera/parcels/CDH/lib/spark') Sys.setenv(YARN_CONF_DIR = '/opt/cloudera/parcels/CDH/lib/spark/conf/yarn-conf') conf$spark.executor.memory <- "300M" conf$spark.executor.cores <- 2 conf$spark.executor.instances <- 3 conf$spark.dynamicAllocation.enabled <- "false" conf <- spark_config() sc <- spark_connect(master = "yarn-cluster", config = conf)

Executor memory error

Requesting more memory or CPUs for Executors than allowed will return an error. This is one of the exceptions to the cluster’s ‘silent’ overrides. It will return a message similar to this: Failed during initialize_connection: java.lang.IllegalArgumentException: Required executor memory (16384+1638 MB) is above the max threshold (8192 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb' A cluster’s administrator is the only person who can make changes to the settings mentioned in the error. If the cluster is supported by a vendor, like Cloudera or Hortonworks, then the change can be made using the cluster’s web UI. Otherwise, changes to those settings are done directly in the yarn-default.xml file.

Kerberos

There are two options to access a “kerberized” data lake: Use kinit to get and cache the ticket. After kinit is installed and configured. After kinit is setup, it can used in R via a system() call prior to connecting to the cluster: system("echo '<password>' | kinit <username>") For more information visit this site: Apache - Authenticate with kinit A preferred option may be to use the out-of-the-box integration with Kerberos that the commercial version of RStudio Server offers.

Standalone mode

Recommended properties

The following are the recommended Spark properties to set when connecting via R: The default behavior in Standalone mode is to create one executor per worker. So in a 3 worker node cluster, there will be 3 executors setup. The basic properties that can be set are: spark.executor.memory - The requested memory cannot exceed the actual RAM available. spark.memory.fraction - The default is set to 60% of the requested memory per executor. For more information, please see this Memory Management Overview page in the official Spark website. spark.executor.cores - The requested cores cannot be higher than the cores available in each worker.

Dynamic Allocation

If dynamic allocation is disabled, then Spark will attempt to assign all of the available cores evenly across the cluster. The property used is spark.dynamicAllocation.enabled. For example, the Standalone cluster used for this article has 3 worker nodes. Each node has 14.7GB in RAM and 4 cores. This means that there are a total of 12 cores (3 workers with 4 cores) and 44.1GB in RAM (3 workers with 14.7GB in RAM each). If the spark.executor.cores property is set to 2, and dynamic allocation is disabled, then Spark will spawn 6 executors. The spark.executor.memory property should be set to a level that when the value is multiplied by 6 (number of executors) it will not be over total available RAM. In this case, the value can be safely set to 7GB so that the total memory requested will be 42GB, which is under the available 44.1GB.

Connection example

conf <- spark_config() conf$spark.executor.memory <- "7GB" conf$spark.memory.fraction <- 0.9 conf$spark.executor.cores <- 2 conf$spark.dynamicAllocation.enabled <- "false" sc <- spark_connect(master="spark://master-url:7077", version = "2.1.0", config = conf, spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")

Executors page

To see how the requested configuration affected the Spark connection, go to the Executors page in the Spark Web UI. Typically, the Spark Web UI can be found using the exact same URL used for RStudio but on port 4040:

Troubleshooting

Help with code debugging

For general programming questions with sparklyr, please ask on Stack Overflow.

Code does not work after upgrading to the latest sparklyr version

Please refer to the NEWS section of the sparklyr package to find out if any of the updates listed may have changed the way your code needs to work. If it seems that current version of the package has a bug, or the new functionality does not perform as stated, please refer to the sparklyr ISSUES page. If no existing issue matches to what your problem is, please open a new issue.

Not able to connect, or the jobs take a long time when working with a Data Lake

The Configuration connections contains an overview and recommendations for requesting resources form the cluster. The articles in the Guides section provide best-practice information about specific operations that may match to the intent of your code. To verify your infrastructure, please review the Deployment Examples section.

Manipulating Data with dplyr

Overview

dplyr is an R package for working with structured data both in and outside of R. dplyr makes data manipulation for R users easy, consistent, and performant. With dplyr as an interface to manipulating Spark DataFrames, you can: Select, filter, and aggregate data Use window functions (e.g. for sampling) Perform joins on DataFrames Collect data from Spark into R Statements in dplyr can be chained together using pipes defined by the magrittr R package. dplyr also supports non-standard evalution of its arguments. For more information on dplyr, see the introduction, a guide for connecting to databases, and a variety of vignettes.

Reading Data

You can read data into Spark DataFrames using the following functions:
Function Description
spark_read_csv Reads a CSV file and provides a data source compatible with dplyr
spark_read_json Reads a JSON file and provides a data source compatible with dplyr
spark_read_parquet Reads a parquet file and provides a data source compatible with dplyr
Regardless of the format of your data, Spark supports reading data from a variety of different data sources. These include data stored on HDFS (hdfs:// protocol), Amazon S3 (s3n:// protocol), or local files available to the Spark worker nodes (file:// protocol) Each of these functions returns a reference to a Spark DataFrame which can be used as a dplyr table (tbl).

Flights Data

This guide will demonstrate some of the basic data manipulation verbs of dplyr by using data from the nycflights13 R package. This package contains data for all 336,776 flights departing New York City in 2013. It also includes useful metadata on airlines, airports, weather, and planes. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13 Connect to the cluster and copy the flights data using the copy_to function. Caveat: The flight data in nycflights13 is convenient for dplyr demonstrations because it is small, but in practice large data should rarely be copied directly from R objects. library(sparklyr) library(dplyr) library(nycflights13) library(ggplot2) sc <- spark_connect(master="local") flights <- copy_to(sc, flights, "flights") airlines <- copy_to(sc, airlines, "airlines") src_tbls(sc) ## [1] "airlines" "flights"

dplyr Verbs

Verbs are dplyr commands for manipulating data. When connected to a Spark DataFrame, dplyr translates the commands into Spark SQL statements. Remote data sources use exactly the same five verbs as local data sources. Here are the five verbs with their corresponding SQL commands: select ~ SELECT filter ~ WHERE arrange ~ ORDER summarise ~ aggregators: sum, min, sd, etc. mutate ~ operators: +, *, log, etc. select(flights, year:day, arr_delay, dep_delay) ## # Source: lazy query [?? x 5] ## # Database: spark_connection ## year month day arr_delay dep_delay ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 1 11.0 2.00 ## 2 2013 1 1 20.0 4.00 ## 3 2013 1 1 33.0 2.00 ## 4 2013 1 1 -18.0 -1.00 ## 5 2013 1 1 -25.0 -6.00 ## 6 2013 1 1 12.0 -4.00 ## 7 2013 1 1 19.0 -5.00 ## 8 2013 1 1 -14.0 -3.00 ## 9 2013 1 1 - 8.00 -3.00 ## 10 2013 1 1 8.00 -2.00 ## # ... with more rows filter(flights, dep_delay > 1000) ## # Source: lazy query [?? x 19] ## # Database: spark_connection ## year month day dep_t~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384~ ## 2 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517~ ## 3 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504~ ## 4 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665~ ## 5 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338~ ## # ... with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl> arrange(flights, desc(dep_delay)) ## # Source: table<flights> [?? x 19] ## # Database: spark_connection ## # Ordered by: desc(dep_delay) ## year month day dep_~ sche~ dep_~ arr_~ sche~ arr_~ carr~ flig~ tail~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384~ ## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504~ ## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517~ ## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338~ ## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665~ ## 6 2013 4 10 1100 1900 960 1342 2211 931 DL 2391 N959~ ## 7 2013 3 17 2321 810 911 135 1020 915 DL 2119 N927~ ## 8 2013 6 27 959 1900 899 1236 2226 850 DL 2007 N376~ ## 9 2013 7 22 2257 759 898 121 1026 895 DL 2047 N671~ ## 10 2013 12 5 756 1700 896 1058 2020 878 AA 172 N5DM~ ## # ... with more rows, and 7 more variables: origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour ## # <dbl> summarise(flights, mean_dep_delay = mean(dep_delay)) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## # Source: lazy query [?? x 1] ## # Database: spark_connection ## mean_dep_delay ## <dbl> ## 1 12.6 mutate(flights, speed = distance / air_time * 60) ## # Source: lazy query [?? x 20] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708 ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301 ## # ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl>, speed <dbl>

Laziness

When working with databases, dplyr tries to be as lazy as possible: It never pulls data into R unless you explicitly ask for it. It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step. For example, take the following code: c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL')) c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance) c3 <- arrange(c2, year, month, day, carrier) c4 <- mutate(c3, air_time_hours = air_time / 60) This sequence of operations never actually touches the database. It’s not until you ask for the data (e.g. by printing c4) that dplyr requests the results from the database. c4 ## # Source: lazy query [?? x 8] ## # Database: spark_connection ## # Ordered by: year, month, day, carrier ## year month day carrier dep_delay air_time distance air_time_hours ## <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 2013 5 17 AA -2.00 294 2248 4.90 ## 2 2013 5 17 AA -1.00 146 1096 2.43 ## 3 2013 5 17 AA -2.00 185 1372 3.08 ## 4 2013 5 17 AA -9.00 186 1389 3.10 ## 5 2013 5 17 AA 2.00 147 1096 2.45 ## 6 2013 5 17 AA -4.00 114 733 1.90 ## 7 2013 5 17 AA -7.00 117 733 1.95 ## 8 2013 5 17 AA -7.00 142 1089 2.37 ## 9 2013 5 17 AA -6.00 148 1089 2.47 ## 10 2013 5 17 AA -7.00 137 944 2.28 ## # ... with more rows

Piping

You can use magrittr pipes to write cleaner syntax. Using the same example from above, you can write a much cleaner version like this: c4 <- flights %>% filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>% select(carrier, dep_delay, air_time, distance) %>% arrange(carrier) %>% mutate(air_time_hours = air_time / 60)

Grouping

The group_by function corresponds to the GROUP BY statement in SQL. c4 %>% group_by(carrier) %>% summarize(count = n(), mean_dep_delay = mean(dep_delay)) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## carrier count mean_dep_delay ## <chr> <dbl> <dbl> ## 1 AA 94.0 1.47 ## 2 DL 136 6.24 ## 3 UA 172 9.63 ## 4 WN 34.0 7.97

Collecting to R

You can copy data from Spark into R’s memory by using collect(). carrierhours <- collect(c4) collect() executes the Spark query and returns the results to R for further analysis and visualization. # Test the significance of pairwise differences and plot the results with(carrierhours, pairwise.t.test(air_time, carrier)) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: air_time and carrier ## ## AA DL UA ## DL 0.25057 - - ## UA 0.07957 0.00044 - ## WN 0.07957 0.23488 0.00041 ## ## P value adjustment method: holm ggplot(carrierhours, aes(carrier, air_time_hours)) + geom_boxplot()

SQL Translation

It’s relatively straightforward to translate R code to SQL (or indeed to any programming language) when doing simple mathematical operations of the form you normally use when filtering, mutating and summarizing. dplyr knows how to convert the following R functions to Spark SQL: # Basic math operators +, -, *, /, %%, ^ # Math functions abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh # Logical comparisons <, <=, !=, >=, >, ==, %in% # Boolean operations &, &&, |, ||, ! # Character functions paste, tolower, toupper, nchar # Casting as.double, as.integer, as.logical, as.character, as.date # Basic aggregations mean, sum, min, max, sd, var, cor, cov, n

Window Functions

dplyr supports Spark SQL window functions. Window functions are used in conjunction with mutate and filter to solve a wide range of problems. You can compare the dplyr syntax to the query it has generated by using dbplyr::sql_render(). # Find the most and least delayed flight each day bestworst <- flights %>% group_by(year, month, day) %>% select(dep_delay) %>% filter(dep_delay == min(dep_delay) || dep_delay == max(dep_delay)) dbplyr::sql_render(bestworst) ## Warning: Missing values are always removed in SQL. ## Use `min(x, na.rm = TRUE)` to silence this warning ## Warning: Missing values are always removed in SQL. ## Use `max(x, na.rm = TRUE)` to silence this warning ## <SQL> SELECT `year`, `month`, `day`, `dep_delay` ## FROM (SELECT `year`, `month`, `day`, `dep_delay`, min(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz3`, max(`dep_delay`) OVER (PARTITION BY `year`, `month`, `day`) AS `zzz4` ## FROM (SELECT `year`, `month`, `day`, `dep_delay` ## FROM `flights`) `coaxmtqqbj`) `efznnpuovy` ## WHERE (`dep_delay` = `zzz3` OR `dep_delay` = `zzz4`) bestworst ## Warning: Missing values are always removed in SQL. ## Use `min(x, na.rm = TRUE)` to silence this warning ## Warning: Missing values are always removed in SQL. ## Use `max(x, na.rm = TRUE)` to silence this warning ## # Source: lazy query [?? x 4] ## # Database: spark_connection ## # Groups: year, month, day ## year month day dep_delay ## <int> <int> <int> <dbl> ## 1 2013 1 1 853 ## 2 2013 1 1 - 15.0 ## 3 2013 1 1 - 15.0 ## 4 2013 1 9 1301 ## 5 2013 1 9 - 17.0 ## 6 2013 1 24 - 15.0 ## 7 2013 1 24 329 ## 8 2013 1 29 - 27.0 ## 9 2013 1 29 235 ## 10 2013 2 1 - 15.0 ## # ... with more rows # Rank each flight within a daily ranked <- flights %>% group_by(year, month, day) %>% select(dep_delay) %>% mutate(rank = rank(desc(dep_delay))) dbplyr::sql_render(ranked) ## <SQL> SELECT `year`, `month`, `day`, `dep_delay`, rank() OVER (PARTITION BY `year`, `month`, `day` ORDER BY `dep_delay` DESC) AS `rank` ## FROM (SELECT `year`, `month`, `day`, `dep_delay` ## FROM `flights`) `mauqwkxuam` ranked ## # Source: lazy query [?? x 5] ## # Database: spark_connection ## # Groups: year, month, day ## year month day dep_delay rank ## <int> <int> <int> <dbl> <int> ## 1 2013 1 1 853 1 ## 2 2013 1 1 379 2 ## 3 2013 1 1 290 3 ## 4 2013 1 1 285 4 ## 5 2013 1 1 260 5 ## 6 2013 1 1 255 6 ## 7 2013 1 1 216 7 ## 8 2013 1 1 192 8 ## 9 2013 1 1 157 9 ## 10 2013 1 1 155 10 ## # ... with more rows

Peforming Joins

It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time: Mutating joins, which add new variables to one table from matching rows in another. Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table. Set operations, which combine the observations in the data sets as if they were set elements. All two-table verbs work similarly. The first two arguments are x and y, and provide the tables to combine. The output is always a new table with the same type as x. The following statements are equivalent: flights %>% left_join(airlines) ## Joining, by = "carrier" ## # Source: lazy query [?? x 20] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708 ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301 ## # ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl>, name <chr> flights %>% left_join(airlines, by = "carrier") ## # Source: lazy query [?? x 20] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708 ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301 ## # ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl>, name <chr> flights %>% left_join(airlines, by = c("carrier", "carrier")) ## # Source: lazy query [?? x 20] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708 ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301 ## # ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl>, name <chr>

Sampling

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction. sample_n(flights, 10) ## # Source: lazy query [?? x 19] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708 ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301 ## # ... with more rows, and 8 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl> sample_frac(flights, 0.01) ## # Source: lazy query [?? x 19] ## # Database: spark_connection ## year month day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~ ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 655 655 0 1021 1030 - 9.00 DL 1415 ## 2 2013 1 1 656 700 - 4.00 854 850 4.00 AA 305 ## 3 2013 1 1 1044 1045 - 1.00 1231 1212 19.0 EV 4322 ## 4 2013 1 1 1056 1059 - 3.00 1203 1209 - 6.00 EV 4479 ## 5 2013 1 1 1317 1325 - 8.00 1454 1505 -11.0 MQ 4475 ## 6 2013 1 1 1708 1700 8.00 2037 2005 32.0 WN 1066 ## 7 2013 1 1 1825 1829 - 4.00 2056 2053 3.00 9E 3286 ## 8 2013 1 1 1843 1845 - 2.00 1955 2024 -29.0 DL 904 ## 9 2013 1 1 2108 2057 11.0 25 39 -14.0 UA 1517 ## 10 2013 1 2 557 605 - 8.00 832 823 9.00 DL 544 ## # ... with more rows, and 8 more variables: tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dbl>

Writing Data

It is often useful to save the results of your analysis or the tables that you have generated on your Spark cluster into persistent storage. The best option in many scenarios is to write the table out to a Parquet file using the spark_write_parquet function. For example: spark_write_parquet(tbl, "hdfs://hdfs.company.org:9000/hdfs-path/data") This will write the Spark DataFrame referenced by the tbl R variable to the given HDFS path. You can use the spark_read_parquet function to read the same table back into a subsequent Spark session: tbl <- spark_read_parquet(sc, "data", "hdfs://hdfs.company.org:9000/hdfs-path/data") You can also write data as CSV or JSON using the spark_write_csv and spark_write_json functions.

Hive Functions

Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions. The following example uses the datediff and current_date Hive UDFs to figure the difference between the flight_date and the current system date: flights %>% mutate(flight_date = paste(year,month,day,sep="-"), days_since = datediff(current_date(), flight_date)) %>% group_by(flight_date,days_since) %>% tally() %>% arrange(-days_since) ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Groups: flight_date ## # Ordered by: -days_since ## flight_date days_since n ## <chr> <int> <dbl> ## 1 2013-1-1 1844 842 ## 2 2013-1-2 1843 943 ## 3 2013-1-3 1842 914 ## 4 2013-1-4 1841 915 ## 5 2013-1-5 1840 720 ## 6 2013-1-6 1839 832 ## 7 2013-1-7 1838 933 ## 8 2013-1-8 1837 899 ## 9 2013-1-9 1836 902 ## 10 2013-1-10 1835 932 ## # ... with more rows

Spark Machine Learning Library (MLlib)

Overview

sparklyr provides bindings to Spark's distributed machine learning library. In particular, sparklyr allows you to access the machine learning routines provided by the spark.ml package. Together with sparklyr's dplyr interface, you can easily create and tune machine learning workflows on Spark, orchestrated entirely within R. sparklyr provides three families of functions that you can use with Spark machine learning: Machine learning algorithms for analyzing data (ml_*) Feature transformers for manipulating individual features (ft_*) Functions for manipulating Spark DataFrames (sdf_*) An analytic workflow with sparklyr might be composed of the following stages. For an example see Example Workflow.
    Perform SQL queries through the sparklyr dplyr interface, Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set, Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data, Inspect the quality of your model fit, and use it to make predictions with new data. Collect the results for visualization and further analysis in R

Algorithms

Spark's machine learning library can be accessed from sparklyr through the ml_* set of functions:
Function Description
ml_kmeans K-Means Clustering
ml_linear_regression Linear Regression
ml_logistic_regression Logistic Regression
ml_survival_regression Survival Regression
ml_generalized_linear_regression Generalized Linear Regression
ml_decision_tree Decision Trees
ml_random_forest Random Forests
ml_gradient_boosted_trees Gradient-Boosted Trees
ml_pca Principal Components Analysis
ml_naive_bayes Naive-Bayes
ml_multilayer_perceptron Multilayer Perceptron
ml_lda Latent Dirichlet Allocation
ml_one_vs_rest One vs Rest

Formulas

The ml_* functions take the arguments response and features. But features can also be a formula with main effects (it currently does not accept interaction terms). The intercept term can be omitted by using -1. # Equivalent statements ml_linear_regression(z ~ -1 + x + y) ml_linear_regression(intercept = FALSE, response = "z", features = c("x", "y"))

Options

The Spark model output can be modified with the ml_options argument in the ml_* functions. The ml_options is an experts only interface for tweaking the model output. For example, model.transform can be used to mutate the Spark model object before the fit is performed.

Transformers

A model is often fit not on a dataset as-is, but instead on some transformation of that dataset. Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_* family of functions. These routines generally take one or more input columns, and generate a new output column formed as a transformation of those columns.
Function Description
ft_binarizer Threshold numerical features to binary (0/1) feature
ft_bucketizer Bucketizer transforms a column of continuous features to a column of feature buckets
ft_discrete_cosine_transform Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain
ft_elementwise_product Multiplies each input vector by a provided weight vector, using element-wise multiplication.
ft_index_to_string Maps a column of label indices back to a column containing the original labels as strings
ft_quantile_discretizer Takes a column with continuous features and outputs a column with binned categorical features
sql_transformer Implements the transformations which are defined by a SQL statement
ft_string_indexer Encodes a string column of labels to a column of label indices
ft_vector_assembler Combines a given list of columns into a single vector column

Examples

We will use the iris data set to examine a handful of learning algorithms and transformers. The iris data set measures attributes for 150 flowers in 3 different species of iris. library(sparklyr) ## Warning: package 'sparklyr' was built under R version 3.4.3 library(ggplot2) library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union sc <- spark_connect(master = "local") ## * Using Spark: 2.1.0 iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE) iris_tbl ## # Source: table<iris> [?? x 5] ## # Database: spark_connection ## Sepal_Length Sepal_Width Petal_Length Petal_Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with more rows

K-Means Clustering

Use Spark's K-means clustering to partition a dataset into groups. K-means clustering partitions points into k groups, such that the sum of squares from points to the assigned cluster centers is minimized. kmeans_model <- iris_tbl %>% select(Petal_Width, Petal_Length) %>% ml_kmeans(centers = 3) ## * No rows dropped by 'na.omit' call # print our model fit kmeans_model ## K-means clustering with 3 clusters ## ## Cluster centers: ## Petal_Width Petal_Length ## 1 1.359259 4.292593 ## 2 0.246000 1.462000 ## 3 2.047826 5.626087 ## ## Within Set Sum of Squared Errors = 31.41289 # predict the associated class predicted <- sdf_predict(kmeans_model, iris_tbl) %>% collect table(predicted$Species, predicted$prediction) ## ## 0 1 2 ## setosa 0 50 0 ## versicolor 48 0 2 ## virginica 6 0 44 # plot cluster membership sdf_predict(kmeans_model) %>% collect() %>% ggplot(aes(Petal_Length, Petal_Width)) + geom_point(aes(Petal_Width, Petal_Length, col = factor(prediction + 1)), size = 2, alpha = 0.5) + geom_point(data = kmeans_model$centers, aes(Petal_Width, Petal_Length), col = scales::muted(c("red", "green", "blue")), pch = 'x', size = 12) + scale_color_discrete(name = "Predicted Cluster", labels = paste("Cluster", 1:3)) + labs( x = "Petal Length", y = "Petal Width", title = "K-Means Clustering", subtitle = "Use Spark.ML to predict cluster membership with the iris dataset." )

Linear Regression

Use Spark's linear regression to model the linear relationship between a response variable and one or more explanatory variables. lm_model <- iris_tbl %>% select(Petal_Width, Petal_Length) %>% ml_linear_regression(Petal_Length ~ Petal_Width) ## * No rows dropped by 'na.omit' call iris_tbl %>% select(Petal_Width, Petal_Length) %>% collect %>% ggplot(aes(Petal_Length, Petal_Width)) + geom_point(aes(Petal_Width, Petal_Length), size = 2, alpha = 0.5) + geom_abline(aes(slope = coef(lm_model)[["Petal_Width"]], intercept = coef(lm_model)[["(Intercept)"]]), color = "red") + labs( x = "Petal Width", y = "Petal Length", title = "Linear Regression: Petal Length ~ Petal Width", subtitle = "Use Spark.ML linear regression to predict petal length as a function of petal width." )

Logistic Regression

Use Spark's logistic regression to perform logistic regression, modeling a binary outcome as a function of one or more explanatory variables. # Prepare beaver dataset beaver <- beaver2 beaver$activ <- factor(beaver$activ, labels = c("Non-Active", "Active")) copy_to(sc, beaver, "beaver") ## # Source: table<beaver> [?? x 4] ## # Database: spark_connection ## day time temp activ ## <dbl> <dbl> <dbl> <chr> ## 1 307 930 36.58 Non-Active ## 2 307 940 36.73 Non-Active ## 3 307 950 36.93 Non-Active ## 4 307 1000 37.15 Non-Active ## 5 307 1010 37.23 Non-Active ## 6 307 1020 37.24 Non-Active ## 7 307 1030 37.24 Non-Active ## 8 307 1040 36.90 Non-Active ## 9 307 1050 36.95 Non-Active ## 10 307 1100 36.89 Non-Active ## # ... with more rows beaver_tbl <- tbl(sc, "beaver") glm_model <- beaver_tbl %>% mutate(binary_response = as.numeric(activ == "Active")) %>% ml_logistic_regression(binary_response ~ temp) ## * No rows dropped by 'na.omit' call glm_model ## Call: binary_response ~ temp ## ## Coefficients: ## (Intercept) temp ## -550.52331 14.69184

PCA

Use Spark's Principal Components Analysis (PCA) to perform dimensionality reduction. PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. pca_model <- tbl(sc, "iris") %>% select(-Species) %>% ml_pca() ## * No rows dropped by 'na.omit' call print(pca_model) ## Explained variance: ## ## PC1 PC2 PC3 PC4 ## 0.924618723 0.053066483 0.017102610 0.005212184 ## ## Rotation: ## PC1 PC2 PC3 PC4 ## Sepal_Length -0.36138659 -0.65658877 0.58202985 0.3154872 ## Sepal_Width 0.08452251 -0.73016143 -0.59791083 -0.3197231 ## Petal_Length -0.85667061 0.17337266 -0.07623608 -0.4798390 ## Petal_Width -0.35828920 0.07548102 -0.54583143 0.7536574

Random Forest

Use Spark's Random Forest to perform regression or multiclass classification. rf_model <- iris_tbl %>% ml_random_forest(Species ~ Petal_Length + Petal_Width, type = "classification") ## * No rows dropped by 'na.omit' call rf_predict <- sdf_predict(rf_model, iris_tbl) %>% ft_string_indexer("Species", "Species_idx") %>% collect table(rf_predict$Species_idx, rf_predict$prediction) ## ## 0 1 2 ## 0 49 1 0 ## 1 0 50 0 ## 2 0 0 50

SDF Partitioning

Split a Spark DataFrame into training, test datasets. partitions <- tbl(sc, "iris") %>% sdf_partition(training = 0.75, test = 0.25, seed = 1099) fit <- partitions$training %>% ml_linear_regression(Petal_Length ~ Petal_Width) ## * No rows dropped by 'na.omit' call estimate_mse <- function(df){ sdf_predict(fit, df) %>% mutate(resid = Petal_Length - prediction) %>% summarize(mse = mean(resid ^ 2)) %>% collect } sapply(partitions, estimate_mse) ## $training.mse ## [1] 0.2374596 ## ## $test.mse ## [1] 0.1898848

FT String Indexing

Use ft_string_indexer and ft_index_to_string to convert a character column into a numeric column and back again. ft_string2idx <- iris_tbl %>% ft_string_indexer("Species", "Species_idx") %>% ft_index_to_string("Species_idx", "Species_remap") %>% collect table(ft_string2idx$Species, ft_string2idx$Species_remap) ## ## setosa versicolor virginica ## setosa 50 0 0 ## versicolor 0 50 0 ## virginica 0 0 50

SDF Mutate

sdf_mutate is provided as a helper function, to allow you to use feature transformers. For example, the previous code snippet could have been written as: ft_string2idx <- iris_tbl %>% sdf_mutate(Species_idx = ft_string_indexer(Species)) %>% sdf_mutate(Species_remap = ft_index_to_string(Species_idx)) %>% collect ft_string2idx %>% select(Species, Species_idx, Species_remap) %>% distinct ## # A tibble: 3 x 3 ## Species Species_idx Species_remap ## <chr> <dbl> <chr> ## 1 setosa 2 setosa ## 2 versicolor 0 versicolor ## 3 virginica 1 virginica

Example Workflow

Let's walk through a simple example to demonstrate the use of Spark's machine learning algorithms within R. We'll use ml_linear_regression to fit a linear regression model. Using the built-in mtcars dataset, we'll try to predict a car's fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). First, we will copy the mtcars dataset into Spark. mtcars_tbl <- copy_to(sc, mtcars, "mtcars") Transform the data with Spark SQL, feature transformers, and DataFrame functions.
    Use Spark SQL to remove all cars with horsepower less than 100 Use Spark feature transformers to bucket cars into two groups based on cylinders Use Spark DataFrame functions to partition the data into test and training
Then fit a linear model using spark ML. Model MPG as a function of weight and cylinders. # transform our data set, and then partition into 'training', 'test' partitions <- mtcars_tbl %>% filter(hp >= 100) %>% sdf_mutate(cyl8 = ft_bucketizer(cyl, c(0,8,12))) %>% sdf_partition(training = 0.5, test = 0.5, seed = 888) # fit a linear mdoel to the training dataset fit <- partitions$training %>% ml_linear_regression(mpg ~ wt + cyl) ## * No rows dropped by 'na.omit' call # summarize the model summary(fit) ## Call: ml_linear_regression(., mpg ~ wt + cyl) ## ## Deviance Residuals:: ## Min 1Q Median 3Q Max ## -2.0947 -1.2747 -0.1129 1.0876 2.2185 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 33.79558 2.67240 12.6462 4.92e-07 *** ## wt -1.59625 0.73729 -2.1650 0.05859 . ## cyl -1.58036 0.49670 -3.1817 0.01115 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-Squared: 0.8267 ## Root Mean Squared Error: 1.437 The summary() suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption. (The model suggests that, on average, heavier cars consume more fuel.) Let's use our Spark model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption. We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions. # Score the data pred <- sdf_predict(fit, partitions$test) %>% collect # Plot the predicted versus actual mpg ggplot(pred, aes(x = mpg, y = prediction)) + geom_abline(lty = "dashed", col = "red") + geom_point() + theme(plot.title = element_text(hjust = 0.5)) + coord_fixed(ratio = 1) + labs( x = "Actual Fuel Consumption", y = "Predicted Fuel Consumption", title = "Predicted vs. Actual Fuel Consumption" ) Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption. As you can see, we can easily and effectively combine feature transformers, machine learning algorithms, and Spark DataFrame functions into a complete analysis with Spark and R.

Understanding Spark Caching

Introduction

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small dataset or when running an iterative algorithm like random forests. Since operations in Spark are lazy, caching can help force computation. Sparklyr tools can be used to cache and uncache DataFrames. The Spark UI will tell you which DataFrames and what percentages are in memory. By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark’s memory management options.

Preparation

Download Test Data

The 2008 and 2007 Flights data from the Statistical Computing site will be used for this exercise. The spark_read_csv supports reading compressed CSV files in a bz2 format, so no additional file preparation is needed. if(!file.exists("2008.csv.bz2")) {download.file("http://stat-computing.org/dataexpo/2009/2008.csv.bz2", "2008.csv.bz2")} if(!file.exists("2007.csv.bz2")) {download.file("http://stat-computing.org/dataexpo/2009/2007.csv.bz2", "2007.csv.bz2")}

Start a Spark session

A local deployment will be used for this example. library(sparklyr) library(dplyr) library(ggplot2) # Install Spark version 2 spark_install(version = "2.0.0") # Customize the connection configuration conf <- spark_config() conf$`sparklyr.shell.driver-memory` <- "16G" # Connect to Spark sc <- spark_connect(master = "local", config = conf, version = "2.0.0")

The Memory Argument

In the spark_read_… functions, the memory argument controls if the data will be loaded into memory as an RDD. Setting it to FALSE means that Spark will essentially map the file, but not make a copy of it in memory. This makes the spark_read_csv command run faster, but the trade off is that any data transformation operations will take much longer. spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE) In the RStudio IDE, the flights_spark_2008 table now shows up in the Spark tab.
To access the Spark Web UI, click the SparkUI button in the RStudio Spark Tab. As expected, the Storage page shows no tables loaded into memory.

Loading Less Data into Memory

Using the pre-processing capabilities of Spark, the data will be transformed before being loaded into memory. In this section, we will continue to build on the example started in the Spark Read section

Lazy Transform

The following dplyr script will not be immediately run, so the code is processed quickly. There are some check-ups made, but for the most part it is building a Spark SQL statement in the background. flights_table <- tbl(sc,"flights_spark_2008") %>% mutate(DepDelay = as.numeric(DepDelay), ArrDelay = as.numeric(ArrDelay), DepDelay > 15 , DepDelay < 240, ArrDelay > -60 , ArrDelay < 360, Gain = DepDelay - ArrDelay) %>% filter(ArrDelay > 0) %>% select(Origin, Dest, UniqueCarrier, Distance, DepDelay, ArrDelay, Gain)

Register in Spark

sdf_register will register the resulting Spark SQL in Spark. The results will show up as a table called flights_spark. But a table of the same name is still not loaded into memory in Spark. sdf_register(flights_table, "flights_spark")

Cache into Memory

The tbl_cache command loads the results into an Spark RDD in memory, so any analysis from there on will not need to re-read and re-transform the original file. The resulting Spark RDD is smaller than the original file because the transformations created a smaller data set than the original file. tbl_cache(sc, "flights_spark")

Driver Memory

In the Executors page of the Spark Web UI, we can see that the Storage Memory is at about half of the 16 gigabytes requested. This is mainly because of a Spark setting called spark.memory.fraction, which reserves by default 40% of the memory requested.

Process on the fly

The plan is to read the Flights 2007 file, combine it with the 2008 file and summarize the data without bringing either file fully into memory. spark_read_csv(sc, "flights_spark_2007" , "2007.csv.bz2", memory = FALSE)

Union and Transform

The union command is akin to the bind_rows dyplyr command. It will allow us to append the 2007 file to the 2008 file, and as with the previous transform, this script will be evaluated lazily. all_flights <- tbl(sc, "flights_spark_2008") %>% union(tbl(sc, "flights_spark_2007")) %>% group_by(Year, Month) %>% tally()

Collect into R

When receiving a collect command, Spark will execute the SQL statement and send the results back to R in a data frame. In this case, R only loads 24 observations into a data frame called all_flights. all_flights <- all_flights %>% collect()

Plot in R

Now the smaller data set can be plotted ggplot(data = all_flights, aes(x = Month, y = n/1000, fill = factor(Year))) + geom_area(position = "dodge", alpha = 0.5) + geom_line(alpha = 0.4) + scale_fill_brewer(palette = "Dark2", name = "Year") + scale_x_continuous(breaks = 1:12, labels = c("J","F","M","A","M","J","J","A","S","O","N","D")) + theme_light() + labs(y="Number of Flights (Thousands)", title = "Number of Flights Year-Over-Year")

Deployment and Configuration

Deployment

There are two well supported deployment modes for sparklyr: Local — Working on a local desktop typically with smaller/sampled datasets Cluster — Working directly within or alongside a Spark cluster (standalone, YARN, Mesos, etc.)

Local Deployment

Local mode is an excellent way to learn and experiment with Spark. Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster. To work in local mode you should first install a version of Spark for local use. You can do this using the spark_install function, for example: sparklyr::spark_install(version = "2.1.0") To connect to the local Spark instance you pass “local” as the value of the Spark master node to spark_connect: library(sparklyr) sc <- spark_connect(master = "local") For the local development scenario, see the Configuration section below for details on how to have the same code work seamlessly in both development and production environments.

Cluster Deployment

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell). For more information see Submitting Applications. To use spaklyr with a Spark cluster you should locate your R session on a machine that is either directly on one of the cluster nodes or is close to the cluster (for networking performance). In the case where R is not running directly on the cluster you should also ensure that the machine has a Spark version and configuration identical to that of the cluster nodes. The most straightforward way to run R within or near to the cluster is either a remote SSH session or via RStudio Server. In cluster mode you use the version of Spark already deployed on the cluster node. This version is located via the SPARK_HOME environment variable, so you should be sure that this variable is correctly defined on your server before attempting a connection. This would typically be done within the Renviron.site configuration file. For example: SPARK_HOME=/opt/spark/spark-2.0.0-bin-hadoop2.6 To connect, pass the address of the master node to spark_connect, for example: library(sparklyr) sc <- spark_connect(master = "spark://local:7077") For a Hadoop YARN cluster, you can connect using the YARN master, for example: library(sparklyr) sc <- spark_connect(master = "yarn-client") If you are running on EC2 using the Spark EC2 deployment scripts then you can read the master from /root/spark-ec2/cluster-url, for example: library(sparklyr) cluster_url <- system('cat /root/spark-ec2/cluster-url', intern=TRUE) sc <- spark_connect(master = cluster_url)

Livy Connections

Livy, “An Open Source REST Service for Apache Spark (Apache License)” , is available starting in sparklyr 0.5 as an experimental feature. Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster. To work with Livy locally, sparklyr supports livy_install() which installs Livy in your local environment, this is similar to spark_install(). Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start(). Once the service is running, spark_connect() needs to reference the running service and use method = "Livy", then sparklyr can be used as usual. A short example follows: livy_install() livy_service_start() sc <- spark_connect(master = "http://localhost:8998", method = "livy") copy_to(sc, iris) spark_disconnect(sc) livy_service_stop()

Connection Tools

You can view the Spark web UI via the spark_web function, and view the Spark log via the spark_log function: spark_web(sc) spark_log(sc) You can disconnect from Spark using the spark_disconnect function: spark_disconnect(sc)

Collect

The collect function transfers data from Spark into R. The data are collected from a cluster environment and transfered into local R memory. In the process, all data is first transfered from executor nodes to the driver node. Therefore, the driver node must have enough memory to collect all the data. Collecting data on the driver node is relatively slow. The process also inflates the data as it moves from the executor nodes to the driver node. Caution should be used when collecting large data. The following parameters could be adjusted to avoid OutOfMemory and Timeout errors: spark.executor.heartbeatInterval spark.network.timeout spark.driver.extraJavaOptions spark.driver.memory spark.yarn.driver.memoryOverhead spark.driver.maxResultSize

Configuration

This section describes the various options available for configuring both the behavior of the sparklyr package as well as the underlying Spark cluster. Creating multiple configuration profiles (e.g. development, test, production) is also covered.

Config Files

The configuration for a Spark connection is specified via the config parameter of the spark_connect function. By default the configuration is established by calling the spark_config function. This code represents the default behavior: spark_connect(master = "local", config = spark_config()) By default the spark_config function reads configuration data from a file named config.yml located in the current working directory (or in parent directories if not located in the working directory). This file is not required and only need be provided for overriding default behavior. You can also specify an alternate config file name and/or location. The config.yml file is in turn processed using the config package, which enables support for multiple named configuration profiles.

Package Options

There are a number of options available to configure the behavior of the sparklyr package: For example, this configuration file sets the number of local cores to 4 and the amount of memory allocated for the Spark driver to 4G: default: sparklyr.cores.local: 4 sparklyr.shell.driver-memory: 4G Note that the use of default will be explained below in Multiple Profiles.

Spark

Option Description
sparklyr.shell.* Command line parameters to pass to spark-submit. For example, sparklyr.shell.executor-memory: 20G configures --executor-memory 20G (see the Spark documentation for details on supported options).

Runtime

Option Description
sparklyr.cores.local Number of cores to use when running in local mode (defaults to parallel::detectCores).
sparklyr.sparkui.url Configures the url to the Spark UI web interface when calling spark_web.
sparklyr.defaultPackages List of default Spark packages to install in the cluster (defaults to “com.databricks:spark-csv_2.11:1.3.0” and “com.amazonaws:aws-java-sdk-pom:1.10.34”).
sparklyr.sanitize.column.names Allows Spark to automatically rename column names to conform to Spark naming restrictions.

Diagnostics

Option Description
sparklyr.backend.threads Number of threads to use in the sparklyr backend to process incoming connections form the sparklyr client.
sparklyr.app.jar The application jar to be submitted in Spark submit.
sparklyr.ports.file Path to the ports file used to share connection information to the sparklyr backend.
sparklyr.ports.wait.seconds Number of seconds to wait while for the Spark connection to initialize.
sparklyr.verbose Provide additional feedback while performing operations. Currently used to communicate which column names are being sanitized in sparklyr.sanitize.column.names.

Spark Options

You can also use config.yml to specify arbitrary Spark configuration properties:
Option Description
spark.* Configuration settings for the Spark context (applied by creating a SparkConf containing the specified properties). For example, spark.executor.memory: 1g configures the memory available in each executor (see Spark Configuration for additional options.)
spark.sql.* Configuration settings for the Spark SQL context (applied using SET). For instance, spark.sql.shuffle.partitions configures number of partitions to use while shuffling (see SQL Programming Guide for additional options).
For example, this configuration file sets a custom scratch directory for Spark and specifies 100 as the number of partitions to use when shuffling data for joins or aggregations: default: spark.local.dir: /tmp/spark-scratch spark.sql.shuffle.partitions: 100

User Options

You can also include arbitrary custom user options within the config.yml file. These can be named anything you like so long as they do not use either spark or sparklyr as a prefix. For example, this configuration file defines dataset and sample-size options: default: dataset: "observations.parquet" sample-size: 10000

Multiple Profiles

The config package enables the definition of multiple named configuration profiles for different environments (e.g. default, test, production). All environments automatically inherit from the default environment and can optionally also inherit from each other. For example, you might want to use a distinct datasets for development and testing or might want to use custom Spark configuration properties that are only applied when running on a production cluster. Here’s how that would be expressed in config.yml: default: dataset: "observations-dev.parquet" sample-size: 10000 production: spark.memory.fraction: 0.9 spark.rdd.compress: true dataset: "observations.parquet" sample-size: null You can also use this feature to specify distinct Spark master nodes for different environments, for example: default: spark.master: "local" production: spark.master: "spark://local:7077" With this configuration, you can omit the master argument entirely from the call to spark_connect: sc <- spark_connect() Note that the currently active configuration is determined via the value of R_CONFIG_ACTIVE environment variable. See the config package documentation for additional details.

Tuning

In general, you will need to tune a Spark cluster for it to perform well. Spark applications tend to consume a lot of resources. There are many knobs to control the performance of Yarn and executor (i.e. worker) nodes in a cluster. Some of the parameters to pay attention to are as follows: spark.executor.heartbeatInterval spark.network.timeout spark.executor.extraJavaOptions spark.executor.memory spark.yarn.executor.memoryOverhead spark.executor.cores spark.executor.instances (if is not enabled)

Example Config

Here is an example spark configuration for an EMR cluster on AWS with 1 master and 2 worker nodes. Eache node has 8 vCPU and 61 GiB of memory.
Parameter Value
spark.driver.extraJavaOptions append -XX:MaxPermSize=30G
spark.driver.maxResultSize 0
spark.driver.memory 30G
spark.yarn.driver.memoryOverhead 4096
spark.yarn.executor.memoryOverhead 4096
spark.executor.memory 4G
spark.executor.cores 2
spark.dynamicAllocation.maxExecutors 15
Configuration parameters can be set in the config R object or can be set in the config.yml. Alternatively, they can be set in the spark-defaults.conf.
Configuration in R script
config <- spark_config() config$spark.executor.cores <- 2 config$spark.executor.memory <- "4G" sc <- spark_connect(master = "yarn-client", config = config, version = '2.0.0')
Configuration in YAML script
default: spark.executor.cores: 2 spark.executor.memory: 4G

RStudio Server

RStudio Server provides a web-based IDE interface to a remote R session, making it ideal for use as a front-end to a Spark cluster. This section covers some additional configuration options that are useful for RStudio Server.

Connection Options

The RStudio IDE Spark pane provides a New Connection dialog to assist in connecting with both local instances of Spark and Spark clusters: You can configure which connection choices are presented using the rstudio.spark.connections option. By default, users are presented with possibility of both local and cluster connections, however, you can modify this behavior to present only one of these, or even a specific Spark master URL. Some commonly used combinations of connection choices include:
Value Description
c("local", "cluster") Default. Present connections to both local and cluster Spark instances.
"local" Present only connections to local Spark instances.
"spark://local:7077" Present only a connection to a specific Spark cluster.
c("spark://local:7077", "cluster") Present a connection to a specific Spark cluster and other clusters.
This option should generally be set within Rprofile.site. For example: options(rstudio.spark.connections = "spark://local:7077")

Spark Installations

If you are running within local mode (as opposed to cluster mode) you may want to provide pre-installed Spark version(s) to be shared by all users of the server. You can do this by installing Spark versions within a shared directory (e.g. /opt/spark) then designating it as the Spark installation directory. For example, after installing one or more versions of Spark to /opt/spark you would add the following to Rprofile.site: options(spark.install.dir = "/opt/spark") If this directory is read-only for ordinary users then RStudio will not offer installation of additional versions, which will help guide users to a version that is known to be compatible with versions of Spark deployed on clusters in the same organization.

Distributing R Computations

Overview

sparklyr provides support to run arbitrary R code at scale within your Spark Cluster through spark_apply(). This is especially useful where there is a need to use functionality available only in R or R packages that is not available in Apache Spark nor Spark Packages. spark_apply() applies an R function to a Spark object (typically, a Spark DataFrame). Spark objects are partitioned so they can be distributed across a cluster. You can use spark_apply with the default partitions or you can define your own partitions with the group_by argument. Your R function must return another Spark DataFrame. spark_apply will run your R function on each partition and output a single Spark DataFrame.

Apply an R function to a Spark Object

Lets run a simple example. We will apply the identify function, I(), over a list of numbers we created with the sdf_len function. library(sparklyr) sc <- spark_connect(master = "local") sdf_len(sc, 5, repartition = 1) %>% spark_apply(function(e) I(e)) ## # Source: table<sparklyr_tmp_378c2e4fb50> [?? x 1] ## # Database: spark_connection ## id ## <dbl> ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 Your R function should be designed to operate on an R data frame. The R function passed to spark_apply expects a DataFrame and will return an object that can be cast as a DataFrame. We can use the class function to verify the class of the data. sdf_len(sc, 10, repartition = 1) %>% spark_apply(function(e) class(e)) ## # Source: table<sparklyr_tmp_378c7ce7618d> [?? x 1] ## # Database: spark_connection ## id ## <chr> ## 1 data.frame Spark will partition your data by hash or range so it can be distributed across a cluster. In the following example we create two partitions and count the number of rows in each partition. Then we print the first record in each partition. trees_tbl <- sdf_copy_to(sc, trees, repartition = 2) trees_tbl %>% spark_apply(function(e) nrow(e), names = "n") ## # Source: table<sparklyr_tmp_378c15c45eb1> [?? x 1] ## # Database: spark_connection ## n ## <int> ## 1 16 ## 2 15 trees_tbl %>% spark_apply(function(e) head(e, 1)) ## # Source: table<sparklyr_tmp_378c29215418> [?? x 3] ## # Database: spark_connection ## Girth Height Volume ## <dbl> <dbl> <dbl> ## 1 8.3 70 10.3 ## 2 8.6 65 10.3 We can apply any arbitrary function to the partitions in the Spark DataFrame. For instance, we can scale or jitter the columns. Notice that spark_apply applies the R function to all partitions and returns a single DataFrame. trees_tbl %>% spark_apply(function(e) scale(e)) ## # Source: table<sparklyr_tmp_378c8922ba8> [?? x 3] ## # Database: spark_connection ## Girth Height Volume ## <dbl> <dbl> <dbl> ## 1 -1.4482330 -0.99510521 -1.1503645 ## 2 -1.3021313 -2.06675697 -1.1558670 ## 3 -0.7469449 0.68891899 -0.6826528 ## 4 -0.6592839 -1.60747764 -0.8587325 ## 5 -0.6300635 0.53582588 -0.4735581 ## 6 -0.5716229 0.38273277 -0.3855183 ## 7 -0.5424025 -0.07654655 -0.5395880 ## 8 -0.3670805 -0.22963966 -0.6661453 ## 9 -0.1040975 1.30129143 0.1427209 ## 10 0.1296653 -0.84201210 -0.3029809 ## # ... with more rows trees_tbl %>% spark_apply(function(e) lapply(e, jitter)) ## # Source: table<sparklyr_tmp_378c43237574> [?? x 3] ## # Database: spark_connection ## Girth Height Volume ## <dbl> <dbl> <dbl> ## 1 8.319392 70.04321 10.30556 ## 2 8.801237 62.85795 10.21751 ## 3 10.719805 81.15618 18.78076 ## 4 11.009892 65.98926 15.58448 ## 5 11.089322 80.14661 22.58749 ## 6 11.309682 79.01360 24.18158 ## 7 11.418486 75.88748 21.38380 ## 8 11.982421 74.85612 19.09375 ## 9 12.907616 84.81742 33.80591 ## 10 13.691892 71.05309 25.70321 ## # ... with more rows By default spark_apply() derives the column names from the input Spark data frame. Use the names argument to rename or add new columns. trees_tbl %>% spark_apply( function(e) data.frame(2.54 * e$Girth, e), names = c("Girth(cm)", colnames(trees))) ## # Source: table<sparklyr_tmp_378c14e015b5> [?? x 4] ## # Database: spark_connection ## `Girth(cm)` Girth Height Volume ## <dbl> <dbl> <dbl> <dbl> ## 1 21.082 8.3 70 10.3 ## 2 22.352 8.8 63 10.2 ## 3 27.178 10.7 81 18.8 ## 4 27.940 11.0 66 15.6 ## 5 28.194 11.1 80 22.6 ## 6 28.702 11.3 79 24.2 ## 7 28.956 11.4 76 21.4 ## 8 30.480 12.0 75 19.1 ## 9 32.766 12.9 85 33.8 ## 10 34.798 13.7 71 25.7 ## # ... with more rows

Group By

In some cases you may want to apply your R function to specific groups in your data. For example, suppose you want to compute regression models against specific subgroups. To solve this, you can specify a group_by argument. This example counts the number of rows in iris by species and then fits a simple linear model for each species. iris_tbl <- sdf_copy_to(sc, iris) iris_tbl %>% spark_apply(nrow, group_by = "Species") ## # Source: table<sparklyr_tmp_378c1b8155f3> [?? x 2] ## # Database: spark_connection ## Species Sepal_Length ## <chr> <int> ## 1 versicolor 50 ## 2 virginica 50 ## 3 setosa 50 iris_tbl %>% spark_apply( function(e) summary(lm(Petal_Length ~ Petal_Width, e))$r.squared, names = "r.squared", group_by = "Species") ## # Source: table<sparklyr_tmp_378c30e6155> [?? x 2] ## # Database: spark_connection ## Species r.squared ## <chr> <dbl> ## 1 versicolor 0.6188467 ## 2 virginica 0.1037537 ## 3 setosa 0.1099785

Distributing Packages

With spark_apply() you can use any R package inside Spark. For instance, you can use the broom package to create a tidy data frame from linear regression output. spark_apply( iris_tbl, function(e) broom::tidy(lm(Petal_Length ~ Petal_Width, e)), names = c("term", "estimate", "std.error", "statistic", "p.value"), group_by = "Species") ## # Source: table<sparklyr_tmp_378c5502500b> [?? x 6] ## # Database: spark_connection ## Species term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 versicolor (Intercept) 1.7812754 0.2838234 6.276000 9.484134e-08 ## 2 versicolor Petal_Width 1.8693247 0.2117495 8.827999 1.271916e-11 ## 3 virginica (Intercept) 4.2406526 0.5612870 7.555230 1.041600e-09 ## 4 virginica Petal_Width 0.6472593 0.2745804 2.357267 2.253577e-02 ## 5 setosa (Intercept) 1.3275634 0.0599594 22.141037 7.676120e-27 ## 6 setosa Petal_Width 0.5464903 0.2243924 2.435422 1.863892e-02 To use R packages inside Spark, your packages must be installed on the worker nodes. The first time you call spark_apply all of the contents in your local .libPaths() will be copied into each Spark worker node via the SparkConf.addFile() function. Packages will only be copied once and will persist as long as the connection remains open. It's not uncommon for R libraries to be several gigabytes in size, so be prepared for a one-time tax while the R packages are copied over to your Spark cluster. You can disable package distribution by setting packages = FALSE. Note: packages are not copied in local mode (master="local") because the packages already exist on the system.

Handling Errors

It can be more difficult to troubleshoot R issues in a cluster than in local mode. For instance, the following R code causes the distributed execution to fail and suggests you check the logs for details. spark_apply(iris_tbl, function(e) stop("Make this fail")) Error in force(code) : sparklyr worker rscript failure, check worker logs for details In local mode, sparklyr will retrieve the logs for you. The logs point out the real failure as ERROR sparklyr: RScript (4190) Make this fail as you might expect. ---- Output Log ---- (17/07/27 21:24:18 ERROR sparklyr: Worker (2427) is shutting down with exception ,java.net.SocketException: Socket closed) 17/07/27 21:24:18 WARN TaskSetManager: Lost task 0.0 in stage 389.0 (TID 429, localhost, executor driver): 17/07/27 21:27:21 INFO sparklyr: RScript (4190) retrieved 150 rows 17/07/27 21:27:21 INFO sparklyr: RScript (4190) computing closure 17/07/27 21:27:21 ERROR sparklyr: RScript (4190) Make this fail It is worth mentioning that different cluster providers and platforms expose worker logs in different ways. Specific documentation for your environment will point out how to retrieve these logs.

Requirements

The R Runtime is expected to be pre-installed in the cluster for spark_apply to function. Failure to install the cluster will trigger a Cannot run program, no such file or directory error while attempting to use spark_apply(). Contact your cluster administrator to consider making the R runtime available throughout the entire cluster. A Homogeneous Cluster is required since the driver node distributes, and potentially compiles, packages to the workers. For instance, the driver and workers must have the same processor architecture, system libraries, etc.

Configuration

The following table describes relevant parameters while making use of spark_apply.
Value Description
spark.r.command The path to the R binary. Useful to select from multiple R versions.
sparklyr.worker.gateway.address The gateway address to use under each worker node. Defaults to sparklyr.gateway.address.
sparklyr.worker.gateway.port The gateway port to use under each worker node. Defaults to sparklyr.gateway.port.
For example, one could make use of an specific R version by running: config <- spark_config() config[["spark.r.command"]] <- "<path-to-r-version>" sc <- spark_connect(master = "local", config = config) sdf_len(sc, 10) %>% spark_apply(function(e) e)

Limitations

Closures

Closures are serialized using serialize, which is described as “A simple low-level interface for serializing to connections.”. One of the current limitations of serialize is that it wont serialize objects being referenced outside of it's environment. For instance, the following function will error out since the closures references external_value: external_value <- 1 spark_apply(iris_tbl, function(e) e + external_value)

Livy

Currently, Livy connections do not support distributing packages since the client machine where the libraries are precompiled might not have the same processor architecture, not operating systems that the cluster machines.

Computing over Groups

While performing computations over groups, spark_apply() will provide partitions over the selected column; however, this implies that each partition can fit into a worker node, if this is not the case an exception will be thrown. To perform operations over groups that exceed the resources of a single node, one can consider partitioning to smaller units or use dplyr::do which is currently optimized for large partitions.

Package Installation

Since packages are copied only once for the duration of the spark_connect() connection, installing additional packages is not supported while the connection is active. Therefore, if a new package needs to be installed, spark_disconnect() the connection, modify packages and reconnect.

Data Science using a Data Lake

Audience

This article aims explain how to take advantage of Apache Spark inside organizations that have already implemented, or are in the process of implementing, a Hadoop based Big Data Lake.

Introduction

We have noticed that the types of questions we field after a demo of sparklyr to our customers were more about high-level architecture than how the package works. To answer those questions, we put together a set of slides that illustrate and discuss important concepts, to help customers see where Spark, R, and sparklyr fit in a Big Data Platform implementation. In this article, we’ll review those slides and provide a narrative that will help you better envision how you can take advantage of our products.

R for Data Science

It is very important to preface the Use Case review with some background information about where RStudio focuses its efforts when developing packages and products. Many vendors offer R integration, but in most cases, what this means is that they will add a model built in R to their pipeline or interface, and pass new inputs to that model to generate outputs that can be used in the next step in the pipeline, or in a calculation for the interface. In contrast, our focus is on the process that happens before that: the discipline that produces the model, meaning Data Science.
In their R for Data Science book, Hadley Wickham and Garrett Grolemund provide a great diagram that nicely illustrates the Data Science process: We import data into memory with R and clean and tidy the data. Then we go into a cyclical process called understand, which helps us to get to know our data, and hopefully find the answer to the question we started with. This cycle typically involves making transformations to our tidied data, using the transformed data to fit models, and visualizing results. Once we find an answer to our question, we then communicate the results. Data Scientists like using R because it allows them to complete a Data Science project from beginning to end inside the R environment, and in memory.

Hadoop as a Data Source

What happens when the data that needs to be analyzed is very large, like the data sets found in a Hadoop cluster? It would be impossible to fit these in memory, so workarounds are normally used. Possible workarounds include using a comparatively minuscule data sample, or download as much data as possible. This becomes disruptive to Data Scientists because either the small sample may not be representative, or they have to wait a long time in every iteration of importing a lot of data, exploring a lot of data, and modeling a lot of data.

Spark as an Analysis Engine

We noticed that a very important mental leap to make is to see Spark not just as a gateway to Hadoop (or worse, as an additional data source), but as a computing engine. As such, it is an excellent vehicle to scale our analytics. Spark has many capabilities that makes it ideal for Data Science in a data lake, such as close integration with Hadoop and Hive, the ability to cache data into memory across multiple nodes, data transformers, and its Machine Learning libraries. The approach, then, is to push as much compute to the cluster as possible, using R primarily as an interface to Spark for the Data Scientist, which will then collect as few results as possible back into R memory, mostly to visualize and communicate. As shown in the slide, the more import, tidy, transform and modeling work we can push to Spark, the faster we can analyze very large data sets.

Cluster Setup

Here is an illustration of how R, RStudio, and sparklyr can be added to the YARN managed cluster. The highlights are: R, RStudio, and sparklyr need to be installed on one node only, typically an edge node The Data Scientist can access R, Spark, and the cluster via a web browser by navigating to the RStudio IDE inside the edge node

Considerations

There are some important considerations to keep in mind when combining your Data Lake and R for large scale analytics: Spark’s Machine Learning libraries may not contain specific models that a Data Scientist needs. For those cases, workarounds would include using a sparklyr extension like H2O, or collecting a sample of the data into R memory for modeling. Spark does not have visualization functionality; currently, the best approach is to collect pre-calculated data into R for plotting. A good way to drastically reduce the number of rows being brought back into memory is to push as much computation as possible to Spark, and return just the results to be plotted. For example, the bins of a Histogram can be calculated in Spark, so that only the final bucket values would be returned to R for visualization. Here is sample code for such a scenario: sparkDemos/Histogram A particular use case may require a different way of scaling analytics. We have published an article that provides a very good overview of the options that are available: R for Enterprise: How to Scale Your Analytics Using R

R for Data Science Toolchain with Spark

With sparklyr, the Data Scientist will be able to access the Data Lake’s data, and also gain an additional, very powerful understand layer via Spark. sparklyr, along with the RStudio IDE and the tidyverse packages, provides the Data Scientist with an excellent toolbox to analyze data, big and small.

Spark ML Pipelines

Spark’s ML Pipelines provide a way to easily combine multiple transformations and algorithms into a single workflow, or pipeline. For R users, the insights gathered during the interactive sessions with Spark can now be converted to a formal pipeline. This makes the hand-off from Data Scientists to Big Data Engineers a lot easier, this is because there should not be additional changes needed to be made by the later group. The final list of selected variables, data manipulation, feature transformations and modeling can be easily re-written into a ml_pipeline() object, saved, and ultimately placed into a Production environment. The sparklyr output of a saved Spark ML Pipeline object is in Scala code, which means that the code can be added to the scheduled Spark ML jobs, and without any dependencies in R.

Introduction to ML Pipelines

The official Apache Spark site contains a more complete overview of ML Pipelines. This article will focus in introducing the basic concepts and steps to work with ML Pipelines via sparklyr. There are two important stages in building an ML Pipeline. The first one is creating a Pipeline. A good way to look at it, or call it, is as an “empty” pipeline. This step just builds the steps that the data will go through. This is the somewhat equivalent of doing this in R: r_pipeline <- . %>% mutate(cyl = paste0("c", cyl)) %>% lm(am ~ cyl + mpg, data = .) r_pipeline ## Functional sequence with the following components: ## ## 1. mutate(., cyl = paste0("c", cyl)) ## 2. lm(am ~ cyl + mpg, data = .) ## ## Use 'functions' to extract the individual functions. The r_pipeline object has all the steps needed to transform and fit the model, but it has not yet transformed any data. The second step, is to pass data through the pipeline, which in turn will output a fitted model. That is called a PipelineModel. The PipelineModel can then be used to produce predictions. r_model <- r_pipeline(mtcars) r_model ## ## Call: ## lm(formula = am ~ cyl + mpg, data = .) ## ## Coefficients: ## (Intercept) cylc6 cylc8 mpg ## -0.54388 0.03124 -0.03313 0.04767

Taking advantage of Pipelines and PipelineModels

The two stage ML Pipeline approach produces two final data products: A PipelineModel that can be added to the daily Spark jobs which will produce new predictions for the incoming data, and again, with no R dependencies. A Pipeline that can be easily re-fitted on a regular interval, say every month. All that is needed is to pass a new sample to obtain the new coefficients.

Pipeline

An additional goal of this article is that the reader can follow along, so the data, transformations and Spark connection in this example will be kept as easy to reproduce as possible. library(nycflights13) library(sparklyr) library(dplyr) sc <- spark_connect(master = "local", spark_version = "2.2.0") ## * Using Spark: 2.2.0 spark_flights <- sdf_copy_to(sc, flights)

Feature Transformers

Pipelines make heavy use of Feature Transformers. If new to Spark, and sparklyr, it would be good to review what these transformers do. These functions use the Spark API directly to transform the data, and may be faster at making the data manipulations that a dplyr (SQL) transformation. In sparklyr the ft functions are essentially are wrappers to original Spark feature transformer.

ft_dplyr_transformer

This example will start with dplyr transformations, which are ultimately SQL transformations, loaded into the df variable. In sparklyr, there is one feature transformer that is not available in Spark, ft_dplyr_transformer(). The goal of this function is to convert the dplyr code to a SQL Feature Transformer that can then be used in a Pipeline. df <- spark_flights %>% filter(!is.na(dep_delay)) %>% mutate( month = paste0("m", month), day = paste0("d", day) ) %>% select(dep_delay, sched_dep_time, month, day, distance) This is the resulting pipeline stage produced from the dplyr code: ft_dplyr_transformer(sc, df) Use the ml_param() function to extract the “statement” attribute. That attribute contains the finalized SQL statement. Notice that the flights table name has been replace with __THIS__. This allows the pipeline to accept different table names as its source, making the pipeline very modular. ft_dplyr_transformer(sc, df) %>% ml_param("statement") ## [1] "SELECT `dep_delay`, `sched_dep_time`, `month`, `day`, `distance`\nFROM (SELECT `year`, CONCAT(\"m\", `month`) AS `month`, CONCAT(\"d\", `day`) AS `day`, `dep_time`, `sched_dep_time`, `dep_delay`, `arr_time`, `sched_arr_time`, `arr_delay`, `carrier`, `flight`, `tailnum`, `origin`, `dest`, `air_time`, `distance`, `hour`, `minute`, `time_hour`\nFROM (SELECT *\nFROM `__THIS__`\nWHERE (NOT(((`dep_delay`) IS NULL)))) `bjbujfpqzq`) `axbwotqnbr`"

Creating the Pipeline

The following step will create a 5 stage pipeline:
    SQL transformer - Resulting from the ft_dplyr_transformer() transformation Binarizer - To determine if the flight should be considered delay. The eventual outcome variable. Bucketizer - To split the day into specific hour buckets R Formula - To define the model’s formula Logistic Model
flights_pipeline <- ml_pipeline(sc) %>% ft_dplyr_transformer( tbl = df ) %>% ft_binarizer( input.col = "dep_delay", output.col = "delayed", threshold = 15 ) %>% ft_bucketizer( input.col = "sched_dep_time", output.col = "hours", splits = c(400, 800, 1200, 1600, 2000, 2400) ) %>% ft_r_formula(delayed ~ month + day + hours + distance) %>% ml_logistic_regression() Another nice feature for ML Pipelines in sparklyr, is the print-out. It makes it really easy to how each stage is setup: flights_pipeline ## Pipeline (Estimator) with 5 stages ## <pipeline_24044e4f2e21> ## Stages ## |--1 SQLTransformer (Transformer) ## | <dplyr_transformer_2404e6a1b8e> ## | (Parameters -- Column Names) ## |--2 Binarizer (Transformer) ## | <binarizer_24045c9227f2> ## | (Parameters -- Column Names) ## | input_col: dep_delay ## | output_col: delayed ## |--3 Bucketizer (Transformer) ## | <bucketizer_240412366b1e> ## | (Parameters -- Column Names) ## | input_col: sched_dep_time ## | output_col: hours ## |--4 RFormula (Estimator) ## | <r_formula_240442d75f00> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | (Parameters) ## | force_index_label: FALSE ## | formula: delayed ~ month + day + hours + distance ## |--5 LogisticRegression (Estimator) ## | <logistic_regression_24044321ad0> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | prediction_col: prediction ## | probability_col: probability ## | raw_prediction_col: rawPrediction ## | (Parameters) ## | aggregation_depth: 2 ## | elastic_net_param: 0 ## | family: auto ## | fit_intercept: TRUE ## | max_iter: 100 ## | reg_param: 0 ## | standardization: TRUE ## | threshold: 0.5 ## | tol: 1e-06 Notice that there are no coefficients defined yet. That’s because no data has been actually processed. Even though df uses spark_flights(), recall that the final SQL transformer makes that name, so there’s no data to process yet.

PipelineModel

A quick partition of the data is created for this exercise. partitioned_flights <- sdf_partition( spark_flights, training = 0.01, testing = 0.01, rest = 0.98 ) The ml_fit() function produces the PipelineModel. The training partition of the partitioned_flights data is used to train the model: fitted_pipeline <- ml_fit( flights_pipeline, partitioned_flights$training ) fitted_pipeline ## PipelineModel (Transformer) with 5 stages ## <pipeline_24044e4f2e21> ## Stages ## |--1 SQLTransformer (Transformer) ## | <dplyr_transformer_2404e6a1b8e> ## | (Parameters -- Column Names) ## |--2 Binarizer (Transformer) ## | <binarizer_24045c9227f2> ## | (Parameters -- Column Names) ## | input_col: dep_delay ## | output_col: delayed ## |--3 Bucketizer (Transformer) ## | <bucketizer_240412366b1e> ## | (Parameters -- Column Names) ## | input_col: sched_dep_time ## | output_col: hours ## |--4 RFormulaModel (Transformer) ## | <r_formula_240442d75f00> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | (Transformer Info) ## | formula: chr "delayed ~ month + day + hours + distance" ## |--5 LogisticRegressionModel (Transformer) ## | <logistic_regression_24044321ad0> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | prediction_col: prediction ## | probability_col: probability ## | raw_prediction_col: rawPrediction ## | (Transformer Info) ## | coefficient_matrix: num [1, 1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ... ## | coefficients: num [1:43] 0.709 -0.3401 -0.0328 0.0543 -0.4774 ... ## | intercept: num -3.04 ## | intercept_vector: num -3.04 ## | num_classes: int 2 ## | num_features: int 43 ## | threshold: num 0.5 Notice that the print-out for the fitted pipeline now displays the model’s coefficients. The ml_transform() function can be used to run predictions, in other words it is used instead of predict() or sdf_predict(). predictions <- ml_transform( fitted_pipeline, partitioned_flights$testing ) predictions %>% group_by(delayed, prediction) %>% tally() ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Groups: delayed ## delayed prediction n ## <dbl> <dbl> <dbl> ## 1 0. 1. 51. ## 2 0. 0. 2599. ## 3 1. 0. 666. ## 4 1. 1. 69.

Save the pipelines to disk

The ml_save() command can be used to save the Pipeline and PipelineModel to disk. The resulting output is a folder with the selected name, which contains all of the necessary Scala scripts: ml_save( flights_pipeline, "flights_pipeline", overwrite = TRUE ) ## NULL ml_save( fitted_pipeline, "flights_model", overwrite = TRUE ) ## NULL

Use an existing PipelineModel

The ml_load() command can be used to re-load Pipelines and PipelineModels. The saved ML Pipeline files can only be loaded into an open Spark session. reloaded_model <- ml_load(sc, "flights_model") A simple query can be used as the table that will be used to make the new predictions. This of course, does not have to done in R, at this time the “flights_model” can be loaded into an independent Spark session outside of R. new_df <- spark_flights %>% filter( month == 7, day == 5 ) ml_transform(reloaded_model, new_df) ## # Source: table<sparklyr_tmp_24041e052b5> [?? x 12] ## # Database: spark_connection ## dep_delay sched_dep_time month day distance delayed hours features ## <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <list> ## 1 39. 2359 m7 d5 1617. 1. 4. <dbl [43]> ## 2 141. 2245 m7 d5 2475. 1. 4. <dbl [43]> ## 3 0. 500 m7 d5 529. 0. 0. <dbl [43]> ## 4 -5. 536 m7 d5 1400. 0. 0. <dbl [43]> ## 5 -2. 540 m7 d5 1089. 0. 0. <dbl [43]> ## 6 -7. 545 m7 d5 1416. 0. 0. <dbl [43]> ## 7 -3. 545 m7 d5 1576. 0. 0. <dbl [43]> ## 8 -7. 600 m7 d5 1076. 0. 0. <dbl [43]> ## 9 -7. 600 m7 d5 96. 0. 0. <dbl [43]> ## 10 -6. 600 m7 d5 937. 0. 0. <dbl [43]> ## # ... with more rows, and 4 more variables: label <dbl>, ## # rawPrediction <list>, probability <list>, prediction <dbl>

Re-fit an existing Pipeline

First, reload the pipeline into an open Spark session: reloaded_pipeline <- ml_load(sc, "flights_pipeline") Use ml_fit() again to pass new data, in this case, sample_frac() is used instead of sdf_partition() to provide the new data. The idea being that the re-fitting would happen at a later date than when the model was initially fitted. new_model <- ml_fit(reloaded_pipeline, sample_frac(spark_flights, 0.01)) new_model ## PipelineModel (Transformer) with 5 stages ## <pipeline_24044e4f2e21> ## Stages ## |--1 SQLTransformer (Transformer) ## | <dplyr_transformer_2404e6a1b8e> ## | (Parameters -- Column Names) ## |--2 Binarizer (Transformer) ## | <binarizer_24045c9227f2> ## | (Parameters -- Column Names) ## | input_col: dep_delay ## | output_col: delayed ## |--3 Bucketizer (Transformer) ## | <bucketizer_240412366b1e> ## | (Parameters -- Column Names) ## | input_col: sched_dep_time ## | output_col: hours ## |--4 RFormulaModel (Transformer) ## | <r_formula_240442d75f00> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | (Transformer Info) ## | formula: chr "delayed ~ month + day + hours + distance" ## |--5 LogisticRegressionModel (Transformer) ## | <logistic_regression_24044321ad0> ## | (Parameters -- Column Names) ## | features_col: features ## | label_col: label ## | prediction_col: prediction ## | probability_col: probability ## | raw_prediction_col: rawPrediction ## | (Transformer Info) ## | coefficient_matrix: num [1, 1:43] 0.258 0.648 -0.317 0.36 -0.279 ... ## | coefficients: num [1:43] 0.258 0.648 -0.317 0.36 -0.279 ... ## | intercept: num -3.77 ## | intercept_vector: num -3.77 ## | num_classes: int 2 ## | num_features: int 43 ## | threshold: num 0.5 The new model can be saved using ml_save(). A new name is used in this case, but the same name as the existing PipelineModel to replace it. ml_save(new_model, "new_flights_model", overwrite = TRUE) ## NULL Finally, this example is complete by closing the Spark session. spark_disconnect(sc)

Text mining with Spark & sparklyr

This article focuses on a set of functions that can be used for text mining with Spark and sparklyr. The main goal is to illustrate how to perform most of the data preparation and analysis with commands that will run inside the Spark cluster, as opposed to locally in R. Because of that, the amount of data used will be small.

Data source

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared. readLines("arthur_doyle.txt", 10) ## [1] "THE RETURN OF SHERLOCK HOLMES," ## [2] "" ## [3] "A Collection of Holmes Adventures" ## [4] "" ## [5] "" ## [6] "by Sir Arthur Conan Doyle" ## [7] "" ## [8] "" ## [9] "" ## [10] ""

Data Import

Connect to Spark

An additional goal of this article is to encourage the reader to try it out, so a simple Spark local mode session is used. library(sparklyr) library(dplyr) sc <- spark_connect(master = "local", version = "2.1.0")

spark_read_text()

The spark_read_text() is a new function which works like readLines() but for sparklyr. It comes in handy when non-structured data, such as lines in a book, is what is available for analysis. # Imports Mark Twain's file # Setting up the path to the file in a Windows OS laptop twain_path <- paste0("file:///", getwd(), "/mark_twain.txt") twain <- spark_read_text(sc, "twain", twain_path) # Imports Sir Arthur Conan Doyle's file doyle_path <- paste0("file:///", getwd(), "/arthur_doyle.txt") doyle <- spark_read_text(sc, "doyle", doyle_path)

Data transformation

The objective is to end up with a tidy table inside Spark with one row per word used. The steps will be:
    The needed data transformations apply to the data from both authors. The data sets will be appended to one another Punctuation will be removed The words inside each line will be separated, or tokenized For a cleaner analysis, stop words will be removed To tidy the data, each word in a line will become its own row The results will be saved to Spark memory

sdf_bind_rows()

sdf_bind_rows() appends the doyle Spark Dataframe to the twain Spark Dataframe. This function can be used in lieu of a dplyr::bind_rows() wrapper function. For this exercise, the column author is added to differentiate between the two bodies of work. all_words <- doyle %>% mutate(author = "doyle") %>% sdf_bind_rows({ twain %>% mutate(author = "twain")}) %>% filter(nchar(line) > 0)

regexp_replace

The Hive UDF, regexp_replace, is used as a sort of gsub() that works inside Spark. In this case it is used to remove punctuation. The usual [:punct:] regular expression did not work well during development, so a custom list is provided. For more information, see the Hive Functions section in the dplyr page. all_words <- all_words %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))

ft_tokenizer()

ft_tokenizer() uses the Spark API to separate each word. It creates a new list column with the results. all_words <- all_words %>% ft_tokenizer(input.col = "line", output.col = "word_list") head(all_words, 4) ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## author line word_list ## <chr> <chr> <list> ## 1 doyle THE RETURN OF SHERLOCK HOLMES <list [5]> ## 2 doyle A Collection of Holmes Adventures <list [5]> ## 3 doyle by Sir Arthur Conan Doyle <list [5]> ## 4 doyle CONTENTS <list [1]>

ft_stop_words_remover()

ft_stop_words_remover() is a new function that, as its name suggests, takes care of removing stop words from the previous transformation. It expects a list column, so it is important to sequence it correctly after a ft_tokenizer() command. In the sample results, notice that the new wo_stop_words column contains less items than word_list. all_words <- all_words %>% ft_stop_words_remover(input.col = "word_list", output.col = "wo_stop_words") head(all_words, 4) ## # Source: lazy query [?? x 4] ## # Database: spark_connection ## author line word_list wo_stop_words ## <chr> <chr> <list> <list> ## 1 doyle THE RETURN OF SHERLOCK HOLMES <list [5]> <list [3]> ## 2 doyle A Collection of Holmes Adventures <list [5]> <list [3]> ## 3 doyle by Sir Arthur Conan Doyle <list [5]> <list [4]> ## 4 doyle CONTENTS <list [1]> <list [1]>

explode

The Hive UDF explode performs the job of unnesting the tokens into their own row. Some further filtering and field selection is done to reduce the size of the dataset. all_words <- all_words %>% mutate(word = explode(wo_stop_words)) %>% select(word, author) %>% filter(nchar(word) > 2) head(all_words, 4) ## # Source: lazy query [?? x 2] ## # Database: spark_connection ## word author ## <chr> <chr> ## 1 return doyle ## 2 sherlock doyle ## 3 holmes doyle ## 4 collection doyle

compute()

compute() will operate this transformation and cache the results in Spark memory. It is a good idea to pass a name to compute() to make it easier to identify it inside the Spark environment. In this case the name will be all_words all_words <- all_words %>% compute("all_words")

Full code

This is what the code would look like on an actual analysis: all_words <- doyle %>% mutate(author = "doyle") %>% sdf_bind_rows({ twain %>% mutate(author = "twain")}) %>% filter(nchar(line) > 0) %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>% ft_tokenizer(input.col = "line", output.col = "word_list") %>% ft_stop_words_remover(input.col = "word_list", output.col = "wo_stop_words") %>% mutate(word = explode(wo_stop_words)) %>% select(word, author) %>% filter(nchar(word) > 2) %>% compute("all_words")

Data Analysis

Words used the most

word_count <- all_words %>% group_by(author, word) %>% tally() %>% arrange(desc(n)) word_count ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Groups: author ## # Ordered by: desc(n) ## author word n ## <chr> <chr> <dbl> ## 1 twain one 20028 ## 2 doyle upon 16482 ## 3 twain would 15735 ## 4 doyle one 14534 ## 5 doyle said 13716 ## 6 twain said 13204 ## 7 twain could 11301 ## 8 doyle would 11300 ## 9 twain time 10502 ## 10 doyle man 10478 ## # ... with more rows

Words used by Doyle and not Twain

doyle_unique <- filter(word_count, author == "doyle") %>% anti_join(filter(word_count, author == "twain"), by = "word") %>% arrange(desc(n)) %>% compute("doyle_unique") doyle_unique ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Groups: author ## # Ordered by: desc(n), desc(n) ## author word n ## <chr> <chr> <dbl> ## 1 doyle nigel 972 ## 2 doyle alleyne 500 ## 3 doyle ezra 421 ## 4 doyle maude 337 ## 5 doyle aylward 336 ## 6 doyle catinat 301 ## 7 doyle sharkey 281 ## 8 doyle lestrade 280 ## 9 doyle summerlee 248 ## 10 doyle congo 211 ## # ... with more rows doyle_unique %>% head(100) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))

Twain and Sherlock

The word cloud highlighted something interesting. The word lestrade is listed as one of the words used by Doyle but not Twain. Lestrade is the last name of a major character in the Sherlock Holmes books. It makes sense that the word “sherlock” appears considerably more times than “lestrade” in Doyle's books, so why is Sherlock not in the word cloud? Did Mark Twain use the word “sherlock” in his writings? all_words %>% filter(author == "twain", word == "sherlock") %>% tally() ## # Source: lazy query [?? x 1] ## # Database: spark_connection ## n ## <dbl> ## 1 16 The all_words table contains 16 instances of the word sherlock in the words used by Twain in his works. The instr Hive UDF is used to extract the lines that contain that word in the twain table. This Hive function works can be used instead of base::grep() or stringr::str_detect(). To account for any word capitalization, the lower command will be used in mutate() to make all words in the full text lower cap.

instr & lower

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902. twain %>% mutate(line = lower(line)) %>% filter(instr(line, "sherlock") > 0) %>% pull(line) ## [1] "late sherlock holmes, and yet discernible by a member of a race charged" ## [2] "sherlock holmes." ## [3] "\"uncle sherlock! the mean luck of it!--that he should come just" ## [4] "another trouble presented itself. \"uncle sherlock 'll be wanting to talk" ## [5] "flint buckner's cabin in the frosty gloom. they were sherlock holmes and" ## [6] "\"uncle sherlock's got some work to do, gentlemen, that 'll keep him till" ## [7] "\"by george, he's just a duke, boys! three cheers for sherlock holmes," ## [8] "he brought sherlock holmes to the billiard-room, which was jammed with" ## [9] "of interest was there--sherlock holmes. the miners stood silent and" ## [10] "the room; the chair was on it; sherlock holmes, stately, imposing," ## [11] "\"you have hunted me around the world, sherlock holmes, yet god is my" ## [12] "\"if it's only sherlock holmes that's troubling you, you needn't worry" ## [13] "they sighed; then one said: \"we must bring sherlock holmes. he can be" ## [14] "i had small desire that sherlock holmes should hang for my deeds, as you" ## [15] "\"my name is sherlock holmes, and i have not been doing anything.\"" ## [16] "late sherlock holmes, and yet discernible by a member of a race charged" spark_disconnect(sc)

Appendix

gutenbergr package

This is an example of how the data for this article was pulled from the Gutenberg site: library(gutenbergr) gutenberg_works() %>% filter(author == "Twain, Mark") %>% pull(gutenberg_id) %>% gutenberg_download() %>% pull(text) %>% writeLines("mark_twain.txt")

Intro to Spark Streaming with sparklyr

The sparklyr interface

As stated in the Spark’s official site, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. Because is part of the Spark API, it is possible to re-use query code that queries the current state of the stream, as well as joining the streaming data with historical data. Please see Spark’s official documentation for a deeper look into Spark Streaming. The sparklyr interface provides the following: Ability to run dplyr, SQL, spark_apply(), and PipelineModels against a stream Read in multiple formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc Write stream results to Spark memory and the following file formats: CSV, text, JSON, parquet, Kafka, JDBC, and orc An out-of-the box graph visualization to monitor the stream A new reactiveSpark() function, that allows Shiny apps to poll the contents of the stream create Shiny apps that are able to read the contents of the stream

Interacting with a stream

A good way of looking at the way how Spark streams update is as a three stage operation:
    Input - Spark reads the data inside a given folder. The folder is expected to contain multiple data files, with new files being created containing the most current stream data. Processing - Spark applies the desired operations on top of the data. These operations could be data manipulations (dplyr, SQL), data transformations (sdf operations, PipelineModel predictions), or native R manipulations (spark_apply()). Output - The results of processing the input files are saved in a different folder.
In the same way all of the read and write operations in sparklyr for Spark Standalone, or in sparklyr’s local mode, the input and output folders are actual OS file system folders. For Hadoop clusters, these will be folder locations inside the HDFS.

Example 1 - Input/Output

The first intro example is a small script that can be used with a local master. The result should be to see the stream_view() app showing live the number of records processed for each iteration of test data being sent to the stream. library(future) library(sparklyr) sc <- spark_connect(master = "local", spark_version = "2.3.0") if(file.exists("source")) unlink("source", TRUE) if(file.exists("source-out")) unlink("source-out", TRUE) stream_generate_test(iterations = 1) read_folder <- stream_read_csv(sc, "source") write_output <- stream_write_csv(read_folder, "source-out") invisible(future(stream_generate_test(interval = 0.5))) stream_view(write_output)
stream_stop(write_output) spark_disconnect(sc)

Code breakdown

    Open the Spark connection library(sparklyr) sc <- spark_connect(master = "local", spark_version = "2.3.0") Optional step. This resets the input and output folders. It makes it easier to run the code multiple times in a clean manner. if(file.exists("source")) unlink("source", TRUE) if(file.exists("source-out")) unlink("source-out", TRUE) Produces a single test file inside the “source” folder. This allows the “read” function to infer CSV file definition. stream_generate_test(iterations = 1) list.files("source") [1] "stream_1.csv" Points the stream reader to the folder where the streaming files will be placed. Since it is primed with a single CSV file, it will use as the expected layout of subsequent files. By default, stream_read_csv() creates a single integer variable data frame. read_folder <- stream_read_csv(sc, "source") The output writer is what starts the streaming job. It will start monitoring the input folder, and then write the new results in the “source-out” folder. So as new records stream in, new files will be created in the “source-out” folder. Since there are no operations on the incoming data at this time, the output files will have the same exact raw data as the input files. The only difference is that the files and sub folders within “source-out” will be structured how Spark structures data folders. write_output <- stream_write_csv(read_folder, "source-out") list.files("source-out") [1] "_spark_metadata" "checkpoint" [3] "part-00000-1f29719a-2314-40e1-b93d-a647a3d57154-c000.csv" The test generation function will run 100 files every 0.2 seconds. To run the tests “out-of-sync” with the current R session, the future package is used. library(future) invisible(future(stream_generate_test(interval = 0.2, iterations = 100))) The stream_view() function can be used before the 50 tests are complete because of the use of the future package. It will monitor the status of the job that write_output is pointing to and provide information on the amount of data coming into the “source” folder and going out into the “source-out” folder. stream_view(write_output) The monitor will continue to run even after the tests are complete. To end the experiment, stop the Shiny app and then use the following to stop the stream and close the Spark session. stream_stop(write_output) spark_disconnect(sc)

Example 2 - Processing

The second example builds on the first. It adds a processing step that manipulates the input data before saving it to the output folder. In this case, a new binary field is added indicating if the value from x is over 400 or not. This time, while run the second code chunk in this example a few times during the stream tests to see the aggregated values change. library(future) library(sparklyr) library(dplyr, warn.conflicts = FALSE) sc <- spark_connect(master = "local", spark_version = "2.3.0") if(file.exists("source")) unlink("source", TRUE) if(file.exists("source-out")) unlink("source-out", TRUE) stream_generate_test(iterations = 1) read_folder <- stream_read_csv(sc, "source") process_stream <- read_folder %>% mutate(x = as.double(x)) %>% ft_binarizer( input_col = "x", output_col = "over", threshold = 400 ) write_output <- stream_write_csv(process_stream, "source-out") invisible(future(stream_generate_test(interval = 0.2, iterations = 100))) Run this code a few times during the experiment: spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>% group_by(over) %>% tally() The results would look similar to this. The n totals will increase as the experiment progresses. # Source: lazy query [?? x 2] # Database: spark_connection over n <dbl> <dbl> 1 0 40215 2 1 60006 Clean up after the experiment stream_stop(write_output) spark_disconnect(sc)

Code breakdown

    The processing starts with the read_folder variable that contains the input stream. It coerces the integer field x, into a type double. This is because the next function, ft_binarizer() does not accept integers. The binarizer determines if x is over 400 or not. This is a good illustration of how dplyr can help simplify the manipulation needed during the processing stage. process_stream <- read_folder %>% mutate(x = as.double(x)) %>% ft_binarizer( input_col = "x", output_col = "over", threshold = 400 ) The output now needs to write-out the processed data instead of the raw input data. Swap read_folder with process_stream. write_output <- stream_write_csv(process_stream, "source-out") The “source-out” folder can be treated as a if it was a single table within Spark. Using spark_read_csv(), the data can be mapped, but not brought into memory (memory = FALSE). This allows the current results to be further analyzed using regular dplyr commands. spark_read_csv(sc, "stream", "source-out", memory = FALSE) %>% group_by(over) %>% tally()

Example 3 - Aggregate in process and output to memory

Another option is to save the results of the processing into a in-memory Spark table. Unless intentionally saving it to disk, the table and its data will only exist while the Spark session is active. The biggest advantage of using Spark memory as the target, is that it will allow for aggregation to happen during processing. This is an advantage because aggregation is not allowed for any file output, expect Kafka, on the input/process stage. Using example 2 as the base, this example code will perform some aggregations to the current stream input and save only those summarized results into Spark memory: library(future) library(sparklyr) library(dplyr, warn.conflicts = FALSE) sc <- spark_connect(master = "local", spark_version = "2.3.0") if(file.exists("source")) unlink("source", TRUE) stream_generate_test(iterations = 1) read_folder <- stream_read_csv(sc, "source") process_stream <- read_folder %>% stream_watermark() %>% group_by(timestamp) %>% summarise( max_x = max(x, na.rm = TRUE), min_x = min(x, na.rm = TRUE), count = n() ) write_output <- stream_write_memory(process_stream, name = "stream") invisible(future(stream_generate_test())) Run this command a different times while the experiment is running: tbl(sc, "stream") Clean up after the experiment stream_stop(write_output) spark_disconnect(sc)

Code breakdown

    The stream_watermark() functions add a new timestamp variable that is then used in the group_by() command. This is required by Spark Stream to accept summarized results as output of the stream. The second step is to simply decide what kinds of aggregations we need to perform. In this case, a simply max, min and count are performed. process_stream <- read_folder %>% stream_watermark() %>% group_by(timestamp) %>% summarise( max_x = max(x, na.rm = TRUE), min_x = min(x, na.rm = TRUE), count = n() ) The spark_write_memory() function is used to write the output to Spark memory. The results will appear as a table of the Spark session with the name assigned in the name argument, in this case the name selected is: “stream”. write_output <- stream_write_memory(process_stream, name = "stream") To query the current data in the “stream” table can be queried by using the dplyr tbl() command. tbl(sc, "stream")

Example 4 - Shiny integration

sparklyr provides a new Shiny function called reactiveSpark(). It can take a Spark data frame, in this case the one created as a result of the stream processing, and then creates a Spark memory stream table, the same way a table is created in example 3. library(future) library(sparklyr) library(dplyr, warn.conflicts = FALSE) library(ggplot2) sc <- spark_connect(master = "local", spark_version = "2.3.0") if(file.exists("source")) unlink("source", TRUE) if(file.exists("source-out")) unlink("source-out", TRUE) stream_generate_test(iterations = 1) read_folder <- stream_read_csv(sc, "source") process_stream <- read_folder %>% stream_watermark() %>% group_by(timestamp) %>% summarise( max_x = max(x, na.rm = TRUE), min_x = min(x, na.rm = TRUE), count = n() ) invisible(future(stream_generate_test(interval = 0.2, iterations = 100))) library(shiny) ui <- function(){ tableOutput("table") } server <- function(input, output, session){ ps <- reactiveSpark(process_stream) output$table <- renderTable({ ps() %>% mutate(timestamp = as.character(timestamp)) }) } runGadget(ui, server)

Code breakdown

    Notice that there is no stream_write_... command. The reason is that reactiveSpark() function contains the stream_write_memory() function. This very basic Shiny app simply displays the output of a table in the ui section library(shiny) ui <- function(){ tableOutput("table") } In the server section, the reactiveSpark() function will update every time there’s a change to the stream and return a data frame. The results are saved to a variable called ps() in this script. Treat the ps() variable as a regular table that can be piped from, as shown in the example. In this case, the timestamp variable is converted to string for to make it easier to read. server <- function(input, output, session){ ps <- reactiveSpark(process_stream) output$table <- renderTable({ ps() %>% mutate(timestamp = as.character(timestamp)) }) } Use runGadget() to display the Shiny app in the Viewer pane. This is optional, the app can be run using normal Shiny run functions. runGadget(ui, server)

Example 5 - ML Pipeline Model

This example uses a fitted Pipeline Model to process the input, and saves the predictions to the output. This approach would be used to apply Machine Learning on top of streaming data. library(sparklyr) library(dplyr, warn.conflicts = FALSE) sc <- spark_connect(master = "local", spark_version = "2.3.0") if(file.exists("source")) unlink("source", TRUE) if(file.exists("source-out")) unlink("source-out", TRUE) df <- data.frame(x = rep(1:1000), y = rep(2:1001)) stream_generate_test(df = df, iteration = 1) model_sample <- spark_read_csv(sc, "sample", "source") pipeline <- sc %>% ml_pipeline() %>% ft_r_formula(x ~ y) %>% ml_linear_regression() fitted_pipeline <- ml_fit(pipeline, model_sample) ml_stream <- stream_read_csv( sc = sc, path = "source", columns = c(x = "integer", y = "integer") ) %>% ml_transform(fitted_pipeline, .) %>% select(- features) %>% stream_write_csv("source-out") stream_generate_test(df = df, interval = 0.5) spark_read_csv(sc, "stream", "source-out", memory = FALSE) ### Source: spark<stream> [?? x 4] ## x y label prediction ## * <int> <int> <dbl> <dbl> ## 1 276 277 276 276. ## 2 277 278 277 277. ## 3 278 279 278 278. ## 4 279 280 279 279. ## 5 280 281 280 280. ## 6 281 282 281 281. ## 7 282 283 282 282. ## 8 283 284 283 283. ## 9 284 285 284 284. ##10 285 286 285 285. ### ... with more rows stream_stop(ml_stream) spark_disconnect(sc)

Code Breakdown

    Creates and fits a pipeline df <- data.frame(x = rep(1:1000), y = rep(2:1001)) stream_generate_test(df = df, iteration = 1) model_sample <- spark_read_csv(sc, "sample", "source") pipeline <- sc %>% ml_pipeline() %>% ft_r_formula(x ~ y) %>% ml_linear_regression() fitted_pipeline <- ml_fit(pipeline, model_sample) This example pipelines the input, process and output in a single code segment. The ml_transform() function is used to create the predictions. Because the CSV format does not support list type fields, the features column is removed before the results are sent to the output. ml_stream <- stream_read_csv( sc = sc, path = "source", columns = c(x = "integer", y = "integer") ) %>% ml_transform(fitted_pipeline, .) %>% select(- features) %>% stream_write_csv("source-out")

Using Spark with AWS S3 buckets

AWS Access Keys

AWS Access Keys are needed to access S3 data. To learn how to setup a new keys, please review the AWS documentation: http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html .We then pass the keys to R via Environment Variables: Sys.setenv(AWS_ACCESS_KEY_ID="[Your access key]") Sys.setenv(AWS_SECRET_ACCESS_KEY="[Your secret access key]")

Connecting to Spark

There are four key settings needed to connect to Spark and use S3: A Hadoop-AWS package Executor memory (key but not critical) The master URL The Spark Home To connect to Spark, we first need to initialize a variable with the contents of sparklyr default config (spark_config) which we will then customize for our needs library(sparklyr) conf <- spark_config()

Hadoop-AWS package:

A Spark connection can be enhanced by using packages, please note that these are not R packages. For example, there are packages that tells Spark how to read CSV files, Hadoop or Hadoop in AWS. In order to read S3 buckets, our Spark connection will need a package called hadoop-aws. If needed, multiple packages can be used. We experimented with many combinations of packages, and determined that for reading data in S3 we only need the one. The version we used, 2.7.3, refers to the latest Hadoop version, so as this article ages, please make sure to check this site to ensure that you are using the latest version: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"

Executor Memory

As mentioned above this setting key but not critical. There are two points worth highlighting about it is: The only performance related setting in a Spark Stand Alone cluster that can be tweaked, and in most cases because Spark defaults to a fraction of what is available, we then need to increase it by manually passing a value to that setting. If more than the available RAM is requested, then Spark will set the Cores to 0, thus rendering the session unusable. conf$spark.executor.memory <- "14g"

Master URL and Spark home

There are three important points to mention when executing the spark_connect command:
    The master will be the Spark Master’s URL. To find the URL, please see the Spark Cluster section. Point the Spark Home to the location where Spark was installed in this node Make sure to the conf variable as the value for the config argument
sc <- spark_connect(master = "spark://ip-172-30-1-5.us-west-2.compute.internal:7077", spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/", config = conf)

Data Import/Wrangle approach

We experimented with multiple approaches. Most of the factors for settling on a recommended approach were made based on the speed of each step. The premise is that we would rather wait longer during Data Import, if it meant that we can much faster Register and Cache our data subsets during Data Wrangling, especially since we would expect to end up with many subsets as we explore and model.The selected combination was the second slowest during the Import stage, but the fastest when caching a subset, by a lot. In our tests, it took 72 seconds to read and cache the 29 columns of the 41 million rows of data, the slowest was 77 seconds. But when it comes to registering and caching a considerably sizable subset of 3 columns and almost all of the 41 million records, this approach was 17X faster than the second fastest approach. It took 1/3 of a second to register and cache the subset, the second fastest was 5 seconds. To implement this approach, we need to set three arguments in the spark_csv_read() step: memory infer_schema columns Again, this is a recommended approach. The columns argument is needed only if infer_schema is set to FALSE. When memory is set to TRUE it makes Spark load the entire dataset into memory, and setting infer_schema to FALSE prevents Spark from trying to figure out what the schema of the files are. By trying different combinations the memory and infer_schema arguments you may be able to find an approach that may better fits your needs.

Reading the schema

Surprisingly, another critical detail that can easily be overlooked is choosing the right s3 URI scheme. There are two options: s3n and s3a. In most examples and tutorials I found, there was no reason give of why or when to use which one. The article the finally clarified it was this one: https://wiki.apache.org/hadoop/AmazonS3 The gist of it is that s3a is the recommended one going forward, especially for Hadoop versions 2.7 and above. This means that if we copy from older examples that used Hadoop 2.6 we would more likely also used s3n thus making data import much, much slower.

Data Import

After the long introduction in the previous section, there is only one point to add about the following code chunk. If there are any NA values in numeric fields, then define the column as character and then convert it on later subsets using dplyr. The data import will fail if it finds any NA values on numeric fields. This is a small trade off in this approach because the next fastest one does not have this issue but is 17X slower at caching subsets. flights <- spark_read_csv(sc, "flights_spark", path = "s3a://flights-data/full", memory = TRUE, columns = list( Year = "character", Month = "character", DayofMonth = "character", DayOfWeek = "character", DepTime = "character", CRSDepTime = "character", ArrTime = "character", CRSArrTime = "character", UniqueCarrier = "character", FlightNum = "character", TailNum = "character", ActualElapsedTime = "character", CRSElapsedTime = "character", AirTime = "character", ArrDelay = "character", DepDelay = "character", Origin = "character", Dest = "character", Distance = "character", TaxiIn = "character", TaxiOut = "character", Cancelled = "character", CancellationCode = "character", Diverted = "character", CarrierDelay = "character", WeatherDelay = "character", NASDelay = "character", SecurityDelay = "character", LateAircraftDelay = "character"), infer_schema = FALSE)

Data Wrangle

There are a few points we need to highlight about the following simple dyplr code: Because there were NAs in the original fields, we have to mutate them to a number. Try coercing any variable as integer instead of numeric, this will save a lot of space when cached to Spark memory. The sdf_register command can be piped at the end of the code. After running the code, a new table will appear in the RStudio IDE’s Spark tab tidy_flights <- tbl(sc, "flights_spark") %>% mutate(ArrDelay = as.integer(ArrDelay), DepDelay = as.integer(DepDelay), Distance = as.integer(Distance)) %>% filter(!is.na(ArrDelay)) %>% select(DepDelay, ArrDelay, Distance) %>% sdf_register("tidy_spark") After we use tbl_cache() to load the tidy_spark table into Spark memory. We can see the new table in the Storage page of our Spark session. tbl_cache(sc, "tidy_spark")

Using Apache Arrow

Introduction

Apache Arrow is a cross-language development platform for in-memory data. Arrow is supported starting with sparklyr 1.0.0 to improve performance when transferring data between Spark and R. You can find some performance benchmarks under: sparklyr 1.0: Arrow, XGBoost, Broom and TFRecords. Speeding up R and Apache Spark using Apache Arrow.

Installation

Using Arrow from R requires installing: The Arrow Runtime: Provides a cross-language runtime library. The Arrow R Package: Provides support for using Arrow from R through an R package.

Runtime

OS X

Installing from OS X requires Homebrew and executing from a terminal: brew install apache-arrow

Windows

Currently, installing Arrow in Windows requires Conda and executing from a terminal: conda install arrow-cpp=0.12.* -c conda-forge conda install pyarrow=0.12.* -c conda-forge

Linux

Please reference arrow.apache.org/install when installing Arrow for Linux.

Package

As of this writing, the arrow R package is not yet available in CRAN; however, this package can be installed using the remotes package. First, install remotes: install.packages("remotes") Then install the R package from github as follows: remotes::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.0") If you happen to have Arrow 0.11 installed, you will have to install remotes::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")

Use Cases

There are three main use cases for arrow in sparklyr: Data Copying: When copying data with copy_to(), Arrow will be used. Data Collection: Also, when collecting either, implicitly by printing datasets or explicitly calling collect. R Transformations: When using spark_apply(), data will be transferred using Arrow when possible. To use arrow in sparklyr one simply needs to import this library: library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp The following objects are masked from ‘package:base’: array, table

Considerations

Types

Some data types are mapped to slightly different, one can argue more correct, types when using Arrow. For instance, consider collecting 64 bit integers in sparklyr: library(sparklyr) sc <- spark_connect(master = "local") integer64 <- sdf_len(sc, 2, type = "integer64") integer64 # Source: spark<?> [?? x 1] id <dbl> 1 1 2 2 Notice that sparklyr collects 64 bit integers as double; however, using arrow: library(arrow) integer64 # Source: spark<?> [?? x 1] id <S3: integer64> 1 1 2 2 64 bit integers are now being collected as proper 64 bit integer using the bit64 package.

Fallback

The Arrow R package supports many data types; however, in cases where a type is unsupported, sparklyr will fallback to not using arrow and print a warning. library(sparklyr.nested) library(sparklyr) library(dplyr) library(arrow) sc <- spark_connect(master = "local") cars <- copy_to(sc, mtcars) sdf_nest(cars, hp) %>% group_by(cyl) %>% summarize(data = collect_list(data)) # Source: spark<?> [?? x 2] cyl data <dbl> <list> 1 6 <list [7]> 2 4 <list [11]> 3 8 <list [14]> Warning message: In arrow_enabled_object.spark_jobj(sdf) : Arrow disabled due to columns: data

Creating Extensions for sparklyr

Introduction

The sparklyr package provides a dplyr interface to Spark DataFrames as well as an R interface to Spark’s distributed machine learning pipelines. However, since Spark is a general-purpose cluster computing system there are many other R interfaces that could be built (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.). The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. This guide describes how you can use these tools to create your own custom R interfaces to Spark.

Examples

Here’s an example of an extension function that calls the text file line counting function available via the SparkContext: library(sparklyr) count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count") } The count_lines function takes a spark_connection (sc) argument which enables it to obtain a reference to the SparkContext object, and in turn call the textFile().count() method. You can use this function with an existing sparklyr connection as follows: library(sparklyr) sc <- spark_connect(master = "local") count_lines(sc, "hdfs://path/data.csv") Here are links to some additional examples of extension packages:
Package Description
spark.sas7bdat Read in SAS data in parallel into Apache Spark.
rsparkling Extension for using H2O machine learning algorithms against Spark Data Frames.
sparkhello Simple example of including a custom JAR file within an extension package.
rddlist Implements some methods of an R list as a Spark RDD (resilient distributed dataset).
sparkwarc Load WARC files into Apache Spark with sparklyr.
sparkavro Load Avro data into Spark with sparklyr. It is a wrapper of spark-avro
crassy Connect to Cassandra with sparklyr using the Spark-Cassandra-Connector.
sparklygraphs R interface for GraphFrames which aims to provide the functionality of GraphX.
sparklyr.nested Extension for working with nested data.
sparklyudf Simple example registering an Scala UDF within an extension package.

Core Types

Three classes are defined for representing the fundamental types of the R to Java bridge:
Function Description
spark_connection Connection between R and the Spark shell process
spark_jobj Instance of a remote Spark object
spark_dataframe Instance of a remote Spark DataFrame object
S3 methods are defined for each of these classes so they can be easily converted to or from objects that contain or wrap them. Note that for any given spark_jobj it’s possible to discover the underlying spark_connection.

Calling Spark from R

There are several functions available for calling the methods of Java objects and static methods of Java classes:
Function Description
invoke Call a method on an object
invoke_new Create a new object by invoking a constructor
invoke_static Call a static method on an object
For example, to create a new instance of the java.math.BigInteger class and then call the longValue() method on it you would use code like this: billionBigInteger <- invoke_new(sc, "java.math.BigInteger", "1000000000") billion <- invoke(billionBigInteger, "longValue") Note the sc argument: that’s the spark_connection object which is provided by the front-end package (e.g. sparklyr). The previous example can be re-written to be more compact and clear using magrittr pipes: billion <- sc %>% invoke_new("java.math.BigInteger", "1000000000") %>% invoke("longValue") This syntax is similar to the method-chaining syntax often used with Scala code so is generally preferred. Calling a static method of a class is also straightforward. For example, to call the Math::hypot() static function you would use this code: hypot <- sc %>% invoke_static("java.lang.Math", "hypot", 10, 20)

Wrapper Functions

Creating an extension typically consists of writing R wrapper functions for a set of Spark services. In this section we’ll describe the typical form of these functions as well as how to handle special types like Spark DataFrames. Here’s the wrapper function for textFile().count() which we defined earlier: count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count") } The count_lines function takes a spark_connection (sc) argument which enables it to obtain a reference to the SparkContext object, and in turn call the textFile().count() method. The following functions are useful for implementing wrapper functions of various kinds:
Function Description
spark_connection Get the Spark connection associated with an object (S3)
spark_jobj Get the Spark jobj associated with an object (S3)
spark_dataframe Get the Spark DataFrame associated with an object (S3)
spark_context Get the SparkContext for a spark_connection
hive_context Get the HiveContext for a spark_connection
spark_version Get the version of Spark (as a numeric_version) for a spark_connection
The use of these functions is illustrated in this simple example: analyze <- function(x, features) { # normalize whatever we were passed (e.g. a dplyr tbl) into a DataFrame df <- spark_dataframe(x) # get the underlying connection so we can create new objects sc <- spark_connection(df) # create an object to do the analysis and call its `analyze` and `summary` # methods (note that the df and features are passed to the analyze function) summary <- sc %>% invoke_new("com.example.tools.Analyzer") %>% invoke("analyze", df, features) %>% invoke("summary") # return the results summary } The first argument is an object that can be accessed using the Spark DataFrame API (this might be an actual reference to a DataFrame or could rather be a dplyr tbl which has a DataFrame reference inside it). After using the spark_dataframe function to normalize the reference, we extract the underlying Spark connection associated with the data frame using the spark_connection function. Finally, we create a new Analyzer object, call it’s analyze method with the DataFrame and list of features, and then call the summary method on the results of the analysis. Accepting a spark_jobj or spark_dataframe as the first argument of a function makes it very easy to incorporate into magrittr pipelines so this pattern is highly recommended when possible.

Dependencies

When creating R packages which implement interfaces to Spark you may need to include additional dependencies. Your dependencies might be a set of Spark Packages or might be a custom JAR file. In either case, you’ll need a way to specify that these dependencies should be included during the initialization of a Spark session. A Spark dependency is defined using the spark_dependency function:
Function Description
spark_dependency Define a Spark dependency consisting of JAR files and Spark packages
Your extension package can specify it’s dependencies by implementing a function named spark_dependencies within the package (this function should not be publicly exported). For example, let’s say you were creating an extension package named sparkds that needs to include a custom JAR as well as the Redshift and Apache Avro packages: spark_dependencies <- function(spark_version, scala_version, ...) { spark_dependency( jars = c( system.file( sprintf("java/sparkds-%s-%s.jar", spark_version, scala_version), package = "sparkds" ) ), packages = c( sprintf("com.databricks:spark-redshift_%s:0.6.0", scala_version), sprintf("com.databricks:spark-avro_%s:2.0.1", scala_version) ) ) } .onLoad <- function(libname, pkgname) { sparklyr::register_extension(pkgname) } The spark_version argument is provided so that a package can support multiple Spark versions for it’s JARs. Note that the argument will include just the major and minor versions (e.g. 1.6 or 2.0) and will not include the patch level (as JARs built for a given major/minor version are expected to work for all patch levels). The scala_version argument is provided so that a single package can support multiple Scala compiler versions for it’s JARs and packages (currently Scala 1.6 downloadable binaries are compiled with Scala 2.10 and Scala 2.0 downloadable binaries are compiled with Scala 2.11). The ... argument is unused but nevertheless should be included to ensure compatibility if new arguments are added to spark_dependencies in the future. The .onLoad function registers your extension package so that it’s spark_dependencies function will be automatically called when new connections to Spark are made via spark_connect: library(sparklyr) library(sparkds) sc <- spark_connect(master = "local")

Compiling JARs

The sparklyr package includes a utility function (compile_package_jars) that will automatically compile a JAR file from your Scala source code for the required permutations of Spark and Scala compiler versions. To use the function just invoke it from the root directory of your R package as follows: sparklyr::compile_package_jars() Note that a prerequisite to calling compile_package_jars is the installation of the Scala 2.10 and 2.11 compilers to one of the following paths: /opt/scala /opt/local/scala /usr/local/scala ~/scala (Windows-only) See the sparkhello repository for a complete example of including a custom JAR within an extension package.

CRAN

When including a JAR file within an R package distributed on CRAN, you should follow the guidelines provided in Writing R Extensions:
Java code is a special case: except for very small programs, .java files should be byte-compiled (to a .class file) and distributed as part of a .jar file: the conventional location for the .jar file(s) is inst/java. It is desirable (and required under an Open Source license) to make the Java source files available: this is best done in a top-level java directory in the package – the source files should not be installed.

Data Types

The ensure_* family of functions can be used to enforce specific data types that are passed to a Spark routine. For example, Spark routines that require an integer will not accept an R numeric element. Use these functions ensure certain parameters are scalar integers, or scalar doubles, and so on. ensure_scalar_integer ensure_scalar_double ensure_scalar_boolean ensure_scalar_character In order to match the correct data types while calling Scala code from R, or retrieving results from Scala back to R, consider the following types mapping:
From R Scala To R
NULL void NULL
integer Int integer
character String character
logical Boolean logical
double Double double
numeric Double double
Float double
Decimal double
Long double
raw Array[Byte] raw
Date Date Date
POSIXlt Time
POSIXct Time POSIXct
list Array[T] list
environment Map[String, T]
jobj Object jobj

Compiling

Most Spark extensions won’t need to define their own compilation specification, and can instead rely on the default behavior of compile_package_jars. For users who would like to take more control over where the scalac compilers should be looked up, use the spark_compilation_spec fucnction. The Spark compilation specification is used when compiling Spark extension Java Archives, and defines which versions of Spark, as well as which versions of Scala, should be used for compilation.

Sparkling Water (H2O) Machine Learning

Overview

The rsparkling extension package provides bindings to H2O's distributed machine learning algorithms via sparklyr. In particular, rsparkling allows you to access the machine learning routines provided by the Sparkling Water Spark package. Together with sparklyr's dplyr interface, you can easily create and tune H2O machine learning workflows on Spark, orchestrated entirely within R. rsparkling provides a few simple conversion functions that allow the user to transfer data between Spark DataFrames and H2O Frames. Once the Spark DataFrames are available as H2O Frames, the h2o R interface can be used to train H2O machine learning algorithms on the data. A typical machine learning pipeline with rsparkling might be composed of the following stages. To fit a model, you might need to:
    Perform SQL queries through the sparklyr dplyr interface, Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set, Convert your training, validation and/or test data frames into H2O Frames using the as_h2o_frame function, Choose an appropriate H2O machine learning algorithm to model your data, Inspect the quality of your model fit, and use it to make predictions with new data.

Installation

You can install the rsparkling package from CRAN as follows: install.packages("rsparkling") Then set the Sparkling Water version for rsparkling.: options(rsparkling.sparklingwater.version = "2.1.14") For Spark 2.0.x set rsparkling.sparklingwater.version to 2.0.3 instead, for Spark 1.6.2 use 1.6.8.

Using H2O

Now let's walk through a simple example to demonstrate the use of H2O's machine learning algorithms within R. We'll use h2o.glm to fit a linear regression model. Using the built-in mtcars dataset, we'll try to predict a car's fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). First, we will initialize a local Spark connection, and copy the mtcars dataset into Spark. library(rsparkling) library(sparklyr) library(h2o) library(dplyr) sc <- spark_connect("local", version = "2.1.0") mtcars_tbl <- copy_to(sc, mtcars, "mtcars") Now, let's perform some simple transformations – we'll
    Remove all cars with horsepower less than 100, Produce a column encoding whether a car has 8 cylinders or not, Partition the data into separate training and test data sets, Fit a model to our training data set, Evaluate our predictive performance on our test dataset.
# transform our data set, and then partition into 'training', 'test' partitions <- mtcars_tbl %>% filter(hp >= 100) %>% mutate(cyl8 = cyl == 8) %>% sdf_partition(training = 0.5, test = 0.5, seed = 1099) Now, we convert our training and test sets into H2O Frames using rsparkling conversion functions. We have already split the data into training and test frames using dplyr. training <- as_h2o_frame(sc, partitions$training, strict_version_check = FALSE) test <- as_h2o_frame(sc, partitions$test, strict_version_check = FALSE) Alternatively, we can use the h2o.splitFrame() function instead of sdf_partition() to partition the data within H2O instead of Spark (e.g. partitions <- h2o.splitFrame(as_h2o_frame(mtcars_tbl), 0.5)) # fit a linear model to the training dataset glm_model <- h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = training, lambda_search = TRUE) For linear regression models produced by H2O, we can use either print() or summary() to learn a bit more about the quality of our fit. The summary() method returns some extra information about scoring history and variable importance. glm_model ## Model Details: ## ============== ## ## H2ORegressionModel: glm ## Model ID: GLM_model_R_1510348062048_1 ## GLM Model: summary ## family link regularization ## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.05468 ) ## lambda_search ## 1 nlambda = 100, lambda.max = 5.4682, lambda.min = 0.05468, lambda.1se = -1.0 ## number_of_predictors_total number_of_active_predictors ## 1 2 2 ## number_of_iterations training_frame ## 1 100 frame_rdd_32_929e407384e0082416acd4c9897144a0 ## ## Coefficients: glm coefficients ## names coefficients standardized_coefficients ## 1 Intercept 32.997281 16.625000 ## 2 cyl -0.906688 -1.349195 ## 3 wt -2.712562 -2.282649 ## ## H2ORegressionMetrics: glm ## ** Reported on training data. ** ## ## MSE: 2.03293 ## RMSE: 1.425808 ## MAE: 1.306314 ## RMSLE: 0.08238032 ## Mean Residual Deviance : 2.03293 ## R^2 : 0.8265696 ## Null Deviance :93.775 ## Null D.o.F. :7 ## Residual Deviance :16.26344 ## Residual D.o.F. :5 ## AIC :36.37884 The output suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption. (The model suggests that, on average, heavier cars consume more fuel.) Let's use our H2O model fit to predict the average fuel consumption on our test data set, and compare the predicted response with the true measured fuel consumption. We'll build a simple ggplot2 plot that will allow us to inspect the quality of our predictions. library(ggplot2) # compute predicted values on our test dataset pred <- h2o.predict(glm_model, newdata = test) # convert from H2O Frame to Spark DataFrame predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE) # extract the true 'mpg' values from our test dataset actual <- partitions$test %>% select(mpg) %>% collect() %>% `[[`("mpg") # produce a data.frame housing our predicted + actual 'mpg' values data <- data.frame( predicted = predicted, actual = actual ) # a bug in data.frame does not set colnames properly; reset here names(data) <- c("predicted", "actual") # plot predicted vs. actual values ggplot(data, aes(x = actual, y = predicted)) + geom_abline(lty = "dashed", col = "red") + geom_point() + theme(plot.title = element_text(hjust = 0.5)) + coord_fixed(ratio = 1) + labs( x = "Actual Fuel Consumption", y = "Predicted Fuel Consumption", title = "Predicted vs. Actual Fuel Consumption" ) Although simple, our model appears to do a fairly good job of predicting a car's average fuel consumption. As you can see, we can easily and effectively combine dplyr data transformation pipelines with the machine learning algorithms provided by H2O's Sparkling Water.

Algorithms

Once the H2OContext is made available to Spark (as demonstrated below), all of the functions in the standard h2o R interface can be used with H2O Frames (converted from Spark DataFrames). Here is a table of the available algorithms:
Function Description
h2o.glm Generalized Linear Model
h2o.deeplearning Multilayer Perceptron
h2o.randomForest Random Forest
h2o.gbm Gradient Boosting Machine
h2o.naiveBayes Naive-Bayes
h2o.prcomp Principal Components Analysis
h2o.svd Singular Value Decomposition
h2o.glrm Generalized Low Rank Model
h2o.kmeans K-Means Clustering
h2o.anomaly Anomaly Detection via Deep Learning Autoencoder
Additionally, the h2oEnsemble R package can be used to generate Super Learner ensembles of H2O algorithms:
Function Description
h2o.ensemble Super Learner / Stacking
h2o.stack Super Learner / Stacking

Transformers

A model is often fit not on a dataset as-is, but instead on some transformation of that dataset. Spark provides feature transformers, facilitating many common transformations of data within a Spark DataFrame, and sparklyr exposes these within the ft_* family of functions. Transformers can be used on Spark DataFrames, and the final training set can be sent to the H2O cluster for machine learning.
Function Description
ft_binarizer Threshold numerical features to binary (0/1) feature
ft_bucketizer Bucketizer transforms a column of continuous features to a column of feature buckets
ft_discrete_cosine_transform Transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain
ft_elementwise_product Multiplies each input vector by a provided weight vector, using element-wise multiplication.
ft_index_to_string Maps a column of label indices back to a column containing the original labels as strings
ft_quantile_discretizer Takes a column with continuous features and outputs a column with binned categorical features
ft_sql_transformer Implements the transformations which are defined by a SQL statement
ft_string_indexer Encodes a string column of labels to a column of label indices
ft_vector_assembler Combines a given list of columns into a single vector column

Examples

We will use the iris data set to examine a handful of learning algorithms and transformers. The iris data set measures attributes for 150 flowers in 3 different species of iris. iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE) iris_tbl ## # Source: table<iris> [?? x 5] ## # Database: spark_connection ## Sepal_Length Sepal_Width Petal_Length Petal_Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with more rows Convert to an H2O Frame: iris_hf <- as_h2o_frame(sc, iris_tbl, strict_version_check = FALSE)

K-Means Clustering

Use H2O's K-means clustering to partition a dataset into groups. K-means clustering partitions points into k groups, such that the sum of squares from points to the assigned cluster centers is minimized. kmeans_model <- h2o.kmeans(training_frame = iris_hf, x = 3:4, k = 3, seed = 1) To look at particular metrics of the K-means model, we can use h2o.centroid_stats() and h2o.centers() or simply print out all the model metrics using print(kmeans_model). # print the cluster centers h2o.centers(kmeans_model) ## petal_length petal_width ## 1 1.462000 0.24600 ## 2 5.566667 2.05625 ## 3 4.296154 1.32500 # print the centroid statistics h2o.centroid_stats(kmeans_model) ## Centroid Statistics: ## centroid size within_cluster_sum_of_squares ## 1 1 50.00000 1.41087 ## 2 2 48.00000 9.29317 ## 3 3 52.00000 7.20274

PCA

Use H2O's Principal Components Analysis (PCA) to perform dimensionality reduction. PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. pca_model <- h2o.prcomp(training_frame = iris_hf, x = 1:4, k = 4, seed = 1) ## Warning in doTryCatch(return(expr), name, parentenv, handler): _train: ## Dataset used may contain fewer number of rows due to removal of rows with ## NA/missing values. If this is not desirable, set impute_missing argument in ## pca call to TRUE/True/true/... depending on the client language. pca_model ## Model Details: ## ============== ## ## H2ODimReductionModel: pca ## Model ID: PCA_model_R_1510348062048_3 ## Importance of components: ## pc1 pc2 pc3 pc4 ## Standard deviation 7.861342 1.455041 0.283531 0.154411 ## Proportion of Variance 0.965303 0.033069 0.001256 0.000372 ## Cumulative Proportion 0.965303 0.998372 0.999628 1.000000 ## ## ## H2ODimReductionMetrics: pca ## ## No model metrics available for PCA

Random Forest

Use H2O's Random Forest to perform regression or classification on a dataset. We will continue to use the iris dataset as an example for this problem. As usual, we define the response and predictor variables using the x and y arguments. Since we'd like to do a classification, we need to ensure that the response column is encoded as a factor (enum) column. y <- "Species" x <- setdiff(names(iris_hf), y) iris_hf[,y] <- as.factor(iris_hf[,y]) We can split the iris_hf H2O Frame into a train and test set (the split defaults to 7525 train/test). splits <- h2o.splitFrame(iris_hf, seed = 1) Then we can train a Random Forest model: rf_model <- h2o.randomForest(x = x, y = y, training_frame = splits[[1]], validation_frame = splits[[2]], nbins = 32, max_depth = 5, ntrees = 20, seed = 1) Since we passed a validation frame, the validation metrics will be calculated. We can retrieve individual metrics using functions such as h2o.mse(rf_model, valid = TRUE). The confusion matrix can be printed using the following: h2o.confusionMatrix(rf_model, valid = TRUE) ## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class ## setosa versicolor virginica Error Rate ## setosa 7 0 0 0.0000 = 0 / 7 ## versicolor 0 13 0 0.0000 = 0 / 13 ## virginica 0 1 10 0.0909 = 1 / 11 ## Totals 7 14 10 0.0323 = 1 / 31 To view the variable importance computed from an H2O model, you can use either the h2o.varimp() or h2o.varimp_plot() functions: h2o.varimp_plot(rf_model)

Gradient Boosting Machine

The Gradient Boosting Machine (GBM) is one of H2O's most popular algorithms, as it works well on many types of data. We will continue to use the iris dataset as an example for this problem. Using the same dataset and x and y from above, we can train a GBM: gbm_model <- h2o.gbm(x = x, y = y, training_frame = splits[[1]], validation_frame = splits[[2]], ntrees = 20, max_depth = 3, learn_rate = 0.01, col_sample_rate = 0.7, seed = 1) Since this is a multi-class problem, we may be interested in inspecting the confusion matrix on a hold-out set. Since we passed along a validatin_frame at train time, the validation metrics are already computed and we just need to retreive them from the model object. h2o.confusionMatrix(gbm_model, valid = TRUE) ## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class ## setosa versicolor virginica Error Rate ## setosa 7 0 0 0.0000 = 0 / 7 ## versicolor 0 13 0 0.0000 = 0 / 13 ## virginica 0 1 10 0.0909 = 1 / 11 ## Totals 7 14 10 0.0323 = 1 / 31

Deep Learning

Use H2O's Deep Learning to perform regression or classification on a dataset, extact non-linear features generated by the deep neural network, and/or detect anomalies using a deep learning model with auto-encoding. In this example, we will use the prostate dataset available within the h2o package: path <- system.file("extdata", "prostate.csv", package = "h2o") prostate_df <- spark_read_csv(sc, "prostate", path) head(prostate_df) ## # Source: lazy query [?? x 9] ## # Database: spark_connection ## ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON ## <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int> ## 1 1 0 65 1 2 1 1.4 0.0 6 ## 2 2 0 72 1 3 2 6.7 0.0 7 ## 3 3 0 70 1 1 2 4.9 0.0 6 ## 4 4 0 76 2 2 1 51.2 20.0 7 ## 5 5 0 69 1 1 1 12.3 55.9 6 ## 6 6 1 71 1 3 2 3.3 0.0 8 Once we've done whatever data manipulation is required to run our model we'll get a reference to it as an h2o frame then split it into training and test sets using the h2o.splitFrame function: prostate_hf <- as_h2o_frame(sc, prostate_df, strict_version_check = FALSE) splits <- h2o.splitFrame(prostate_hf, seed = 1) Next we define the response and predictor columns. y <- "VOL" #remove response and ID cols x <- setdiff(names(prostate_hf), c("ID", y)) Now we can train a deep neural net. dl_fit <- h2o.deeplearning(x = x, y = y, training_frame = splits[[1]], epochs = 15, activation = "Rectifier", hidden = c(10, 5, 10), input_dropout_ratio = 0.7) Evaluate performance on a test set: h2o.performance(dl_fit, newdata = splits[[2]]) ## H2ORegressionMetrics: deeplearning ## ## MSE: 253.7022 ## RMSE: 15.92803 ## MAE: 12.90077 ## RMSLE: 1.885052 ## Mean Residual Deviance : 253.7022 Note that the above metrics are not reproducible when H2O's Deep Learning is run on multiple cores, however, the metrics should be fairly stable across repeat runs. H2O's grid search capabilities currently supports traditional (Cartesian) grid search and random grid search. Grid search in R provides the following capabilities: H2OGrid class: Represents the results of the grid search h2o.getGrid(<grid_id>, sort_by, decreasing): Display the specified grid h2o.grid: Start a new grid search parameterized by model builder name (e.g., algorithm = "gbm") model parameters (e.g., ntrees = 100) hyper_parameters: attribute for passing a list of hyper parameters (e.g., list(ntrees=c(1,100), learn_rate=c(0.1,0.001))) search_criteria: optional attribute for specifying more a advanced search strategy By default, h2o.grid() will train a Cartesian grid search – meaning, all possible models in the specified grid. In this example, we will re-use the prostate data as an example dataset for a regression problem. splits <- h2o.splitFrame(prostate_hf, seed = 1) y <- "VOL" #remove response and ID cols x <- setdiff(names(prostate_hf), c("ID", y)) After prepping the data, we define a grid and execute the grid search. # GBM hyperparamters gbm_params1 <- list(learn_rate = c(0.01, 0.1), max_depth = c(3, 5, 9), sample_rate = c(0.8, 1.0), col_sample_rate = c(0.2, 0.5, 1.0)) # Train and validate a grid of GBMs gbm_grid1 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid1", training_frame = splits[[1]], validation_frame = splits[[1]], ntrees = 100, seed = 1, hyper_params = gbm_params1) # Get the grid results, sorted by validation MSE gbm_gridperf1 <- h2o.getGrid(grid_id = "gbm_grid1", sort_by = "mse", decreasing = FALSE) gbm_gridperf1 ## H2O Grid Details ## ================ ## ## Grid ID: gbm_grid1 ## Used hyper parameters: ## - col_sample_rate ## - learn_rate ## - max_depth ## - sample_rate ## Number of models: 36 ## Number of failed models: 0 ## ## Hyper-Parameter Search Summary: ordered by increasing mse ## col_sample_rate learn_rate max_depth sample_rate model_ids ## 1 1.0 0.1 9 1.0 gbm_grid1_model_35 ## 2 0.5 0.1 9 1.0 gbm_grid1_model_34 ## 3 1.0 0.1 9 0.8 gbm_grid1_model_17 ## 4 0.5 0.1 9 0.8 gbm_grid1_model_16 ## 5 1.0 0.1 5 0.8 gbm_grid1_model_11 ## mse ## 1 88.10947523138782 ## 2 102.3118989994892 ## 3 102.78632321923726 ## 4 126.4217260351778 ## 5 149.6066650109763 ## ## --- ## col_sample_rate learn_rate max_depth sample_rate model_ids ## 31 0.5 0.01 3 0.8 gbm_grid1_model_1 ## 32 0.2 0.01 5 1.0 gbm_grid1_model_24 ## 33 0.5 0.01 3 1.0 gbm_grid1_model_19 ## 34 0.2 0.01 5 0.8 gbm_grid1_model_6 ## 35 0.2 0.01 3 1.0 gbm_grid1_model_18 ## 36 0.2 0.01 3 0.8 gbm_grid1_model_0 ## mse ## 31 324.8117304723162 ## 32 325.10992525687294 ## 33 325.27898443785045 ## 34 329.36983845305735 ## 35 338.54411936919456 ## 36 339.7744828617712 H2O's Random Grid Search samples from the given parameter space until a set of constraints is met. The user can specify the total number of desired models using (e.g. max_models = 40), the amount of time (e.g. max_runtime_secs = 1000), or tell the grid to stop after performance stops improving by a specified amount. Random Grid Search is a practical way to arrive at a good model without too much effort. The example below is set to run fairly quickly – increase max_runtime_secs or max_models to cover more of the hyperparameter space in your grid search. Also, you can expand the hyperparameter space of each of the algorithms by modifying the definition of hyper_param below. # GBM hyperparamters gbm_params2 <- list(learn_rate = seq(0.01, 0.1, 0.01), max_depth = seq(2, 10, 1), sample_rate = seq(0.5, 1.0, 0.1), col_sample_rate = seq(0.1, 1.0, 0.1)) search_criteria2 <- list(strategy = "RandomDiscrete", max_models = 50) # Train and validate a grid of GBMs gbm_grid2 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid2", training_frame = splits[[1]], validation_frame = splits[[2]], ntrees = 100, seed = 1, hyper_params = gbm_params2, search_criteria = search_criteria2) # Get the grid results, sorted by validation MSE gbm_gridperf2 <- h2o.getGrid(grid_id = "gbm_grid2", sort_by = "mse", decreasing = FALSE) To get the best model, as measured by validation MSE, we simply grab the first row of the gbm_gridperf2@summary_table object, since this table is already sorted such that the lowest MSE model is on top. gbm_gridperf2@summary_table[1,] ## Hyper-Parameter Search Summary: ordered by increasing mse ## col_sample_rate learn_rate max_depth sample_rate model_ids ## 1 0.8 0.01 2 0.7 gbm_grid2_model_35 ## mse ## 1 244.61196951586288 In the examples above, we generated two different grids, specified by grid_id. The first grid was called grid_id = "gbm_grid1" and the second was called grid_id = "gbm_grid2". However, if we are using the same dataset & algorithm in two grid searches, it probably makes more sense just to add the results of the second grid search to the first. If you want to add models to an existing grid, rather than create a new one, you simply re-use the same grid_id.

Exporting Models

There are two ways of exporting models from H2O – saving models as a binary file, or saving models as pure Java code.

Binary Models

The more traditional method is to save a binary model file to disk using the h2o.saveModel() function. To load the models using h2o.loadModel(), the same version of H2O that generated the models is required. This method is commonly used when H2O is being used in a non-production setting. A binary model can be saved as follows: h2o.saveModel(my_model, path = "/Users/me/h2omodels")

Java (POJO) Models

One of the most valuable features of H2O is it's ability to export models as pure Java code, or rather, a “Plain Old Java Object” (POJO). You can learn more about H2O POJO models in this POJO quickstart guide. The POJO method is used most commonly when a model is deployed in a production setting. POJO models are ideal for when you need very fast prediction response times, and minimal requirements – the POJO is a standalone Java class with no dependencies on the full H2O stack. To generate the POJO for your model, use the following command: h2o.download_pojo(my_model, path = "/Users/me/h2omodels") Finally, disconnect with: spark_disconnect_all() ## [1] 1 You can learn more about how to take H2O models to production in the productionizing H2O models section of the H2O docs.

Additional Resources

Main documentation site for Sparkling Water (and all H2O software projects) H2O.ai website If you are new to H2O for machine learning, we recommend you start with the Intro to H2O Tutorial, followed by the H2O Grid Search & Model Selection Tutorial. There are a number of other H2O R tutorials and demos available, as well as the H2O World 2015 Training Gitbook, and the Machine Learning with R and H2O Booklet (pdf).

R interface for GraphFrames

Highlights

Support for GraphFrames which aims to provide the functionality of GraphX. Perform graph algorithms like: PageRank, ShortestPaths and many others. Designed to work with sparklyr and the sparklyr extensions.

Installation

To install from CRAN, run: install.packages("graphframes") For the development version, run: devtools::install_github("rstudio/graphframes")

Examples

The examples make use of the highschool dataset from the ggplot package.

Create a GraphFrame

The base for graph analyses in Spark, using sparklyr, will be a GraphFrame. Open a new Spark connection using sparklyr, and copy the highschool data set library(graphframes) library(sparklyr) library(dplyr) sc <- spark_connect(master = "local", version = "2.1.0") highschool_tbl <- copy_to(sc, ggraph::highschool, "highschool") head(highschool_tbl) ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## from to year ## <dbl> <dbl> <dbl> ## 1 1. 14. 1957. ## 2 1. 15. 1957. ## 3 1. 21. 1957. ## 4 1. 54. 1957. ## 5 1. 55. 1957. ## 6 2. 21. 1957. The vertices table is be constructed using dplyr. The variable name expected by the GraphFrame is id. from_tbl <- highschool_tbl %>% distinct(from) %>% transmute(id = from) to_tbl <- highschool_tbl %>% distinct(to) %>% transmute(id = to) vertices_tbl <- from_tbl %>% sdf_bind_rows(to_tbl) head(vertices_tbl) ## # Source: lazy query [?? x 1] ## # Database: spark_connection ## id ## <dbl> ## 1 6. ## 2 7. ## 3 12. ## 4 13. ## 5 55. ## 6 58. The edges table can also be created using dplyr. In order for the GraphFrame to work, the from variable needs be renamed src, and the to variable dst. # Create a table with <source, destination> edges edges_tbl <- highschool_tbl %>% transmute(src = from, dst = to) The gf_graphframe() function creates a new GraphFrame gf_graphframe(vertices_tbl, edges_tbl) ## GraphFrame ## Vertices: ## $ id <dbl> 6, 7, 12, 13, 55, 58, 63, 41, 44, 48, 59, 1, 4, 17, 20, 22,... ## Edges: ## $ src <dbl> 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 7, 8... ## $ dst <dbl> 14, 15, 21, 54, 55, 21, 22, 9, 15, 5, 18, 19, 43, 19, 43, ...

Basic Page Rank

We will calculate PageRank over this dataset. The gf_graphframe() command can easily be piped into the gf_pagerank() function to execute the Page Rank. gf_graphframe(vertices_tbl, edges_tbl) %>% gf_pagerank(reset_prob = 0.15, max_iter = 10L, source_id = "1") ## GraphFrame ## Vertices: ## $ id <dbl> 12, 12, 59, 59, 1, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26,... ## $ pagerank <dbl> 1.216914e-02, 1.216914e-02, 1.151867e-03, 1.151867e-0... ## Edges: ## $ src <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,... ## $ dst <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 22, 22,... ## $ weight <dbl> 0.02777778, 0.02777778, 0.02777778, 0.02777778, 0.02777... Additionaly, one can calculate the degrees of vertices using gf_degrees as follows: gf_graphframe(vertices_tbl, edges_tbl) %>% gf_degrees() ## # Source: table<sparklyr_tmp_27b034635ad> [?? x 2] ## # Database: spark_connection ## id degree ## <dbl> <int> ## 1 55. 25 ## 2 6. 10 ## 3 13. 16 ## 4 7. 6 ## 5 12. 11 ## 6 63. 21 ## 7 58. 8 ## 8 41. 19 ## 9 48. 15 ## 10 59. 11 ## # ... with more rows

Visualizations

In order to visualize large graphframes, one can use sample_n and then use ggraph with igraph to visualize the graph as follows: library(ggraph) library(igraph) graph <- highschool_tbl %>% sample_n(20) %>% collect() %>% graph_from_data_frame() ggraph(graph, layout = 'kk') + geom_edge_link(aes(colour = factor(year))) + geom_node_point() + ggtitle('An example')

Additional functions

Apart from calculating PageRank using gf_pagerank, the following functions are available: gf_bfs(): Breadth-first search (BFS). gf_connected_components(): Connected components. gf_shortest_paths(): Shortest paths algorithm. gf_scc(): Strongly connected components. gf_triangle_count: Computes the number of triangles passing through each vertex and others.

R interface for MLeap

mleap is a sparklyr extension that provides an interface to MLeap, which allows us to take Spark pipelines to production.

Install mleap

mleap can be installed from CRAN via install.packages("mleap") or, for the latest development version from GitHub, using devtools::install_github("rstudio/mleap")

Setup

Once mleap has been installed, we can install the external dependencies using: library(mleap) install_mleap() Another dependency of mleap is Maven. If it is already installed, just point mleap to its location: options(maven.home = "path/to/maven") If Maven is not yet installed, which is the most likely case, use the following to install it: install_maven()

Create an MLeap Bundle

    Start Spark session using sparklyr library(sparklyr) sc <- spark_connect(master = "local", version = "2.2.0") mtcars_tbl <- sdf_copy_to(sc, mtcars, overwrite = TRUE) Create a fit an ML Pipeline pipeline <- ml_pipeline(sc) %>% ft_binarizer("hp", "big_hp", threshold = 100) %>% ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>% ml_gbt_regressor(label_col = "mpg") pipeline_model <- ml_fit(pipeline, mtcars_tbl) A transformed data frame with the appropriate schema is required for exporting the Pipeline model transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl) Export the model using the ml_write_bundle() function from mleap model_path <- file.path(tempdir(), "mtcars_model.zip") ml_write_bundle(pipeline_model, transformed_tbl, model_path) ## Model successfully exported. Close Spark session spark_disconnect(sc)
At this point, we can share mtcars_model.zip with the deployment/implementation engineers, and they would be able to embed the model in another application. See the MLeap docs for details.

Test the mleap bundle

The mleap package also provides R functions for testing that the saved models behave as expected. Here we load the previously saved model: model <- mleap_load_bundle(model_path) model ## MLeap Transformer ## <db23a9f1-7b3d-4d27-9eb0-8675125ab3a5> ## Name: pipeline_fe6b8cb0028f ## Format: json ## MLeap Version: 0.10.0-SNAPSHOT To retrieve the schema associated with the model use the mleap_model_schema() function mleap_model_schema(model) ## # A tibble: 6 x 4 ## name type nullable dimension ## <chr> <chr> <lgl> <chr> ## 1 qsec double TRUE <NA> ## 2 hp double FALSE <NA> ## 3 wt double TRUE <NA> ## 4 big_hp double FALSE <NA> ## 5 features double TRUE (3) ## 6 prediction double FALSE <NA> Then, we create a new data frame to be scored, and make predictions using the model: newdata <- tibble::tribble( ~qsec, ~hp, ~wt, 16.2, 101, 2.68, 18.1, 99, 3.08 ) # Transform the data frame transformed_df <- mleap_transform(model, newdata) dplyr::glimpse(transformed_df) ## Observations: 2 ## Variables: 6 ## $ qsec <dbl> 16.2, 18.1 ## $ hp <dbl> 101, 99 ## $ wt <dbl> 2.68, 3.08 ## $ big_hp <dbl> 1, 0 ## $ features <list> [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]] ## $ prediction <dbl> 21.06529, 22.36667

Examples

Option 1 - Connecting to Databricks remotely Overview With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect. This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources. Advantages and limitations Advantages: RStudio Server Pro will remain functional if Databricks clusters are terminated Provides the ability to communicate with one or more Databricks clusters as a remote compute resource Avoids resource contention between RStudio Server Pro and Databricks Limitations:
Option 2 - Working inside of Databricks Overview If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below. With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to work locally with Spark using sparklyr. This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.
Spark Standalone Deployment in AWS Overview The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server. What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.
Using sparklyr with Databricks Overview This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager. Using RStudio Team with Databricks RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages. RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark.
Using sparklyr with an Apache Spark cluster Summary This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Cloudera Cluster This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. The assumption will be made that there no aid is needed to setup and administer the cluster.
Using sparklyr with an Apache Spark cluster This document demonstrates how to use sparklyr with an Apache Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Here is the basic workflow. Data preparation Set up the cluster This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider.

Spark Standalone Deployment in AWS

Overview

The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server. What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them. The topology will look something like this:

AWS EC Instances

Here are the details of the EC2 instance, just deploy one at this point: Type: t2.medium OS: Ubuntu 16.04 LTS Disk space: At least 20GB Security group: Open the following ports: 8080 (Spark UI), 4040 (Spark Worker UI), 8088 (sparklyr UI) and 8787 (RStudio). Also open All TCP ports for the machines inside the security group.

Spark

Perform the steps in this section on all of the servers that will be part of the cluster.

Install Java 8

We will add the Java 8 repository, install it and set it as default sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer sudo apt-get install oracle-java8-set-default sudo apt-get update

Download Spark

Download and unpack a pre-compiled version of Spark. Here’s is the link to the official Spark download page wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz cd spark-2.1.0-bin-hadoop2.7

Create and launch AMI

We will create an image of the server. In Amazon, these are called AMIs, for information please see the User Guide. Launch 3 instances of the AMI

RStudio Server

Select one of the nodes to execute this section. Please check the RStudio download page for the latest version

Install R

In order to get the latest R core, we will need to update the source list in Ubuntu. sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list' gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9 gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add - sudo apt-get update Now we can install R sudo apt-get install r-base sudo apt-get install gdebi-core

Install RStudio

We will download and install 1.044 of RStudio Server. To find the latest version, please visit the RStudio website. In order to get the enhanced integration with Spark, RStudio version 1.044 or later will be needed. wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb sudo gdebi rstudio-server-1.0.153-amd64.deb

Install dependencies

Run the following commands sudo apt-get -y install libcurl4-gnutls-dev sudo apt-get -y install libssl-dev sudo apt-get -y install libxml2-dev

Add default user

Run the following command to add a default user sudo adduser rstudio-user

Start the Master node

Select one of the servers to become your Master node Run the command that starts the master service sudo spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh Close the terminal connection (optional)

Start Worker nodes

Start the slave service. Important: Use dots not dashes as separators for the Spark Master node’s address sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://[Master node's IP address]:7077 sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-30-1-94.us-west-2.compute.internal:7077 - Close the terminal connection (optional)

Pre-load pacakges

Log into RStudio (port 8787) Use ‘rstudio-user’ install.packages("sparklyr")

Connect to the Spark Master

Navigate to the Spark Master’s UI, typically on port 8080 Note the Spark Master URL Logon to RStudio Run the following code library(sparklyr) conf <- spark_config() conf$spark.executor.memory <- "2GB" conf$spark.memory.fraction <- 0.9 sc <- spark_connect(master="[Spark Master URL]", version = "2.1.0", config = conf, spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/")

Using sparklyr with an Apache Spark cluster

This document demonstrates how to use sparklyr with an Apache Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Here is the basic workflow.

Data preparation

Set up the cluster

This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider. We will use Elastic Map Reduce (EMR) to easily set up a cluster with two core nodes and one master node. Nodes use virtual servers from the Elastic Compute Cloud (EC2). Note: There is no free tier for EMR, charges will apply. Before beginning this setup we assume you have: Familiarity with and access to an AWS account Familiarity with basic linux commands Sudo privileges in order to install software from the command line

Build an EMR cluster

Before beginning the EMR wizard setup, make sure you create the following in AWS: An AWS key pair (.pem key) so you can SSH into the EC2 master node A security group that gives you access to port 22 on your IP and port 8787 from anywhere

Step 1: Select software

Make sure to select Hive and Spark as part of the install. Note that by choosing Spark, R will also be installed on the master node as part of the distribution.

Step 2: Select hardware

Install 2 core nodes and one master node with m3.xlarge 80 GiB storage per node. You can easily increase the number of nodes later.

Step 3: Select general cluster settings

Click next on the general cluster settings.

Step 4: Select security

Enter your EC2 key pair and security group. Make sure the security group has ports 22 and 8787 open.

Connect to EMR

The cluster page will give you details about your EMR cluster and instructions on connecting. Connect to the master node via SSH using your key pair. Once you connect you will see the EMR welcome. # Log in to master node ssh -i ~/spark-demo.pem hadoop@ec2-52-10-102-11.us-west-2.compute.amazonaws.com

Install RStudio Server

EMR uses Amazon Linux which is based on Centos. Update your master node and install dependencies that will be used by R packages. # Update sudo yum update sudo yum install libcurl-devel openssl-devel # used for devtools The installation of RStudio Server is easy. Download the preview version of RStudio and install on the master node. # Install RStudio Server wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-0.99.1266-x86_64.rpm sudo yum install --nogpgcheck /tmp/rstudio-server-rhel-0.99.1266-x86_64.rpm

Create a User

Create a user called rstudio-user that will perform the data analysis. Create a user directory for rstudio-user on HDFS with the hadoop fs command. # Make User sudo useradd -m rstudio-user sudo passwd rstudio-user # Create new directory in hdfs hadoop fs -mkdir /user/rstudio-user hadoop fs -chmod 777 /user/rstudio-user

Download flights data

The flights data is a well known data source representing 123 million flights over 22 years. It consumes roughly 12 GiB of storage in uncompressed CSV format in yearly files.

Switch User

For data loading and analysis, make sure you are logged in as regular user. # create directories on hdfs for new user hadoop fs -mkdir /user/rstudio-user hadoop fs -chmod 777 /user/rstudio-user # switch user su rstudio-user

Download data

Run the following script to download data from the web onto your master node. Download the yearly flight data and the airlines lookup table. # Make download directory mkdir /tmp/flights # Download flight data by year for i in {1987..2008} do echo "$(date) $i Download" fnam=$i.csv.bz2 wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam echo "$(date) $i Unzip" bunzip2 /tmp/flights/$fnam done # Download airline carrier data wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS # Download airports data wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Distribute into HDFS

Copy data into HDFS using the hadoop fs command. # Copy flight data to HDFS hadoop fs -mkdir /user/rstudio-user/flights/ hadoop fs -put /tmp/flights /user/rstudio-user/ # Copy airline data to HDFS hadoop fs -mkdir /user/rstudio-user/airlines/ hadoop fs -put /tmp/airlines.csv /user/rstudio-user/airlines # Copy airport data to HDFS hadoop fs -mkdir /user/rstudio-user/airports/ hadoop fs -put /tmp/airports.csv /user/rstudio-user/airports

Create Hive tables

Launch Hive from the command line. # Open Hive prompt hive Create the metadata that will structure the flights table. Load data into the Hive table. # Create metadata for flights CREATE EXTERNAL TABLE IF NOT EXISTS flights ( year int, month int, dayofmonth int, dayofweek int, deptime int, crsdeptime int, arrtime int, crsarrtime int, uniquecarrier string, flightnum int, tailnum string, actualelapsedtime int, crselapsedtime int, airtime string, arrdelay int, depdelay int, origin string, dest string, distance int, taxiin string, taxiout string, cancelled int, cancellationcode string, diverted int, carrierdelay string, weatherdelay string, nasdelay string, securitydelay string, lateaircraftdelay string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' TBLPROPERTIES("skip.header.line.count"="1"); # Load data into table LOAD DATA INPATH '/user/rstudio-user/flights' INTO TABLE flights; Create the metadata that will structure the airlines table. Load data into the Hive table. # Create metadata for airlines CREATE EXTERNAL TABLE IF NOT EXISTS airlines ( Code string, Description string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = '\,', "quoteChar" = '\"' ) STORED AS TEXTFILE tblproperties("skip.header.line.count"="1"); # Load data into table LOAD DATA INPATH '/user/rstudio-user/airlines' INTO TABLE airlines; Create the metadata that will structure the airports table. Load data into the Hive table. # Create metadata for airports CREATE EXTERNAL TABLE IF NOT EXISTS airports ( id string, name string, city string, country string, faa string, icao string, lat double, lon double, alt int, tz_offset double, dst string, tz_name string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = '\,', "quoteChar" = '\"' ) STORED AS TEXTFILE; # Load data into table LOAD DATA INPATH '/user/rstudio-user/airports' INTO TABLE airports;

Connect to Spark

Log in to RStudio Server by pointing a browser at your master node IP:8787. Set the environment variable SPARK_HOME and then run spark_connect. After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane. # Connect to Spark library(sparklyr) library(dplyr) library(ggplot2) Sys.setenv(SPARK_HOME="/usr/lib/spark") config <- spark_config() sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2') Once you are connected, you will see the Spark pane appear along with your hive tables. You can inspect your tables by clicking on the data icon.

Data analysis

Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.

Cache the tables into memory

Use tbl_cache to load the flights table into memory. Caching tables will make analysis much faster. Create a dplyr reference to the Spark DataFrame. # Cache flights Hive table into Spark tbl_cache(sc, 'flights') flights_tbl <- tbl(sc, 'flights') # Cache airlines Hive table into Spark tbl_cache(sc, 'airlines') airlines_tbl <- tbl(sc, 'airlines') # Cache airports Hive table into Spark tbl_cache(sc, 'airports') airports_tbl <- tbl(sc, 'airports')

Create a model data set

Filter the data to contain only the records to be used in the fitted model. Join carrier descriptions for reference. Create a new variable called gain which represents the amount of time gained (or lost) in flight. # Filter records and create target variable 'gain' model_data <- flights_tbl %>% filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>% filter(depdelay > 15 & depdelay < 240) %>% filter(arrdelay > -60 & arrdelay < 360) %>% filter(year >= 2003 & year <= 2007) %>% left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>% mutate(gain = depdelay - arrdelay) %>% select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain) # Summarize data by carrier model_data %>% group_by(uniquecarrier) %>% summarize(description = min(description), gain=mean(gain), distance=mean(distance), depdelay=mean(depdelay)) %>% select(description, gain, distance, depdelay) %>% arrange(gain) Source: query [?? x 4] Database: spark connection master=yarn-client app=sparklyr local=FALSE description gain distance depdelay <chr> <dbl> <dbl> <dbl> 1 ATA Airlines d/b/a ATA -3.3480120 1134.7084 56.06583 2 ExpressJet Airlines Inc. (1) -3.0326180 519.7125 59.41659 3 Envoy Air -2.5434415 416.3716 53.12529 4 Northwest Airlines Inc. -2.2030586 779.2342 48.52828 5 Delta Air Lines Inc. -1.8248026 868.3997 50.77174 6 AirTran Airways Corporation -1.4331555 641.8318 54.96702 7 Continental Air Lines Inc. -0.9617003 1116.6668 57.00553 8 American Airlines Inc. -0.8860262 1074.4388 55.45045 9 Endeavor Air Inc. -0.6392733 467.1951 58.47395 10 JetBlue Airways -0.3262134 1139.0443 54.06156 # ... with more rows

Train a linear model

Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier. # Partition the data into training and validation sets model_partition <- model_data %>% sdf_partition(train = 0.8, valid = 0.2, seed = 5555) # Fit a linear model ml1 <- model_partition$train %>% ml_linear_regression(gain ~ distance + depdelay + uniquecarrier) # Summarize the linear model summary(ml1) Deviance Residuals: (approximate): Min 1Q Median 3Q Max -305.422 -5.593 2.699 9.750 147.871 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.24342576 0.10248281 -12.1330 < 2.2e-16 *** distance 0.00326600 0.00001670 195.5709 < 2.2e-16 *** depdelay -0.01466233 0.00020337 -72.0977 < 2.2e-16 *** uniquecarrier_AA -2.32650517 0.10522524 -22.1098 < 2.2e-16 *** uniquecarrier_AQ 2.98773637 0.28798507 10.3746 < 2.2e-16 *** uniquecarrier_AS 0.92054894 0.11298561 8.1475 4.441e-16 *** uniquecarrier_B6 -1.95784698 0.11728289 -16.6934 < 2.2e-16 *** uniquecarrier_CO -2.52618081 0.11006631 -22.9514 < 2.2e-16 *** uniquecarrier_DH 2.23287189 0.11608798 19.2343 < 2.2e-16 *** uniquecarrier_DL -2.68848119 0.10621977 -25.3106 < 2.2e-16 *** uniquecarrier_EV 1.93484736 0.10724290 18.0417 < 2.2e-16 *** uniquecarrier_F9 -0.89788137 0.14422281 -6.2257 4.796e-10 *** uniquecarrier_FL -1.46706706 0.11085354 -13.2343 < 2.2e-16 *** uniquecarrier_HA -0.14506644 0.25031456 -0.5795 0.5622 uniquecarrier_HP 2.09354855 0.12337515 16.9690 < 2.2e-16 *** uniquecarrier_MQ -1.88297535 0.10550507 -17.8473 < 2.2e-16 *** uniquecarrier_NW -2.79538927 0.10752182 -25.9983 < 2.2e-16 *** uniquecarrier_OH 0.83520117 0.11032997 7.5700 3.730e-14 *** uniquecarrier_OO 0.61993842 0.10679884 5.8047 6.447e-09 *** uniquecarrier_TZ -4.99830389 0.15912629 -31.4109 < 2.2e-16 *** uniquecarrier_UA -0.68294396 0.10638099 -6.4198 1.365e-10 *** uniquecarrier_US -0.61589284 0.10669583 -5.7724 7.815e-09 *** uniquecarrier_WN 3.86386059 0.10362275 37.2878 < 2.2e-16 *** uniquecarrier_XE -2.59658123 0.10775736 -24.0966 < 2.2e-16 *** uniquecarrier_YV 3.11113140 0.11659679 26.6828 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-Squared: 0.02385 Root Mean Squared Error: 17.74

Assess model performance

Compare the model performance using the validation data. # Calculate average gains by predicted decile model_deciles <- lapply(model_partition, function(x) { sdf_predict(ml1, x) %>% mutate(decile = ntile(desc(prediction), 10)) %>% group_by(decile) %>% summarize(gain = mean(gain)) %>% select(decile, gain) %>% collect() }) # Create a summary dataset for plotting deciles <- rbind( data.frame(data = 'train', model_deciles$train), data.frame(data = 'valid', model_deciles$valid), make.row.names = FALSE ) # Plot average gains by predicted decile deciles %>% ggplot(aes(factor(decile), gain, fill = data)) + geom_bar(stat = 'identity', position = 'dodge') + labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')

Visualize predictions

Compare actual gains to predicted gains for an out of time sample. # Select data from an out of time sample data_2008 <- flights_tbl %>% filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>% filter(depdelay > 15 & depdelay < 240) %>% filter(arrdelay > -60 & arrdelay < 360) %>% filter(year == 2008) %>% left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>% mutate(gain = depdelay - arrdelay) %>% select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest) # Summarize data by carrier carrier <- sdf_predict(ml1, data_2008) %>% group_by(description) %>% summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>% filter(freq > 10000) %>% collect # Plot actual gains and predicted gains by airline carrier ggplot(carrier, aes(gain, prediction)) + geom_point(alpha = 0.75, color = 'red', shape = 3) + geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') + geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) + labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted') Some carriers make up more time than others in flight, but the differences are relatively small. The average time gains between the best and worst airlines is only six minutes. The best predictor of time gained is not carrier but flight distance. The biggest gains were associated with the longest flights.

Share Insights

This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays. These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.

Build dashboard

Aggregate the scored data by origin, destination, and airline. Save the aggregated data. # Summarize by origin, destination, and carrier summary_2008 <- sdf_predict(ml1, data_2008) %>% rename(carrier = uniquecarrier, airline = description) %>% group_by(origin, dest, carrier, airline) %>% summarize( flights = n(), distance = mean(distance), avg_dep_delay = mean(depdelay), avg_arr_delay = mean(arrdelay), avg_gain = mean(gain), pred_gain = mean(prediction) ) # Collect and save objects pred_data <- collect(summary_2008) airports <- collect(select(airports_tbl, name, faa, lat, lon)) ml1_summary <- capture.output(summary(ml1)) save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')

Publish dashboard

Use the saved data to build an R Markdown flexdashboard. Publish the flexdashboard to Shiny Server, Shinyapps.io or RStudio Connect.

Using sparklyr with an Apache Spark cluster

Summary

This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark.

Cloudera Cluster

This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. The assumption will be made that there no aid is needed to setup and administer the cluster. ##CDH 5 We will start with a Cloudera cluster CDH version 5.8.2 (free version) with an underlaying Ubuntu Linux distribution. ##Spark 1.6 The default Spark 1.6.0 parcel is in installed and running

Hive data

For this demo, we have created and populated 3 tables in Hive. The table names are: flights, airlines and airports. Using Hue, we can see the loaded tables. For the links to the data files and their Hive import scripts please see Appendix A.

Install RStudio

The latest version of R is needed. In Ubuntu, the default core R is not the latest so we have to update the source list. We will also install a few other dependencies. sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list' gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9 gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add - sudo apt-get update sudo apt-get install r-base sudo apt-get install gdebi-core sudo apt-get -y install libcurl4-gnutls-dev sudo apt-get -y install libssl-dev We will install the preview version of RStudio Server wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-1.0.40-amd64.deb sudo gdebi rstudio-server-1.0.49-amd64.deb

Create and configure a User

Create a user called rstudio that will perform the data analysis. sudo adduser rstudio To ease security restriction in this demo, we will add the new user to the default super group defined in the dfs.permissions.superusergroup setting in CDH sudo groupadd supergroup sudo usermod -a -G supergroup rstudio

Connect to Spark

Log in to RStudio Server by pointing a browser at your master node IP:8787. Set the environment variable SPARK_HOME and then run spark_connect. After connecting you will be able to browse the Hive metadata in the RStudio Server Spark pane. library(sparklyr) library(dplyr) library(ggplot2) sc <- spark_connect(master = "yarn-client", version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/') Once you are connected, you will see the Spark pane appear along with your hive tables. You can inspect your tables by clicking on the data icon. This is what the tables look like loaded in Spark via the History Server Web UI (port 18088)

Data analysis

Is there evidence to suggest that some airline carriers make up time in flight? This analysis predicts time gained in flight by airline carrier.

Cache the tables into memory

Use tbl_cache to load the flights table into memory. Caching tables will make analysis much faster. Create a dplyr reference to the Spark DataFrame. # Cache flights Hive table into Spark tbl_cache(sc, 'flights') flights_tbl <- tbl(sc, 'flights') # Cache airlines Hive table into Spark tbl_cache(sc, 'airlines') airlines_tbl <- tbl(sc, 'airlines') # Cache airports Hive table into Spark tbl_cache(sc, 'airports') airports_tbl <- tbl(sc, 'airports')

Create a model data set

Filter the data to contain only the records to be used in the fitted model. Join carrier descriptions for reference. Create a new variable called gain which represents the amount of time gained (or lost) in flight. # Filter records and create target variable 'gain' model_data <- flights_tbl %>% filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>% filter(depdelay > 15 & depdelay < 240) %>% filter(arrdelay > -60 & arrdelay < 360) %>% filter(year >= 2003 & year <= 2007) %>% left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>% mutate(gain = depdelay - arrdelay) %>% select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain) # Summarize data by carrier model_data %>% group_by(uniquecarrier) %>% summarize(description = min(description), gain=mean(gain), distance=mean(distance), depdelay=mean(depdelay)) %>% select(description, gain, distance, depdelay) %>% arrange(gain) Source: query [?? x 4] Database: spark connection master=yarn-client app=sparklyr local=FALSE description gain distance depdelay <chr> <dbl> <dbl> <dbl> 1 ATA Airlines d/b/a ATA -5.5679651 1240.7219 61.84391 2 Northwest Airlines Inc. -3.1134556 779.1926 48.84979 3 Envoy Air -2.2056576 437.0883 54.54923 4 PSA Airlines Inc. -1.9267647 500.6955 55.60335 5 ExpressJet Airlines Inc. (1) -1.5886314 537.3077 61.58386 6 JetBlue Airways -1.3742524 1087.2337 59.80750 7 SkyWest Airlines Inc. -1.1265678 419.6489 54.04198 8 Delta Air Lines Inc. -0.9829374 956.9576 50.19338 9 American Airlines Inc. -0.9631200 1066.8396 56.78222 10 AirTran Airways Corporation -0.9411572 665.6574 53.38363 # ... with more rows

Train a linear model

Predict time gained or lost in flight as a function of distance, departure delay, and airline carrier. # Partition the data into training and validation sets model_partition <- model_data %>% sdf_partition(train = 0.8, valid = 0.2, seed = 5555) # Fit a linear model ml1 <- model_partition$train %>% ml_linear_regression(gain ~ distance + depdelay + uniquecarrier) # Summarize the linear model summary(ml1) Call: ml_linear_regression(., gain ~ distance + depdelay + uniquecarrier) Deviance Residuals: (approximate): Min 1Q Median 3Q Max -302.343 -5.669 2.714 9.832 104.130 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.26566581 0.10385870 -12.1864 < 2.2e-16 *** distance 0.00308711 0.00002404 128.4155 < 2.2e-16 *** depdelay -0.01397013 0.00028816 -48.4812 < 2.2e-16 *** uniquecarrier_AA -2.18483090 0.10985406 -19.8885 < 2.2e-16 *** uniquecarrier_AQ 3.14330242 0.29114487 10.7964 < 2.2e-16 *** uniquecarrier_AS 0.09210380 0.12825003 0.7182 0.4726598 uniquecarrier_B6 -2.66988794 0.12682192 -21.0523 < 2.2e-16 *** uniquecarrier_CO -1.11611186 0.11795564 -9.4621 < 2.2e-16 *** uniquecarrier_DL -1.95206198 0.11431110 -17.0767 < 2.2e-16 *** uniquecarrier_EV 1.70420830 0.11337215 15.0320 < 2.2e-16 *** uniquecarrier_F9 -1.03178176 0.15384863 -6.7065 1.994e-11 *** uniquecarrier_FL -0.99574060 0.12034738 -8.2739 2.220e-16 *** uniquecarrier_HA -1.16970713 0.34894788 -3.3521 0.0008020 *** uniquecarrier_MQ -1.55569040 0.10975613 -14.1741 < 2.2e-16 *** uniquecarrier_NW -3.58502418 0.11534938 -31.0797 < 2.2e-16 *** uniquecarrier_OH -1.40654797 0.12034858 -11.6873 < 2.2e-16 *** uniquecarrier_OO -0.39069404 0.11132164 -3.5096 0.0004488 *** uniquecarrier_TZ -7.26285217 0.34428509 -21.0955 < 2.2e-16 *** uniquecarrier_UA -0.56995737 0.11186757 -5.0949 3.489e-07 *** uniquecarrier_US -0.52000028 0.11218498 -4.6352 3.566e-06 *** uniquecarrier_WN 4.22838982 0.10629405 39.7801 < 2.2e-16 *** uniquecarrier_XE -1.13836940 0.11332176 -10.0455 < 2.2e-16 *** uniquecarrier_YV 3.17149538 0.11709253 27.0854 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-Squared: 0.02301 Root Mean Squared Error: 17.83

Assess model performance

Compare the model performance using the validation data. # Calculate average gains by predicted decile model_deciles <- lapply(model_partition, function(x) { sdf_predict(ml1, x) %>% mutate(decile = ntile(desc(prediction), 10)) %>% group_by(decile) %>% summarize(gain = mean(gain)) %>% select(decile, gain) %>% collect() }) # Create a summary dataset for plotting deciles <- rbind( data.frame(data = 'train', model_deciles$train), data.frame(data = 'valid', model_deciles$valid), make.row.names = FALSE ) # Plot average gains by predicted decile deciles %>% ggplot(aes(factor(decile), gain, fill = data)) + geom_bar(stat = 'identity', position = 'dodge') + labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')

Visualize predictions

Compare actual gains to predicted gains for an out of time sample. # Select data from an out of time sample data_2008 <- flights_tbl %>% filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>% filter(depdelay > 15 & depdelay < 240) %>% filter(arrdelay > -60 & arrdelay < 360) %>% filter(year == 2008) %>% left_join(airlines_tbl, by = c("uniquecarrier" = "code")) %>% mutate(gain = depdelay - arrdelay) %>% select(year, month, arrdelay, depdelay, distance, uniquecarrier, description, gain, origin,dest) # Summarize data by carrier carrier <- sdf_predict(ml1, data_2008) %>% group_by(description) %>% summarize(gain = mean(gain), prediction = mean(prediction), freq = n()) %>% filter(freq > 10000) %>% collect # Plot actual gains and predicted gains by airline carrier ggplot(carrier, aes(gain, prediction)) + geom_point(alpha = 0.75, color = 'red', shape = 3) + geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') + geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) + labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted') Some carriers make up more time than others in flight, but the differences are relatively small. The average time gains between the best and worst airlines is only six minutes. The best predictor of time gained is not carrier but flight distance. The biggest gains were associated with the longest flights.

Share Insights

This simple linear model contains a wealth of detailed information about carriers, distances traveled, and flight delays. These detailed insights can be conveyed to a non-technical audiance via an interactive flexdashboard.

Build dashboard

Aggregate the scored data by origin, destination, and airline. Save the aggregated data. # Summarize by origin, destination, and carrier summary_2008 <- sdf_predict(ml1, data_2008) %>% rename(carrier = uniquecarrier, airline = description) %>% group_by(origin, dest, carrier, airline) %>% summarize( flights = n(), distance = mean(distance), avg_dep_delay = mean(depdelay), avg_arr_delay = mean(arrdelay), avg_gain = mean(gain), pred_gain = mean(prediction) ) # Collect and save objects pred_data <- collect(summary_2008) airports <- collect(select(airports_tbl, name, faa, lat, lon)) ml1_summary <- capture.output(summary(ml1)) save(pred_data, airports, ml1_summary, file = 'flights_pred_2008.RData')

Publish dashboard

Use the saved data to build an R Markdown flexdashboard. Publish the flexdashboard #Appendix

Appendix A - Data files

Run the following script to download data from the web onto your master node. Download the yearly flight data and the airlines lookup table. # Make download directory mkdir /tmp/flights # Download flight data by year for i in {2006..2008} do echo "$(date) $i Download" fnam=$i.csv.bz2 wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam echo "$(date) $i Unzip" bunzip2 /tmp/flights/$fnam done # Download airline carrier data wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS # Download airports data wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Hive tables

We used the Hue interface, logged in as ‘admin’ to load the data into HDFS and then into Hive. CREATE EXTERNAL TABLE IF NOT EXISTS flights ( year int, month int, dayofmonth int, dayofweek int, deptime int, crsdeptime int, arrtime int, crsarrtime int, uniquecarrier string, flightnum int, tailnum string, actualelapsedtime int, crselapsedtime int, airtime string, arrdelay int, depdelay int, origin string, dest string, distance int, taxiin string, taxiout string, cancelled int, cancellationcode string, diverted int, carrierdelay string, weatherdelay string, nasdelay string, securitydelay string, lateaircraftdelay string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' TBLPROPERTIES("skip.header.line.count"="1"); LOAD DATA INPATH '/user/admin/flights/2006.csv/' INTO TABLE flights; LOAD DATA INPATH '/user/admin/flights/2007.csv/' INTO TABLE flights; LOAD DATA INPATH '/user/admin/flights/2008.csv/' INTO TABLE flights; # Create metadata for airlines CREATE EXTERNAL TABLE IF NOT EXISTS airlines ( Code string, Description string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = '\,', "quoteChar" = '\"' ) STORED AS TEXTFILE tblproperties("skip.header.line.count"="1"); LOAD DATA INPATH '/user/admin/L_UNIQUE_CARRIERS.csv' INTO TABLE airlines; CREATE EXTERNAL TABLE IF NOT EXISTS airports ( id string, name string, city string, country string, faa string, icao string, lat double, lon double, alt int, tz_offset double, dst string, tz_name string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = '\,', "quoteChar" = '\"' ) STORED AS TEXTFILE; LOAD DATA INPATH '/user/admin/airports.dat' INTO TABLE airports;

Using sparklyr with Databricks

Overview

This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager.

Using RStudio Team with Databricks

RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages. RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark. The most common use case is to perform interactive analysis and exploratory development with RStudio Server Pro and sparklyr; write out the results to a database, file system, or cloud storage; then publish apps, reports, and APIs to RStudio Connect that query and access the results. The sections below describe best practices and different options for configuring specific RStudio products to work with Databricks.

Best practices for working with Databricks

Maintain separate installation environments - Install RStudio Server Pro, RStudio Connect, and RStudio Package Manager outside of the Databricks cluster so that they are not limited to the compute resources or ephemeral nature of Databricks clusters. Connect to Databricks remotely - Work with Databricks as a remote compute resource, similar to how you would connect remotely to external databases, data sources, and storage systems. This can be accomplished using Databricks Connect (as described in the Connecting to Databricks remotely section below) or by performing SQL queries with JDBC/ODBC using the Databricks Spark SQL Driver on AWS or Azure. Restrict workloads to interactive analysis - Only perform workloads related to exploratory or interactive analysis with Spark, then write the results to a database, file system, or cloud storage for more efficient retrieval in apps, reports, and APIs. Load and query results efficiently - Because of the nature of Spark computations and the associated overhead, Shiny apps that use Spark on the backend tend to have performance and runtime issues; consider reading the results from a database, file system, or cloud storage instead.

Using RStudio Server Pro with Databricks

There are two options for using sparklyr and RStudio Server Pro with Databricks: Option 1: Connecting to Databricks remotely (Recommended Option) Option 2: Working inside of Databricks (Alternative Option)

Option 1 - Connecting to Databricks remotely

With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect. This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources.

View steps for connecting to Databricks remotely

Option 2 - Working inside of Databricks

If you cannot work with Spark remotely, you should install RStudio Server Pro on the Driver node of a long-running, persistent Databricks cluster as opposed to a worker node or an ephemeral cluster. With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to connect to Spark locally using sparklyr. This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.

View steps for working inside of Databricks

Using RStudio Connect with Databricks

The server environment within Databricks clusters is not permissive enough to support RStudio Connect or the process sandboxing mechanisms that it uses to isolate published content. Therefore, the only supported configuration is to install RStudio Connect outside of the Databricks cluster and connect to Databricks remotely. Whether RStudio Server Pro is installed outside of the Databricks cluster (Recommended Option) or within the Databricks cluster (Alternative Option), you can publish content to RStudio Connect as long as HTTP/HTTPS network traffic is allowed from RStudio Server Pro to RStudio Connect. There are two options for using RStudio Connect with Databricks:
    Performing SQL queries with JDBC/ODBC using the Databricks Spark SQL Driver on AWS or Azure (Recommended Option) Adding calls in your R code to create and run Databricks jobs with bricksteR and the Databricks Jobs API (Alternative Option)

Using RStudio Package Manager with Databricks

Whether RStudio Server Pro is installed outside of the Databricks cluster (Recommended Option) or within the Databricks cluster (Alternative Option), you can install packages from repositories in RStudio Package Manager as long as HTTP/HTTPS network traffic is allowed from RStudio Server Pro to RStudio Package Manager.

Development

Function Reference

Spark Operations

spark_config() Read Spark Configuration
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() Manage Spark Connections
spark_install_find() spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() Find a given Spark installation by version.
spark_log() View Entries in the Spark Log
spark_web() Open the Spark web interface
connection_is_open() Check whether the connection is open
connection_spark_shinyapp() A Shiny app that can be used to construct a spark_connect statement
spark_session_config() Runtime configuration interface for the Spark Session
spark_set_checkpoint_dir() spark_get_checkpoint_dir() Set/Get Spark checkpoint directory
spark_table_name() Generate a Table Name from Expression
spark_version_from_home() Get the Spark Version Associated with a Spark Installation
spark_versions() Retrieves a dataframe available Spark versions that van be installed.
spark_config_kubernetes() Kubernetes Configuration
spark_config_settings() Retrieve Available Settings
spark_connection_find() Find Spark Connection
spark_dependency_fallback() Fallback to Spark Dependency
spark_extension() Create Spark Extension
spark_load_table() Reads from a Spark Table into a Spark DataFrame.
spark_read_libsvm() Read libsvm file into a Spark DataFrame.

Read/Write data

spark_read_csv() Read a CSV file into a Spark DataFrame
spark_read_delta() Read from Delta Lake into a Spark DataFrame.
spark_read_jdbc() Read from JDBC connection into a Spark DataFrame.
spark_read_json() Read a JSON file into a Spark DataFrame
spark_read_libsvm() Read libsvm file into a Spark DataFrame.
spark_read_orc() Read a ORC file into a Spark DataFrame
spark_read_parquet() Read a Parquet file into a Spark DataFrame
spark_read_source() Read from a generic source into a Spark DataFrame.
spark_read_table() Reads from a Spark Table into a Spark DataFrame.
spark_read_text() Read a Text file into a Spark DataFrame
spark_write_csv() Write a Spark DataFrame to a CSV
spark_write_delta() Writes a Spark DataFrame into Delta Lake
spark_write_jdbc() Writes a Spark DataFrame into a JDBC table
spark_write_json() Write a Spark DataFrame to a JSON file
spark_write_orc() Write a Spark DataFrame to a ORC file
spark_write_parquet() Write a Spark DataFrame to a Parquet file
spark_write_source() Writes a Spark DataFrame into a generic source
spark_write_table() Writes a Spark DataFrame into a Spark table
spark_write_text() Write a Spark DataFrame to a Text file

Spark Tables

sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() Save / Load a Spark DataFrame
sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() Spark ML -- Transform, fit, and predict methods (sdf_ interface)
sdf_along() Create DataFrame for along Object
sdf_bind_rows() sdf_bind_cols() Bind multiple Spark DataFrames by row and column
sdf_broadcast() Broadcast hint
sdf_checkpoint() Checkpoint a Spark DataFrame
sdf_coalesce() Coalesces a Spark DataFrame
sdf_collect() Collect a Spark DataFrame into R.
sdf_copy_to() sdf_import() Copy an Object into Spark
sdf_crosstab() Cross Tabulation
sdf_debug_string() Debug Info for Spark DataFrame
sdf_describe() Compute summary statistics for columns of a data frame
sdf_dim() sdf_nrow() sdf_ncol() Support for Dimension Operations
sdf_is_streaming() Spark DataFrame is Streaming
sdf_last_index() Returns the last index of a Spark DataFrame
sdf_len() Create DataFrame for Length
sdf_num_partitions() Gets number of partitions of a Spark DataFrame
sdf_persist() Persist a Spark DataFrame
sdf_pivot() Pivot a Spark DataFrame
sdf_project() Project features onto principal components
sdf_quantile() Compute (Approximate) Quantiles with a Spark DataFrame
sdf_random_split() sdf_partition() Partition a Spark Dataframe
sdf_read_column() Read a Column from a Spark DataFrame
sdf_register() Register a Spark DataFrame
sdf_repartition() Repartition a Spark DataFrame
sdf_residuals() Model Residuals
sdf_sample() Randomly Sample Rows from a Spark DataFrame
sdf_schema() Read the Schema of a Spark DataFrame
sdf_separate_column() Separate a Vector Column into Scalar Columns
sdf_seq() Create DataFrame for Range
sdf_sort() Sort a Spark DataFrame
sdf_sql() Spark DataFrame from SQL
sdf_with_sequential_id() Add a Sequential ID Column to a Spark DataFrame
sdf_with_unique_id() Add a Unique ID Column to a Spark DataFrame

Spark Machine Learning

ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor() Spark ML -- Decision Trees
ml_generalized_linear_regression() Spark ML -- Generalized Linear Regression
ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor() Spark ML -- Gradient Boosted Trees
ml_kmeans() ml_compute_cost() Spark ML -- K-Means Clustering
ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() Spark ML -- Latent Dirichlet Allocation
ml_linear_regression() Spark ML -- Linear Regression
ml_logistic_regression() Spark ML -- Logistic Regression
ml_model_data() Extracts data associated with a Spark ML model
ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() Spark ML -- Multilayer Perceptron
ml_naive_bayes() Spark ML -- Naive-Bayes
ml_one_vs_rest() Spark ML -- OneVsRest
ft_pca() ml_pca() Feature Transformation -- PCA (Estimator)
ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor() Spark ML -- Random Forest
ml_aft_survival_regression() ml_survival_regression() Spark ML -- Survival Regression
ml_add_stage() Add a Stage to a Pipeline
ml_als() ml_recommend() Spark ML -- ALS
ml_approx_nearest_neighbors() ml_approx_similarity_join() Utility functions for LSH models
ml_fpgrowth() ml_association_rules() ml_freq_itemsets() Frequent Pattern Mining -- FPGrowth
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_bisecting_kmeans() Spark ML -- Bisecting K-Means Clustering
ml_call_constructor() Wrap a Spark ML JVM object
ml_chisquare_test() Chi-square hypothesis testing for categorical data.
ml_clustering_evaluator() Spark ML - Clustering Evaluator
new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() Constructors for `ml_model` Objects
ml_corr() Compute correlation matrix
ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() Spark ML -- Tuning
ml_default_stop_words() Default stop words
ml_evaluate() Evaluate the Model on a Validation Set
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models
ft_word2vec() ml_find_synonyms() Feature Transformation -- Word2Vec (Estimator)
is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict() Spark ML -- Transform, fit, and predict methods (ml_ interface)
ml_gaussian_mixture() Spark ML -- Gaussian Mixture clustering.
ml_is_set() ml_param_map() ml_param() ml_params() Spark ML -- ML Params
ml_isotonic_regression() Spark ML -- Isotonic Regression
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation -- StringIndexer (Estimator)
ml_linear_svc() Spark ML -- LinearSVC
ml_save() ml_load() Spark ML -- Model Persistence
ml_pipeline() Spark ML -- Pipelines
ml_stage() ml_stages() Spark ML -- Pipeline stage extraction
ml_standardize_formula() Standardize Formula Input for `ml_model`
ml_summary() Spark ML -- Extraction of summary metrics
ml_uid() Spark ML -- UID
ft_count_vectorizer() ml_vocabulary() Feature Transformation -- CountVectorizer (Estimator)

Spark Feature Transformers

ft_binarizer() Feature Transformation -- Binarizer (Transformer)
ft_bucketizer() Feature Transformation -- Bucketizer (Transformer)
ft_chisq_selector() Feature Transformation -- ChiSqSelector (Estimator)
ft_count_vectorizer() ml_vocabulary() Feature Transformation -- CountVectorizer (Estimator)
ft_dct() ft_discrete_cosine_transform() Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
ft_elementwise_product() Feature Transformation -- ElementwiseProduct (Transformer)
ft_feature_hasher() Feature Transformation -- FeatureHasher (Transformer)
ft_hashing_tf() Feature Transformation -- HashingTF (Transformer)
ft_idf() Feature Transformation -- IDF (Estimator)
ft_imputer() Feature Transformation -- Imputer (Estimator)
ft_index_to_string() Feature Transformation -- IndexToString (Transformer)
ft_interaction() Feature Transformation -- Interaction (Transformer)
ft_bucketed_random_projection_lsh() ft_minhash_lsh() Feature Transformation -- LSH (Estimator)
ml_approx_nearest_neighbors() ml_approx_similarity_join() Utility functions for LSH models
ft_max_abs_scaler() Feature Transformation -- MaxAbsScaler (Estimator)
ft_min_max_scaler() Feature Transformation -- MinMaxScaler (Estimator)
ft_ngram() Feature Transformation -- NGram (Transformer)
ft_normalizer() Feature Transformation -- Normalizer (Transformer)
ft_one_hot_encoder() Feature Transformation -- OneHotEncoder (Transformer)
ft_one_hot_encoder_estimator() Feature Transformation -- OneHotEncoderEstimator (Estimator)
ft_pca() ml_pca() Feature Transformation -- PCA (Estimator)
ft_polynomial_expansion() Feature Transformation -- PolynomialExpansion (Transformer)
ft_quantile_discretizer() Feature Transformation -- QuantileDiscretizer (Estimator)
ft_r_formula() Feature Transformation -- RFormula (Estimator)
ft_regex_tokenizer() Feature Transformation -- RegexTokenizer (Transformer)
ft_standard_scaler() Feature Transformation -- StandardScaler (Estimator)
ft_stop_words_remover() Feature Transformation -- StopWordsRemover (Transformer)
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation -- StringIndexer (Estimator)
ft_tokenizer() Feature Transformation -- Tokenizer (Transformer)
ft_vector_assembler() Feature Transformation -- VectorAssembler (Transformer)
ft_vector_indexer() Feature Transformation -- VectorIndexer (Estimator)
ft_vector_slicer() Feature Transformation -- VectorSlicer (Transformer)
ft_word2vec() ml_find_synonyms() Feature Transformation -- Word2Vec (Estimator)
ft_sql_transformer() ft_dplyr_transformer() Feature Transformation -- SQLTransformer

Spark Machine Learning Utilities

ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models

Extensions

compile_package_jars() Compile Scala sources into a Java Archive (jar)
connection_config() Read configuration values for a connection
download_scalac() Downloads default Scala Compilers
find_scalac() Discover the Scala Compiler
spark_context() java_context() hive_context() spark_session() Access the Spark API
hive_context_config() Runtime configuration interface for Hive
invoke() invoke_static() invoke_new() Invoke a Method on a JVM Object
register_extension() registered_extensions() Register a Package that Implements a Spark Extension
spark_compilation_spec() Define a Spark Compilation Specification
spark_default_compilation_spec() Default Compilation Specification for Spark Extensions
spark_connection() Retrieve the Spark Connection Associated with an R Object
spark_context_config() Runtime configuration interface for the Spark Context.
spark_dataframe() Retrieve a Spark DataFrame
spark_dependency() Define a Spark dependency
spark_home_set() Set the SPARK_HOME environment variable
spark_jobj() Retrieve a Spark JVM Object Reference
spark_version() Get the Spark Version Associated with a Spark Connection

Distributed Computing

spark_apply() Apply an R Function in Spark
spark_apply_bundle() Create Bundle for Spark Apply
spark_apply_log() Log Writer for Spark Apply

Livy

livy_config() Create a Spark Configuration for Livy
livy_service_start() livy_service_stop() Start Livy

Streaming

stream_find() Find Stream
stream_generate_test() Generate Test Stream
stream_id() Spark Stream's Identifier
stream_name() Spark Stream's Name
stream_read_csv() Read CSV Stream
stream_read_json() Read JSON Stream
stream_read_kafka() Read Kafka Stream
stream_read_orc() Read ORC Stream
stream_read_parquet() Read Parquet Stream
stream_read_scoket() Read Socket Stream
stream_read_text() Read Text Stream
stream_render() Render Stream
stream_stats() Stream Statistics
stream_stop() Stops a Spark Stream
stream_trigger_continuous() Spark Stream Continuous Trigger
stream_trigger_interval() Spark Stream Interval Trigger
stream_view() View Stream
stream_watermark() Watermark Stream
stream_write_console() Write Console Stream
stream_write_csv() Write CSV Stream
stream_write_json() Write JSON Stream
stream_write_kafka() Write Kafka Stream
stream_write_memory() Write Memory Stream
stream_write_orc() Write a ORC Stream
stream_write_parquet() Write Parquet Stream
stream_write_text() Write Text Stream
reactiveSpark() Reactive spark reader

Function Reference - version 1.04

Spark Operations

spark_config() Read Spark Configuration
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() Manage Spark Connections
spark_install_find() spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() Find a given Spark installation by version.
spark_log() View Entries in the Spark Log
spark_web() Open the Spark web interface
connection_is_open() Check whether the connection is open
connection_spark_shinyapp() A Shiny app that can be used to construct a spark_connect statement
spark_session_config() Runtime configuration interface for the Spark Session
spark_set_checkpoint_dir() spark_get_checkpoint_dir() Set/Get Spark checkpoint directory
spark_table_name() Generate a Table Name from Expression
spark_version_from_home() Get the Spark Version Associated with a Spark Installation
spark_versions() Retrieves a dataframe available Spark versions that van be installed.
spark_config_kubernetes() Kubernetes Configuration
spark_config_settings() Retrieve Available Settings
spark_connection_find() Find Spark Connection
spark_dependency_fallback() Fallback to Spark Dependency
spark_extension() Create Spark Extension
spark_load_table() Reads from a Spark Table into a Spark DataFrame.
spark_read_libsvm() Read libsvm file into a Spark DataFrame.

Spark Data

spark_read_csv() Read a CSV file into a Spark DataFrame
spark_read_jdbc() Read from JDBC connection into a Spark DataFrame.
spark_read_json() Read a JSON file into a Spark DataFrame
spark_read_parquet() Read a Parquet file into a Spark DataFrame
spark_read_source() Read from a generic source into a Spark DataFrame.
spark_read_table() Reads from a Spark Table into a Spark DataFrame.
spark_read_orc() Read a ORC file into a Spark DataFrame
spark_read_text() Read a Text file into a Spark DataFrame
spark_save_table() Saves a Spark DataFrame as a Spark table
spark_write_orc() Write a Spark DataFrame to a ORC file
spark_write_text() Write a Spark DataFrame to a Text file
spark_write_csv() Write a Spark DataFrame to a CSV
spark_write_jdbc() Writes a Spark DataFrame into a JDBC table
spark_write_json() Write a Spark DataFrame to a JSON file
spark_write_parquet() Write a Spark DataFrame to a Parquet file
spark_write_source() Writes a Spark DataFrame into a generic source
spark_write_table() Writes a Spark DataFrame into a Spark table

Spark Tables

src_databases() Show database list
tbl_cache() Cache a Spark Table
tbl_change_db() Use specific database
tbl_uncache() Uncache a Spark Table

Spark DataFrames

sdf_along() Create DataFrame for along Object
sdf_bind_rows() sdf_bind_cols() Bind multiple Spark DataFrames by row and column
sdf_broadcast() Broadcast hint
sdf_checkpoint() Checkpoint a Spark DataFrame
sdf_coalesce() Coalesces a Spark DataFrame
sdf_copy_to() sdf_import() Copy an Object into Spark
sdf_len() Create DataFrame for Length
sdf_num_partitions() Gets number of partitions of a Spark DataFrame
sdf_random_split() sdf_partition() Partition a Spark Dataframe
sdf_pivot() Pivot a Spark DataFrame
sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() Spark ML -- Transform, fit, and predict methods (sdf_ interface)
sdf_read_column() Read a Column from a Spark DataFrame
sdf_register() Register a Spark DataFrame
sdf_repartition() Repartition a Spark DataFrame
sdf_residuals() Model Residuals
sdf_sample() Randomly Sample Rows from a Spark DataFrame
sdf_separate_column() Separate a Vector Column into Scalar Columns
sdf_seq() Create DataFrame for Range
sdf_sort() Sort a Spark DataFrame
sdf_with_unique_id() Add a Unique ID Column to a Spark DataFrame
sdf_collect() Collect a Spark DataFrame into R.
sdf_crosstab() Cross Tabulation
sdf_debug_string() Debug Info for Spark DataFrame
sdf_describe() Compute summary statistics for columns of a data frame
sdf_dim() sdf_nrow() sdf_ncol() Support for Dimension Operations
sdf_is_streaming() Spark DataFrame is Streaming
sdf_last_index() Returns the last index of a Spark DataFrame
sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() Save / Load a Spark DataFrame
sdf_persist() Persist a Spark DataFrame
sdf_project() Project features onto principal components
sdf_quantile() Compute (Approximate) Quantiles with a Spark DataFrame
sdf_schema() Read the Schema of a Spark DataFrame
sdf_sql() Spark DataFrame from SQL
sdf_with_sequential_id() Add a Sequential ID Column to a Spark DataFrame

Spark Machine Learning

ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor() Spark ML -- Decision Trees
ml_generalized_linear_regression() Spark ML -- Generalized Linear Regression
ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor() Spark ML -- Gradient Boosted Trees
ml_kmeans() ml_compute_cost() Spark ML -- K-Means Clustering
ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() Spark ML -- Latent Dirichlet Allocation
ml_linear_regression() Spark ML -- Linear Regression
ml_logistic_regression() Spark ML -- Logistic Regression
ml_model_data() Extracts data associated with a Spark ML model
ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() Spark ML -- Multilayer Perceptron
ml_naive_bayes() Spark ML -- Naive-Bayes
ml_one_vs_rest() Spark ML -- OneVsRest
ft_pca() ml_pca() Feature Transformation -- PCA (Estimator)
ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor() Spark ML -- Random Forest
ml_aft_survival_regression() ml_survival_regression() Spark ML -- Survival Regression
ml_add_stage() Add a Stage to a Pipeline
ml_als() ml_recommend() Spark ML -- ALS
ml_approx_nearest_neighbors() ml_approx_similarity_join() Utility functions for LSH models
ml_fpgrowth() ml_association_rules() ml_freq_itemsets() Frequent Pattern Mining -- FPGrowth
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_bisecting_kmeans() Spark ML -- Bisecting K-Means Clustering
ml_call_constructor() Wrap a Spark ML JVM object
ml_chisquare_test() Chi-square hypothesis testing for categorical data.
ml_clustering_evaluator() Spark ML - Clustering Evaluator
new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() Constructors for `ml_model` Objects
ml_corr() Compute correlation matrix
ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() Spark ML -- Tuning
ml_default_stop_words() Default stop words
ml_evaluate() Evaluate the Model on a Validation Set
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models
ft_word2vec() ml_find_synonyms() Feature Transformation -- Word2Vec (Estimator)
is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict() Spark ML -- Transform, fit, and predict methods (ml_ interface)
ml_gaussian_mixture() Spark ML -- Gaussian Mixture clustering.
ml_is_set() ml_param_map() ml_param() ml_params() Spark ML -- ML Params
ml_isotonic_regression() Spark ML -- Isotonic Regression
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation -- StringIndexer (Estimator)
ml_linear_svc() Spark ML -- LinearSVC
ml_save() ml_load() Spark ML -- Model Persistence
ml_pipeline() Spark ML -- Pipelines
ml_stage() ml_stages() Spark ML -- Pipeline stage extraction
ml_standardize_formula() Standardize Formula Input for `ml_model`
ml_summary() Spark ML -- Extraction of summary metrics
ml_uid() Spark ML -- UID
ft_count_vectorizer() ml_vocabulary() Feature Transformation -- CountVectorizer (Estimator)

Spark Feature Transformers

ft_binarizer() Feature Transformation -- Binarizer (Transformer)
ft_bucketizer() Feature Transformation -- Bucketizer (Transformer)
ft_count_vectorizer() ml_vocabulary() Feature Transformation -- CountVectorizer (Estimator)
ft_dct() ft_discrete_cosine_transform() Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)
ft_elementwise_product() Feature Transformation -- ElementwiseProduct (Transformer)
ft_index_to_string() Feature Transformation -- IndexToString (Transformer)
ft_one_hot_encoder() Feature Transformation -- OneHotEncoder (Transformer)
ft_quantile_discretizer() Feature Transformation -- QuantileDiscretizer (Estimator)
ft_sql_transformer() ft_dplyr_transformer() Feature Transformation -- SQLTransformer
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation -- StringIndexer (Estimator)
ft_vector_assembler() Feature Transformation -- VectorAssembler (Transformer)
ft_tokenizer() Feature Transformation -- Tokenizer (Transformer)
ft_regex_tokenizer() Feature Transformation -- RegexTokenizer (Transformer)
ft_bucketed_random_projection_lsh() ft_minhash_lsh() Feature Transformation -- LSH (Estimator)
ft_chisq_selector() Feature Transformation -- ChiSqSelector (Estimator)
ft_feature_hasher() Feature Transformation -- FeatureHasher (Transformer)
ft_hashing_tf() Feature Transformation -- HashingTF (Transformer)
ft_idf() Feature Transformation -- IDF (Estimator)
ft_imputer() Feature Transformation -- Imputer (Estimator)
ft_interaction() Feature Transformation -- Interaction (Transformer)
ft_max_abs_scaler() Feature Transformation -- MaxAbsScaler (Estimator)
ft_min_max_scaler() Feature Transformation -- MinMaxScaler (Estimator)
ft_ngram() Feature Transformation -- NGram (Transformer)
ft_normalizer() Feature Transformation -- Normalizer (Transformer)
ft_one_hot_encoder_estimator() Feature Transformation -- OneHotEncoderEstimator (Estimator)
ft_pca() ml_pca() Feature Transformation -- PCA (Estimator)
ft_polynomial_expansion() Feature Transformation -- PolynomialExpansion (Transformer)
ft_r_formula() Feature Transformation -- RFormula (Estimator)
ft_standard_scaler() Feature Transformation -- StandardScaler (Estimator)
ft_stop_words_remover() Feature Transformation -- StopWordsRemover (Transformer)
ft_vector_indexer() Feature Transformation -- VectorIndexer (Estimator)
ft_vector_slicer() Feature Transformation -- VectorSlicer (Transformer)
ft_word2vec() ml_find_synonyms() Feature Transformation -- Word2Vec (Estimator)

Spark Machine Learning Utilities

ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models

Extensions

compile_package_jars() Compile Scala sources into a Java Archive (jar)
connection_config() Read configuration values for a connection
download_scalac() Downloads default Scala Compilers
find_scalac() Discover the Scala Compiler
spark_context() java_context() hive_context() spark_session() Access the Spark API
hive_context_config() Runtime configuration interface for Hive
invoke() invoke_static() invoke_new() Invoke a Method on a JVM Object
register_extension() registered_extensions() Register a Package that Implements a Spark Extension
spark_compilation_spec() Define a Spark Compilation Specification
spark_default_compilation_spec() Default Compilation Specification for Spark Extensions
spark_connection() Retrieve the Spark Connection Associated with an R Object
spark_context_config() Runtime configuration interface for the Spark Context.
spark_dataframe() Retrieve a Spark DataFrame
spark_dependency() Define a Spark dependency
spark_home_set() Set the SPARK_HOME environment variable
spark_jobj() Retrieve a Spark JVM Object Reference
spark_version() Get the Spark Version Associated with a Spark Connection

Distributed Computing

spark_apply() Apply an R Function in Spark
spark_apply_bundle() Create Bundle for Spark Apply
spark_apply_log() Log Writer for Spark Apply

Livy

livy_install() livy_available_versions() livy_install_dir() livy_installed_versions() livy_home_dir() Install Livy
livy_config() Create a Spark Configuration for Livy
livy_service_start() livy_service_stop() Start Livy

Streaming

stream_find() Find Stream
stream_generate_test() Generate Test Stream
stream_id() Spark Stream's Identifier
stream_name() Spark Stream's Name
stream_read_csv() Read CSV Stream
stream_read_json() Read JSON Stream
stream_read_kafka() Read Kafka Stream
stream_read_orc() Read ORC Stream
stream_read_parquet() Read Parquet Stream
stream_read_scoket() Read Socket Stream
stream_read_text() Read Text Stream
stream_render() Render Stream
stream_stats() Stream Statistics
stream_stop() Stops a Spark Stream
stream_trigger_continuous() Spark Stream Continuous Trigger
stream_trigger_interval() Spark Stream Interval Trigger
stream_view() View Stream
stream_watermark() Watermark Stream
stream_write_console() Write Console Stream
stream_write_csv() Write CSV Stream
stream_write_json() Write JSON Stream
stream_write_kafka() Write Kafka Stream
stream_write_memory() Write Memory Stream
stream_write_orc() Write a ORC Stream
stream_write_parquet() Write Parquet Stream
stream_write_text() Write Text Stream
reactiveSpark() Reactive spark reader

Read Spark Configuration

Arguments Value Details Read Spark Configuration spark_config(file = "config.yml", use_default = TRUE)

Arguments

file Name of the configuration file
use_default TRUE to use the built-in defaults provided in this package

Value

Named list with configuration data

Details

Read Spark configuration using the config package.

Manage Spark Connections

Arguments Details Examples These routines allow you to manage your connections to Spark. spark_connect(master, spark_home = Sys.getenv("SPARK_HOME"), method = c("shell", "livy", "databricks", "test", "qubole"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), packages = NULL, ...) spark_connection_is_open(sc) spark_disconnect(sc, ...) spark_disconnect_all() spark_submit(master, file, spark_home = Sys.getenv("SPARK_HOME"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), ...)

Arguments

master Spark cluster url to connect to. Use "local" to connect to a local instance of Spark installed via spark_install.
spark_home The path to a Spark installation. Defaults to the path provided by the SPARK_HOME environment variable. If SPARK_HOME is defined, it will always be used unless the version parameter is specified to force the use of a locally installed version.
method The method used to connect to Spark. Default connection method is "shell" to connect using spark-submit, use "livy" to perform remote connections using HTTP, or "databricks" when using a Databricks clusters.
app_name The application name to be used while running in the Spark cluster.
version The version of Spark to use. Required for "local" Spark connections, optional otherwise.
config Custom configuration for the generated Spark connection. See spark_config for details.
extensions Extension R packages to enable for this connection. By default, all packages enabled through the use of sparklyr::register_extension will be passed here.
packages A list of Spark packages to load. For example, "delta" or "kafka" to enable Delta Lake or Kafka. Also supports full versions like "io.delta:delta-core_2.11:0.4.0". This is similar to adding packages into the sparklyr.shell.packages configuration option. Notice that the version parameter is used to choose the correect package, otherwise assumes the latest version is being used.
... Optional arguments; currently unused.
sc A spark_connection.
file Path to R source file to submit for batch execution.

Details

When using method = "livy", it is recommended to specify version parameter to improve performance by using precompiled code rather than uploading sources. By default, jars are downloaded from GitHub but the path to the correct sparklyr JAR can also be specified through the livy.jars setting.

Examples

sc <- spark_connect(master = "spark://HOST:PORT") connection_is_open(sc)#> [1] TRUE spark_disconnect(sc)

Find a given Spark installation by version.

Arguments Value Install versions of Spark for use with local Spark connections (i.e. spark_connect(master = "local") spark_install_find(version = NULL, hadoop_version = NULL, installed_only = TRUE, latest = FALSE, hint = FALSE) spark_install(version = NULL, hadoop_version = NULL, reset = TRUE, logging = "INFO", verbose = interactive()) spark_uninstall(version, hadoop_version) spark_install_dir() spark_install_tar(tarfile) spark_installed_versions() spark_available_versions(show_hadoop = FALSE, show_minor = FALSE)

Arguments

version Version of Spark to install. See spark_available_versions for a list of supported versions
hadoop_version Version of Hadoop to install. See spark_available_versions for a list of supported versions
installed_only Search only the locally installed versions?
latest Check for latest version?
hint On failure should the installation code be provided?
reset Attempts to reset settings to defaults.
logging Logging level to configure install. Supported options: "WARN", "INFO"
verbose Report information as Spark is downloaded / installed
tarfile Path to TAR file conforming to the pattern spark-###-bin-(hadoop)?### where ### reference spark and hadoop versions respectively.
show_hadoop Show Hadoop distributions?
show_minor Show minor Spark versions?

Value

List with information about the installed version.

View Entries in the Spark Log

Arguments View the most recent entries in the Spark log. This can be useful when inspecting output / errors produced by Spark during the invocation of various commands. spark_log(sc, n = 100, filter = NULL, ...)

Arguments

sc A spark_connection.
n The max number of log entries to retrieve. Use NULL to retrieve all entries within the log.
filter Character string to filter log entries.
... Optional arguments; currently unused.

Open the Spark web interface

Arguments Open the Spark web interface spark_web(sc, ...)

Arguments

sc A spark_connection.
... Optional arguments; currently unused.

Check whether the connection is open

Arguments Check whether the connection is open connection_is_open(sc)

Arguments

sc spark_connection

A Shiny app that can be used to construct a <code>spark_connect</code> statement

A Shiny app that can be used to construct a spark_connect statement connection_spark_shinyapp()

Runtime configuration interface for the Spark Session

Arguments Retrieves or sets runtime configuration entries for the Spark Session spark_session_config(sc, config = TRUE, value = NULL)

Arguments

sc A spark_connection.
config The configuration entry name(s) (e.g., "spark.sql.shuffle.partitions"). Defaults to NULL to retrieve all configuration entries.
value The configuration value to be set. Defaults to NULL to retrieve configuration entries.

Set/Get Spark checkpoint directory

Arguments Set/Get Spark checkpoint directory spark_set_checkpoint_dir(sc, dir) spark_get_checkpoint_dir(sc)

Arguments

sc A spark_connection.
dir checkpoint directory, must be HDFS path of running on cluster

Generate a Table Name from Expression

Arguments Attempts to generate a table name from an expression; otherwise, assigns an auto-generated generic name with "sparklyr_" prefix. spark_table_name(expr)

Arguments

expr The expression to attempt to use as name

Get the Spark Version Associated with a Spark Installation

Arguments Retrieve the version of Spark associated with a Spark installation. spark_version_from_home(spark_home, default = NULL)

Arguments

spark_home The path to a Spark installation.
default The default version to be inferred, in case version lookup failed, e.g. no Spark installation was found at spark_home.

Retrieves a dataframe available Spark versions that van be installed.

Arguments Retrieves a dataframe available Spark versions that van be installed. spark_versions(latest = TRUE)

Arguments

latest Check for latest version?

Kubernetes Configuration

Arguments Convenience function to initialize a Kubernetes configuration instead of spark_config(), exposes common properties to set in Kubernetes clusters. spark_config_kubernetes(master, version = "2.3.2", image = "spark:sparklyr", driver = random_string("sparklyr-"), account = "spark", jars = "local:///opt/sparklyr", forward = TRUE, executors = NULL, conf = NULL, timeout = 120, ports = c(8880, 8881, 4040), fix_config = identical(.Platform$OS.type, "windows"), ...)

Arguments

master Kubernetes url to connect to, found by running kubectl cluster-info.
version The version of Spark being used.
image Container image to use to launch Spark and sparklyr. Also known as spark.kubernetes.container.image.
driver Name of the driver pod. If not set, the driver pod name is set to "sparklyr" suffixed by id to avoid name conflicts. Also known as spark.kubernetes.driver.pod.name.
account Service account that is used when running the driver pod. The driver pod uses this service account when requesting executor pods from the API server. Also known as spark.kubernetes.authenticate.driver.serviceAccountName.
jars Path to the sparklyr jars; either, a local path inside the container image with the sparklyr jars copied when the image was created or, a path accesible by the container where the sparklyr jars were copied. You can find a path to the sparklyr jars by running system.file("java/", package = "sparklyr").
forward Should ports used in sparklyr be forwarded automatically through Kubernetes? Default to TRUE which runs kubectl port-forward and pkill kubectl on disconnection.
executors Number of executors to request while connecting.
conf A named list of additional entries to add to sparklyr.shell.conf.
timeout Total seconds to wait before giving up on connection.
ports Ports to forward using kubectl.
fix_config Should the spark-defaults.conf get fixed? TRUE for Windows.
... Additional parameters, currently not in use.

Retrieve Available Settings

Retrieves available sparklyr settings that can be used in configuration files or spark_config(). spark_config_settings()

Find Spark Connection

Arguments Finds an active spark connection in the environment given the connection parameters. spark_connection_find(master = NULL, app_name = NULL, method = NULL)

Arguments

master The Spark master parameter.
app_name The Spark application name.
method The method used to connect to Spark.

Fallback to Spark Dependency

Arguments Value Helper function to assist falling back to previous Spark versions. spark_dependency_fallback(spark_version, supported_versions)

Arguments

spark_version The Spark version being requested in spark_dependencies.
supported_versions The Spark versions that are supported by this extension.

Value

A Spark version to use.

Create Spark Extension

Arguments Creates an R package ready to be used as an Spark extension. spark_extension(path)

Arguments

path Location where the extension will be created.

Reads from a Spark Table into a Spark DataFrame.

Arguments See also Reads from a Spark Table into a Spark DataFrame. spark_load_table(sc, name, path, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?

See also

Other Spark serialization routines: spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read libsvm file into a Spark DataFrame.

Arguments See also Read libsvm file into a Spark DataFrame. spark_read_libsvm(sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read a CSV file into a Spark DataFrame

Arguments Details See also Read a tabular data file into a Spark DataFrame. spark_read_csv(sc, name = NULL, path = name, header = TRUE, columns = NULL, infer_schema = is.null(columns), delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
header Boolean; should the first row of data be used as a header? Defaults to TRUE.
columns A vector of column names or a named vector of column types.
infer_schema Boolean; should column types be automatically inferred? Requires one extra pass over the data. Defaults to is.null(columns).
delimiter The character used to delimit each column. Defaults to ','.
quote The character used as a quote. Defaults to '"'.
escape The character used to escape other characters. Defaults to '\'.
charset The character set. Defaults to "UTF-8".
null_value The character to use for null, or missing, values. Defaults to NULL.
options A list of strings with additional options.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
... Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://). If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk documentation Working with AWS credentials In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint . In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options for the config key spark.driver.extraJavaOptions For instructions on how to configure s3n:// check the hadoop documentation: s3n authentication properties When header is FALSE, the column names are generated with a V prefix; e.g. V1, V2, ....

See also

Other Spark serialization routines: spark_load_table, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read from Delta Lake into a Spark DataFrame.

Arguments See also Read from Delta Lake into a Spark DataFrame. spark_read_delta(sc, path, name = NULL, version = NULL, timestamp = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated table.
version The version of the delta table to read.
timestamp The timestamp of the delta table to read. For example, "2019-01-01" or "2019-01-01'T'00:00:00.000Z".
options A list of strings with additional options.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read from JDBC connection into a Spark DataFrame.

Arguments See also Read from JDBC connection into a Spark DataFrame. spark_read_jdbc(sc, name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
columns A vector of column names or a named vector of column types.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read a JSON file into a Spark DataFrame

Arguments Details See also Read a table serialized in the JavaScript Object Notation format into a Spark DataFrame. spark_read_json(sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
options A list of strings with additional options.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
columns A vector of column names or a named vector of column types.
... Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://). If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk documentation Working with AWS credentials In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint . In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options for the config key spark.driver.extraJavaOptions For instructions on how to configure s3n:// check the hadoop documentation: s3n authentication properties

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read a ORC file into a Spark DataFrame

Arguments Details See also Read a ORC file into a Spark DataFrame. spark_read_orc(sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
columns A vector of column names or a named vector of column types.
schema A (java) read schema. Useful for optimizing read operation on nested data.
... Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://).

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read a Parquet file into a Spark DataFrame

Arguments Details See also Read a Parquet file into a Spark DataFrame. spark_read_parquet(sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
columns A vector of column names or a named vector of column types.
schema A (java) read schema. Useful for optimizing read operation on nested data.
... Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://). If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk documentation Working with AWS credentials In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint . In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options for the config key spark.driver.extraJavaOptions For instructions on how to configure s3n:// check the hadoop documentation: s3n authentication properties

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read from a generic source into a Spark DataFrame.

Arguments See also Read from a generic source into a Spark DataFrame. spark_read_source(sc, name = NULL, path = name, source, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
source A data source capable of reading data.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
columns A vector of column names or a named vector of column types.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Reads from a Spark Table into a Spark DataFrame.

Arguments See also Reads from a Spark Table into a Spark DataFrame. spark_read_table(sc, name, options = list(), repartition = 0, memory = TRUE, columns = NULL, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
columns A vector of column names or a named vector of column types.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Read a Text file into a Spark DataFrame

Arguments Details See also Read a text file into a Spark DataFrame. spark_read_text(sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), whole = FALSE, ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
options A list of strings with additional options.
whole Read the entire text file as a single entry? Defaults to FALSE.
... Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://). If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk documentation Working with AWS credentials In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint . In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options for the config key spark.driver.extraJavaOptions For instructions on how to configure s3n:// check the hadoop documentation: s3n authentication properties

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Write a Spark DataFrame to a CSV

Arguments See also Write a Spark DataFrame to a tabular (typically, comma-separated) file. spark_write_csv(x, path, header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), mode = NULL, partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
header Should the first row of data be used as a header? Defaults to TRUE.
delimiter The character used to delimit each column, defaults to ,.
quote The character used as a quote. Defaults to '"'.
escape The character used to escape other characters, defaults to \.
charset The character set, defaults to "UTF-8".
null_value The character to use for default values, defaults to NULL.
options A list of strings with additional options.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Writes a Spark DataFrame into Delta Lake

Arguments See also Writes a Spark DataFrame into Delta Lake. spark_write_delta(x, path, mode = NULL, options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Writes a Spark DataFrame into a JDBC table

Arguments See also Writes a Spark DataFrame into a JDBC table. spark_write_jdbc(x, name, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
name The name to assign to the newly generated table.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Write a Spark DataFrame to a JSON file

Arguments See also Serialize a Spark DataFrame to the JavaScript Object Notation format. spark_write_json(x, path, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Write a Spark DataFrame to a ORC file

Arguments See also Serialize a Spark DataFrame to the ORC format. spark_write_orc(x, path, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text

Write a Spark DataFrame to a Parquet file

Arguments See also Serialize a Spark DataFrame to the Parquet format. spark_write_parquet(x, path, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_source, spark_write_table, spark_write_text

Writes a Spark DataFrame into a generic source

Arguments See also Writes a Spark DataFrame into a generic source. spark_write_source(x, source, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
source A data source capable of reading data.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_table, spark_write_text

Writes a Spark DataFrame into a Spark table

Arguments See also Writes a Spark DataFrame into a Spark table. spark_write_table(x, name, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
name The name to assign to the newly generated table.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_text

Write a Spark DataFrame to a Text file

Arguments See also Serialize a Spark DataFrame to the plain text format. spark_write_text(x, path, mode = NULL, options = list(), partition_by = NULL, ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode A character element. Specifies the behavior when data or table already exists. Supported values include: 'error', 'append', 'overwrite' and ignore. Notice that 'overwrite' will also change the column structure. For more details see also http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark.
options A list of strings with additional options.
partition_by A character vector. Partitions the output by the given columns on the file system.
... Optional arguments; currently unused.

See also

Other Spark serialization routines: spark_load_table, spark_read_csv, spark_read_delta, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_orc, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_delta, spark_write_jdbc, spark_write_json, spark_write_orc, spark_write_parquet, spark_write_source, spark_write_table

Save / Load a Spark DataFrame

Arguments Routines for saving and loading Spark DataFrames. sdf_save_table(x, name, overwrite = FALSE, append = FALSE) sdf_load_table(sc, name) sdf_save_parquet(x, path, overwrite = FALSE, append = FALSE) sdf_load_parquet(sc, path)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
name The table name to assign to the saved Spark DataFrame.
overwrite Boolean; overwrite a pre-existing table of the same name?
append Boolean; append to a pre-existing table of the same name?
sc A spark_connection object.
path The path where the Spark DataFrame should be saved.

Spark ML -- Transform, fit, and predict methods (sdf_ interface)

Arguments Value Deprecated methods for transformation, fit, and prediction. These are mirrors of the corresponding ml-transform-methods. sdf_predict(x, model, ...) sdf_transform(x, transformer, ...) sdf_fit(x, estimator, ...) sdf_fit_and_transform(x, estimator, ...)

Arguments

x A tbl_spark.
model A ml_transformer or a ml_model object.
... Optional arguments passed to the corresponding ml_ methods.
transformer A ml_transformer object.
estimator A ml_estimator object.

Value

sdf_predict(), sdf_transform(), and sdf_fit_and_transform() return a transformed dataframe whereas sdf_fit() returns a ml_transformer.

Create DataFrame for along Object

Arguments Creates a DataFrame along the given object. sdf_along(sc, along, repartition = NULL, type = c("integer", "integer64"))

Arguments

sc The associated Spark connection.
along Takes the length from the length of this argument.
repartition The number of partitions to use when distributing the data across the Spark cluster.
type The data type to use for the index, either "integer" or "integer64".

Bind multiple Spark DataFrames by row and column

Arguments Value Details sdf_bind_rows() and sdf_bind_cols() are implementation of the common pattern of do.call(rbind, sdfs) or do.call(cbind, sdfs) for binding many Spark DataFrames into one. sdf_bind_rows(..., id = NULL) sdf_bind_cols(...)

Arguments

... Spark tbls to combine. Each argument can either be a Spark DataFrame or a list of Spark DataFrames When row-binding, columns are matched by name, and any missing columns with be filled with NA. When column-binding, rows are matched by position, so all data frames must have the same number of rows.
id Data frame identifier. When id is supplied, a new column of identifiers is created to link each row to its original Spark DataFrame. The labels are taken from the named arguments to sdf_bind_rows(). When a list of Spark DataFrames is supplied, the labels are taken from the names of the list. If no names are found a numeric sequence is used instead.

Value

sdf_bind_rows() and sdf_bind_cols() return tbl_spark

Details

The output of sdf_bind_rows() will contain a column if that column appears in any of the inputs.

Broadcast hint

Arguments Used to force broadcast hash joins. sdf_broadcast(x)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.

Checkpoint a Spark DataFrame

Arguments Checkpoint a Spark DataFrame sdf_checkpoint(x, eager = TRUE)

Arguments

x an object coercible to a Spark DataFrame
eager whether to truncate the lineage of the DataFrame

Coalesces a Spark DataFrame

Arguments Coalesces a Spark DataFrame sdf_coalesce(x, partitions)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
partitions number of partitions

Collect a Spark DataFrame into R.

Arguments Collects a Spark dataframe into R. sdf_collect(object, ...)

Arguments

object Spark dataframe to collect
... Additional options.

Copy an Object into Spark

Arguments Advanced Usage See also Examples Copy an object into Spark, and return an R object wrapping the copied object (typically, a Spark DataFrame). sdf_copy_to(sc, x, name, memory, repartition, overwrite, ...) sdf_import(x, sc, name, memory, repartition, overwrite, ...)

Arguments

sc The associated Spark connection.
x An R object from which a Spark DataFrame can be generated.
name The name to assign to the copied table in Spark.
memory Boolean; should the table be cached into memory?
repartition The number of partitions to use when distributing the table across the Spark cluster. The default (0) can be used to avoid partitioning.
overwrite Boolean; overwrite a pre-existing table with the name name if one already exists?
... Optional arguments, passed to implementing methods.

Advanced Usage

sdf_copy_to is an S3 generic that, by default, dispatches to sdf_import. Package authors that would like to implement sdf_copy_to for a custom object type can accomplish this by implementing the associated method on sdf_import.

See also

Other Spark data frames: sdf_random_split, sdf_register, sdf_sample, sdf_sort

Examples

sc <- spark_connect(master = "spark://HOST:PORT") sdf_copy_to(sc, iris)#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5.0 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> 11 5.4 3.7 1.5 0.2 setosa #> 12 4.8 3.4 1.6 0.2 setosa #> 13 4.8 3.0 1.4 0.1 setosa #> 14 4.3 3.0 1.1 0.1 setosa #> 15 5.8 4.0 1.2 0.2 setosa #> 16 5.7 4.4 1.5 0.4 setosa #> 17 5.4 3.9 1.3 0.4 setosa #> 18 5.1 3.5 1.4 0.3 setosa #> 19 5.7 3.8 1.7 0.3 setosa #> 20 5.1 3.8 1.5 0.3 setosa #> 21 5.4 3.4 1.7 0.2 setosa #> 22 5.1 3.7 1.5 0.4 setosa #> 23 4.6 3.6 1.0 0.2 setosa #> 24 5.1 3.3 1.7 0.5 setosa #> 25 4.8 3.4 1.9 0.2 setosa #> 26 5.0 3.0 1.6 0.2 setosa #> 27 5.0 3.4 1.6 0.4 setosa #> 28 5.2 3.5 1.5 0.2 setosa #> 29 5.2 3.4 1.4 0.2 setosa #> 30 4.7 3.2 1.6 0.2 setosa #> 31 4.8 3.1 1.6 0.2 setosa #> 32 5.4 3.4 1.5 0.4 setosa #> 33 5.2 4.1 1.5 0.1 setosa #> 34 5.5 4.2 1.4 0.2 setosa #> 35 4.9 3.1 1.5 0.2 setosa #> 36 5.0 3.2 1.2 0.2 setosa #> 37 5.5 3.5 1.3 0.2 setosa #> 38 4.9 3.6 1.4 0.1 setosa #> 39 4.4 3.0 1.3 0.2 setosa #> 40 5.1 3.4 1.5 0.2 setosa #> 41 5.0 3.5 1.3 0.3 setosa #> 42 4.5 2.3 1.3 0.3 setosa #> 43 4.4 3.2 1.3 0.2 setosa #> 44 5.0 3.5 1.6 0.6 setosa #> 45 5.1 3.8 1.9 0.4 setosa #> 46 4.8 3.0 1.4 0.3 setosa #> 47 5.1 3.8 1.6 0.2 setosa #> 48 4.6 3.2 1.4 0.2 setosa #> 49 5.3 3.7 1.5 0.2 setosa #> 50 5.0 3.3 1.4 0.2 setosa #> 51 7.0 3.2 4.7 1.4 versicolor #> 52 6.4 3.2 4.5 1.5 versicolor #> 53 6.9 3.1 4.9 1.5 versicolor #> 54 5.5 2.3 4.0 1.3 versicolor #> 55 6.5 2.8 4.6 1.5 versicolor #> 56 5.7 2.8 4.5 1.3 versicolor #> 57 6.3 3.3 4.7 1.6 versicolor #> 58 4.9 2.4 3.3 1.0 versicolor #> 59 6.6 2.9 4.6 1.3 versicolor #> 60 5.2 2.7 3.9 1.4 versicolor #> 61 5.0 2.0 3.5 1.0 versicolor #> 62 5.9 3.0 4.2 1.5 versicolor #> 63 6.0 2.2 4.0 1.0 versicolor #> 64 6.1 2.9 4.7 1.4 versicolor #> 65 5.6 2.9 3.6 1.3 versicolor #> 66 6.7 3.1 4.4 1.4 versicolor #> 67 5.6 3.0 4.5 1.5 versicolor #> 68 5.8 2.7 4.1 1.0 versicolor #> 69 6.2 2.2 4.5 1.5 versicolor #> 70 5.6 2.5 3.9 1.1 versicolor #> 71 5.9 3.2 4.8 1.8 versicolor #> 72 6.1 2.8 4.0 1.3 versicolor #> 73 6.3 2.5 4.9 1.5 versicolor #> 74 6.1 2.8 4.7 1.2 versicolor #> 75 6.4 2.9 4.3 1.3 versicolor #> 76 6.6 3.0 4.4 1.4 versicolor #> 77 6.8 2.8 4.8 1.4 versicolor #> 78 6.7 3.0 5.0 1.7 versicolor #> 79 6.0 2.9 4.5 1.5 versicolor #> 80 5.7 2.6 3.5 1.0 versicolor #> 81 5.5 2.4 3.8 1.1 versicolor #> 82 5.5 2.4 3.7 1.0 versicolor #> 83 5.8 2.7 3.9 1.2 versicolor #> 84 6.0 2.7 5.1 1.6 versicolor #> 85 5.4 3.0 4.5 1.5 versicolor #> 86 6.0 3.4 4.5 1.6 versicolor #> 87 6.7 3.1 4.7 1.5 versicolor #> 88 6.3 2.3 4.4 1.3 versicolor #> 89 5.6 3.0 4.1 1.3 versicolor #> 90 5.5 2.5 4.0 1.3 versicolor #> 91 5.5 2.6 4.4 1.2 versicolor #> 92 6.1 3.0 4.6 1.4 versicolor #> 93 5.8 2.6 4.0 1.2 versicolor #> 94 5.0 2.3 3.3 1.0 versicolor #> 95 5.6 2.7 4.2 1.3 versicolor #> 96 5.7 3.0 4.2 1.2 versicolor #> 97 5.7 2.9 4.2 1.3 versicolor #> 98 6.2 2.9 4.3 1.3 versicolor #> 99 5.1 2.5 3.0 1.1 versicolor #> 100 5.7 2.8 4.1 1.3 versicolor #> 101 6.3 3.3 6.0 2.5 virginica #> 102 5.8 2.7 5.1 1.9 virginica #> 103 7.1 3.0 5.9 2.1 virginica #> 104 6.3 2.9 5.6 1.8 virginica #> 105 6.5 3.0 5.8 2.2 virginica #> 106 7.6 3.0 6.6 2.1 virginica #> 107 4.9 2.5 4.5 1.7 virginica #> 108 7.3 2.9 6.3 1.8 virginica #> 109 6.7 2.5 5.8 1.8 virginica #> 110 7.2 3.6 6.1 2.5 virginica #> 111 6.5 3.2 5.1 2.0 virginica #> 112 6.4 2.7 5.3 1.9 virginica #> 113 6.8 3.0 5.5 2.1 virginica #> 114 5.7 2.5 5.0 2.0 virginica #> 115 5.8 2.8 5.1 2.4 virginica #> 116 6.4 3.2 5.3 2.3 virginica #> 117 6.5 3.0 5.5 1.8 virginica #> 118 7.7 3.8 6.7 2.2 virginica #> 119 7.7 2.6 6.9 2.3 virginica #> 120 6.0 2.2 5.0 1.5 virginica #> 121 6.9 3.2 5.7 2.3 virginica #> 122 5.6 2.8 4.9 2.0 virginica #> 123 7.7 2.8 6.7 2.0 virginica #> 124 6.3 2.7 4.9 1.8 virginica #> 125 6.7 3.3 5.7 2.1 virginica #> 126 7.2 3.2 6.0 1.8 virginica #> 127 6.2 2.8 4.8 1.8 virginica #> 128 6.1 3.0 4.9 1.8 virginica #> 129 6.4 2.8 5.6 2.1 virginica #> 130 7.2 3.0 5.8 1.6 virginica #> 131 7.4 2.8 6.1 1.9 virginica #> 132 7.9 3.8 6.4 2.0 virginica #> 133 6.4 2.8 5.6 2.2 virginica #> 134 6.3 2.8 5.1 1.5 virginica #> 135 6.1 2.6 5.6 1.4 virginica #> 136 7.7 3.0 6.1 2.3 virginica #> 137 6.3 3.4 5.6 2.4 virginica #> 138 6.4 3.1 5.5 1.8 virginica #> 139 6.0 3.0 4.8 1.8 virginica #> 140 6.9 3.1 5.4 2.1 virginica #> 141 6.7 3.1 5.6 2.4 virginica #> 142 6.9 3.1 5.1 2.3 virginica #> 143 5.8 2.7 5.1 1.9 virginica #> 144 6.8 3.2 5.9 2.3 virginica #> 145 6.7 3.3 5.7 2.5 virginica #> 146 6.7 3.0 5.2 2.3 virginica #> 147 6.3 2.5 5.0 1.9 virginica #> 148 6.5 3.0 5.2 2.0 virginica #> 149 6.2 3.4 5.4 2.3 virginica #> 150 5.9 3.0 5.1 1.8 virginica

Cross Tabulation

Arguments Value Builds a contingency table at each combination of factor levels. sdf_crosstab(x, col1, col2)

Arguments

x A Spark DataFrame
col1 The name of the first column. Distinct items will make the first item of each row.
col2 The name of the second column. Distinct items will make the column names of the DataFrame.

Value

A DataFrame containing the contingency table.

Debug Info for Spark DataFrame

Arguments Prints plan of execution to generate x. This plan will, among other things, show the number of partitions in parenthesis at the far left and indicate stages using indentation. sdf_debug_string(x, print = TRUE)

Arguments

x An R object wrapping, or containing, a Spark DataFrame.
print Print debug information?

Compute summary statistics for columns of a data frame

Arguments Compute summary statistics for columns of a data frame sdf_describe(x, cols = colnames(x))

Arguments

x An object coercible to a Spark DataFrame
cols Columns to compute statistics for, given as a character vector

Support for Dimension Operations

Arguments sdf_dim(), sdf_nrow() and sdf_ncol() provide similar functionality to dim(), nrow() and ncol(). sdf_dim(x) sdf_nrow(x) sdf_ncol(x)

Arguments

x An object (usually a spark_tbl).

Spark DataFrame is Streaming

Arguments Is the given Spark DataFrame a streaming data? sdf_is_streaming(x)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.

Returns the last index of a Spark DataFrame

Arguments Returns the last index of a Spark DataFrame. The Spark mapPartitionsWithIndex function is used to iterate through the last nonempty partition of the RDD to find the last record. sdf_last_index(x, id = "id")

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
id The name of the index column.

Create DataFrame for Length

Arguments Creates a DataFrame for the given length. sdf_len(sc, length, repartition = NULL, type = c("integer", "integer64"))

Arguments

sc The associated Spark connection.
length The desired length of the sequence.
repartition The number of partitions to use when distributing the data across the Spark cluster.
type The data type to use for the index, either "integer" or "integer64".

Gets number of partitions of a Spark DataFrame

Arguments Gets number of partitions of a Spark DataFrame sdf_num_partitions(x)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.

Persist a Spark DataFrame

Arguments Details Persist a Spark DataFrame, forcing any pending computations and (optionally) serializing the results to disk. sdf_persist(x, storage.level = "MEMORY_AND_DISK")

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
storage.level The storage level to be used. Please view the Spark Documentation for information on what storage levels are accepted.

Details

Spark DataFrames invoke their operations lazily -- pending operations are deferred until their results are actually needed. Persisting a Spark DataFrame effectively 'forces' any pending computations, and then persists the generated Spark DataFrame as requested (to memory, to disk, or otherwise). Users of Spark should be careful to persist the results of any computations which are non-deterministic -- otherwise, one might see that the values within a column seem to 'change' as new operations are performed on that data set.

Pivot a Spark DataFrame

Arguments Examples Construct a pivot table over a Spark Dataframe, using a syntax similar to that from reshape2::dcast. sdf_pivot(x, formula, fun.aggregate = "count")

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula A two-sided R formula of the form x_1 + x_2 + ... ~ y_1. The left-hand side of the formula indicates which variables are used for grouping, and the right-hand side indicates which variable is used for pivoting. Currently, only a single pivot column is supported.
fun.aggregate How should the grouped dataset be aggregated? Can be a length-one character vector, giving the name of a Spark aggregation function to be called; a named R list mapping column names to an aggregation method, or an R function that is invoked on the grouped dataset.

Examples

if (FALSE) { library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # aggregating by mean iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "mean")) # aggregating all observations in a list iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low" )) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "collect_list")) }

Project features onto principal components

Arguments Transforming Spark DataFrames Project features onto principal components sdf_project(object, newdata, features = dimnames(object$pc)[[1]], feature_prefix = NULL, ...)

Arguments

object A Spark PCA model object
newdata An object coercible to a Spark DataFrame
features A vector of names of columns to be projected
feature_prefix The prefix used in naming the output features
... Optional arguments; currently unused.

Transforming Spark DataFrames

The family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL. These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object returned will no longer have the attached 'lazy' SQL operations. Note that the underlying Spark DataFrame does execute its operations lazily, so that even though the pending set of operations (currently) are not exposed at the R level, these operations will only be executed when you explicitly collect() the table.

Compute (Approximate) Quantiles with a Spark DataFrame

Arguments Given a numeric column within a Spark DataFrame, compute approximate quantiles (to some relative error). sdf_quantile(x, column, probabilities = c(0, 0.25, 0.5, 0.75, 1), relative.error = 1e-05)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
column The column for which quantiles should be computed.
probabilities A numeric vector of probabilities, for which quantiles should be computed.
relative.error The relative error -- lower values imply more precision in the computed quantiles.

Partition a Spark Dataframe

Arguments Value Details Transforming Spark DataFrames See also Examples Partition a Spark DataFrame into multiple groups. This routine is useful for splitting a DataFrame into, for example, training and test datasets. sdf_random_split(x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1)) sdf_partition(x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1))

Arguments

x An object coercable to a Spark DataFrame.
... Named parameters, mapping table names to weights. The weights will be normalized such that they sum to 1.
weights An alternate mechanism for supplying weights -- when specified, this takes precedence over the ... arguments.
seed Random seed to use for randomly partitioning the dataset. Set this if you want your partitioning to be reproducible on repeated runs.

Value

An R list of tbl_sparks.

Details

The sampling weights define the probability that a particular observation will be assigned to a particular partition, not the resulting size of the partition. This implies that partitioning a DataFrame with, for example, sdf_random_split(x, training = 0.5, test = 0.5) is not guaranteed to produce training and test partitions of equal size.

Transforming Spark DataFrames

The family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL. These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object returned will no longer have the attached 'lazy' SQL operations. Note that the underlying Spark DataFrame does execute its operations lazily, so that even though the pending set of operations (currently) are not exposed at the R level, these operations will only be executed when you explicitly collect() the table.

See also

Other Spark data frames: sdf_copy_to, sdf_register, sdf_sample, sdf_sort

Examples

if (FALSE) { # randomly partition data into a 'training' and 'test' # dataset, with 60% of the observations assigned to the # 'training' dataset, and 40% assigned to the 'test' dataset data(diamonds, package = "ggplot2") diamonds_tbl <- copy_to(sc, diamonds, "diamonds") partitions <- diamonds_tbl %>% sdf_random_split(training = 0.6, test = 0.4) print(partitions) # alternate way of specifying weights weights <- c(training = 0.6, test = 0.4) diamonds_tbl %>% sdf_random_split(weights = weights) }

Read a Column from a Spark DataFrame

Arguments Details Read a single column from a Spark DataFrame, and return the contents of that column back to R. sdf_read_column(x, column)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
column The name of a column within x.

Details

It is expected for this operation to preserve row order.

Register a Spark DataFrame

Arguments Transforming Spark DataFrames See also Registers a Spark DataFrame (giving it a table name for the Spark SQL context), and returns a tbl_spark. sdf_register(x, name = NULL)

Arguments

x A Spark DataFrame.
name A name to assign this table.

Transforming Spark DataFrames

The family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL. These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object returned will no longer have the attached 'lazy' SQL operations. Note that the underlying Spark DataFrame does execute its operations lazily, so that even though the pending set of operations (currently) are not exposed at the R level, these operations will only be executed when you explicitly collect() the table.

See also

Other Spark data frames: sdf_copy_to, sdf_random_split, sdf_sample, sdf_sort

Repartition a Spark DataFrame

Arguments Repartition a Spark DataFrame sdf_repartition(x, partitions = NULL, partition_by = NULL)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
partitions number of partitions
partition_by vector of column names used for partitioning, only supported for Spark 2.0+

Model Residuals

Arguments This generic method returns a Spark DataFrame with model residuals added as a column to the model training data. # S3 method for ml_model_generalized_linear_regression sdf_residuals(object, type = c("deviance", "pearson", "working", "response"), ...) # S3 method for ml_model_linear_regression sdf_residuals(object, ...) sdf_residuals(object, ...)

Arguments

object Spark ML model object.
type type of residuals which should be returned.
... additional arguments

Randomly Sample Rows from a Spark DataFrame

Arguments Transforming Spark DataFrames See also Draw a random sample of rows (with or without replacement) from a Spark DataFrame. sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL)

Arguments

x An object coercable to a Spark DataFrame.
fraction The fraction to sample.
replacement Boolean; sample with replacement?
seed An (optional) integer seed.

Transforming Spark DataFrames

The family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL. These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object returned will no longer have the attached 'lazy' SQL operations. Note that the underlying Spark DataFrame does execute its operations lazily, so that even though the pending set of operations (currently) are not exposed at the R level, these operations will only be executed when you explicitly collect() the table.

See also

Other Spark data frames: sdf_copy_to, sdf_random_split, sdf_register, sdf_sort

Read the Schema of a Spark DataFrame

Arguments Value Details Read the schema of a Spark DataFrame. sdf_schema(x)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.

Value

An R list, with each list element describing the name and type of a column.

Details

The type column returned gives the string representation of the underlying Spark type for that column; for example, a vector of numeric values would be returned with the type "DoubleType". Please see the Spark Scala API Documentation for information on what types are available and exposed by Spark.

Separate a Vector Column into Scalar Columns

Arguments Given a vector column in a Spark DataFrame, split that into n separate columns, each column made up of the different elements in the column column. sdf_separate_column(x, column, into = NULL)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
column The name of a (vector-typed) column.
into A specification of the columns that should be generated from column. This can either be a vector of column names, or an R list mapping column names to the (1-based) index at which a particular vector element should be extracted.

Create DataFrame for Range

Arguments Creates a DataFrame for the given range sdf_seq(sc, from = 1L, to = 1L, by = 1L, repartition = type, type = c("integer", "integer64"))

Arguments

sc The associated Spark connection.
from, to The start and end to use as a range
by The increment of the sequence.
repartition The number of partitions to use when distributing the data across the Spark cluster.
type The data type to use for the index, either "integer" or "integer64".

Sort a Spark DataFrame

Arguments Transforming Spark DataFrames See also Sort a Spark DataFrame by one or more columns, with each column sorted in ascending order. sdf_sort(x, columns)

Arguments

x An object coercable to a Spark DataFrame.
columns The column(s) to sort by.

Transforming Spark DataFrames

The family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL. These functions will 'force' any pending SQL in a dplyr pipeline, such that the resulting tbl_spark object returned will no longer have the attached 'lazy' SQL operations. Note that the underlying Spark DataFrame does execute its operations lazily, so that even though the pending set of operations (currently) are not exposed at the R level, these operations will only be executed when you explicitly collect() the table.

See also

Other Spark data frames: sdf_copy_to, sdf_random_split, sdf_register, sdf_sample

Spark DataFrame from SQL

Arguments Defines a Spark DataFrame from a SQL query, useful to create Spark DataFrames without collecting the results immediately. sdf_sql(sc, sql)

Arguments

sc A spark_connection.
sql a 'SQL' query used to generate a Spark DataFrame.

Add a Sequential ID Column to a Spark DataFrame

Arguments Add a sequential ID column to a Spark DataFrame. The Spark zipWithIndex function is used to produce these. This differs from sdf_with_unique_id in that the IDs generated are independent of partitioning. sdf_with_sequential_id(x, id = "id", from = 1L)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
id The name of the column to host the generated IDs.
from The starting value of the id column

Add a Unique ID Column to a Spark DataFrame

Arguments Add a unique ID column to a Spark DataFrame. The Spark monotonicallyIncreasingId function is used to produce these and is guaranteed to produce unique, monotonically increasing ids; however, there is no guarantee that these IDs will be sequential. The table is persisted immediately after the column is generated, to ensure that the column is stable -- otherwise, it can differ across new computations. sdf_with_unique_id(x, id = "id")

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
id The name of the column to host the generated IDs.

Spark ML -- Decision Trees

Arguments Value Details See also Examples Perform classification and regression using decision trees. ml_decision_tree_classifier(x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "gini", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("decision_tree_classifier_"), ...) ml_decision_tree(x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", variance_col = NULL, probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10L, impurity = "auto", max_bins = 32L, max_depth = 5L, min_info_gain = 0, min_instances_per_node = 1L, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256L, uid = random_string("decision_tree_"), response = NULL, features = NULL, ...) ml_decision_tree_regressor(x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "variance", seed = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, variance_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("decision_tree_regressor_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
max_depth Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.
max_bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
min_instances_per_node Minimum number of instances each child must have after split.
min_info_gain Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0.
impurity Criterion used for information gain calculation. Supported: "entropy" and "gini" (default) for classification and "variance" (default) for regression. For ml_decision_tree, setting "auto" will default to the appropriate criterion based on model type.
seed Seed for random numbers.
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
cache_node_ids If FALSE, the algorithm will pass trees to executors to match instances with nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Defaults to FALSE.
checkpoint_interval Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
max_memory_in_mb Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.
type The type of model to fit. "regression" treats the response as a continuous variable, while "classification" treats the response as a categorical variable. When "auto" is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.
variance_col (Optional) Column name for the biased sample variance of prediction.
response (Deprecated) The name of the response column (as a length-one character vector.)
features (Deprecated) The name of features (terms) to use for the model fit.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. ml_decision_tree is a wrapper around ml_decision_tree_regressor.tbl_spark and ml_decision_tree_classifier.tbl_spark and calls the appropriate method based on model type.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test dt_model <- iris_training %>% ml_decision_tree(Species ~ .) pred <- ml_predict(dt_model, iris_test) ml_multiclass_classification_evaluator(pred) }

Spark ML -- Generalized Linear Regression

Arguments Value Details See also Examples Perform regression using Generalized Linear Model (GLM). ml_generalized_linear_regression(x, formula = NULL, family = "gaussian", link = NULL, fit_intercept = TRUE, offset_col = NULL, link_power = NULL, link_prediction_col = NULL, reg_param = 0, max_iter = 25, weight_col = NULL, solver = "irls", tol = 1e-06, variance_power = 0, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("generalized_linear_regression_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
family Name of family which is a description of the error distribution to be used in the model. Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie". Default is "gaussian".
link Name of link function which provides the relationship between the linear predictor and the mean of the distribution function. See for supported link functions.
fit_intercept Boolean; should the model be fit with an intercept term?
offset_col Offset column name. If this is not set, we treat all instance offsets as 0.0. The feature specified as offset has a constant coefficient of 1.0.
link_power Index in the power link function. Only applicable to the Tweedie family. Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt link, respectively. When not set, this value defaults to 1 - variancePower, which matches the R "statmod" package.
link_prediction_col Link prediction (linear predictor) column name. Default is not set, which means we do not output link prediction.
reg_param Regularization parameter (aka lambda)
max_iter The maximum number of iterations to use.
weight_col The name of the column to use as weights for the model fit.
solver Solver algorithm for optimization.
tol Param for the convergence tolerance for iterative algorithms.
variance_power Power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution. Only applicable to the Tweedie family. (see Tweedie Distribution (Wikipedia)) Supported values: 0 and [1, Inf). Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma family, respectively.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. Valid link functions for each family is listed below. The first link function of each family is the default one. gaussian: "identity", "log", "inverse" binomial: "logit", "probit", "loglog" poisson: "log", "identity", "sqrt" gamma: "inverse", "identity", "log" tweedie: power link function specified through link_power. The default link power in the tweedie family is 1 - variance_power.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { library(sparklyr) sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # Specify the grid family <- c("gaussian", "gamma", "poisson") link <- c("identity", "log") family_link <- expand.grid(family = family, link = link, stringsAsFactors = FALSE) family_link <- data.frame(family_link, rmse = 0) # Train the models for (i in 1:nrow(family_link)) { glm_model <- mtcars_training %>% ml_generalized_linear_regression(mpg ~ ., family = family_link[i, 1], link = family_link[i, 2] ) pred <- ml_predict(glm_model, mtcars_test) family_link[i, 3] <- ml_regression_evaluator(pred, label_col = "mpg") } family_link }

Spark ML -- Gradient Boosted Trees

Arguments Value Details See also Examples Perform binary classification and regression using gradient boosted trees. Multiclass classification is not supported yet. ml_gbt_classifier(x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1L, max_bins = 32, min_info_gain = 0, loss_type = "logistic", seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("gbt_classifier_"), ...) ml_gradient_boosted_trees(x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10, loss_type = c("auto", "logistic", "squared", "absolute"), max_bins = 32, max_depth = 5, max_iter = 20L, min_info_gain = 0, min_instances_per_node = 1, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("gradient_boosted_trees_"), response = NULL, features = NULL, ...) ml_gbt_regressor(x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1, max_bins = 32, min_info_gain = 0, loss_type = "squared", seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("gbt_regressor_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
max_iter Maxmimum number of iterations.
max_depth Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.
step_size Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default = 0.1)
subsampling_rate Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0)
feature_subset_strategy The number of features to consider for splits at each tree node. See details for options.
min_instances_per_node Minimum number of instances each child must have after split.
max_bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
min_info_gain Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0.
loss_type Loss function which GBT tries to minimize. Supported: "squared" (L2) and "absolute" (L1) (default = squared) for regression and "logistic" (default) for classification. For ml_gradient_boosted_trees, setting "auto" will default to the appropriate loss type based on model type.
seed Seed for random numbers.
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
checkpoint_interval Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
cache_node_ids If FALSE, the algorithm will pass trees to executors to match instances with nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Defaults to FALSE.
max_memory_in_mb Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.
type The type of model to fit. "regression" treats the response as a continuous variable, while "classification" treats the response as a categorical variable. When "auto" is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.
response (Deprecated) The name of the response column (as a length-one character vector.)
features (Deprecated) The name of features (terms) to use for the model fit.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. The supported options for feature_subset_strategy are "auto": Choose automatically for task: If num_trees == 1, set to "all". If num_trees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression. "all": use all features "onethird": use 1/3 of the features "sqrt": use use sqrt(number of features) "log2": use log2(number of features) "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features. (default = "auto") ml_gradient_boosted_trees is a wrapper around ml_gbt_regressor.tbl_spark and ml_gbt_classifier.tbl_spark and calls the appropriate method based on model type.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test gbt_model <- iris_training %>% ml_gradient_boosted_trees(Sepal_Length ~ Petal_Length + Petal_Width) pred <- ml_predict(gbt_model, iris_test) ml_regression_evaluator(pred, label_col = "Sepal_Length") }

Spark ML -- K-Means Clustering

Arguments Value See also Examples K-means clustering with support for k-means|| initialization proposed by Bahmani et al. Using `ml_kmeans()` with the formula interface requires Spark 2.0+. ml_kmeans(x, formula = NULL, k = 2, max_iter = 20, tol = 1e-04, init_steps = 2, init_mode = "k-means||", seed = NULL, features_col = "features", prediction_col = "prediction", uid = random_string("kmeans_"), ...) ml_compute_cost(model, dataset)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
k The number of clusters to create
max_iter The maximum number of iterations to use.
tol Param for the convergence tolerance for iterative algorithms.
init_steps Number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 2 is almost always enough. Must be > 0. Default: 2.
init_mode Initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments, see Details.
model A fitted K-means model returned by ml_kmeans()
dataset Dataset on which to calculate K-means cost

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the clustering estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning a clustering model. tbl_spark, with formula or features specified: When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the estimator. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model. This signature does not apply to ml_lda(). ml_compute_cost() returns the K-means cost (sum of squared distances of points to their nearest center) for the model on the given data.

See also

See http://spark.apache.org/docs/latest/ml-clustering.html for more information on the set of clustering algorithms. Other ml clustering algorithms: ml_bisecting_kmeans, ml_gaussian_mixture, ml_lda

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) ml_kmeans(iris_tbl, Species ~ .) }

Spark ML -- Latent Dirichlet Allocation

Arguments Value Details Parameter details See also Examples Latent Dirichlet Allocation (LDA), a topic model designed for text documents. ml_lda(x, formula = NULL, k = 10, max_iter = 20, doc_concentration = NULL, topic_concentration = NULL, subsampling_rate = 0.05, optimizer = "online", checkpoint_interval = 10, keep_last_checkpoint = TRUE, learning_decay = 0.51, learning_offset = 1024, optimize_doc_concentration = TRUE, seed = NULL, features_col = "features", topic_distribution_col = "topicDistribution", uid = random_string("lda_"), ...) ml_describe_topics(model, max_terms_per_topic = 10) ml_log_likelihood(model, dataset) ml_log_perplexity(model, dataset) ml_topics_matrix(model)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
k The number of clusters to create
max_iter The maximum number of iterations to use.
doc_concentration Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). See details.
topic_concentration Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
subsampling_rate (For Online optimizer only) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. Note that this should be adjusted in synch with max_iter so the entire corpus is used. Specifically, set both so that maxIterations * miniBatchFraction greater than or equal to 1.
optimizer Optimizer or inference algorithm used to estimate the LDA model. Supported: "online" for Online Variational Bayes (default) and "em" for Expectation-Maximization.
checkpoint_interval Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
keep_last_checkpoint (Spark 2.0.0+) (For EM optimizer only) If using checkpointing, this indicates whether to keep the last checkpoint. If FALSE, then the checkpoint will be deleted. Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care. Note that checkpoints will be cleaned up via reference counting, regardless.
learning_decay (For Online optimizer only) Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. This is called "kappa" in the Online LDA paper (Hoffman et al., 2010). Default: 0.51, based on Hoffman et al.
learning_offset (For Online optimizer only) A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. This is called "tau0" in the Online LDA paper (Hoffman et al., 2010) Default: 1024, following Hoffman et al.
optimize_doc_concentration (For Online optimizer only) Indicates whether the doc_concentration (Dirichlet parameter for document-topic distribution) will be optimized during training. Setting this to true will make the model more expressive and fit the training data better. Default: FALSE
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
topic_distribution_col Output column with estimates of the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments, see Details.
model A fitted LDA model returned by ml_lda().
max_terms_per_topic Maximum number of terms to collect for each topic. Default value of 10.
dataset test corpus to use for calculating log likelihood or log perplexity

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the clustering estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning a clustering model. tbl_spark, with formula or features specified: When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the estimator. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model. This signature does not apply to ml_lda(). ml_describe_topics returns a DataFrame with topics and their top-weighted terms. ml_log_likelihood calculates a lower bound on the log likelihood of the entire corpus

Details

For `ml_lda.tbl_spark` with the formula interface, you can specify named arguments in `...` that will be passed `ft_regex_tokenizer()`, `ft_stop_words_remover()`, and `ft_count_vectorizer()`. For example, to increase the default `min_token_length`, you can use `ml_lda(dataset, ~ text, min_token_length = 4)`. Terminology for LDA: "term" = "word": an element of the vocabulary "token": instance of a term appearing in a document "topic": multinomial distribution over terms representing some concept "document": one piece of text, corresponding to one row in the input data Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. Input data (features_col): LDA is given a collection of documents as input data, via the features_col parameter. Each document is specified as a Vector of length vocab_size, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as ft_tokenizer and ft_count_vectorizer can be useful for converting text to word count vectors

Parameter details

doc_concentration

This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization). If not set by the user, then doc_concentration is set automatically. If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, the doc_concentration vector must be length k. (default = automatic) Optimizer-specific parameter settings: EM Currently only supports symmetric distributions, so all values in the vector should be the same. Values should be greater than 1.0 default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. Online Values should be greater than or equal to 0 default = uniformly (1.0 / k), following the implementation from here

topic_concentration

This is the parameter to a symmetric Dirichlet distribution. Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009. If not set by the user, then topic_concentration is set automatically. (default = automatic) Optimizer-specific parameter settings: EM Value should be greater than 1.0 default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. Online Value should be greater than or equal to 0 default = (1.0 / k), following the implementation from here.

topic_distribution_col

This uses a variational approximation following Hoffman et al. (2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.

See also

See http://spark.apache.org/docs/latest/ml-clustering.html for more information on the set of clustering algorithms. Other ml clustering algorithms: ml_bisecting_kmeans, ml_gaussian_mixture, ml_kmeans

Examples

if (FALSE) { library(janeaustenr) library(dplyr) sc <- spark_connect(master = "local") lines_tbl <- sdf_copy_to(sc, austen_books()[c(1:30), ], name = "lines_tbl", overwrite = TRUE ) # transform the data in a tidy form lines_tbl_tidy <- lines_tbl %>% ft_tokenizer( input_col = "text", output_col = "word_list" ) %>% ft_stop_words_remover( input_col = "word_list", output_col = "wo_stop_words" ) %>% mutate(text = explode(wo_stop_words)) %>% filter(text != "") %>% select(text, book) lda_model <- lines_tbl_tidy %>% ml_lda(~text, k = 4) # vocabulary and topics tidy(lda_model) }

Spark ML -- Linear Regression

Arguments Value Details See also Examples Perform regression using linear regression. ml_linear_regression(x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, weight_col = NULL, loss = "squaredError", solver = "auto", standardization = TRUE, tol = 1e-06, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("linear_regression_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
fit_intercept Boolean; should the model be fit with an intercept term?
elastic_net_param ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
reg_param Regularization parameter (aka lambda)
max_iter The maximum number of iterations to use.
weight_col The name of the column to use as weights for the model fit.
loss The loss function to be optimized. Supported options: "squaredError" and "huber". Default: "squaredError"
solver Solver algorithm for optimization.
standardization Whether to standardize the training features before fitting the model.
tol Param for the convergence tolerance for iterative algorithms.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lm_model <- mtcars_training %>% ml_linear_regression(mpg ~ .) pred <- ml_predict(lm_model, mtcars_test) ml_regression_evaluator(pred, label_col = "mpg") }

Spark ML -- Logistic Regression

Arguments Value Details See also Examples Perform classification using logistic regression. ml_logistic_regression(x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, threshold = 0.5, thresholds = NULL, tol = 1e-06, weight_col = NULL, aggregation_depth = 2, lower_bounds_on_coefficients = NULL, lower_bounds_on_intercepts = NULL, upper_bounds_on_coefficients = NULL, upper_bounds_on_intercepts = NULL, features_col = "features", label_col = "label", family = "auto", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("logistic_regression_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
fit_intercept Boolean; should the model be fit with an intercept term?
elastic_net_param ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
reg_param Regularization parameter (aka lambda)
max_iter The maximum number of iterations to use.
threshold in binary classification prediction, in range [0, 1].
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
tol Param for the convergence tolerance for iterative algorithms.
weight_col The name of the column to use as weights for the model fit.
aggregation_depth (Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
lower_bounds_on_coefficients (Spark 2.2.0+) Lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression.
lower_bounds_on_intercepts (Spark 2.2.0+) Lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression.
upper_bounds_on_coefficients (Spark 2.2.0+) Upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression.
upper_bounds_on_intercepts (Spark 2.2.0+) Upper bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
family (Spark 2.1.0+) Param for the name of family which is a description of the label distribution to be used in the model. Supported options: "auto", "binomial", and "multinomial."
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lr_model <- mtcars_training %>% ml_logistic_regression(am ~ gear + carb) pred <- ml_predict(lr_model, mtcars_test) ml_binary_classification_evaluator(pred) }

Extracts data associated with a Spark ML model

Arguments Value Extracts data associated with a Spark ML model ml_model_data(object)

Arguments

object a Spark ML model

Value

A tbl_spark

Spark ML -- Multilayer Perceptron

Arguments Value Details See also Examples Classification model based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. ml_multilayer_perceptron_classifier(x, formula = NULL, layers = NULL, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, thresholds = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), ...) ml_multilayer_perceptron(x, formula = NULL, layers, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, features_col = "features", label_col = "label", thresholds = NULL, prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), response = NULL, features = NULL, ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
layers A numeric vector describing the layers -- each element in the vector gives the size of a layer. For example, c(4, 5, 2) would imply three layers, with an input (feature) layer of size 4, an intermediate layer of size 5, and an output (class) layer of size 2.
max_iter The maximum number of iterations to use.
step_size Step size to be used for each iteration of optimization (> 0).
tol Param for the convergence tolerance for iterative algorithms.
block_size Block size for stacking input data in matrices to speed up the computation. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data. Recommended size is between 10 and 1000. Default: 128
solver The solver algorithm for optimization. Supported options: "gd" (minibatch gradient descent) or "l-bfgs". Default: "l-bfgs"
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
initial_weights The initial weights of the model.
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.
response (Deprecated) The name of the response column (as a length-one character vector.)
features (Deprecated) The name of features (terms) to use for the model fit.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. ml_multilayer_perceptron() is an alias for ml_multilayer_perceptron_classifier() for backwards compatibility.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test mlp_model <- iris_training %>% ml_multilayer_perceptron_classifier(Species ~ ., layers = c(4,3,3)) pred <- ml_predict(mlp_model, iris_test) ml_multiclass_classification_evaluator(pred) }

Spark ML -- Naive-Bayes

Arguments Value Details See also Examples Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values must be nonnegative. ml_naive_bayes(x, formula = NULL, model_type = "multinomial", smoothing = 1, thresholds = NULL, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("naive_bayes_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
model_type The model type. Supported options: "multinomial" and "bernoulli". (default = multinomial)
smoothing The (Laplace) smoothing parameter. Defaults to 1.
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
weight_col (Spark 2.1.0+) Weight column name. If this is not set or empty, we treat all instance weights as 1.0.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test nb_model <- iris_training %>% ml_naive_bayes(Species ~ .) pred <- ml_predict(nb_model, iris_test) ml_multiclass_classification_evaluator(pred) }

Spark ML -- OneVsRest

Arguments Value Details See also Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example. ml_one_vs_rest(x, formula = NULL, classifier = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("one_vs_rest_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
classifier Object of class ml_estimator. Base binary classifier that we reduce multiclass classification into.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_random_forest_classifier

Feature Transformation -- PCA (Estimator)

Arguments Value Details See also Examples PCA trains a model to project vectors to a lower dimensional space of the top k principal components. ft_pca(x, input_col = NULL, output_col = NULL, k = NULL, uid = random_string("pca_"), ...) ml_pca(x, features = tbl_vars(x), k = length(features), pc_prefix = "PC", ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
k The number of principal components
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.
features The columns to use in the principal components analysis. Defaults to all columns in x.
pc_prefix Length-one character vector used to prepend names of components.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark. ml_pca() is a wrapper around ft_pca() that returns a ml_model.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_pca(k = 2) }

Spark ML -- Random Forest

Arguments Value Details See also Examples Perform classification and regression using random forests. ml_random_forest_classifier(x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "gini", min_info_gain = 0, max_bins = 32, seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("random_forest_classifier_"), ...) ml_random_forest(x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", feature_subset_strategy = "auto", impurity = "auto", checkpoint_interval = 10, max_bins = 32, max_depth = 5, num_trees = 20, min_info_gain = 0, min_instances_per_node = 1, subsampling_rate = 1, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("random_forest_"), response = NULL, features = NULL, ...) ml_random_forest_regressor(x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "variance", min_info_gain = 0, max_bins = 32, seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("random_forest_regressor_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
num_trees Number of trees to train (>= 1). If 1, then no bootstrapping is used. If > 1, then bootstrapping is done.
subsampling_rate Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0)
max_depth Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.
min_instances_per_node Minimum number of instances each child must have after split.
feature_subset_strategy The number of features to consider for splits at each tree node. See details for options.
impurity Criterion used for information gain calculation. Supported: "entropy" and "gini" (default) for classification and "variance" (default) for regression. For ml_decision_tree, setting "auto" will default to the appropriate criterion based on model type.
min_info_gain Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0.
max_bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
seed Seed for random numbers.
thresholds Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
checkpoint_interval Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
cache_node_ids If FALSE, the algorithm will pass trees to executors to match instances with nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Defaults to FALSE.
max_memory_in_mb Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.
type The type of model to fit. "regression" treats the response as a continuous variable, while "classification" treats the response as a categorical variable. When "auto" is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.
response (Deprecated) The name of the response column (as a length-one character vector.)
features (Deprecated) The name of features (terms) to use for the model fit.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. The supported options for feature_subset_strategy are "auto": Choose automatically for task: If num_trees == 1, set to "all". If num_trees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression. "all": use all features "onethird": use 1/3 of the features "sqrt": use use sqrt(number of features) "log2": use log2(number of features) "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features. (default = "auto") ml_random_forest is a wrapper around ml_random_forest_regressor.tbl_spark and ml_random_forest_classifier.tbl_spark and calls the appropriate method based on model type.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test rf_model <- iris_training %>% ml_random_forest(Species ~ ., type = "classification") pred <- ml_predict(rf_model, iris_test) ml_multiclass_classification_evaluator(pred) }

Spark ML -- Survival Regression

Arguments Value Details See also Examples Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time. ml_aft_survival_regression(x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), ...) ml_survival_regression(x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), response = NULL, features = NULL, ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
censor_col Censor column name. The value of this column could be 0 or 1. If the value is 1, it means the event has occurred i.e. uncensored; otherwise censored.
quantile_probabilities Quantile probabilities array. Values of the quantile probabilities array should be in the range (0, 1) and the array should be non-empty.
fit_intercept Boolean; should the model be fit with an intercept term?
max_iter The maximum number of iterations to use.
tol Param for the convergence tolerance for iterative algorithms.
aggregation_depth (Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
quantiles_col Quantiles column name. This column will output quantiles of corresponding quantileProbabilities if it is set.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.
response (Deprecated) The name of the response column (as a length-one character vector.)
features (Deprecated) The name of features (terms) to use for the model fit.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows. ml_survival_regression() is an alias for ml_aft_survival_regression() for backwards compatibility.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { library(survival) library(sparklyr) sc <- spark_connect(master = "local") ovarian_tbl <- sdf_copy_to(sc, ovarian, name = "ovarian_tbl", overwrite = TRUE) partitions <- ovarian_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) ovarian_training <- partitions$training ovarian_test <- partitions$test sur_reg <- ovarian_training %>% ml_aft_survival_regression(futime ~ ecog_ps + rx + age + resid_ds, censor_col = "fustat") pred <- ml_predict(sur_reg, ovarian_test) pred }

Add a Stage to a Pipeline

Arguments Adds a stage to a pipeline. ml_add_stage(x, stage)

Arguments

x A pipeline or a pipeline stage.
stage A pipeline stage.

Spark ML -- ALS

Arguments Value Details Examples Perform recommendation using Alternating Least Squares (ALS) matrix factorization. ml_als(x, formula = NULL, rating_col = "rating", user_col = "user", item_col = "item", rank = 10, reg_param = 0.1, implicit_prefs = FALSE, alpha = 1, nonnegative = FALSE, max_iter = 10, num_user_blocks = 10, num_item_blocks = 10, checkpoint_interval = 10, cold_start_strategy = "nan", intermediate_storage_level = "MEMORY_AND_DISK", final_storage_level = "MEMORY_AND_DISK", uid = random_string("als_"), ...) ml_recommend(model, type = c("items", "users"), n = 1)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details. The ALS model requires a specific formula format, please use rating_col ~ user_col + item_col.
rating_col Column name for ratings. Default: "rating"
user_col Column name for user ids. Ids must be integers. Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. Default: "user"
item_col Column name for item ids. Ids must be integers. Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. Default: "item"
rank Rank of the matrix factorization (positive). Default: 10
reg_param Regularization parameter.
implicit_prefs Whether to use implicit preference. Default: FALSE.
alpha Alpha parameter in the implicit preference formulation (nonnegative).
nonnegative Whether to apply nonnegativity constraints. Default: FALSE.
max_iter Maximum number of iterations.
num_user_blocks Number of user blocks (positive). Default: 10
num_item_blocks Number of item blocks (positive). Default: 10
checkpoint_interval Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
cold_start_strategy (Spark 2.2.0+) Strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: - "nan": predicted value for unknown ids will be NaN. - "drop": rows in the input DataFrame containing unknown ids will be dropped from the output DataFrame containing predictions. Default: "nan".
intermediate_storage_level (Spark 2.0.0+) StorageLevel for intermediate datasets. Pass in a string representation of StorageLevel. Cannot be "NONE". Default: "MEMORY_AND_DISK".
final_storage_level (Spark 2.0.0+) StorageLevel for ALS model factors. Pass in a string representation of StorageLevel. Default: "MEMORY_AND_DISK".
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; currently unused.
model An ALS model object
type What to recommend, one of items or users
n Maximum number of recommendations to return

Value

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix. This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages. For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here. Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items. The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_als recommender object, which is an Estimator. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the recommender appended to the pipeline. tbl_spark: When x is a tbl_spark, a recommender estimator is constructed then immediately fit with the input tbl_spark, returning a recommendation model, i.e. ml_als_model.

Details

ml_recommend() returns the top n users/items recommended for each item/user, for all items/users. The output has been transformed (exploded and separated) from the default Spark outputs to be more user friendly.

Examples

if (FALSE) { library(sparklyr) sc <- spark_connect(master = "local") movies <- data.frame( user = c(1, 2, 0, 1, 2, 0), item = c(1, 1, 1, 2, 2, 0), rating = c(3, 1, 2, 4, 5, 4) ) movies_tbl <- sdf_copy_to(sc, movies) model <- ml_als(movies_tbl, rating ~ user + item) ml_predict(model, movies_tbl) ml_recommend(model, type = "item", 1) }

Utility functions for LSH models

Arguments Utility functions for LSH models ml_approx_nearest_neighbors(model, dataset, key, num_nearest_neighbors, dist_col = "distCol") ml_approx_similarity_join(model, dataset_a, dataset_b, threshold, dist_col = "distCol")

Arguments

model A fitted LSH model, returned by either ft_minhash_lsh() or ft_bucketed_random_projection_lsh().
dataset The dataset to search for nearest neighbors of the key.
key Feature vector representing the item to search for.
num_nearest_neighbors The maximum number of nearest neighbors.
dist_col Output column for storing the distance between each result row and the key.
dataset_a One of the datasets to join.
dataset_b Another dataset to join.
threshold The threshold for the distance of row pairs.

Frequent Pattern Mining -- FPGrowth

Arguments A parallel FP-growth algorithm to mine frequent itemsets. ml_fpgrowth(x, items_col = "items", min_confidence = 0.8, min_support = 0.3, prediction_col = "prediction", uid = random_string("fpgrowth_"), ...) ml_association_rules(model) ml_freq_itemsets(model)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
items_col Items column name. Default: "items"
min_confidence Minimal confidence for generating Association Rule. min_confidence will not affect the mining for frequent itemsets, but will affect the association rules generation. Default: 0.8
min_support Minimal support level of the frequent pattern. [0.0, 1.0]. Any pattern that appears more than (min_support * size-of-the-dataset) times will be output in the frequent itemsets. Default: 0.3
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; currently unused.
model A fitted FPGrowth model returned by ml_fpgrowth()

Spark ML - Evaluators

Arguments Value Details Examples A set of functions to calculate performance metrics for prediction models. Also see the Spark ML Documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.package ml_binary_classification_evaluator(x, label_col = "label", raw_prediction_col = "rawPrediction", metric_name = "areaUnderROC", uid = random_string("binary_classification_evaluator_"), ...) ml_binary_classification_eval(x, label_col = "label", prediction_col = "prediction", metric_name = "areaUnderROC") ml_multiclass_classification_evaluator(x, label_col = "label", prediction_col = "prediction", metric_name = "f1", uid = random_string("multiclass_classification_evaluator_"), ...) ml_classification_eval(x, label_col = "label", prediction_col = "prediction", metric_name = "f1") ml_regression_evaluator(x, label_col = "label", prediction_col = "prediction", metric_name = "rmse", uid = random_string("regression_evaluator_"), ...)

Arguments

x A spark_connection object or a tbl_spark containing label and prediction columns. The latter should be the output of sdf_predict.
label_col Name of column string specifying which column contains the true labels or values.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
metric_name The performance metric. See details.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; currently unused.
prediction_col Name of the column that contains the predicted label or value NOT the scored probability. Column should be of type Double.

Value

The calculated performance metric

Details

The following metrics are supported Binary Classification: areaUnderROC (default) or areaUnderPR (not available in Spark 2.X.) Multiclass Classification: f1 (default), precision, recall, weightedPrecision, weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision, weightedRecall or accuracy. Regression: rmse (root mean squared error, default), mse (mean squared error), r2, or mae (mean absolute error.) ml_binary_classification_eval() is an alias for ml_binary_classification_evaluator() for backwards compatibility. ml_classification_eval() is an alias for ml_multiclass_classification_evaluator() for backwards compatibility.

Examples

if (FALSE) { sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # for multiclass classification rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_multiclass_classification_evaluator(pred) # for regression rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "regression") pred <- ml_predict(rf_model, mtcars_test) ml_regression_evaluator(pred, label_col = "cyl") # for binary classification rf_model <- mtcars_training %>% ml_random_forest(am ~ gear + carb, type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_binary_classification_evaluator(pred) }

Spark ML -- Bisecting K-Means Clustering

Arguments Value See also Examples A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority. ml_bisecting_kmeans(x, formula = NULL, k = 4, max_iter = 20, seed = NULL, min_divisible_cluster_size = 1, features_col = "features", prediction_col = "prediction", uid = random_string("bisecting_bisecting_kmeans_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
k The number of clusters to create
max_iter The maximum number of iterations to use.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
min_divisible_cluster_size The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments, see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the clustering estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning a clustering model. tbl_spark, with formula or features specified: When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the estimator. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model. This signature does not apply to ml_lda().

See also

See http://spark.apache.org/docs/latest/ml-clustering.html for more information on the set of clustering algorithms. Other ml clustering algorithms: ml_gaussian_mixture, ml_kmeans, ml_lda

Examples

if (FALSE) { library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_bisecting_kmeans(k = 4 , Species ~ .) }

Wrap a Spark ML JVM object

Arguments Identifies the associated sparklyr ML constructor for the JVM object by inspecting its class and performing a lookup. The lookup table is specified by the `sparkml/class_mapping.json` files of sparklyr and the loaded extensions. ml_call_constructor(jobj)

Arguments

jobj The jobj for the pipeline stage.

Chi-square hypothesis testing for categorical data.

Arguments Value Examples Conduct Pearson's independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical. ml_chisquare_test(x, features, label)

Arguments

x A tbl_spark.
features The name(s) of the feature columns. This can also be the name of a single vector column created using ft_vector_assembler().
label The name of the label column.

Value

A data frame with one row for each (feature, label) pair with p-values, degrees of freedom, and test statistics.

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_chisquare_test(iris_tbl, features = features, label = "Species") }

Spark ML - Clustering Evaluator

Arguments Value Examples Evaluator for clustering results. The metric computes the Silhouette measure using the squared Euclidean distance. The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters. ml_clustering_evaluator(x, features_col = "features", prediction_col = "prediction", metric_name = "silhouette", uid = random_string("clustering_evaluator_"), ...)

Arguments

x A spark_connection object or a tbl_spark containing label and prediction columns. The latter should be the output of sdf_predict.
features_col Name of features column.
prediction_col Name of the prediction column.
metric_name The performance metric. Currently supports "silhouette".
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; currently unused.

Value

The calculated performance metric

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test formula <- Species ~ . # Train the models kmeans_model <- ml_kmeans(iris_training, formula = formula) b_kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula) gmm_model <- ml_gaussian_mixture(iris_training, formula = formula) # Predict pred_kmeans <- ml_predict(kmeans_model, iris_test) pred_b_kmeans <- ml_predict(b_kmeans_model, iris_test) pred_gmm <- ml_predict(gmm_model, iris_test) # Evaluate ml_clustering_evaluator(pred_kmeans) ml_clustering_evaluator(pred_b_kmeans) ml_clustering_evaluator(pred_gmm) }

Constructors for `ml_model` Objects

Arguments Functions for developers writing extensions for Spark ML. These functions are constructors for `ml_model` objects that are returned when using the formula interface. new_ml_model_prediction(pipeline_model, formula, dataset, label_col, features_col, ..., class = character()) new_ml_model(pipeline_model, formula, dataset, ..., class = character()) new_ml_model_classification(pipeline_model, formula, dataset, label_col, features_col, predicted_label_col, ..., class = character()) new_ml_model_regression(pipeline_model, formula, dataset, label_col, features_col, ..., class = character()) new_ml_model_clustering(pipeline_model, formula, dataset, features_col, ..., class = character()) ml_supervised_pipeline(predictor, dataset, formula, features_col, label_col) ml_clustering_pipeline(predictor, dataset, formula, features_col) ml_construct_model_supervised(constructor, predictor, formula, dataset, features_col, label_col, ...) ml_construct_model_clustering(constructor, predictor, formula, dataset, features_col, ...)

Arguments

pipeline_model The pipeline model object returned by `ml_supervised_pipeline()`.
formula The formula used for data preprocessing
dataset The training dataset.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
class Name of the subclass.
predictor The pipeline stage corresponding to the ML algorithm.
constructor The constructor function for the `ml_model`.

Compute correlation matrix

Arguments Value Examples Compute correlation matrix ml_corr(x, columns = NULL, method = c("pearson", "spearman"))

Arguments

x A tbl_spark.
columns The names of the columns to calculate correlations of. If only one column is specified, it must be a vector column (for example, assembled using ft_vector_assember()).
method The method to use, either "pearson" or "spearman".

Value

A correlation matrix organized as a data frame.

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_corr(iris_tbl, columns = features , method = "pearson") }

Spark ML -- Tuning

Arguments Value Details Examples Perform hyper-parameter tuning using either K-fold cross validation or train-validation split. ml_sub_models(model) ml_validation_metrics(model) ml_cross_validator(x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, num_folds = 3, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("cross_validator_"), ...) ml_train_validation_split(x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, train_ratio = 0.75, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("train_validation_split_"), ...)

Arguments

model A cross validation or train-validation-split model.
x A spark_connection, ml_pipeline, or a tbl_spark.
estimator A ml_estimator object.
estimator_param_maps A named list of stages and hyper-parameter sets to tune. See details.
evaluator A ml_evaluator object, see ml_evaluator.
num_folds Number of folds for cross validation. Must be >= 2. Default: 3
collect_sub_models Whether to collect a list of sub-models trained during tuning. If set to FALSE, then only the single best sub-model will be available after fitting. If set to true, then all sub-models will be available. Warning: For large models, collecting all sub-models can cause OOMs on the Spark driver.
parallelism The number of threads to use when running parallel algorithms. Default is 1 for serial execution.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; currently unused.
train_ratio Ratio between train and validation data. Must be between 0 and 1. Default: 0.75

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_cross_validator or ml_traing_validation_split object. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the tuning estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a tuning estimator is constructed then immediately fit with the input tbl_spark, returning a ml_cross_validation_model or a ml_train_validation_split_model object. For cross validation, ml_sub_models() returns a nested list of models, where the first layer represents fold indices and the second layer represents param maps. For train-validation split, ml_sub_models() returns a list of models, corresponding to the order of the estimator param maps. ml_validation_metrics() returns a data frame of performance metrics and hyperparameter combinations.

Details

ml_cross_validator() performs k-fold cross validation while ml_train_validation_split() performs tuning on one pair of train and validation datasets.

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # Create a pipeline pipeline <- ml_pipeline(sc) %>% ft_r_formula(Species ~ . ) %>% ml_random_forest_classifier() # Specify hyperparameter grid grid <- list( random_forest = list( num_trees = c(5,10), max_depth = c(5,10), impurity = c("entropy", "gini") ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl) # Print the metrics ml_validation_metrics(cv_model) }

Default stop words

Arguments Value Details See also Loads the default stop words for the given language. ml_default_stop_words(sc, language = c("english", "danish", "dutch", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish", "turkish"), ...)

Arguments

sc A spark_connection
language A character string.
... Optional arguments; currently unused.

Value

A list of stop words.

Details

Supported languages: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish. Defaults to English. See http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ for more details

See also

ft_stop_words_remover

Evaluate the Model on a Validation Set

Arguments Compute performance metrics. ml_evaluate(x, dataset) # S3 method for ml_model_logistic_regression ml_evaluate(x, dataset) # S3 method for ml_logistic_regression_model ml_evaluate(x, dataset) # S3 method for ml_model_linear_regression ml_evaluate(x, dataset) # S3 method for ml_linear_regression_model ml_evaluate(x, dataset) # S3 method for ml_model_generalized_linear_regression ml_evaluate(x, dataset) # S3 method for ml_generalized_linear_regression_model ml_evaluate(x, dataset) # S3 method for ml_evaluator ml_evaluate(x, dataset)

Arguments

x An ML model object or an evaluator object.
dataset The dataset to be validate the model on.

Spark ML - Feature Importance for Tree Models

Arguments Value Spark ML - Feature Importance for Tree Models ml_feature_importances(model, ...) ml_tree_feature_importance(model, ...)

Arguments

model A decision tree-based model.
... Optional arguments; currently unused.

Value

For ml_model, a sorted data frame with feature labels and their relative importance. For ml_prediction_model, a vector of relative importances.

Feature Transformation -- Word2Vec (Estimator)

Arguments Value Details See also Word2Vec transforms a word into a code for further natural language processing or machine learning process. ft_word2vec(x, input_col = NULL, output_col = NULL, vector_size = 100, min_count = 5, max_sentence_length = 1000, num_partitions = 1, step_size = 0.025, max_iter = 1, seed = NULL, uid = random_string("word2vec_"), ...) ml_find_synonyms(model, word, num)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
vector_size The dimension of the code that you want to transform from words. Default: 100
min_count The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5
max_sentence_length (Spark 2.0.0+) Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to max_sentence_length size. Default: 1000
num_partitions Number of partitions for sentences of words. Default: 1
step_size Param for Step size to be used for each iteration of optimization (> 0).
max_iter The maximum number of iterations to use.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.
model A fitted Word2Vec model, returned by ft_word2vec().
word A word, as a length-one character vector.
num Number of words closest in similarity to the given word to find.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark ml_find_synonyms() returns a DataFrame of synonyms and cosine similarities

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer

Spark ML -- Transform, fit, and predict methods (ml_ interface)

Arguments Value Details Methods for transformation, fit, and prediction. These are mirrors of the corresponding sdf-transform-methods. is_ml_transformer(x) is_ml_estimator(x) ml_fit(x, dataset, ...) ml_transform(x, dataset, ...) ml_fit_and_transform(x, dataset, ...) ml_predict(x, dataset, ...) # S3 method for ml_model_classification ml_predict(x, dataset, probability_prefix = "probability_", ...)

Arguments

x A ml_estimator, ml_transformer (or a list thereof), or ml_model object.
dataset A tbl_spark.
... Optional arguments; currently unused.
probability_prefix String used to prepend the class probability output columns.

Value

When x is an estimator, ml_fit() returns a transformer whereas ml_fit_and_transform() returns a transformed dataset. When x is a transformer, ml_transform() and ml_predict() return a transformed dataset. When ml_predict() is called on a ml_model object, additional columns (e.g. probabilities in case of classification models) are appended to the transformed output for the user's convenience.

Details

These methods are

Spark ML -- Gaussian Mixture clustering.

Arguments Value See also Examples This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite. Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than tol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum. ml_gaussian_mixture(x, formula = NULL, k = 2, max_iter = 100, tol = 0.01, seed = NULL, features_col = "features", prediction_col = "prediction", probability_col = "probability", uid = random_string("gaussian_mixture_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
k The number of clusters to create
max_iter The maximum number of iterations to use.
tol Param for the convergence tolerance for iterative algorithms.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
probability_col Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments, see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the clustering estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning a clustering model. tbl_spark, with formula or features specified: When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the estimator. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model. This signature does not apply to ml_lda().

See also

See http://spark.apache.org/docs/latest/ml-clustering.html for more information on the set of clustering algorithms. Other ml clustering algorithms: ml_bisecting_kmeans, ml_kmeans, ml_lda

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .) pred <- sdf_predict(iris_tbl, gmm_model) ml_clustering_evaluator(pred) }

Spark ML -- ML Params

Arguments Helper methods for working with parameters for ML objects. ml_is_set(x, param, ...) ml_param_map(x, ...) ml_param(x, param, allow_null = FALSE, ...) ml_params(x, params = NULL, allow_null = FALSE, ...)

Arguments

x A Spark ML object, either a pipeline stage or an evaluator.
param The parameter to extract or set.
... Optional arguments; currently unused.
allow_null Whether to allow NULL results when extracting parameters. If FALSE, an error will be thrown if the specified parameter is not found. Defaults to FALSE.
params A vector of parameters to extract.

Spark ML -- Isotonic Regression

Arguments Value Details See also Examples Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported. ml_isotonic_regression(x, formula = NULL, feature_index = 0, isotonic = TRUE, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("isotonic_regression_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
feature_index Index of the feature if features_col is a vector column (default: 0), no effect otherwise.
isotonic Whether the output sequence should be isotonic/increasing (true) or antitonic/decreasing (false). Default: true
weight_col The name of the column to use as weights for the model fit.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_linear_regression, ml_linear_svc, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test iso_res <- iris_tbl %>% ml_isotonic_regression(Petal_Length ~ Petal_Width) pred <- ml_predict(iso_res, iris_test) pred }

Feature Transformation -- StringIndexer (Estimator)

Arguments Value Details See also A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0. This function is the inverse of ft_index_to_string. ft_string_indexer(x, input_col = NULL, output_col = NULL, handle_invalid = "error", string_order_type = "frequencyDesc", uid = random_string("string_indexer_"), ...) ml_labels(model) ft_string_indexer_model(x, input_col = NULL, output_col = NULL, labels, handle_invalid = "error", uid = random_string("string_indexer_model_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
handle_invalid (Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"
string_order_type (Spark 2.3+)How to order labels of string column. The first label after ordering is assigned an index of 0. Options are "frequencyDesc", "frequencyAsc", "alphabetDesc", and "alphabetAsc". Defaults to "frequencyDesc".
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.
model A fitted StringIndexer model returned by ft_string_indexer()
labels Vector of labels, corresponding to indices to be assigned.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark ml_labels() returns a vector of labels, corresponding to indices to be assigned.

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ft_index_to_string Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Spark ML -- LinearSVC

Arguments Value Details See also Examples Perform classification using linear support vector machines (SVM). This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently. ml_linear_svc(x, formula = NULL, fit_intercept = TRUE, reg_param = 0, max_iter = 100, standardization = TRUE, weight_col = NULL, tol = 1e-06, threshold = 0, aggregation_depth = 2, features_col = "features", label_col = "label", prediction_col = "prediction", raw_prediction_col = "rawPrediction", uid = random_string("linear_svc_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula Used when x is a tbl_spark. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.
fit_intercept Boolean; should the model be fit with an intercept term?
reg_param Regularization parameter (aka lambda)
max_iter The maximum number of iterations to use.
standardization Whether to standardize the training features before fitting the model.
weight_col The name of the column to use as weights for the model fit.
tol Param for the convergence tolerance for iterative algorithms.
threshold in binary classification prediction, in range [0, 1].
aggregation_depth (Spark 2.1.0+) Suggested depth for treeAggregate (>= 2).
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
prediction_col Prediction column name.
raw_prediction_col Raw prediction (a.k.a. confidence) column name.
uid A character string used to uniquely identify the ML estimator.
... Optional arguments; see Details.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Predictor object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the predictor appended to the pipeline. tbl_spark: When x is a tbl_spark, a predictor is constructed then immediately fit with the input tbl_spark, returning a prediction model. tbl_spark, with formula: specified When formula is specified, the input tbl_spark is first transformed using a RFormula transformer before being fit by the predictor. The object returned in this case is a ml_model which is a wrapper of a ml_pipeline_model.

Details

When x is a tbl_spark and formula (alternatively, response and features) is specified, the function returns a ml_model object wrapping a ml_pipeline_model which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col (defaults to "predicted_label") can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model, ml_model objects also contain a ml_pipeline object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save with type = "pipeline" to faciliate model refresh workflows.

See also

See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms. Other ml algorithms: ml_aft_survival_regression, ml_decision_tree_classifier, ml_gbt_classifier, ml_generalized_linear_regression, ml_isotonic_regression, ml_linear_regression, ml_logistic_regression, ml_multilayer_perceptron_classifier, ml_naive_bayes, ml_one_vs_rest, ml_random_forest_classifier

Examples

if (FALSE) { library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% filter(Species != "setosa") %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test svc_model <- iris_training %>% ml_linear_svc(Species ~ .) pred <- ml_predict(svc_model, iris_test) ml_binary_classification_evaluator(pred) }

Spark ML -- Model Persistence

Arguments Value Save/load Spark ML objects ml_save(x, path, overwrite = FALSE, ...) # S3 method for ml_model ml_save(x, path, overwrite = FALSE, type = c("pipeline_model", "pipeline"), ...) ml_load(sc, path)

Arguments

x A ML object, which could be a ml_pipeline_stage or a ml_model
path The path where the object is to be serialized/deserialized.
overwrite Whether to overwrite the existing path, defaults to FALSE.
... Optional arguments; currently unused.
type Whether to save the pipeline model or the pipeline.
sc A Spark connection.

Value

ml_save() serializes a Spark object into a format that can be read back into sparklyr or by the Scala or PySpark APIs. When called on ml_model objects, i.e. those that were created via the tbl_spark - formula signature, the associated pipeline model is serialized. In other words, the saved model contains both the data processing (RFormulaModel) stage and the machine learning stage. ml_load() reads a saved Spark object into sparklyr. It calls the correct Scala load method based on parsing the saved metadata. Note that a PipelineModel object saved from a sparklyr ml_model via ml_save() will be read back in as an ml_pipeline_model, rather than the ml_model object.

Spark ML -- Pipelines

Arguments Value Create Spark ML Pipelines ml_pipeline(x, ..., uid = random_string("pipeline_"))

Arguments

x Either a spark_connection or ml_pipeline_stage objects
... ml_pipeline_stage objects.
uid A character string used to uniquely identify the ML estimator.

Value

When x is a spark_connection, ml_pipeline() returns an empty pipeline object. When x is a ml_pipeline_stage, ml_pipeline() returns an ml_pipeline with the stages set to x and any transformers or estimators given in ....

Spark ML -- Pipeline stage extraction

Arguments Value Extraction of stages from a Pipeline or PipelineModel object. ml_stage(x, stage) ml_stages(x, stages = NULL)

Arguments

x A ml_pipeline or a ml_pipeline_model object
stage The UID of a stage in the pipeline.
stages The UIDs of stages in the pipeline as a character vector.

Value

For ml_stage(): The stage specified. For ml_stages(): A list of stages. If stages is not set, the function returns all stages of the pipeline in a list.

Standardize Formula Input for `ml_model`

Arguments Generates a formula string from user inputs, to be used in `ml_model` constructor. ml_standardize_formula(formula = NULL, response = NULL, features = NULL)

Arguments

formula The `formula` argument.
response The `response` argument.
features The `features` argument.

Spark ML -- Extraction of summary metrics

Arguments Extracts a metric from the summary object of a Spark ML model. ml_summary(x, metric = NULL, allow_null = FALSE)

Arguments

x A Spark ML model that has a summary.
metric The name of the metric to extract. If not set, returns the summary object.
allow_null Whether null results are allowed when the metric is not found in the summary.

Spark ML -- UID

Arguments Extracts the UID of an ML object. ml_uid(x)

Arguments

x A Spark ML object

Feature Transformation -- CountVectorizer (Estimator)

Arguments Value Details See also Extracts a vocabulary from document collections. ft_count_vectorizer(x, input_col = NULL, output_col = NULL, binary = FALSE, min_df = 1, min_tf = 1, vocab_size = 2^18, uid = random_string("count_vectorizer_"), ...) ml_vocabulary(model)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
binary Binary toggle to control the output vector values. If TRUE, all nonzero counts (after min_tf filter applied) are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default: FALSE
min_df Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer greater than or equal to 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. Default: 1.
min_tf Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer greater than or equal to 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). Default: 1.
vocab_size Build a vocabulary that only considers the top vocab_size terms ordered by term frequency across the corpus. Default: 2^18.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.
model A ml_count_vectorizer_model.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark ml_vocabulary() returns a vector of vocabulary built.

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Binarizer (Transformer)

Arguments Value See also Examples Apply thresholding to a column, such that values less than or equal to the threshold are assigned the value 0.0, and values greater than the threshold are assigned the value 1.0. Column output is numeric for compatibility with other modeling functions. ft_binarizer(x, input_col, output_col, threshold = 0, uid = random_string("binarizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
threshold Threshold used to binarize continuous features.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_binarizer(input_col = "Sepal_Length", output_col = "Sepal_Length_bin", threshold = 5) %>% select(Sepal_Length, Sepal_Length_bin, Species) }

Feature Transformation -- Bucketizer (Transformer)

Arguments Value See also Examples Similar to R's cut function, this transforms a numeric column into a discretized column, with breaks specified through the splits parameter. ft_bucketizer(x, input_col = NULL, output_col = NULL, splits = NULL, input_cols = NULL, output_cols = NULL, splits_array = NULL, handle_invalid = "error", uid = random_string("bucketizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
splits A numeric vector of cutpoints, indicating the bucket boundaries.
input_cols Names of input columns.
output_cols Names of output columns.
splits_array Parameter for specifying multiple splits parameters. Each element in this array can be used to map continuous features into buckets.
handle_invalid (Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_bucketizer(input_col = "Sepal_Length", output_col = "Sepal_Length_bucket", splits = c(0, 4.5, 5, 8)) %>% select(Sepal_Length, Sepal_Length_bucket, Species) }

Feature Transformation -- ChiSqSelector (Estimator)

Arguments Value Details See also Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label ft_chisq_selector(x, features_col = "features", output_col = NULL, label_col = "label", selector_type = "numTopFeatures", fdr = 0.05, fpr = 0.05, fwe = 0.05, num_top_features = 50, percentile = 0.1, uid = random_string("chisq_selector_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
output_col The name of the output column.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
selector_type (Spark 2.1.0+) The selector type of the ChisqSelector. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe".
fdr (Spark 2.2.0+) The upper bound of the expected false discovery rate. Only applicable when selector_type = "fdr". Default value is 0.05.
fpr (Spark 2.1.0+) The highest p-value for features to be kept. Only applicable when selector_type= "fpr". Default value is 0.05.
fwe (Spark 2.2.0+) The upper bound of the expected family-wise error rate. Only applicable when selector_type = "fwe". Default value is 0.05.
num_top_features Number of features that selector will select, ordered by ascending p-value. If the number of features is less than num_top_features, then this will select all features. Only applicable when selector_type = "numTopFeatures". The default value of num_top_features is 50.
percentile (Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selector_type = "percentile". Default value is 0.1.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Discrete Cosine Transform (DCT) (Transformer)

Arguments Value Details See also A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II). ft_dct(x, input_col = NULL, output_col = NULL, inverse = FALSE, uid = random_string("dct_"), ...) ft_discrete_cosine_transform(x, input_col, output_col, inverse = FALSE, uid = random_string("dct_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
inverse Indicates whether to perform the inverse DCT (TRUE) or forward DCT (FALSE).
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

ft_discrete_cosine_transform() is an alias for ft_dct for backwards compatibility.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- ElementwiseProduct (Transformer)

Arguments Value See also Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier. ft_elementwise_product(x, input_col = NULL, output_col = NULL, scaling_vec = NULL, uid = random_string("elementwise_product_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
scaling_vec the vector to multiply with input vectors
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- FeatureHasher (Transformer)

Arguments Value Details See also Feature Transformation -- FeatureHasher (Transformer) ft_feature_hasher(x, input_cols = NULL, output_col = NULL, num_features = 2^18, categorical_cols = NULL, uid = random_string("feature_hasher_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_cols Names of input columns.
output_col Name of output column.
num_features Number of features. Defaults to \(2^18\).
categorical_cols Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick https://en.wikipedia.org/wiki/Feature_hashing to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with drop_last=FALSE). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0. Null (missing) values are ignored (implicitly zero in the resulting feature vector). The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the num_features parameter; otherwise the features will not be mapped evenly to the vector indices.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- HashingTF (Transformer)

Arguments Value See also Maps a sequence of terms to their term frequencies using the hashing trick. ft_hashing_tf(x, input_col = NULL, output_col = NULL, binary = FALSE, num_features = 2^18, uid = random_string("hashing_tf_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
binary Binary toggle to control term frequency counts. If true, all non-zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. (default = FALSE)
num_features Number of features. Should be greater than 0. (default = 2^18)
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- IDF (Estimator)

Arguments Value Details See also Compute the Inverse Document Frequency (IDF) given a collection of documents. ft_idf(x, input_col = NULL, output_col = NULL, min_doc_freq = 0, uid = random_string("idf_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
min_doc_freq The minimum number of documents in which a term should appear. Default: 0
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Imputer (Estimator)

Arguments Value Details See also Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. This function requires Spark 2.2.0+. ft_imputer(x, input_cols = NULL, output_cols = NULL, missing_value = NULL, strategy = "mean", uid = random_string("imputer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_cols The names of the input columns
output_cols The names of the output columns.
missing_value The placeholder for the missing values. All occurrences of missing_value will be imputed. Note that null values are always treated as missing.
strategy The imputation strategy. Currently only "mean" and "median" are supported. If "mean", then replace missing values using the mean value of the feature. If "median", then replace missing values using the approximate median value of the feature. Default: mean
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- IndexToString (Transformer)

Arguments Value See also A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). This function is the inverse of ft_string_indexer. ft_index_to_string(x, input_col = NULL, output_col = NULL, labels = NULL, uid = random_string("index_to_string_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
labels Optional param for array of labels specifying index-string mapping.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ft_string_indexer Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Interaction (Transformer)

Arguments Value See also Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced. ft_interaction(x, input_cols = NULL, output_col = NULL, uid = random_string("interaction_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_cols The names of the input columns
output_col The name of the output column.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- LSH (Estimator)

Arguments Value Details See also Locality Sensitive Hashing functions for Euclidean distance (Bucketed Random Projection) and Jaccard distance (MinHash). ft_bucketed_random_projection_lsh(x, input_col = NULL, output_col = NULL, bucket_length = NULL, num_hash_tables = 1, seed = NULL, uid = random_string("bucketed_random_projection_lsh_"), ...) ft_minhash_lsh(x, input_col = NULL, output_col = NULL, num_hash_tables = 1L, seed = NULL, uid = random_string("minhash_lsh_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
bucket_length The length of each hash bucket, a larger bucket lowers the false negative rate. The number of buckets will be (max L2 norm of input vectors) / bucketLength.
num_hash_tables Number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this param lead to a reduced false negative rate, at the expense of added computational complexity.
seed A random seed. Set this value if you need your results to be reproducible across repeated calls.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ft_lsh_utils Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- MaxAbsScaler (Estimator)

Arguments Value Details See also Examples Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity. ft_max_abs_scaler(x, input_col = NULL, output_col = NULL, uid = random_string("max_abs_scaler_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler(input_col = features, output_col = "features_temp") %>% ft_max_abs_scaler(input_col = "features_temp", output_col = "features") }

Feature Transformation -- MinMaxScaler (Estimator)

Arguments Value Details See also Examples Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling ft_min_max_scaler(x, input_col = NULL, output_col = NULL, min = 0, max = 1, uid = random_string("min_max_scaler_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
min Lower bound after transformation, shared by all features Default: 0.0
max Upper bound after transformation, shared by all features Default: 1.0
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler(input_col = features, output_col = "features_temp") %>% ft_min_max_scaler(input_col = "features_temp", output_col = "features") }

Feature Transformation -- NGram (Transformer)

Arguments Value Details See also A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. ft_ngram(x, input_col = NULL, output_col = NULL, n = 2, uid = random_string("ngram_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
n Minimum n-gram length, greater than or equal to 1. Default: 2, bigram features
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Normalizer (Transformer)

Arguments Value See also Normalize a vector to have unit norm using the given p-norm. ft_normalizer(x, input_col = NULL, output_col = NULL, p = 2, uid = random_string("normalizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
p Normalization in L^p space. Must be >= 1. Defaults to 2.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- OneHotEncoder (Transformer)

Arguments Value See also One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. Typically, used with ft_string_indexer() to index a column first. ft_one_hot_encoder(x, input_col = NULL, output_col = NULL, drop_last = TRUE, uid = random_string("one_hot_encoder_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
drop_last Whether to drop the last category. Defaults to TRUE.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- OneHotEncoderEstimator (Estimator)

Arguments Value Details See also A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. ft_one_hot_encoder_estimator(x, input_cols = NULL, output_cols = NULL, handle_invalid = "error", drop_last = TRUE, uid = random_string("one_hot_encoder_estimator_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_cols Names of input columns.
output_cols Names of output columns.
handle_invalid (Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"
drop_last Whether to drop the last category. Defaults to TRUE.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- PolynomialExpansion (Transformer)

Arguments Value See also Perform feature expansion in a polynomial space. E.g. take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). ft_polynomial_expansion(x, input_col = NULL, output_col = NULL, degree = 2, uid = random_string("polynomial_expansion_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
degree The polynomial degree to expand, which should be greater than equal to 1. A value of 1 means no expansion. Default: 2
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- QuantileDiscretizer (Estimator)

Arguments Value Details See also ft_quantile_discretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the num_buckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. ft_quantile_discretizer(x, input_col = NULL, output_col = NULL, num_buckets = 2, input_cols = NULL, output_cols = NULL, num_buckets_array = NULL, handle_invalid = "error", relative_error = 0.001, uid = random_string("quantile_discretizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
num_buckets Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.
input_cols Names of input columns.
output_cols Names of output columns.
num_buckets_array Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2.
handle_invalid (Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"
relative_error (Spark 2.0.0+) Relative error (see documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for description). Must be in the range [0, 1]. default: 0.001
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handle_invalid If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for a detailed description). The precision of the approximation can be controlled with the relative_error parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic. In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ft_bucketizer Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- RFormula (Estimator)

Arguments Value Details See also Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including ~, ., :, +, and -. Also see the R formula docs here: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html ft_r_formula(x, formula = NULL, features_col = "features", label_col = "label", force_index_label = FALSE, uid = random_string("r_formula_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
formula R formula as a character string or a formula. Formula objects are converted to character strings directly and the environment is not captured.
features_col Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.
label_col Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.
force_index_label (Spark 2.1.0+) Force to index label whether it is numeric or string type. Usually we index label only when it is string type. If the formula was used by classification algorithms, we can force to index label even it is numeric type by setting this param with true. Default: FALSE.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

The basic operators in the formula are: ~ separate target and terms + concat terms, "+ 0" means removing intercept - remove a term, "- 1" means removing intercept : interaction (multiplication for numeric values, or binarized categorical values) . all columns except target Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula: y ~ a + b means model y ~ w0 + w1 * a + w2 * b where w0 is the intercept and w1, w2 are coefficients. y ~ a + b + a:b - 1 means model y ~ w1 * a + w2 * b + w3 * a * b where w1, w2, w3 are coefficients. RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with StringIndexer. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula. In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- RegexTokenizer (Transformer)

Arguments Value See also A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. ft_regex_tokenizer(x, input_col = NULL, output_col = NULL, gaps = TRUE, min_token_length = 1, pattern = "\\s+", to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
gaps Indicates whether regex splits on gaps (TRUE) or matches tokens (FALSE).
min_token_length Minimum token length, greater than or equal to 0.
pattern The regular expression pattern to be used.
to_lower_case Indicates whether to convert all characters to lowercase before tokenizing.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- StandardScaler (Estimator)

Arguments Value Details See also Examples Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance. ft_standard_scaler(x, input_col = NULL, output_col = NULL, with_mean = FALSE, with_std = TRUE, uid = random_string("standard_scaler_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
with_mean Whether to center the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. Default: FALSE
with_std Whether to scale the data to unit standard deviation. Default: TRUE
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Examples

if (FALSE) { sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler(input_col = features, output_col = "features_temp") %>% ft_standard_scaler(input_col = "features_temp", output_col = "features", with_mean = TRUE) }

Feature Transformation -- StopWordsRemover (Transformer)

Arguments Value See also A feature transformer that filters out stop words from input. ft_stop_words_remover(x, input_col = NULL, output_col = NULL, case_sensitive = FALSE, stop_words = ml_default_stop_words(spark_connection(x), "english"), uid = random_string("stop_words_remover_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
case_sensitive Whether to do a case sensitive comparison over the stop words.
stop_words The words to be filtered out.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ml_default_stop_words Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- Tokenizer (Transformer)

Arguments Value See also A tokenizer that converts the input string to lowercase and then splits it by white spaces. ft_tokenizer(x, input_col = NULL, output_col = NULL, uid = random_string("tokenizer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- VectorAssembler (Transformer)

Arguments Value See also Combine multiple vectors into a single row-vector; that is, where each row element of the newly generated column is a vector formed by concatenating each row element from the specified input columns. ft_vector_assembler(x, input_cols = NULL, output_col = NULL, uid = random_string("vector_assembler_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_cols The names of the input columns
output_col The name of the output column.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Feature Transformation -- VectorIndexer (Estimator)

Arguments Value Details See also Indexing categorical feature columns in a dataset of Vector. ft_vector_indexer(x, input_col = NULL, output_col = NULL, max_categories = 20, uid = random_string("vector_indexer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
max_categories Threshold for the number of values a categorical feature can take. If a feature is found to have > max_categories values, then it is declared continuous. Must be greater than or equal to 2. Defaults to 20.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, which is then immediately used to transform x, returning a tbl_spark.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_slicer, ft_word2vec

Feature Transformation -- VectorSlicer (Transformer)

Arguments Value See also Takes a feature vector and outputs a new feature vector with a subarray of the original features. ft_vector_slicer(x, input_col = NULL, output_col = NULL, indices = NULL, uid = random_string("vector_slicer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
input_col The name of the input column.
output_col The name of the output column.
indices An vector of indices to select features from a vector column. Note that the indices are 0-based.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_word2vec

Feature Transformation -- SQLTransformer

Arguments Value Details See also Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM __THIS__ ...' where '__THIS__' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. ft_sql_transformer(x, statement = NULL, uid = random_string("sql_transformer_"), ...) ft_dplyr_transformer(x, tbl, uid = random_string("dplyr_transformer_"), ...)

Arguments

x A spark_connection, ml_pipeline, or a tbl_spark.
statement A SQL statement.
uid A character string used to uniquely identify the feature transformer.
... Optional arguments; currently unused.
tbl A tbl_spark generated using dplyr transformations.

Value

The object returned depends on the class of x. spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects. ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline. tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

ft_dplyr_transformer() is a wrapper around ft_sql_transformer() that takes a tbl_spark instead of a SQL statement. Internally, the ft_dplyr_transformer() extracts the dplyr transformations used to generate tbl as a SQL statement then passes it on to ft_sql_transformer(). Note that only single-table dplyr verbs are supported and that the sdf_ family of functions are not.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder_estimator, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_quantile_discretizer, ft_r_formula, ft_regex_tokenizer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec

Compile Scala sources into a Java Archive (jar)

Arguments Compile the scala source files contained within an R package into a Java Archive (jar) file that can be loaded and used within a Spark environment. compile_package_jars(..., spec = NULL)

Arguments

... Optional compilation specifications, as generated by spark_compilation_spec. When no arguments are passed, spark_default_compilation_spec is used instead.
spec An optional list of compilation specifications. When set, this option takes precedence over arguments passed to ....

Read configuration values for a connection

Arguments Value Read configuration values for a connection connection_config(sc, prefix, not_prefix = list())

Arguments

sc spark_connection
prefix Prefix to read parameters for (e.g. spark.context., spark.sql., etc.)
not_prefix Prefix to not include.

Value

Named list of config parameters (note that if a prefix was specified then the names will not include the prefix)

Downloads default Scala Compilers

Arguments Details compile_package_jars requires several versions of the scala compiler to work, this is to match Spark scala versions. To help setup your environment, this function will download the required compilers under the default search path. download_scalac(dest_path = NULL)

Arguments

dest_path The destination path where scalac will be downloaded to.

Details

See find_scalac for a list of paths searched and used by this function to install the required compilers.

Discover the Scala Compiler

Arguments Find the scalac compiler for a particular version of scala, by scanning some common directories containing scala installations. find_scalac(version, locations = NULL)

Arguments

version The scala version to search for. Versions of the form major.minor will be matched against the scalac installation with version major.minor.patch; if multiple compilers are discovered the most recent one will be used.
locations Additional locations to scan. By default, the directories /opt/scala and /usr/local/scala will be scanned.

Access the Spark API

Arguments Details Spark Context Java Spark Context Hive Context Spark Session Access the commonly-used Spark objects associated with a Spark instance. These objects provide access to different facets of the Spark API. spark_context(sc) java_context(sc) hive_context(sc) spark_session(sc)

Arguments

sc A spark_connection.

Details

The Scala API documentation is useful for discovering what methods are available for each of these objects. Use invoke to call methods on these objects.

Spark Context

The main entry point for Spark functionality. The Spark Context represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Java Spark Context

A Java-friendly version of the aforementioned Spark Context.

Hive Context

An instance of the Spark SQL execution engine that integrates with data stored in Hive. Configuration for Hive is read from hive-site.xml on the classpath. Starting with Spark >= 2.0.0, the Hive Context class has been deprecated -- it is superceded by the Spark Session class, and hive_context will return a Spark Session object instead. Note that both classes share a SQL interface, and therefore one can invoke SQL through these objects.

Spark Session

Available since Spark 2.0.0, the Spark Session unifies the Spark Context and Hive Context classes into a single interface. Its use is recommended over the older APIs for code targeting Spark 2.0.0 and above.

Runtime configuration interface for Hive

Arguments Retrieves the runtime configuration interface for Hive. hive_context_config(sc)

Arguments

sc A spark_connection.

Invoke a Method on a JVM Object

Arguments Details Examples Invoke methods on Java object references. These functions provide a mechanism for invoking various Java object methods directly from R. invoke(jobj, method, ...) invoke_static(sc, class, method, ...) invoke_new(sc, class, ...)

Arguments

jobj An R object acting as a Java object reference (typically, a spark_jobj).
method The name of the method to be invoked.
... Optional arguments, currently unused.
sc A spark_connection.
class The name of the Java class whose methods should be invoked.

Details

Use each of these functions in the following scenarios:
invokeExecute a method on a Java object reference (typically, a spark_jobj).invoke_static
Execute a static method associated with a Java class.invoke_newInvoke a constructor associated with a Java class.

Examples

sc <- spark_connect(master = "spark://HOST:PORT") spark_context(sc) %>% invoke("textFile", "file.csv", 1L) %>% invoke("count")

Register a Package that Implements a Spark Extension

Arguments Note Registering an extension package will result in the package being automatically scanned for spark dependencies when a connection to Spark is created. register_extension(package) registered_extensions()

Arguments

package The package(s) to register.

Note

Packages should typically register their extensions in their .onLoad hook -- this ensures that their extensions are registered when their namespaces are loaded.

Define a Spark Compilation Specification

Arguments Details For use with compile_package_jars. The Spark compilation specification is used when compiling Spark extension Java Archives, and defines which versions of Spark, as well as which versions of Scala, should be used for compilation. spark_compilation_spec(spark_version = NULL, spark_home = NULL, scalac_path = NULL, scala_filter = NULL, jar_name = NULL, jar_path = NULL, jar_dep = NULL)

Arguments

spark_version The Spark version to build against. This can be left unset if the path to a suitable Spark home is supplied.
spark_home The path to a Spark home installation. This can be left unset if spark_version is supplied; in such a case, sparklyr will attempt to discover the associated Spark installation using spark_home_dir.
scalac_path The path to the scalac compiler to be used during compilation of your Spark extension. Note that you should ensure the version of scalac selected matches the version of scalac used with the version of Spark you are compiling against.
scala_filter An optional R function that can be used to filter which scala files are used during compilation. This can be useful if you have auxiliary files that should only be included with certain versions of Spark.
jar_name The name to be assigned to the generated jar.
jar_path The path to the jar tool to be used during compilation of your Spark extension.
jar_dep An optional list of additional jar dependencies.

Details

Most Spark extensions won't need to define their own compilation specification, and can instead rely on the default behavior of compile_package_jars.

Default Compilation Specification for Spark Extensions

Arguments This is the default compilation specification used for Spark extensions, when used with compile_package_jars. spark_default_compilation_spec(pkg = infer_active_package_name(), locations = NULL)

Arguments

pkg The package containing Spark extensions to be compiled.
locations Additional locations to scan. By default, the directories /opt/scala and /usr/local/scala will be scanned.

Retrieve the Spark Connection Associated with an R Object

Arguments Retrieve the spark_connection associated with an R object. spark_connection(x, ...)

Arguments

x An R object from which a spark_connection can be obtained.
... Optional arguments; currently unused.

Runtime configuration interface for the Spark Context.

Arguments Retrieves the runtime configuration interface for the Spark Context. spark_context_config(sc)

Arguments

sc A spark_connection.

Retrieve a Spark DataFrame

Arguments Value This S3 generic is used to access a Spark DataFrame object (as a Java object reference) from an R object. spark_dataframe(x, ...)

Arguments

x An R object wrapping, or containing, a Spark DataFrame.
... Optional arguments; currently unused.

Value

A spark_jobj representing a Java object reference to a Spark DataFrame.

Define a Spark dependency

Arguments Value Define a Spark dependency consisting of a set of custom JARs and Spark packages. spark_dependency(jars = NULL, packages = NULL, initializer = NULL, catalog = NULL, repositories = NULL, ...)

Arguments

jars Character vector of full paths to JAR files.
packages Character vector of Spark packages names.
initializer Optional callback function called when initializing a connection.
catalog Optional location where extension JAR files can be downloaded for Livy.
repositories Character vector of Spark package repositories.
... Additional optional arguments.

Value

An object of type `spark_dependency`

Set the SPARK_HOME environment variable

Arguments Value Examples Set the SPARK_HOME environment variable. This slightly speeds up some operations, including the connection time. spark_home_set(path = NULL, ...)

Arguments

path A string containing the path to the installation location of Spark. If NULL, the path to the most latest Spark/Hadoop versions is used.
... Additional parameters not currently used.

Value

The function is mostly invoked for the side-effect of setting the SPARK_HOME environment variable. It also returns TRUE if the environment was successfully set, and FALSE otherwise.

Examples

if (FALSE) { # Not run due to side-effects spark_home_set() }

Retrieve a Spark JVM Object Reference

Arguments See also This S3 generic is used for accessing the underlying Java Virtual Machine (JVM) Spark objects associated with R objects. These objects act as references to Spark objects living in the JVM. Methods on these objects can be called with the invoke family of functions. spark_jobj(x, ...)

Arguments

x An R object containing, or wrapping, a spark_jobj.
... Optional arguments; currently unused.

See also

invoke, for calling methods on Java object references.

Get the Spark Version Associated with a Spark Connection

Arguments Value Details Retrieve the version of Spark associated with a Spark connection. spark_version(sc)

Arguments

sc A spark_connection.

Value

The Spark version as a numeric_version.

Details

Suffixes for e.g. preview versions, or snapshotted versions, are trimmed -- if you require the full Spark version, you can retrieve it with invoke(spark_context(sc), "version").

Apply an R Function in Spark

Arguments Configuration Examples Applies an R function to a Spark object (typically, a Spark DataFrame). spark_apply(x, f, columns = NULL, memory = !is.null(name), group_by = NULL, packages = NULL, context = NULL, name = NULL, ...)

Arguments

x An object (usually a spark_tbl) coercable to a Spark DataFrame.
f A function that transforms a data frame partition into a data frame. The function f has signature f(df, context, group1, group2, ...) where df is a data frame with the data to be processed, context is an optional object passed as the context parameter and group1 to groupN contain the values of the group_by values. When group_by is not specified, f takes only one argument. Can also be an rlang anonymous function. For example, as ~ .x + 1 to define an expression that adds one to the given .x data frame.
columns A vector of column names or a named vector of column types for the transformed object. When not specified, a sample of 10 rows is taken to infer out the output columns automatically, to avoid this performance penalty, specify the column types. The sample size is configurable using the sparklyr.apply.schema.infer configuration option.
memory Boolean; should the table be cached into memory?
group_by Column name used to group by data frame partitions.
packages Boolean to distribute .libPaths() packages to each node, a list of packages to distribute, or a package bundle created with spark_apply_bundle(). Defaults to TRUE or the sparklyr.apply.packages value set in spark_config(). For clusters using Yarn cluster mode, packages can point to a package bundle created using spark_apply_bundle() and made available as a Spark file using config$sparklyr.shell.files. For clusters using Livy, packages can be manually installed on the driver node. For offline clusters where available.packages() is not available, manually download the packages database from https://cran.r-project.org/web/packages/packages.rds and set Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>"). Otherwise, all packages will be used by default. For clusters where R packages already installed in every worker node, the spark.r.libpaths config entry can be set in spark_config() to the local packages library. To specify multiple paths collapse them (without spaces) with a comma delimiter (e.g., "/lib/path/one,/lib/path/two").
context Optional object to be serialized and passed back to f().
name Optional table name while registering the resulting data frame.
... Optional arguments; currently unused.

Configuration

spark_config() settings can be specified to change the workers environment. For instance, to set additional environment variables to each worker node use the sparklyr.apply.env.* config, to launch workers without --vanilla use sparklyr.apply.options.vanilla set to FALSE, to run a custom script before launching Rscript use sparklyr.apply.options.rscript.before.

Examples

if (FALSE) { library(sparklyr) sc <- spark_connect(master = "local") # creates an Spark data frame with 10 elements then multiply times 10 in R sdf_len(sc, 10) %>% spark_apply(function(df) df * 10) }

Create Bundle for Spark Apply

Arguments Creates a bundle of packages for spark_apply(). spark_apply_bundle(packages = TRUE, base_path = getwd())

Arguments

packages List of packages to pack or TRUE to pack all.
base_path Base path used to store the resulting bundle.

Log Writer for Spark Apply

Arguments Writes data to log under spark_apply(). spark_apply_log(..., level = "INFO")

Arguments

... Arguments to write to log.
level Severity level for this entry; recommended values: INFO, ERROR or WARN.

Create a Spark Configuration for Livy

Arguments Value Details Create a Spark Configuration for Livy livy_config(config = spark_config(), username = NULL, password = NULL, negotiate = FALSE, custom_headers = list(`X-Requested-By` = "sparklyr"), ...)

Arguments

config Optional base configuration
username The username to use in the Authorization header
password The password to use in the Authorization header
negotiate Whether to use gssnegotiate method or not
custom_headers List of custom headers to append to http requests. Defaults to list("X-Requested-By" = "sparklyr").
... additional Livy session parameters

Value

Named list with configuration data

Details

Extends a Spark spark_config() configuration with settings for Livy. For instance, username and password define the basic authentication settings for a Livy session. The default value of "custom_headers" is set to list("X-Requested-By" = "sparklyr") in order to facilitate connection to Livy servers with CSRF protection enabled. Additional parameters for Livy sessions are:
proxy_user
User to impersonate when starting the session
jars
jars to be used in this session
py_files
Python files to be used in this session
files
files to be used in this session
driver_memory
Amount of memory to use for the driver process
driver_cores
Number of cores to use for the driver process
executor_memory
Amount of memory to use per executor process
executor_cores
Number of cores to use for each executor
num_executors
Number of executors to launch for this session
archives
Archives to be used in this session
queue
The name of the YARN queue to which submitted
name
The name of this session
heartbeat_timeout
Timeout in seconds to which session be orphaned
Note that queue is supported only by version 0.4.0 of Livy or newer. If you are using the older one, specify queue via config (e.g. config = spark_config(spark.yarn.queue = "my_queue")).

Start Livy

Arguments Starts the livy service. Stops the running instances of the livy service. livy_service_start(version = NULL, spark_version = NULL, stdout = "", stderr = "", ...) livy_service_stop()

Arguments

version The version of livy to use.
spark_version The version of spark to connect to.
stdout, stderr where output to 'stdout' or 'stderr' should be sent. Same options as system2.
... Optional arguments; currently unused.

Find Stream

Arguments Examples Finds and returns a stream based on the stream's identifier. stream_find(sc, id)

Arguments

sc The associated Spark connection.
id The stream identifier to find.

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_id <- stream_id(stream) stream_find(sc, stream_id) }

Generate Test Stream

Arguments Details Generates a local test stream, useful when testing streams locally. stream_generate_test(df = rep(1:1000), path = "source", distribution = floor(10 + 1e+05 * stats::dbinom(1:20, 20, 0.5)), iterations = 50, interval = 1)

Arguments

df The data frame used as a source of rows to the stream, will be cast to data frame if needed. Defaults to a sequence of one thousand entries.
path Path to save stream of files to, defaults to "source".
distribution The distribution of rows to use over each iteration, defaults to a binomial distribution. The stream will cycle through the distribution if needed.
iterations Number of iterations to execute before stopping, defaults to fifty.
interval The inverval in seconds use to write the stream, defaults to one second.

Details

This function requires the callr package to be installed.

Spark Stream's Identifier

Arguments Retrieves the identifier of the Spark stream. stream_id(stream)

Arguments

stream The spark stream object.

Spark Stream's Name

Arguments Retrieves the name of the Spark stream if available. stream_name(stream)

Arguments

stream The spark stream object.

Read CSV Stream

Arguments See also Examples Reads a CSV stream as a Spark dataframe stream. stream_read_csv(sc, path, name = NULL, header = TRUE, columns = NULL, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated stream.
header Boolean; should the first row of data be used as a header? Defaults to TRUE.
columns A vector of column names or a named vector of column types.
delimiter The character used to delimit each column. Defaults to ','.
quote The character used as a quote. Defaults to '"'.
escape The character used to escape other characters. Defaults to '\'.
charset The character set. Defaults to "UTF-8".
null_value The character to use for null, or missing, values. Defaults to NULL.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) }

Read JSON Stream

Arguments See also Examples Reads a JSON stream as a Spark dataframe stream. stream_read_json(sc, path, name = NULL, columns = NULL, options = list(), ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated stream.
columns A vector of column names or a named vector of column types.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("json-in") jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json") json_path <- file.path("file://", getwd(), "json-in") stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out") stream_stop(stream) }

Read Kafka Stream

Arguments Details See also Examples Reads a Kafka stream as a Spark dataframe stream. stream_read_kafka(sc, name = NULL, options = list(), ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated stream.
options A list of strings with additional options.
... Optional arguments; currently unused.

Details

Please note that Kafka requires installing the appropriate package by conneting with a config setting where sparklyr.shell.packages is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2".

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { config <- spark_config() # The following package is dependent to Spark version, for Spark 2.3.2: config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2" sc <- spark_connect(master = "local", config = config) read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1") write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2") stream <- stream_read_kafka(sc, options = read_options) %>% stream_write_kafka(options = write_options) stream_stop(stream) }

Read ORC Stream

Arguments See also Examples Reads an ORC stream as a Spark dataframe stream. stream_read_orc(sc, path, name = NULL, columns = NULL, options = list(), ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated stream.
columns A vector of column names or a named vector of column types.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_orc("orc-in") stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out") stream_stop(stream) }

Read Parquet Stream

Arguments See also Examples Reads a parquet stream as a Spark dataframe stream. stream_read_parquet(sc, path, name = NULL, columns = NULL, options = list(), ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated stream.
columns A vector of column names or a named vector of column types.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet("parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_stop(stream) }

Read Socket Stream

Arguments See also Examples Reads a Socket stream as a Spark dataframe stream. stream_read_scoket(sc, name = NULL, columns = NULL, options = list(), ...)

Arguments

sc A spark_connection.
name The name to assign to the newly generated stream.
columns A vector of column names or a named vector of column types.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") # Start socket server from terminal, example: nc -lk 9999 stream <- stream_read_scoket(sc, options = list(host = "localhost", port = 9999)) stream }

Read Text Stream

Arguments See also Examples Reads a text stream as a Spark dataframe stream. stream_read_text(sc, path, name = NULL, options = list(), ...)

Arguments

sc A spark_connection.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
name The name to assign to the newly generated stream.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("text-in") writeLines("A text entry", "text-in/text.txt") text_path <- file.path("file://", getwd(), "text-in") stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out") stream_stop(stream) }

Render Stream

Arguments Examples Collects streaming statistics to render the stream as an 'htmlwidget'. stream_render(stream = NULL, collect = 10, stats = NULL, ...)

Arguments

stream The stream to render
collect The interval in seconds to collect data before rendering the 'htmlwidget'.
stats Optional stream statistics collected using stream_stats(), when specified, stream should be omitted.
... Additional optional arguments.

Examples

if (FALSE) { library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream <- stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") stream_render(stream) stream_stop(stream) }

Stream Statistics

Arguments Value Examples Collects streaming statistics, usually, to be used with stream_render() to render streaming statistics. stream_stats(stream, stats = list())

Arguments

stream The stream to collect statistics from.
stats An optional stats object generated using stream_stats().

Value

A stats object containing streaming statistics that can be passed back to the stats parameter to continue aggregating streaming stats.

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_stats(stream) }

Stops a Spark Stream

Arguments Stops processing data from a Spark stream. stream_stop(stream)

Arguments

stream The spark stream object to be stopped.

Spark Stream Continuous Trigger

Arguments See also Creates a Spark structured streaming trigger to execute continuously. This mode is the most performant but not all operations are supported. stream_trigger_continuous(checkpoint = 5000)

Arguments

checkpoint The checkpoint interval specified in milliseconds.

See also

stream_trigger_interval

Spark Stream Interval Trigger

Arguments See also Creates a Spark structured streaming trigger to execute over the specified interval. stream_trigger_interval(interval = 1000)

Arguments

interval The execution interval specified in milliseconds.

See also

stream_trigger_continuous

View Stream

Arguments Examples Opens a Shiny gadget to visualize the given stream. stream_view(stream, ...)

Arguments

stream The stream to visualize.
... Additional optional arguments.

Examples

if (FALSE) { library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") %>% stream_view() %>% stream_stop() }

Watermark Stream

Arguments Ensures a stream has a watermark defined, which is required for some operations over streams. stream_watermark(x, column = "timestamp", threshold = "10 minutes")

Arguments

x An object coercable to a Spark Streaming DataFrame.
column The name of the column that contains the event time of the row, if the column is missing, a column with the current time will be added.
threshold The minimum delay to wait to data to arrive late, defaults to ten minutes.

Write Console Stream

Arguments See also Examples Writes a Spark dataframe stream into console logs. stream_write_console(x, mode = c("append", "complete", "update"), options = list(), trigger = stream_trigger_interval(), ...)

Arguments

x A Spark DataFrame or dplyr operation
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
options A list of strings with additional options.
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% dplyr::transmute(text = as.character(id)) %>% spark_write_text("text-in") stream <- stream_read_text(sc, "text-in") %>% stream_write_console() stream_stop(stream) }

Write CSV Stream

Arguments See also Examples Writes a Spark dataframe stream into a tabular (typically, comma-separated) stream. stream_write_csv(x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoint"), header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
header Should the first row of data be used as a header? Defaults to TRUE.
delimiter The character used to delimit each column, defaults to ,.
quote The character used as a quote. Defaults to '"'.
escape The character used to escape other characters, defaults to \.
charset The character set, defaults to "UTF-8".
null_value The character to use for default values, defaults to NULL.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) }

Write JSON Stream

Arguments See also Examples Writes a Spark dataframe stream into a JSON stream. stream_write_json(x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The destination path. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("json-in") jsonlite::write_json(list(a = c(1,2), b = c(10,20)), "json-in/data.json") json_path <- file.path("file://", getwd(), "json-in") stream <- stream_read_json(sc, json_path) %>% stream_write_json("json-out") stream_stop(stream) }

Write Kafka Stream

Arguments Details See also Examples Writes a Spark dataframe stream into an kafka stream. stream_write_kafka(x, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

Details

Please note that Kafka requires installing the appropriate package by conneting with a config setting where sparklyr.shell.packages is set to, for Spark 2.3.2, "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2".

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_memory, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { config <- spark_config() # The following package is dependent to Spark version, for Spark 2.3.2: config$sparklyr.shell.packages <- "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2" sc <- spark_connect(master = "local", config = config) read_options <- list(kafka.bootstrap.servers = "localhost:9092", subscribe = "topic1") write_options <- list(kafka.bootstrap.servers = "localhost:9092", topic = "topic2") stream <- stream_read_kafka(sc, options = read_options) %>% stream_write_kafka(options = write_options) stream_stop(stream) }

Write Memory Stream

Arguments See also Examples Writes a Spark dataframe stream into a memory stream. stream_write_memory(x, name = random_string("sparklyr_tmp_"), mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", name, random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
name The name to assign to the newly generated stream.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_orc, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_memory("csv-out") stream_stop(stream) }

Write a ORC Stream

Arguments See also Examples Writes a Spark dataframe stream into an ORC stream. stream_write_orc(x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The destination path. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_parquet, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_orc("orc-in") stream <- stream_read_orc(sc, "orc-in") %>% stream_write_orc("orc-out") stream_stop(stream) }

Write Parquet Stream

Arguments See also Examples Writes a Spark dataframe stream into a parquet stream. stream_write_parquet(x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The destination path. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_text

Examples

if (FALSE) { sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet("parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_stop(stream) }

Write Text Stream

Arguments See also Examples Writes a Spark dataframe stream into a text stream. stream_write_text(x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), ...)

Arguments

x A Spark DataFrame or dplyr operation
path The destination path. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
mode Specifies how data is written to a streaming sink. Valid values are "append", "complete" or "update".
trigger The trigger for the stream query, defaults to micro-batches runnnig every 5 seconds. See stream_trigger_interval and stream_trigger_continuous.
checkpoint The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance.
options A list of strings with additional options.
... Optional arguments; currently unused.

See also

Other Spark stream serialization: stream_read_csv, stream_read_json, stream_read_kafka, stream_read_orc, stream_read_parquet, stream_read_scoket, stream_read_text, stream_write_console, stream_write_csv, stream_write_json, stream_write_kafka, stream_write_memory, stream_write_orc, stream_write_parquet

Examples

if (FALSE) { sc <- spark_connect(master = "local") dir.create("text-in") writeLines("A text entry", "text-in/text.txt") text_path <- file.path("file://", getwd(), "text-in") stream <- stream_read_text(sc, text_path) %>% stream_write_text("text-out") stream_stop(stream) }

Reactive spark reader

Arguments Given a spark object, returns a reactive data source for the contents of the spark object. This function is most useful to read Spark streams. reactiveSpark(x, intervalMillis = 1000, session = NULL)

Arguments

x An object coercable to a Spark DataFrame.
intervalMillis Approximate number of milliseconds to wait to retrieve updated data frame. This can be a numeric value, or a function that returns a numeric value.
session The user session to associate this file reader with, or NULL if none. If non-null, the reader will automatically stop when the session ends.