R4DataScience R Basic Commands of the Non-standard evaluation Statistics Globe DataScience Made Simple Libraries for Python & R sparklyr functions on non-tabular data rlist is a set of tools for working with list objects. RList Turorial RSelenium Tutorial RSelenium
Libraries for Python & Rstudy the sample() function
totalRows = 15sample(fromPool, ChooseSize, replace=F) # replace=F means cannot repeat, if fromPool is smaller than ChooseSize and cannot repeat, so not enough pool, so make it repeatable to work e.g.sample(2, totalRows, replace = TRUE, prob=c(0.9,0.1)) sample(1:totalRows, totalRows/5, replace=F) 11 2 4sample(1:3, 4, replace=F) Error: cannot take a sample larger than the population when 'replace = FALSE'use of sample_frac() function library(ggplot2) index1 = sample_frac(diamonds, 0.1) str(index1) tibble [5,394 x 10] (S3: tbl_df/tbl/data.frame)Introduction to R
Introduction to Rfree books for R
Cookbook for R ♦RCookbook bookdown R books bookdown all books bookdown r-programming books R Programming for Data ScienceData Frame
Data Frame is a list of vectors of equal length to create a dataframe: n = c(2,3,5) s = c('a','b','c') b = c(TRUE, FALSE, FALSE) df = data.frame(n,s,b) Components of dataframe: header, column names, data row, name of the row cell single square bracket "[]", comma Functions: nrow(), ncol(), head() Inport Data: read.table("mydata.txt") read.csv("mydata.csv") retrieve the column vector by the double square bracket or the "$" operator mtcars[[9]] mtcars[["am"]] mtcars$am mtcars[,"am"] retrieve a column slice with the single square bracket "[]" mtcars[1] mtcars["mpg"] mtcars[c("mpg", "hp")] Data frame Row Slice mtcars[24,] mtcars[c(3,24),] mtcars["camaro z28",] mtcars[c("datsun 710","camaro z28"),]# MLFundStat and Hangseng Fund Stat
#================= MLFundStat.html the computation is long, it is possible to cut time by adjusting the cutdate variable. this should be modified to new version using r chart.# Start Of R
#================= Sys.setlocale(category = 'LC_ALL', 'Chinese') use the .Rprofile.site file to run R commands for all users when their R session starts. D:\R-3.5.1\etc\Rprofile.site See: Initialization at startup. #c:\R-4.2.1\etc\Rprofile.site #loadhistory("C:\Users\User\Desktop\.Rhistory") check environment: Sys.getenv() This command could be an environment set: Sys.setenv(FAME="/opt/fame") Start Of R Initialization set rstudio locale to check locale: Sys.getlocale() LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936" to permanently change to english: open C:\R-4.0.3 file Rprofile.site add last line: Sys.setlocale("LC_ALL","English")Startup : Initialization at Start of an R Session
Description
In R, the startup mechanism is as follows. Unless --no-environ was given on the command line, R searches for site and user files to process for setting environment variables. The name of the site file is the one pointed to by the environment variableR_ENVIRON
; if this is unset,R_HOME/etc/Renviron.site
is used (if it exists, which it does not in a "factory-fresh" installation). The name of the user file can be specified by theR_ENVIRON_USER
environment variable; if this is unset, the files searched for are.Renviron
in the current or in the user's home directory (in that order). See "Details" for how the files are read. Then R searches for the site-wide startup profile file of R code unless the command line option --no-site-file was given. The path of this file is taken from the value of theR_PROFILE
environment variable (after tilde expansion). If this variable is unset, the default isR_HOME/etc/Rprofile.site
, which is used if it exists (which it does not in a "factory-fresh" installation). (it contains settings from the installer in a "factory-fresh" installation). This code is sourced into the base package. Users need to be careful not to unintentionally overwrite objects in base, and it is normally advisable to uselocal
if code needs to be executed: see the examples. Then, unless --no-init-file was given, R searches for a user profile, a file of R code. The path of this file can be specified by theR_PROFILE_USER
environment variable (and tilde expansion will be performed). If this is unset, a file called.Rprofile
is searched for in the current directory or in the user's home directory (in that order). The user profile file is sourced into the workspace. Note that when the site and user profile files are sourced only the base package is loaded, so objects in other packages need to be referred to by e.g.utils::dump.frames
or after explicitly loading the package concerned. R then loads a saved image of the user workspace from.RData
in the current directory if there is one (unless --no-restore-data or --no-restore was specified on the command line). Next, if a function.First
is found on the search path, it is executed as.First()
. Finally, function.First.sys()
in the base package is run. This callsrequire
to attach the default packages specified byoptions("defaultPackages")
. If the methods package is included, this will have been attached earlier (by function.OptRequireMethods()
) so that namespace initializations such as those from the user workspace will proceed correctly. A function.First
(and.Last
) can be defined in appropriate.Rprofile
orRprofile.site
files or have been saved in.RData
. If you want a different set of packages than the default ones when you start, insert a call tooptions
in the.Rprofile
orRprofile.site
file. For example,options(defaultPackages = character())
will attach no extra packages on startup (only the base package) (or setR_DEFAULT_PACKAGES=NULL
as an environment variable before running R). Usingoptions(defaultPackages = "")
orR_DEFAULT_PACKAGES=""
enforces the R system default. On front-ends which support it, the commands history is read from the file specified by the environment variableR_HISTFILE
(default.Rhistory
in the current directory) unless --no-restore-history or --no-restore was specified. The command-line option --vanilla implies --no-site-file, --no-init-file, --no-environ and (except forR CMD
) --no-restore Under Windows, it also implies --no-Rconsole, which prevents loading theRconsole
file.Arguments
Details
Note that there are two sorts of files used in startup: environment files which contain lists of environment variables to be set, and profile files which contain R code. Lines in a site or user environment file should be either comment lines starting with#
, or lines of the formname=value
. The latter sets the environmental variablename
tovalue
, overriding an existing value. Ifvalue
contains an expression of the form${foo-bar}
, the value is that of the environmental variablefoo
if that exists and is set to a non-empty value, otherwisebar
. (If it is of the form${foo}
, the default is""
.) This construction can be nested, sobar
can be of the same form (as in${foo-${bar-blah}}
). Note that the braces are essential: for example$HOME
will not be interpreted. Leading and trailing white space invalue
are stripped.value
is then processed in a similar way to a Unix shell: in particular the outermost level of (single or double) quotes is stripped, and backslashes are removed except inside quotes. On systems with sub-architectures (mainly Windows), the filesRenviron.site
andRprofile.site
are looked for first in architecture-specific directories, e.g.R_HOME/etc/i386/Renviron.site
. And e.g..Renviron.i386
will be used in preference to.Renviron
.See Also
For the definition of the "home" directory on Windows see therw-FAQ
Q2.14. It can be found from a running R by Sys.getenv("R_USER")..Last
for final actions at the close of an R session.commandArgs
for accessing the command line arguments. There are examples of using startup files to set defaults for graphics devices in the help forwindows.options
.X11
andquartz
. An Introduction to R for more command-line options: those affecting memory management are covered in the help file for Memory.readRenviron
to read.Renviron
files. For profiling code, seeRprof
.Examples
# NOT RUN { ## Example ~/.Renviron on Unix R_LIBS=~/R/library PAGER=/usr/local/bin/less ## Example .Renviron on Windows R_LIBS=C:/R/library MY_TCLTK="c:/Program Files/Tcl/bin" ## Example of setting R_DEFAULT_PACKAGES (from R CMD check) R_DEFAULT_PACKAGES='utils,grDevices,graphics,stats' # this loads the packages in the order given, so they appear on # the search path in reverse order. ## Example of .Rprofile options(width=65, digits=5) options(show.signif.stars=FALSE) setHook(packageEvent("grDevices", "onLoad"), function(...) grDevices::ps.options(horizontal=FALSE)) set.seed(1234) .First = function() cat("\n Welcome to R!\n\n") .Last = function() cat("\n Goodbye!\n\n") ## Example of Rprofile.site local({ # add MASS to the default packages, set a CRAN mirror old = getOption("defaultPackages"); r = getOption("repos") r["CRAN"] = "http://my.local.cran" options(defaultPackages = c(old, "MASS"), repos = r) ## (for Unix terminal users) set the width from COLUMNS if set cols = Sys.getenv("COLUMNS") if(nzchar(cols)) options(width = as.integer(cols)) # interactive sessions get a fortune cookie (needs fortunes package) if (interactive()) fortunes::fortune() }) ## if .Renviron contains FOOBAR="coo\bar"doh\ex"abc\"def'" ## then we get # > cat(Sys.getenv("FOOBAR"), "\n") # coo\bardoh\exabc"def' # }# Encoding Problems
To write text UTF8 encoding on Windows
Firstly, set encoding options(encoding = "UTF-8") To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines: txt = "在"writeLines(txt, "test.txt", useBytes=T) readLines("test.txt", encoding="UTF-8")
[1] "在"writeLines(wholePage, theFilename, useBytes=T)
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary. According to the Unicode standard, the BOM for UTF-8 files is not recommended #================= # Encoding Problems Sys.getlocale() getOption("encoding") options(encoding = "UTF-8") Encoding(txtstring) = "UTF-8" Encoding(txtstring) txtstring Sys.setlocale Sys.setlocale(category = 'LC_ALL', 'Chinese') Sys.setlocale(category = "LC_ALL", locale = "chs") Sys.setlocale(category = "LC_ALL", locale = "cht") # fanti Note: default: options("encoding" = "native.enc") statTxtFile = "test.txt" write("建设银行", statTxtFile, append=TRUE) result file is ansi add: options("encoding" = "UTF-8") write("建设银行", statTxtFile, append=TRUE) result file is utf-8 mytext = "this is my text" Encoding(mytext) options(encoding = "UTF-8") getOption("encoding") options(encoding='native.enc') getOption("encoding") iconvlist() theHeader = "http://qt.gtimg.cn/r=2&q=r_hk" onecode = "02009" con = url(paste0(theHeader,onecode), encoding = "GB2312") thepage=readLines(con) close(con) Info=unlist(strsplit(thepage,"~")) codename=Info[2] codename Encoding(codename) ================== readLines(textConnection("Z\u00FCrich", encoding="UTF-8"), encoding="UTF-8") readLines(filename, encoding="UTF-8") readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown", skipNul = FALSE) # note! the chiname encoding is ok inside R, but will be wrong when write to file by local pc locale, to solve the problem, set Sys.setlocale(category = 'LC_ALL', 'Chinese') readLines(con = file("Unicode.txt", encoding = "UCS-2LE")) close(con) unique(Encoding(A)) # will most likely be UTF-8 ================== guess_encoding(pageHeader) pageHeader = repair_encoding(pageHeader, from="utf-8") pageHeader = repair_encoding(pageHeader, "UTF-8") iconv(pageHeader, to="UTF-8") Encoding(pageHeader) = "UTF-8" Sys.getlocale("LC_ALL") https://rpubs.com/mauriciocramos/encoding ================== Read text as UTF-8 encoding the following reads in encoding twice and works but reasons unknown readLines(textConnection("Z\u00FCrich", encoding="UTF-8"), encoding="UTF-8") [1] "Zürich" ================== the page source claim to be using UTF-8 encoding: meta http-equiv="Content-Type" content="text/html; charset=utf-8" So, the question is, are they really using a different enough encoding, or can we just convert to utf-8, guessing that any errors will be negligible? A quick and dirty approach just force utf-8 using iconv: TV_Audio_Video = read_html(iconv(page_source[[1]], to = "UTF-8"), encoding = "utf8") In general, this is a bad idea - better to specify the encoding it's from. In this case, maybe the error is theirs, so this quick and dirty approach might be ok.to remove leading zeros
substr(t,regexpr("[^0]",t),nchar(t))Pop up message in windows 8.1
use the tcl/tk package in R to create a messageBox. Here is a very simple example: require(tcltk) tkmessageBox(title = "Title of message box", message = "Hello, world!", icon = "info", type = "ok") library(tcltk) tk_messageBox(type='ok',message='I am a tkMessageBox!') different types of messagebox (yesno, okcancel, etc). See ?tk_messageBox. or use cmd system('CMD /C "ECHO The R process has finished running && PAUSE"', or use hta in one line: mshta "about:<script>alert('Hello, world!');close()</script>" or mshta "javascript:alert('message');close()" or mshta.exe vbscript:Execute("msgbox ""message"",0,""title"":close") mshta "about:<script src='file://%~f0'></script><script>close()</script>" %* msg = paste0( 'mshta ', "\"about:<script>alert('Hello, world!');close()</script>\"" ) to show web page, use script to create #================= Pop up message in windows 8.1 c.bat: start MessageBox.vbs "This will be shown in a popup." MessageBox.vbs : Set objArgs = WScript.Arguments messageText = objArgs(0) MsgBox messageText in fact, save a file named test.vbs with content: MsgBox "some message" double click the file will run directly # options("scipen"=999) # format(xx, scientific=F) # options("scipen"=100, "digits"=4) # getOption("scipen") # or as.integer(functionResult); df = data.frame(matrix(ncol = 10000, nrow = 0)) colnames(df) = c("a", "b," "c") rm(list=ls()) Extracting a Single, Simple Table The first step is to load the ¡§XML¡¨ package, then use the htmlParse() function to read the html document into an R object, and readHTMLTable() to read the table(s) in the document. The length() function indicates there is a single table in the document, simplifying our work. The plot3d() function in the rgl package library(rgl) open3d() attach(mtcars) plot3d(disp,wt,mpg, col = rainbow(10))library(stringr)
#============ library(stringr) library(htmltools) library(threejs) data(mtcars) data = mtcars[order(mtcars$cyl),] uv = tabulate(mtcars$cyl) col = c(rep("red",uv[4]),rep("yellow",uv[6]),rep("blue",uv[8])) row.names(mtcars) scatterplot3d(data[,c(3,6,1)], labels=row.names(mtcars), size=mtcars$hp/100, flip.y=TRUE, color=col,renderer="canvas") tabulate(bin, nbins = max(1, bin, na.rm = TRUE)) tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it. tabulate(c(2,3,3,5), nbins = 10) [1] 0 1 2 0 1 0 0 0 0 0 table(c(2,3,3,5)) 2 3 5 1 2 1 tabulate(c(-2,0,2,3,3,5)) # -2 and 0 are ignored [1] 0 1 2 0 1 tabulate(c(-2,0,2,3,3,5), nbins = 3) [1] 0 1 2 tabulate(factor(letters[1:10]) [1] 1 1 1 1 1 1 1 1 1 1Scatterplot3d: 3D graphics - R software and data visualization
1 Install and load scaterplot3dThere are many packages in R (RGL, car, lattice, scatterplot3d, …) for creating
2 Prepare the data
3 The function scatterplot3d()
4 Basic 3D scatter plots
5 Change the main title and axis labels
6 Change the shape and the color of points
7 Change point shapes by groups
8 Change point colors by groups
9 Change the global appearance of the graph
10 Remove the box around the plot
11 Add grids on scatterplot3d
12 Add bars
13 Modification of scatterplot3d output
14 Add legends
15 Specify the legend position using xyz.convert()
16 Specify the legend position using keywords
17 Customize the legend position
18 Add point labels
19 Add regression plane and supplementary points3D graphics . Thistutorial describes how to generate a scatter pot in the3D space usingR software and the packagescatterplot3d .scaterplot3d is very simple to use and it can be easily extended by adding supplementary points or regression planes into an already generated graphic. It can be easily installed, as it requires only an installed version of R.![]()
Install and load scaterplot3d
install.packages("scatterplot3d") # Install library("scatterplot3d") # load
Prepare the data
The iris data set will be used:data(iris) head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
iris data set gives the measurements of the variables sepal length and width, petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.The function scatterplot3d()
A simplified format is:scatterplot3d(x, y=NULL, z=NULL)
x, y, z are the coordinates of points to be plotted. The arguments y and z can be optional depending on the structure of x. In what cases, y and z are optional variables?Case 1 : x is a formula of type zvar ~ xvar + yvar. xvar, yvar and zvar are used as x, y and z variablesCase 2 : x is a matrix containing at least 3 columns corresponding to x, y and z variables, respectivelyBasic 3D scatter plots
# Basic 3d graphics scatterplot3d(iris[,1:3])
![]()
# Change the angle of point view scatterplot3d(iris[,1:3], angle = 55)
![]()
Change the main title and axis labels
scatterplot3d(iris[,1:3], main="3D Scatter Plot", xlab = "Sepal Length (cm)", ylab = "Sepal Width (cm)", zlab = "Petal Length (cm)")
![]()
Change the shape and the color of points
The argument pch and color can be used:scatterplot3d(iris[,1:3], pch = 16, color="steelblue")
Read more on the different point shapes available in R : Point shapes in R
Change point shapes by groups
shapes = c(16, 17, 18) shapes = shapes[as.numeric(iris$Species)] scatterplot3d(iris[,1:3], pch = shapes)
Read more on the different point shapes available in R : Point shapes in R
Change point colors by groups
colors = c("#999999", "#E69F00", "#56B4E9") colors = colors[as.numeric(iris$Species)] scatterplot3d(iris[,1:3], pch = 16, color=colors)
Read more about colors in R: colors in R
Change the global appearance of the graph
The arguments below can be used:grid : a logical value. If TRUE, a grid is drawn on the plot.box : a logical value. If TRUE, a box is drawn around the plotRemove the box around the plot
scatterplot3d(iris[,1:3], pch = 16, color = colors, grid=TRUE, box=FALSE)
Note that, the argument
grid = TRUE plots only the grid on the xy plane. In the next section, we’ll see how to add grids on the other facets of the 3D scatter plot.Add grids on scatterplot3d
This section describes how to add xy-, xz- and yz- toscatterplot3d graphics. We’ll use a custom function namedaddgrids3d() . The source code is available here : addgrids3d.r. The function is inspired from the discussion on this forum. A simplified format of the function is:addgrids3d(x, y=NULL, z=NULL, grid = TRUE, col.grid = "grey", lty.grid=par("lty"))
x, y, and z are numeric vectors specifying the x, y, z coordinates of points. x can be a matrix or a data frame containing 3 columns corresponding to the x, y and z coordinates. In this case the arguments y and z are optionalgrid specifies the facet(s) of the plot on which grids should be drawn. Possible values are the combination of “xy”, “xz” or “yz”. Example: grid = c(“xy”, “yz”). The default value is TRUE to add grids only on xy facet.col.grid, lty.grid : the color and the line type to be used for grids
Add grids on the different factes of scatterplot3d graphics :# 1. Source the function source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r') # 2. 3D scatter plot scatterplot3d(iris[, 1:3], pch = 16, grid=FALSE, box=FALSE) # 3. Add grids addgrids3d(iris[, 1:3], grid = c("xy", "xz", "yz"))
The problem on the above plot is that the grids are drawn over the points. The
R code below, we’ll put the points in the foreground using the following steps: An empty scatterplot3 graphic is created and the result ofscatterplot3d() is assigned to s3d The functionaddgrids3d() is used to add grids Finally, the functions3d$points3d is used to add points on the 3D scatter plot# 1. Source the function source('~/hubiC/Documents/R/function/addgrids3d.r') # 2. Empty 3D scatter plot using pch="" s3d = scatterplot3d(iris[, 1:3], pch = "", grid=FALSE, box=FALSE) # 3. Add grids addgrids3d(iris[, 1:3], grid = c("xy", "xz", "yz")) # 4. Add points s3d$points3d(iris[, 1:3], pch = 16)
The function
points3d() is described in the next sections.Add bars
The argumenttype = “h” is used. This is useful to see very clearly the x-y location of points.scatterplot3d(iris[,1:3], pch = 16, type="h", color=colors)
![]()
Modification of scatterplot3d output
scatterplot3d returns a list of function closures which can be used to add elements on a existing plot. The returned functions are :xyz.convert() : to convert 3D coordinates to the 2D parallel projection of the existing scatterplot3d. It can be used to add arbitrary elements, such as legend, into the plot.points3d() : to add points or lines into the existing plotplane3d() : to add a plane into the existing plotbox3d() : to add or refresh a box around the plotAdd legends
Specify the legend position using xyz.convert()
The result ofscatterplot3d() is assigned to s3d The functions3d$xyz.convert() is used to specify the coordinates for legends the functionlegend() is used to add legends to plotss3d = scatterplot3d(iris[,1:3], pch = 16, color=colors) legend(s3d$xyz.convert(7.5, 3, 4.5), legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = 16)
![]()
It’s also possible to specify theposition of legends using the following keywords: “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”. Read more aboutlegend in R: legend in R.
Specify the legend position using keywords
# "right" position s3d = scatterplot3d(iris[,1:3], pch = 16, color=colors) legend("right", legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = 16)
![]()
# Use the argument inset s3d = scatterplot3d(iris[,1:3], pch = 16, color=colors) legend("right", legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = 16, inset = 0.1)
What means the argument
inset in the R code above? The argumentinset is used to inset distance(s) from the margins as a fraction of the plot region when legend is positioned by keyword. ( see ?legend from R). You can play with inset argument using negative or positive values.# "bottom" position s3d = scatterplot3d(iris[,1:3], pch = 16, color=colors) legend("bottom", legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = 16)
Using keywords to specify the legend position is very simple. However, sometimes, there is an overlap between some points and the legend box or between the axis and legend box. Is there any solution to avoid this overlap? Yes, there are several solutions using the combination of the following arguments for the function
legend() :bty = “n” : toremove the box around the legend . In this case the background color of the legend becomes transparent and the overlapping points become visible.bg = “transparent” : to change the background color of the legend box to transparent color (this is only possible when bty != “n”).inset : to modify the distance(s) between plot margins and the legend box.horiz : a logical value; if TRUE, set the legend horizontally rather than verticallyxpd : a logical value; if TRUE, it enables the legend items to be drawn outside the plot.Customize the legend position
# Custom point shapes s3d = scatterplot3d(iris[,1:3], pch = shapes) legend("bottom", legend = levels(iris$Species), pch = c(16, 17, 18), inset = -0.25, xpd = TRUE, horiz = TRUE)
![]()
# Custom colors s3d = scatterplot3d(iris[,1:3], pch = 16, color=colors) legend("bottom", legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = 16, inset = -0.25, xpd = TRUE, horiz = TRUE)
![]()
# Custom shapes/colors s3d = scatterplot3d(iris[,1:3], pch = shapes, color=colors) legend("bottom", legend = levels(iris$Species), col = c("#999999", "#E69F00", "#56B4E9"), pch = c(16, 17, 18), inset = -0.25, xpd = TRUE, horiz = TRUE)
In the R code above, you can play with the arguments inset, xpd and horiz to see the effects on the appearance of the legend box.
Add point labels
The functiontext() is used as follow:scatterplot3d(iris[,1:3], pch = 16, color=colors) text(s3d$xyz.convert(iris[, 1:3]), labels = rownames(iris), cex= 0.7, col = "steelblue")
![]()
Add regression plane and supplementary points
The result ofscatterplot3d() is assigned to s3d A linear model is calculated as follow : lm(zvar ~ xvar + yvar). Assumption : zvar depends on xvar and yvar The functions3d$plane3d() is used to add the regression plane Supplementary points are added using the functions3d$points3d() The data sets trees will be used:data(trees) head(trees)
Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7
This data set provides measurements of the girth, height and volume for black cherry trees.3D scatter plot with the regression plane :# 3D scatter plot s3d = scatterplot3d(trees, type = "h", color = "blue", angle=55, pch = 16) # Add regression plane my.lm = lm(trees$Volume ~ trees$Girth + trees$Height) s3d$plane3d(my.lm) # Add supplementary points s3d$points3d(seq(10, 20, 2), seq(85, 60, -5), seq(60, 10, -10), col = "red", type = "h", pch = 8)
![]()
scatterplot3d(data[,c(3,6,1)],
scatterplot3d interactive 3d scatterplots Interactive 3D Scatterplots complete guide to 3D visualization Data Visualization 3D and 4D graph Three.js Fundamentals #============ scatterplot3d(data[,c(3,6,1)], labels=row.names(mtcars), size=mtcars$hp/100, flip.y=TRUE, color=col,renderer="canvas") # Gumball machine N = 100 i = sample(3, N, replace=TRUE) x = matrix(rnorm(N*3),ncol=3) lab = c("small", "bigger", "biggest") scatterplot3d(x, color=rainbow(N), labels=lab[i], size=i, renderer="canvas") # Example 1 from the scatterplot3d package (cf.) z = seq(-10, 10, 0.1) x = cos(z) y = sin(z) scatterplot3d(x,y,z, color=rainbow(length(z)), labels=sprintf("x=%.2f, y=%.2f, z=%.2f", x, y, z)) # Interesting 100,000 point cloud example, should run this with WebGL! N1 = 10000 N2 = 90000 x = c(rnorm(N1, sd=0.5), rnorm(N2, sd=2)) y = c(rnorm(N1, sd=0.5), rnorm(N2, sd=2)) z = c(rnorm(N1, sd=0.5), rpois(N2, lambda=20)-20) col = c(rep("#ffff00",N1),rep("#0000ff",N2)) scatterplot3d(x,y,z, color=col, size=0.25) cat("\014") CLS Screen # match returns a vector of the positions v1 = c("a","b","c","d") v2 = c("g","x","d","e","f","a","c") x = match(v1,v2) 6 NA 7 3 v1 %in% v2 TRUE FALSE TRUE TRUE x = match(v1,v2,nomatch=-1) 6 -1 7 3 %in% returns a logical vector indicating if there is a match or notthis check whether an element is inside a group
#============= this check whether an element is inside a group v = c('a','b','c','e') 'b' %in% vcheck vector includes in 31:37 %in% 0:36
#============= 31:37 %in% 0:36 if(all(31:36 %in% 0:36)){cat("good")} # dmInfo=data.matrix(Info) # convert dataframe to matrix, but the row and column is exchanged # bob = data.frame(lapply(bob, as.character), stringsAsFactors=FALSE) #Change numeric to characters # write.csv(Info,quote=FALSE, row.names = FALSE) # write csv is the proper way to write the datafile # attach an excel file in R: 1: Install packages XLConnect and foreign and run both libraries 2: abcd = readWorksheet(loadWorkbook('file extension'),sheet=1) # allocate vector of size 1.7 Gb Try memory.limit() for the current memory limit Use memory.limit (size=50000) to increase memory limit. Try using a cloud based environment, try using package slam use factors Concatenate and Split Strings in R ================================== use the paste() function to concatenate strsplit() function to split pangram = "The quick brown fox jumps over the lazy dog" strsplit(pangram, " ") "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog" the unique elements unique() function unique(tolower(words)) "the" "quick" "brown" "fox" "jumps" "over" "lazy" "dog" # find duplicates # the intersect function is used for different set, not in inside a vector # instead, use the duplicated function will be OK. words = unlist(strsplit(pangram, " ")) words = tolower(words) duplicated(words) words[duplicated(words)] arr = sample(1:36,6,replace=TRUE) cat(arr, "\n") arr[duplicated(arr)] # test run remove duplicate items from a vector originalArr = c(1,1,3,4,5,5,6,7,8,8,8,8,9,9) cat(originalArr, "\n") # find out duplicates removeItems = unique(originalArr[duplicated(originalArr)]) # use unique to remove repeated duplicates cat(removeItems, "\n") finalArr = originalArr for(item in removeItems){ cat("remove this:", item," ") cat("they are:", which(finalArr == item)," ") finalArr = finalArr[-(which(finalArr == item))] cat("result vec:", finalArr, "\n") } # unique will not remove duplicates originalArr = unique(originalArr) # rmItems(fmList, itemList) remove itemList from fmList rmItems <- function(fmList, itemList){ commons = unique(fmList[fmList %in% itemList]) for(item in commons){fmList = fmList[-(which(fmList == item))]} return(fmList) } rmItems(originalArr, removeItems) # R base functions duplicated(): for identifying duplicated elements and unique(): for extracting unique elements, distinct() [dplyr package] to remove duplicate rows in a data frame. R split Function ================ split() function divides the data in a vector. unsplit() funtion do the reverse. split(x, f, drop = FALSE, ...) split(x, f, drop = FALSE, ...) = value unsplit(value, f, drop = FALSE) x: vector, data frame f: indices drop: discard non existing levels or not # R not recognizing Chinese characters # I have this saved as a script in RStudio: # this works without problem in windows 8.1 a <- "中文" cat("这是中文", a) aaa = readline(prompt="输入汉字:") cat(aaa) This seems to be a Windows/UTF-8 encoding problem. It works if you use eval(parse('test.R', encoding = 'UTF-8')) instead of source(). I try to use read_csv to read my csv file and the source code as follow: ch4sample <- "D:/Rcode/最近一年內.csv" ch4sample.exp1 <-read_csv(ch4sample, col_names = TRUE) Unfortunately, the R console was showing the error message You might use list.files() function to find out how R names these files, and refer to them that way. For example > list.files() [1] "community_test" "community-sandbox.Rproj" [3] "poobär.r" To source .R file saved using UTF-8 encoding first of all: Sys.setlocale(category = 'LC_ALL', 'Chinese') and then source(filename, encoding = 'UTF-8') but remember to save output file in utflist objects in the working environment
ls() data() will give you a list of the datasets of all loaded packages help(package = "datasets") show structure of datasets dataStr = function(package="datasets", ...) { d = data(package=package, envir=new.env(), ...)$results[,"Item"] d = sapply(strsplit(d, split=" ", fixed=TRUE), "[", 1) d = d[order(tolower(d))] for(x in d){ message(x, ": ", class(get(x))); message(str(get(x)))} } dataStr()x = read.csv("anova.csv",header=T,sep=",")
#============= x = read.csv("anova.csv",header=T,sep=",") Subtype,Gender,Expression A,m,-0.54 A,m,-0.8 Split the "Expression" values into two groups based on "Gender" variable, "f" for female group, and "m" for male group: >g = split(x$Expression, x$Gender) >g $f [1] -0.66 -1.15 -0.30 -0.40 -0.24 -0.92 0.48 -1.68 -0.80 -0.55 -0.11 -1.26 $m [1] -0.54 -0.80 -1.03 -0.41 -1.31 -0.43 1.01 0.14 1.42 -0.16 0.15 -0.62 Calculate the length, mean value of each group: sapply(g,length) f m 135 146 sapply(g,mean) f m -0.3946667 -0.2227397 You may use lapply, return is a list: lapply(g,mean) unsplit() function combines the groups: unsplit(g,x$Gender)Apply
===== m = matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3) apply(m, 1, mean) a 1 in the second argument, giving the mean of each row. apply(m, 2, mean)giving the mean of each column. apply(m, 2, function(x) length(x[x<0])) # count -ve values apply(m, 2, function(x) is.matrix(x)) apply(m, 2, is.vector) apply(m, 2, function(x) mean(x[x>0])) #========= ma = matrix(c(1:4, 1, 6:8), nrow = 2) apply(ma, 1, table) apply(ma, 1, stats::quantile) apply(ma, 2, mean) apply(m, 2, function(x) length(x[x<0])) sapply lapply rollapply sapply(1:3, function(x) x^2) lapply return a list: lapply(1:3, function(x) x^2) use unlist with lapply to get a vector sapply(1:3, function(x, y) mean(y[,x]), y=m) A=matrix(1:9, 3,3) B=matrix(4:15, 4,3) C=matrix(8:10, 3,2) MyList=list(A,B,C) Z=sapply(MyList,"[", 1,1 ) #========== te=matrix(1:20,nrow=2) sapply(te,mean) # this is a vector, order arrange in matrix direction matrix(sapply(te,mean),nrow=2) # this is changed to matrix subset() apply() sapply() lapply() tapply() aggregate() apply apply a function to the rows or columns of a matrix M = matrix(seq(1,16), 4, 4) apply(M, 1, min) lapply apply a function to each element of a list in turn and get a list back x = list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) sapply apply a function to each element of a list in turn, but you want a vector back x = list(a = 1, b = 1:3, c = 10:100) sapply(x, FUN = length) vapply squeeze some more speed out of sapply x = list(a = 1, b = 1:3, c = 10:100) vapply(x, FUN = length, FUN.VALUE = 0L) mapply apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply Note: mApply(X, INDEX, FUN, …, simplify=TRUE, keepmatrix=FALSE) from Hmisc package is different from mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) Examples #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 mapply(rep, 1:4, 4:1) mapply(rep, times = 1:4, x = 4:1) mapply(rep, times = 1:4, MoreArgs = list(x = 42)) mapply(function(x, y) seq_len(x) + y, c(a = 1, b = 2, c = 3), # names from first c(A = 10, B = 0, C = -10)) word = function(C, k) paste(rep.int(C, k), collapse = "") utils::str(mapply(word, LETTERS[1:6], 6:1, SIMPLIFY = FALSE)) mapply(function(x,y){x^y},x=c(2,3),y=c(3,4)) 8 81 values1 = list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)) values2 = list(a = c(10, 11, 12), b = c(13, 14, 15), c = c(16, 17, 18)) mapply(function(num1, num2) max(c(num1, num2)), values1, values2) a b c 12 15 18 Map A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list rapply For when you want to apply a function to each element of a nested list structure, recursively tapply For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor lapply is a list apply which acts on a list or vector and returns a list. sapply is a simple lapply (function defaults to returning a vector or matrix when possible) vapply is a verified apply (allows the return object type to be prespecified) rapply is a recursive apply for nested lists, i.e. lists within lists tapply is a tagged apply where the tags identify the subsets apply is generic: applies a function to a matrix's rows or columns by a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle aggregate can be seen as another a different way of use tapply if we use it in such a way xx = c(1,3,5,7,9,8,6,4,2,1,5) duplicated(xx) xx[duplicated(xx)] Accessing dataframe by names: mtcars["mpg"] QueueNo = 12 mtcars[QueueNo,"mpg"] some functions to remember charToRaw(key) as.raw(key) A motion chart is a dynamic chart to explore several indicators over time. subset(airquality, Temp > 80, select = c(Ozone, Temp)) subset(airquality, Day == 1, select = -Temp) subset(airquality, select = Ozone:Wind) with(airquality, subset(Ozone, Temp > 80)) ## sometimes requiring a logical 'subset' argument is a nuisance nm = rownames(state.x77) start_with_M = nm %in% grep("^M", nm, value = TRUE) subset(state.x77, start_with_M, Illiteracy:Murder) # but in recent versions of R this can simply be subset(state.x77, grepl("^M", nm), Illiteracy:Murder) join 3 dataframes library("plyr") join() function names(gdp)[3] = "GDP" names(life_expectancy)[3] = "LifeExpectancy" names(population)[3] = "Population" gdp_life_exp = join(gdp, life_expectancy) development = join(gdp_life_exp, population) subset() function dev_2005 = subset(development, Year == 2005) dev_2005_big = subset(dev_2005, GDP >= 30000) development_motion = subset(development_complete, Country %in% selection) library(googleVis) gvisMotionChart() function motion_graph = gvisMotionChart(development_motion, idvar = "Country", timevar = "Year") plot(motion_graph) motion_graph = gvisMotionChart(development_motion, idvar = "Country", timevar = "Year", xvar = "GDP", yvar = "LifeExpectancy", sizevar = "Population") development_motion$logGDP = log(development_motion$GDP) motion_graph = gvisMotionChart(development_motion, idvar = "Country", timevar = "Year", xvar = "logGDP", yvar = "LifeExpectancy", sizevar = "Population") my_list[[1]] extracts the first element of the list my_list, and my_list[["name"]] extracts the element in my_list that is called name. If the list is nested you can travel down the heirarchy by recursive subsetting. mylist[[1]][["name"]] is the element called name inside the first element of my_list. A data frame is just a special kind of list, so you can use double bracket subsetting on data frames too. my_df[[1]] will extract the first column of a data frame and my_df[["name"]] will extract the column named name from the data frame. names() and str() is a great way to explore the structure of a list. i in 1:ncol(df) This is a pretty common model for a sequence: a sequence of consecutive integers designed to index over one dimension of our data. What might surprise you is that this isn't the best way to generate such a sequence, especially when you are using for loops inside your own functions. Let's look at an example where df is an empty data frame: df = data.frame() 1:ncol(df) for (i in 1:ncol(df)) { print(median(df[[i]])) } Our sequence is now the somewhat non-sensical: 1, 0. You might think you wouldn't be silly enough to use a for loop with an empty data frame, but once you start writing your own functions, there's no telling what the input will be. A better method is to use the seq_along() function. if you grow the for loop at each iteration (e.g. using c()), your for loop will be very slow. A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc.) and the length of the vector. Then, at each iteration of the loop you must store the output in the corresponding entry of the output vector, i.e. assign the result to output[[i]]. (You might ask why we are using double brackets here when output is a vector. It's primarily for generalizability: this subsetting will work whether output is a vector or a list.) A time series can be thought of as a vector or matrix of numbers, along with some information about what times those numbers were recorded. This information is stored in a ts object in R. read in some time series data from an xlsx file using read_excel(), a function from the readxl package, and store the data as a ts object. Use the read_excel() function to read the data from "exercise1.xlsx" into mydata. mydata = read_excel("exercise1.xlsx") Create a ts object called myts using the ts() function. myts = ts(mydata[,2:4], start = c(1981, 1), frequency = 4) The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. The features that you see in the plots must then be incorporated into the forecasting methods that you use. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate. You will use the autoplot() function to produce time plots of the data. In each plot, look out for outliers, seasonal patterns, and other interesting features. Use which.max() to spot the outlier in the gold series. library("fpp2") autoplot(a10) ggseasonplot(a10) An interesting variant of a season plot uses polar coordinates, where the time axis is circular rather than horizontal. ggseasonplot(a10, polar = TRUE) beer = window(a10, start=1992) autoplot(beer) ggseasonplot(beer) Use the window() function to consider only the ausbeer data from 1992 and save this to beer. Set a keyword start to the appropriate year. x = tryCatch( readLines("wx.qq.com/"), warning=function(w){ return(paste( "Warning:", conditionMessage(w)));}, error = function(e) { return(paste( "this is Error:", conditionMessage(e)));}, finally={print("This is try-catch test. check the output.")});x = c(sort(sample(1:20, 9)), NA)
#=================== x = c(sort(sample(1:20, 9)), NA) y = c(sort(sample(3:23, 7)), NA) union(x, y) intersect(x, y) setdiff(x, y) setdiff(y, x) setequal(x, y) alist = readLines("alist.txt") blist = readLines("blist.txt") out = setdiff(blist, alist) writeClipboard(out) use of sample command: newData = sample[sample$x > 0 & sample$y > 0.4, ]# To skip 3rd iteration and go to next iteration
#=================== # To skip 3rd iteration and go to next iteration for(n in 1:5) { if(n==3) next cat(n) }googleVis chart
#=================== googleVis chart =============== library(googleVis) Line chart ========== df=data.frame(country=c("US", "GB", "BR"), val1=c(10,13,14), val2=c(23,12,32)) Line = gvisLineChart(df) plot(Line) Scatter chart ======================= # example 1 dat = data.frame(x=c(1,2,3,4,5), y1=c(0,3,7,5,2), y2=c(1,NA,0,3,2)) plot(gvisScatterChart(dat, options=list(lineWidth=2, pointSize=2, width=900, height=600))) # example 2, women Scatter = gvisScatterChart(women, options=list( legend="none", lineWidth=1, pointSize=2, title="Women", vAxis="{title:'weight (lbs)'}", hAxis="{title:'height (in)'}", width=900, height=600) ) plot(Scatter) # example 3 ex3dat = data.frame(x=c(1,2,3,4,5,6,7,8), y1=c(0,3,7,5,2,0,8,6), y2=c(1,NA,0,3,2,6,4,2)) ex3 = gvisScatterChart(ex3dat, options=list( legend="none", lineWidth=1, pointSize=2, title="ex3", vAxis="{title:'weight (lbs)'}", hAxis="{title:'height (in)'}", width=900, height=600) ) plot(ex3) # Note: to plot timeline chart, arrange the time in x axis, beginning with -ve and the last is 1 to show the sequencecat to a file using file(filename, open = "a")
cat("TITLE extra line", "2 3 5 7", "11 13 17", file = "data.txt", sep = "\n")cat append to a file, open file in "a" mode
#=================== textVector = c("First thing","Second thing","c") catObj = file("theappend.txt", open = "a") cat(textVector, file = catObj, sep="\n") close(catObj)install.packages("readr")
#=================== install.packages("readr") library(readr) to read rectangular data (like csv, tsv, and fwf) readr is part of the core tidyverse library(tidyverse) readr supports seven file formats with seven read_ functions: read_csv(): comma separated (CSV) files read_tsv(): tab separated files read_delim(): general delimited files read_fwf(): fixed width files read_table(): tabular files where columns are separated by white-space. read_log(): web log filesiconv(keyword, "unknown", "GB2312")
#=================== iconv(keyword, "unknown", "GB2312")Grabbing HTML Tags
#========== Grabbing HTML Tags \b[^>]*>(.*?) matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in one two one. <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)\1> will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselvesfind the new item
#========== find the new item theList = c("00700","02318","02007") newList=c("03333","01398","02007") newList[!(newList %in% theList)]formating numbers
#========== formating numbers a = seq(1,101,25) sprintf("%03d", a) format(round(a, 2), nsmall = 2)the match function:
#========== the match function: match(x, table, nomatch = NA_integer_, incomparables = NULL) %in% match returns a vector of the positions of (first) matches of its first argument in its second. Corpus= c('animalada', 'fe', 'fernandez', 'ladrillo') Lexicon= c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo') Lexicon %in% Corpus Lexicon[Lexicon %in% Corpus]Machine Learning:
Machine Learning with R and TensorFlow machine-learning-in-r-step-by-step An Introduction to Machine Learning with R mxnet image classification Image Recognition & Classification with Keras #========== Machine Learning: The caret package Caret contains wrapper functions that allow you to use the exact same functions for training and predicting with dozens of different algorithms. On top of that, it includes sophisticated built-in methods for evaluating the effectiveness of the predictions you get from the model. Use The Titanic dataset Training a model training a bunch of different decision trees and having them vote Random forests work pretty well in *lots* of different situations, so I often try them first. Evaluating the model Cross-validation is a way to evaluate the performance of a model without needing any other data than the training data. Making predictions on the test set Improving the modelto handle error 404 when scraping: use tryCatch()
#========== to handle error 404 when scraping: use tryCatch() for (i in urls) { tmp = tryCatch(readLines(url(i), warn=F), error = function (e) NULL) if (is.null(tmp)) { next() # skip to the next url. } } #========== try(readLines(url), silent = TRUE) tryCatch(readLines(url), error = function (e) conditionMessage(e))write.table
write.table(matrixname, file = "outputname", append = FALSE, quote = FALSE, sep = "\t", #========== write.table(matrixname, file = "outputname", append = FALSE, quote = FALSE, sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "") write.table(finalTableList, theOutputname, row.names=FALSE, col.names=FALSE, quote = FALSE, sep = "\t" )Add Gaussian noise to vector
# Create a vector x x <- 1:10 # Add Gaussian noise with mean 0 and standard deviation 0.1 noise <- rnorm(length(x), mean = 0, sd = 0.1) # Add noise to the vector x x_noisy <- x + noise # Print the original and noisy vectors print(x) print(x_noisy)Generate Random Numbers
Method 1: Generate One Random Number #generate one random number between 1 and 20 runif(n=1, min=1, max=20) Method 2: Generate Multiple Random Numbers #generate five random numbers between 1 and 20 runif(n=5, min=1, max=20) Method 3: Generate One Random Integer in sample pool sample(1:20, 1) Method 4: Generate Multiple Random Integers in sample pool #generate five random integers between 1 and 20 (sample with replacement) sample(1:20, 5, replace=TRUE) #generate five random integers between 1 and 20 (sample without replacement) sample(1:20, 5, replace=FALSE) # Generate Random Number From Uniform Distribution > runif(1) # generates 1 random number [1] 0.3984754 > runif(3) # generates 3 random number [1] 0.8090284 0.1797232 0.6803607 > runif(3, min=5, max=10) # define the range between 5 and 10 [1] 7.099781 8.355461 5.173133 # Generate Random Number From Normal Distribution > rnorm(1) # generates 1 random number [1] 1.072712 > rnorm(3) # generates 3 random number [1] -1.1383656 0.2016713 -0.4602043 > rnorm(3, mean=10, sd=2) # provide our own mean and standard deviation [1] 9.856933 9.024286 10.822507Four normal distribution functions:
#========== Four normal distribution functions: R - Normal Distribution Distribution of data is normal means, on plotting a graph with the value of the variable in the horizontal axis and the count of the values in the vertical axis we get a bell shape curve. The center of the curve represents the mean of the data set. In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of the graph. This is referred as normal distribution in statistics. R has four in built functions to generate normal distribution. They are: dnorm(x, mean, sd) pnorm(x, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)dnorm()
This function gives height of the probability distribution at each point for a given mean and standard deviation. # Create a sequence of numbers between -10 and 10 incrementing by 0.1. x <- seq(-10, 10, by = .1) # Choose the mean as 2.5 and standard deviation as 0.5. y <-dnorm (x, mean = 2.5, sd = 0.5) # Give the chart file a name. png(file = "dnorm.png") plot(x,y) # Save the file. dev.off() When we execute the above code, it produces the following result −![]()
pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given number. It is also called "Cumulative Distribution Function ". # Create a sequence of numbers between -10 and 10 incrementing by 0.2. x <- seq(-10,10,by = .2) # Choose the mean as 2.5 and standard deviation as 2. y <-pnorm (x, mean = 2.5, sd = 2) # Give the chart file a name. png(file = "pnorm.png") # Plot the graph. plot(x,y) # Save the file. dev.off() When we execute the above code, it produces the following result −![]()
qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability value. # Create a sequence of probability values incrementing by 0.02. x <- seq(0, 1, by = 0.02) # Choose the mean as 2 and standard deviation as 3. y <-qnorm (x, mean = 2, sd = 1) # Give the chart file a name. png(file = "qnorm.png") # Plot the graph. plot(x,y) # Save the file. dev.off() When we execute the above code, it produces the following result −![]()
rnorm()
This function is used to generate random numbers whose distribution is normal. It takes the sample size as input and generates that many random numbers. We draw a histogram to show the distribution of the generated numbers. # Create a sample of 50 numbers which are normally distributed. y <-rnorm (50) # Give the chart file a name. png(file = "rnorm.png") # Plot the histogram for this sample. hist(y, main = "Normal DIstribution") # Save the file. dev.off() When we execute the above code, it produces the following result −RNORM Generates random numbers from normal distribution rnorm(n, mean, sd) rnorm(1000, 3, .25) Generates 1000 numbers from a normal with mean 3 and sd=.25 DNORM Probability Density Function(PDF) dnorm(x, mean, sd) dnorm(0, 0, .5) Gives the density (height of the PDF) of the normal with mean=0 and sd=.5. dnorm returns the value of the normal distribution given parameters for x, μ, and σ. # x = 0, mu = 0 and sigma = 0 dnorm(0, mean = 0, sd = 1) dnorm(1, mean = 1.2, sd = 0.5) # result: 0.7365403 change x to dataset dataset = seq(-3, 3, by = .1) dvalues = dnorm(dataset) plot(dvalues, # y = values and x = index xaxt = "n", # Don't label the x-axis type = "l", # Make it a line plot main = "pdf of the Standard Normal", xlab= "Data Set") compare the data with dnorm: dataset = c( 5, 1,2,5,3,5,6,4,7,4,5,4,8,6,3,3,6,5,4,3,4,3,4,3) plot(dvalues, # y = values and x = index xaxt = "n", # Don't label the x-axis type = "l", # Make it a line plot main = "pdf of the Standard Normal", xlab= "Data Set") to create a dnorm of a dataset to compare with current dataset make a cut index cutindex = seq(min(dataset),max(dataset),length = 10) yfit = dnorm(cutindex, mean=mean(dataset), sd=sd(dataset)) lines(cutindex, yfit) # Kernel Density Plot d = density(mtcars$mpg) # returns the density data plot(d) # plots the results # Filled Density Plot d = density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon") polygon(d, col="red", border="blue") Kernel density estimation is a technique that let's you create a smooth curve given a set of data. PNORM Cumulative Distribution Function (CDF) pnorm(q, mean, sd) pnorm(1.96, 0, 1) Gives the area under the standard normal curve to the left of 1.96, i.e. ~0.975 QNORM Quantile Function – inverse of pnorm qnorm(p, mean, sd) qnorm(0.975, 0, 1) Gives the value at which the CDF of the standard normal is .975, i.e. ~1.96 Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.
pnorm students scoring higher than 84
#========== pnorm students scoring higher than 84 > pnorm(84, mean=72, sd=15.2, lower.tail=FALSE) [1] 0.21492 Answer The percentage of students scoring 84 or more in the college entrance exam is 21.5%.plot a histogram of 1000
draws from a normal distribution with mean 10, standard deviation 2. #========== plot a histogram of 1000 draws from a normal distribution with mean 10, standard deviation 2. set.seed(seed) x = rnorm(1000, 10, 2) plot(x) hist(x) Using a QQ plot. Assess the normality: qqnorm(x) qqline(x) In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.format leading zeros
#========== format leading zeros formatC(1, width = 2, format = "d", flag = "0") "01" formatC(125, width = 5, format = "d", flag = "0") "00125"library(pdftools)
#========== setwd("C:/Users/User/Desktop") library(pdftools) txt = pdf_text("a.pdf") str(txt) # 361 pages writeClipboard(txt[1]) Sys.setlocale(category = 'LC_ALL', 'Chinese') options("encoding" = "UTF-8") sink("war.txt") for(i in txt){ cat(i, sep="\n")} sink() shell("war.txt") txt1 = gsub(".*ORIGINATOR", "", txt) txt1 = gsub(" ", "", txt1) list = c(13:16, 19:22, 25:28, 31:34, 37:42, 45:48, 52:58, 62:68, 71:75, 78:85, 88:95, 98:105, 108:115, 118:124, 127:133, 136:142, 145:156, 159:169, 173:202, 206:221, 225:240, 244:258, 261:274, 277:290, 294:298, 302:308, 312:318, 323:331, 334:345, 348:359) txt1 = txt1[list] writeClipboard(txt1) pdf_info("a.pdf") pdf_text("a.pdf") pdf_fonts("a.pdf") pdf_attachments("a.pdf") pdf_toc("a.pdf") toc = pdf_toc("a.pdf") sink("test.txt") print(toc) sink() #========== library(pdftools) txt = pdf_text("a.pdf") str(txt) txtList = unlist(strsplit(txt, "\\s{2,}")) writeClipboard(txtList) pdftools.pdf pdftools Usage pdf_text(pdf)pdfimages
https://stackoverflow.com/questions/47133072/how-to-extract-images-from-a-scanned-pdf http://www.xpdfreader.com/pdfimages-man.html http://www.xpdfreader.com/download.html https://rdrr.io/cran/metagear/src/R/PDF_extractImages.R pdfimages a.pdf -j Quote a string to be passed to an operating system shell. Usage: shQuote(string, type = c("sh", "csh", "cmd", "cmd2")) #("PDF to PPM") files <- list.files(path = dest, pattern = "pdf", full.names = TRUE) lapply(files, function(i){ shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 300 ", i,".pdf", " ",i))) }) You could also just use the CMD prompt and type pdftoppm -f 1 -l 10 -r 300 stuff.pdf stuff.ppmOCR Extract Text from Images
download Using the Tesseract OCR engine in R library(tesseract) i = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Chineselanguage.svg/1200px-Chineselanguage.svg.png" chi <- tesseract("chi_sim") text <- ocr(i, engine = chi) cat(text) # In love # text <- ocr(i) # for english, default engine library(tesseract) eng <- tesseract("eng") text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng) cat(text) results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng) # list the languages have installed. tesseract_info() $datapath [1] "/Users/jeroen/Library/Application Support/tesseract4/tessdata/" $available [1] "chi_sim" "eng" "osd" chinese character recognition using Tesseract OCR download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder. C:/Users/User/AppData/Local/tesseract4/tesseract4/tessdata/ To download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata library(tesseract) chi <- tesseract("chi_sim") datapath = "C:/Users/User/Desktop/testReact/" setwd(datapath) shell(shQuote("D:/XpdfReader-win64/xpdf-tools-win-4.03/bin64/pdfimages a.pdf -j")) allFiles <- list.files(path = datapath, pattern = "jpg", full.names = TRUE) allText = character() # for(i in allFiles){ for(file in 1:5){ i = allFiles[file] cat(i, "\n") text <- tesseract::ocr(i, engine = chi) allText = c(allText, text) } setwd(datapath) Sys.setlocale(category = 'LC_ALL', 'Chinese') options("encoding" = "UTF-8") sink("result.txt") cat(allText, sep="\n") sink() options("encoding" = "native.enc") thepage = readLines("result.txt", encoding="UTF-8") thepage = gsub(" ","", thepage) sink("resultNew.txt") cat(thepage, sep="\n") sink() thepage = readLines("resultNew.txt", encoding="UTF-8") thepage = gsub("。","。\n", thepage) sink("resultNew.txt") cat(thepage, sep="\n") sink() Tesseract-OCR 實用心得 Xpdf language support packages with XpdfViewer, XpdfPrint, XpdfText First, download whichever language support package(s) you need and unpack them. You can unpack them anywhere you like – in step 3, you'll set up the config file with the path to wherever you unpacked them. Create an xpdfrc configuration file (if you haven't done this already). All of the Glyph & Cog tools read an (optional) configuration file with various global settings. To use this config file with the Windows DLLs and COM components, simply create a text file called "xpdfrc" in the same directory as the DLL, COM component, or ActiveX control. This must be a plain text file (not Word or RTF) with no file name extension (correct: xpdfrc; incorrect: xpdfrc.txt). Documentation on the configuration settings, i.e., available commands for the xpdfrc file, can be found in the documentation for the DLL or COM component. Each language support package comes with a file called "add-to-xpdfrc". You need to insert the contents of that file into your own xpdfrc file (created in step 2). This information includes pointers to the various files installed when you unpacked the language support package – make sure you modify these paths to match your install directory. The GPG/PGP key used to sign the packages is available here, or from the PGP keyservers (search for xpdf@xpdfreader.com). https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html tesseract_info() to show environment remember to copy the train data to: C:/Users/william/AppData/Local/tesseract4/tesseract4/tessdata/ High Quality OCR in R Using the Tesseract OCR engine in R 實用心得 Tesseract-OCRtrain tessdata library
creating training data improve characters recognition lots of tessdata traindata tessdata langdata Traineddata Files Tesseract ocr train tessdata library on batch with lots of single character image If they are of same font, put them in a multi-page TIFF and conduct training on it. jTessBoxEditor can help you with the TIFF merging and box editing. jTessBoxEditor Here is a summary: 3. The more data, the better the OCR result, so repeat (1) and (2) until you have at least 4 pages. Limit is 32 4. Execute tesseract command to obtain the box files 5. Edit the box file using the bbTesseract editing tool 6. Execute tesseract command to generate the data files (clustering) 7. Rename files with "vie." prefix and copy the files to tessdata directory, overriding the existing data 8. Run OCR on the original images to validate your work. The accuracy rate should be in the high 90% So that the community can benefit from your work, please submit your data files. They will be posted in the VietOCR's Download page. Be sure to indicate the names of the fonts that you have trained for, so users can know which data set they should load into tessdata directory when OCRing their document. training tesseract models from scratchtesseract extra spaces in result when ocr chinese
# workaround to remove extra spaces in OCR result # https://github.com/tesseract-ocr/tesseract/issues/991, 988 and 1009 This fix can be applied via adding the following to the config file and then running combine_tessdata. preserve_interword_spaces 1 SetVariable("preserve_interword_spaces", false); these files need to be fixed: tessdata/chi_sim/chi_sim.config tessdata/chi_tra/chi_tra.config tessdata/jpn/jpn.config tessdata/tha/tha.config tessdata_best/chi_sim/chi_sim.config tessdata_best/chi_tra/chi_tra.config tessdata_best/jpn/jpn.config tessdata_best/tha/tha.config fixed tessdata_best/jpn_vert/jpn_vert.config which is included by tessdata_best/jpn/jpn.configThe name of the site environment variable R_ENVIRON
#========== The name of the site environment variable R_ENVIRON "R_HOME/etc/Renviron.site" the default is "R_HOME/etc/Rprofile.site" Sys.getenv("R_USER") Examples ## Example ~/.Renviron on Unix R_LIBS=~/R/library PAGER=/usr/local/bin/less ## Example .Renviron on Windows R_LIBS=C:/R/library MY_TCLTK="c:/Program Files/Tcl/bin" ## Example of setting R_DEFAULT_PACKAGES (from R CMD check) R_DEFAULT_PACKAGES='utils,grDevices,graphics,stats' # this loads the packages in the order given, so they appear on # the search path in reverse order. ## Example of .Rprofile options(width=65, digits=5) options(show.signif.stars=FALSE) setHook(packageEvent("grDevices", "onLoad"), function(...) grDevices::ps.options(horizontal=FALSE)) set.seed(1234) .First = function() cat("\n Welcome to R!\n\n") .Last = function() cat("\n Goodbye!\n\n") ## Example of Rprofile.site local({ # add MASS to the default packages, set a CRAN mirror old = getOption("defaultPackages"); r = getOption("repos") r["CRAN"] = "http://my.local.cran" options(defaultPackages = c(old, "MASS"), repos = r) ## (for Unix terminal users) set the width from COLUMNS if set cols = Sys.getenv("COLUMNS") if(nzchar(cols)) options(width = as.integer(cols)) # interactive sessions get a fortune cookie (needs fortunes package) if (interactive()) fortunes::fortune() }) ## if .Renviron contains FOOBAR="coo\bar"doh\ex"abc\"def'" ## then we get # > cat(Sys.getenv("FOOBAR"), "\n") # coo\bardoh\exabc"def'How to Convert Factor into Numerical?
#========== How to Convert Factor into Numerical? When you convert factors to numeric, first you should convert it into characters and then convert into numeric. as.numeric(as.character(X)) Df$column=as.numeric(as.factor(df$column) as.integer(as.factor(region))options(error=recover)
#========== options(error=recover) recover {utils} Browsing after an Error This function allows the user to browse directly on any of the currently active function calls, and is suitable as an error option. The expression options(error = recover) will make this the error option. Usage recover() When called, recover prints the list of current calls, and prompts the user to select one of them. The standard R browser is then invoked from the corresponding environment; the user can type ordinary R language expressions to be evaluated in that environment. Turning off the options() debugging mode in R options(error=NULL)Extract hyperlink from Excel file in R
#========== library(XML) # rename file to .zip my.zip.file = sub("xlsx", "zip", my.excel.file) file.copy(from = my.excel.file, to = my.zip.file) # unzip the file unzip(my.zip.file) # unzipping produces a bunch of files which we can read using the XML package # assume sheet1 has our data xml = xmlParse("xl/worksheets/sheet1.xml") # finally grab the hyperlinks hyperlinks = xpathApply(xml, "//x:hyperlink/@display", namespaces="x") To repair Hyperlink address corrupted: copy file to desk top and rename to zip file open zip file and locate: \xl\worksheets\_rels open the sheet1.xml.rels with editor remove all text: D:\Users\Lawht\AppData\Roaming\Microsoft\Excel\Extract part of a string
#========== x = c("75 to 79", "80 to 84", "85 to 89") substr(x, start = 1, stop = 2) substr(x, start, stop) x = "1234567890" substr(x, 5, 7) "567"alter grades
#========== alter grades locate the word get the line location alter the score table #========== locate the word v = c('a','b','c','e') 'b' %in% v ## returns TRUE match('b',v) ## returns the first location of 'b', in this case: 2 subv = c('a', 'f') subv %in% v ## returns a vector TRUE FALSE is.element(subv, v) ## returns a vector TRUE FALSE which() which('a' == v) #[1] 2 4 For finding all occurances as vector of indices grep() returns a vector of integers, which indicate where matches are. yo = c("a", "a", "b", "b", "c", "c") grep("b", yo) # [1] 3 4 ROC="中華民國 – 維基百科,自由的百科全書" grep("中華民國",ROC) Partial String Matching pmatch("med", c("mean", "median", "mode")) # returns 2table, cut and barplot
atab=c(1,2,3,2,1,2,3,4,5,4) table(atab)atab 1 2 3 4 5 2 3 2 2 1 cut(atab, 2) table( cut(atab, 2)) counts = table( cut(atab, 4)) barplot(counts, main="Qty", xlab="grade") Note: testgroup_A = c('@','#','$','#','@') testgroup_B = c('#','$','*','~','*') table(testgroup_A, testgroup_B)testgroup_B testgroup_A # $ * ~ # 0 1 0 1 $ 0 0 1 0 @ 1 0 1 0 testgroup_A = c('baby','boy','girl','boy','baby') testgroup_B = c('boy','girl','baby','baby','baby') table(testgroup_A, testgroup_B)testgroup_B testgroup_A baby boy girl baby 1 1 0 boy 1 0 1 girl 1 0 0 This is to compare freq of two groupsnon-paste answer to concatenate two strings
capture.output(cat(counts, sep = ","))V8 is an R interface JavaScript engine.
This package helps us execute javascript code in R #Loading both the required libraries library(rvest) library(V8) #URL with js-rendered content to be scraped link = 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/' #Read the html page content and extract all javascript codes that are inside a list emailjs = read_html(link) %>% html_nodes('li') %>% html_nodes('script') %>% html_text() # Create a new v8 context ct = v8() #parse the html content from the js output and print it as text read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text() info@brewhemia.co.uk Thus we have used rvest to extract the javascript code snippet from the desired location (that is coded in place of email ID) and used V8 to execute the javascript snippet (with slight code formatting) and output the actual email (that is hidden behind the javascript code). #################### Getting email address through rvest You need a javascript engine here to process the js code. R has got V8. Modify your code after installing V8 package: library(rvest) library(V8) link = 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/' page = read_html(link) name_html = html_nodes(page,'.placeHeading') business_adr = html_text(adr_html) tel_html = html_nodes(page,'.value') business_tel = html_text(tel_html) emailjs = page %>% html_nodes('li') %>% html_nodes('script') %>% html_text() ct = v8() read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()extract protected pdf document
library(pdftools) setwd("C:/Users/User/Desktop") txt = pdf_text("a.pdf") str(txt) # 361 pages # copy page 1 writeClipboard(txt[1]) # copy page 2 writeClipboard(txt[2]) # copy page 3 writeClipboard(txt[3]) Convert unicode character to string format: remove "\u" theStr = "\u9999\u6e2f\u98df\u54c1\u6295\u8d44" # "香港食品投资" ============================= Sys.setlocale(category = 'LC_ALL', 'Chinese') library(pdftools) setwd("C:/Users/User/Desktop") txt = pdf_text("45.pdf") str(txt) chi1 = gsub('\\u' , '', txt[1]) chi2 = gsub('\\u' , '', txt[2]) chi3 = gsub('\\u' , '', txt[3]) sink("aaa.txt") cat(chi1) cat(chi2) cat(chi3) sink()Writing an R package
Develop Packages with RStudio rpackage_instructions.pdf Writing an R package from scratch Writing an R packagetable, cut and breaks
table(cut(as.numeric(resultTable[,3]), 10)) cut(as.numeric(resultTable[,3]),10) breaks = c(seq(lower, 0, by = 5), 0, seq(0, upper, by = 5)) tableA = c(1,3,5,7,9) tableB = c(1,3,5,7,2,4,6,8) tableA = c(tableA, tableB) tableA = sort(tableA) table(tableA) table(cut(tableA, 3)) breaks = c(seq(1, 3, by = 1), 4, seq(5, 9, by = 2)) table(cut(tableA, breaks))List the Files in a Directory
List the Files in a Directory/Folder list.files() list.dirs(R.home("doc")) list.dirs()best way is to run dos command from R
shell("dir /s >thisdir.txt") this will show all file details instead of only filenames in list.files() commandtest url exist
library(httr) http_error(theUrl) Load image from website download.file("url", destfile="tmp.png", mode="wb") url.exists {RCurl} return true of false With httr use url_success()download.file
This function can be used to download a file from the Internet. download.file(url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE, extra = getOption("download.file.extra"), headers = NULL, ...) example: destfile <- "C:/Users/User/Desktop/aaaa.jpg" url <- "https://i.pinimg.com/originals/22/2d/b8/222db84256aecf2a7532dcb1a3bab9af.jpg" download.file(url, destfile, mode = "w", method='curl') method Method to be used for downloading files. Current download methods are "internal", "wininet" (Windows only) "libcurl", "wget" and "curl", and there is a value "auto": see ‘Details’ and ‘Note’. The method can also be set through the option "download.file.method": see options(). quiet If TRUE, suppress status messages (if any), and the progress bar. mode character. The mode with which to write the file. Useful values are "w", "wb" (binary), "a" (append) and "ab". Not used for methods "wget" and "curl". See also ‘Details’, notably about using "wb" for Windows. cacheOK logical. Is a server-side cached value acceptable? extra character vector of additional command-line arguments for the "wget" and "curl" methods. headers named character vector of HTTP headers to use in HTTP requests. It is ignored for non-HTTP URLs. The User-Agent header, coming from the HTTPUserAgent option (see options) is used as the first header, automatically. ... allow additional arguments to be passed, unused.Passing arguments to R script
Passing arguments to R script Rscript --vanilla testargument.R iris.txt newname To avoid Rscript.exe loop forever for keyboard input: use this: cat("a string please: "); a = readLines("stdin",n=1);School Revision Papers
http://schoolsnetkenya.com/form-1-revision-papers-for-term-1-2018/ http://schoolsnetkenya.com/form-1-revision-papers-for-term-1-2017/ https://curriculum.gov.mt/en/Examination-Papers/Pages/list_secondary_papers.aspx http://www2.hkedcity.net/sch_files/a/hf1/hf1-lin/visitor_cabinet/67726/F1-2ndTest-Eng.pdf http://www2.hkedcity.net/sch_files/a/hf1/hf1-lin/visitor_cabinet/67726/F2-2ndTest-Eng.pdf http://www.sttss.edu.hk/parents_corner/pastpaper.phpdifference between 1L and 1
L specifies an integer type, rather than a double, it uses only 4 bytes per element the function as.integer is simplified yb "L " suffix > str(1) num 1 > str(1L) int 1Datatable
data.table FAQ ♦Data.Table Tutorial R Data.Table Tutorial Datatable Cheat Sheetsetkey does two things:
reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order. marks those columns as key columns by setting an attribute called sorted to DT. The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated). When is setkey() required? For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by. A key is basically an index into a dataset, which allows for very fast and efficient sort, filter, and join operations. These are probably the best reasons to use data tables instead of data frames (the syntax for using data tables is also much more user friendly, but that has nothing to do with keys). library(data.table) dt=data.table(read.table("wAveTable.txt", header=TRUE, colClasses=c('character', 'numeric', 'numeric'))) colnames(dt) "Code" "WAve5" "WAve10" dt[WAve5 > 5, ] summary(dt[WAve5 = 5, ]) summary(dt[WAve5 %between% c(7,9), ]) data.table dt subset rows using i, and manipulate columns with j, grouped according to by dt[i, j, by] Create a data.table data.table(a = c(1, 2), b = c("a", "b")) convert a data frame or a list to a data.table setDT(df) or as.data.table(df) Subset data.table rows using i dt[1:2, ] subset data.table rows based on values in one or more columns dt[a > 5, ] data.table Logical Operators To Use In i >,<,<=,>=, |, !,&, is.na(),!is.na(), %in%, %like%, %between% data.table extract column(s) by number. Prefix column numbers with “-” to drop dt[, c(2)] data.table extract column(s) by name dt[, .(b, c)] create a data.table with new columns based on the summarized values of rows dt[, .(x = sum(a))] compute a data.table column based on an expression dt[, c := 1 + 2] compute a data.table column based on an expression but only for a subset of rows dt[a == 1, c := 1 + 2] compute a data.table multiple columns based on separate expressions dt[, `:=`(c = 1 , d = 2)] delete a data.table column dt[, c := NULL] convert the type of a data.table column using as.integer(), as.numeric(), as.character(), as.Date(), etc.. dt[, b := as.integer(b)] group data.table rows by values in specified column(s) dt[, j, by = .(a)] group data.table and simultaneously sort rows according to values in specified column(s) dt[, j, keyby = .(a)] summarize data.table rows within groups dt[, .(c = sum(b)), by = a] create a new data.table column and compute rows within groups dt[, c := sum(b), by = a] extract first data.table row of groups dt[, .SD[1], by = a] extract last data.table row of groups dt[, .SD[.N], by = a] perform a sequence of data.table operations by chaining multiple “[]” dt[…][…] reorder a data.table according to specified columns setorder(dt, a, -b), “-” for descending data.table’s functions prefixed with “set” and the operator “:=” work without “=” to alter data without making copies in memory df = as.data.table(df) setDT(df) extract unique data.table rows based on columns specified in “by”. Leave out “by” to use all columns unique(dt, by = c("a", "b")) return the number of unique data.table rows based on columns specified in “by” uniqueN(dt, by = c("a", "b")) rename data.table column(s) setnames(dt, c("a", "b"), c("x", "y")) data.table Syntax DT[ i , j , by], i refers to rows. j refers to columns. by refers to adding a group data.table Syntax arguments DT[ i , j , by], with, which, allow.cartesian, roll, rollends, .SD, .SDcols, on, mult, nomatch data.table fread() function to read data, mydata = fread("https://github.com/flights_2014.csv") data.table select only 'origin' column returns a vector dat1 = mydata[ , origin] data.table select only 'origin' column returns a data.table dat1 = mydata[ , .(origin)] or dat1 = mydata[, c("origin"), with=FALSE] data.table select column dat2 =mydata[, 2, with=FALSE] data.table select column Multiple Columns dat3 = mydata[, .(origin, year, month, hour)], dat4 = mydata[, c(2:4), with=FALSE] data.table Dropping Column adding ! sign, dat5 = mydata[, !c("origin"), with=FALSE] data.table Dropping Multiple Columns dat6 = mydata[, !c("origin", "year", "month"), with=FALSE] data.table select variables that contain 'dep' use %like% operator, dat7 = mydata[,names(mydata) %like% "dep", with=FALSE] data.table Rename Variables setnames(mydata, c("dest"), c("Destination")) data.table rename multiple variables setnames(mydata, c("dest","origin"), c("Destination", "origin.of.flight")) data.table find all the flights whose origin is 'JFK' dat8 = mydata[origin == "JFK"] data.table Filter Multiple Values dat9 = mydata[origin %in% c("JFK", "LGA")] data.table selects not equal to 'JFK' and 'LGA' dat10 = mydata[!origin %in% c("JFK", "LGA")] data.table Filter Multiple variables dat11 = mydata[origin == "JFK" & carrier == "AA"] data.table Indexing Set Key tells system that data is sorted by the key column data.table setting 'origin' as a key setkey(mydata, origin), 'origin' key is turned on. data12 = mydata[c("JFK", "LGA")] data.table Indexing Multiple Columns setkey(mydata, origin, dest), key is turned on. mydata[.("JFK", "MIA")] # First key 'origin' matches “JFK” second key 'dest' matches “MIA” data.table Indexing Multiple Columns equivalent mydata[origin == "JFK" & dest == "MIA"] data.table identify the column(s) indexed by key(mydata) data.table sort data using setorder() mydata01 = setorder(mydata, origin) data.table sorting on descending order mydata02 = setorder(mydata, -origin) data.table Sorting Data based on multiple variables mydata03 = setorder(mydata, origin, -carrier) data.table Adding Columns (Calculation on rows) use := operator, mydata[, dep_sch:=dep_time - dep_delay] data.table Adding Multiple Columns mydata002 = mydata[, c("dep_sch","arr_sch"):=list(dep_time - dep_delay, arr_time - arr_delay)] data.table IF THEN ELSE Method I mydata[, flag:= 1*(min < 50)] ,set flag= 1 if min is less than 50. Otherwise, set flag =0. data.table IF THEN ELSE Method II mydata[, flag:= ifelse(min < 50, 1,0)] ,set flag= 1 if min is less than 50. Otherwise, set flag =0. data.table build a chain DT[ ] [ ] [ ], mydata[, dep_sch:=dep_time - dep_delay][,.(dep_time,dep_delay,dep_sch)] data.table Aggregate Columns mean mydata[, .(mean = mean(arr_delay, na.rm = TRUE), data.table Aggregate Columns median median = median(arr_delay, na.rm = TRUE), data.table Aggregate Columns min min = min(arr_delay, na.rm = TRUE), data.table Aggregate Columns max max = max(arr_delay, na.rm = TRUE))] data.table Summarize Multiple Columns all the summary function in a bracket, mydata[, .(mean(arr_delay), mean(dep_delay))] data.table .SD operator implies 'Subset of Data' data.table .SD and .SDcols operators calculate summary statistics for a larger list of variables data.table calculates mean of two variables mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")] data.table Summarize all numeric Columns mydata[, lapply(.SD, mean)] data.table Summarize with multiple statistics mydata[, sapply(.SD, function(x) c(mean=mean(x), median=median(x)))] data.table Summarize by group 'origin mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = origin] data.table Summary by group useing keyby= operator mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), keyby = origin] data.table Summarize multiple variables by group 'origin' mydata[, .(mean(arr_delay, na.rm = TRUE), mean(dep_delay, na.rm = TRUE)), by = origin], or mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay", "dep_delay"), by = origin] data.table remove non-unique / duplicate cases with unique() setkey(mydata, "carrier"), unique(mydata) data.table remove duplicated setkey(mydata, NULL), unique(mydata), Note : Setting key to NULL is not required if no key is already set. data.table Extract values within a group mydata[, .SD[1:2], by=carrier], selects first and second values from a categorical variable carrier. data.table Select LAST value from a group mydata[, .SD[.N], by=carrier] data.table window function frank() dt = mydata[, rank:=frank(-distance,ties.method = "min"), by=carrier], calculating rank of variable 'distance' by 'carrier'. data.table cumulative sum cumsum() dat = mydata[, cum:=cumsum(distance), by=carrier] data.table lag and lead with shift() shift(variable_name, number_of_lags, type=c("lag", "lead")), DT = data.table(A=1:5), DT[ , X := shift(A, 1, type="lag")], DT[ , Y := shift(A, 1, type="lead")] data.table %between% operator to define a range DT = data.table(x=6:10), DT[x %between% c(7,9)] data.table %like% to find all the values that matches a pattern DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4)), DT[Name %like% "dep"] data.table Inner Join Sample Data: (dt1 = data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A")), (dt2 = data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A")), merge(dt1, dt2, by="A") data.table Left Join merge(dt1, dt2, by="A", all.x = TRUE) data.table Right Join merge(dt1, dt2, by="A", all.y = TRUE) data.table Full Join merge(dt1, dt2, all=TRUE) Convert a data.table to data.frame setDF(mydata) convert data frame to data table setDT(), setDT(X, key = "A") data.table Reshape Data dcast.data.table() and melt.data.table() data.table Calculate total number of rows by month and then sort on descending order mydata[, .N, by = month] [order(-N)], The .N operator is used to find count. data.table Find top 3 months with high mean arrival delay mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = month][order(-mean_arr_delay)][1:3] data.table Find origin of flights having average total delay is greater than 20 minutes mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay", "dep_delay"), by = origin][(arr_delay + dep_delay) > 20] data.table Extract average of arrival and departure delays for carrier == 'DL' by 'origin' and 'dest' variables mydata[carrier == "DL", lapply(.SD, mean, na.rm = TRUE), by = .(origin, dest), .SDcols = c("arr_delay", "dep_delay")] data.table Pull first value of 'air_time' by 'origin' and then sum the returned values when it is greater than 300 mydata[, .SD[1], .SDcols="air_time", by=origin][air_time > 300, sum(air_time)]extract flickr image
seek .context-thumb get background-image convert _m.jpg -> _b.jpg https://live.staticflickr.com/2941/15170815109_f81b1994d2_m.jpg https://live.staticflickr.com/2941/15170815109_f81b1994d2_b.jpgR Web Scraping
get started xpath selectors R Web Scraping Rvest webscraping-using-readlines-and-rcurl xmlTreeParse, htmlTreeParse getting web data parsing xml html error in r no applicable method for xpathapply Parse and process XML (and HTML) with xml2 ================== web_page = readLines("http://www.interestingwebsite.com") web_page = read.csv("http://www.programmingr.com/jan09rlist.html") # General-purpose data wrangling library(tidyverse) # Parsing of HTML/XML files library(rvest) # String manipulation library(stringr) # Verbose regular expressions library(rebus) # Eases DateTime manipulation library(lubridate) ================== install.packages("RCurl", dependencies = TRUE) library("RCurl") library("XML") past = getURL("http://www.iciba.com/past", ssl.verifypeer = FALSE) # getURL cannot work webpage = read_html("http://www.iciba.com/past") # getURL cannot work jan09_parsed = htmlTreeParse(jan09) ================== http://www.iciba.com/past ul class="base-list switch_part" class library('rvest') library(tidyverse) url = 'http://www.iciba.com/past' webpage = readLines(url, warn=FALSE) webpage = read_html(webpage) grappedData = html_nodes(webpage,'.base-list switch_part') parseData = htmlTreeParse(webpage) rank_data = html_text(grappedData) html_node("#mw-content-text > div > table:nth-child(18)") html_table() the function htmlParse() which is equivalent to xmlParse(file, isHTML = TRUE) output = htmlParse(webpage) class(output) To parse content into an R structure : htmlTreeParse() which is equivalent to htmlParse(file, useInternalNodes = FALSE) output = htmlTreeParse(webpage) class(output) htmlTreeParse(file) especially suited for parsing HTML content returns class "XMLDocumentContent" (R data structure) equivalent to xmlParse(file, isHTML = TRUE, useInternalNodes = FALSE) htmlParse(file, useInternalNodes = FALSE) root =xmlRoot(output) xmlChildren(output) xmlChildren(xmlRoot(output)) XMLNodeList Functions for a given node Function Description xmlName() name of the node xmlSize() number of subnodes xmlAttrs() named character vector of all attributes xmlGetAttr() value of a single attribute xmlValue() contents of a leaf node xmlParent() name of parent node xmlAncestors() name of ancestor nodes getSibling() siblings to the right or to the left xmlNamespace() the namespace (if there’s one) to parse HTML tables using R sched = readHTMLTable(html, stringsAsFactors = FALSE) The html.raw object is not immediately useful because it literally contains all of the raw HTML for the entire webpage. We can parse the raw code using the xpathApply function which parses HTML based on the path argument, which in this case specifies parsing of HTML using the paragraph tag. html.raw=htmlTreeParse('http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27013300', useInternalNodes=T ) html.parse=xpathApply(html.raw, "//p", xmlValue) # evaluate input and convert to text txt = htmlToText(url) ================== url = 'http://www.iciba.com/past' webpage = readLines(url, warn=FALSE) scraping_wiki = read_html(webpage) scraping_wiki %>% html_nodes("h1") %>% html_text() url = 'testvibrate.html' webpage = readLines(url, warn=FALSE) x = read_xml(webpage) xml_name(x) =========== This cannot work in office library(rvest) Sys.setlocale(category = 'LC_ALL', 'Chinese') webpage = read_html("http://www.iciba.com/haunt") ullist = webpage %>% html_nodes("ul") content = ullist[2] %>% html_text() content = gsub("n.| |\n|adj.|adv.|prep.|vt.|vi.|&","",content) content = gsub(",|;"," ",content) %>% strsplit(split = " ") %>% unlist() %>% sort() %>% unique() paste0("past","\t",capture.output(cat(content)))R scraping html text example
scraping-html-text library(rvest) scraping_wiki = read_html("https://en.wikipedia.org/wiki/Web_scraping") scraping_wiki %>% html_nodes("h1") scraping_wiki %>% html_nodes("h2") scraping_wiki %>% html_nodes("h1") %>% html_text() scraping_wiki %>% html_nodes("h2") %>% html_text() p_nodes = scraping_wiki %>% html_nodes("p") length(p_nodes) p_text = scraping_wiki %>% html_nodes("p") %>% html_text() p_text[1] p_text[5] ul_text = scraping_wiki %>% html_nodes("ul") %>% html_text() length(ul_text) ul_text[1] substr(ul_text[2], start = 1, stop = 200) li_text = scraping_wiki %>% html_nodes("li") %>% html_text() length(li_text) li_text[1:8] li_text[104:136] all_text = scraping_wiki %>% html_nodes("div") %>% html_text() body_text = scraping_wiki %>% html_nodes("#mw-content-text") %>% html_text() # read the first 207 characters substr(body_text, start = 1, stop = 207) # read the last 73 characters substr(body_text, start = nchar(body_text)-73, stop = nchar(body_text)) # Scraping a specific heading scraping_wiki %>% html_nodes("#Techniques") %>% html_text() ## [1] "Techniques" # Scraping a specific paragraph scraping_wiki %>% html_nodes("#mw-content-text > p:nth-child(20)") %>% html_text() # Scraping a specific list scraping_wiki %>% html_nodes("#mw-content-text > div:nth-child(22)") %>% html_text() # Scraping a specific reference list item scraping_wiki %>% html_nodes("#cite_note-22") %>% html_text() # Cleaning up library(magrittr) scraping_wiki %>% html_nodes("#mw-content-text > div:nth-child(22)") %>% html_text() scraping_wiki %>% html_nodes("#mw-content-text > div:nth-child(22)") %>% html_text() %>% strsplit(split = "\n") %>% unlist() %>% .[. != ""] library(stringr) # read the last 700 characters substr(body_text, start = nchar(body_text)-700, stop = nchar(body_text)) # clean up text body_text %>% str_replace_all(pattern = "\n", replacement = " ") %>% str_replace_all(pattern = "[\\^]", replacement = " ") %>% str_replace_all(pattern = "\"", replacement = " ") %>% str_replace_all(pattern = "\\s+", replacement = " ") %>% str_trim(side = "both") %>% substr(start = nchar(body_text)-700, stop = nchar(body_text)) ################ # rvest tutorials https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ https://blog.gtwang.org/r/rvest-web-scraping-with-r/ https://www.rdocumentation.org/packages/rvest/versions/0.3.4 https://www.datacamp.com/community/tutorials/r-web-scraping-rvest https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/ https://lmyint.github.io/post/dnd-scraping-rvest-rselenium/ ################ # parse guancha library(rvest) pageHeader="https://user.guancha.cn/main/content?id=181885" pagesource = read_html(pageHeader) ################ # parse RTHK and metroradio library(rvest) pageHeader = "http://news.rthk.hk/rthk/ch/latest-news.htm" pagesource = read_html(pageHeader) className = ".ns2-title" keywordList = html_nodes(pagesource, className) html_text(keywordList) pageHeader = "http://www.metroradio.com.hk/MetroFinance/News/NewsLive.aspx" pagesource = read_html(pageHeader) className = ".n13newslist" keywordList = html_nodes(pagesource, className) className = "a" keywordList = html_nodes(keywordList, className) html_text(keywordList) ################ # parse xhamster library(rvest) pageHeader = "https://xhamster.com/users/fredlake/photos" pagesource = read_html(pageHeader) className = ".xh-paginator-button" keywordList = html_nodes(pagesource, className) html_text(keywordList) html_name(keywordList) html_attrs(keywordList) thelist = unlist(html_attrs(keywordList)) length(keywordList) as.numeric(html_text(keywordList[length(keywordList)])) pagesource %>% html_nodes(className) %>% html_text() %>% as.numeric() for ( i in keywordList ) { qlink = html_nodes(s, ".gallery-thumb") cat("Title:", html_text(qlink), "\n") qviews = html_nodes(s, "name") cat("Views:", html_text(qviews), "\n") } ################ # parse text and href pageHeader = "http://news.rthk.hk/rthk/ch/latest-news.htm" pagesource = read_html(pageHeader) className = ".ns2-title" keywordList = html_nodes(pagesource, className) className = "a" a = html_nodes(keywordList, className) html_text(a) html_attr(a, "href") ################ # extract huanqiu.com gallery pageHeader = "https://china.huanqiu.com/gallery/9CaKrnQhXac" pagesource = read_html(pageHeader) className = "article" keywordList = html_nodes(pagesource, className) className = "img" img = html_nodes(keywordList, className) html_attr(img, "src") html_attr(img, "data-alt") ################ # html_nodes samples html_nodes(".a1.b1") html_nodes(".b1:not(.a1)") # Select class contains b1 not a1: html_nodes(".content__info__item__value") html_nodes("[class='b1']") html_nodes("center") html_nodes("font") html_nodes(ateam, "center") html_nodes(ateam, "center font") html_nodes(ateam, "center font b") html_nodes("table") %>% .[[3]] %>% html_table() html_nodes("td") html_nodes() returns all nodes html_nodes(pagesource, className) html_nodes(pg, "div > input:first-of-type"), "value") html_nodes(s, ".gallery-thumb") html_nodes(s, "name") html_nodes(xpath = '//*[@id="a"]') ateam %>% html_nodes("center") %>% html_nodes("td") ateam %>% html_nodes("center") %>% html_nodes("font") td = ateam %>% html_nodes("center") %>% html_nodes("td") td %>% html_nodes("font") if (utils::packageVersion("xml2") > "0.1.2") { td %>% html_node("font") } # To pick out an element at specified position, use magrittr::extract2 # which is an alias for [[ library(magrittr) ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img") ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img") # Find all images contained in the first two tables ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img") ateam %>% html_nodes("table") %>% extract(1:2) %>% html_nodes("img") # XPath selectors --------------------------------------------- # If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")). # chaining with XPath is a little trickier - you may need to vary # the prefix you're using - // always selects from the root node # regardless of where you currently are in the doc ateam %>% html_nodes(xpath = "//center//font//b") %>% html_nodes(xpath = "//b") read_html() html_node() # to find the first node html_nodes(doc, "table td") # to find the all node html_nodes(doc, xpath = "//table//td")) html_name() # the name of the tag html_tag() # Extract the tag names html_text() # Extract all text inside the tag html_attr() Extract the a single attribute html_attrs() Extract all the attributes # html_attrs(keywordList) this cannot use id, just list all details # html_attr(keywordList, "id") this select the ids # html_attr(keywordList, "href") this select the hrefs html_nodes("#titleCast .itemprop span") html_nodes("#img_primary img") html_nodes("div.name > strong > a") html_attr("href") html_text(keywordList, trim = FALSE) html_name(keywordList) html_children(keywordList) html_attrs(keywordList) html_attr(keywordList, "[href]", default = NA_character_) parse with xml() then extract components using xml_node() xml_attr() xml_attrs() xml_text() and xml_name() Parse tables into data frames with html_table(). Extract, modify and submit forms with html_form() set_values() submit_form(). Detect and repair encoding problems with guess_encoding() Detect text encoding repair_encoding() repair text encoding Navigate around a website as if you’re in a browser with html_session() jump_to() follow_link() back() forward() Extract, modify and submit forms with html_form(), set_values() and submit_form() The toString() function collapse the list of strings into one. html_node(":not(#commentblock)") # exclude tags ######### demos ######### # Inspired by https://github.com/notesofdabbler library(rvest) library(tidyr) page = read_html("http://www.zillow.com/homes/for_sale/....") houses = page %>% html_nodes(".photo-cards li article") z_id = houses %>% html_attr("id") address = houses %>% html_node(".zsg-photo-card-address") %>% html_text() price = houses %>% html_node(".zsg-photo-card-price") %>% html_text() %>% readr::parse_number() params = houses %>% html_node(".zsg-photo-card-info") %>% html_text() %>% strsplit("\u00b7") beds = params %>% purrr::map_chr(1) %>% readr::parse_number() baths = params %>% purrr::map_chr(2) %>% readr::parse_number() house_area = params %>% purrr::map_chr(3) %>% readr::parse_number() ################ pagesource %>% html_nodes("table") %>% .[[3]] %>% html_table() read_html(doc) %>% html_nodes(".b1:not(.a1)") # Select class contains b1 not a1: # [1] text2 use the attribute selector: read_html(doc) %>% html_nodes("[class='b1']") # [1] text2 Select class contains both: read_html(doc) %>% html_nodes(".a1.b1") # this is 'and' operation # [1] text1 combine class and ID in CSS selector div#content.sectionA # this is 'and' operation ===================== select 2 classes in 1 tag Select class contains b1 not a1: read_html(doc) %>% html_nodes(".b1:not(.a1)") use the attribute selector: read_html(doc) %>% html_nodes("[class='b1']") Select class contains both: read_html(doc) %>% html_nodes(".a1.b1") # this is 'and' operation ===================== standard CSS selector specify either or both html_nodes(".content__info__item__value, skill") # the comma is 'or' operation {xml_nodeset (4)} [1] 5h 59m 42s [2] Beginner + Intermediate [3] September 26, 2013 [4] 82,552 # has both classes in_learning_page html_nodes(".content__info__item__value.skill") # this is 'and' operation {xml_nodeset (1)} [1] Beginner + Intermediate in_learning_page %>% html_nodes(".content__info__item__value") %>% str_subset(., "viewers") h = read_html(text) h %>% html_nodes(xpath = '//*[@id="a"]') %>% xml_attr("value") html_attr(html_nodes(pg, "div > input:first-of-type"), "value") ateam %>% html_nodes("center") %>% html_nodes("td") ateam %>% html_nodes("center") %>% html_nodes("font") td = ateam %>% html_nodes("center") %>% html_nodes("td") # When applied to a list of nodes, html_nodes() returns all nodes, # collapsing results into a new nodelist. td %>% html_nodes("font") # nodes, it returns a "missing" node if (utils::packageVersion("xml2") > "0.1.2") { td %>% html_nodes("font")sort() rank() order()
Rank references the position of the value in the sorted vector and is in the same order as theoriginal sequenceOrder returns the position of the original value and is in the order ofsorted sequence The graphic below helps tie together the values reported by rank and order with the positions from which they come.x = c(1, 8,9, 4) sort(x) 1 4 8 9 # the original position in the sorted order rank(x) 1 3 4 2 # the sorted position in the original position order(x) 1 4 2 3
Bioinformatics
Bioinformatics using R bioconductor Introduction to Bioconductor:Annotation and Analysis of Genomes and Genomics Assaysa list of dataframes, 3D data arrangement
d1 = data.frame(y1=c(1,2,3),y2=c(4,5,6)) d2 = data.frame(y1=c(3,2,1),y2=c(6,5,4)) d3 = data.frame(y1=c(7,8,9),y2=c(5,2,6)) mylist = list(d1, d2, d3) names(mylist) = c("List1","List2","List3") mylist[1] # same as mylist$List1 mylist[[2]][1,2] # access an element inside a dataframe mylist[[2]][2,2] # same as mylist$List2[2,2] to concate another dataframe: d4 = data.frame(y1=c(2,5,8),y2=c(1,4,7)) mylist[[4]] = d4 to create an empty list: data = list()format time string
Sys.time() sub(".* | .*", "", Sys.time()) format(Sys.time(), '%H:%M') gsub(":", "", format(Sys.time(), '%H:%M')) format(Sys.time(), '%H%M')extract 5 digit from string
activityListCode = str_replace(activityListCode, ".*\\b(\\d{5})\\b.*", "\\1")access Components of a Data Frame
access Components of a Data Frame Components of data frame can be accessed like a list or like a matrix.Accessing like a list
We can use either[
,[[
or$
operator to access columns of data frame.> x["Name"] Name 1 John 2 Dora > x$Name [1] "John" "Dora" > x[["Name"]] [1] "John" "Dora" > x[[3]] [1] "John" "Dora"
Accessing with[[
or$
is similar. However, it differs for[
in that, indexing with[
will return us a data frame but the other two will reduce it into a vector.Accessing like a matrix
Data frames can be accessed like a matrix by providing index for row and column. To illustrate this, we use datasets already available in R. Datasets that are available can be listed with the commandlibrary(help = "datasets")
. We will use thetrees
dataset which containsGirth
,Height
andVolume
for Black Cherry Trees. A data frame can be examined using functions likestr()
andhead()
.> str(trees) 'data.frame': 31 obs. of 3 variables: $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ... $ Height: num 70 65 63 72 81 83 66 75 80 75 ... $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ... > head(trees,n=3) Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2
We can see thattrees
is a data frame with 31 rows and 3 columns. We also display the first 3 rows of the data frame. Now we proceed to access the data frame like a matrix.> trees[2:3,] # select 2nd and 3rd row Girth Height Volume 2 8.6 65 10.3 3 8.8 63 10.2 > trees[trees$Height > 82,] # selects rows with Height greater than 82 Girth Height Volume 6 10.8 83 19.7 17 12.9 85 33.8 18 13.3 86 27.4 31 20.6 87 77.0 > trees[10:12,2] [1] 75 79 76
We can see in the last case that the returned type is a vector since we extracted data from a single column. This behavior can be avoided by passing the argumentdrop=FALSE
as follows.> trees[10:12,2, drop = FALSE] Height 10 75 11 79 12 76
# access first row by index, returns a data.frame x[1,] # access first row by "name", returns a data.frame > x["1",] # access first row returns a vector use as.numeric str(as.numeric(wAveTable["1",])) unlist which keeps the names. str(unlist(wAveTable["1",])) use transpose and as.vector str(as.vector(t(wAveTable["1",])[,1])) use only as.vector cannot convert to vector str(as.vector(wAveTable["1",])) # convert dataframe to matrix data.matrix(wAveTable)read.csv as character
wAveTable = read.csv("wAveTable.txt", sep="\t", colClasses=c('character', 'character', 'character'))frequency manipulation
grade = c("low", "high", "medium", "high", "low", "medium", "high") # using factor to count the frequency foodfac = factor(grade) summary(foodfac) max(summary(foodfac)) min(summary(foodfac)) levels(foodfac) nlevels(foodfac) summary(levels(foodfac)) # use of table to count frequency: table(grade) sort(table(grade)) table(grade)[1] max(table(grade)) summary(table(grade)) # this locate the max item: table(grade)[which(table(grade) == max(table(grade)))] # change to dataframe and find the max item: theTable = as.data.frame(table(grade)) theTable[which(theTable$Freq == max(theTable$Freq)),] # use of the count function in plyr: library(plyr) count(grade) count(mtcars, 'gear') # use of the which function: which(letters == "g") x = c(1,5,8,4,6) which(x == 5) which(x != 5)5 must have R programming tools
1) RStudio
2) lintr
If you come from the world of Python, you’ve probably heard of linting. Essentially, linting analyzes your code for readability. It makes sure you don’t produce code that looks like this: # This is some bad R code
if ( mean(x,na.rm=T)==1) { print(“This code is bad”); } # Still bad code because this line is SO long There are many things wrong with this code. For starters, the code is too long. Nobody likes to read code with seemingly endless lines. There are also no spaces after the comma in themean()
function, or any spaces between the==
operator. Oftentimes data science is done hastily, but linting your code is a good reminder for creating portable and understandable code. After all, if you can’t explain what you are doing or how you are doing it, your data science job is incomplete. lintr is an R package, growing in popularity, that allows you to lint your code. Once you install lintr, linting a file is as easy aslint("filename.R")
.3) Caret
Caret, which you can find on CRAN, is central to a data scientist’s toolbox in R. Caret allows one to quickly develop models, set cross-validation methods and analyze model performance all in one. Right out of the box, Caret abstracts the various interfaces to user-made algorithms and allows you to swiftly create models from averaged neural networks to boosted trees. It can even handle parallel processing. Some of the models caret includes are: AdaBoost, Decision Trees & Random Forests, Neural Networks, Stochastic Gradient Boosting, nearest neighbors, support vector machines — among the most commonly used machine learning algorithms.4) Tidyverse
You may not have heard oftidyverse
as a whole, but chances are, you’ve used one of the packages in it. Tidyverse is a set of unified packages meant to make data science… easyr (classic R pun). These packages alleviate many of the problems a data scientist may run into when dealing with data, such as loading data into your workspace, manipulating data, tidying data or visualizing data. Undoubtedly, these packages make dealing with data in R more efficient. It’s incredibly easy to get Tidyverse, you just runinstall.packages("tidyverse")
and you get: ggplot2: A popular R package for creating graphics dplyr: A popular R package for efficiently manipulating data tidyr: An R package for tidying up data sets readr: An R package for reading in data purrr: An R package which extends R’s functional programming toolkit purrr Tutorial tibble: An R package which introduces the tibble (tbl_df), an enhancement of the data frame By and large, ggplot2 and dplyr are some of the most common packages in the R sphere today, and you’ll see countless posts on StackOverflow on how to use either package. (Fine Print: Keep in mind, you can’t just load everything withlibrary(tidyverse)
you must load each individually!)5) Jupyter Notebooks or R Notebooks
Data science MUST be transparent and reproducible. For this to happen, we have to see your code! The two most common ways to do this are through Jupyter Notebooks or R Notebooks. Essentially, a notebook (of either kind) allows you to run R code block by block, and show output block my block. We can see on the left that we are summarizing the data, then checking the output. After, we plot the data, then view the plot. All of these actions take place within the notebook, and it makes analyzing both output and code a simultaneous process. This can help data scientists collaborate and ease the friction of having to open up someone’s code and understand what it does. Additionally, notebooks also make data science reproducible, which gives validity to whatever data science work you do!Honorable Mention: Git
Last but not least, I want to mention Git. Git is a version control system. So why use it? Well, it’s in the name. Git allows you to keep versions of the code you are working on. It also allows multiple people to work on the same project and allows those changes to be attributed to certain contributors. You’ve probably heard of Github, undoubtedly one of the most popular git servers. You can visit my website at www.peterxeno.com and my Github at www.github.com/peterxenoR with Javascript
R and D3 tools for working with JavaScript in R R Connecting with Javascripterror handling: tryCatch
error handling code should read up on: (expr) — evaluates expression warning(…) — generates warnings stop(…) — generates errors result =tryCatch( expr = { 1 + 1 }, error = function(e){ message("error message.") }, warning = function(w){ message("warning message.") }, finally = { message("tryCatch is finished.") } ) Note that you can also place multiple expressions in the "expressions part" (argument expr of tryCatch()) if you wrap themin curly brackets .example # This don't let the user interrupt the code i = 1 while(i < 3) { tryCatch({ Sys.sleep(0.5) message("Try to escape") }, interrupt = function(x) { message("Try again!") i <= i + 1 }) }example readUrl = function(url) { out = tryCatch( { message("This is the 'try' part") readLines(con=url, warn=FALSE) }, error = function(cond) { message(paste("URL not exist:", url)) message("Here's the original error message:") message(cond) # Choose a return value in case of error return(NA) }, warning=function(cond) { message(paste("URL caused a warning:", url)) message("Here's the original warning message:") message(cond) # Choose a return value in case of warning return(NULL) }, finally={ # executed finally regardless success or error. # wrap the in curly brackets ({...}); to run more than one expression, otherwise just have 'finally={expression}' message("Some message at the end") } ) return(out) } e.g. x = tryCatch( readLines("wx.qq.com/"), warning=function(w){ return(paste( "Warning:", conditionMessage(w)));}, error = function(e) { return(paste( "this is Error:", conditionMessage(e)));}, finally={print("This is try-catch test. check the output.")} ); a retry function: retry = function(dothis, max = 10, init = 0){ suppressWarnings( tryCatch({ if(init<max) dothis}, error = function(e){retry(dothis, max, init = init+1)} ) ) } dothis = function(){do somthing}Download Image
Download Image when opened with windows image viewer it also says it is corrupt. The reason for this is that you don't have specified the mode in the download.file statement. Try this: download.file(y,'y.jpg', mode = 'wb') download.file('http://78.media.tumblr.com/83a81c41926c1da585916a5c092b4789/tumblr_or0y0vdjOP1rttk8po1_1280.jpg','y.jpg', mode = 'wb') To view the image in R library(jpeg) jj = readJPEG("y.jpg",native=TRUE) plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE) rasterImage(jj,0,0,1,1)testShiny
setwd("D:/KPC/testShiny") runApp("D:/KPC/testShiny")Error in file(filename, "r", encoding = encoding)
The error indicate that either the file doesn't exist or the source() command an incorrect path.call another R program from R program
source("program_B.R")to view all the functions present in a package
To list all objects in the package use ls ls("package:Hmisc") Note that the package must be attached. To list all strings lsf.str("package:dplyr") lsf.str("package:Hmisc") To see the list of currently loaded packages use search() Alternatively calling the help would also do, even if the package is not attached: help(package = dplyr) help(package = Hmisc) Finally, use RStudio which provides an autocomplete function. So, for instance, typing Hmisc:: in the console or while editing a file will result in a popup list of all dplyr functions/objects.cut2
Function like cut but left endpoints are inclusive. install.packages("Hmisc") library(Hmisc) alist = c(-15,18,2,5,4,-7,-5,-3,-1,0,2,1,5,4,6) breaks = c(-5,-3,-1,0,1,3,5) table(cut2(alist, breaks))Reference A Data Frame Column
with the double square bracket "[[]]" operator. LastDayTable[["Vol"]] or LastDayTable$Vol or LastDayTable[,"Vol"]Writing data to a file
Problem
You want to write data to a file.Solution
Writing to a delimited text file
The easiest way to do this is to usewrite.csv()
. By default,write.csv()
includes row names, but these are usually unnecessary and may cause confusion.# A sample data frame data = read.table(header=TRUE, text=' subject sex size 1 M 7 2 F NA 3 F 9 4 M 11 ') # Write to a file, suppress row names write.csv(data, "data.csv", row.names=FALSE) # Same, except that instead of "NA", output blank cells write.csv(data, "data.csv", row.names=FALSE, na="") # Use tabs, suppress row names and column names write.table(data, "data.csv", sep="\t", row.names=FALSE, col.names=FALSE)
Saving in R data format
write.csv()
andwrite.table()
are best for interoperability with other data analysis programs. They will not, however, preserve special attributes of the data structures, such as whether a column is a character type or factor, or the order of levels in factors. In order to do that, it should be written out in a special format for R. Below are are three primary ways of doing this: The first method is to output R source code which, when run, will re-create the object. This should work for most data objects, but it may not be able to faithfully re-create some more complicated data objects.# Save in a text format that can be easily loaded in R dump("data", "data.Rdmpd") # Can save multiple objects: dump(c("data", "data1"), "data.Rdmpd") # To load the data again: source("data.Rdmpd") # When loaded, the original data names will automatically be used.
The next method is to write out individual data objects in RDS format. This format can be binary or ASCII. Binary is more compact, while ASCII will be more efficient with version control systems like Git.# Save a single object in binary RDS format saveRDS(data, "data.rds") # Or, using ASCII format saveRDS(data, "data.rds", ascii=TRUE) # To load the data again: data = readRDS("data.rds")
It’s also possible to save multiple objects into an single file, using the RData format.# Saving multiple objects in binary RData format save(data, file="data.RData") # Or, using ASCII format save(data, file="data.RData", ascii=TRUE) # Can save multiple objects save(data, data1, file="data.RData") # To load the data again: load("data.RData")
An important difference betweensaveRDS()
andsave()
is that, with the former, when youreadRDS()
the data, you specify the name of the object, and with the latter, when youload()
the data, the original object names are automatically used. Automatically using the original object names can sometimes simplify a workflow, but it can also be a drawback if the data object is meant to be distributed to others for use in a different environment.Debugging a script or function
Problem
You want to debug a script or function.Solution
Insert this into your code at the place where you want to start debugging: browser() When the R interpreter reaches that line, it will pause your code and you will be able to look at and change variables. In the browser, typing these letters will do things:
c | Continue |
n (or Return) | Next step |
Q | quit |
Ctrl-C | go to top level |
n
and then Enter. This can be annoying. To disable it use:
options(browserNLdisabled=TRUE)
To start debugging whenever an error is thrown, run this before your function which throws an error:
options(error=recover)
If you want these options to be set every time you start R, you can put them in your ~/.Rprofile file.
plyr
package, the base functions remain useful and worth knowing.
# Two dimensional matrix
M = matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M = array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension, look from top is an area
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans
, rowMeans
, colSums
, rowSums
.
lapply
underneath.
x = list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
unlist(lapply(...))
, stop and consider sapply
.
x = list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate.
For example, if our function returns vectors of the same length, sapply
will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply
will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
sapply
but perhaps need to squeeze some more speed out of your code.
For vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.
x = list(a = 1, b = 1:3, c = 10:100)
# Note that since the advantage here is mainly speed, this
# example is only for illustration.
We're telling R that
# everything returned by length() should be an integer of length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
sapply
.
This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
mapply
with SIMPLIFY = FALSE
, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.
rapply
is best illustrated with a user-defined function to apply:
# Append ! to string, otherwise increment
myFun = function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l = list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
x = 1:20
A factor (of the same length!) defining groups:
y = factor(rep(letters[1:5], each = 4))
Add up the values in x
within each subgroup defined by y
:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
, by
, ave
, ddply
, etc.) Hence its black sheep status.
Slice vector
We can use lapply() or sapply() interchangeable to slice a data frame.
We create a function, below_average(), that takes a vector of numerical values and returns a vector that only contains the values that are strictly above the average.
below_ave = function(x) {
ave = mean(x)
return(x[x > ave])
}
Compare both results with the identical() function.
dataf_s= sapply(dataf, below_ave)
dataf_l= lapply(dataf, below_ave)
identical(dataf_s, dataf_l)
#install.packages("rvest")
library(rvest)
# Store web url
lego_movie = html("http://www.imdb.com/title/tt1490017/")
#Scrape the website for the movie rating
rating = lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
## [1] 7.8
# Scrape the website for the cast
cast = lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
cast
## [1] "Will Arnett" "Elizabeth Banks" "Craig Berry"
## [4] "Alison Brie" "David Burrows" "Anthony Daniels"
## [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson"
## [10] "Will Ferrell" "Will Forte" "Dave Franco"
## [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
#Scrape the website for the url of the movie poster
poster = lego_movie %>%
html_nodes("#img_primary img") %>%
html_attr("src")
poster
## [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_SX214_AL_.jpg"
# Extract the first review
review = lego_movie %>%
html_nodes("#titleUserReviewsTeaser p") %>%
html_text()
review
## [1] "The stand out feature of the Lego Movie for me would be the way the Lego Universe was created.
The movie paid great attention to detail making everything appear as it would made from Lego, including the water and clouds, and the surfaces people walked on all had the circles sticking upwards a Lego piece would have.
Combined with all the yellow faces, and Lego part during building, I was convinced action took place in the Lego Universe.A combination of adult and child friendly humour should entertain all, the movie has done well to ensure audiences of all ages are catered to.
The voice cast were excellent, especially Liam Neeson's split personality police officer, making the 2 personalities sound distinctive, and giving his Bad Cop the usual Liam Neeson tough guy.
The plot is about resisting an over-controlling ruler, highlighted by the name of the hero's \"resistance piece\".
It is well thought through, well written, and revealing at the right times.
Full of surprises, The Lego Movie won't let You see what's coming.
Best animated film since Wreck it Ralph! Please let there be sequels."
# Submit the form on indeed.com for a job description and location using html_form() and set_values()
query = "data science"
loc = "New York"
session = html_session("http://www.indeed.com")
form = html_form(session)[[1]]
form = set_values(form, q = query, l = loc)
# The rvest submit_form function is still under construction and does not work for web sites which build URLs (i.e. GET requests.
It does seem to work for POST requests).
#url = submit_form(session, indeed)
# Version 1 of our submit_form function
submit_form2 = function(session, form){
library(XML)
url = XML::getRelativeURL(form$url, session$url)
url = paste(url,'?',sep='')
values = as.vector(rvest:::submit_request(form)$values)
att = names(values)
if (tail(att, n=1) == "NULL"){
values = values[1:length(values)-1]
att = att[1:length(att)-1]
}
q = paste(att,values,sep='=')
q = paste(q, collapse = '&')
q = gsub(" ", "+", q)
url = paste(url, q, sep = '')
html_session(url)
}
# Version 2 of our submit_form function
library(httr)
# Appends element of a list to another without changing variable type of x
# build_url function uses the httr package and requires a variable of the url class
appendList = function (x, val)
{
stopifnot(is.list(x), is.list(val))
xnames = names(x)
for (v in names(val)) {
x[[v]] = if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
appendList(x[[v]], val[[v]])
else c(x[[v]], val[[v]])
}
x
}
# Simulating submit_form for GET requests
submit_geturl = function (session, form)
{
query = rvest:::submit_request(form)
query$method = NULL
query$encode = NULL
query$url = NULL
names(query) = "query"
relativeurl = XML::getRelativeURL(form$url, session$url)
basepath = parse_url(relativeurl)
fullpath = appendList(basepath,query)
fullpath = build_url(fullpath)
fullpath
}
# Submit form and get new url
session1 = submit_form2(session, form)
# Get reviews of last company using follow_link()
session2 = follow_link(session1, css = "#more_9 li:nth-child(3) a")
reviews = session2 %>% html_nodes(".description") %>% html_text()
reviews
## [1] "Custody Client Services"
## [2] "An exciting position on a trading floor"
## [3] "Great work environment"
## [4] "A company that helps its employees to advance career."
## [5] "Decent Company to work for while you still have the job there."
# Get average salary for each job listing based on title and location
salary_links = html_nodes(session1, css = "#resultsCol li:nth-child(2) a") %>% html_attr("href")
salary_links = paste(session$url, salary_links, sep='')
salaries = lapply(salary_links, .
%>% html() %>% html_nodes("#salary_display_table .salary") %>% html_text())
salary = unlist(salaries)
# Store web url
data_sci_indeed = session1
# Get job titles
job_title = data_sci_indeed %>%
html_nodes("[itemprop=title]") %>%
html_text()
# Get companies
company = data_sci_indeed %>%
html_nodes("[itemprop=hiringOrganization]") %>%
html_text()
# Get locations
location = data_sci_indeed %>%
html_nodes("[itemprop=addressLocality]") %>%
html_text()
# Get descriptions
description = data_sci_indeed %>%
html_nodes("[itemprop=description]") %>%
html_text()
# Get the links
link = data_sci_indeed %>%
html_nodes("[itemprop=title]") %>%
html_attr("href")
link = paste('[Link](https://www.indeed.com', link, sep='')
link = paste(link, ')', sep='')
indeed_jobs = data.frame(job_title,company,location,description,salary,link)
library(knitr)
kable(indeed_jobs, format = "html")
job_title | company | location | description | salary | link |
---|---|---|---|---|---|
Data Scientist | Career Path Group | New York, NY 10018 (Clinton area) | Or higher in Computer Science or related field. Design, develop, and optimize our data and analytics system…. | $109,000 | Link |
Data Scientist or Statistician | Humana | New York, NY | Experience with unstructured data analysis. Humana is seeking an experienced statistician with demonstrated health and wellness data analysis expertise to join… | $60,000 | Link |
Analyst | 1010data | New York, NY | Data providers can also use 1010data to share and monetize their data. 1010data is the leading provider of Big Data Discovery and data sharing solutions…. | $81,000 | Link |
Data Scientist & Visualization Engineer | Enstoa | New York, NY | 2+ years professional experience analyzing complex data sets, modeling, machine learning, and/or large-scale data mining…. | $210,000 | Link |
Data Scientist - Intelligent Solutions | JPMorgan Chase | New York, NY | Experience managing and growing a data science team. Data Scientist - Intelligent Solutions. Analyze communications data and Utilize statistical natural… | $109,000 | Link |
Analytics Program Lead | AIG | New York, NY | Lead the analytical team for Data Solutions. Graduate degree from a renowned institution in any advanced quantitative modeling oriented discipline including but… | $126,000 | Link |
Data Engineer | Standard Analytics | New York, NY | Code experience in a production environment (familiar with data structures, parallelism, and concurrency). We aim to organize the world’s scientific information… | $122,000 | Link |
Summer Intern - Network Science and Big Data Analytics | IBM | Yorktown Heights, NY | The Network Science and Big Data Analytics department at the IBM T. Our lab has access to large computing resources and data…. | $36,000 | Link |
Data Scientist | The Nielsen Company | New York, NY | As a Data Scientist in the Data Integration group, you will be involved in the process of integrating data to enable analyses of patterns and relationships… | $109,000 | Link |
Data Analyst, IM Data Science | BNY Mellon | New York, NY | The Data Analyst will support a wide variety of projects and initiatives of the Data Science Group, including the creation of back-end data management tools,… | $84,000 | Link |
# Attempt to crawl LinkedIn, requires useragent to access Linkedin Sites
uastring = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session = html_session("https://www.linkedin.com/job/", user_agent(uastring))
form = html_form(session)[[1]]
form = set_values(form, keywords = "Data Science", location="New York")
new_url = submit_geturl(session,form)
new_session = html_session(new_url, user_agent(uastring))
jobtitle = new_session %>% html_nodes(".job [itemprop=title]") %>% html_text
company = new_session %>% html_nodes(".job [itemprop=name]") %>% html_text
location = new_session %>% html_nodes(".job [itemprop=addressLocality]") %>% html_text
description = new_session %>% html_nodes(".job [itemprop=description]") %>% html_text
url = new_session %>% html_nodes(".job [itemprop=title]") %>% html_attr("href")
url = paste(url, ')', sep='')
url = paste('[Link](', url, sep='')
df = data.frame(jobtitle, company, location, url)
df %>% kable
jobtitle | company | location | url |
---|---|---|---|
Data Science Lead: Metis | Kaplan | New York City, NY, US | Link |
Data Science Lead: Metis | Kaplan Test Prep | New York, NY | Link |
Think Big Senior Data Scientist | Think Big, A Teradata Company | US-NY-New York | Link |
Think Big Principal Data Scientist | Think Big, A Teradata Company | US-NY-New York | Link |
Data Scientist - Professional Services Consultant (East … | MapR Technologies | Greater New York City Area | Link |
Think Big Senior Data Scientist | Teradata | New York City, NY, US | Link |
Think Big Principal Data Scientist | Teradata | New York City, NY, US | Link |
Sr. Software Engineer - Data Science - HookLogic | HookLogic, Inc. | New York City, NY, US | Link |
Think Big Data Scientist | Think Big, A Teradata Company | US-NY-New York | Link |
Director of Data Science Programs | DataKind | New York City, NY, US | Link |
Lead Data Scientist - VP - Intelligent Solutions | JPMorgan Chase & Co. | US-NY-New York | Link |
Senior Data Scientist for US Quantitative Fund, NYC | GQR Global Markets | Greater New York City Area | Link |
Google Cloud Solutions Practice, Google Data Solution … | PricewaterhouseCoopers | New York City, NY, US | Link |
Senior Data Scientist | Dun and Bradstreet | Short Hills, NJ, US | Link |
Senior data scientist | Mezzobit | New York City, NY, US | Link |
Think Big Data Scientist | Teradata | New York City, NY, US | Link |
Data Scientist - Intelligent Solutions | JPMorgan Chase & Co. | US-NY-New York | Link |
Technical Trainer EMEA | Datameer | New York | Link |
Elementary School Science Teacher | Success Academy Charter Schools | Greater New York City Area | Link |
Middle School Science Teacher | Success Academy Charter Schools | Greater New York City Area | Link |
Data Scientist (various levels) | Burtch Works | Greater New York City Area | Link |
Sr. Data Scientist – Big Data, Online Advertising, Search | Magnetic | New York, NY | Link |
Sr. Big Data Engineer FlexGraph | ADP | New York, NY | Link |
Data Science Lead Instructor - Data Science, Teaching | CyberCoders | New York City, NY | Link |
Director, Data Consulting | Havas Media | Greater New York City Area | Link |
# Attempt to crawl Columbia Lionshare for jobs
session = html_session("http://www.careereducation.columbia.edu/lionshare")
form = html_form(session)[[1]]
form = set_values(form, username = "uni")
#Below code commented out in Markdown
#pw = .rs.askForPassword("Password?")
#form = set_values(form, password = pw)
#rm(pw)
#session2 = submit_form(session, form)
#session2 = follow_link(session2, "Job")
#form2 = html_form(session2)[[1]]
#form2 = set_values(form2, PositionTypes = 7, Keyword = "Data")
#session3 = submit_form(session2, form2)
# Unable to scrape because the table containing the job data uses javascript and doesn't load soon enough for rvest to collect information
There isn't any equivalent to checking if the document finishes loading before scraping the data.
The general recommendation appears to be using something entirely different such as Selenium to scrape web data.
Selenium, automating web browsers
If you are webscraping with Python chances are that you have already tried urllib, httplib, requests, etc.
These are excellent libraries, but some websites don’t like to be webscraped.
In these cases you may need to disguise your webscraping bot as a human being.
Selenium is just the tool for that.
Selenium is a webdriver: it takes control of your browser, which then does all the work.
Hence what the website “sees” is Chrome or Firefox or IE; it does not see Python or Selenium.
That makes it a lot harder for the website to tell your bot from a human being.
Selenium tutorial
1 | 2 | 3 | ||||||
4 |
1 | 2 | 3 | ||||||
4 |
> install.packages("XML")
> library(XML)
text = paste0("<bookstore><book>","<title>Everyday Italian</title>","<author>Giada De Laurentiis</author>","<year>2005</year>","</book></bookstore>")
Parse the XML file
xmldoc = xmlParse(text)
rootNode = xmlRoot(xmldoc)
rootNode[1]
xmlToDataFrame(nodes = getNodeSet(xmldoc, "//title"))
xmlToDataFrame(nodes = getNodeSet(xmldoc, "//author"))
xmlToDataFrame(nodes = getNodeSet(xmldoc, "//book"))
newdf = xmlToDataFrame(getNodeSet(xmldoc, "//book"))
newdf = xmlToDataFrame(getNodeSet(xmldoc, "//title"))
Extract XML data:
> data = xmlSApply(rootNode,function(x) xmlSApply(x, xmlValue))
text = paste0("<CD>","<TITLE>Empire Burlesque</TITLE>","<ARTIST>Bob Dylan</ARTIST>","<COUNTRY>USA</COUNTRY>","<COMPANY>Columbia</COMPANY>","<PRICE>10.90</PRICE>","<YEAR>1985</YEAR>","</CD>")
xmldoc = xmlParse(text)
rootNode = xmlRoot(xmldoc)
rootNode[1]
Convert the extracted data into a data frame:
> cd.catalog = data.frame(t(data),row.names=NULL)
Verify the results
The xmlParse
function returns an object of the XMLInternalDocument
class, which is a C-level internal data structure.
The xmlRoot()
function gets access to the root node and its elements.
We check the first element of the root node:
> rootNode[1]
$CD
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
To extract data from the root node, we use the xmlSApply()
function iteratively over all the children of the root node.
The xmlSApply
function returns a matrix.
To convert the preceding matrix into a data frame, we transpose the matrix using the t()
function.
We then extract the first two rows from the cd.catalog
data frame:
> cd.catalog[1:2,]
TITLE ARTIST COUNTRY COMPANY PRICE YEAR
1 Empire Burlesque Bob Dylan USA Columbia 10.90 1985
2 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988
XML data can be deeply nested and hence can become complex to extract.
Knowledge of XPath
will be helpful to access specific XML tags.
R provides several functions such as xpathSApply
and getNodeSet
to locate specific elements.
> url = "http://en.wikipedia.org/wiki/World_population"
webpage = read_html(url)
output = htmlParse(webpage)
tables = readHTMLTable(output)
world.pop = tables[[5]]
table.list = readHTMLTable(output, header=F)
u = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
webpage = read_html(u)
tables = readHTMLTable(webpage)
names(tables)
The readHTMLTable()
function parses the web page and returns a list
of all tables that are found on the page.
For tables that have an id
attribute, the function uses the id
attribute as the name of that list element.
We are interested in extracting the "10 most populous countries," which is the fifth table; hence we use tables[[5]]
.
> table = readHTMLTable(url,which=5)
Specify which
to get data from a specific table.
R returns a data frame.
<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my <a href="#">page</a></h2>
<p>This is the first paragraph.</p>
<!-- this is the end -->
</body>
</html>
XPath handles any XML/HTML document as a tree.
This tree's root node is not part of the document itself.
It is in fact the parent of the document element node (<html>
in case of the HTML above).
This is how the XPath tree for the HTML document looks like:
<a href=”http://www.example.com”>example</a>
.
<!-- … -->
).
example
in <p>example</p>
).
Distinguishing between these different types is useful to understand how XPath expressions work.
Now let's start digging into XPath.
Here is how we can select the title element from the page above using an XPath expression:
/html/head/title
This is what we call a location path.
It allows us to specify the path from the head
node is the context node when the last step is being evaluated.
However, we usually don't know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name.
We can select them using:
//title
Which means: look in the whole tree, starting from the root of the tree (//
) and select only those nodes whose name matches title
.
In this example, //
is the title
is the //title
to the full syntax we get:
/descendant-or-self::node()/child::title
So, //
in the abbreviated syntax is short for descendant-or-self
, which means the current node or any node below it in the tree.
This part of the expression is called the node()
, is called a child
, which means go to the child nodes from the current context, followed by another node test, which selects the nodes named as title
.
So, theYou can test nodes against their name or against their type. Here are some examples of name tests:axis defines where in the tree thenode test should be applied and the nodes that match the node test will be returned as aresult .
Expression | Meaning |
---|---|
/html | Selects the node named html , which is under the root. |
/html/head | Selects the node named head , which is under the html node. |
//title | Selects all the title nodes from the HTML tree. |
//h2/a | Selects all a nodes which are directly under an h2 node. |
Expression | Meaning |
---|---|
//comment() | Selects only comment nodes. |
//node() | Selects any kind of node in the tree. |
//text() | Selects only text nodes, such as "This is the first paragraph". |
//* | Selects all nodes, except comment and text nodes. |
p
elements.
In the HTML snippet shown above, it would select "This is the first paragraph.".
Now, <html>
<body>
<ul>
<li>Quote 1</li>
<li>Quote 2 with <a href="...">link</a></li>
<li>Quote 3 with <a href="...">another link</a></li>
<li><h2>Quote 4 title</h2> ...</li>
</ul>
</body>
</html>
Say we want to select only the first li
node from the snippet above.
We can do this with:
//li[position() = 1]
The expression surrounded by square brackets is called a predicate and it filters the node set returned by //li
(that is, all li
nodes from the document) using the given condition.
In this case it checks each node's position using the position()
function, which returns the position of the current node in the resulting node set (notice that positions in XPath start at 1, not 0).
We can abbreviate the expression above to:
//li[1]
Both XPath expressions above would select the following element:
<li class="quote">Quote 1</li>
Check out a few more predicate examples:
Expression | Meaning |
---|---|
//li[position()%2=0] | Selects the li elements at even positions. |
//li[a] | Selects the li elements which enclose an a element. |
//li[a or h2] | Selects the li elements which enclose either an a or an h2 element. |
//li[ a [ text() = "link" ] ] | Selects the li elements which enclose an a element whose text is "link".
Can also be written as //li[ a/text()="link" ] . |
//li[last()] | Selects the last li element in the document. |
/
and each step can have an axis, a node test and a predicate.
Here we have an expression composed by two steps, each one with axis, node test and predicate:
<span style="font-weight: 400;">//li[ 4 ]/h2[ text() = "Quote 4 title" ]</span>
And here is the same expression, written using the non-abbreviated syntax:
/descendant-or-self::node()|
.
For example, we can select all a
and h2
elements in the document above using this expression:
//a | //h2
Now, consider this HTML document:
<html>
<body>
<ul>
<li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
<li><a href="https://scrapinghub.com">Scrapinghub</a></li>
<li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
<li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
</ul>
</body>
</html>
Say we want to select only the a
elements whose link points to an HTTPS URL.
We can do it by checking their href
a
elements from the document and for each of those elements, it checks whether their href
attribute starts with "https".
We can access any node attribute using the @attributename
syntax.
Here we have a few additional examples using attributes:
Expression | Meaning |
---|---|
//a[@href="https://scrapy.org"] | Selects the a elements pointing to https://scrapy.org. |
//a/@href | Selects the value of the href attribute from all the a elements in the document. |
//li[@id] | Selects only the li elements which have an id attribute. |
<html>
<body>
<p>Intro paragraph</p>
<h1>Title #1</h1>
<p>A random paragraph #1</p>
<h1>Title #2</h1>
<p>A random paragraph #2</p>
<p>Another one #2</p>
A single paragraph, with no markup
<div id="footer"><p>Footer text</p></div>
</body>
</html>
Now we want to extract only the first paragraph after each of the titles.
To do that, we can use the following-sibling
axis, which selects all the siblings after the context node.
Siblings are nodes who are children of the same parent, for example all children nodes of the body
tag are siblings.
This is the expression:
//h1/following-sibling::p[1]
In this example, the context node where the following-sibling
axis is applied to is each of the h1
nodes from the page.
What if we want to select only the text that is right before the footer
? We can use the preceding-sibling
axis:
//div[@id='footer']/preceding-sibling::text()[1]
In this case, we are selecting the first text node before the div
footer ("A single paragraph, with no markup").
XPath also allows us to select elements based on their text content.
We can use such a feature, along with the parent
axis, to select the parent of the p
element whose text is "Footer text":
//p[ text()="Footer text" ]/..
The expression above selects <div id="footer"><p>Footer text</p></div>
.
As you may have noticed, we used ..
here as a shortcut to the parent
axis.
As an alternative to the expression above, we could use:
//*[p/text()="Footer text"]
It selects, from all elements, the ones that have a p
child which text is "Footer text", getting the same result as the previous expression.
You can find additional axes in the XPath specification: https://www.w3.org/TR/xpath/#axes
=
can be switched with =
.
The function assign()
can also be used:
assign('x', c(1, 2, 3, 4))
Assignments can also be made in the other direction:
c(1, 2, 3, 4) -> x
y = c(x, 0, x)
would assign a vector 1, 2, 3, 4, 0, 1, 2, 3, 4
to variable y
.
Vectors can be freely multiplied and added by constants:
v = 2*x + y + 1
Note that this operation is valid even when x
and y
are different lengths.
In this case, R will simply recycle x (sometimes fractionally) until it meets the length of y.
Since y is 9 numbers long and x is 4 units long, x will be repeated 2.25 times to match the length of y.
The arithmetic operators +
, -
, *
, /
, and ^
can all be used.
log
, exp
, sin
, cos
, tan
, sqrt
, and more can also be used.
max(x)
and min(x)
represent the largest and smallest elements of a vector x
, and length(x)
is the number of elements in x
.
sum(x)
gives the total of the elements in x
, and prod(x)
their product.
mean(x)
calculates the sample mean, and var(x)
returns the sample variance.
sort(x)
returns a vector of the same size as x with elements arranged in increasing order.
1:30
is the same as c(1, 2, …, 29, 30)
.
The colon as the highest priority in an expression, so2*1:15
will return c(2, 4, …, 28, 30)
instead of c(2, 3, …, 14, 15)
.
30:1 may be used to generate the sequence backwards.
The seq()
function can also be used to generate sequences.
seq(2,10)
returns the same vector as 2:10
.
In seq()
, one can also specify the length of the step in which to take: seq(1,2,by=0.5)
returns c(1, 1.5, 2)
.
A similar function is rep()
, which replicates an object in various ways.
For example, rep(x, times=5)
will return five copies of x
end-to-end.
val = x > 13
sets val
as a vector of the same length as x
with values TRUE
where the condition is met and FALSE
where the condition is not.
The logical operators in r are <
, <=
, >
, >=
, ==
, and !=
, which mean less than, less than or equal to, greater than, greater than or equal to, equality, and inequality.
is.na(x)
returns a logical vector of the same size as x
with TRUE
if the corresponding element to x
is NA
.
x == NA
is different from is.na(x)
since NA
is not a value but a marker for an unavailable quantity.
A second type of ‘missing value’ is that which is produced by numerical computation, such as 0/0
.
In this case, NaN
(Not a Number) values are treated as NA
values; that is, is.na(x)
will return TRUE
for both NA
and NaN
values.
is.nan(x)
can be used only for identifying NaN
values.
y = x[!is.na(x)]
sets y
to the values of x
that are not equal to NA
or NaN
.
(x+1)[(!is.na(x)) & x>0] -> z
sets z
to the values of x+1
that are not Na
or NaN
and larger than 0.
A second method is with a vector of positive integral quantities.
In this case, the values must be in the set {1, 2, …, length(x)}
.
The corresponding elements of the vector are selected and concatenated in that order to form a result.
It is important to remember that unlike in other languages, the first index in R is 1 and not 0.
x[1:10]
returns the first 10 elements of x
, assuming length(x)
is not less than 10.
c(‘x’, ‘y’)[rep(c(1,2,2,1), times=4)]
produces a character vector of length 16, where ‘x’, ‘y’, ‘y’, ‘x’
is repeated four times.
A vector of negative integral numbers specifies the values to be excluded rather than included.
y = x[-(1:5)]
sets y
to all but the first five values of x
.
Lastly, a vector of character strings can be used when an object has a names attribute to identify its components.
With fruit = c(1, 2, 3, 4)
, one can set the names of each index of the vector fruit with names(fruit) = c(‘mango’, ‘apple’, ‘banana’, ‘orange’)
.
Then, one can call the elements by name with lunch = fruit[c(‘apple’, ‘orange’)]
.
The advantage of this is that alphanumeric names can sometimes be easier to remember than indices.
Note that an indexed expression can also appear on the receiving end of an assignment, in which the assignment is only performed on those elements of a vector.
For example, x[is.na(x)] = 0
replaces all NA
and NaN
values in vector x
with the value 0
.
Another example: y[y<0] = -y[y<0]
has the same effect as y = abs(y)
.
The code simply replaces all the values that are less than 0 with the negative of that value.
dim
attribute.
If z
were a vector of 1500 elements, the assignment dim(z) = c(100, 5, 3)
would mean z
is now treated as a 100 by 5 by 3 array.
a
could have its first value called via a[1, 1, 1]
and its last value called via a[3, 4, 6]
.
a[,,]
represents the entire array; hence, a[1,1,]
takes the first row of the first 2-dimensional cross-section in a
.
x = array(1:20, dim = c(4,5))
.
Arrays are specified by a vector of values and the dimensions of the matrix.
Values are calculated top-down first, left-right second.
array(1:4, dim = c(2,2))
would return
1 3
2 4
and not
1 2
3 4
Negative indices are not allowed in index matrices.
NA
and zero values are allowed.
a
and b
are two numeric arrays, their outer product is an array whose dimension vector is obtained by concatenating the two dimension vectors and whose data vector is achieved by forming all possible products of elements of the data vector of a
with those of b
.
The outer product is calculated with the operator %o%
:
ab = a %o% b
Another way to achieve this is
ab = outer(a, b, ‘*’)
In fact, any function can be applied on two arrays using the outer() function.
Suppose we define a function f = function(x, y) cos(y)/(1+x²)
.
The function could be applied to two vectors x
and y
via z = outer(x, y, f)
.
aperm(a, perm)
can be used to permute an array a.
The argument perm must be the permutation of the integers {1,…, k} where k is the number of subscripts in a.
The result of the function is an array of the same size as a but with the old dimension given by perm[j]
becoming the new j-th
dimension.
An easy way to think about it is a generalization of transposition for matrices.
If A
is a matrix, then B
is simply the transpose of A
:
B = aperm(A, c(2, 1))
In these special cases the function t()
performs a transposition.
A
and B
are square matrices of the same size, A*B
is the element-wise product of the two matrices.
A %*% B
is the dot product (matrix product).
If x is a vector, then x %*% A %*% x
is a quadratic form.
crossprod()
performs cross-products; thus, crossprod(X, y)
is the same as the operation t(X) %*% y
, but more efficient.
diag(v)
, where v
is a vector, gives a diagonal matrix with elements of the vector as the diagonal entries.
diag(M)
, where m
is a matrix, gives the vector of the main diagonal entries of M
(the same convention as in Matlab).
diag(k)
, where k
is a single numeric value, returns a k
by k
identity matrix.
A
and b
given, vector x
is the solution of the linear equation system.
This can be solved quickly in R with
solve(A, b)
eigen(Sm)
calculates the eigenvalues and eigenvectors of a symmetric matrix Sm.
The result is a list, with the first element named values and the second named vectors.
ev = eigen(Sm)
assigns this list to ev
.
ev$val
is the vector of eigenvalues of Sm
and ev$vec
the matrix of corresponding eigenvectors.
For large matrices, it is better to avoid computing the eigenvectors if they are not needed by using the expression
evals = eigen(Sm, only.values = TRUE)$values
svd(m)
takes an arbitrary matrix argument, m
, and calculates the singular value decomposition of m
.
This consists of a matrix of orthonormal columns U
with the same column space as m
, a second matrix of orthonormal columns V
whose column space is the row space of m
and a diagonal matrix of positive entries D
such that
m = U %*% D %*% t(V)
det(m)
can be used to calculate the determinant of a square matrix m
.
lsfit()
returns a list giving results of a least squares fitting procedure.
An assignment like
ans = lsfit(X, y)
gives results of a least squares fit where y is the vector of observations and X is the design matrix.
ls.diag()
can be used for regression diagnostics.
A closely related function is qr().
b = qr.coef(Xplus,y)
fit = qr.fitted(Xplus,y)
res = qr.resid(Xplus,y)
These compute the orthogonal projection of y
onto the range of X
in fit
, the projection onto the orthogonal complement in res
and the coefficient vector for the projection in b
.
Matrices
cbind()
and rbind()
.
cbind()
forms matrices by binding matrices horizontally (column-wise), and rbind()
binds matrices vertically (row-wise).
In the assignment X = cbind(arg_1, arg_2, arg_3, …)
the arguments to cbind()
must be either vectors of any length, or columns with the same column size (the same number of rows).
rbind()
performs a corresponding operation for rows.
ggvis
package, for example.
ggvis
package:
# Load in `ggvis`
library(ggvis)
# Iris scatter plot
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()
iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()
head(iris)
or str(iris)
.
Note that the last command will help you to clearly distinguish the data type num
and the three levels of the Species
attribute, which is a factor.
This is very convenient, since many R machine learning classifiers require that the target feature is coded as a factor.
Remember that factor variables represent categorical variables in R.
They can thus take on a limited number of different values.
A quick look at the Species
attribute through tells you that the division of the species of flowers is 50-50-50.
On the other hand, if you want to check the percentual division of the Species
attribute, you can ask for a table of proportions:
# Division of `Species`
table(iris$Species)
# Percentual division of `Species`
round(prop.table(table(iris$Species)) * 100, digits = 1)
round
argument rounds the values of the first argument, prop.table(table(iris$Species))*100
to the specified number of digits, which is one digit after the decimal point.
You can easily adjust this by changing the value of the digits
argument.
summary()
function.
This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types.
For the class variable, the count of factors will be returned:
# Summary overview of `iris`
summary(....)
# Refined summary overview
summary(....[c("Petal.Width", "Sepal.Width")])
As you can see, the c()
function is added to the original command: the columns petal width
and sepal width
are concatenated and a summary is then asked of just these two columns of the Iris data set.
Species
, will be the target variable or the variable that you want to predict in this example.
class
:
library(.....)
If you don’t have this package yet, you can quickly and easily do so by typing the following line of code:
install.packages("<package name>")
any(grepl("<name of your package>", installed.packages()))
summary()
function.
Look at the minimum and maximum values of all the (numerical) attributes.
If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature.
For example, if your dataset has just two attributes, X and Y, and X has values that range from 1 to 1000, while Y has values that only go from 1 to 100, then Y’s influence on the distance function will usually be overpowered by X’s influence.
When you normalize, you actually adjust the range of all features, so that distances between variables with larger ranges will not be over-emphasised.
Tip: go back to the result of summary(iris)
and try to figure out if normalization is necessary.
The Iris data set doesn’t need to be normalized: the Sepal.Length
attribute has values that go from 4.3 to 7.9 and Sepal.Width
contains values from 2 to 4.4, while Petal.Length
’s values range from 1 to 6.9 and Petal.Width
goes from 0.1 to 2.5.
All values of all attributes are contained within the range of 0.1 and 7.9, which you can consider acceptable.
Nevertheless, it’s still a good idea to study normalization and its effect, especially if you’re new to machine learning.
You can perform feature normalization, for example, by first making your own normalize()
function.
You can then use this argument in another command, where you put the results of the normalization in a data frame through as.data.frame()
after the function lapply()
returns a list of the same length as the data set that you give in.
Each element of that list is the result of the application of the normalize
argument to the data set that served as input:
YourNormalizedDataSet = as.data.frame(lapply(YourDataSet, normalize))
Test this in the DataCamp Light chunk below!
# Build your own `normalize()` function
normalize = function(x) {
num = x - min(x)
denom = max(x) - min(x)
return (num/denom)
}
# Normalize the `iris` data
iris_norm = .............(......(iris[1:4], normalize))
# Summarize `iris_norm`
summary(.........)
For the Iris dataset, you would have applied the normalize
argument on the four numerical attributes of the Iris data set (Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
) and put the results in a data frame.
set.seed(1234)
Then, you want to make sure that your Iris data set is shuffled and that you have an equal amount of each species in your training and test sets.
You use the sample()
function to take a sample with a size that is set as the number of rows of the Iris data set, or 150.
You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the 150 rows of the Iris data set.
The assignment of the elements is subject to probability weights of 0.67 and 0.33.
ind = sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))
replace
argument is set to TRUE
: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state.
This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again.
The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights.
Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234
.
ind
to define your training and test sets:
# Compose training set
iris.training = ....[ind==1, 1:4]
# Inspect training set
head(................)
# Compose test set
iris.test = ....[ind==2, 1:4]
# Inspect test set
head(...........)
Sepal.Length
, Sepal.Width
, Petal.Length
and Petal.Width
.
This is because you actually want to predict the fifth attribute, Species
: it is your target variable.
However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it.
You therefore need to store the class labels in factor vectors and divide them over the training and test sets:
# Compose `iris` training labels
iris.trainLabels = iris[ind==1,5]
# Inspect result
print(iris.trainLabels)
# Compose `iris` test labels
iris.testLabels = iris[ind==2, 5]
# Inspect result
print(iris.testLabels)
knn()
function, which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance.
Here, the k parameter is one that you set yourself.
As mentioned before, new instances are classified by looking at the majority vote or weighted vote.
In case of classification, the data point with the highest score wins the battle and the unknown instance receives the label of that winning data point.
If there is an equal amount of winners, the classification happens randomly.
iris_pred
the knn()
function that takes as arguments the training set, the test set, the train labels and the amount of neighbours you want to find with this algorithm.
The result of this function is a factor vector with the predicted classes for each row of the test data.
iris_pred
, you’ll get back the factor vector with the predicted classes for each row of the test data.
iris_pred
to the test labels that you had defined earlier:
# Put `iris.testLabels` in a data frame
irisTestLabels = data.frame(................)
# Merge `iris_pred` and `iris.testLabels`
merge = data.frame(........., ...............)
# Specify column names for `merge`
names(.....) = c("Predicted Species", "Observed Species")
# Inspect `merge`
merge
You see that the model makes reasonably accurate predictions, with the exception of one wrong classification in row 29, where “Versicolor” was predicted while the test label is “Virginica”.
This is already some indication of your model’s performance, but you might want to go even deeper into your analysis.
For this purpose, you can import the package gmodels
:
install.packages("package name")
However, if you have already installed this package, you can simply enter
library(gmodels)
Then you can make a cross tabulation or a contingency table.
This type of table is often used to understand the relationship between two variables.
In this case, you want to understand how the classes of your test data, stored in iris.testLabels
relate to your model that is stored in iris_pred
:
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)
prop.chisq
indicates whether or not the chi-square contribution of each cell is included.
The chi-square statistic is the sum of the contributions from each of the individual cells and is used to decide whether the difference between the observed and the expected values is significant.
From this table, you can derive the number of correct and incorrect predictions: one instance from the testing set was labeled Versicolor
by the model, while it was actually a flower of species Virginica
.
You can see this in the first row of the “Virginica” species in the iris.testLabels
column.
In all other cases, correct predictions were made.
You can conclude that the model’s performance is good enough and that you don’t need to improve the model!
caret
caret
package can come in handy: it’s short for “Classification and Regression Training” and offers everything you need to know to solve supervised machine learning problems: it provides a uniform interface to a ton of machine learning algorithms.
If you’re a bit familiar with Python machine learning, you might see similarities with scikit-learn
!
In the following, you’ll go through the steps as they have been outlined above, but this time, you’ll make use of caret
to classify your data.
Note that you have already done a lot of work if you’ve followed the steps as they were outlined above: you already have a hold on your data, you have explored it, prepared your workspace, etc.
Now it’s time to preprocess your data with caret
!
As you have done before, you can study the effect of the normalization, but you’ll see this later on in the tutorial.
You already know what’s next! Let’s split up the data in a training and test set.
In this case, though, you handle things a little bit differently: you split up the data based on the labels that you find in iris$Species
.
Also, the ratio is in this case set at 75-25 for the training and test sets.
# Create index to split based on labels
index = createDataPartition(iris$Species, p=0.75, list=FALSE)
# Subset training set with index
iris.training = iris[.......,]
# Subset test set with index
iris.test = iris[-.........,]
You’re all set to go and train models now! But, as you might remember, caret
is an extremely large project that includes a lot of algorithms.
If you’re in doubt on what algorithms are included in the project, you can get a list of all of them.
Pull up the list by running names(getModelInfo())
, just like the code chunk below demonstrates.
Next, pick an algorithm and train a model with the train()
function:
# Overview of algos supported by caret
names(getModelInfo())
# Train a model
model_knn = train(iris.training[, 1:4], iris.training[, 5], method='knn')
Note that making other models is extremely simple when you have gotten this far; You just have to change the method
argument, just like in this example:
model_cart = train(iris.training[, 1:4], iris.training[, 5], method='rpart2')
Now that you have trained your model, it’s time to predict the labels of the test set that you have just made and evaluate how the model has done on your data:
# Predict the labels of the test set
predictions=predict(object=model_knn,iris.test[,1:4])
# Evaluate the predictions
table(predictions)
# Confusion matrix
confusionMatrix(predictions,iris.test[,5])
Additionally, you can try to perform the same test as before, to examine the effect of preprocessing, such as scaling and centering, on your model.
Run the following code chunk:
# Train the model with preprocessing
model_knn = train(iris.training[, 1:4], iris.training[, 5], method='knn', preProcess=c("center", "scale"))
# Predict values
predictions=predict.train(object=model_knn,iris.test[,1:4], type="raw")
# Confusion matrix
confusionMatrix(predictions,iris.test[,5])
caret
offers, to spark your machine learning.
But you can do so much more!
If you have experimented enough with the basics presented in this tutorial and other machine learning algorithms, you might want to find it interesting to go further into R and data analysis.
library(caTools)
"caTools" package provides a function "sample.split()" which helps in splitting the data.
sample.split(diamonds$price,SplitRatio = 0.65)->split_index
65% of the observations from price column have been assigned the "true" label and the rest 35% have been assigned "false" label.
subset(diamonds,split_index==T)->train
subset(diamonds,split_index==F)->test
All the observations which have "true" label have been stored in the "train" object and those observations having "false" label have been assigned to the "test" set.
Now that the splitting is done and we have our "train" and "test" sets, it's time to build the linear regression model on the training set.
We'll be using the "lm()" function to build the linear regression model on the "train" data.
We are determining the price of the diamonds with respect to all other variables of the data-set.
The built model is stored in the object "mod_regress".
lm(price~.,data = train)->mod_regress
Now, that we have built the model, we need to make predictions on the "test" set.
"predict()" function is used to get predictions.
It takes two arguments: the built model and the test set.
The predicted results are stored in the "result_regress" object.
predict(mod_regress,test)->result_regress
Let's bind the actual price values from the "test" data-set and the predicted values into a single data-set using the "cbind()" function.
The new data-frame is stored in "Final_Data"
cbind(Actual=test$price,Predicted=result_regress)->Final_Data
as.data.frame(Final_Data)->Final_Data
A glance at the "Final_Data" which comprises of actual values and predicted values:
(Final_Data$Actual- Final_Data$Predicted)->error
cbind(Final_Data,error)->Final_Data
A glance at the "Final_Data" which also comprises of the error in prediction:
rmse1=sqrt(mean(Final_Data$error^2))
rmse1
lm(price~.-y-z,data = train)->mod_regress2
The predicted results are stored in "result_regress2"predict(mod_regress2,test)->result_regress2
Actual and Predicted values are combined and stored in "Final_Data2":
cbind(Actual=test$price,Predicted=result_regress2)->Final_Data2
as.data.frame(Final_Data2)->Final_Data2
Let's also add the error in prediction to "Final_Data2"
(Final_Data2$Actual- Final_Data2$Predicted)->error2
cbind(Final_Data2,error2)->Final_Data2
A glance at "Final_Data2":
rmse2=sqrt(mean(Final_Data2$error^2))
library(caTools)
65% of the observations from ‘Purchased' column will be assigned "TRUE" labels and the rest will be assigned "FALSE" labels.
sample.split(car_purchase$Purchased,SplitRatio = 0.65)->split_values
All those observations which have "TRUE" label will be stored into ‘train' data and those observations having "FALSE" label will be assigned to ‘test' data.
subset(car_purchase,split_values==T)->train_data
subset(car_purchase,split_values==F)->test_data
Time to build the Recursive Partitioning algorithm:
We'll start off by loading the ‘rpart' package:
library(rpart)
"Purchased" column will be the dependent variable and all other columns are the independent variables i.e. we are determining whether the person has bought the car or not with respect to all other columns.
The model is built on the "train_data" and the result is stored in "mod1".
rpart(Purchased~.,data = train_data)->mod1
Let's plot the result:
plot(mod1,margin = 0.1)
text(mod1,pretty = T,cex=0.8)
predict(mod1,test_data,type = "class")->result1
Let's evaluate the accuracy of the model using "confusionMatrix()" function from caret package.
library(caret)
confusionMatrix(table(test_data$Purchased,result1))
iris[1:4]->iris_k
Let us take the number of clusters to be 3.
"Kmeans()" function takes the input data and the number of clusters in which the data is to be clustered.
The syntax is : kmeans( data, k) where k is the number of cluster centers.
kmeans(iris_k,3)->k1
Analyzing the clustering:
str(k1)
# remove na in r - test for missing values (is.na example)
test = c(1,2,3,NA)
is.na(test)
In the example above, is.na() will return a vector indicating which elements have a na value.
# remove na in r - remove rows - na.omit function / option
ompleterecords = na.omit(datacollected)
Passing your data frame through the na.omit() function is a simple way to purge incomplete records from your analysis.
It is an efficient way to remove na values in r.
# na in R - complete.cases example
fullrecords = collecteddata[!complete.cases(collecteddata)] droprecords = collecteddata[complete.cases(collecteddata)]
for
, while
, and repeat
when the order or operations lapply
applies a function to each element of a list (or vector), collecting results in a list. sapply
does the same, but will try to simplify the output if possible.
Lists are a very powerful and flexible data structure that few people seem to know about. Moreover, they are the building block for other data structures, like data.frame
and matrix
. To access elements of a list, you use the double square bracket, for example X[[4]]
returns the fourth element of the list X
. If you don’t know what a list is, we suggest you read more about them, before you proceed.
result = lapply(a list or vector, a function, ...)
This code will also return a list, stored in result
, with same number of elements as X
.
first.step = lapply(X, first.function) second.step = lapply(first.step, next.function)
The challenge is to identify the parts of your analysis that stay the same and those that differ for each call of the function. The trick to using lapply
is to recognise that only one item can differ between different function calls.
It is possible to pass in a bunch of additional arguments to your function, but these must be the same for each call of your function. For example, let’s say we have a function test
which takes the path of a file, loads the data, and tests it against some hypothesised value H0. We can run the function on the file
“myfile.csv” as follows.
result = test("myfile.csv", H0=1)
We could then run the test on a bunch of files using lapply:
files = c("myfile1.csv", "myfile2.csv", "myfile3.csv") result = lapply(files, test, H0=1)
But notice, that in this example, the files = lapply(1:10, function(x){paste0("myfile", x, ".csv")}) result = lapply(files, test, H0=1)
The nice things about that piece of code is that it would extend as long as we wanted, to 10000000 files, if needed.
cities = c("Melbourne", "Sydney", "Brisbane", "Cairns")
The data are stored in a url scheme where the Sydney data is at
http://nicercode.github.io/guides/repeating-things/data/Sydney.csv and so on.
The URLs that we need are therefore:
urls =
sprintf("http://nicercode.github.io/guides/repeating-things/data/%s.csv",
cities) urls
[1] "http://nicercode.github.io/guides/repeating-things/data/Melbourne.csv"
[2] "http://nicercode.github.io/guides/repeating-things/data/Sydney.csv"
[3] "http://nicercode.github.io/guides/repeating-things/data/Brisbane.csv"
[4] "http://nicercode.github.io/guides/repeating-things/data/Cairns.csv"
We can write a function to download a file if it does not exist:
download.maybe = function(url, refetch=FALSE, path=".") {
dest = file.path(path, basename(url))
if (refetch || !file.exists(dest))
download.file(url, dest)
dest
}
and then run that over the urls:
path = "data" dir.create(path, showWarnings=FALSE) files = sapply(urls, download.maybe, path=path) names(files) = cities
Notice that we never specify the order of which file is downloaded in which order; we just say “apply this function (download.maybe
) to this list of urls. We also pass the path
argument to every function call. So it was as if we’d written
download.maybe(urls[[1]], path=path) download.maybe(urls[[2]], path=path) download.maybe(urls[[3]], path=path) download.maybe(urls[[4]], path=path)
but much less boring, and scalable to more files.
The first column, time
of each file is a string representing date and time, which needs processing into R’s native time format (dealing with times in R (or frankly, in any language) is a complete pain). In a real case, there might be many steps involved in processing each file. We can make a function like this:
load.file = function(filename) {
d = read.csv(filename, stringsAsFactors=FALSE)
d$time = as.POSIXlt(d$time)
d
}
that reads in a file given a filename, and then apply that function to each filename using lapply
:
data = lapply(files, load.file) names(data) = cities
We now have a data.frame
of weather data:
head(data$Sydney)
time temp temp.min temp.max
1 2013-06-13 23:00:00 12.66 8.89 16.11
2 2013-06-14 00:00:00 15.90 12.22 20.00
3 2013-06-14 02:00:00 18.44 16.11 20.00
4 2013-06-14 03:00:00 18.68 16.67 20.56
5 2013-06-14 04:00:00 19.41 17.78 22.22
6 2013-06-14 05:00:00 19.10 17.78 22.22
We can use lapply
or sapply
to easy ask the same question to each element of this list. For example, how many rows of data are there?
sapply(data, nrow)
Melbourne Sydney Brisbane Cairns
97 99 99 80
What is the hottest temperature recorded by city?
sapply(data, function(x) max(x$temp))
Melbourne Sydney Brisbane Cairns
12.85 19.41 22.00 31.67
or, estimate the autocorrelation function for each set:
autocor = lapply(data, function(x) acf(x$temp, lag.max=24))
plot(autocor$Sydney, main="Sydney")
plot(autocor$Cairns, main="Cairns")
xlim = range(sapply(data, function(x) range(x$time))) ylim = range(sapply(data, function(x) range(x[-1]))) plot(data[[1]]$time, data[[1]]$temp, ylim=ylim, type="n",
xlab="Time", ylab="Temperature") cols = 1:4 for (i in seq_along(data))
lines(data[[i]]$time, data[[i]]$temp, col=cols[i])
plot(data[[1]]$time, data[[1]]$temp, ylim=ylim, type="n",
xlab="Time", ylab="Temperature") mapply(function(x, col) lines(x$time, x$temp, col=col),
data, cols)
$Melbourne
NULL
$Sydney
NULL
$Brisbane
NULL
$Cairns
NULL
result = lapply(x, f) #apply f to x using a single core and lapply
library(multicore) result = mclapply(x, f) #same thing using all the cores in your machine
library(downloader) if (!file.exists("seinfeld.csv"))
download("https://raw.github.com/audy/smalldata/master/seinfeld.csv",
"seinfeld.csv") dat = read.csv("seinfeld.csv", stringsAsFactors=FALSE)
Columns are Season (number), Episode (number), Title (of the episode), Rating (according to IMDb) and Votes (to construct the rating).
head(dat)
Season Episode Title Rating Votes
1 1 2 The Stakeout 7.8 649
2 1 3 The Robbery 7.7 565
3 1 4 Male Unbonding 7.6 561
4 1 5 The Stock Tip 7.8 541
5 2 1 The Ex-Girlfriend 7.7 529
6 2 1 The Statue 8.1 509
Make sure it’s sorted sensibly
dat = dat[order(dat$Season, dat$Episode),]
Biologically, this could be Site / Individual / ID / Mean size /
Things measured.
Hypothesis: Seinfeld used to be funny, but got progressively less good as it became too mainstream. Or, does the mean episode rating per season decrease?
Now, we want to calculate the average rating per season:
mean(dat$Rating[dat$Season == 1])
[1] 7.725
mean(dat$Rating[dat$Season == 2])
[1] 8.158
and so on until:
mean(dat$Rating[dat$Season == 9])
[1] 8.323
As with most things, we could automate this with a for loop:
seasons = sort(unique(dat$Season)) rating = numeric(length(seasons)) for (i in seq_along(seasons))
rating[i] = mean(dat$Rating[dat$Season == seasons[i]])
That’s actually not that horrible to do. But we it could be nicer. We first ratings.split = split(dat$Rating, dat$Season) head(ratings.split)
$`1`
[1] 7.8 7.7 7.6 7.8
$`2`
[1] 7.7 8.1 8.0 7.9 7.8 8.5 8.7 8.5 8.0 8.0 8.4 8.3
$`3`
[1] 8.3 7.5 7.8 8.1 8.3 7.3 8.7 8.5 8.5 8.6 8.1 8.4 8.5 8.7 8.6 7.8 8.3
[18] 8.6 8.7 8.6 8.0 8.5 8.6
$`4`
[1] 8.4 8.3 8.6 8.5 8.7 8.6 8.1 8.2 8.7 8.4 8.3 8.7 8.5 8.6 8.3 8.2 8.4
[18] 8.5 8.4 8.7 8.7 8.4 8.5
$`5`
[1] 8.6 8.4 8.4 8.4 8.3 8.2 8.1 8.5 8.5 8.3 8.0 8.1 8.6 8.3 8.4 8.5 7.9
[18] 8.0 8.5 8.7 8.5
$`6`
[1] 8.1 8.4 8.3 8.4 8.2 8.3 8.5 8.4 8.3 8.2 8.1 8.4 8.6 8.2 7.5 8.4 8.2
[18] 8.5 8.3 8.4 8.1 8.5 8.2
Then use sapply to loop over this list, computing the mean
rating = sapply(ratings.split, mean)
Then if we wanted to apply a different function (say, compute the per-season standard error) we could just do:
se = function(x)
sqrt(var(x) / length(x)) rating.se = sapply(ratings.split, se)
plot(rating ~ seasons, ylim=c(7, 9), pch=19) arrows(seasons, rating - rating.se, seasons, rating + rating.se,
code=3, angle=90, length=0.02)
summarise.by.group = function(response, group, func) {
response.split = split(response, group)
sapply(response.split, func)
}
We can compute the mean rating by season again:
rating.new = summarise.by.group(dat$Rating, dat$Season, mean)
which is the same as what we got before:
identical(rating.new, rating)
[1] TRUE
Of course, we’re not the first people to try this. This is tapply
function does (but with a few bells and whistles, especially around missing values, factor levels, additional arguments and multiple grouping factors at once).
tapply(dat$Rating, dat$Season, mean)
1 2 3 4 5 6 7 8 9
7.725 8.158 8.304 8.465 8.343 8.283 8.441 8.423 8.323
So using tapply
, you can do all the above manipulation in a single line.
There are a couple of limitations of tapply
.
The first is that getting the season out of tapply
is quite hard. We could do:
as.numeric(names(rating))
[1] 1 2 3 4 5 6 7 8 9
But that’s quite ugly, not least because it involves the conversion numeric -> string -> numeric.
Better could be to use
sort(unique(dat$Season))
[1] 1 2 3 4 5 6 7 8 9
But that requires knowing what is going on inside of tapply
(that unique levels are sorted and data are returned in that order).
I suspect that this approach:
first = function(x) x[[1]] tapply(dat$Season, dat$Season, first)
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
is probably the most fool-proof, but it’s certainly not pretty.
However, the returned format is extremely flexible. If you do:
The aggregate
function provides a simplfied interface to tapply
that avoids this issue. It has two interfaces: the first is similar to what we used before, but the grouping variable now must be a list or data frame:
aggregate(dat$Rating, dat["Season"], mean)
Season x
1 1 7.725
2 2 8.158
3 3 8.304
4 4 8.465
5 5 8.343
6 6 8.283
7 7 8.441
8 8 8.423
9 9 8.323
(note that dat["Season"]
returns a one-column data frame). The column ‘x’ is our response variable, Rating, grouped by season. We can get its name included in the column names here by specifying the first argument as a data.frame
too:
aggregate(dat["Rating"], dat["Season"], mean)
Season Rating
1 1 7.725
2 2 8.158
3 3 8.304
4 4 8.465
5 5 8.343
6 6 8.283
7 7 8.441
8 8 8.423
9 9 8.323
The other interface is the formula interface, that will be familiar from fitting linear models:
aggregate(Rating ~ Season, dat, mean)
Season Rating
1 1 7.725
2 2 8.158
3 3 8.304
4 4 8.465
5 5 8.343
6 6 8.283
7 7 8.441
8 8 8.423
9 9 8.323
This interface is really nice; we can get the number of votes here too.
aggregate(cbind(Rating, Votes) ~ Season, dat, mean)
Season Rating Votes
1 1 7.725 579.0
2 2 8.158 533.0
3 3 8.304 496.7
4 4 8.465 497.0
5 5 8.343 452.5
6 6 8.283 385.7
7 7 8.441 408.0
8 8 8.423 391.4
9 9 8.323 415.0
If you have multiple grouping variables, you can write things like:
<div class='bogus-wrapper'><figcaption></figcaption><div class=”highlight”><table><tr><td class=”gutter”><pre class=”line-numbers”>1
</pre></td><td class='code'><pre>aggregate(response ~ factor1 + factor2, dat, function)
</pre></td></tr></table></div></div>
to apply a function to each pair of levels of factor1
and factor2
.
trial = function(n)
sum(runif(n) < 0.5) # could have done a binomial draw...
You can run the trial a bunch of times:
trial(10)
[1] 4
trial(10)
[1] 4
trial(10)
[1] 6
and get a feel for the results. If you want to replicate the trial
100 times and look at the distribution of results, you could do:
replicate(100, trial(10))
[1] 4 4 5 6 8 5 5 7 3 5 6 4 4 3 5 3 6 7 2 6 6 4 5 4 4 4 4 5 6 5 4 2 6 5 6
[36] 5 6 8 5 6 4 5 4 5 5 5 4 7 3 5 5 6 4 6 4 6 4 4 4 6 3 5 5 7 6 7 5 3 4 4
[71] 5 6 8 5 6 2 5 7 6 3 5 9 3 7 6 4 5 3 7 3 3 7 6 8 5 4 6 7 4 3
and then you could plot these:
plot(table(replicate(10000, trial(50))))
for
” loops shine where the output of one iteration depends on the result of the previous iteration.
Suppose you wanted to model random walk. Every time step, with 50% probability move left or right.
Start at position 0
x = 0
Move left or right with probability p (0.5 = unbiased)
p = 0.5
Update the position
x = x + if (runif(1) < p) -1 else 1
Let’s abstract the update into a function:
step = function(x, p=0.5)
x + if (runif(1) < p) -1 else 1
Repeat a bunch of times:
x = step(x) x = step(x)
To find out where we got to after 20 steps:
for (i in 1:20)
x = step(x)
If we want to collect where we’re up to at the same time:
nsteps = 200 x = numeric(nsteps + 1) x[1] = 0 # start at 0 for (i in seq_len(nsteps))
x[i+1] = step(x[i]) plot(x, type="l")
random.walk = function(nsteps, x0=0, p=0.5) {
x = numeric(nsteps + 1)
x[1] = x0
for (i in seq_len(nsteps))
x[i+1] = step(x[i])
x
}
We can then do 30 random walks:
walks = replicate(30, random.walk(100)) matplot(walks, type="l", lty=1, col=rainbow(nrow(walks)))
random.walk = function(nsteps, x0=0, p=0.5)
cumsum(c(x0, ifelse(runif(nsteps) < p, -1, 1)))
walks = replicate(30, random.walk(100)) matplot(walks, type="l", lty=1, col=rainbow(nrow(walks)))
1e4
iterations, a quick look at the performance metrics in the task manager (Windows 7 OS) gives you an idea of how hard your computer is working to process the code.
My machine has eight processors and you can see that only a fraction of them are working while the script is running.
for
loop.
Running the code using foreach will make full use of the computer's processors.
Individual chunks of the loop are sent to each processor so that the entire process can be run in parallel rather than in sequence.
Here's how to run the code with 1e4
iterations in parallel.
That is, each processor gets a finite set of the total number of iterations, i.e., iterations 1–100 goes to processor one, iterations 101–200 go to processor two, etc. The output from each processor is then comiled after the iterations are completed.
#import packages
library(foreach)
library(doParallel)
iters=1e4 #number of iterations
#setup parallel backend to use 8 processors
cl=makeCluster(8)
registerDoParallel(cl)
#start time
strt=Sys.time()
#loop
ls=foreach(icount(iters)) %dopar% {
to.ls=rnorm(1e6)
to.ls=summary(to.ls)
to.ls
}
print(Sys.time()-strt)
stopCluster(cl)
#Time difference of 10.00242 mins
Running the loop in parallel decreased the processing time about four-fold.
Although the loop generally looks the same as the sequential version, several parts of the code have changed.
First, we are using the foreach
function rather than for
to define our loop.
The syntax for specifying the iterator is slightly different with foreach
as well, i.e., icount(iters)
tells the function to repeat the loop a given number of times based on the value assigned to iters
.
Additionally, the convention %dopar%
specifies that the code is to be processed in parallel if a backend has been registered (using %do%
will run the loop sequentially).
The functions makeCluster
and registerDoParallel
from the doParallel package are used to create the parallel backend.
Another important issue is the method for recombining the data after the chunks are processed. By default, foreach
will append the output to a list which we've saved to an object.
The default method for recombining output can be changed using the .combine
argument.
Also be aware that packages used in the evaluated expression must be included with the .packages
argument.
The processors should be working at full capacity if the the loop is executed properly.
Note the difference here compared to the first loop that was run in sequence.
for
loop.
A few other issues are worth noting when using the foreach package.
These are mainly issues I've encountered and I'm sure others could contribute to this list.
The foreach package does not work with all types of loops.
For example, I chose the above example to use a large number (1e6
) of observations with the rnorm
function.
I can't say for certain the exact type of data that works best, but I have found that functions hat take a long time when run individually are generally handled very well.
Interestingly, decreasing the number of observations and increasing the number of iterations may cause the processors to not run at maximum efficiency (try rnorm(100)
with 1e5
iterations). I also haven't had much success running repeated models in parallel.
The functions work but the processors never seem to reach max efficiency.
The system statistics should cue you off as to whether or not the functions are working.
I also find it bothersome that monitoring progress seems is an issue with parallel loops.
A simple call using cat
to return the iteration in the console does not work with parallel loops.
The most practical solution I've found is described here, which involves exporting information to a separate file that tells you how far the loop has progressed.
Also, be very aware of your RAM when running processes in parallel.
I've found that it's incredibly easy to max out the memory, which not only causes the function to stop working correctly, but also makes your computer run like garbage.
Finally, I'm a little concerned that I might be destroying my processors by running them at maximum capacity.
The fan always runs at full blast leading me to believe that critical meltdown is imminent.
I'd be pleased to know if this is an issue or not.
That's it for now.
I have to give credit to this tutorial for a lot of the information in this post.
# A formula
d = y ~ x + b
The variable on the left-hand side of a tilde (~) is called the "dependent variable", while the variables on the right-hand side are called the "independent variables" and are joined by plus signs +.
You can access the elements of a formula with the help of the square brackets: [[and ]].
f = y ~ x + b
# Retrieve the elements at index 1 and 2
f[[1]] ## "~"
f[[2]] ## y
f[[3]] ## x + b
y ~ x + a + b
## y ~ x + a + b
More complex formulas like the code chunk below:
Sepal.Width ~ Petal.Width | Species
## Sepal.Width ~ Petal.Width | Species
Where you mean to say "the sepal width is a function of petal width, conditioned on species"
"y ~ x1 + x2"
## [1] "y ~ x1 + x2"
h = as.formula("y ~ x1 + x2")
h = formula("y ~ x1 + x2")
# Create variables
i = y ~ x
j = y ~ x + x1
k = y ~ x + x1 + x2
# Concatentate
formulae = list(as.formula(i),as.formula(j),as.formula(k))
Use the lapply() function, where you pass in a vector with all of your formulas as a first argument and as.formula as the function that you want to apply to each element of that vector
# Join all with "c()"
l = c(i, j, k)
# Apply "as.formula" to all elements of "f"
lapply(l, as.formula)
[[1]] ## y ~ x
[[2]] ## y ~ x + x1
[[3]] ## y ~ x + x1 + x2
# Use multiple independent variables
y ~ x1 + x2 ## y ~ x1 + x2
# Ignore objects in an analysis
y ~ x1 - x2 ## y ~ x1 - x2
What if you want to actually perform an arithmetic operation? you have a couple of solutions:
1.You can calculate and store all of the variables in advance 2.You use the I() or "as-is" operator: y ~ x + I(x^2)
m = formula("y ~ x1 + x2")
terms(m)
## y ~ x1 + x2
## attr(,"variables")
## list(y, x1, x2)
## attr(,"factors")
## x1 x2
## y 0 0
## x1 1 0
## x2 0 1
## attr(,"term.labels")
## [1] "x1" "x2"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
class(m)
## [1] "formula"
typeof(m)
## [1] "language"
attributes(m)
## $class
## [1] "formula"
##
## $.Environment
## <environment: R_GlobalEnv>
If you want to know the names of the variables in the model, you can use all.vars.
print(all.vars(m))
## [1] "y" "x1" "x2"
To modify formulae without converting them to character you can use the update() function:
update(y ~ x1 + x2, ~. + x3)
## y ~ x1 + x2 + x3
y ~ x1 + x2 + x3
## y ~ x1 + x2 + x3
Double check whether you variable is a formula by passing it to the is.formula() function.
# Load "plyr"
library(plyr)
# Check "m"
is.formula(m)
## [1] TRUE
File | Who Controls | Level | Limitations |
.Rprofile | User or Admin | User or Project | None, sourced as R code. |
.Renviron | User or Admin | User or Project | Set environment variables only. |
Rprofile.site | Admin | Version of R | None, sourced as R code. |
Renviron.site | Admin | Version of R | Set environment variables only. |
rsession.conf | Admin | Server | Only RStudio settings, only single repository. |
repos.conf | Admin | Server | Only for setting repositories. |
.Rprofile
.Rprofile
files are user-controllable files to set options and environment variables.
.Rprofile
files can be either at the user or project level.
User-level .Rprofile
files live in the base of the user's home directory, and project-level .Rprofile
files live in the base of the project directory.
R will source only one .Rprofile
file.
So if you have both a project-specific .Rprofile
file and a user .Rprofile
file that you want to use, you explicitly source the user-level .Rprofile
at the top of your project-level .Rprofile
with source("~/.Rprofile")
.
.Rprofile
files are sourced as regular R code, so setting environment variables must be done inside a Sys.setenv(key = "value")
call.
One easy way to edit your .RProfile
file is to use the usethis::edit_r_profile()
function from within an R session.
You can specify whether you want to edit the user or project level .Rprofile.
.Renviron
.Renviron
is a user-controllable file that can be used to create environment variables.
This is especially useful to avoid including credentials like API keys inside R scripts.
This file is written in a key-value format, so environment variables are created in the format:
Key1=value1
Key2=value2
...
And then Sys.getenv("Key1")
will return "value1"
in an R session.
Like with the .Rprofile
file, .Renviron
files can be at either the user or project level.
If there is a project-level .Renviron
, the user-level file will not be sourced.
The usethis
package includes a helper function for editing .Renviron
files from an R session with usethis::edit_r_environ()
.
Rprofile.site
and Renviron.site
.Rprofile
and .Renviron
files have equivalents that apply server wide.
Rprofile.site
andRenviron.site
(no leading dot) files are managed by admins on RStudio Server and are specific to a particular version of R. The most common settings for these files involve access to package repositories.
For example, using the shared-baseline package management strategy is generally done from an Rprofile.site
.
Users can override settings in these files with their individual .Rprofile
files.
These files are set for each version of R and should be located in R_HOME/etc/
.
You can findR_HOME
by running the command R.home(component
= "home")
in a session of that version of R.
So, for example, if you find that R_HOME
is /opt/R/3.6.2/lib/R
, theRprofile.site
for R 3.6.2 would go in /opt/R/3.6.2/lib/R/etc/Rprofile.site
.
rsession.conf
and repos.conf
rsession.conf
and repos.conf
files.
Only one repository can be configured in rsession.conf
.
If multiple repositories are needed, repos.conf
should be used.
Details on configuring RStudio Server with these files are in this support article.
# Create a new context
ct = v8()
# Evaluate some code
ct$eval("var foo = 123")
ct$eval("var bar = 456")
ct$eval("foo + bar")
[1] "579"
A major advantage over the other foreign language interfaces is that V8 requires no compilers, external executables or other run-time dependencies. The entire engine is contained within a 6MB package (2MB zipped) and works on all major platforms.
# Create some JSON
cat(ct$eval("JSON.stringify({x:Math.random()})"))
{"x":0.5580623043314792}
# Simple closure
ct$eval("(function(x){return x+1;})(123)")
[1] "124"
However note that V8 by itself is just the naked JavaScript engine. Currently, there is no DOM (i.e. no window object), no network or disk IO, not even an event loop. Which is fine because we already have all of those in R. In this sense V8 resembles other foreign language interfaces such as Rcpp or rJava, but then for JavaScript.
ct$source
method is a convenience function for loading JavaScript libraries from a file or url.
ct$source(system.file("js/underscore.js", package="V8"))
ct$source("https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.11/crossfilter.min.js")
ct$assign("mydata", mtcars)
ct$get("mydata")
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
...
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Alternatively use JS()
to assign the value of a JavaScript expression (without converting to JSON):
ct$assign("foo", JS("function(x){return x*x}"))
ct$assign("bar", JS("foo(9)"))
ct$get("bar")
[1] 81
ct$call
method calls a JavaScript function, automatically converting objects (arguments and return value) between R and JavaScript:
ct$call("_.filter", mtcars, JS("function(x){return x.mpg < 15}"))
mpg cyl disp hp drat wt qsec vs am gear carb
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
It looks a bit like .Call
but then for JavaScript instead of C.
# Load some data
data(diamonds, package = "ggplot2")
ct$assign("diamonds", diamonds)
ct$console()
From here you can interactively work in JavaScript without typing ct$eval
every time:
var cf = crossfilter(diamonds)
var price = cf.dimension(function(x){return x.price})
var depth = cf.dimension(function(x){return x.depth})
price.filter([2000, 3000])
output = depth.top(10)
To exit the console, either press ESC
or type exit
. Afterwards you can retrieve the objects back into R:
output = ct$get("output")
print(output)
# A common typo
ct$eval('var foo = 123;')
Error in context_eval(join(src), private$context, serialize): SyntaxError: Unexpected token '<'
JavaScript runtime exceptions are automatically propagated into R errors:
# Runtime errors
ct$eval("123 + doesnotexit")
Error in context_eval(join(src), private$context, serialize): ReferenceError: doesnotexit is not defined
Within JavaScript we can also call back to the R console manually using console.log
, console.warn
and console.error
. This allows for explicitly generating output, warnings or errors from within a JavaScript application.
ct$eval('console.log("this is a message")')
this is a message
ct$eval('console.warn("Heads up!")')
Warning: Heads up!
ct$eval('console.error("Oh no! An error!")')
Error in context_eval(join(src), private$context, serialize): Oh no! An error!
A example of using console.error
is to verify that external resources were loaded:
ct = v8()
ct$source("https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.11/crossfilter.min.js")
ct$eval('var cf = crossfilter || console.error("failed to load crossfilter!")')
global
(a reference to itself), console
(for console.log
and friends) and print
(an alias of console.log needed by some JavaScript libraries)
ct = v8(typed_arrays = FALSE);
ct$get(JS("Object.keys(global)"))
[1] "print" "console" "global"
If typed arrays are enabled it contains some additional functions:
ct = v8(typed_arrays = TRUE);
ct$get(JS("Object.keys(global)"))
[1] "print" "console" "global"
A context always has a global scope, even when no name is set. When a context is initiated with global = NULL
, it can still be reached by evaluating the this
keyword within the global scope:
ct2 = v8(global = NULL, console = FALSE)
ct2$get(JS("Object.keys(this).length"))
[1] 1
ct2$assign("cars", cars)
ct2$eval("var foo = 123")
ct2$eval("function test(x){x+1}")
ct2$get(JS("Object.keys(this).length"))
[1] 4
ct2$get(JS("Object.keys(this)"))
[1] "print" "cars" "foo" "test"
To create your own global you could use something like:
ct2$eval("var __global__ = this")
ct2$eval("(function(){var bar = [1,2,3,4]; __global__.bar = bar; })()")
ct2$get("bar")
[1] 1 2 3 4
ct$validate("function foo(x){2*x}")
[1] TRUE
ct$validate("foo = function(x){2*x}")
[1] TRUE
This might be useful for all those R libraries that generate browser graphics via templated JavaScript. Note that JavaScript does not allow for defining anonymous functions in the global scope:
ct$validate("function(x){2*x}")
[1] FALSE
To check if an anonymous function is syntactically valid, prefix it with !
or wrap in ()
. These are OK:
ct$validate("(function(x){2*x})")
[1] TRUE
ct$validate("!function(x){2*x}")
[1] TRUE
console.r
API`. This is most easily demonstrated via the interactive console.
ctx = v8()
ctx$console()
From JavaScript we can read/write R objects via console.r.get
and console.r.assign
. The final argument is an optional list specifying arguments passed to toJSON
or fromJSON
.
// read the iris object into JS
var iris = console.r.get("iris")
var iris_col = console.r.get("iris", {dataframe : "col"})
//write an object back to the R session
console.r.assign("iris2", iris)
console.r.assign("iris3", iris, {simplifyVector : false})
To call R functions use console.r.call
. The first argument should be a string which evaluates to a function. The second argument contains a list of arguments passed to the function, similar to do.call
in R. Both named and unnamed lists are supported. The return object is returned to JavaScript via JSON.
//calls rnorm(n=2, mean=10, sd=5)
var out = console.r.call('rnorm', {n: 2,mean:10, sd:5})
var out = console.r.call('rnorm', [2, 20, 5])
//anonymous function
var out = console.r.call('function(x){x^2}', {x:12})
There is also an console.r.eval
function, which evaluates some code. It takes only a single argument (the string to evaluate) and does not return anything. Output is printed to the console.
console.r.eval('sessionInfo()')
Besides automatically converting objects, V8 also propagates exceptions between R, C++ and JavaScript up and down the stack. Hence you can catch R errors as JavaScript exceptions when calling an R function from JavaScript or vice versa. If nothing gets caught, exceptions bubble all the way up as R errors in your top-level R session.
//raise an error in R
console.r.call('stop("ouch!")')
//catch error from JavaScript
try {
console.r.call('stop("ouch!")')
} catch (e) {
console.log("Uhoh R had an error: " + e)
}
//# Uhoh R had an error: ouch!
# install package
install.packages("neuralnet")
Updating HTML index of packages in '.Library'
Making 'packages.html' ...
done
# creating training data set
TKS=c(20,10,30,20,80,30)
CSS=c(90,20,40,50,50,80)
Placed=c(1,0,0,0,1,1)
# Here, you will combine multiple columns or features into a single set of data
df=data.frame(TKS,CSS,Placed)
Let's build a NN classifier model using the neuralnet library.
First, import the neuralnet library and create NN classifier model by passing argument set of label and features, dataset, number of neurons in hidden layers, and error calculation.
# load library
require(neuralnet)
# fit neural network
nn=neuralnet(Placed~TKS+CSS,data=df, hidden=3,act.fct = "logistic",
linear.output = FALSE)
Here,
- Placed~TKS+CSS, Placed is label annd TKS and CSS are features.
- df is dataframe,
- hidden=3: represents single layer with 3 neurons respectively.
- act.fct = "logistic" used for smoothing the result.
- linear.ouput=FALSE: set FALSE for apply act.fct otherwise TRUE
# plot neural network
plot(nn)
# creating test set
TKS=c(30,40,85)
CSS=c(85,50,40)
test=data.frame(TKS,CSS)
## Prediction using neural network
Predict=compute(nn,test)
Predict$net.result
0.9928202080
0.3335543925
0.9775153014
Now, Convert probabilities into binary classes.
# Converting probabilities into binary classes setting threshold level 0.5
prob = Predict$net.result
pred = ifelse(prob>0.5, 1, 0)
pred
1
0
1
Predicted results are 1,0, and 1.
library("neuralnet")
Going to create a neural network to perform square rooting
Type ?neuralnet for more information on the neuralnet library
Generate 50 random numbers uniformly distributed between 0 and 100 And store them as a dataframe
traininginput = as.data.frame(runif(50, min=0, max=100))
trainingoutput = sqrt(traininginput)
Column bind the data into one variable
trainingdata = cbind(traininginput,trainingoutput)
colnames(trainingdata) = c("Input","Output")
Train the neural network Going to have 10 hidden layers Threshold is a numeric value specifying the threshold for the partial derivatives of the error function as stopping criteria.
net.sqrt = neuralnet(Output~Input,trainingdata, hidden=10, threshold=0.01)
Plot the neural network
plot(net.sqrt, rep = "best")
Test the neural network on some training data
testdata = as.data.frame((1:10)^2) #Generate some squared numbers
net.results = compute(net.sqrt, testdata) #Run them through the neural network
Lets see what properties net.sqrt has
ls(net.results)
## [1] "net.result" "neurons"
Lets see the results
print(net.results$net.result)
## [,1]
## [1,] 0.995651087
## [2,] 2.004949735
## [3,] 2.997236258
## [4,] 4.003559121
## [5,] 4.992983838
## [6,] 6.004351125
## [7,] 6.999959828
## [8,] 7.995941860
## [9,] 9.005608807
## [10,] 9.971903887
Lets display a better version of the results
cleanoutput = cbind(testdata,sqrt(testdata),
as.data.frame(net.results$net.result))
colnames(cleanoutput) = c("Input","Expected Output","Neural Net Output")
print(cleanoutput)
## Input Expected Output Neural Net Output
## 1 1 1 0.995651087
## 2 4 2 2.004949735
## 3 9 3 2.997236258
## 4 16 4 4.003559121
## 5 25 5 4.992983838
## 6 36 6 6.004351125
## 7 49 7 6.999959828
## 8 64 8 7.995941860
## 9 81 9 9.005608807
## 10 100 10 9.971903887
sin
functionx = sort(runif(50, min = 0, max = 4*pi))
y = sin(x)
data = cbind(x,y)
Create the neural network responsible for the sin function
library(neuralnet)
sin.nn = neuralnet(y ~ x, data = data, hidden = 5, stepmax = 100000, learningrate = 10e-6,
act.fct = 'logistic', err.fct = 'sse', rep = 5, lifesign = "minimal",
linear.output = T)
## hidden: 5 thresh: 0.01 rep: 1/5 steps: stepmax min thresh: 0.01599376894
## hidden: 5 thresh: 0.01 rep: 2/5 steps: 7943 error: 0.41295 time: 0.73 secs
## hidden: 5 thresh: 0.01 rep: 3/5 steps: 34702 error: 0.02068 time: 3.13 secs
## hidden: 5 thresh: 0.01 rep: 4/5 steps: 4603 error: 0.4004 time: 0.41 secs
## hidden: 5 thresh: 0.01 rep: 5/5 steps: 3582 error: 0.26375 time: 0.34 secs
## Warning: algorithm did not converge in 1 of 5 repetition(s) within the
## stepmax
Visualize the neural network
plot(sin.nn, rep = "best")
Generate data for the prediction of the using the neural net;
testdata= as.data.frame(runif(10, min=0, max=(4*pi)))
testdata
## runif(10, min = 0, max = (4 * pi))
## 1 1.564816433
## 2 4.692188270
## 3 10.942269605
## 4 11.432769193
## 5 1.528565797
## 6 4.277983023
## 7 7.863112004
## 8 3.233025098
## 9 4.212822393
## 10 11.584672483
Calculate the real value using the sin
function
testdata.result = sin(testdata)
Make the prediction
sin.nn.result = compute(sin.nn, testdata)
sin.nn.result$net.result
## [,1]
## [1,] 1.04026644587
## [2,] -0.99122081475
## [3,] -0.77154683268
## [4,] -0.80702735515
## [5,] 1.03394587608
## [6,] -0.91997356615
## [7,] 1.02031970677
## [8,] -0.08226873533
## [9,] -0.89463523567
## [10,] -0.81283835083
Compare with the real values:
better = cbind(testdata, sin.nn.result$net.result, testdata.result, (sin.nn.result$net.result-testdata.result))
colnames(better) = c("Input", "NN Result", "Result", "Error")
better
## Input NN Result Result Error
## 1 1.564816433 1.04026644587 0.99998212049 0.040284325379
## 2 4.692188270 -0.99122081475 -0.99979597259 0.008575157839
## 3 10.942269605 -0.77154683268 -0.99857964177 0.227032809091
## 4 11.432769193 -0.80702735515 -0.90594290260 0.098915547446
## 5 1.528565797 1.03394587608 0.99910842368 0.034837452408
## 6 4.277983023 -0.91997356615 -0.90712021799 -0.012853348159
## 7 7.863112004 1.02031970677 0.99995831846 0.020361388309
## 8 3.233025098 -0.08226873533 -0.09130510334 0.009036368006
## 9 4.212822393 -0.89463523567 -0.87779026852 -0.016844967152
## 10 11.584672483 -0.81283835083 -0.83144207031 0.018603719479
Calculate the RMSE:
library(Metrics)
rmse(better$Result, better$`NN Result`)
## [1] 0.08095028855
Plot the results:
plot(x,y)
plot(sin, 0, (4*pi), add=T)
x1 = seq(0, 4*pi, by=0.1)
lines(x1, compute(sin.nn, data.frame(x=x1))$net.result, col="green")
iris
dataset
data(iris)
iris.dataset = iris
Check what is inside the dataset:
head(iris.dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Change the dataset so we are able to predict classes:
iris.dataset$setosa = iris.dataset$Species=="setosa"
iris.dataset$virginica = iris.dataset$Species == "virginica"
iris.dataset$versicolor = iris.dataset$Species == "versicolor"
Separate into train and test data:
train = sample(x = nrow(iris.dataset), size = nrow(iris)*0.5)
train
## [1] 116 3 137 124 100 48 28 123 99 54 129 128 96 11 97 115 53
## [18] 8 133 85 91 70 60 45 113 119 69 126 114 86 109 140 58 13
## [35] 77 57 7 61 9 111 141 39 120 98 104 88 83 106 20 147 74
## [52] 122 93 72 73 146 4 38 1 22 118 103 51 21 80 82 25 78
## [69] 148 143 14 50 23 84 40
iristrain = iris.dataset[train,]
irisvalid = iris.dataset[-train,]
print(nrow(iristrain))
## [1] 75
print(nrow(irisvalid))
## [1] 75
Build the Neural Network for the classification:
nn = neuralnet(setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width, data=iristrain, hidden=3,
rep = 2, err.fct = "ce", linear.output = F, lifesign = "minimal", stepmax = 1000000)
## hidden: 3 thresh: 0.01 rep: 1/2 steps: 77918 error: 54.96826 time: 9.41 secs
## hidden: 3 thresh: 0.01 rep: 2/2 steps: 53687 error: 54.24648 time: 6.25 secs
Let’s check the neural network that we just built
plot(nn, rep="best")
Let’s try to make the prediction:
comp = compute(nn, irisvalid[-3:-8])
pred.weights = comp$net.result
idx = apply(pred.weights, 1, which.max)
pred = c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
##
## pred setosa versicolor virginica
## setosa 28 0 0
## versicolor 1 13 5
## virginica 0 9 19
AND = c(rep(0,3),1)
OR = c(0,rep(1,3))
binary.data = data.frame(expand.grid(c(0,1), c(0,1)), AND)
net = neuralnet(AND~Var1+Var2, binary.data, hidden=0, rep=10, err.fct="ce", linear.output=FALSE)
Now to validate the predictions:
input = data.frame(expand.grid(c(0,1), c(0,1)))
net.results = compute(net, input)
cbind(round(net.results$net.result), AND)
## AND
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
## [4,] 1 1
![]() | Practice Problem: Food Demand Forecasting Challenge | Predict the demand of meals for a meal delivery company |
![]() | Practice Problem: HR Analytics Challenge | Identify the employees most likely to get promoted |
![]() | Practice Problem: Predict Number of Upvotes | Predict number of upvotes on a query asked at an online question & answer platform |
boot
has elegant and powerful support for
bootstrapping. In order to use it, you have to repackage your
estimation function as follows.
R has very elegant and abstract notation in array indexes. Suppose
there is an integer vector OBS
containing the elements 2,
3, 7, i.e. that OBS = c(2,3,7);
. Suppose x is a
vector. Then the notation x[OBS]
is a vector containing
elements x[2], x[3] and x[7]. This beautiful notation works for x as a
dataset (data frame) also. Here are demos:
# For vectors --
> x = c(10,20,30,40,50)
> d = c(3,2,2)
> x[d]
[1] 30 20 20
# For data frames --
> D = data.frame(x=seq(10,50,10), y=seq(500,100,-100))
> t(D)
1 2 3 4 5
x 10 20 30 40 50
y 500 400 300 200 100
> D[d,]
x y
3 30 300
2 20 400
2.1 20 400
Now for the key point: how does the R boot package work? The R
package boot
repeatedly calls your estimation function,
and each time, the bootstrap sample is supplied using an integer
vector of indexes like above. Let me show you two examples of how you
would write estimation functions which are compatible with the
package:
samplemean = function(x, d) {
return(mean(x[d]))
}
samplemedian = function(x, d) {
return(median(x[d]))
}
The estimation function (that you write) consumes data
x
and a vector of indices d
. This function
will be called many times, one for each bootstrap replication. Every
time, the data `x' will be the same, and the bootstrap sample `d' will
be different.
At each call, the boot package will supply a fresh set of indices
d. The notation x[d] allows us to make a brand-new vector (the
bootstrap sample), which is given to mean() or median(). This reflects
sampling with replacement from the original data vector.
Once you have written a function like this, here is how you would
obtain bootstrap estimates of the standard deviation of the
distribution of the median:
b = boot(x, samplemedian, R=1000) # 1000 replications
The object `b' that is returned by boot() is interesting and
useful. Say ?boot to learn about it. For example, after making
b
as shown above, you can say:
print(sd(b$t[,1]))
Here, I'm using the fact that b$t is a matrix containing 1000 rows
which holds all the results of estimation. The 1st column in it is the
only thing being estimated by samplemedian(), which is the sample
median.
The default plot() operator does nice things when fed with this
object. Try it: say plot(b)
E = D[d,]
which gives you a
data frame E using the rows out of data frame D that are specified by
the integer vector d.
boot
package by Angelo J. Canty, which appeared
in the December 2002 issue of R News.
Also see the web appendix to An R and S-PLUS Companion to
Applied Regression by John Fox [pdf],
and a tutorial by Patrick Burns [html].
Return to R by example
Ajay Shah
ajayshah at mayin dot org
RCurl
and XML
package to help us with the scrapping.
Let’s use the Eurovision_Song_Contest as an example.
The XML
package has plenty functions that can allow us to scrape the data.
Usually we are extracting information based on the tags of the web pages.
##### SCRAPPING CONTENT OFF WEBSITES ######
require(RCurl)
require(XML)
# XPath is a language for querying XML
# //Select anywhere in the document
# /Select from root
# @select attributes. Used in [] brackets
#### Wikipedia Example ####
url = "https://en.wikipedia.org/wiki/Eurovision_Song_Contest"
txt = getURL(url) # get the URL html code
# parsing html code into readable format
PARSED = htmlParse(txt)
# Parsing code using tags
xpathSApply(PARSED, "//h1")
# strops code and return content of the tag
xpathSApply(PARSED, "//h1", xmlValue) # h1 tag
xpathSApply(PARSED, "//h3", xmlValue) # h3 tag
xpathSApply(PARSED, "//a[@href]") # a tag with href attribute
# Go to url
# Highlight references
# right click, inspect element
# Search for tags
xpathSApply(PARSED, "//span[@class='reference-text']",xmlValue) # parse notes and citations
xpathSApply(PARSED, "//cite[@class='citation news']",xmlValue) # parse citation news
xpathSApply(PARSED, "//span[@class='mw-headline']",xmlValue) # parse headlines
xpathSApply(PARSED, "//p",xmlValue) # parsing contents in p tag
xpathSApply(PARSED, "//cite[@class='citation news']/a/@href") # parse links under citation. xmlValue not needed.
xpathSApply(PARSED, "//p/a/@href") # parse href links under all p tags
xpathSApply(PARSED, "//p/a/@*") # parse all atributes under all p tags
# Partial matches - subtle variations within or between pages.
xpathSApply(PARSED, "//cite[starts-with(@class, 'citation news')]",xmlValue) # parse citataion news that starts with..
xpathSApply(PARSED, "//cite[contains(@class, 'citation news')]",xmlValue) # parse citataion news that contains.
# Parsing tree like structure
parsed= htmlTreeParse(txt, asText = TRUE)
##### BBC Example ####
url = "https://www.bbc.co.uk/news/uk-england-london-46387998"
url = "https://www.bbc.co.uk/news/education-46382919"
txt = getURL(url) # get the URL html code
# parsing html code into readable format
PARSED = htmlParse(txt)
xpathSApply(PARSED, "//h1", xmlValue) # h1 tag
xpathSApply(PARSED, "//p", xmlValue) # p tag
xpathSApply(PARSED, "//p[@class='story-body__introduction']", xmlValue) # p tag body
xpathSApply(PARSED, "//div[@class='date date--v2']",xmlValue) # date, only the first is enough
xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content") # sometimes there is meta data.
##### Create simple BBC scrapper #####
# scrape title, date and content
BBCscrapper1= function(url){
txt = getURL(url) # get the URL html code
PARSED = htmlParse(txt) # Parse code into readable format
title = xpathSApply(PARSED, "//h1", xmlValue) # h1 tag
paragraph = xpathSApply(PARSED, "//p", xmlValue) # p tag
date = xpathSApply(PARSED, "//div[@class='date date--v2']",xmlValue) # date, only the first is enough
date = date[1]
return(cbind(title,date))
#return(as.matrix(c(title,date)))
}
# Use function that was just created.
BBCscrapper1("https://www.bbc.co.uk/news/education-46382919")
## title
## [1,] "Ed Farmer: Expel students who defy initiations ban, says dad"
## date
## [1,] "29 November 2018"
plyr
package helps to arrange the data in an organised way.
## Putting the title and date into a dataframe
require(plyr)
#url
url= c("https://www.bbc.co.uk/news/uk-england-london-46387998", "https://www.bbc.co.uk/news/education-46382919")
## ldply: For each element of a list, apply function then combine results into a data frame
#put into a dataframe
ldply(url,BBCscrapper1)
## title
## 1 Man murdered widow, 80, in London allotment row
## 2 Ed Farmer: Expel students who defy initiations ban, says dad
## date
## 1 29 November 2018
## 2 29 November 2018
# Install the packages that you don't have first.
library("RCurl") # Good package for getting things from URLs, including https
library("XML") # Has a good function for parsing HTML data
library("rvest") #another package that is good for web scraping. We use it in the Wikipedia example
#####################
### Get a table of data from Wikipedia
## all of this happens because of the read_html function in the rvest package
# First, grab the page source
us_states = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population") %>% # piping
# then extract the first node with class of wikitable
html_node(".wikitable") %>%
# then convert the HTML table into a data frame
html_table()
######################
url = "http://apps.saferoutesinfo.org/legislation_funding/state_apportionment.cfm"
funding=htmlParse(url) #get the data
# find the table on the page and read it into a list object
funding= XML::readHTMLTable(funding,stringsAsFactors = FALSE)
funding.df = do.call("rbind", funding) #flatten data
# Contain empty spaces previously.
colnames(funding.df)[1]= c("State") # shorten colname to just State.
# Match up the tables by State/Territory names
# so we have two data frames, x and y, and we're setting the columns we want to do the matching on by setting by.x and by.y
mydata = merge(us_states, funding.df, by.x="State, federal district, or territory", by.y="State")
# it looks pretty good, but note that we're down to 50 US States, because the others didn't match up by name
# e.g. "District of Columbia" in the us_states data, doesn't match "Dist. of Col." in the funding data
#Replace the total spend column name with a name that's easier to use.
colnames(mydata)[18] = "total_spend"
# We need to remove commas so that R can treat it as a number.
mydata[,"Population estimate, July 1, 2017[4]"] = gsub(",", ", mydata[,"Population estimate, July 1, 2017[4]"])
mydata[,"Population estimate, July 1, 2017[4]"] = as.numeric(mydata[,"Population estimate, July 1, 2017[4]"]) #this converts it to a number data type
# Now we have to do the same thing with the funding totals, which are in a format like this: $17,309,568
mydata[,"total_spend"] = gsub(",", ", mydata[,"total_spend"]) #this removes all commas
mydata[,"total_spend"] = gsub("\\$", ", mydata[,"total_spend"]) #this removes all dollar signs. We have a \\ because the dollar sign is a special character.
mydata[,"total_spend"] = as.numeric(mydata[,"total_spend"]) #this converts it to a number data type
# Now we can do the plotting
options(scipen=9999) #stop it showing scientific notation
plot(mydata[,"Population estimate, July 1, 2017[4]"], mydata[,"total_spend"])
## What's does the correlation between state funding and state population look like?
cor(mydata[,"Population estimate, July 1, 2017[4]"], mydata[,"total_spend"]) # 0.9924265 - big correlation!
## [1] 0.9885666
map_data
function in the ggplot
package to help us with that.
Again, with a bit of data manipulation, we can merge the data table that contains the longitude and latitude information together with the funding data across different states.
require(ggplot2)
all_states = map_data("state") # states
colnames(mydata)[1] = "state" # rename to states
mydata$state = tolower(mydata$state) #set all to lower case
Total = merge(all_states, mydata, by.x="region", by.y = 'state') # merge data
# we have data for delaware but not lat, long data in the maps
i = which(!unique(all_states$region) %in% mydata$state)
# Plot data
ggplot() +
geom_polygon(data=Total, aes(x=long, y=lat, group = group, fill=Total$total_spend),colour="white") +
scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") +
theme_bw() +
labs(fill = "Funding for School" ,title = "Funding for School between 2005 to 2012", x=", y=") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border = element_blank(),
text = element_text(size=20))
tuber
install.packages("tuber")
library(tuber) # youtube API
library(magrittr) # Pipes %>%, %T>% and equals(), extract().
library(tidyverse) # all tidyverse packages
library(purrr) # package for iterating/extracting data
client_id
and client_secret
).
client_id = "XXXXXXXXX"
client_secret = "XXXXXXXXX"
tuber
’s yt_oauth()
function to authenticate your application.
I included the token as a blank string (token = ''
) because it kept looking for the .httr-oauth
in my local directory (and I didn’t create one).
# use the youtube oauth
yt_oauth(app_id = client_id,
app_secret = client_secret,
token = '')
Provided you did everything correct, this should open your browser and ask you to sign into the Google account you set everything up with (see the images below).
You’ll see the name of your application in place of “Your application name”.
Authentication
complete.
Please close this page and return to R.
message.
playlistId
from the url to access the content from the videos.
Here is some information on the playlistId
parameter:
TheDave Chappelle’s playlist is in the url below. We pass it to theplaylistId
parameter specifies the unique ID of the playlist for
which you want to retrieve playlist items. Note that even though this
is an optional parameter, every request to retrieve playlist items
must specify a value for either theid
parameter or theplaylistId
parameter.
stringr::str_split()
function to get the playlistId
out of it.
dave_chappelle_playlist_id = stringr::str_split(
string = "https://www.youtube.com/playlist?list=PLG6HoeSC3raE-EB8r_vVDOs-59kg3Spvd",
pattern = "=",
n = 2,
simplify = TRUE)[ , 2]
dave_chappelle_playlist_id
[1] "PLG6HoeSC3raE-EB8r_vVDOs-59kg3Spvd"
Ok–we have a vector for Dave Chappelle’s playlistId
named dave_chappelle_playlist_id
, now we can use the tuber::get_playlist_items()
to collect the videos into a data.frame
.
DaveChappelleRaw = tuber::get_playlist_items(filter =
c(playlist_id = "PLG6HoeSC3raE-EB8r_vVDOs-59kg3Spvd"),
part = "contentDetails",
# set this to the number of videos
max_results = 200)
We should check these data to see if there is one row per video from the playlist (recall that Dave Chappelle had 200 videos).
# check the data for Dave Chappelle
DaveChappelleRaw %>% dplyr::glimpse(78)
Observations: 200
Variables: 6
$ .id <chr> "items1", "items2", "items3", "item…
$ kind <fct> youtube#playlistItem, youtube#playl…
$ etag <fct> "p4VTdlkQv3HQeTEaXgvLePAydmU/G-gTM9…
$ id <fct> UExHNkhvZVNDM3JhRS1FQjhyX3ZWRE9zLTU…
$ contentDetails.videoId <fct> oO3wTulizvg, ZX5MHNvjw7o, MvZ-clcMC…
$ contentDetails.videoPublishedAt <fct> 2019-04-28T16:00:07.000Z, 2017-12-3…
ids
(not .id
), we can create a function that extracts the statistics for each video on the playlist.
We’ll start by putting the video ids in a vector and call it dave_chap_ids
.
dave_chap_ids = base::as.vector(DaveChappelleRaw$contentDetails.videoId)
dplyr::glimpse(dave_chap_ids)
chr [1:200] "oO3wTulizvg" "ZX5MHNvjw7o" "MvZ-clcMCec" "4trBQseIkkc" ...
tuber
has a get_stats()
function we will use with the vector we just created for the show ids.
# Function to scrape stats for all vids
get_all_stats = function(id) {
tuber::get_stats(video_id = id)
}
purrr
package.
The purrr
package provides tools for ‘functional programming,’ but that is a much bigger topic for a later post.
For now, just know that the purrr::map_df()
function takes an object as .x
, and whatever function is listed in .f
gets applied over the .x
object.
Check out the code below:
# Get stats and convert results to data frame
DaveChappelleAllStatsRaw = purrr::map_df(.x = dave_chap_ids,
.f = get_all_stats)
DaveChappelleAllStatsRaw %>% dplyr::glimpse(78)
Observations: 200
Variables: 6
$ id <chr> "oO3wTulizvg", "ZX5MHNvjw7o", "MvZ-clcMCec", "4trBQse…
$ viewCount <chr> "4446789", "19266680", "6233018", "8867404", "7860341…
$ likeCount <chr> "48699", "150691", "65272", "92259", "56584", "144625…
$ dislikeCount <chr> "1396", "6878", "1530", "2189", "1405", "3172", "1779…
$ favoriteCount <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
$ commentCount <chr> "2098", "8345", "5130", "5337", "2878", "9071", "4613…
Fantastic! We have the DaveChappelleRaw
and DaveChappelleAllStatsRaw
in two data.frame
s we can export (and timestamp!)
# export DaveChappelleRaw
readr::write_csv(x = as.data.frame(DaveChappelleRaw),
path = paste0("data/",
base::noquote(lubridate::today()),
"-DaveChappelleRaw.csv"))
# export DaveChappelleRaw
readr::write_csv(x = as.data.frame(DaveChappelleAllStatsRaw),
path = paste0("data/",
base::noquote(lubridate::today()),
"-DaveChappelleAllStatsRaw.csv"))
# verify
fs::dir_ls("data", regexp = "Dave")
Be sure to go through the following purrr
tutorials if you want to learn more about functional programming:
R for Data Science by H. Wickham & G.
Grolemund
purrr Tutorial by J.
Bryan
A purrr tutorial - useR! 2017 by C.
Wickham
Happy dev with {purrr} - by C. Fay
Also check out the previous post on using
APIs.
parallel
packagemulticore
and snow
.
However, both were adopted in the base R installation and merged into the parallel
package.
library(parallel)
You can easily check the number of cores you have access to with detectCores
:
detectCores()
## [1] 4
The number of cores represented is not neccessarily correlated with the number of processors you actually have thanks to the concept of "logical CPUs".
For the most part, you can use this number as accurate.
Trying to use more cores than you have available won’t provide any benefit.
parallel
in the background) or when running in a GUI (such as RStudio).
This doesn’t come up often, but if you get odd behavior, this may be the case.
Pro: Faster than sockets.
Pro: Because it copies the existing version of R, your entire workspace exists in each process.
Pro: Trivially easy to implement.
In general, I’d recommend using forking if you’re not on Windows.
Note: These notes were compiled on OS X.
mclapply
lapply
to mclapply
.
(Note I’m using system.time
instead of profvis
here because I only care about running time, not profiling.)
library(lme4)
## Loading required package: Matrix
f = function(i) {
lmer(Petal.Width ~ .
- Species + (1 | Species), data = iris)
}
system.time(save1 = lapply(1:100, f))
## user system elapsed
## 2.048 0.019 2.084
system.time(save2 = mclapply(1:100, f))
## user system elapsed
## 1.295 0.150 1.471
If you were to run this code on Windows, mclapply
would simply call lapply
, so the code works but sees no speed gain.
mclapply
takes an argument, mc.cores
.
By default, mclapply
will use all cores available to it.
If you don’t want to (either becaues you’re on a shared system or you just want to save processing power for other purposes) you can set this to a value lower than the number of cores you have.
Setting it to 1 disables parallel processing, and setting it higher than the number of available cores has no effect.
parLapply
par*apply
as a replacement for *apply
.
Note that unlike mcapply
, this is not a drop-in replacement.
Destroy the cluster (not necessary, but best practices).
makeCluster
which takes in as an argument the number of cores:
numCores = detectCores()
numCores
## [1] 4
cl = makeCluster(numCores)
The function takes an argument type
which can be either PSOCK
(the socket version) or FORK
(the fork version).
Generally, mclapply
should be used for the forking approach, so there’s no need to change this.
If you were running this on a network of multiple computers as opposed to on your local machine, there are additional argumnts you may wish to run, but generally the other defaults should be specific.
clusterEvalQ
function, which takes a cluster and any expression, and executes the expression on each process.
clusterEvalQ(cl, 2 + 2)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] 4
Note the lack of inheritance:
x = 1
clusterEvalQ(cl, x)
## Error in checkForRemoteErrors(lapply(cl, recvResult)): 4 nodes produced errors; first error: object 'x' not found
We could fix this by wrapping the assignment in a clusterEvalQ
call:
clusterEvalQ(cl, y = 1)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
clusterEvalQ(cl, y)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
y
## Error in eval(expr, envir, enclos): object 'y' not found
However, now y
doesn’t exist in the main process.
We can instead use clusterExport
to pass objects to the processes:
clusterExport(cl, "x")
clusterEvalQ(cl, x)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
The second argument is a vector of strings naming the variables to pass.
Finally, we can use clusterEvalQ
to load packages:
clusterEvalQ(cl, {
library(ggplot2)
library(stringr)
})
## [[1]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[2]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[4]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
Note that this helpfully returns a list of the packages loaded in each process.
par*apply
apply
statements: parApply
, parLapply
and parSapply
for apply
, lapply
and sapply
respectively.
They take an additional argument for the cluster to operate on.
parSapply(cl, Orange, mean, na.rm = TRUE)
## Tree age circumference
## NA 922.1429 115.8571
All the general advice and rules about par*apply
apply as with the normal *apply
functions.
stopCluster(cl)
This is not fully necessary, but is best practices.
If not stopped, the processes continue to run in the background, consuming resources, and any new processes can be slowed or delayed.
If you exit R, it should automatically close all processes also.
This does not delete the cl
object, just the cluster it refers to in the background.
Keep in mind that closing a cluster is equivalent to quitting R in each; anything saved there is lost and packages will need to be re-loaded.
cl = makeCluster(detectCores())
clusterEvalQ(cl, library(lme4))
## [[1]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[2]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[4]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
system.time(save3 = parLapply(cl, 1:100, f))
## user system elapsed
## 0.095 0.017 1.145
stopCluster(cl)
Timing this is tricky - if we just time the parLapply
call we’re not capturing the time to open and close the cluster, and if we time the whole thing, we’re including the call to lme4.
To be completely fair, we need to include loading lme4
in all three cases.
I do this outside of this Markdown file to ensure no added complications.
The three pieces of code were, with a complete restart of R after each:
### lapply
library(parallel)
f = function(i) {
lmer(Petal.Width ~ .
- Species + (1 | Species), data = iris)
}
system.time({
library(lme4)
save1 = lapply(1:100, f)
})
### mclapply
library(parallel)
f = function(i) {
lmer(Petal.Width ~ .
- Species + (1 | Species), data = iris)
}
system.time({
library(lme4)
save2 = mclapply(1:100, f)
})
### mclapply
library(parallel)
f = function(i) {
lmer(Petal.Width ~ .
- Species + (1 | Species), data = iris)
}
system.time({
cl = makeCluster(detectCores())
clusterEvalQ(cl, library(lme4))
save3 = parLapply(cl, 1:100, f)
stopCluster(cl)
})
lapply | mclapply | parLapply |
---|---|---|
4.237 | 4.087 | 6.954 |
parLapply
call is faster.
Also known as "embarrassingly parallel" though I don’t like that term.↩
In this situation, we would actually run slower because of the overhead!↩
The flexibility of this to work across computers is what allows massive servers made up of many computers to work in parallel.↩
Sepal.Length
and Species
from the iris
dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling the observations with replacement.
We then run a logistic regression fitting species as a function of length, and record the coefficients for each trial to be returned.
x = iris[which(iris[,5] != "setosa"), c(1,5)]
trials = 10000
res = data.frame()
system.time({
trial = 1
while(trial <= trials) {
ind = sample(100, 100, replace=TRUE)
result1 = glm(x[ind,2]~x[ind,1], family=binomial(logit))
r = coefficients(result1)
res = rbind(res, r)
trial = trial + 1
}
})
## user system elapsed
## 20.031 0.458 21.220
The issue with this loop is that we execute each trial sequentially, which means that only one of our 8 processors on this machine are in use.
In order to exploit parallelism, we need to be able to dispatch our tasks as functions, with one task going to each processor.
To do that, we need to convert our task to a function, and then use the *apply()
family of R functions to apply that function to all of the members of a set.
In R, using apply
is often significantly faster than the equivalent code in a loop.
Here’s the same code rewritten to use lapply()
, which applies a function to each of the members of a list (in this case the trials we want to run):
x = iris[which(iris[,5] != "setosa"), c(1,5)]
trials = seq(1, 10000)
boot_fx = function(trial) {
ind = sample(100, 100, replace=TRUE)
result1 = glm(x[ind,2]~x[ind,1], family=binomial(logit))
r = coefficients(result1)
res = rbind(data.frame(), r)
}
system.time({
results = lapply(trials, boot_fx)
})
## user system elapsed
## 19.340 0.553 20.315
mclapply
Use multiple processors on local (and remote) machines using makeCluster
and clusterApply
In this approach, one has to manually copy data and code to each cluster member using clusterExport
This is extra work, but sometimes gaining access to a large cluster is worth it
parallel
library can be used to send tasks (encoded as function calls) to each of the processing cores on your machine in parallel.
This is done by using the parallel::mclapply
function, which is analogous to lapply
, but distributes the tasks to multiple processors.
mclapply
gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).
library(parallel)
library(MASS)
starts = rep(100, 40)
fx = function(nstart) kmeans(Boston, 4, nstart=nstart)
numCores = detectCores()
numCores
## [1] 8
system.time(
results = lapply(starts, fx)
)
## user system elapsed
## 1.346 0.024 1.372
system.time(
results = mclapply(starts, fx, mc.cores = numCores)
)
## user system elapsed
## 0.801 0.178 0.367
Now let’s demonstrate with our bootstrap example:
x = iris[which(iris[,5] != "setosa"), c(1,5)]
trials = seq(1, 10000)
boot_fx = function(trial) {
ind = sample(100, 100, replace=TRUE)
result1 = glm(x[ind,2]~x[ind,1], family=binomial(logit))
r = coefficients(result1)
res = rbind(data.frame(), r)
}
system.time({
results = mclapply(trials, boot_fx, mc.cores = numCores)
})
## user system elapsed
## 25.672 1.343 5.003
for
loop in R looks like:
for (i in 1:3) {
print(sqrt(i))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051
The foreach
method is similar, but uses the sequential %do%
operator to indicate an expression to run.
Note the difference in the returned data structure.
library(foreach)
foreach (i=1:3) %do% {
sqrt(i)
}
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
In addition, foreach
supports a parallelizable operator %dopar%
from the doParallel
package.
This allows each iteration through the loop to use different cores or different machines in a cluster.
Here, we demonstrate with using all the cores on the current machine:
library(foreach)
library(doParallel)
## Loading required package: iterators
registerDoParallel(numCores) # use multicore, set to the number of our cores
foreach (i=1:3) %dopar% {
sqrt(i)
}
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
# To simplify output, foreach has the .combine parameter that can simplify return values
# Return a vector
foreach (i=1:3, .combine=c) %dopar% {
sqrt(i)
}
## [1] 1.000000 1.414214 1.732051
# Return a data frame
foreach (i=1:3, .combine=rbind) %dopar% {
sqrt(i)
}
## [,1]
## result.1 1.000000
## result.2 1.414214
## result.3 1.732051
The doParallel vignette on CRAN shows a much more realistic example, where one can use `%dopar% to parallelize a bootstrap analysis where a data set is resampled 10,000 times and the analysis is rerun on each sample, and then the results combined:
# Let's use the iris data set to do a parallel bootstrap
# From the doParallel vignette, but slightly modified
x = iris[which(iris[,5] != "setosa"), c(1,5)]
trials = 10000
system.time({
r = foreach(icount(trials), .combine=rbind) %dopar% {
ind = sample(100, 100, replace=TRUE)
result1 = glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
## user system elapsed
## 24.117 1.303 4.944
# And compare that to what it takes to do the same analysis in serial
system.time({
r = foreach(icount(trials), .combine=rbind) %do% {
ind = sample(100, 100, replace=TRUE)
result1 = glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
## user system elapsed
## 19.445 0.571 20.302
# When you're done, clean up the cluster
stopImplicitCluster()
somewhat-hackish solution:
Read the lines from the vbs script into R (using
vbs_lines <- readLines(con ="Msg_Script.vbs" )
Edit the lines in R by finding and replacing specific text:
updated_vbs_lines <- gsub(x = vbs_lines,pattern = "[Insert Text Here]", replacement = "World", fixed =TRUE )
Create a new VBS script using the updated lines:
writeLines(text = updated_vbs_lines, con ="Temporary VBS Script.vbs" )
Run the script using a system command:
full_temp_script_path <- normalizePath("Temporary VBS Script.vbs" ) system_command <- paste0("WScript " ,'"' , full_temp_script_path,'"' ) system(command = system_command, wait = TRUE)
Delete the new script after you've run it:
file.remove("Temporary VBS Script.vbs" )
polarLUV()
(= HCL), LUV()
, polarLAB()
, LAB()
, XYZ()
, RGB()
, sRGB()
, HLS()
, HSV()
.
The HCL space (= polar coordinates in CIELUV) is particularly useful for specifying individual colors and color palettes as its three axes match those of the human visual system very well: Hue (= type of color, dominant wavelength), chroma (= colorfulness), luminance (= brightness).
The colorspace package provides three types of palettes based on the HCL model:
Qualitative: Designed for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight.
Function: qualitative_hcl()
.
Sequential: Designed for coding ordered/numeric information, i.e., where colors go from high to low (or vice versa).
Function: sequential_hcl()
.
Diverging: Designed for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.
Function: diverging_hcl()
.
To aid choice and application of these palettes there are: scales for use with ggplot2; shiny (and tcltk) apps for interactive exploration; visualizations of palette properties; accompanying manipulation utilities (like desaturation, lighten/darken, and emulation of color vision deficiencies).
More detailed overviews and examples are provided in the articles:
Color Spaces: S4 Classes and Utilities
HCL-Based Color Palettes
HCL-Based Color Scales for ggplot2
Palette Visualization and Assessment
Apps for Choosing Colors and Palettes Interactively
Color Vision Deficiency Emulation
Color Manipulation and Utilities
Approximating Palettes from Other Packages
Somewhere over the Rainbow
install.packages("colorspace")
The development version of colorspace is hosted on R-Forge at https://R-Forge.R-project.org/projects/colorspace/ in a Subversion (SVN) repository.
It can be installed via
install.packages("colorspace", repos = "http://R-Forge.R-project.org")
For Python users a beta re-implementation of the full colorspace package in Python 2/Python 3 is also available, see https://github.com/retostauffer/python-colorspace.
hcl_palettes()
function:
library("colorspace")
hcl_palettes(plot = TRUE)
A suitable vector of colors can be easily computed by specifying the desired number of colors and the palette name (see the plot above), e.g.,
q4 = qualitative_hcl(4, palette = "Dark 3")
q4
## [1] "#E16A86" "#909800" "#00AD9A" "#9183E6"
The functions sequential_hcl()
, and diverging_hcl()
work analogously.
Additionally, their hue/chroma/luminance parameters can be modified, thus allowing for easy customization of each palette.
Moreover, the choose_palette()
/hclwizard()
app provide convenient user interfaces to perform palette customization interactively.
Finally, even more flexible diverging HCL palettes are provided by divergingx_hcl()
.
col
argument.
Here, the q4
vector created above is used in a time series display:
plot(log(EuStockMarkets), plot.type = "single", col = q4, lwd = 2)
legend("topleft", colnames(EuStockMarkets), col = q4, lwd = 3, bty = "n")
As another example for a sequential palette, we demonstrate how to create a spine plot displaying the proportion of Titanic passengers that survived per class.
The Purples 3
palette is used, which is quite similar to the Purples
.
Here, only two colors are employed, yielding a dark purple and light gray.
ttnc = margin.table(Titanic, c(1, 4))[, 2:1]
spineplot(ttnc, col = sequential_hcl(2, palette = "Purples 3"))
scale_<aesthetic>_<datatype>_<colorscale>()
, where <aesthetic>
is the name of the aesthetic (fill
, color
, colour
), <datatype>
is the type of the variable plotted (discrete
or continuous
) and <colorscale>
sets the type of the color scale used (qualitative
, sequential
, diverging
, divergingx
).
To illustrate their usage two simple examples are shown using the qualitative Dark 3
and sequential Purples 3
palettes that were also employed above.
For the first example, semi-transparent shaded densities of the sepal length from the iris data are shown, grouped by species.
library("ggplot2")
ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.6) +
scale_fill_discrete_qualitative(palette = "Dark 3")
And for the second example the sequential palette is used to code the cut levels in a scatter of price by carat in the diamonds data (or rather a small subsample thereof).
The scale function first generates six colors but then drops the first color because the light gray is too light here.
(Alternatively, the chroma and luminance parameters could also be tweaked.)
dsamp = diamonds[1 + 1:1000 * 50, ]
ggplot(dsamp, aes(carat, price, color = cut)) + geom_point() +
scale_color_discrete_sequential(palette = "Purples 3", nmax = 6, order = 2:6)
demoplot()
can display a palette (with arbitrary number of colors) in a range of typical and somewhat simplified statistical graphics.
hclplot()
converts the colors of a palette to the corresponding hue/chroma/luminance coordinates and displays them in HCL space with one dimension collapsed.
The collapsed dimension is the luminance for qualitative palettes and the hue for sequential/diverging palettes.
specplot()
also converts the colors to hue/chroma/luminance coordinates but draws the resulting spectrum in a line plot.
For the qualitative Dark 3
palette from above the following plots can be obtained.
demoplot(q4, "bar")
hclplot(q4)
specplot(q4, type = "o")
The bar plot is used as a typical application for a qualitative palette (in addition to the time series and density plots used above).
The other two displays show that luminance is (almost) constant in the palette while the hue changes linearly along the color “wheel”.
Ideally, chroma would have also been constant to completely balance the colors.
However, at this luminance the maximum chroma differs across hues so that the palette is fixed up to use less chroma for the yellow and green elements.
Note also that in a bar plot areas are shaded (and not just points or lines) so that lighter colors would be preferable.
In the density plot above this was achieved through semi-transparency.
Alternatively, luminance could be increased as is done in the "Pastel 1"
or "Set 3"
palettes.
Subsequently, the same types of assessment are carried out for the sequential "Purples 3"
palette as employed above.
s9 = sequential_hcl(9, "Purples 3")
demoplot(s9, "heatmap")
hclplot(s9)
specplot(s9, type = "o")
Here, a heatmap (based on the well-known Maunga Whau volcano data) is used as a typical application for a sequential palette.
The elevation of the volcano is brought out clearly, using dark colors to give emphasis to higher elevations.
The other two displays show that hue is constant in the palette while luminance and chroma vary.
Luminance increases monotonically from dark to light (as required for a proper sequential palette).
Chroma is triangular-shaped which allows to better distinguish the middle colors in the palette when compared to a monotonic chroma trajectory.
dummy variable
is a numeric interpretation of the category or level of the factor variable.
That is, it represents every group or level of the categorical variable as a single numeric entity.
Read the data-conversion.csv
file and store it in the working directory of your R environment. Install the dummies
package. Then read the data:
install.packages("dummies")
library(dummies)
students = read.csv("data-conversion.csv")
Create dummies for all factors in the data frame:
students.new = dummy.data.frame(students, sep = ".")
names(students.new)
[1] "Age" "State.NJ" "State.NY" "State.TX" "State.VA"
[6] "Gender.F" "Gender.M" "Height" "Income"
The students.new
data frame now contains all the original variables and the newly added dummy variables. The dummy.data.frame()
function has created dummy variables for all four levels of the State
and two levels of Gender
factors. However, we will generally omit one of the dummy variables for State
and one for Gender
when we use machine-learning techniques.
We can use the optional argument all = FALSE
to specify that the resulting data frame should contain only the generated dummy variables and none of the original variables.
dummy.data.frame()
function creates dummies for all the factors in the data frame supplied. Internally, it uses another dummy()
function which creates dummy variables for a single factor. The dummy()
function creates one new variable for every level of the factor for which we are creating dummies. It appends the variable name with the factor level name to generate names for the dummy variables. We can use the sep
argument to specify the character that separates them—an empty string is the default:
dummy(students$State, sep = ".")
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
names
argument to specify the column names of the variables we want dummies for:
students.new1 = dummy.data.frame(students, names = c("State","Gender") , sep = ".")
dummy variable
is a numeric interpretation of the category or level of the factor variable.
That is, it represents every group or level of the categorical variable as a single numeric entity.
For example, consider a data set that contains a variable ‘Poll' with values ‘Yes' and ‘No'.
Now, in order to represent the two groups as numeric entries, we can create dummies of the same.
So, the transformed dataset would now have two more additional columns as ‘Poll.1' which would represent ‘yes' type values (would assign 1 to all the data rows that are associated with level yes) and ‘Poll.2' for ‘No' type values.
dummy_cols()
function, one can select the variables for whom the dummies need to be created.
Syntax:
dummy_cols(data, select_columns = 'columns')
Example:
In this example, we have made use of the Bank Load Defaulter dataset.
You can find the dataset here.
Further, we have made use of dummy_cols() function to create dummy variables for the column ‘ed'.
rm(list = ls())
#install.packages('fastDummies')
library('fastDummies')
dta = read.csv("bank-loan.csv",header=TRUE)
dim(dta)
dum = dummy_cols(dta, select_columns = 'ed')
dim(dum)
Output:
As witnessed below, the initial number of columns of the data set equals to 9.
Post creation of dummy variables, the number of columns equals to 14.
All the 5 levels of the ed variable has been segregated as a separate column.
Only those rows which belongs to a certain category are set as 1, rest all values are set to zero(0).
> dim(dta)
[1] 850 9
> dim(dum)
[1] 850 14
dummy()
function that enables us to create dummy entries for selected columns.
Example:
In the below example, we have created dummy variables of the column ‘ed' using dummy() function.
rm(list = ls())
library('dummies')
dta = read.csv("bank-loan.csv",header=TRUE)
dim(dta)
dum = dummy(dta$ed)
dim(dum)
Output:
As seen below, all the levels have been segregated as a different column.
Also, only those data rows that match to the particular level is set to 1 in the column else it is represented as zero.
For example, if the data represents the level ‘ed1', then it is set to 1 else it is set to 0.
X
, y
are 800 by 2 and 800 by 1 data frames respectively, and they are created in a way such that a linear classifier cannot separate them.
Since the data is 2D, we can easily visualize it on a plot.
They are roughly evenly spaced and indeed a line is not a good decision boundary.
x_min <- min(X[,1])-0.2; x_max <- max(X[,1])+0.2
y_min <- min(X[,2])-0.2; y_max <- max(X[,2])+0.2
# lets visualize the data:
ggplot(data) + geom_point(aes(x=x, y=y, color = as.character(label)), size = 2) + theme_bw(base_size = 15) +
xlim(x_min, x_max) + ylim(y_min, y_max) +
ggtitle('Spiral Data Visulization') +
coord_fixed(ratio = 0.8) +
theme(axis.ticks=element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.text=element_blank(), axis.title=element_blank(), legend.position = 'none')
nnet
that takes two matrices X
and Y
and returns a list of 4 with W
, b
and W2
, b2
(weight and bias for each layer).
I can specify step_size
(learning rate) and regularization strength (reg
, sometimes symbolized as λ ).X
(same col as training X
but may have different rows) and layer parameters as input.
The output is the column index of max score in each row.
In this example, the output is simply the label of each class.
Now we can print out the training accuracy.
nnetPred <- function(X, para = list()){
W <- para[[1]]
b <- para[[2]]
W2 <- para[[3]]
b2 <- para[[4]]
N <- nrow(X)
hidden_layer <- pmax(0, X%*% W + matrix(rep(b,N), nrow = N, byrow = T))
hidden_layer <- matrix(hidden_layer, nrow = N)
scores <- hidden_layer%*%W2 + matrix(rep(b2,N), nrow = N, byrow = T)
predicted_class <- apply(scores, 1, which.max)
return(predicted_class)
}
nnet.model <- nnet(X, Y, step_size = 0.4,reg = 0.0002, h=50, niteration = 6000)
## [1] "iteration 0 : loss 1.38628868932674"
## [1] "iteration 1000 : loss 0.967921639616882"
## [1] "iteration 2000 : loss 0.448881467342854"
## [1] "iteration 3000 : loss 0.293036646147359"
## [1] "iteration 4000 : loss 0.244380009480792"
## [1] "iteration 5000 : loss 0.225211501612035"
## [1] "iteration 6000 : loss 0.218468573259166"
predicted_class <- nnetPred(X, nnet.model)
print(paste('training accuracy:',mean(predicted_class == (y))))
## [1] "training accuracy: 0.96375"
max(X)
.
The data is also splitted into two for cross validation.
Once again, we need to create a Y
matrix with dimension N
by K
.
This time the non-zero index in each row is offset by 1: label 0 will have entry 1 at index 1, label 1 will have entry 1 at index 2, and so on.
In the end, we need to convert it back.
(Another way is put 0 at index 10 and no offset for the rest labels.)
library(tidyverse)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
dplyr::count()
:
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
A variable is ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
dplyr::count()
and ggplot2::cut_width()
:
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # … with 5 more rows
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
In the graph above, the tallest bar shows that almost 30,000 observations have a carat
value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the binwidth
argument, which is measured in the units of the x
variable.
You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
smaller = diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
geom_freqpoly()
instead of geom_histogram()
.
geom_freqpoly()
performs the same calculation as geom_histogram()
, but instead of displaying the counts with bars, uses lines instead.
It’s much easier to understand overlapping lines than bars.
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
y
variable from the diamonds dataset.
The only evidence of outliers is the unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
coord_cartesian()
:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
coord_cartesian()
also has an xlim()
argument for when you need to zoom into the x-axis.
ggplot2 also has xlim()
and ylim()
functions that work slightly differently: they throw away the data outside the limits.)
This allows us to see that there are three unusual values: 0, ~30, and ~60.
We pluck them out with dplyr:
unusual = diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
The y
variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can’t have a width of 0mm, so these values must be incorrect.
We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!
It’s good practice to repeat your analysis with and without the outliers.
If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on.
However, if they have a substantial effect on your results, you shouldn’t drop them without justification.
You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
x
, y
, and z
variables
in diamonds
.
What do you learn? Think about a diamond and how you
might decide which dimension is the length, width, and depth.
Explore the distribution of price
.
Do you discover anything unusual
or surprising? (Hint: Carefully think about the binwidth
and make sure
you try a wide range of values.)
How many diamonds are 0.99 carat? How many are 1 carat? What
do you think is the cause of the difference?
Compare and contrast coord_cartesian()
vs xlim()
or ylim()
when
zooming in on a histogram.
What happens if you leave binwidth
unset?
What happens if you try and zoom so only half a bar shows?
diamonds2 = diamonds %>%
filter(between(y, 3, 20))
I don’t recommend this option because just because one measurement
is invalid, doesn’t mean all the measurements are.
Additionally, if you
have low quality data, by time that you’ve applied this approach to every
variable you might find that you don’t have any data left!
Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is to use mutate()
to replace the variable
with a modified copy.
You can use the ifelse()
function to replace
unusual values with NA
:
diamonds2 = diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ifelse()
has three arguments.
The first argument test
should be a logical vector.
The result will contain the value of the second argument, yes
, when test
is TRUE
, and the value of the third argument, no
, when it is false.
Alternatively to ifelse, use dplyr::case_when()
.
case_when()
is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing.
It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
#> Warning: Removed 9 rows containing missing values (geom_point).
na.rm = TRUE
:
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
Other times you want to understand what makes observations with missing values different to observations with recorded values.
For example, in nycflights13::flights
, missing values in the dep_time
variable indicate that the flight was cancelled.
So you might want to compare the scheduled departure times for cancelled and non-cancelled times.
You can do this by making a new variable with is.na()
.
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
na.rm = TRUE
do in mean()
and sum()
?
geom_freqpoly()
is not that useful for that sort of comparison because the height is given by the count.
That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape.
For example, let’s explore how the price of a diamond varies with its quality:
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
geom_boxplot()
:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
cut
is an ordered factor: fair is worse than good, which is worse than very good and so on.
Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display.
One way to do that is with the reorder()
function.
For example, take the class
variable in the mpg
dataset.
You might be interested to know how highway mileage varies across classes:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
class
based on the median value of hwy
:
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
geom_boxplot()
will work better if you flip it 90°.
You can do that with coord_flip()
.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
coord_flip()
?
One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display a prohibitively large
number of “outlying values”.
One approach to remedy this problem is
the letter value plot.
Install the lvplot package, and try using
geom_lv()
to display the distribution of price vs cut.
What
do you learn? How do you interpret the plots?
Compare and contrast geom_violin()
with a facetted geom_histogram()
,
or a coloured geom_freqpoly()
.
What are the pros and cons of each
method?
If you have a small dataset, it’s sometimes useful to use geom_jitter()
to see the relationship between a continuous and categorical variable.
The ggbeeswarm package provides a number of methods similar to
geom_jitter()
.
List them and briefly describe what each one does.
geom_count()
:
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
diamonds %>%
count(color, cut)
#> # A tibble: 35 x 3
#> color cut n
#> <ord> <ord> <int>
#> 1 D Fair 163
#> 2 D Good 662
#> 3 D Very Good 1513
#> 4 D Premium 1603
#> 5 D Ideal 2834
#> 6 E Fair 224
#> # … with 29 more rows
Then visualise with geom_tile()
and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
geom_tile()
together with dplyr to explore how average flight
delays vary by destination and month of year.
What makes the
plot difficult to read? How could you improve it?
Why is it slightly better to use aes(x = color, y = cut)
rather
than aes(x = cut, y = color)
in the example above?
geom_point()
.
You can see covariation as a pattern in the points.
For example, you can see an exponential relationship between the carat size and price of a diamond.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
alpha
aesthetic to add transparency.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
geom_histogram()
and geom_freqpoly()
to bin in one dimension.
Now you’ll learn how to use geom_bin2d()
and geom_hex()
to bin in two dimensions.
geom_bin2d()
and geom_hex()
divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.
geom_bin2d()
creates rectangular bins.
geom_hex()
creates hexagonal bins.
You will need to install the hexbin package to use geom_hex()
.
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
carat
and then for each group, display a boxplot:
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
cut_width(x, width)
, as used above, divides x
into bins of width width
.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarises a different number of points.
One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE
.
Another approach is to display approximately the same number of points in each bin.
That’s the job of cut_number()
:
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
cut_width()
vs cut_number()
? How does that impact a visualisation of
the 2d distribution of carat
and price
?
Visualise the distribution of carat, partitioned by price.
How does the price distribution of very large diamonds compare to small
diamonds? Is it as you expect, or does it surprise you?
Combine two of the techniques you’ve learned to visualise the
combined distribution of cut, carat, and price.
Two dimensional plots reveal outliers that are not visible in one
dimensional plots.
For example, some points in the plot below have an
unusual combination of x
and y
values, which makes the points outliers
even though their x
and y
values appear normal when examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
price
from carat
and then computes the residuals (the difference between the predicted value and the actual value).
The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
library(modelr)
mod = lm(log(price) ~ log(carat), data = diamonds)
diamonds2 = diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
Typically, the first one or two arguments to a function are so important that you should know them by heart.
The first two arguments to ggplot()
are data
and mapping
, and the first two arguments to aes()
are x
and y
.
In the remainder of the book, we won’t supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots.
That’s a really important programming concern that we’ll come back in functions.
Rewriting the previous plot more concisely yields:
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
Sometimes we’ll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from %>%
to +
.
I wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
data.frame
s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy vignette("tibble")
.
library(tidyverse)
as_tibble()
:
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> # … with 144 more rows
You can create a new tibble from individual vectors with tibble()
.
tibble()
will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#> x y z
#> <int> <dbl> <dbl>
#> 1 1 1 2
#> 2 2 1 5
#> 3 3 1 10
#> 4 4 1 17
#> 5 5 1 26
If you’re already familiar with data.frame()
, note that tibble()
does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.
It’s possible for a tibble to have column names that are not valid R variable names, aka `
:
tb = tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
#> # A tibble: 1 x 3
#> `:)` ` ` `2000`
#> <chr> <chr> <chr>
#> 1 smile space number
You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
Another way to create a tibble is with tribble()
, short for tribble()
is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~
), and entries are separated by commas.
This makes it possible to lay out small amounts of data in easy to read form.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
#> # A tibble: 2 x 3
#> x y z
#> <chr> <dbl> <dbl>
#> 1 a 2 3.6
#> 2 b 1 8.5
I often add a comment (the line starting with #
), to make it really clear where the header is.
data.frame
: printing and subsetting.
str()
:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 x 5
#> a b c d e
#> <dttm> <date> <int> <dbl> <chr>
#> 1 2020-10-09 13:55:17 2020-10-16 1 0.368 n
#> 2 2020-10-10 08:00:26 2020-10-21 2 0.612 l
#> 3 2020-10-10 02:24:06 2020-10-31 3 0.415 p
#> 4 2020-10-09 15:45:23 2020-10-30 4 0.212 m
#> 5 2020-10-09 12:09:39 2020-10-27 5 0.733 i
#> 6 2020-10-09 23:10:37 2020-10-23 6 0.460 n
#> # … with 994 more rows
Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames.
But sometimes you need more output than the default display.
There are a few options that can help.
First, you can explicitly print()
the data frame and control the number of rows (n
) and the width
of the display.
width = Inf
will display all columns:
nycflights13::flights %>%
print(n = 10, width = Inf)
You can also control the default print behaviour by setting options:
options(tibble.print_max = n, tibble.print_min = m)
: if more than n
rows, print only m
rows.
Use options(tibble.print_min = Inf)
to always show all rows.
Use options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with package?tibble
.
A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
nycflights13::flights %>%
View()
$
and [[
.
[[
can extract by name or position; $
only extracts by name but is a little less typing.
df = tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
df[["x"]]
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
# Extract by position
df[[1]]
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
To use these in a pipe, you’ll need to use the special placeholder .
:
df %>% .$x
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
df %>% .[["x"]]
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
Compared to a data.frame
, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
as.data.frame()
to turn a tibble back to a data.frame
:
class(as.data.frame(tb))
#> [1] "data.frame"
The main reason that some older functions don’t work with tibble is the [
function.
We don’t use [
much in this book because dplyr::filter()
and dplyr::select()
allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting).
With base R data frames, [
sometimes returns a data frame, and sometimes returns a vector.
With tibbles, [
always returns another tibble.
mtcars
,
which is a regular data frame).
Compare and contrast the following operations on a data.frame
and equivalent tibble.
What is different? Why might the default data frame behaviours cause you frustration?
df = data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
If you have the name of a variable stored in an object, e.g. var = "mpg"
,
how can you extract the reference variable from a tibble?
Practice referring to non-syntactic names in the following data frame by:
Extracting the variable called 1
.
Plotting a scatterplot of 1
vs 2
.
Creating a new column called 3
which is 2
divided by 1
.
Renaming the columns to one
, two
and three
.
annoying = tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
What does tibble::enframe()
do? When might you use it?
What option controls how many additional column names are printed at the footer of a tibble?
read.table()
is generally used to read a file in table format and imports data as a data frame.
Several variants of this function are available, for importing different file formats;
readLines
function.
This function takes a file (or URL) as input and returns a vector containing as many elements as the number of lines in the file.
The readLines
function simply extracts the text from its input source and returns each line as a character string.
The n=
argument is useful to read a limited number (subset) of lines from the input source (Its default value is -1, which reads all lines from the input source).
When using the filename in this function's argument, R assumes the file is in your current working directory (you can use the getwd()
function in R console to find your current working directory).
You can also choose the input file interactively, using the file.choose()
function within the argument.
The next step is to load that Vector as a Corpus.
In R, a Corpus is a collection of text document(s) to apply text mining or NLP routines on.
Details of using the readLines
function are sourced from: https://www.stat.berkeley.edu/~spector/s133/Read.html .
In your R script, add the following code to load the data into a corpus.
# Read the text file from local machine , choose file interactively
choose.files(default = "", caption = "Select files",
multi = TRUE, filters = Filters, index = nrow(Filters))
text = readLines(file.choose())
# Load the data as a corpus
TextDoc = Corpus(VectorSource(text))
Upon running this, you will be prompted to select the input file.
Navigate to your file and click Open as shown in Figure 2.
tm_map()
function to replace special characters like /
, @
and |
with a space.
The next step is to remove the unnecessary whitespace and convert the text to lower case.
Then remove the stopwords.
They are the most commonly occurring words in a language and have very little value in terms of gaining useful information.
They should be removed before performing further analysis.
Examples of stopwords in English are "the, is, at, on".
There is no single universal list of stop words used by all NLP tools.
stopwords
in the tm_map()
function supports several languages like English, French, German, Italian, and Spanish.
Please note the language names are case sensitive.
I will also demonstrate how to add your own list of stopwords, which is useful in this Team Health example for removing non-default stop words like "team", "company", "health".
Next, remove numbers and punctuation.
The last step is text stemming.
It is the process of reducing the word to its root form.
The stemming process simplifies the word to its common origin.
For example, the stemming process reduces the words "fishing", "fished" and "fisher" to its stem "fish".
Please note stemming uses the SnowballC package.
(You may want to skip the text stemming step if your users indicate a preference to see the original "unstemmed" words in the word cloud plot)
In your R script, add the following code to transform and run to clean-up the text data.
#Replacing "/", "@" and "|" with space
toSpace = content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc = tm_map(TextDoc, toSpace, "/")
TextDoc = tm_map(TextDoc, toSpace, "@")
TextDoc = tm_map(TextDoc, toSpace, "\\|")
# Convert the text to lower case
TextDoc = tm_map(TextDoc, content_transformer(tolower))
# Remove numbers
TextDoc = tm_map(TextDoc, removeNumbers)
# Remove english common stopwords
TextDoc = tm_map(TextDoc, removeWords, stopwords("english"))
# Remove your own stop word
# specify your custom stopwords as a character vector
TextDoc = tm_map(TextDoc, removeWords, c("s", "company", "team"))
# Remove punctuations
TextDoc = tm_map(TextDoc, removePunctuation)
# Eliminate extra white spaces
TextDoc = tm_map(TextDoc, stripWhitespace)
# Text stemming - which reduces words to their root form
TextDoc = tm_map(TextDoc, stemDocument)
TermDocumentMatrix()
from the text mining package, you can build a Document Matrix – a table containing the frequency of words.
In your R script, add the following code and run it to see the top 5 most frequently found words in your text.
# Build a term-document matrix
TextDoc_dtm = TermDocumentMatrix(TextDoc)
dtm_m = as.matrix(TextDoc_dtm)
# Sort by descearing value of frequency
dtm_v = sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d = data.frame(word = names(dtm_v),freq=dtm_v)
# Display the top 5 most frequent words
head(dtm_d, 5)
The following table of word frequency is the expected output of the head
command on RStudio Console.
corlimit = 0.25
is the lower limit/threshold I have set.
You can set it lower to see more words, or higher to see less).
The output indicates that "integr" (which is the root for word "integrity") and "synergi" (which is the root for words "synergy", "synergies", etc.) and occur 28% of the time with the word "good".
You can interpret this as the context around the most frequently occurring word ("good") is positive.
Similarly, the root of the word "together" is highly correlated with the word "work".
This indicates that most responses are saying that teams "work together" and can be interpreted in a positive context.
You can modify the above script to find terms associated with words that occur at least 50 times or more, instead of having to hard code the terms in your script.
# Find associations for words that occur at least 50 times
findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)
get_sentiment
function accepts two arguments: a character vector (of sentences or words) and a method.
The selected method determines which of the four available sentiment extraction methods will be used.
The four methods are syuzhet
(this is the default), bing
, afinn
and nrc
.
Each method uses a different scale and hence returns slightly different results.
Please note the outcome of nrc
method is more than just a numeric score, requires additional interpretations and is out of scope for this article.
The descriptions of the get_sentiment
function has been sourced from : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?
Add the following code to the R script and run it.
# regular sentiment score using get_sentiment() function and method of your choice
# please note that different methods may have different scales
syuzhet_vector = get_sentiment(text, method="syuzhet")
# see the first row of the vector
head(syuzhet_vector)
# see summary statistics of the vector
summary(syuzhet_vector)
Your results should look similar to Figure 7.
syuzhet
method is decimal and ranges from -1(indicating most negative) to +1(indicating most positive).
Note that the summary statistics of the suyzhet
vector show a median value of 1.6, which is above zero and can be interpreted as the overall average sentiment across all the responses is positive.
Next, run the same analysis for the remaining two methods and inspect their respective vectors.
Add the following code to the R script and run it.
# bing
bing_vector = get_sentiment(text, method="bing")
head(bing_vector)
summary(bing_vector)
#affin
afinn_vector = get_sentiment(text, method="afinn")
head(afinn_vector)
summary(afinn_vector)
Your results should resemble Figure 8.
bing
and afinn
vectors also show that the Median
value of Sentiment scores is above 0 and can be interpreted as the overall average sentiment across the all the responses is positive.
Because these different methods use different scales, it's better to convert their output to a common scale before comparing them.
This basic scale conversion can be done easily using R's built-in sign
function, which converts all positive number to 1, all negative numbers to -1 and all zeros remain 0.
Add the following code to your R script and run it.
#compare the first row of each vector using sign function
rbind(
sign(head(syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector))
)
Figure 9 shows the results.
get_nrc_sentiments
function, which returns a data frame with each row representing a sentence from the original file.
The data frame has ten columns (one column for each of the eight emotions, one column for positive sentiment valence and one for negative sentiment valence).
The data in the columns (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive) can be accessed individually or in sets.
The definition of get_nrc_sentiments
has been sourced from: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?
Add the following line to your R script and run it, to see the data frame generated from the previous execution of the get_nrc_sentiment
function.
# run nrc sentiment analysis to return data frame with each row classified as one of the following
# emotions, rather than a score:
# anger, anticipation, disgust, fear, joy, sadness, surprise, trust
# It also counts the number of positive and negative emotions found in each row
d=get_nrc_sentiment(text)
# head(d,10) - to see top 10 lines of the get_nrc_sentiment dataframe
head (d,10)
The results should look like Figure 10.
ws.send(str0)
(“ws” stands for “websocket”): this command lives in JS, and it sends a string to R every time it’s called.
The common usage is ws.send(JSON.stringify(my_complex_data))
, where we convert the (JSON) data into a string using JSON.stringify
; this function applies to all lists in JS.
ws.onmessage(msg) = function { ... }
: this function lives in JS.
It continuously monitors if R has sent JS a message, and it runs the code in the body when it gets a message.
The message contains many things other than your data, and we can use JSON.parse(msg.data)
to extract the data from it.
Your R function (well, clearly) lives in R.
It describes what R should do when JS sends R some data.
The input is assumed to be a named list, and the output must also be a named list.
The following is a common pattern to use.
It is very flexible; in fact, all examples in the package are created under this framework.
Additional patterns can be created, but we shall leave that to another tutorial.
ws.send("hi");
ws.onmessage = function(msg) {
var r_data = JSON.parse(msg.data);
console.log(r_data['r_msg']); // this prints the message in JS console
}
R
my_r_function = function(msg) {
print(msg) # this will print the message in R console
list(r_msg = msg) # return the message to JS
}
ws.send("JSON.stringify({x:3, y:4})"); // sends a named list in JS to R
ws.onmessage = function(msg) {
var r_data = JSON.parse(msg.data);
console.log(r_data['r_msg'], r_data['z']); // this prints the message in JS console
}
R
my_r_function = function(msg) {
print(msg) # this will print the message in R console
print(msg$x) # expects 3
print(msg$y) # expects 4
list(r_msg = msg, z = rnorm(1)) # return the message to JS
}
min, max, oninput
.
min, max
refers to the minimum and maximum value the slider can take; oninput
refers to a function which describes the desired behaviour when the slider is moved.
In html, most things are just containers with different defaults.
Containers are referred to as <div>
elements.
jsReact
package, the code can be developed entirely in R.
Though as you get more experienced and the app gets more complicated, it is preferable to create the html, js and R files separately.
(Side note: this is where beginners, e.g. me, got tripped up, and this is partly why I created this package.)
create_html
), add a title (add_title
), a slider(add_slider
) and another title, then add a container (add_div
).
We give the container an id
as later we want to refer to it and update its content.
show_value(value)
takes the slider value and send the value to R.
ws.onmessage(msg)
takes a message from R and display it on the <div>
container we created previously.
document.getElementById("_ID_")
is the easiest way to refer to a particular element in a html file.
We will use that quite often.
In JS, both function NAME(ARG) {...}
and NAME = function(ARG) {...}
are valid ways to create functions.
write_html_to_file
, create_app
and start_app
are three functions from the jsReact
package that helps you build and run an app.
write_html_to_file
writes the html object we created in the previous section to hard-drive.
This is not needed if you supply your own html file.
create_app
links the html and the R function you provided (using the model presented in Diagram 2) and creates an app object.
insert_socket
is by default TRUE
; you could set it to FALSE
if you are not doing any R processing.
start_app
launches a R server to serve your website.
By default, the address is set to “localhost:9454”, and the website is shown in your viewer.
You can use the option browser = "browser"
to open the app with your browser instead.
jsReact
setups a simple framework for this, and the three key functions to know are:
ws.send(str0)
, ws.onmessage(msg)
and your_r_function(named_list0) { named_list1 }
.
Along the way, we have also learnt about some useful functions for apps development:
for building the html interface, we have jsReact::add_title, jsReact::add_slider, jsReact::add_div
;
for JavaScript, we have document.getElementById('_ID_')
;
for running the app, we have jsReact::write_html_to_file(), jsReact::create_app(), jsReact::start_app()
.
I hope you successfully created an app in R, and I shall see you in the next tutorial!
https://kcf-jackson.github.io/jsReact/articles/index.html
#install apps: 'stocks', 'markdownapp' and 'nabel'
library(devtools)
install_github(c("stocks", "markdownapp", "nabel"), username="opencpu")
By convention, the web pages are placed in the /inst/www/
directory in the R package. To use an app locally, simply start the opencpu single-user server:
library(opencpu)
opencpu$browse("/library/stocks/www")
opencpu$browse("/library/nabel/www")
The same apps can be installed and accessed on a cloud server by navigating to /ocpu/library/[pkgname]/www/
:
https://cloud.opencpu.org/ocpu/library/stocks/www
https://cloud.opencpu.org/ocpu/library/markdownapp/www
https://cloud.opencpu.org/ocpu/library/nabel/www
One app in the public repository is called appdemo. This application contains some minimal examples to demonstrate basic functionality and help you get started with building apps using opencpu.js
.
opencpu.js
is available from github: https://github.com/jeroenooms/opencpu.js. The jQuery library must be included in your web page <script src="js/jquery.js"></script>
<script src="js/opencpu.js"></script>
<script src="js/app.js"></script>
It is recommended to ship a copy of the opencpu.js
library with your application or website (as opposed to hotlinking it from some public location). This because the JavaScript library is in active development (0.x version) and the latest version might (radically) change from time to time. Shipping a version of opencpu.js
with your app prevents it from breaking with upstream changes in the library. Also it is practical both for development and deployment if your app works offline.
Most functions in opencpu.js call out to $.ajax
and return the jqXHR object. Thereby you (the programmer) have full control over the request. Note that the A in Ajax stands for jqXHR.done
, jqXHR.fail
and jqXHR.always
methods (see jqXHR).
opencpu.js
library from an external site that is not hosted on OpenCPU. In this case, we must specify the external OpenCPU server using ocpu.seturl()
:
//set page to communicate to with "mypackage" on server below
ocpu.seturl("//cloud.opencpu.org/ocpu/library/mypackage/R")
Cross domain requests are convenient for development and illustrative examples, see e.g: jsfiddle examples. However, when possible it is still recommended to include a copy of your web pages in the R package for every release of your app. That way you get a nice redistributable app and there is no ambiguity over version compatibility of the front-end (web pages) and back-end (R functions).
Also note that even when using CORS, the opencpu.js
library still requires that all R functions used by a certain application are contained in a single R package. This is on purpose, to force you to keep things organized. If you would like to use functionality from various R packages, you need to create an R package that includes some wrapper functions and formally declares its dependencies on the other packages. Writing an R package is really easy these days, so this should be no problem.
jqXHR
opencpu.js
library implements a jquery plugin called rplot
which makes it easy to embed live plots in your webpage. For example, consider the R function smoothplot in the stocks package:
#The R function
function(ticker = "GOOG", from = "2013-01-01", to=Sys.time()){
mydata = yahoodata(ticker, from, to);
qplot(Date, Close, data = mydata, geom = c("line", "smooth"));
}
It defines three arguments, each of which optional: ticker
, from
, and to
. These are the arguments that we can pass from the opencpu.js
client app. In this example, we only pass the first two arguments.
//JavaScript client code
var ticker = $("#ticker").val();
var req = $("#plotdiv").rplot("smoothplot", {
ticker : ticker,
from : "2013-01-01"
})
//optional: add custom callbacks
req.fail(function(){
alert("R returned an error: " + req.responseText);
});
This creates a plot widget in the #plotdiv
element (a div in your html). It calls the R function smoothplot
and passes argument values as specified, and displays the generated plot including PNG, PDF, and SVG export links. The final lines specify an error handler, which is optional but recommended. Have a look at the jsfiddle, or the full stocks app to see all of this in action!
jqXHR
var mydata = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15];
//call R function: stats::sd(x=data)
var req = ocpu.rpc("sd",{
x : mydata
}, function(output){
alert("Standard Deviation equals: " + output);
});
//optional
req.fail(function(){
alert("R returned an error: " + req.responseText);
});
See it in action here. When calling ocpu.rpc
, the arguments as well as return value are transferred using JSON. On the R side, the jsonlite
package is used to convert between JSON and R objects. Hence, the above code is equivalent to the R code below. The output
object is a JSON string which is sent back to the client and parsed by JavaScript.
library(jsonlite)
#parse input from JSON into R
jsoninput = '{"x" : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]}'
fnargs = fromJSON(jsoninput)
#the actual function call
result = do.call(stats::sd, fnargs)
#convert result back to JSON
jsonoutput = toJSON(result)
Another example is available here: http://jsfiddle.net/opencpu/9nVd5/. This example calls the lowess
function in R, to smooth a bunch of values. This can be useful to remove outliers from noisy data. One difference with the previous example, is that lowess
does not return a single value, but a list with two vectors: x
and y
. See the lowess help page for more detail.
jqXHR
ocpu.call
function is the stateful equivalent of ocpu.rpc
. It has the same arguments, but the difference is in the callback function. The ocpu.rpc
callback argument is a JSON object containing the data returned by the R function. The ocpu.call
callback argument is a Session object. The session object is a javascript class that stores the session ID; it does not contain any actual data. However, from the session object, we can asynchronously retrieve data, plots, files, stdout, etc. See this jsfiddle in action.
//toy example
var req = ocpu.call("rnorm", {n: 100}, function(session){
//read the session properties (just for fun)
$("#key").text(session.getKey());
$("#location").text(session.getLoc());
//retrieve session console (stdout) async
session.getConsole(function(outtxt){
$("#output").text(outtxt);
});
//retrieve the returned object async
session.getObject(function(data){
//data is the object returned by the R function
alert("Array of length " + data.length + ".\nFirst few values:" + data.slice(0,3));
});
})
We can also use the Session object to pass the R value returned by the function call as an argument to a subsequent function call, without ever retrieving the object. All state in OpenCPU is managed by controlling R objects in sessions on the server. This jsfiddle example continues on the previous example, and calculates the variance of the vector generated before, by passing the session object as an argument. A more simple example here
var req1 = ocpu.call("rnorm", {n: 100}, function(session1){
var req2 = ocpu.call("var", {x : session1}, function(session2){
session2.getObject(function(data){
alert("Variance equals: " + data);
});
});
});
opencpu.js
there are 4 types of arguments: a basic JavaScript value/object (automatically converted to R via JSON), a session object (represents an R value from a previous function call), a file and a code snippet. We have already seen examples the first two argument types earlier. Below is an example of using a file as an argument. The file will automatically be uploaded and used to call the R function. See it in action using this jsfiddle.
//This must be HTML5 <input type="file">
var myfile = $("#csvfile")[0].files[0];
var header = true;
//call read.csv in R. File is automatically uploaded
var req = ocpu.call("read.csv", {
"file" : myfile,
"header" : myheader
}, function(session){
//use output here
});
The final type of argument is a code snippet. This injects raw R code into the function call. It is usually recommended to use this type only when really needed, because it requires the client to understand R code, which kills interoperability. But this argument type is useful for example in applications that explicitly let the user do some R coding. See here for a basic example:
//create snippet argument
var x = new ocpu.Snippet($("#input").val());
//perform the request
var req = ocpu.call("mean", {
"x" : x
}, function(session){
//use output here
});
One interesting special case is using a code Snippet when calling the identity
function in R. This comes down to executing a raw block of code in a session. Try this jsfiddle to see this in action.
ocpu.call()
is always a session object. This object does not contain actual data, it just holds a sessoin ID and which can be used to retrieve output from the server. All session objects have the following methods:
A portable solution for sending emails from R (contains a simple SMTP client). | 1.2-1 | R 3.0.0+ | |
An easy to use package for sending emails from R. | 1.0 | R 2.0.0+ | |
A wrapper around Apache Commons Email for sending emails from R. | 0.6 | N/A | |
A package for creating and sending HTML emails from R through an SMTP server or Mailgun API. | 0.2.1 | R 3.2.1+ | |
A wrapper around Blat – a Windows command line utility that sends emails via SMTP or posts to Usenet via NNTP. http://www.blat.net/ | 1.0.1 | N/A | |
A package for sending emails via the Gmail’s RESTful API. | 1.0.0 | R 3.0.0+ | |
A package for sending emails via the Mailgun API. | 0.1.2 | N/A | |
A package for sending emails from R via an SMTP server. | 0.1.1 | N/A | |
A Windows-specific package for sending emails in R from the Outlook app. | 0.94-0 | N/A | |
A package to automate email sending from R via Gmail (based on the gmailR package). | N/A | N/A |
install.packages("sendmailR",repos="http://cran.r-project.org")
Next, we create a data structure called Server, which is a map with a single key value pair – key: smtpServer
, value: smtp.example.io
:
Server=list(smtpServer= "smtp.example.io")
Now, let’s write a few R lines to send a simple email:
library(sendmailR)
from = sprintf("<user@sender.com>","The Sender") # the sender’s name is an optional value
to = sprintf("<user@recipient.com>")
subject = "Test email subject"
body = "Test email body"
sendmail(from,to,subject,body,control=list(smtpServer= "smtp.example.io"))
The following code sample is for sending an email to multiple recipients:
from = sprintf("<user@sender.com>","The Sender")
to =sprintf(c("<user@recipient.com>","<user2@recipient.com>", "<user3@recipient.com>")
subject = "Test email subject"
body = "Test email body"
sapply(to,function(x) sendmail(from,to=x,subject,body,control=list(smtpServer= "smtp.example.io"))
And now, let’s send an email with an attachment as well:
from = sprintf("<user@sender.com>","The Sender")
to = sprintf("<user@recipient.com>")
subject = "Test email subject"
body = "Test email body"
attachmentPath ="C:/.../Attachment.png"
attachmentObject =mime_part(x=attachmentPath,name=attachmentName)
bodyWithAttachment = list(body,attachmentObject)
sendmail(from,to,subject,bodyWithAttachment,control=list(smtpServer= "smtp.example.io"))
NB: To send emails with install.packages("mailR",repos="http://cran.r-project.org")
Now, we can use the Mailtrap SMTP server that requires authentication to send an email:
library(mailR)
send.mail(from = "user@sender.com",
to = "user@recipient.com",
subject = "Test email subject",
body = "Test emails body",
smtp = list(host.name = "smtp.mailtrap.io", port = 25,
user.name = "********",
passwd = "******", ssl = TRUE),
authenticate = TRUE,
send = TRUE)
Insert your Mailtrap credentials (user.name
and passwd
) and pick any SMTP port of 25, 465, 587, 2525.
Here is how to send an email to multiple recipients:
library(mailR)
send.mail(from = "user@sender.com",
to = c("Recipient 1 <user1@recipient.com>", "Recipient 2 <user@recipient.com>"),
cc = c("CC Recipient <cc.user@recipient.com>"),
bcc = c("BCC Recipient <bcc.user@recipient.com>"),
replyTo = c("Reply to Recipient <reply-to@recipient.com>"),
subject = "Test email subject",
body = "Test emails body",
smtp = list(host.name = "smtp.mailtrap.io", port = 25,
user.name = "********",
passwd = "******", ssl = TRUE),
authenticate = TRUE,
send = TRUE)
Now, let’s add a few attachments to the email:
library(mailR)
send.mail(from = "user@sender.com",
to = c("Recipient 1 <user1@recipient.com>", "Recipient 2 <user@recipient.com>"),
cc = c("CC Recipient <cc.user@recipient.com>"),
bcc = c("BCC Recipient <bcc.user@recipient.com>"),
replyTo = c("Reply to Recipient <reply-to@recipient.com>"),
subject = "Test email subject",
body = "Test emails body",
smtp = list(host.name = "smtp.mailtrap.io", port = 25,
user.name = "********",
passwd = "******", ssl = TRUE),
authenticate = TRUE,
send = TRUE,
attach.files = c("./attachment.png", "https://dl.dropboxusercontent.com/u/123456/Attachment.pdf"),
file.names = c("Attachment.png", "Attachment.pdf"), #this is an optional parameter
file.descriptions = c("Description for Attachment.png", "Description for Attachment.pdf")) #this is an optional parameter
Eventually, let’s send an HTML email from R:
library(mailR)
send.mail(from = "user@sender.com",
to = "user@recipient.com",
subject = "Test email subject",
body = "<html>Test <k>email</k> body</html>",
smtp = list(host.name = "smtp.mailtrap.io", port = 25,
user.name = "********",
passwd = "******", ssl = TRUE),
authenticate = TRUE,
send = TRUE)
You can also point to an HTML template by specifying its location, as follows:
body = "./Template.html",
install.packages("blastula",repos="http://cran.r-project.org")
and load it:
library(blastula)
Compose an email using Markdown formatting.
You can also employ the following string objects:
add_readable_time
– creates a nicely formatted date/time string for the current time
add_image
– transforms an image to an HTML string object
For example,
date_time = add_readable_time() # => "Thursday, November 28, 2019 at 4:34 PM (CET)"
img_file_path = "./attachment.png" # => "<img cid=\"mtwhxvdnojpr__attachment.png\" src=\"...g==\" width=\"520\" alt=\"\"/>\n"
img_string = add_image(file = img_file_path)
When composing an email, you will need the c()
function to combine the strings in the email body and footer.
You can use three main arguments: body
, header
, and footer
.
If you have Markdown and HTML fragments in the email body, use the md()
function.
Here is what we’ve got:
library(blastula)
email =
compose_email(
body = md(
c("<html>Test <k>email</k> body</html>", img_string )
),
footer = md(
c("Test email footer", date_time, "." )
)
)
Preview the email using attach_connect_email(email = email)
Now, let’s send the email.
This can be done with the smtp_send()
function through one of the following ways:
Providing the SMTP credentials directly via the creds()
helper:
smtp_send(
email = email,
from = "user@sender.com",
to = "user@recipient.com",
credentials = creds(
host = "smtp.mailtrap.io",
port = 25,
user = "********"
)
)
Using a credentials key that you can generate with the create_smtp_creds_key()
function:create_smtp_creds_key(
id = "mailtrap",
host = "smtp.mailtrap.io",
port = 25,
user = "********"
)
smtp_send(
email = email,
from = "user@sender.com",
to = "user@recipient.com",
credentials = creds_key("mailtrap")
)
Using a credentials file that you can generate with the create_smtp_creds_file()
function:
create_smtp_creds_file(
file = "mailtrap_file",
host = "smtp.mailtrap.io",
port = 25,
user = "********"
)
smtp_send(
email = email,
from = "user@sender.com",
to = "user@recipient.com",
credentials = creds_file("mailtrap_file")
)
NB: There is no way to programmatically specify a password for authentication.
The user will be prompted to provide one during code execution.
install.packages("remotes")
library(remotes)
remotes::install_github("datawookie/emayili")
Emayili has two classes at the core:
envelope
– to create emails
server
– to communicate with the SMTP server
Let’s create an email first:
library(emayili)
email = envelope() %>%
from("user@sender.com") %>%
to("user@recipient.com") %>%
subject("Test email subject") %>%
body("Test email body")
Now, configure the SMTP server:
smtp = server(host = "smtp.mailtrap.io",
port = 25,
username = "********",
password = "*********")
To send the email to multiple recipients, enhance your emails with Cc, Bcc, and Reply-To header fields as follows:
email = envelope() %>%
from("user@sender.com") %>%
to(c("Recipient 1 <user1@recipient.com>", "Recipient 2 <user@recipient.com>")) %>%
cc("cc@recipient.com") %>%
bcc("bcc@recipient.com") %>%
reply("reply-to@recipient.com") %>%
subject("Test email subject") %>%
body("Test email body")
You can also use the attachment()
method to add attachments to your email:
email = email %>% attachment(c("./attachment.png", "https://dl.dropboxusercontent.com/u/123456/Attachment.pdf"))
Eventually, you can send your email with:
smtp(email, verbose = TRUE)
install.packages("gmailr", repos="http://cran.r-project.org")
and load in your R script:
library(gmailr)
Now, you can use your downloaded JSON credentials file.
Employ the use_secret_file()
function.
For example, if your JSON file is named GmailCredentials.json, this will look, as follows:
use_secret_file("GmailCredentials.json")
After that, create a MIME email object:
email = gm_mime() %>%
gm_to("user@recipient.com") %>%
gm_from("user@sender.com") %>%
gm_subject("Test email subject") %>%
gm_text_body("Test email body")
To create an HTML email, use markup to shape your HTML string, for example:
email = gm_mime() %>%
gm_to("user@recipient.com") %>%
gm_from("user@sender.com") %>%
gm_subject("Test email subject") %>%
gm_html_body("<html>Test <k>email</k> body</html>")
To add an attachment, you can:
use the gm_attach_file()
function, if the attachment has not been loaded into R.
You can specify the MIME type yourself using the type parameter or let it be automatically guessed by mime::guess_type
email = gm_mime() %>%
gm_to("user@recipient.com") %>%
gm_from("user@sender.com") %>%
gm_subject("Test email subject") %>%
gm_html_body("<html>Test <k>email</k> body</html>") %>%
gm_attach_file("Attachment.png")
use attach_part()
to attach the binary data to your file:
email = gm_mime() %>%
gm_to("user@recipient.com") %>%
gm_from("user@sender.com") %>%
gm_subject("Test email subject") %>%
gm_html_body("<html>Test <k>email</k> body</html>") %>%
gm_attach_part(part = charToRaw("attach me!"), name = "please")
If you need to include an image into HTML, you can use the <img class="lazy" data-src=”cid:xy”>
tag to reference the image.
First create a plot to send, and save it to AttachImage.png:
# 1.
use built-in mtcars data set
my_data = mtcars
# 2.
Open file for writing
png("AttachImage.png", width = 350, height = 350)
# 3.
Create the plot
plot(x = my_data$wt, y = my_data$mpg,
pch = 16, frame = FALSE,
xlab = "wt", ylab = "mpg", col = "#2E9FDF")
# 4.
Close the file
dev.off()
Now, create an HTML email that references the plot as foobar
:
email = gm_mime() %>%
gm_to("user@recipient.com") %>%
gm_from("user@sender.com") %>%
gm_subject("Test email subject") %>%
gm_html_body(
'<html>Test <k>email</k> body</html>
<br><img class="lazy" data-src="cid:foobar">'
) %>%
gm_attach_file("AttachImage.png", id = "foobar")
Eventually, you can send your email:
gm_send_message(email)
install.packages("RDCOMClient")
via devtools:
devtools::install_github("omegahat/RDCOMClient")
from the Windows command line:
R CMD INSTALL RDCOMClient
Warning: if you receive a message like package ‘RDCOMClient’ is not available (for R version 3.5.1)
” during the installation from CRAN, try to install install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
Load the package, open Outlook, and create a simple email:
library(RDCOMClient)
Outlook = COMCreate("Outlook.Application")
Email = Outlook$CreateItem(0)
Email[["to"]] = "user@recipient.com"
Email[["subject"]] = "Test email subject"
Email[["body"]] = "Test email body"
If you need to change the default From:
field and send from a secondary mailbox, use:
Email[["SentOnBehalfOfName"]] = "user@sender.com"
Here is how you can specify multiple recipients, as well as Cc and Bcc headers:
Email[["to"]] = "user1@recipient.com, user2@recipient.com"
Email[["cc"]] = "cc.user@recipient.com"
Email[["bcc"]] = "bcc.user@recipient.com"
To create an HTML email, use [["htmlbody"]]
.
You can simply add your HTML in the R code as follows:
library(RDCOMClient)
Outlook = COMCreate("Outlook.Application")
Email = Outlook$CreateItem(0)
Email[["to"]] = "user@recipietn.com"
Email[["subject"]] = "Test email subject"
Email[["htmlbody"]] =
"<html>Test <k>email</k> body</html>"
Let’s also add an attachment:
library(RDCOMClient)
Outlook = COMCreate("Outlook.Application")
Email = Outlook$CreateItem(0)
Email[["to"]] = "user@recipient.com"
Email[["subject"]] = "Test email subject"
Email[["htmlbody"]] =
"<html>Test <k>email</k> body</html>"
Email[["attachments"]]$Add("C:/.../Attachment.png")
Now, you can send the email:
outMail$Send()
lastname,firstname,win_amount,email_address
SMITH,JOHN,1234,johnsmith@winner.com
LOCKWOOD,JANE,1234,janelockwood24@example.com
Now, let’s go through the mail steps to create an R script for bulk emails.
Load the packages and files we need:
suppressPackageStartupMessages(library(gmailr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(purrr))
library(readr) # => if you don’t have it, run: install.packages("readr", repos="http://cran.r-project.org")
my_dat = read_csv("Variables.csv")
Create a data frame that will insert variables from the file into the email:
this_hw = "Lottery Winners"
email_sender = 'Best Lottery Ever <info@best-lottery-ever.com>'
optional_bcc = 'Anonymous <bcc@example.com>'
body = "Hi, %s.
Your lottery win is %s.
Thanks for betting with us!
"
edat = my_dat %>%
mutate(
To = sprintf('%s <%s>', firstname, email_address),
Bcc = optional_bcc,
From = email_sender,
Subject = sprintf('Lottery win for %s', win_amount),
body = sprintf(body, firstname, win_amount)) %>%
select(To, Bcc, From, Subject, body)
write_csv(edat, "data-frame.csv")
The data frame will be saved to data-frame.csv.
This will provide an easy-to-read record of the composed emails.
Now, convert each row of the data frame into a MIME object using the gmailr::mime()
function.
After that, purrr::pmap()
generates the list of MIME objects, one per row of the input data frame:
emails = edat %>%
pmap(mime)
str(emails, max.level = 2, list.len = 2)
If you use install.packages("plyr")
), you can do this, as follows:
emails = plyr::dlply(edat, ~ To, function(x) mime(
To = x$To,
Bcc = x$Bcc,
From = x$From,
Subject = x$Subject,
body = x$body))
Specify your JSON credentials file:
use_secret_file("GmailCredentials.json")
And send emails with purrr::safely()
.
This will protect your bulk emails from failures in the middle:
safe_send_message = safely(send_message)
sent_mail = emails %>%
map(safe_send_message)
saveRDS(sent_mail,
paste(gsub("\\s+", "_", this_hw), "sent-emails.rds", sep = "_"))
List recipients with TRUE
in case of errors:
errors = sent_mail %>%
transpose() %>%
.$error %>%
map_lgl(Negate(is.null))
Take a look at the full code now:
suppressPackageStartupMessages(library(gmailr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(purrr))
library(readr) # => if you don’t have it, run: install.packages("readr", repos="http://cran.r-project.org")
my_dat = read_csv("Variables.csv")
this_hw = "Lottery Winners"
email_sender = 'Best Lottery Ever <info@best-lottery-ever.com>'
optional_bcc = 'Anonymous <bcc@example.com>'
body = "Hi, %s.
Your lottery win is %s.
Thanks for betting with us!
"
edat = my_dat %>%
mutate(
To = sprintf('%s <%s>', firstname, email_address),
Bcc = optional_bcc,
From = email_sender,
Subject = sprintf('Lottery win for %s', win_amount),
body = sprintf(body, firstname, win_amount)) %>%
select(To, Bcc, From, Subject, body)
write_csv(edat, "data-frame.csv")
emails = edat %>%
pmap(mime)
str(emails, max.level = 2, list.len = 2)
use_secret_file("GmailCredentials.json")
safe_send_message = safely(send_message)
sent_mail = emails %>%
map(safe_send_message)
saveRDS(sent_mail,
paste(gsub("\\s+", "_", this_hw), "sent-emails.rds", sep = "_"))
errors = sent_mail %>%
transpose() %>%
.$error %>%
map_lgl(Negate(is.null))
lastname; firstname; win_amount; email_address
SMITH; JOHN; 1234; johnsmith@winner.com
LOCKWOOD; JANE; 1234; janelockwood24@example.com
What you need to do next:
Build the HTML email body for a given recipient using the message_text
function:
message_text = function(x) sprintf('Hello %s %s!\nCongratulation to your win.\nYour prize is XXX.\nBet with the Best Lottery Ever!', x$firstname, x$lastname)
Load the package and read in the mail list:
library(mailR)
mail_list = read.csv2("Variables.csv",as.is=TRUE)
Values in the Variables.csv should be separated with a semicolon (;
).
You can configure settings to read the data frame using the read.table
or read.csv
functions.
Create a file to write the information of each individual row in the mail_list
after each email is sent.
my_file = file("mail.out",open="w")
# … write data here
close(my_file)
Perform the batch emailing to all students in the mail list:
for (recipient in 1:nrow(mail_list)) {
body = message_text(mail_list[recipient,])
send.mail(from="info@best-lottery-ever.com",
to=as.character(mail_list[recipient,]$email_address),
subject="Lottery Winners",
body=body,
html=TRUE,
authenticate=TRUE,
smtp = list(host.name = "smtp.mailtrap.io",
user.name = "*****", passwd = "*****", ssl = TRUE),
encoding = "utf-8",send=TRUE)
print(mail_list[recipient,])
Sys.sleep(runif(n=1,min=3,max=6))
#write each recipient to a file
result_file = file("mail.out",open="a")
writeLines(text=paste0("[",recipient,"] ",
paste0(as.character(mail_list[recipient,]),collapse="\t")),
sep="\n",con=result_file)
close(result_file)
}
And here is the full code:
message_text = function(x) sprintf('Hello %s %s!\nCongratulation to your win.\nYour prize is XXX.\nBet with the Best Lottery Ever!', x$firstname, x$lastname)
library(mailR)
mail_list = read.csv2("Variables.csv",as.is=TRUE)
my_file = file("mail.out",open="w")
# … write data here
close(my_file)
for (recipient in 1:nrow(mail_list)) {
body = message_text(mail_list[recipient,])
send.mail(from="info@best-lottery-ever.com",
to=as.character(mail_list[recipient,]$email_address),
subject="Lottery Winners",
body=body,
html=TRUE,
authenticate=TRUE,
smtp = list(host.name = "smtp.mailtrap.io",
user.name = "*****", passwd = "*****", ssl = TRUE),
encoding = "utf-8",send=TRUE)
print(mail_list[recipient,])
Sys.sleep(runif(n=1,min=3,max=6))
#write each recipient to a file
result_file = file("mail.out",open="a")
writeLines(text=paste0("[",recipient,"] ",
paste0(as.character(mail_list[recipient,]),collapse="\t")),
sep="\n",con=result_file)
close(result_file)
}
Resource type | Method |
---|---|
an unsent message that you can modify once created |
create (creating a new draft) delete (removing the specified draft) get (obtaining the specified draft) list (listing drafts in the mailbox) send (sending the specified draft according to the To, Cc, and Bcc headers) update (updating the specified draft’s content) |
an immutable resource that you cannot modify |
batchDelete (removing messages by message ID) batchModify (modifying labels on the specified messages) delete (removing the specified message) get (obtaining the specified message) import (importing the message into the mailbox (similar to receiving via SMTP)) insert (inserting the message into the mailbox (similar to IMAP) list (listing messages in the mailbox) modify (modifying labels on the specified message) send (sending the specified message according to the To, Cc, and Bcc headers) trash (transferring the specified message to the trash) untrash (transferring the specified message from the trash) |
a collection of messages within a single conversation |
delete (removing the specified thread) get (obtaining the specified thread) list (listing threads in the mailbox) modify (modifying labels in the thread) trash (transferring the specified thread to the trash) untrash (transferring the specified thread from the trash) |
a resource to organize messages and threads (for example, inbox, spam, trash, etc.) |
create (creating a new label) delete (removing the specified label) get (obtaining the specified label) list (listing labels in the mailbox) patch (patching the specified label) – this method supports patch semantics update (updating the specified label). |
a collection of changes made to the mailbox | list (listing the history of all changes to the mailbox) |
setting up Gmail features |
getAutoForwarding (auto-forwarding setting) updateAutoForwarding (updating the auto-forwarding setting) getImap (IMAP settings) updateImap (updating IMAP settings) getLanguage (language settings) updateLanguage (updating language settings) getPop (POP3 settings) updatePop (updating POP3 settings) getVacation (vacation responder settings) updateVacation (updating vacation responder settings) |
src/main/resources/
directory.
Then, copy the JSON file with credentials to this directory and replace the content of the build.gradle
file with this code.
So, pay attention when preparing your project.
Installation:
go get -u google.golang.org/api/gmail/v1
go get -u golang.org/x/oauth3/google
Installation via Gradle
repositories {
mavenCentral()
}
dependencies {
compile 'com.google.api-client:google-api-client:1.30.2'
}
Installation:
gem install google-api-client
Installation via NuGet Package Manager Console:
Install-Package Google.Apis.Gmail.v1
Installation via npm:
npm install googleapis@39 --save
Installation via Composer:
composer require google/apiclient:"^2.0"
Installation:
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
or
easy_install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
API client for JavaScript
API client for Objective-C
API client for Dart
go run quickstart.go
gradle run
ruby quickstart.rb
node .
php quickstart.php
python quickstart.py
python -m SimpleHTTPServer 8000
– for Python 2+
python -m http.server 8000
– for Python 3+
It worked…or not.
Google will warn you about a probable failure of the sample you run to open a new window in your default browser.
If this happens, you’ll need to do it manually.
Copy the URL from the console and paste it in the browser.
It will look like this:
def create_message(sender, to, subject, message_text):
message = MIMEText(message_text)
message['to'] = to
message['from'] = sender
message['subject'] = subject
raw_message = base64.urlsafe_b64encode(message.as_string().encode("utf-8"))
return {
'raw': raw_message.decode("utf-8")
}
def create_draft(service, user_id, message_body):
try:
message = {'message': message_body}
draft = service.users().drafts().create(userId=user_id, body=message).execute()
print("Draft id: %s\nDraft message: %s" % (draft['id'], draft['message']))
return draft
except Exception as e:
print('An error occurred: %s' % e)
return None
and PHP
/**
* @param $sender string sender email address
* @param $to string recipient email address
* @param $subject string email subject
* @param $messageText string email text
* @return Google_Service_Gmail_Message
*/
function createMessage($sender, $to, $subject, $messageText) {
$message = new Google_Service_Gmail_Message();
$rawMessageString = "From: <{$sender}>\r\n";
$rawMessageString .= "To: <{$to}>\r\n";
$rawMessageString .= 'Subject: =?utf-8?B?' .
base64_encode($subject) .
"?=\r\n";
$rawMessageString .= "MIME-Version: 1.0\r\n";
$rawMessageString .= "Content-Type: text/html; charset=utf-8\r\n";
$rawMessageString .= 'Content-Transfer-Encoding: quoted-printable' .
"\r\n\r\n";
$rawMessageString .= "{$messageText}\r\n";
$rawMessage = strtr(base64_encode($rawMessageString), array('+' => '-', '/' => '_'));
$message->setRaw($rawMessage);
return $message;
}
/**
* @param $service Google_Service_Gmail an authorized Gmail API service instance.
* @param $user string User's email address or "me"
* @param $message Google_Service_Gmail_Message
* @return Google_Service_Gmail_Draft
*/
function createDraft($service, $user, $message) {
$draft = new Google_Service_Gmail_Draft();
$draft->setMessage($message);
try {
$draft = $service->users_drafts->create($user, $draft);
print 'Draft ID: ' .
$draft->getId();
} catch (Exception $e) {
print 'An error occurred: ' .
$e->getMessage();
}
return $draft;
}
Test Your Emails Now
def send_message(service, user_id, message):
try:
message = service.users().messages().send(userId=user_id, body=message).execute()
print('Message Id: %s' % message['id'])
return message
except Exception as e:
print('An error occurred: %s' % e)
return None
and PHP
/**
* @param $service Google_Service_Gmail an authorized Gmail API service instance.
* @param $userId string User's email address or "me"
* @param $message Google_Service_Gmail_Message
* @return null|Google_Service_Gmail_Message
*/
function sendMessage($service, $userId, $message) {
try {
$message = $service->users_messages->send($userId, $message);
print 'Message with ID: ' .
$message->getId() .
' sent.';
return $message;
} catch (Exception $e) {
print 'An error occurred: ' .
$e->getMessage();
}
return null;
}
def send_message(service, user_id, message):
try:
message = service.users().messages().send(userId=user_id, body=message).execute()
print('Message Id: %s' % message['id'])
return message
except Exception as e:
print('An error occurred: %s' % e)
return None
def create_message_with_attachment(sender, to, subject, message_text, file):
message = MIMEMultipart()
message['to'] = to
message['from'] = sender
message['subject'] = subject
msg = MIMEText(message_text)
message.attach(msg)
content_type, encoding = mimetypes.guess_type(file)
if content_type is None or encoding is not None:
content_type = 'application/octet-stream'
main_type, sub_type = content_type.split('/', 1)
if main_type == 'text':
fp = open(file, 'rb')
msg = MIMEText(fp.read().decode("utf-8"), _subtype=sub_type)
fp.close()
elif main_type == 'image':
fp = open(file, 'rb')
msg = MIMEImage(fp.read(), _subtype=sub_type)
fp.close()
elif main_type == 'audio':
fp = open(file, 'rb')
msg = MIMEAudio(fp.read(), _subtype=sub_type)
fp.close()
else:
fp = open(file, 'rb')
msg = MIMEBase(main_type, sub_type)
msg.set_payload(fp.read())
fp.close()
filename = os.path.basename(file)
msg.add_header('Content-Disposition', 'attachment', filename=filename)
message.attach(msg)
raw_message = base64.urlsafe_b64encode(message.as_string().encode("utf-8"))
return {'raw': raw_message.decode("utf-8")}
import base64
import email
def get_messages(service, user_id):
try:
return service.users().messages().list(userId=user_id).execute()
except Exception as error:
print('An error occurred: %s' % error)
def get_message(service, user_id, msg_id):
try:
return service.users().messages().get(userId=user_id, id=msg_id, format='metadata').execute()
except Exception as error:
print('An error occurred: %s' % error)
def get_mime_message(service, user_id, msg_id):
try:
message = service.users().messages().get(userId=user_id, id=msg_id,
format='raw').execute()
print('Message snippet: %s' % message['snippet'])
msg_str = base64.urlsafe_b64decode(message['raw'].encode("utf-8")).decode("utf-8")
mime_msg = email.message_from_string(msg_str)
return mime_msg
except Exception as error:
print('An error occurred: %s' % error)
If the message contains an attachment, expand your code with the following:
def get_attachments(service, user_id, msg_id, store_dir):
try:
message = service.users().messages().get(userId=user_id, id=msg_id).execute()
for part in message['payload']['parts']:
if(part['filename'] and part['body'] and part['body']['attachmentId']):
attachment = service.users().messages().attachments().get(id=part['body']['attachmentId'], userId=user_id, messageId=msg_id).execute()
file_data = base64.urlsafe_b64decode(attachment['data'].encode('utf-8'))
path = ''.join([store_dir, part['filename']])
f = open(path, 'wb')
f.write(file_data)
f.close()
except Exception as error:
print('An error occurred: %s' % error)
drafts.create
is 10 units and a messages.send
is 100 units.
Gmail API enforces standard daily mail sending limits.
Also, keep in mind that the maximum email size in Gmail is 25MB.
Resource type | Method |
---|---|
an unsent message that you can modify once created |
create (creating a new draft) delete (removing the specified draft) get (obtaining the specified draft) list (listing drafts in the mailbox) send (sending the specified draft according to the To, Cc, and Bcc headers) update (updating the specified draft’s content) |
an immutable resource that you cannot modify |
batchDelete (removing messages by message ID) batchModify (modifying labels on the specified messages) delete (removing the specified message) get (obtaining the specified message) import (importing the message into the mailbox (similar to receiving via SMTP)) insert (inserting the message into the mailbox (similar to IMAP) list (listing messages in the mailbox) modify (modifying labels on the specified message) send (sending the specified message according to the To, Cc, and Bcc headers) trash (transferring the specified message to the trash) untrash (transferring the specified message from the trash) |
a collection of messages within a single conversation |
delete (removing the specified thread) get (obtaining the specified thread) list (listing threads in the mailbox) modify (modifying labels in the thread) trash (transferring the specified thread to the trash) untrash (transferring the specified thread from the trash) |
a resource to organize messages and threads (for example, inbox, spam, trash, etc.) |
create (creating a new label) delete (removing the specified label) get (obtaining the specified label) list (listing labels in the mailbox) patch (patching the specified label) – this method supports patch semantics update (updating the specified label). |
a collection of changes made to the mailbox | list (listing the history of all changes to the mailbox) |
setting up Gmail features |
getAutoForwarding (auto-forwarding setting) updateAutoForwarding (updating the auto-forwarding setting) getImap (IMAP settings) updateImap (updating IMAP settings) getLanguage (language settings) updateLanguage (updating language settings) getPop (POP3 settings) updatePop (updating POP3 settings) getVacation (vacation responder settings) updateVacation (updating vacation responder settings) |
src/main/resources/
directory.
Then, copy the JSON file with credentials to this directory and replace the content of the build.gradle
file with this code.
So, pay attention when preparing your project.
Installation:
go get -u google.golang.org/api/gmail/v1
go get -u golang.org/x/oauth3/google
Installation via Gradle
repositories {
mavenCentral()
}
dependencies {
compile 'com.google.api-client:google-api-client:1.30.2'
}
Installation:
gem install google-api-client
Installation via NuGet Package Manager Console:
Install-Package Google.Apis.Gmail.v1
Installation via npm:
npm install googleapis@39 --save
Installation via Composer:
composer require google/apiclient:"^2.0"
Installation:
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
or
easy_install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
API client for JavaScript
API client for Objective-C
API client for Dart
go run quickstart.go
gradle run
ruby quickstart.rb
node .
php quickstart.php
python quickstart.py
python -m SimpleHTTPServer 8000
– for Python 2+
python -m http.server 8000
– for Python 3+
It worked…or not.
Google will warn you about a probable failure of the sample you run to open a new window in your default browser.
If this happens, you’ll need to do it manually.
Copy the URL from the console and paste it in the browser.
It will look like this:
def create_message(sender, to, subject, message_text):
message = MIMEText(message_text)
message['to'] = to
message['from'] = sender
message['subject'] = subject
raw_message = base64.urlsafe_b64encode(message.as_string().encode("utf-8"))
return {
'raw': raw_message.decode("utf-8")
}
def create_draft(service, user_id, message_body):
try:
message = {'message': message_body}
draft = service.users().drafts().create(userId=user_id, body=message).execute()
print("Draft id: %s\nDraft message: %s" % (draft['id'], draft['message']))
return draft
except Exception as e:
print('An error occurred: %s' % e)
return None
and PHP
/**
* @param $sender string sender email address
* @param $to string recipient email address
* @param $subject string email subject
* @param $messageText string email text
* @return Google_Service_Gmail_Message
*/
function createMessage($sender, $to, $subject, $messageText) {
$message = new Google_Service_Gmail_Message();
$rawMessageString = "From: <{$sender}>\r\n";
$rawMessageString .= "To: <{$to}>\r\n";
$rawMessageString .= 'Subject: =?utf-8?B?' .
base64_encode($subject) .
"?=\r\n";
$rawMessageString .= "MIME-Version: 1.0\r\n";
$rawMessageString .= "Content-Type: text/html; charset=utf-8\r\n";
$rawMessageString .= 'Content-Transfer-Encoding: quoted-printable' .
"\r\n\r\n";
$rawMessageString .= "{$messageText}\r\n";
$rawMessage = strtr(base64_encode($rawMessageString), array('+' => '-', '/' => '_'));
$message->setRaw($rawMessage);
return $message;
}
/**
* @param $service Google_Service_Gmail an authorized Gmail API service instance.
* @param $user string User's email address or "me"
* @param $message Google_Service_Gmail_Message
* @return Google_Service_Gmail_Draft
*/
function createDraft($service, $user, $message) {
$draft = new Google_Service_Gmail_Draft();
$draft->setMessage($message);
try {
$draft = $service->users_drafts->create($user, $draft);
print 'Draft ID: ' .
$draft->getId();
} catch (Exception $e) {
print 'An error occurred: ' .
$e->getMessage();
}
return $draft;
}
Test Your Emails Now
def send_message(service, user_id, message):
try:
message = service.users().messages().send(userId=user_id, body=message).execute()
print('Message Id: %s' % message['id'])
return message
except Exception as e:
print('An error occurred: %s' % e)
return None
and PHP
/**
* @param $service Google_Service_Gmail an authorized Gmail API service instance.
* @param $userId string User's email address or "me"
* @param $message Google_Service_Gmail_Message
* @return null|Google_Service_Gmail_Message
*/
function sendMessage($service, $userId, $message) {
try {
$message = $service->users_messages->send($userId, $message);
print 'Message with ID: ' .
$message->getId() .
' sent.';
return $message;
} catch (Exception $e) {
print 'An error occurred: ' .
$e->getMessage();
}
return null;
}
def send_message(service, user_id, message):
try:
message = service.users().messages().send(userId=user_id, body=message).execute()
print('Message Id: %s' % message['id'])
return message
except Exception as e:
print('An error occurred: %s' % e)
return None
def create_message_with_attachment(sender, to, subject, message_text, file):
message = MIMEMultipart()
message['to'] = to
message['from'] = sender
message['subject'] = subject
msg = MIMEText(message_text)
message.attach(msg)
content_type, encoding = mimetypes.guess_type(file)
if content_type is None or encoding is not None:
content_type = 'application/octet-stream'
main_type, sub_type = content_type.split('/', 1)
if main_type == 'text':
fp = open(file, 'rb')
msg = MIMEText(fp.read().decode("utf-8"), _subtype=sub_type)
fp.close()
elif main_type == 'image':
fp = open(file, 'rb')
msg = MIMEImage(fp.read(), _subtype=sub_type)
fp.close()
elif main_type == 'audio':
fp = open(file, 'rb')
msg = MIMEAudio(fp.read(), _subtype=sub_type)
fp.close()
else:
fp = open(file, 'rb')
msg = MIMEBase(main_type, sub_type)
msg.set_payload(fp.read())
fp.close()
filename = os.path.basename(file)
msg.add_header('Content-Disposition', 'attachment', filename=filename)
message.attach(msg)
raw_message = base64.urlsafe_b64encode(message.as_string().encode("utf-8"))
return {'raw': raw_message.decode("utf-8")}
import base64
import email
def get_messages(service, user_id):
try:
return service.users().messages().list(userId=user_id).execute()
except Exception as error:
print('An error occurred: %s' % error)
def get_message(service, user_id, msg_id):
try:
return service.users().messages().get(userId=user_id, id=msg_id, format='metadata').execute()
except Exception as error:
print('An error occurred: %s' % error)
def get_mime_message(service, user_id, msg_id):
try:
message = service.users().messages().get(userId=user_id, id=msg_id,
format='raw').execute()
print('Message snippet: %s' % message['snippet'])
msg_str = base64.urlsafe_b64decode(message['raw'].encode("utf-8")).decode("utf-8")
mime_msg = email.message_from_string(msg_str)
return mime_msg
except Exception as error:
print('An error occurred: %s' % error)
If the message contains an attachment, expand your code with the following:
def get_attachments(service, user_id, msg_id, store_dir):
try:
message = service.users().messages().get(userId=user_id, id=msg_id).execute()
for part in message['payload']['parts']:
if(part['filename'] and part['body'] and part['body']['attachmentId']):
attachment = service.users().messages().attachments().get(id=part['body']['attachmentId'], userId=user_id, messageId=msg_id).execute()
file_data = base64.urlsafe_b64decode(attachment['data'].encode('utf-8'))
path = ''.join([store_dir, part['filename']])
f = open(path, 'wb')
f.write(file_data)
f.close()
except Exception as error:
print('An error occurred: %s' % error)
drafts.create
is 10 units and a messages.send
is 100 units.
Gmail API enforces standard daily mail sending limits.
Also, keep in mind that the maximum email size in Gmail is 25MB.
install.packages("reticulate")
and library(reticulate)
.
To keep things simple, let's start with just two lines of Python code to import the NumPy package for basic scientific computing and create an array of four numbers.
The Python code looks like this:
import numpy as np
my_python_array = np.array([2,4,6,8])
And here’s one way to do that right in an R script:
py_run_string("import numpy as np")
py_run_string("my_python_array = np.array([2,4,6,8])")
The py_run_string()
function executes whatever Python code is within the parentheses and quotation marks.
If you run that code in R, it may look like nothing happened.
Nothing shows up in your RStudio environment pane, and no value is returned.
If you run print(my_python_array)
in R, you get an error that my_python_array
doesn't exist.
But if you run a Python print command inside the py_run_string()
function such as
py_run_string("for item in my_python_array: print(item)")
you should see a result.
It’s going to get annoying running Python code line by line like this, though, if you have more than a couple of lines of code.
So there are a few other ways to run Python in R and reticulate.
One is to put all the Python code in a regular .py file, and use the py_run_file()
function.
Another way I like is to use an R Markdown document.
R Markdown lets you combine text, code, code results, and visualizations in a single document.
R Markdown lets you combine text, code, code results, and visualizations in a single document.
You can create a new R Markdown document in RStudio by choosing File > New File > R Markdown.
Code chunks start with three backticks (```
) and end with three backticks, and they have a gray background by default in RStudio.
This first chunk is for R code—you can see that with the r
after the opening bracket.
It loads the reticulate package and then you specify the version of Python you want to use.
(If you don’t specify, it’ll use your system default.)
```{r setup, include=FALSE, echo=TRUE}
library(reticulate)
use_python("/usr/bin/python")
```
This second chunk below is for Python code.
You can type the Python like you would in a Python file.
The code below imports NumPy, creates an array, and prints the array.
```{python}
import numpy as np
my_python_array = np.array([2,4,6,8])
for item in my_python_array:
print(item)
```
Here’s the cool part: You can use that array in R by referring to it as py$my_python_array
(in general, py$objectname
).
In this next code chunk, I store that Python array in an R variable called my_r_array
.
And then I check the class of that array.
```{r}
my_r_array = py$my_python_array
class(my_r_array)
``
It’s a class “array,” which isn’t exactly what you’d expect for an R object like this.
But I can turn it into a regular vector with as.vector(my_r_array)
and run whatever R operations I’d like on it, such as multiplying each item by 2.
```{r}
my_r_vector = as.vector(py$my_python_array)
class(my_r_vector)
my_r_vector = my_r_vector * 2
```
Next cool part: I can use that R variable back in Python, as r.my_r_array
(more generally, r.variablename
), such as
```{python}
my_python_array2 = r.my_r_vector
print(my_python_array2)
```
函数 | 作用说明 |
remoteDriver(browserName = "firefox") | 建立Firefox remoteDriver对象 |
open() | 打开浏览器 |
getPageSource() | 获取网页源码 |
navigate() | 跳转至指定网页 |
close() | 关闭当前session |
quit() | 删除session并关闭浏览器 |
getStatus() | 获取Selenium server的状态 |
getCurrentUrl() | 获取当前网页网址 |
getTitle() | 获取当前页面的标题 |
getWindowHandles() | 获取所有页面在Selenium中的window handle |
getPageSource() | 获取当前页面的源代码 |
mouseMoveToLocation() | 将鼠标移动至某一位置,参数x, y代表移动到相对现在鼠标位置的(x,y)距离的位置,参数webElement代表移至某一页面元素的正中位置。 一般使用webElement参数更方便 |
click(buttontId = 0) | 单击鼠标(buttonId = 0代表左键,1代表中间键,2代表右键) |
doubleclick(buttonId = 0) | 双击鼠标 |
clickElement() | 点击元素 |
sendKeysToActiveElement(sendKeys) | 在被激活的页面元素(一般为刚刚被点击的元素)中输入一系列文本或键盘操作。 输入内容,必须是list,如果是键盘操作,则前面要注明key=。 例如:remDr$sendKeysToActiveElement(list("数据分析", key="enter")) |
findElement(using=…,value=…) | 抓取单个元素。 例如:remDr$findElement(using = "css", value = "#kw")。 using代表定位方法:"xpath", "css", "id", "name", "tag name", "class name", "link text", "partial link text"; value代表要搜索的值 |
findElements(using=…,value=…) | 抓取多个元素 |
refresh() | 刷新页面 |
screenshot() | 截屏,如果display=FALSE, file非NULL,则将截屏保存至file指定路径。 |
goBack() | 后退到上一页 |
goForward() | 前进,与后退对应 |
maxWindowSize() | 最大化当前窗口 |
closeWindow() | 关闭当前窗口(但session还是活跃状态) |
switchToWindow() | 切换窗口,参数可以输入window handle |
executeScript() | 插入同步js,参数sript即js脚本,args如果不需要特殊设定时 = 1:2即可)。 例如:页面拉到最下面的js执行方式:remDr$executeScript("window.scrollTo(0,document.body.scrollHeight)", args = 1:2) |
executeAsyncScipt() | 插入异步js |
函数 | 作用说明 |
webElem <- remDr$findElement(using =…, value = …) | 抓取并创建页面元素 |
1.获取元素信息 | |
describeElement() | 获取元素描述信息 |
getElemntText() | 获取内部文本(获取数据的主要方法) |
getElementAttribute(attrName) | 获取元素属性(可用于爬取元素连接等) |
isElementDisplayed() | 元素是否被展示 |
isElementSelected() | 元素是否被选中 |
compareElement(otherElem) | 与另一元素对比,测试是否为同一元素 |
2.发送鼠标与键盘操作: | |
clearElement() | 针对文本输入框,清除内容 |
clickElement() | 单击元素 |
highlightElement() | 高亮闪烁元素,主要用于确认定位的元素是否正确 |
sendKeysToElement() | 用法与remoteDriver函数中的sendKeysToActiveElement(sendKeys)方法相同 |
submitElement() | 针对<form>表单,提交表单 |
setElementAttribute() | *效用函数,设定元素属性 |
3.抓取子页面元素: | |
findChildElement() | 如果当前页面元素下还有子元素,则可以用此方法抓取单个子元素,用法与remDr$findElement()相同 |
findChildElements() | 抓取多个子元素 |
<html> <head> <title>PHP and R Integration Sample</title> </head> <body> <div id=”r-output” id=”width: 100%; padding: 25px;”> <?php // Execute the R script within PHP code // Generates output as test.png image. exec("sample.R"); ?> <img class="lazy" data-src=”test.png?var1.1” alt=”R Graph” /> </div> </body> </html>Now save the file as index.php under your /htdocs/PROJECT-NAME/index.php. Let's create a sample chart using R code. Write the following code and save it as sample.R file.
x = rnorm(6,0,1) png(filename="test.png", width=500, height=500) hist(x, col="red") dev.off()
The only downside of this code is that it will create the same test.png file for all the incoming requests. Meaning if you are creating charts based on user specified inputs, there will always be one test.png file created for various purpose.Let's understand the code As specified earlier the exec('sample.R'); will execute the R script. It in turn generates the test.png graph image. In the very next line we used the HTML <img /> tag to display the R program generated image on the page. We used the src=test.png?ver1.1 where ver1.1 is used to invalidate the browser cache and download the new image from server. All modern browsers supports the browser caching. You might have experienced some website loads way faster on your repetitive visits. It's due to the fact that browsers cache the image and other static resources for brief period of time. How to serve concurrent requests? sample2.R
args = commandArgs(TRUE) cols = args[1] fname = args[2] x = rnorm(cols,0,1) fname = paste(fname, "png", sep = ".") png(filename=fname, width=500, height=500) hist(x, col="red") dev.off()Index.php
<html> <head> <title>PHP and R Integration Sample</title> </head> <body> <div id=”r-output” id=”width: 100%; padding: 25px;”> <?php // Execute the R script within PHP code // Generates output as test.png image. $filename = “samplefile”.rand(1,100); exec("sample2.R 6 “.$filename."); ?> <img class="lazy" data-src=”.$filename.”.png?var1.1” alt=”R Graph” /> </div> </body> </html>It will help you eliminate the need to using the same test.png file name. I have used the $filename=”samplefile”. You can use any random sequence as I have used in the end of the samplefile name. rand(min, max) will help you generate a random number. It will help you fix the file overwriting issue. And you will be able to handle the concurrent requests and server each with unique set of image(s). You might need to take care of old file removals. If you are on a linux machine you can setup a cron job which will find and delete the chart image files those are older than 24 hours. Here is the code to find and remove files: Delete.php
<?php // set the path to your chart image directory $dir = "images/temp/"; // loop through all the chart png files inside the directory. foreach (glob($dir."*.png") as $file) { // if file is 24 hours old then delete it if (filemtime($file) < time() - 86400) { unlink($file); } } ?>Conclusion Making PHP communicate with R and showcase the result is very simple. You might need to understand the exec() function and some PHP code if in-case you want to delete those residual files/images generated by your R program.
r.php
,內容如下:
<html>
<body>
<form action='r.php' method='get'>
輸入 N 值: <input type='text' name='n' />
<input type='submit' />
</form>
<?php
if(isset($_GET['n'])) {
$n = $_GET['n'];
// 以外部指令的方式呼叫 R 進行繪圖
exec("Rscript script.R $n");
// 產生亂數
$nocache = rand();
// 輸出圖檔
echo("<img class="lazy" data-src='output/hist.png?$nocache' />");
}
?>
</body></html>
上面這段程式碼的上半部是一個普通的 HTML form,可以用來送出使用者所輸入的參數,而下半部則是 PHP 的程式碼,在接收使用者輸入的 n
值之後,透過 PHP 的 exec
執行外部程式,而 Rscript
這個程式則是附屬在 R 中的一個程式,只要安裝好 R 之後系統上就會有這個程式,它是專門用來執行 R 指令稿的工具程式。
最後在執行完 R 指令稿之後,要顯示繪圖的結果,由於我們每一次所繪製的圖檔檔名都一樣,所以需要在圖檔後方加上一串亂數,強迫讓瀏覽器重新抓取新的圖檔(也就是不要使用瀏覽器的快取),這樣每次送出新的 n
值才會顯示新的結果。
以下是 script.R
這一個 R 指令稿的內容:
args = commandArgs(TRUE)
# 取得使用者輸入的 N 值
n = args[1]
# 產生資料
x = rnorm(n, 0, 1)
# 繪製直方圖
png(filename="output/hist.png", width = 500, height = 300)
hist(x, col = "orange")
dev.off()
在 R 中我們透過 commandArgs
取得從 shell 中傳入的參數,其中第一個參數就是使用者輸入的 n
值,藉由這樣的方式就可以取得從 PHP 傳過來的資料。
接著產生一些常態分佈的亂數資料,並繪製一張直方圖,我們將圖形儲存至 output
這個目錄中,然後再讓網頁直接讀取這個圖檔,這樣就可以將結果傳給使用者。
這裡我是規劃 output
目錄專門用來放置輸出的圖檔,由於 R 的指令稿會以執行網頁伺服器的使用者(在 Ubuntu Linux 中通常是 www-data
)權限來執行,所以請注意目錄權限的設定,要讓伺服器有權限可以寫入這個目錄。
執行的結果會像這樣:
exec
執行外部的 R 指令稿是一個比較簡單的方式,不過缺點就是它需要另外建立一個單獨的 R 指令稿,如果不想要另外建立一個 R 檔案,可以改用 proc_open
的方式,直接把 R 的指令從 PHP 中透過 Linux 的管線(pipe)寫到 R 的行程(process)中,這樣就可以省去建立 R 檔案的麻煩,以下是一個簡單的範例:
<html><body>
<form action='r.php' method='get'>
輸入 N 值: <input type='text' name='n' />
<input type='submit' />
</form>
<?php
if(isset($_GET['n'])) {
$n = $_GET['n'];
$descriptorspec = array(
0 => array("pipe", "r"), // stdin
1 => array("file", "/tmp/output.txt", "w"),// stdout
2 => array("file", "/tmp/error.txt", "w") // stderr
);
// 以管線的方式執行 R 指令稿進行繪圖
$rproc = proc_open("R --vanilla", $descriptorspec, $pipes);
if (is_resource($rproc)) {
fwrite($pipes[0], "x = rnorm($n, 0, 1);");
fwrite($pipes[0], "png(filename='output/hist.png', width = 500, height = 300);");
fwrite($pipes[0], "hist(x, col = 'orange');");
fwrite($pipes[0], "dev.off();");
fclose($pipes[0]);
proc_close($rproc);
// 產生亂數
$nocache = rand();
// 輸出圖檔
echo("<img class="lazy" data-src='output/hist.png?$nocache' />");
}
}
?>
</body></html>
這個範例我們利用 proc_open
從 PHP 中開啟一個 R 的行程,在開啟新的行程之前,要先以 $descriptorspec
設定好新行程的標準輸入、標準輸出與標準錯誤,此處我們將新 R 行程的標準輸入指定為管線,方便我們直接從 PHP 寫入資料,而 R 的輸出與錯訊息則是導入兩個暫存檔中,通常在開發階段這樣可以方便我們檢視程式是否有正確執行,除錯也比較方便。
將 R 的指令都寫入 R 的行程之後,在呼叫 proc_close
關閉 R 行程之前,要記得先將所有的管線關閉,避免造成 deadlock。
最後一樣照舊將圖檔顯示在網頁上,不管是使用 proc_close
還是 exec
來整合 PHP 與 R,顯示出來的效果看起來都相同,只有內部的程式結構上有些差異而已。
Control/Ctrl + 1
: Source editor (your script)
Control/Ctrl + 2
: Console
Control/Ctrl + 3
: Help
Control/Ctrl + 4
: History
Control/Ctrl + 5
: Files
Control/Ctrl + 6
: Plots
Control/Ctrl + 7
: Packages
Control/Ctrl + 8
: Environment
Control/Ctrl + 9
: Viewer
If you prefer to only have one pane in view at a time, add Shift
to any of the above commands to maximize the pane.
For example, enter Control/Ctrl + Shift + 1
to maximize the R script, notebook, or R Markdown file you are working in.
(Side note: The +
we show in the shortcuts means “and”, so there’s no need to actually type the +
key.)
But what if you want to return to the standard four-pane view? No problem! Enter Control/Ctrl + Shift + 0
:
Tools > Keyboard Shortcuts Help
.
Another way to access RStudio keyboard shortcuts is with a shortcut! To access shortcuts, type Option + Shift + K
on a Mac, or Alt + Shift + K
on Linux and Windows.
Here are some of our favorite RStudio shortcuts:
Insert the = assignment operator with Option + -
on a Mac, or Alt + -
on Linux and Windows.
Insert the pipe operator %>%
with Command + Shift + M
on a Mac, or Ctrl + Shift + M
on Linux and Windows.
Run the current line of code with Command + Enter
on a Mac or Control + Enter
on Linux and Windows.
Run all lines of code with Command + A + Enter
on a Mac or Control + A + Enter
on Linux and Windows.
Restart the current R session and start fresh with Command + Shift + F10
on a Mac or Control + Shift + F10
on Linux and Windows.
Comment or uncomment lines with Command + Shift + C
on a Mac or Control + Shift + C
on Linux and Windows.
Trying to remember a command you submitted earlier? Search the command history from the Console with Command + [up arrow]
on a Mac or Control + [up arrow]
on Linux and Windows.
There are many more useful shortcuts available, but by mastering the shortcuts above, you’ll be on your way to becoming an RStudio power user!
Another great resource for RStudio shortcuts is the official RStudio cheat sheet available here.
return/Enter
to make your selection.
Alternatively, you can utilize a very cool feature called installed.packages()
function by typing part of the function name, and then use arrows to make the selection.
Next, we’ll use fuzzy matching to only enter instd
to narrow our selection further:
control/ctrl + .
to open the Go to File/Function
window and then use your fuzzy matching skills to narrow your selection:
RStudio
tab, navigate to Preferences > Appearance
to explore the many options available.
A nice feature of RStudio is that you can quickly click through the Editor theme
window to preview each theme.
Help
tab in the lower-right window, you’ll find handy links to the online documentation for R functions and R packages.
For example, if we search for information about the install.packages()
function using the search bar, the official documentation is returned:
Help
tab by prepending a package or function with ?
, (e.g. ?install.packages
) and running the command into the Console.
With either approach, RStudio auto-fills matching function names as you type!
Plots
tab in the lower-right window.
In this window, you can inspect your plots by zooming in and out.
If you want to save your plot, you can save the plot as a PDF or image file.
Environment
tab in the upper-right window, there is feature that enables you to import a dataset.
This feature supports a variety of formats:
View()
command, or by clicking the name of the dataset:
History
tab:
Preferences > General
and un-select the option to restore .RData
into workspace at startup.
Be sure to specify that you never want to save your workspace, like this:
File
tab in RStudio and select New Project...
.
You have the option to create your new project in a new directory, or an existing directory.
RStudio offers dedicated project types if you are working on an R package, or a Shiny Web Application.
RStudio Projects are useful when you need to share your work with colleagues.
You can send your project file (ending in .Rproj
) along with all supporting files, which will make it easier for your colleagues to recreate the working environment and reproduce the results.
But if you want seamless collaboration, you may need to introduce package management into your workflow.
Fortunately, RStudio offers a useful tool for package management, renv
, that is now compatible with RStudio projects.
We’ll cover renv
next.
renv
(“reproducible environment”) package from RStudio.
And now, RStudio includes built-in support for renv
.
We won’t get into the details of how to use renv
with RStudio projects in this blog because RStudio provides you with the info you need in the link we provided and in the vignette.
But using renv
with RStudio can make R package management much easier, so we wanted to let you know!
The renv
package is replacing the Packrat
package that RStudio used to maintain.
To use the renv
package with your RStudio projects upgrade to the latest version of RStudio and then install the renv
package with library("renv")
.
From there you will have the option to use renv
with all new projects:
renv
with an existing project navigate to Tools > Project Options > Environments
and check the box to enable renv
:
lib
snippet that saves you a bit of typing when calling the library()
function to load an R package:
library()
function is loaded and the cursor is positioned so you can immediately begin typing the name of the package you want to load:
F2
(on a Mac you may need to enter fn + F2
).
This feature even works for functions loaded from any R packages you use.
control + option X
on a Mac, Ctrl + Alt + X
on Linux/Windows.
A pop-up will appear that will ask you to select a function name.
control + option V
on a Mac, Ctrl + Alt + V
on Linux/Windows.
control + shift + option + M
on a Mac, or Ctrl + Shift + Alt + M
on Linux/Windows.
option
on a Mac, or Alt
on Windows/Linux.
pip
and virtualenv
Create a Python environment in your RStudio project
Activate your Python environment
Install desired Python packages in your environment
Install and configure the R reticulate
package to use Python
This article provides the code you’ll need for the steps above.
We tried it out and were able to run python in RStudio in only a few minutes:
DBI
package from R.
You’ll start by generating an in-memory SQL database to use in all your SQL query examples.
You’ll generate a SQL database of the well-known “mtcars” dataset.
Here’s the code:
dbi_query
:
{sql}
code chunk.
Using the connection and database from the first example, run this code:
output.var = "mt_cars_df"
to save the results of your query to a dataframe.
This dataframe is a standard R dataframe that is identical to the one you generated in the previous example.
You can use this dataframe in R code chunks to perform analysis or to generate a ggplot, for example:
dbplyr
package to write standard dplyr
commands that get converted to SQL! Once again, using the connection and database from the first example, you can write a standard filter()
call to query the cars with four cylinders, this returns a list object:
show_query()
function from dbplyr
:
collect()
function from dbplyr
to save your results as a dataframe:
dbplyr
method returns a tibble, whereas the first two methods return a standard R dataframe.
To learn more about querying SQL databases with RStudio, check out this article.
Help > Cheatsheets
.
Tools -> Modify Keyboard Shortcuts...
:
Backspace
to clear a single key combination, or Delete
to reset that binding to the original value it had when the widget was opened.
Commands can be either a single 'key chord'; for example, Ctrl+Alt+F
, or also to a sequence of keys, as in Ctrl+X Ctrl+F
.
You can also filter, based on the names of commands, by typing within the Filter...
search box at the top left, to more easily find commands of interest:
Apply
and the shortcuts will be applied to the current session and saved for future sessions.
~/.R/rstudio/keybindings/
-- you can find the bindings for the editor and RStudio itself at:
# example using built-in dataset
mtcars
t(mtcars)
1 | 1 | 5 | 6 |
1 | 2 | 3 | 5 |
2 | 1 | 6 | 1 |
2 | 2 | 2 | 4 |
# example of melt function
library(reshape)
mdata = melt(mydata, id=c("id","time"))
1 | 1 | x1 | 5 |
1 | 2 | x1 | 3 |
2 | 1 | x1 | 6 |
2 | 2 | x1 | 2 |
1 | 1 | x2 | 6 |
1 | 2 | x2 | 5 |
2 | 1 | x2 | 1 |
2 | 2 | x2 | 4 |
# cast the melted data
# cast(data, formula, function)
subjmeans = cast(mdata, id~variable, mean)
timemeans = cast(mdata, time~variable, mean)
1 | 4 | 5.5 |
2 | 4 | 2.5 |
1 | 5.5 | 3.5 |
2 | 2.5 | 4.5 |
lm
.
In the next example, use this command to calculate the height based on the age of the child.
First, import the library readxl
to read Microsoft Excel files, it can be any kind of format, as long R can read it.
To know more about importing data to R, you can take this DataCamp course.
The data to use for this tutorial can be downloaded here.
Download the data to an object called ageandheight
and then create the linear regression in the third line.
The lm
command takes the variables in the format:
lm([target variable] ~ [predictor variables], data = [data source])
With the command summary(lmHeight)
you can see detailed information on the model’s performance and coefficients.
library(readxl)
ageandheight = read_excel("ageandheight.xls", sheet = "Hoja2") #Upload the data
lmHeight = lm(height~age, data = ageandheight) #Create the linear regression
summary(lmHeight) #Review the results
lmHeight2 = lm(height~age + no_siblings, data = ageandheight) #Create a linear regression with two variables
summary(lmHeight2) #Review the results
read_excel
command, to create a dataframe with the data, then create a linear regression with your new data.
The command plot
takes a data frame and plots the variables on it.
In this case, it plots the pressure against the temperature of the material.
Then, add the line made by the linear regression with the command abline
.
pressure = read_excel("pressure.xlsx") #Upload the data
lmTemp = lm(Pressure~Temperature, data = pressure) #Create the linear regression
plot(pressure, pch = 16, col = "blue") #Plot the results
abline(lmTemp) #Add a regression line
summary(lmTemp)
Call:
lm(formula = Pressure ~ Temperature, data = pressure)
Residuals:
Min 1Q Median 3Q Max
-41.85 -34.72 -10.90 24.69 63.51
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) -81.5000 29.1395 -2.797 0.0233 *
Temperature 4.0309 0.4696 8.583 2.62e-05 ***
---
Signif.
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.66 on 8 degrees of freedom
Multiple R-squared: 0.902, Adjusted R-squared: 0.8898
F-statistic: 73.67 on 1 and 8 DF, p-value: 2.622e-05
Ideally, when you plot the residuals, they should look random.
Otherwise means that maybe there is a hidden pattern that the linear model is not considering.
To plot the residuals, use the command plot(lmTemp$residuals)
.
plot(lmTemp$residuals, pch = 16, col = "red")
lmTemp2 = lm(Pressure~Temperature + I(Temperature^2), data = pressure) #Create a linear regression with a quadratic coefficient
summary(lmTemp2) #Review the results
Call:
lm(formula = Pressure ~ Temperature + I(Temperature^2), data = pressure)
Residuals:
Min 1Q Median 3Q Max
-4.6045 -1.6330 0.5545 1.1795 4.8273
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) 33.750000 3.615591 9.335 3.36e-05 ***
Temperature -1.731591 0.151002 -11.467 8.62e-06 ***
I(Temperature^2) 0.052386 0.001338 39.158 1.84e-09 ***
---
Signif.
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.074 on 7 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994
F-statistic: 7859 on 2 and 7 DF, p-value: 1.861e-12
Notice that the model improved significantly.
If you plot the residuals of the new model, they will look like this:
plot(lmTemp2$residuals, pch = 16, col = "red")
cooks.distance
and then plot these distances.
Change a value on purpose to see how it looks on the Cooks Distance plot.
To change a specific value, you can directly point at it with ageandheight[row number, column number] = [new value]
.
In this case, the height is changed to 7.7 of the second example:
ageandheight[2, 2] = 7.7
head(ageandheight)
age | height | no_siblings |
---|---|---|
18 | 76.1 | 0 |
19 | 7.7 | 2 |
20 | 78.1 | 0 |
21 | 78.2 | 3 |
22 | 78.8 | 4 |
23 | 79.7 | 1 |
cooks.distance([linear model]
and then if you want you can plot these distances with the command plot
.
lmHeight3 = lm(height~age, data = ageandheight)#Create the linear regression
summary(lmHeight3)#Review the results
plot(cooks.distance(lmHeight3), pch = 16, col = "blue") #Plot the Cooks Distances.
Call:
lm(formula = height ~ age, data = ageandheight)
Residuals:
Min 1Q Median 3Q Max
-53.704 -2.584 3.609 9.503 17.512
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) 7.905 38.319 0.206 0.841
age 2.816 1.613 1.745 0.112
Residual standard error: 19.29 on 10 degrees of freedom
Multiple R-squared: 0.2335, Adjusted R-squared: 0.1568
F-statistic: 3.046 on 1 and 10 DF, p-value: 0.1115
It can helps us pick the right threshold value depending on what we need. For e.g if our motive is to reduce Type 1 error, we need to pick high precision whereas if we aim to minimize Type 2 error then we should pick a threshold such that sensitivity or recall is high.
Why do precision have bumpy nature at the end ?
Predicted Results are likely to vary in each tril, while Actual results are fixed.
True Positive Rate (or sensitivity) False Positive Rate (or 1-specificity)
As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1
extremely low values in the start extremely high values in the end and intermediate values in the middleBoth the plots satisfy this property. However, in case of logistic regression, since there are usually only 2 categories, having a linear straight decision boundary may not work as it is not steep enough & will end up misclassifying points In the sigmoid curve, we have low values for a lot of points, then the values rise all of a sudden, after which we have a lot of high values. In a straight line though, the values rise from low to high very uniformly, and hence, the “boundary” region, the one where the probabilities transition from high to low is not really present.. Hence we apply a sigmoid transformation to convert it to a sigmoid curve which is smooth at the extremes and almost linear in middle (for moderate values)
# Generate a plot of color names which R knows about.
#++++++++++++++++++++++++++++++++++++++++++++
# cl : a vector of colors to plots
# bg: background of the plot
# rot: text rotation angle
#usage=showCols(bg="gray33")
showCols = function(cl=colors(), bg = "grey",
cex = 0.75, rot = 30) {
m = ceiling(sqrt(n =length(cl)))
length(cl) = m*m; cm = matrix(cl, m)
require("grid")
grid.newpage(); vp = viewport(w = .92, h = .92)
grid.rect(gp=gpar(fill=bg))
grid.text(cm, x = col(cm)/m, y = rev(row(cm))/m, rot = rot,
vp=vp, gp=gpar(cex = cex, col = cm))
}
The names of the first sixty colors are shown in the following chart :
# The first sixty color names
showCols(bg="gray20",cl=colors()[1:60], rot=30, cex=0.9)
# Barplot using color names
barplot(c(2,5), col=c("chartreuse", "blue4"))
showCols(cl= colors(), bg="gray33", rot=30, cex=0.75)
# Barplot using hexadecimal color code
barplot(c(2,5), col=c("#009999", "#0000FF"))
install.packages("RColorBrewer")
RColorBrewer package create a nice looking color palettes.
The color palettes associated to RColorBrewer package can be drawn using display.brewer.all() R function as follow :
library("RColorBrewer")
display.brewer.all()
# View a single RColorBrewer palette by specifying its name
display.brewer.pal(n = 8, name = 'RdBu')
# Hexadecimal color specification
brewer.pal(n = 8, name = "RdBu")
## [1] "#B2182B" "#D6604D" "#F4A582" "#FDDBC7" "#D1E5F0" "#92C5DE" "#4393C3" "#2166AC"
# Barplot using RColorBrewer
barplot(c(2,5,7), col=brewer.pal(n = 3, name = "RdBu"))
# Install
install.packages("wesanderson")
# Load
library(wesanderson)
The available color palettes are :
# simple barplot
barplot(c(2,5,7), col=wes.palette(n=3, name="GrandBudapest"))
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 2) +
scale_color_manual(values = wes.palette(n=3, name="GrandBudapest"))
# Use rainbow colors
barplot(1:5, col=rainbow(5))
# Use heat.colors
barplot(1:5, col=heat.colors(5))
# Use terrain.colors
barplot(1:5, col=terrain.colors(5))
# Use topo.colors
barplot(1:5, col=topo.colors(5))
# Use cm.colors
barplot(1:5, col=cm.colors(5))
This analysis has been performed using R (ver. 3.1.0).
| |
Typical goal: Explanation | Typical goal: Prediction |
Does X have an effect on Y? | What best predicts Y? |
Example: Does a low-carb diet lead to a reduced risk of heart attack? | Example: Given various clinical parameters, how can we use them to predict heart attacks? |
Task: Develop research design based on a theory about the data-generating process to identify the causal effect (via a randomized experiment, or an observational study with statistical control variables).
Don’t try out various model specifications until you get your desired result (better: pre-register your hypothesized model). | Task: Try out and tune many different algorithms in order to maximize predictive accuracy in new and unseen test datasets. A theory about the true data-generating process is useful but not strictly necessary, and often not available (think of, e.g., image recognition). |
Parameters of interest: Causal effect size, p-value. | Parameters of interest: Accuracy (%), precision/recall, sensitivity/specificity, … |
DON’T: Throw all kinds of variables into the model which might mask/bias your obtained effect (e.g., “spurious correlation”, “collider bias”). | Use whatever features are available and prove to be useful in predicting the outcome. |
Use all the data to calculate your effect of interest. After all, your sample was probably designed to be representative (e.g. a random sample) of a population. | DON’T: Use all data to train a model. Always reserve subsets for validation/testing in order to avoid overfitting. |
Library | Objective | Function | Parameter |
---|---|---|---|
randomForest | Create a Random forest | RandomForest() | formula, ntree=n, mtry=FALSE, maxnodes = NULL |
caret | Create K folder cross validation | trainControl() | method = “cv”, number = n, search =”grid” |
caret | Train a Random Forest | train() | formula, df, method = “rf”, metric= “Accuracy”, trControl = trainControl(), tuneGrid = NULL |
caret | Predict out of sample | predict | model, newdata= df |
caret | Confusion Matrix and Statistics | confusionMatrix() | model, y test |
caret | variable importance | cvarImp() | model |
Library | Objective | Function | Class | Parameters | Details |
---|---|---|---|---|---|
rpart | Train classification tree in R | rpart() | class | formula, df, method | |
rpart | Train regression tree | rpart() | anova | formula, df, method | |
rpart | Plot the trees | rpart.plot() | fitted model | ||
base | predict | predict() | class | fitted model, type | |
base | predict | predict() | prob | fitted model, type | |
base | predict | predict() | vector | fitted model, type | |
rpart | Control parameters | rpart.control() | minsplit | Set the minimum number of observations in the node before the algorithm perform a split | |
minbucket | Set the minimum number of observations in the final note i.e. the leaf | ||||
maxdepth | Set the maximum depth of any node of the final tree. The root node is treated a depth 0 | ||||
rpart | Train model with control parameter | rpart() | formula, df, method, control |
What Is Regression? Regression is a statistical method that attempts to determine the strength and character of therelationship between a dependent variable and one or more independent variables. Excel不太适合做多项式回归,虽然可以通过散点图进行添加趋势线拟合结果,但是无法判定模型参数的好坏,以及如何选择项数。 所以最好使用编程语言来实现。 可以使用Python实现,今天补充R语言版本。 多项式回归是一种回归分析方法,它通过使用多项式函数来拟合自变量(输入)和因变量(输出)之间的关系。 在多项式回归中,假设自变量和因变量之间的关系可以用一个多项式函数来近似表示。 多项式回归的一般形式如下:其中,y是因变量,x是自变量,β0,β1,…,βn 是回归系数,ϵ 是误差项。
多项式回归的优点和缺点 优点: 灵活性:可以拟合复杂的数据模式,包括非线性关系。 易于理解和实现:多项式回归模型相对简单,易于解释和实现。缺点: 过拟合风险:高阶多项式可能导致模型在训练数据上过度拟合,而在新数据上表现不佳。 计算复杂度:随着多项式阶数的增加,计算复杂度增加。使用R语言进行多项式回归模拟 首先进行数据的生成,为了使用方便,这里就直接在软件里面模拟数据了# 设置随机数种子以确保可重复性 set.seed(21) # 生成自变量x x = runif(100, min = -10, max = 10) # 生成因变量y,假设y与x的关系为二次多项式 y = 2 + 3*x - 0.5*x^2 + rnorm(100, mean = 0, sd = 5) # 将数据存储在数据框中 data = data.frame(x = x, y = y)
![]()
先简单绘制一下图形,看一下整体的分布情况 # 绘制数据分布图 plot(data$x, data$y, main = "广告支出 vs 销售额", xlab = "广告支出", ylab = "销售额", pch = 19, col = "blue")
从图形上看,不是直线的相关关系,所以不能直接使用一元线形回归。 然后是模型的拟合,还是使用lm函数就行,只是参数需要改成多项,我们这里不知道实际情况的情况下,先使用3次项看一下结构
#
Call: lm(formula = y ~ poly(x, 3), data = data) Residuals: Min 1Q Median 3Q Max -13.4946 -3.2138 0.0554 3.6756 9.0921 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.6000 0.4906 -27.719 <2e-16 *** poly(x, 3)1 166.6193 4.9063 33.960 <2e-16 *** poly(x, 3)2 -149.2056 4.9063 -30.411 <2e-16 *** poly(x, 3)3 6.8565 4.9063 1.397 0.165 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.906 on 96 degrees of freedom Multiple R-squared: 0.9559, Adjusted R-squared: 0.9545 F-statistic: 693.4 on 3 and 96 DF, p-value: < 2.2e-16拟合二次多项式回归模型 model = lm(y ~ poly(x, 3), data = data)# 查看模型摘要 summary(model)关键评估参数 残差标准误差(Residual Standard Error): 这是模型残差的均方根误差,用于衡量模型预测值与实际观测值之间的平均差异。 值越小,表示模型拟合越好。 这里只有4.906,模型拟合效果还不错;Multiple R-squared: 决定系数,表示模型解释的变异占总变异的比例。 值介于0和1之间,越接近1,表示模型解释能力越强;这里是0.9559,已经是非常好的效果了;Adjusted R-squared: 调整后的决定系数,考虑了模型中参数的数量。 当添加不重要的变量时,R-squared可能增加,但Adjusted R-squared可能减少。 这里结果是0.9545,仍然是非常好的结果;F-statistic: 用于检验模型整体显著性的统计量。 如果p值(Pr(>F))小于显著性水平(通常为0.05),则拒绝原假设,认为模型整体显著。 这里p值是2.2e-16,模型整体显著;Coefficients: 模型中每个参数的估计值、标准误差、t值和p值。 t值用于检验参数是否显著不为零,p值用于判断参数的显著性。 通常,p值小于0.05表示参数显著。 这里截距,1,2项都非常显著,但是3项不显著,模型参数还有待调整。确定最佳多项式次数 确定最佳多项式次数通常涉及交叉验证或使用信息准则(如AIC或BIC)。 以下是使用交叉验证确定最佳多项式次数的代码# 使用交叉验证确定最佳多项式次数 library(boot)# 定义交叉验证函数 cv_error = function(formula, data, deg, K = 10) { model = glm(formula, data = data) return(cv.glm(data, model, K = K)$delta[1]) } # 计算不同多项式次数的交叉验证误差 cv_errors = sapply(1:10, function(deg) { formula = as.formula(paste("y ~ poly(x,", deg, ")", sep = "")) return(cv_error(formula, data, deg))}) # 找到最小交叉验证误差对应的多项式次数 best_deg = which.min(cv_errors) # 打印结果 print(paste("Best polynomial degree is", best_deg))
"Best polynomial degree is 2" 最后模型在1-10项之间,得出的结果是2,这也和我们模拟的次数是一样的。 为了更直观,我们把每一次拟合的结果都存下来把每一次拟合的结果都存下来 # 创建一个数据框来存储每个多项式次数的模型参数 coefficients_table = data.frame(Degree = integer(0), Coefficients = character(0)) # 计算不同多项式次数的交叉验证误差和模型参数 for (deg in 1:10) { formula = as.formula(paste("y ~ poly(x,", deg, ")", sep = "")) model = glm(formula, data = data) # 获取模型参数 coefficients = coef(model) coefficients = coefficients[2:(deg + 1)] # 去除截距项 # 存储结果 coefficients_table[deg, "Degree"] = deg coefficients_table[deg, "Coefficients"] = paste(round(coefficients, 2), collapse = ", ")}
结果就是2次项的时候最优,然后进行可视化展示,把10次结果都进行拟合看一下
par(mfrow = c(2, 5)) # 设置图形布局为2行5列 for (deg in 1:10) { formula = as.formula(paste("y ~ poly(x,", deg, ")", sep = "")) model = glm(formula, data = data) x_pred = seq(min(x), max(x), length.out = 100) y_pred = predict(model, newdata = data.frame(x = x_pred)) plot(x, y, main = paste("Degree =", deg), xlab = "x", ylab = "y", col = "blue", pch = 19) lines(x_pred, y_pred, col = "red", lwd = 2)}
![]()
对新的数据进行预测 最后可以对新的数据进行预测,并绘制预测结果图# 生成x的预测值 x_pred = seq(min(x), max(x), length.out = 100)y_pred = predict(model, newdata = data.frame(x = x_pred)) # 绘制原始数据和模型预测值 plot(x, y, main = "Polynomial Regression", xlab = "x", ylab = "y", col = "blue", pch = 19 ) lines(x_pred, y_pred, col = "red", lwd = 2) legend("topleft", legend = c("Original Data", "Polynomial Regression"), col = c("blue", "red"), pch = c(19, NA), lty = c(NA, 1))
![]()
trimming video
The basic approach to trimming a video with ffmpeg would be something like this: ffmpeg -i input.mp4 -ss 00:05:00 -to 00:10:00 -c copy output.mp4 To create a batch file, you can put the following in a text file and save it as something like "trimvideo.bat" and run it in the relevant folder. @echo off :: loops across all the mp4s in the folder for %%A in (*.mp4) do ffmpeg -i "%%A"^ :: the commands you would use for processing one file -ss 00:05:00 -to 00:10:00 -c copy ^ :: the new file (original_trimmed.mp4) "%%~nA_trimmed.mp4" pause If you wanted to do this through R, you could do something like: # get a list of the files you're working with x <- list.files(pattern = "*.mp4") for (i in seq_along(x)) { cmd <- sprintf("ffmpeg -i %s -ss 00:05:00 -to 00:10:00 -c copy %_trimmed.mp4", x[i], sub(".mp4$", "", x[i])) system(cmd) }R call external program and return parameters
External program write results to a txt file and loop from R until value of the text file not null or empty. External program can also write results to clipboard vb example: Set WshShell = WScript.CreateObject("WScript.Shell") WshShell.Run "cmd.exe /c echo hello world | clip", 0, TRUE in R, loop until value of clipboard not nullR POST a simple HTML form
with the httr library library(httr) url <- "https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm" fd <- list( submit = "Show Prices", priceDate.year = 2014, priceDate.month = 12, priceDate.day = 15 ) resp<-POST(url, body=fd, encode="form") content(resp) The rvest library is really just a wrapper to httr. It looks like it doesn't do a good job of interpreting absolute URLs without the server name. So if you look at f1$url # [1] /GA-FI/FedInvest/selectSecurityPriceDate.htm you see that it just has the path and not the server name. This appears to be confusing httr. If you do f1 <- set_values(f0[[2]], priceDate.year=2014, priceDate.month=12, priceDate.day=15) f1$url <- url test <- submit_form(s, f1) that seems to work. Perhaps it's a bug that should be reported to rvest. adding the style='POST' parameter to postForm does the trick as well.Make HTTP request using httr package
In this article, we will learn how to make an HTTP request using the GET method in R Programming Language with httr library.Installation
httr library is used to make http requests in R language as it provides a wrapper for the curl package. Install httr package:install.packages(“httr”) Making a simple
HTTP requestlibrary(httr) will import httr package. Now to make an HTTP request we will be usingGET() of httr package and pass a URL, theGET() will return raw data so we will store it in a variable and then print it usingprint(). Note: You need not use install.packages() if you have already installed the package once. # installing packages install.packages("httr") # importing packages library(httr) # GET() method will store the raw data # in response variable response < - GET("https://geeksforgeeks.org") # printing response/data print(response) Output:You might have noticed this output is not exact URL data that’s because it is raw data.
Convert raw data to char format
To convert raw data in char format we need to userawToChar() and passvariable_name$content in it just like we did in this example. # installing packages install.packages("httr") # importing packages library(httr) # GET() method will store the raw # data in r variable r < - GET("https://geeksforgeeks.org") # rawToChar() will convert raw data # to char and store in response variable response < - rawToChar(r$content) # print response print(response) Output:![]()
R Principal Components Analysis Example
Principal components analysis is an unsupervised machine learning technique that seeks to find principal components – linear combinations of the original predictors – that explain a large portion of the variation in a dataset. For a given dataset with p variables, we could examine the scatterplots of each pairwise combination of variables, but the sheer number of scatterplots can become large very quickly. For p predictors, there are p(p-1)/2 scatterplots. So, for a dataset with p = 15 predictors, there would be 105 different scatterplots! Fortunately, PCA offers a way to find a low-dimensional representation of a dataset that captures as much of the variation in the data as possible. If we’re able to capture most of the variation in just two dimensions, we could project all of the observations in the original dataset onto a simple scatterplot. The way we find the principal components is as follows: Given a dataset with p predictors: X1, X2, … , Xp,, calculate Z1, … , ZM to be the M linear combinations of the original p predictors where: Zm = ΣΦjmXj for some constants Φ1m, Φ2m, Φpm, m = 1, …, M. Z1 is the linear combination of the predictors that captures the most variance possible. Z2 is the next linear combination of the predictors that captures the most variance while being orthogonal (i.e. uncorrelated) to Z1. Z3 is then the next linear combination of the predictors that captures the most variance while being orthogonal to Z2. And so on. In practice, we use the following steps to calculate the linear combinations of the original predictors: 1. Scale each of the variables to have a mean of 0 and a standard deviation of 1. 2. Calculate the covariance matrix for the scaled variables. 3. Calculate the eigenvalues of the covariance matrix. Using linear algebra, it can be shown that the eigenvector that corresponds to the largest eigenvalue is the first principal component. In other words, this particular combination of the predictors explains the most variance in the data. The eigenvector corresponding to the second largest eigenvalue is the second principal component, and so on. This tutorial provides a step-by-step example of how to perform this process in R.Step 1: Load the Data Step 2: Calculate the Principal Components Step 3: Visualize the Results with a Biplot Step 4: Find Variance Explained by Each Principal Component Principal Components Analysis in Practice ⇧
Step 1: Load the Data
First we’ll load thetidyverse package, which contains several useful functions for visualizing and manipulating data:library(tidyverse) For this example we’ll use the USArrests dataset built into R, which contains the number of arrests per 100,000 residents in each U.S. state in 1973 for Murder, Assault, and Rape. It also includes the percentage of the population in each state living in urban areas, UrbanPop. The following code show how to load and view the first few rows of the dataset: #load data data("USArrests") #view first six rows of data head(USArrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7⇧
Step 2: Calculate the Principal Components
After loading the data, we can use the R built-in functionprcomp() to calculate the principal components of the dataset. Be sure to specifyscale = TRUE so that each of the variables in the dataset are scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components. Also note that eigenvectors in R point in the negative direction by default, so we’ll multiply by -1 to reverse the signs.#calculate principal components results <- prcomp(USArrests, scale = TRUE) #reverse the signs results$rotation <- -1*results$rotation #display principal components results$rotation PC1 PC2 PC3 PC4 Murder 0.5358995 -0.4181809 0.3412327 -0.64922780 Assault 0.5831836 -0.1879856 0.2681484 0.74340748 UrbanPop 0.2781909 0.8728062 0.3780158 -0.13387773 Rape 0.5434321 0.1673186 -0.8177779 -0.08902432 We can see that the first principal component (PC1) has high values for Murder, Assault, and Rape which indicates that this principal component describes the most variation in these variables. We can also see that the second principal component (PC2) has a high value for UrbanPop, which indicates that this principle component places most of its emphasis on urban population. Note that the principal components scores for each state are stored inresults$x . We will also multiply these scores by -1 to reverse the signs:#reverse the signs of the scores results$x <- -1*results$x #display the first six scores head(results$x) PC1 PC2 PC3 PC4 Alabama 0.9756604 -1.1220012 0.43980366 -0.154696581 Alaska 1.9305379 -1.0624269 -2.01950027 0.434175454 Arizona 1.7454429 0.7384595 -0.05423025 0.826264240 Arkansas -0.1399989 -1.1085423 -0.11342217 0.180973554 California 2.4986128 1.5274267 -0.59254100 0.338559240 Colorado 1.4993407 0.9776297 -1.08400162 -0.001450164 ⇧
Step 3: Visualize the Results with a Biplot
Next, we can create abiplot – a plot that projects each of the observations in the dataset onto a scatterplot that uses the first and second principal components as the axes: Note thatscale = 0 ensures that the arrows in the plot are scaled to represent the loadings.biplot(results, scale = 0) From the plot we can see each of the 50 states represented in a simple two-dimensional space. The states that are close to each other on the plot have similar data patterns in regards to the variables in the original dataset. We can also see that the certain states are more highly associated with certain crimes than others. For example, Georgia is the state closest to the variable Murder in the plot. If we take a look at the states with the highest murder rates in the original dataset, we can see that Georgia is actually at the top of the list:
#display states with highest murder rates in original dataset head(USArrests[order(-USArrests$Murder),]) Murder Assault UrbanPop Rape Georgia 17.4 211 60 25.8 Mississippi 16.1 259 44 17.1 Florida 15.4 335 80 31.9 Louisiana 15.4 249 66 22.2 South Carolina 14.4 279 48 22.5 Alabama 13.2 236 58 21.2 ⇧
Step 4: Find Variance Explained by Each Principal Component
We can use the following code to calculate the total variance in the original dataset explained by each principal component:#calculate total variance explained by each principal component results$sdev^2 / sum(results$sdev^2) [1] 0.62006039 0.24744129 0.08914080 0.04335752 From the results we can observe the following: The first principal component explains62% of the total variance in the dataset. The second principal component explains24.7% of the total variance in the dataset. The third principal component explains8.9% of the total variance in the dataset. The fourth principal component explains4.3% of the total variance in the dataset. Thus, the first two principal components explain a majority of the total variance in the data. This is a good sign because the previous biplot projected each of the observations from the original data onto a scatterplot that only took into account the first two principal components. Thus, it’s valid to look at patterns in the biplot to identify states that are similar to each other. We can also create ascree plot – a plot that displays the total variance explained by each principal component – to visualize the results of PCA:#calculate total variance explained by each principal component var_explained = results$sdev^2 / sum(results$sdev^2) #create scree plot qplot(c(1:4), var_explained) + geom_line() + xlab("Principal Component") + ylab("Variance Explained") + ggtitle("Scree Plot") + ylim(0, 1) ⇧
Principal Components Analysis in Practice
In practice, PCA is used most often for two reasons:1. Exploratory Data Analysis – We use PCA when we’re first exploring a dataset and we want to understand which observations in the data are most similar to each other.2. Principal Components Regression – We can also use PCA to calculate principal components that can then be used in principal components regression. This type of regression is often used when multicollinearity exists between predictors in a dataset. The complete R code used in this tutorial can be found here. library(tidyverse) #load data data("USArrests") #view first six rows of data head(USArrests) #calculate principal components results <- prcomp(USArrests, scale = TRUE) #reverse the signs results$rotation <- -1*results$rotation #display principal components results$rotation #reverse the signs of the scores results$x <- -1*results$x #display the first six scores head(results$x) #create biplot to visualize results biplot(results, scale = 0) #calculate total variance explained by each principal component var_explained = results$sdev^2 / sum(results$sdev^2) #create scree plot qplot(c(1:4), var_explained) + geom_line() + xlab("Principal Component") + ylab("Variance Explained") + ggtitle("Scree Plot") + ylim(0, 1) library(magrittr) frank_txt <- readLines("frank.txt") frank_txt %>% paste(collapse="") %>% strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>% table %>% barplot Note that you can just stop at the table() and assign the result to a variable, which you can then manipulate however you want, e.g. by plotting it: char_counts <- frank_txt %>% paste(collapse="") %>% strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>% table barplot(char_counts) data <- head(iris) library(dplyr) select() selects columns from data filter() subsets rows of data group_by() aggregates data summarise() summarises data (calculating summary statistics) arrange() sorts data mutate() creates new variables library(dplyr) library(ISLR) newspecs <- mutate(auto_specs, hp_to_weight = horsepower / weight)Using Heatmap Data Visualization
1.Website Heatmaps
2.Grid Heatmaps
3.Clustered Heatmaps
Benefits of Heatmap Visualization
When to Use Heatmap Visualization
Best Practices for Using Heatmaps for Data VisualizationDifferent Tools for Generating Heatmaps
Microsoft Clarity:
Google Analytics (Page Analytics):
Types of Data Visualization TechniquesR Heatmap Static and Interactive Visualization R Packages/functions for drawing heatmaps Data preparation R base heatmap: heatmap() Enhanced heat maps: heatmap.2() Pretty heat maps: pheatmap() Interactive heat maps: d3heatmap() Enhancing heatmaps using dendextend Complex heatmap
Simple heatmap
Splitting heatmap by rows
Heatmap annotation
Simple annotation
Complex annotation
Combining multiple heatmapsApplication to gene expression matrix Visualizing the distribution of columns in matrix Summary
Heatmap Data Visualization
1.
Website Heatmaps Website Heatmaps
2.
Grid Heatmaps Grid Heatmaps
3.
Clustered heatmaps extend the functionality of standard grid heatmaps by incorporating hierarchical clustering to show relationships between rows and columns. This added dimension of information makes clustered heatmaps particularly valuable in fields like biology, where they are commonly used to visualize genetic data.Clustered Heatmaps Key Characteristics: Hierarchical Clustering : Clustered heatmaps use hierarchical clustering algorithms to group similar rows and columns together. This clustering is often displayed as dendrograms (tree-like diagrams) alongside the heatmap, indicating the similarity between different rows or columns.Color Encoding : As with standard heatmaps, the cell color represents the value of the data point. The color intensity or hue typically indicates the magnitude of the values, allowing for easy visual differentiation.Enhanced Patterns and Relationships : By clustering similar rows and columns together, clustered heatmaps make it easier to identify patterns, correlations, and relationships within the data. This can reveal underlying structures that might not be immediately apparent in a standard heatmap.Interactive Exploration : Many software tools and libraries allow users to interact with clustered heatmaps, enabling them to zoom in on specific clusters, reorder rows and columns, and explore the data in greater detail.Clustered Heatmaps
Benefits of Heatmap Visualization
Heatmaps offer several advantages over traditional data visualization methods:Intuitive Understanding : Colors make it easy to grasp complex data at a glance.Pattern Recognition : Heatmaps help identify patterns and trends that might be missed in numerical data.Engagement : The use of color makes heatmaps visually appealing and engaging.Granularity : Heatmaps provide detailed insights into data, allowing for more granular analysis.When to Use Heatmap Visualization
Heatmaps are versatile and can be used in various scenarios:Website Optimization : To understand user behavior and optimize webpage design.Financial Analysis : To visualize performance metrics and identify areas needing improvement.Marketing : To track campaign performance and customer engagement.Scientific Research : To analyze genetic data and other complex datasets.Geographic Analysis : Visualizing spatial data such as population density, crime rates, or weather patterns.Sports Analytics : Analyzing player movements, game strategies, or performance metrics.Best Practices for Using Heatmaps for Data Visualization
To effectively use heatmaps, consider the following best practices:Choose the Right Color Scale : Selecting an appropriate color scale is crucial. Sequential scales are ideal for data that progresses in one direction, while diverging scales are suitable for data with a central neutral point and values that can be both positive and negative.Ensure Sufficient Data: Heatmaps require a large amount of data to be accurate. Analyzing heatmaps with insufficient data can lead to misleading conclusions.Combine with Other Analytics : Heatmaps should be used in conjunction with other analytics tools to provide a comprehensive understanding of the data. For example, combining heatmaps with form analytics can offer deeper insights into user behavior.Use Legends : Always include a legend to help interpret the color scale used in the heatmap. This ensures that viewers can accurately understand the data being presented.Highlight Key Areas : Use heatmaps to draw attention to important areas of the data. For example, in a website heatmap, highlight areas with the most user interaction to focus on optimizing those sections.When it comes to generating heatmaps, several tools stand out for their features, ease of use, and effectiveness. Here are some of the tools for generating heatmaps:
Different Tools for Generating Heatmaps Microsoft Clarity is a free tool that offers heatmaps along with session recordings and other analytics features. It is designed to help users understand how visitors interact with their website and identify areas for improvement.
Microsoft Clarity: Google Analytics offers a heatmap feature through its Chrome extension, Page Analytics. This tool provides a visual representation of where visitors click on a webpage, helping to identify popular and underperforming elements.
Google Analytics (Page Analytics): Types of Data Visualization Techniques
Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar charts offer a clear visual representation of values.Line Charts: Perfect for illustrating trends over time, line charts connect data points to reveal patterns and fluctuations.Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to understand proportions and percentages.Scatter Plots: Showcase relationships between two variables, identifying patterns and outliers through scattered data points.Histograms: Depict the distribution of a continuous variable, providing insights into the underlying data patterns.Heatmaps: Visualize complex data sets through color-coding, emphasizing variations and correlations in a matrix.Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding in data distribution analysis.Area Charts: Similar to line charts but with the area under the line filled, these charts accentuate cumulative data patterns.Bubble Charts: Enhance scatter plots by introducing a third dimension through varying bubble sizes, revealing additional insights.Treemaps: Efficiently represent hierarchical data structures, breaking down categories into nested rectangles.Violin Plots : Violin plots combine aspects of box plots and kernel density plots, providing a detailed representation of the distribution of data.Word Clouds : Word clouds are visual representations of text data where words are sized based on their frequency.3D Surface Plots : 3D surface plots visualize three-dimensional data, illustrating how a response variable changes in relation to two predictor variables.Network Graphs : Network graphs represent relationships between entities using nodes and edges. They are useful for visualizing connections in complex systems, such as social networks, transportation networks, or organizational structures.Sankey Diagrams : Sankey diagrams visualize flow and quantity relationships between multiple entities. Often used in process engineering or energy flow analysis.https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/ A
R Heatmap Static and Interactive Visualization heatmap (orheat map ) is another way to visualize hierarchical clustering. It’s also called a false colored image, where data values are transformed to color scale. Heat maps allow us to simultaneously visualize clusters of samples and features. First hierarchical clustering is done of both the rows and the columns of the data matrix. The columns/rows of the data matrix are re-ordered according to the hierarchical clustering result, putting similar observations close to each other. The blocks of ‘high’ and ‘low’ values are adjacent in the data matrix. Finally, a color scheme is applied for the visualization and the data matrix is displayed. Visualizing the data matrix in this way can help to find the variables that appear to be characteristic for each sample cluster. Previously, we described how to visualize dendrograms. Here, we’ll demonstrate how to draw and arrange a heatmap in R.There are a multiple numbers of R packages and functions for drawing interactive and static heatmaps, including: heatmap() [R base function, stats package]: Draws a simple heatmap heatmap.2() [gplots R package]: Draws an enhanced heatmap compared to the R base function. pheatmap() [pheatmap R package]: Draws pretty heatmaps and provides more control to change the appearance of heatmaps. d3heatmap() [d3heatmap R package]: Draws an interactive/clickable heatmap Heatmap() [ComplexHeatmap R/Bioconductor package]: Draws, annotates and arranges complex heatmaps (very useful for genomic data analysis) Here, we start by describing the 5 R functions for drawing heatmaps. Next, we’ll focus on the ComplexHeatmap package, which provides a flexible solution to arrange and annotate multiple heatmaps. It allows also to visualize the association between different data from different sources.
R Packages/functions for drawing heatmaps We use mtcars data as a demo data set. We start by standardizing the data to make variables comparable:
Data preparation df <- scale(mtcars) The built-in R heatmap() function [in stats package] can be used. A simplified format is:
R base heatmap: heatmap() heatmap(x, scale = "row" )x : a numeric matrixscale : a character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Allowed values are in c(“row”, “column”, “none”). Default is “row”.# Default plot heatmap(df, scale ="none" )In the plot above, high values are in red and low values are in yellow. It’s possible to specify a color palette using the argument col, which can be defined as follow: Using custom colors:
col<- colorRampPalette(c( Or, using RColorBrewer color palette:"red" ,"white" ,"blue" ))(256 )Additionally, you can use the argument RowSideColors and ColSideColors to annotate rows and columns, respectively. For example, in the the R code below will customize the heatmap as follow: library ("RColorBrewer" ) col <- colorRampPalette(brewer.pal(10 ,"RdYlBu" ))(256 )An RColorBrewer color palette name is used to change the appearance The argument RowSideColors and ColSideColors are used to annotate rows and columns respectively. The expected values for these options are a vector containing color names specifying the classes for rows/columns.
# Use RColorBrewer color palette names library ("RColorBrewer" ) col <- colorRampPalette(brewer.pal(10 ,"RdYlBu" ))(256 ) heatmap(df, scale ="none" , col = col, RowSideColors = rep(c("blue" ,"pink" ), each =16 ), ColSideColors = c(rep("purple" ,5 ), rep("orange" ,6 )))![]()
The function heatmap.2() [in gplots package] provides many extensions to the standard R heatmap() function presented in the previous section.
Enhanced heat maps: heatmap.2() # install.packages("gplots") library ("gplots" ) heatmap.2(df, scale ="none" , col = bluered(100 ), trace ="none" , density.info ="none" )Other arguments can be used including: labRow, labCol hclustfun: hclustfun=function(x) hclust(x, method=“ward”) In the R code above, the bluered() function [in gplots package] is used to generate a smoothly varying set of colors. You can also use the following color generator functions: colorpanel(n, low, mid, high) n: Desired number of color elements to be generated low, mid, high: Colors to use for the Lowest, middle, and highest values. mid may be omitted. redgreen(n), greenred(n), bluered(n) and redblue(n)
First, install the pheatmap package: install.packages(“pheatmap”); then type this:
Pretty heat maps: pheatmap() library ("pheatmap" ) pheatmap(df, cutree_rows =4 )Arguments are available for changing the default clustering metric (“euclidean”) and method (“complete”). It’s also possible to annotate rows and columns using grouping variables.
First, install the d3heatmap package: install.packages(“d3heatmap”); then type this:
Interactive heat maps: d3heatmap() library ("d3heatmap" ) d3heatmap(scale(mtcars), colors ="RdYlBu" , k_row =4 ,# Number of groups in rows k_col =2 # Number of groups in columns )The d3heamap() function makes it possible to: Put the mouse on a heatmap cell of interest to view the row and the column names as well as the corresponding value. Select an area for zooming. After zooming, click on the heatmap again to go back to the previous display
The package dendextend can be used to enhance functions from other packages. The mtcars data is used in the following sections. We’ll start by defining the order and the appearance for rows and columns using dendextend. These results are used in others functions from others packages. The order and the appearance for rows and columns can be defined as follow:
Enhancing heatmaps using dendextend The arguments above can be used in the functions below: library (dendextend)# order for rows Rowv <- mtcars %>% scale %>% dist %>% hclust %>% as.dendrogram %>% set("branches_k_color" , k =3 ) %>% set("branches_lwd" ,1.2 ) %>% ladderize# Order for columns: We must transpose the data Colv <- mtcars %>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>% set("branches_k_color" , k =2 , value = c("orange" ,"blue" )) %>% set("branches_lwd" ,1.2 ) %>% ladderizeThe standard heatmap() function [in stats package]:
heatmap(scale(mtcars), Rowv = Rowv, Colv = Colv, scale = "none" )The enhanced heatmap.2() function [in gplots package]:
library (gplots) heatmap.2(scale(mtcars), scale ="none" , col = bluered(100 ), Rowv = Rowv, Colv = Colv, trace ="none" , density.info ="none" )The interactive heatmap generator d3heatmap() function [in d3heatmap package]:
library ("d3heatmap" ) d3heatmap(scale(mtcars), colors ="RdBu" , Rowv = Rowv, Colv = Colv)
Complex heatmap ComplexHeatmap is an R/bioconductor package, developed by Zuguang Gu, which provides a flexible solution to arrange and annotate multiple heatmaps. It allows also to visualize the association between different data from different sources. It can be installed as follow:if (!requireNamespace("BiocManager" , quietly = TRUE)) install.packages("BiocManager" ) BiocManager::install("ComplexHeatmap" )Simple heatmap
You can draw a simple heatmap as follow:library (ComplexHeatmap) Heatmap(df, name ="mtcars" ,#title of legend column_title ="Variables" , row_title ="Samples" , row_names_gp = gpar(fontsize =7 )# Text size for row names )Additional arguments:
show_row_names, show_column_names: whether to show row and column names, respectively. Default value is TRUE show_row_hclust, show_column_hclust: logical value; whether to show row and column clusters. Default is TRUE clustering_distance_rows, clustering_distance_columns: metric for clustering: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”, “pearson”, “spearman”, “kendall”) clustering_method_rows, clustering_method_columns: clustering methods: “ward.D”, “ward.D2”, “single”, “complete”, “average”, … (see
To specify a custom colors, you must use the the colorRamp2() function [circlize package], as follow:?hclust ).It’s also possible to use library (circlize) mycols <- colorRamp2(breaks = c(-2 ,0 ,2 ), colors = c("green" ,"white" ,"red" )) Heatmap(df, name ="mtcars" , col = mycols)RColorBrewer color palettes:We can also customize the appearance of dendograms using the function color_branches() [dendextend package]: library ("circlize" )library ("RColorBrewer" ) Heatmap(df, name ="mtcars" , col = colorRamp2(c(-2 ,0 ,2 ), brewer.pal(n=3 , name="RdBu" )))library (dendextend) row_dend = hclust(dist(df))# row clustering col_dend = hclust(dist(t(df)))# column clustering Heatmap(df, name ="mtcars" , row_names_gp = gpar(fontsize =6.5 ), cluster_rows = color_branches(row_dend, k =4 ), cluster_columns = color_branches(col_dend, k =2 ))![]()
Splitting heatmap by rows
You can split the heatmap using either the k-means algorithm or a grouping variable. It’s important to use the set.seed() function when performing k-means so that the results obtained can be reproduced precisely at a later time. To split the dendrogram using k-means, type this:To split by a grouping variable, use the argument split. In the following example we’ll use the levels of the factor variable cyl [in mtcars data set] to split the heatmap by rows. Recall that the column cyl corresponds to the number of cylinders. # Divide into 2 groups set.seed(2 ) Heatmap(df, name ="mtcars" , k =2 )# split by a vector specifying rowgroups Heatmap(df, name ="mtcars" , split = mtcars$cyl, row_names_gp = gpar(fontsize =7 ))Note that, split can be also a data frame in which different combinations of levels split the rows of the heatmap.
# Split by combining multiple variables Heatmap(df, name ="mtcars" , split = data.frame(cyl = mtcars$cyl, am = mtcars$am), row_names_gp = gpar(fontsize =7 ))It’s also possible to combine km and split:
Heatmap(df, name = If you want to use other partitioning method, rather than k-means, you can easily do it by just assigning the partitioning vector to split. In the R code below, we’ll use pam() function [cluster package]. pam() stands for Partitioning of the data into k clusters “around medoids”, a more robust version of K-means."mtcars" , col = mycol, km =2 , split = mtcars$cyl)# install.packages("cluster") library ("cluster" ) set.seed(2 ) pa = pam(df, k =3 ) Heatmap(df, name ="mtcars" , col = mycol, split = paste0("pam" , pa$clustering))Heatmap annotation
The HeatmapAnnotation class is used to define annotation on row or column. A simplified format is:HeatmapAnnotation(df, name, col, show_legend) df : a data.frame with column namesname : the name of the heatmap annotationcol : a list of colors which contains color mapping to columns in df For the example below, we’ll transpose our data to have the observations in columns and the variables in rows.df <- t(df) Simple annotation
A vector, containing discrete or continuous values, is used to annotate rows or columns. We’ll use the qualitative variables cyl (levels = “4”, “5” and “8”) and am (levels = “0” and “1”), and the continuous variable mpg to annotate columns. For each of these 3 variables, custom colors are defined as follow:# Define colors for each levels of qualitative variables # Define gradient color for continuous variable (mpg) col = list(cyl = c("4" ="green" ,"6" ="gray" ,"8" ="darkred" ), am = c("0" ="yellow" ,"1" ="orange" ), mpg = circlize::colorRamp2(c(17 ,25 ), c("lightblue" ,"purple" )) )# Create the heatmap annotation ha <- HeatmapAnnotation( cyl = mtcars$cyl, am = mtcars$am, mpg = mtcars$mpg, col = col )# Combine the heatmap and the annotation Heatmap(df, name ="mtcars" , top_annotation = ha)It’s possible to hide the annotation legend using the argument show_legend = FALSE as follow:
ha <- HeatmapAnnotation( cyl = mtcars$cyl, am = mtcars$am, mpg = mtcars$mpg, col = col, show_legend = FALSE ) Heatmap(df, name = "mtcars" , top_annotation = ha)Complex annotation
In this section we’ll see how to combine heatmap and some basic graphs to show the data distribution. For simple annotation graphics, the following functions can be used: anno_points(), anno_barplot(), anno_boxplot(), anno_density() and anno_histogram(). An example is shown below:# Define some graphics to display the distribution of columns .hist = anno_histogram(df, gp = gpar(fill ="lightblue" )) .density = anno_density(df, type ="line" , gp = gpar(col ="blue" )) ha_mix_top = HeatmapAnnotation( hist = .hist, density = .density, height = unit(3.8 ,"cm" ) )# Define some graphics to display the distribution of rows .violin = anno_density(df, type ="violin" , gp = gpar(fill ="lightblue" ), which ="row" ) .boxplot = anno_boxplot(df, which ="row" ) ha_mix_right = HeatmapAnnotation(violin = .violin, bxplt = .boxplot, which ="row" , width = unit(4 ,"cm" ))# Combine annotation with heatmap Heatmap(df, name ="mtcars" , column_names_gp = gpar(fontsize =8 ), top_annotation = ha_mix_top) + ha_mix_right![]()
Combining multiple heatmaps
Multiple heatmaps can be arranged as follow:# Heatmap 1 ht1 = Heatmap(df, name ="ht1" , km =2 , column_names_gp = gpar(fontsize =9 ))# Heatmap 2 ht2 = Heatmap(df, name ="ht2" , col = circlize::colorRamp2(c(-2 ,0 ,2 ), c("green" ,"white" ,"red" )), column_names_gp = gpar(fontsize =9 ))# Combine the two heatmaps ht1 + ht2You can use the option width = unit(3, “cm”)) to control the size of the heatmaps. Note that when combining multiple heatmaps, the first heatmap is considered as the main heatmap. Some settings of the remaining heatmaps are auto-adjusted according to the setting of the main heatmap. These include: removing row clusters and titles, and adding splitting. The draw() function can be used to customize the appearance of the final image:
draw(ht1 + ht2, row_title = Legends can be removed using the arguments show_heatmap_legend = FALSE, show_annotation_legend = FALSE."Two heatmaps, row title" , row_title_gp = gpar(col ="red" ), column_title ="Two heatmaps, column title" , column_title_side ="bottom" ,# Gap between heatmaps gap = unit(0.5 ,"cm" ))In gene expression data, rows are genes and columns are samples. More information about genes can be attached after the expression heatmap such as gene length and type of genes.
Application to gene expression matrix expr <- readRDS(paste0(system.file(package = "ComplexHeatmap" ),"/extdata/gene_expression.rds" )) mat <- as.matrix(expr[, grep("cell" , colnames(expr))]) type <- gsub("s\\d+_" ,"" , colnames(mat)) ha = HeatmapAnnotation( df = data.frame(type = type), annotation_height = unit(4 ,"mm" ) ) Heatmap(mat, name ="expression" , km =5 , top_annotation = ha, show_row_names = FALSE, show_column_names = FALSE) + Heatmap(expr$length, name ="length" , width = unit(5 ,"mm" ), col = circlize::colorRamp2(c(0 ,100000 ), c("white" ,"orange" ))) + Heatmap(expr$type, name ="type" , width = unit(5 ,"mm" )) + Heatmap(expr$chr, name ="chr" , width = unit(5 ,"mm" ), col = circlize::rand_color(length(unique(expr$chr))))It’s also possible to visualize genomic alterations and to integrate different molecular levels (gene expression, DNA methylation, …). Read the vignette, on Bioconductor, for further examples.
Visualizing the distribution of columns in matrix densityHeatmap(scale(mtcars)) The dashed lines on the heatmap correspond to the five quantile numbers. The text for the five quantile levels are added in the right of the heatmap.
We described many functions for drawing heatmaps in R (from basic to complex heatmaps). A basic heatmap can be produced using either the R base function heatmap() or the function heatmap.2() [in the gplots package].
Summary
The pheatmap() function, in the package of the same name, creates pretty heatmaps, where ones has better control over some graphical parameters such as cell size. The Heatmap() function [in ComplexHeatmap package] allows us to easily, draw, annotate and arrange complex heatmaps. This might be very useful in genomic fields.find second (third...) highest/lowest value in vector
x <- c(12.45,34,4,0,-234,45.6,4) max( x[x!=max(x)] ) min( x[x!=min(x)] )difference between require() and library()
benefit of require() is that it returns a logical value by default. TRUE if the packages is loaded, FALSE if it isn't. test <- library("abc") Error in library("abc") : there is no package called 'abc' test Error: object 'test' not found test <- require("abc") Loading required package: abc Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called 'abc' test [1] FALSEABC analysis
是一种很简单的分析方法。 在数据分析中,帕累托分析(Pareto Analysis)和ABC分析(ABC Analysis)是两种常用的分类工具,广泛应用于库存管理、销售分析和客户细分等领域。 帕累托分析又称8/2法则。 在很多情况下,80%的效果来自于20%的原因。 但并不是所有情况都严格遵循8/2的比例,有时可能是7/3或9/1等。 因此,在使用帕累托模型时,需要结合实际情况进行灵活应用。 ABC Analysis classifies inventory items into three categories based on their value and importance to the business: A (high-value items), B (medium-value items), and C (low-value items). The A items — the most most important — should be managed with extra care and attention. library(plyr) # creates fake data Part.Number = c(rep( letters[15:1], seq_along(letters[15:1]) )) Price = c(rep( 1:15, seq_along(15:1) )) Qty.Sold = sample(1:120) z <- data.frame(Part.Number, Price, Qty.Sold) z[90:120, ]$Qty.Sold <- z[90:120, ]$Qty.Sold * 10 # summarise Revenue z.summary <- ddply(z, .(Part.Number), summarise, Revenue = sum(Price * Qty.Sold)) # classify Revenue z.summary <- within(z.summary, { Percent.Revenue <- cumsum(rev(sort(Revenue)))/sum(Revenue) ABC <- ifelse(Percent.Revenue > 0.91, "C", ifelse(Percent.Revenue < 0.81, "A", "B")) }) z.summary # Part.Number Revenue Percent.Revenue ABC # 1 a 140850 0.4461246 A # 2 b 113960 0.8070784 A # 3 c 21788 0.8760892 B # 4 d 8220 0.9021250 B # 5 e 7238 0.9250504 C # 6 f 6390 0.9452900 CCreate Pareto Chart
A Pareto graph is a type of graph that displays the frequencies of the different categories with the cumulated frequencies of the categories.Step 1: Create the Data for Pareto Chart Let’s create a data frame with the product and its count. df <- data.frame( product=c('A', 'B', 'C', 'D', 'E', 'F'), count=c(40, 57, 50, 82, 17, 16)) df product count 1 A 40 2 B 57 3 C 50 4 D 82 5 E 17 6 F 16Step 2: Create the Pareto Chart Use of the pareto.chart() function from the qcc package, library(qcc) #create Pareto chart pareto.chart(df$count) Pareto chart analysis for df$count Frequency Cum.Freq. Percentage Cum.Percent. D 82.00000 82.00000 31.29771 31.29771 B 57.00000 139.00000 21.75573 53.05344 C 50.00000 189.00000 19.08397 72.13740 A 40.00000 229.00000 15.26718 87.40458 E 17.00000 246.00000 6.48855 93.89313 F 16.00000 262.00000 6.10687 100.00000The above table output displays the frequency and cumulative frequency of each product.
Step 3: Modify the Pareto Chart We can make aesthetic changes in the Pareto chart. pareto.chart(df$count, main='Pareto Chart', col=heat.colors(length(df$count)))![]()
R Generate Plotly JSON in html/javascript
The below function outputs two files. Javascript file containing everything you need from Plotly Optional HTML file with appropriate code to draw the plot. rm(list = ls()) library(tidyverse) library(stringi) library(plotly) plotly_to_js <- function( plotly.object, div.id = 'plot1', output.html = FALSE, output.file = NULL, output.dir = NULL, output.url = NULL ){ if(is.null(output.file)){ output.file <- div.id %s+% '.js' } if(is.null(output.dir)){ js.filename <- getwd() %s+% '/' %s+% output.file }else{ js.filename <- output.dir %s+% '/' %s+% output.file } if(is.null(output.url)){ output.url <- div.id %s+% '.html' } json <- plotly_json(plotly.object,FALSE) js.output <- "(function(){ \n window.PLOTLYENV={'BASE_URL': 'https://plotly.com'}; \n \n var gd = document.getElementById('%div.id%') \n var resizeDebounce = null; \n function resizePlot() { \n var bb = gd.getBoundingClientRect(); \n Plotly.relayout(gd, { \n width: bb.width, \n height: bb.height \n }); \n } \n Plotly.plot(gd, \n %json% \n ); \n }()); \n " js.output <- gsub('%div.id%', div.id, js.output) js.output <- gsub('%json%', json, js.output) fileConn<-file(js.filename) writeLines(js.output, fileConn) close(fileConn) if(output.html){ output.html <- "<html> \n <head> \n <meta charset=\"utf-8\"/> \n </head> \n <body> \n \n <script src='https://cdn.plot.ly/plotly-latest.min.js'></script> \n \n <div id=\"%div.id%\" style=\"width: 100%; height: 100%;\" class=\"plotly-graph-div\"></div> \n <script type=\"text/javascript\" src=\"%js.filename%\"></script> \n </body>\n </html>\n" output.html <- gsub('%div.id%', div.id, output.html) output.html <- gsub('%js.filename%', js.filename, output.html) fileConn <- file(output.url) writeLines(output.html, fileConn) close(fileConn) } } x <- c(1:100) random_y <- rnorm(100, mean = 0) data <- data.frame(x, random_y) fig <- plot_ly(data, x = ~x, y = ~random_y, type = 'scatter', mode = 'lines') plotly_to_js (fig, output.html = TRUE)相关性分析
世间万事万物绝不是独立发展的,世间万事万物是相互联系的。 有些事物之间存在联系,有些事物之间不存在联系。 有些事物之间存在直接联系,有些事物之间存在间接联系。 有些事物之间存在的联系比较强,有些事物之间存在的联系比较弱。 我们的目的是找到哪些事物存在联系? 并作出判断,判断这种联系是相关关系呢? 还是因果关系呢? 这里我们主要关注的是相关关系。 如果事物之间存在相关关系,那么,这种相关关系是直接关系呢? 还是间接关系呢? 这种相关关系是强相关关系呢? 还是弱相关关系呢? 这种相关关系有没有统计学显著性呢? 相关分析的目的是研究事件之间是否存在某种相关关系? 如果事件之间确实存在某种相关关系,那么,需要进一步定量计算相关的方向和相关的强度。 这里需要注意的是:相关关系不是因果关系,相关关系中的事件之间没有先后顺序。 例如,在一个系统内,我们观察到了 A 事件和 B 事件,发现 A 事件和 B 事件同时变化,这就说明 A 事件和 B 事件之间可能存在相关关系。 在相关关系的基础上进一步深入研究,如果我们能说清楚是 A 事件导致了 B 事件,还是 B 事件导致了 A 事件,我们就得到了 A 事件和 B 事件之间的因果关系。 前面相关分析的定义已经说了,相关分析就是寻找相关关系,那么,什么是相关关系? 相关关系有哪些呢? 事件之间的相关关系可以分为两类:函数关系和统计关系。 那么,什么是函数关系呢? 什么是统计关系呢? 函数关系是指两个事件之间的取值能用数学函数来唯一描述,即两个事件之间的取值是一一对应关系。 例如,我们要卖衣服,卖衣服的销售总额与销售量之间就是函数关系。 销售总额等于销售量乘以销售单价。 函数关系不是我们关注的重点,我们重点关注的是统计关系。 统计关系是指两个事件之间的取值不能用数学函数来唯一描述,即两个事件之间的取值不是一一对应关系,但是两个事件之间的取值按照某种规律在一定范围内变化。 例如,子女身高与父母身高,子女身高和父母身高不能用一个函数关系一一对应,但是子女身高和父母身高确实存在一定规律,多数情况下,父母身高越高,子女身高就越高。 这种具有一定规律的关系,就是统计关系。 统计关系按照统计相关的表现形式,也可以分成三个不同的统计相关类型,分别是简单相关关系、偏相关关系、距离相关关系。 这里我们重点关注简单相关关系。 那么,什么是简单相关关系呢? 线性相关关系就是直线相关关系。 其实我们平时常说的相关关系,基本上都是指的线性相关关系,这种线性相关关系有方向,也有强度。 那么,怎么表征线性相关关系的方向呢? 怎么表征线性相关关系的强度呢? 线性相关关系的方向无非两种,分别是正向相关、负向相关。 表征线性相关关系方向的方法有三种,分别是散点图、相关系数、线性拟合。 第一种方式是散点图,一个事件的取值随着另一个事件的取值的增加而增加,这种线性相关关系就是正向相关,一个事件的取值随着另一个事件的取值的增加而减少,这种线性相关关系就是负向相关。 第二种方式是相关系数,相关系数是正值就是正向相关,相关系数是负值就是负向相关。 第三种方式是线性拟合,拟合系数是正值就是正向相关,拟合系数是负值就是负向相关。 表征线性相关关系强度的方法有一种,就是相关系数。 相关系数有三种,分别是 pearson 相关系数、spearman 相关系数、kendall 相关系数。 每种相关系数都有其特殊的使用条件,那么,三种相关系数的使用条件分别是什么呢? pearson 相关系数是最常使用的相关系数,pearson 相关系数等于 XY 的协方差除以 X 的标准差和 Y 的标准差。 两个变量都是连续型变量(自己判断); 两个变量是配对的,来自于同一个个体(自己判断); 两个变量之间存在线性关系(散点图/散点图矩阵判断); 两个变量没有明显的异常值,因为异常值对pearson相关性分析的结果影响很大(箱线图判断); 两个变量呈双变量正态分布或近似正态分布(Q-Q图判断、Shapiro-Wilk检验判断)。 这里需要注意第5个条件,两个变量呈双变量正态分布,这里说的双变量正态分布不是两个变量都是正态分布的意思,双变量正态分布是另一个统计学概念,可以参考以下资料网络资料1,网络资料2。 通俗地说,如果两个变量呈双变量正态分布,那么这两个变量一定都是正态分布。 如果两个变量都是正态分布,然而这两个变量不一定呈双变量正态分布。 一般情况下,我们都用“两个变量都是正态分布”这个条件代替“两个变量呈双变量正态分布”这个条件,因为SPSS统计软件不能检验双变量正态分布,这样的替代条件,我们目前还是可以接受的。 pearson 相关系数的取值范围是[-1, 1] 。 在实际应用过程中,我们往往会将 pearson 相关系数的取值划分为 4 个区间。
相关系数 | 相关程度 |
---|---|
0.8-1.0 | 高度相关 |
0.5-0.8 | 中度相关 |
0.3-0.5 | 低度相关 |
0.0-0.3 | 不相关 |
bulk RNA-seqPCA
![]()
corrplot
![]()
volcano plot
![]()
radar plot
![]()
heatmap
![]()
enrich plot
![]()
kegg map
![]()
PPI network
![]()
2 scRNA-seq
umap
![]()
heatmap
![]()
dotplot
![]()
vlnplot
![]()
monocle
![]()
CCI
![]()
gene set score
![]()
cell preference
![]()
r包推荐
ggsc3 conjoint analysis
ssGSEA+ survival
![]()
富集弦图
![]()
彗星图
![]()
层级网络图
![]()
聚类应该聚成多少类?
① 肘部法则确定聚类数(即K值)是聚类分析中的一个关键问题。 科学地确定合适的聚类数可以提高聚类结果的准确性和解释性。 以下介绍一些常用的方法和技术,用于确定最佳的聚类数。
② 经验逻辑
③ 轮廓系数
④ 卡林斯基-哈拉巴斯指数 Calinski-Harabasz Index)
⑤ 戴维斯-博尔丁指数 Davies-Bouldin Index)
⑥ Gap Statistic Gap值)
⑦ 交叉验证 Cross-Validation)
⑧ 信息准则方法Residual sum of squares Sum of Squares: SST, SSR, SSE sum of squares total (SST) or the total sum of squares (TSS) The sum of squares is a statistical measure of variability. It indicates the dispersion of data points around the mean and how much the dependent variable deviates from the predicted values in regression analysis. https://365datascience.com/tutorials/statistics-tutorials/sum-squares/① 肘部法则
把握一个核心:类别内差异尽量小,类别间差异尽量大。 核心概念:误差平方和SSE 该值可用于测量各点与中心点的距离情况(类别内的差异大小),理论上是希望越小越好。 该指标可用于辅助判断聚类类别个数,如果发现比如从n个聚类到n+1个类别时SSE值减少幅度明显很大,此时选择n+1个聚类类别较好。 类似于SPSS中的碎石图。 SSE:每个点到其所属簇中心的距离平方和。 随着K值的增加,SSE通常会减小,因为样本会被划分得越来越精细。 肘方法的核心在于寻找SSE下降速度减缓的转折点,即“肘点”,这个点通常被认为是数据集中真实的聚类数量。 它的本质与主成分分析、因子分析中的"碎石图"并无差别,但这种方法主观性较强。![]()
(图为示例) 应用范围:最常用于K-means聚类算法。 实现工具: R(factoextra包):代码如下。![]()
(代码方案及结果可视化) ② 经验逻辑
从实际应用场景来看,分类数量不宜太多,以3-6类为主流的分类标准,主要考虑到在实际商业应用中分类过多并不能更有效指导商业成功,以有效作为第一性原则。③ 轮廓系数
轮廓系数结合了内聚度和分离度两种因素,即同时考察了组内相似性和组间差异性。 对于每个样本,计算它与同簇其他样本的平均距离(a)和它与最近簇内所有样本的平均距离(b)。 轮廓系数的值介于[-1, 1]之间,接近1表示样本很好地匹配到了簇,接近-1则表示样本更匹配其他簇。 整体轮廓系数为所有样本轮廓系数的平均值。 对于不同的K值,计算对应的平均轮廓系数,选择使得轮廓系数最大的K值作为最佳聚类数。 应用范围:K-means聚类和系统聚类等均适用。 实现工具:R(cluster-silhouette) :代码如下。![]()
(代码方案及结果可视化,采用Kmeans时聚类数为2的时候系数最高) ![]()
(代码方案及结果可视化,采用K-Medoids算法时聚类数为3时系数最高) ④ 卡林斯基-哈拉巴斯指数 Calinski-Harabasz Index)
一种评估聚类效果的指标,它是簇间离散度与簇内离散度的比值。 该指数的值越大,表示簇间的差异性越大而簇内的差异性越小,聚类效果越好。 通过计算不同K值的CH指数并选择最大值对应的K值作为最佳聚类数。 应用范围:通常用于评估基于方差的聚类方法的聚类质量,最常见的是K均值聚类,层次聚类&DBSCAN也会用到。 实现工具:R(自定义函数) :内置函数似乎无法实现,需要自定义函数calinski_harabasz_index,代码有点复杂,如下。![]()
(代码方案及结果可视化,聚类数为2的时候指数最高) ⑤ 戴维斯-博尔丁指数 Davies-Bouldin Index)
是一种评估聚类效果的指标,它基于聚类内距离和聚类间距离的比率。 DB指数的值越小,表示聚类效果越好。 通过计算不同K值的DB指数并选择最小值对应的K值作为最佳聚类数。 戴维斯-博尔丁指数计算简单,易于理解,因此可以广泛应用于各种聚类方法的评估中。 应用范围:K-Means聚类、系统聚类、DBSCAN聚类等 实现工具:R(自定义函数) :网上说是cluster包有包含计算DBI的函数,但尝试失败,需要自定义函数davies_bouldin_index,代码有点复杂,如下。![]()
(代码方案及结果可视化,聚类数为2的时候指数最高) ⑥ Gap Statistic Gap值)
通过比较实际数据的聚类结果与随机生成的数据的聚类结果来估计数据集中的聚类数目。 这种方法不依赖于特定的聚类算法,可以与多种聚类方法结合使用。 但要注意的是,Gap Statistic本身并不能确定最佳的聚类数目,而是通过比较不同聚类数目下的Gap Statistic值来帮助 选择最佳的聚类数目(最大值通常对应于最佳的聚类数目,因为它表示在这个数目下,原始数据的聚类结构与随机数据的聚类结构差异最大)。 因此,在使用Gap Statistic时,还需要结合其他方法来确定最佳的聚类数目。原理 : 对于真实数据集,聚类结果会形成紧密的簇。 而对于随机数据集,由于数据是随机生成的,因此不会形成明显的簇。 因此,如果聚类结果较好,真实数据集的聚类间隔应该比随机数据集的聚类间隔要大。 应用范围:K-means聚类(最初用途)、密度聚类、系统聚类等 实现工具:R(cluster): ![]()
(代码方案及结果可视化1) ![]()
(代码方案及结果可视化2) ⑦ 交叉验证 Cross-Validation)
虽不常见,但交叉验证也可以用来确定聚类数。 通过将数据集分成几个部分,并在不同部分上重复聚类过程,然后评估聚类结果的一致性,可以帮助选择一个稳健的K值。⑧ 信息准则方法
在二阶聚类的文章中,已对此方法有所介绍,此处不做详细展开。 用于模型选择和评估,帮助研究者在多个潜在模型中选择最佳模型(通过计算不同聚类数目下的特定值来选择最佳聚类数目),它试图平衡模型拟合优度和模型复杂度之间的矛盾。 传统聚类方法如K-means、层次聚类等,往往采用轮廓系数、Calinski-Harabasz指数、Davies-Bouldin指数等聚类评价指标来确定最优聚类数,而不是直接使用信息准则。 而对于那些可以形式化为概率模型的聚类方法(如高斯混合模型),信息准则就有明确价值。 信息准则方法主要有赤池信息准则(AIC) 和贝叶斯信息准则(BIC) : 关心泛化能力,更倾向于选择惩罚力度更大的BIC;而如果更关注解释现有数据,AIC是更好的选择。 应用范围:K-means聚类、密度聚类、系统聚类、高斯混合模型等。 必须样本量足够大,因为它考虑了样本大小。 实现工具:R: 赤池信息准则(AIC) 样本量不够大,数据会不稳定![]()
(代码方案及结果可视化) 贝叶斯信息准则(BIC) 样本量不够大,数据会不稳定![]()
(代码方案及结果可视化) 使用聚类分析来分割数据
机器学习(简称 ML)不仅仅是进行预测。 还有其他无监督过程,其中聚类尤为突出。 聚类分析能对相似数据组进行细分、分析和获取富有洞察力的信息什么是聚类?
简单来说,聚类是将相似的数据项分组在一起的同义词。 这可能就像在杂货店里将相似的水果和蔬菜组织在一起并放在一起一样。 让我们进一步阐述这个概念: 聚类是一种无监督学习任务: 一种广泛的机器学习方法,其中数据被假定为未标记或未预先分类,目的是发现其背后的模式或见解。 具体而言,聚类的目的是发现具有相似特征或属性的数据观察组。 这就是聚类在机器学习技术范围内的位置:为了更好地理解聚类的概念,可以想象一下在超市中寻找具有相似购物行为的顾客群体,或者将电子商务门户中的大量产品归类为类别或类似项目。 这些是涉及聚类过程的现实场景的常见示例。 存在各种用于聚类数据的方法。 三种最流行的方法系列是:
迭代聚类:
这些算法迭代地将数据点分配(有时重新分配)到各自的聚类中,直到它们收敛到“足够好”的解决方案。 最流行的迭代聚类算法是 k-means,它通过将数据点分配给由代表点(聚类质心)定义的聚类并逐渐更新这些质心直到实现收敛来进行迭代。层次聚类:
顾名思义,这些算法使用自上而下的方法(将数据点集拆分为所需数量的子组)或自下而上的方法(逐渐将类似气泡的数据点合并为越来越大的组)构建基于层次树的结构。 AHC(凝聚层次聚类)是自下而上的层次聚类算法的常见示例。基于密度的聚类:
这些方法识别数据点密度高的区域以形成聚类。 DBSCAN(基于密度的带噪声应用空间聚类)是此类别下的一种流行算法。聚类和聚类分析相同吗?
此时最热门的问题可能是: 聚类和聚类分析是指同一个概念吗? 毫无疑问,两者密切相关,但它们并不相同,并且它们之间存在细微的差异。 聚类是将相似数据分组的过程,使得同一组或簇中的任意两个对象彼此之间的相似性比不同组中的任意两个对象更高。 同时,聚类分析是一个更广泛的术语,不仅包括在特定领域背景下对数据进行分组(聚类)的过程,还包括对获得的簇进行分析、评估和解释。 下图说明了这两个经常混淆的术语之间的区别和关系。![]()
R/Rstudio/R包的更新
1.输入命令更新 R
1.首先打开 R,选择镜像(China 随便选一个就可)
2.运行三行代码
2.官网下载最新版本 R
3.更新 Rstudio
4.更换 Rstudio 的 R 版本
升级 R 包
1.输入命令更新 R 包
2.复制粘贴 R 包1.输入命令更新 R
在 Rstudio/R 中更新 R,输入以下命令: install.packages('installr') library(installr) updateR() 在 R 中运行时:1.首先打开 R,选择镜像(China 随便选一个就可)
![]()
2.运行三行代码
会显示如下界面,如果已经有了新版本,点击确定不看新闻点击否
再次确认是否安装最新版本的 R,点击确定
接下来就开始安装了
安装成功即可。
2.官网下载最新版本 R
https://www.r-project.org/![]()
最新版本在最下面,一般下载.gz 的压缩包,下载完安装即可。
3.更新 Rstudio
进入 Rstudio 官网:https://posit.co/downloads/ 点击 DOWNLOAD RSTUDIO ---> DOWNLOAD RSTUDIO DESKTOP FOR WINDOWS![]()
![]()
4.更换 Rstudio 的 R 版本
Tools--->Global Options点击 change 更换 R 版本
换完版本不要忘了点击两个 OK, 然后再重启一便RStudio即可,这样就是完全更新完了
![]()
升级 R 包
1.输入命令更新 R 包
跑代码的时候可能会遇到 R 包版本不匹配或者附加包不匹配的情况,所以需要更新完 R 版本的同时需要升级我们的 R 包 输入命令直接升级所有 R 包(CRAN、Bioconductor、GitHub) ## 安装rvcheck包 install.packages("rvcheck") ## 加载rvcheck包 library("rvcheck") #检查R是否有更新 rvcheck::check_r() rvcheck::update_all(check_R = FALSE, which =c("CRAN","BioC","github"))2.复制粘贴 R 包
首先找到旧版本 R 包安装的路径,在命令行中输入.libPaths() 就可以找到 R 包的位置,此处输出的第一个路径为 R 包安装的位置 .libPaths() [1] "C:/Users/B/AppData/Local/R/win-library/4.4" "C:/Program Files/R/R-4.4.1/library" 打开路径可以发现有两个或三个文件夹,把 4.3 文件内的文件复制粘贴到 4.4 即可![]()
平常更新完 R 版本总是懒得重新安装 R 包,所以就会把自己之前安装的 R 包全部打包压缩好,这样就可以方便自己安装了。 10个无头浏览器——自动化测试、爬虫、RPA利器
无头浏览器的应用场景有哪些?无头浏览器指的是一系列无界面的浏览器,这种浏览器能够以编程方式与网页进行交互,可以减少甚至替代手动处理任务。
Puppeteer
Selenium WebDriver
Playwright
Chromedp
Headless Chrome Crawler
Splash
Splinter
Serverless-chrome
Ferrum
Surf无头浏览器的应用场景有哪些?
(1)数据提取 无头浏览器擅长网页内容抓取,能够在没有界面的环境下,导航网页、解析HTML和检索数据,从而有效地从网站中提取信息。 (2)自动化测试 无头浏览器在自动化测试领域可以扮演关键角色。 它们可以在无人工干预的情况下在Web应用程序上执行测试脚本,实现对Web的功能和性能测试。 这种方式加速了测试过程,保证了最终产品质量。 (3)性能指标优化 无头浏览器对于性能监控也很有价值。 它们可以测量网页加载时间、执行速度等关键指标,从而深入了解网站的效率。 这些基准测试有助于识别瓶颈,提高用户体验。 (4)创建网页快照 无头浏览器可以在任意时间以编程方式生成网页截图,用于帮助编制文档、调试和验证UI。 (5)模拟用户行为 实现用户交互自动化是无头浏览器最强大的功能之一。 它们可以模拟点击、表单提交和其他Web操作。 通过模仿真实的用户行为,对于测试复杂的工作流程以及确保流畅的用户体验至关重要。 下面重点推荐几个比较优秀的开源免费的无头浏览器,开发人员可以根据需要选型。Puppeteer
https://github.com/puppeteer/puppeteer GitHub Star: 88K 开发语言:Node/TypeScript/JavaScript Puppeteer是一个开源的Node.js库,它通过DevTools协议实现了一些API来控制Chrome或Chromium。 它可以实现浏览器任务的自动化,例如:Web抓取、自动测试和性能监控等。 Puppeteer支持无头模式,允许它在没有图形界面的情况下运行,并提供生成屏幕截图或者PDF,可以模拟用户交互和捕获性能指标等。 它因其功能强大且易于与Web项目集成而被广泛使用。 安装: npm i puppeteer 使用: import puppeteer from 'puppeteer'; (async () => { // Launch the browser and open a new blank page const browser = await puppeteer.launch(); const page = await browser.newPage(); // Navigate the page to a URL await page.goto('https://developer.chrome.com/'); // Set screen size await page.setViewport({width: 1080, height: 1024}); // Type into search box await page.type('.devsite-search-field', 'automate beyond recorder'); // Wait and click on first result const searchResultSelector = '.devsite-result-item-link'; await page.waitForSelector(searchResultSelector); await page.click(searchResultSelector); // Locate the full title with a unique string const textSelector = await page.waitForSelector( 'text/Customize and automate' ); const fullTitle = await textSelector?.evaluate(el => el.textContent); // Print the full title console.log('The title of this blog post is "%s".', fullTitle); await browser.close(); })();Selenium WebDriver
https://github.com/SeleniumHQ/selenium GitHub Star:30K 开发语言:支持Java、Python、Javascript、Ruby、.Net、C++、Rust...Selenium是一个封装了各种工具和库的浏览器自动化框架和生态系统。 用于实现Web浏览器自动化。 Selenium专门根据W3C WebDriver规范提供了一个能够与所有主要Web浏览器兼容,并且支持跨语言的编码接口。
Playwright
https://github.com/microsoft/playwright-python GitHub Star:11.4K+ 开发语言:Python Playwright是一个用于实现Web浏览器自动化的Python库。 支持端到端测试,提供强大的功能,支持多浏览器,包括:Chromium、Firefox和WebKit。 Playwright可以实现Web爬虫、自动化表单提交和UI测试等任务,提供了用户交互行为模拟和屏幕截图等工具。 提供了强大的API,能够有效地支持各种Web应用程序测试需求。 安装python依赖: pip install pytest-playwright playwright Demo: import re from playwright.sync_api import Page, expect def test_has_title(page: Page): page.goto("https://playwright.dev/") # Expect a title "to contain" a substring. expect(page).to_have_title(re.compile("Playwright")) def test_get_started_link(page: Page): page.goto("https://playwright.dev/") # Click the get started link. page.get_by_role("link", name="Get started").click() # Expects page to have a heading with the name of Installation. expect(page.get_by_role("heading", name="Installation")).to_be_visible()Chromedp
https://github.com/chromedp/chromedp GitHub Star:10.8K+ 开发语言:Golang Chromedp是一个可以快速驱动Chrome DevTools协议的浏览器的Golang库。 无需外部依赖。 可以查看Golang 的各种应用案例: https://github.com/chromedp/examples![]()
Headless Chrome Crawler
https://github.com/yujiosaka/headless-chrome-crawler GitHub Star:5.5K 开发语言:JavaScript 这项目提供了一个由无头Chrome驱动的分布式爬虫功能。 项目主要特征包括: 支持分布式爬行 可配置并发、延迟和重试 同时支持深度优先搜索和广度优先搜索算法 支持Redis缓存 支持CSV和JSON导出结果 达到最大请求时暂停,并随时恢复 自动插入jQuery进行抓取 保存截图作为抓取证据 模拟设备和用户代理 根据优先级队列提高爬行效率 服从 robots.txtSplash
https://github.com/scrapinghub/splash GitHub Star:4.1K 开发语言:Python Splash是一个支持JavaScript渲染的HTTP API服务。 是一个轻量级的浏览器,具有HTTP API,在Python 3中使用Twisted和QT5实现。 得益于它的快速、轻量级和无状态等特性,使其易于使用和推广。Splinter
https://github.com/cobrateam/splinter GitHub Star:2.7K 开发语言:PythonSplinter是一个基于Python的Web应用程序测试工具,可用于Web应用程序自动化,提供了简单且一致的API。 它可以自动执行浏览器操作,例如:导航到URL、填写表格以及与页面元素交互。 Splinter支持各种Web驱动程序,包括Selenium WebDriver、Google Chrome和Firefox等。 它提供了非常友好的API来控制浏览器,简化了自动化测试过程的开发,使其成为Web应用程序的开发人员和测试人员的宝贵工具。 主要特点包括: 易于学习:API的设计是直观和快速拿起。 更快的编码:快速且可靠地与浏览器自动交互,而无需与工具发生冲突。 强大:专为真实的世界用例而设计,可防止常见的自动化怪癖。 灵活:对较低级别工具的访问从不隐藏。 强大:支持多个自动化驱动程序(Selenium,Django,Flask,ZopeTestBrowser)。
Serverless-chrome
https://github.com/adieuadieu/serverless-chrome Github Star:2.9K 开发语言:JavaScript 这是一个无服务器Chrome 。 这个项目的目的主要是为在无服务器函数调用期间使用Headless Chrome提供框架。 Serverless-chrome负责构建和捆绑Chrome二进制文件,并确保在执行无服务器函数时Chrome正在运行。 此外,该项目还提供了一些常见模式的服务,例如:对页面进行屏幕截图、打印到PDF、页面抓取等。Ferrum
https://github.com/rubycdp/ferrum GitHub Star:1.7K 开发语言:Ruby Ferrum是一个用于实现Chrome自动化的Ruby库。 它提供了一种控制浏览器的方法,而不需要像Selenium这样的驱动程序。 Ferrum可以处理诸如浏览网页、与元素交互以及捕获屏幕截图等任务。 它对于Web抓取、自动化测试和模拟用户交互非常有用。 Ferrum支持在无头和非无头模式下运行,使其能够满足各种自动化需求。Surf
https://github.com/headzoo/surf GitHub Star:1.5K Surf是一个Golang库,Surf不仅仅是一个Web内容提取的Go解决方案,还实现了一个可以用于编程控制的虚拟Web浏览器。 Surf被设计成像Web浏览器一样,功能包括:cookie管理、历史记录、书签、用户代理、表单提交、通过jQuery样式的CSS选择器选择和遍历DOM、抓取图像、样式表等。 安装: go get gopkg.in/headzoo/surf.v1 Demo: package main import ( "gopkg.in/headzoo/surf.v1" "fmt" ) func main() { bow := surf.NewBrowser() err := bow.Open("http://golang.org") if err != nil { panic(err) } // Outputs: "The Go Programming Language" fmt.Println(bow.Title()) }ggplot2绘图简明手册
前言
1 数据类型
2 散点图
2.1 基础散点图
2.2.2 添加平滑曲线和趋势图
2.3 多个变量
2.5 小结
2.6 其他形式的散点图
2.6.1 二维密度估计的散点图
2.6.2 带有椭圆的散点图
2.6.3 3D散点图
2.6.4气泡图
2.6.5 曼哈顿图
2.6.6火山图
3 箱线图
3.1 基础箱线图
3.2 进阶箱线图
3.2.1颜色主题的选择
3.2.2推荐的颜色搭配
3.2.3添加P值和显著性水平
4 小提琴图
5 云雨图
6 条形图
6.1 基础条形图
6.2簇状条形图
6.3堆砌条形图
6.4 频次直方图
7 折线图
7.1 简单折线图
7.2 面积图
8 饼图
8.1 普通饼图
8.2 环图
8.3 玫瑰图
8.4 旭日图
9 网络图
9.1 二分网
9.2 微生物共现网络
10雷达图
11弦图
12 桑基图
13相关性热图
13.1 普通相关性热图
13.2 Mantel test 图
14树形图 Treemap
15 聚类树状图 Dendrogram
15.1 聚类树的构建
15.2使用ggdendro包绘制聚类树图
16森林图
16.1 普通森林图
16.2 亚组森林图
16.3 带有ROB的森林图
17 金字塔图
18韦恩图
19 词云图
20地图绘制前言
以下的例子中大部分都是用ggplot2,但还用了其他的包,比如meta、VennDiagram、wordcloud、maps包等等。 学习本期的内容涉及到两个方法: 第一,数据。 演示的数据一方面来自R和R包内置的数据集,比如iris、mtcars数据集等等,这些都是可以自己自行导入的;另一方面,部分数据是来自我自己的数据,这些数据我已经打包放在一个压缩包里面,文末有下载的链接,需要的同学可自行下载。 第二,代码。 代码在文档中都有,建议是基础的代码自己敲一遍,如果只是统一搬运过来,当然也不是不可以。1 数据类型
开展数据可视化之前,必须要了解R语言中的数据的类型。 这样才能根据数据类型选择合适的可视化图表。 我们在绘制图最基础的数据类型可分为以下两类: (1)连续型变量 任意两个值之间具有无限个值的数值变量。 连续变量可以是数值变量,也可以是日期/时间变量。 比如,3.14,4.65, 5.28等等。 (2)离散变量 可计数的数值变量,比如,1,2,3等等 (3)分类变量 R中称为因子(factor),比如性别(男/女),身高(低/中/高)等等。2 散点图
散点图是最常见的图形之一。2.1 基础散点图
最基础的语法如下: library(ggplot2) data("mtcars") ggplot(aes(x =disp,y = mpg),data=mtcars)+geom_point()首先,是加载ggplot2包,还未下载该包的请提前安装“install.packages(“ggplot2”)”。 这里,我们可以看出ggplot2的语法很简单,而且是层层叠加的。 aes()规定了x轴和y轴;然后定义数据集,data= mtcars。 后面使用geom_point()定义呈现的是散点。
2.2 进阶散点图 2.2.1设置主题、坐标轴和颜色 基础语法掌握后,我们进行修改,主要包括: 主题、坐标轴字体和大小、散点大小和颜色、坐标轴标签 p1<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(color = "blue",size =3.0,alpha = 0.5 )+#调节散点的颜色、大小和显示度 theme(axis.text = element_text(family ="serif",size = 1))+#调节坐标轴字体为罗马字体,大小为14号 theme_bw()+#主题可有theme_test或者theme_classic等等 xlab("DISP")+ylab("MPG")#可改变x轴和y轴的标签 p1上述用的是主题theme_bw(),其他的主题的效果如下: p1.1<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(color = "blue",size =3.0,alpha = 0.5 )+ theme(axis.text = element_text(family ="serif",size = 1))+ theme_test()+ xlab("DISP")+ylab("MPG") p1.2<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(color = "blue",size =3.0,alpha = 0.5 )+ theme(axis.text = element_text(family ="serif",size = 1))+ theme_classic()+ xlab("DISP")+ylab("MPG") library(cowplot) plot_grid(p1.1,p1.2,labels = c('A','B'))
图A用的是theme_test()主题,图B用的是theme_classic()主题,看个人喜好选择了。
2.2.2 添加平滑曲线和趋势图
我们想要在散点图中添加趋势图,ggplot2中用geom_smooth()函数就能轻松完成这个目标。 默认的采用geom_smooth()是获得非参的平滑曲线估计: p1+geom_smooth() #`geom_smooth()` using method = 'loess' and formula = 'y ~ x'我们想要用线性回归,只需要在增加geom_smooth(method = “lm”) p1+geom_smooth(method = "lm") `geom_smooth()` using formula = 'y ~ x'
此外我们想在图片中加入回归分析中的P值和R方,这该怎么做呢? 这个时候我们得加载一个ggpmisc包,里面的stat_poly_eq()函数可以帮我们完成这一目标。 library(ggpmisc) p1+geom_smooth(method = "lm")+ stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label..,..p.value.label.., sep = '~~~~')),formula = y ~ x,parse = TRUE,label.x =0.2,label.y =0.9)
至此,我们从一个简单的双变量的散点图,了解了其点大小、颜色、坐标轴、字体、主题,并对此进行线性回归分析,并添加回归分析的公式、P值和R方。 一步一步,瞧!也没那么难,这是基础的图形的绘制,接下来我们进行多个变量的颜色、分面及其更高级的分析!
2.3 多个变量
一般来讲,二维的散点图所呈现的变量是有限的,我们用x和y轴表示两个向量,用颜色或形状表示第三个变量,我们来看看这个怎么操作: 数据还是用mtcars数据集 head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 我们以disp为x轴,mpg为y轴,以计数型的cyl为分组。 p2<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(aes(colour = cyl),size =3.0,alpha = 0.5 )+ theme_bw() p2可以看见,不同区间的cyl表示为连续型的颜色变化了,这是因为ggplot2中默认为连续型的变量。 我们要转换一下,把cyl变为因子(factor)。 p2.1<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(aes(colour = as.factor(cyl)),size =3.0,show.legend = TRUE)+ theme_bw() p2.1
这里我们语法结构为:ggplot(aes(x = ,y = ),data=),确定好数据集的x和y轴,使用geom_point()表示为散点图,然后在散点图中使用颜色映射分组信息,即geom_point(aes(colour=));show.legend是否显示图层,如果show.legend= FALSE表示不显示图层。 这个图有很多的地方可以优化,第一,颜色!个人感觉ggplot2自带的颜色红蓝绿三原色太low了,应该更换更高级的颜色;其次,这个图例太丑了,需要调整。 那我们来看看怎么升级一下这张图。
(1)颜色调整 scale_color_manual()是ggplot2中手动设置颜色映射的函数,适用于离散型和分组变量指定特定的颜色,如下所示: p2.2<-p2.1+scale_color_manual(values = c("#6495ED","#FFA500","#FF4500")) p2.2这些颜色代号都是自己在网上找的,大家可以根据自己的喜好手动设置。
(2)图例调整 主要有两点:修改图例标题和移动图例位置 p2.3<-p2.2+guides(colour=guide_legend(title=NULL))+theme(legend.position = c(0.8,0.8)) p2.3这里使用guides来修改图例标题,title=NULL表示删去图例标题;使用theme()函数中的legend.position来调整图例位置,大家可以试一下legend.position= “right”、“bottom”、“top”、left、“none”(无图例)来调整图例的上下左右。 在上诉例子中,legend.position=c(0.8,0.8),表示为右上角区域。 c(0,1)表示左上角,c(1,0)表示右下角。
2.4 分面 上诉的例子中,我们把3个变量在一张图中展示,此外还有一种展示方法,就是分面。 facet_wrap()是最常见的分面函数。 p2.4<-p2.2+facet_wrap(~cyl)+theme(legend.position = "none") p2.4上述的标度都是统一的,我们来设置一下x和y的标度在每个面板中都可以变化。 scales = “free” p2.5<-p2.2+facet_wrap(~cyl,scales = "free")+theme(legend.position = "none") p2.5
大家看,好像完全变成了另外一张图一样!
2.5 小结
至此,我们整理一下,一张比较优秀的散点图应该是这样: (1)颜色、大小和显示度合适 (2)坐标轴字体、大小合适 (3)图例、主题合适 (4)回归或者其他分析恰当且标注 P<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(aes(colour = as.factor(cyl)),size =3.0,show.legend = TRUE)+ scale_color_manual(values = c("#6495ED","#FFA500","#FF4500"))+ geom_smooth(method = "lm")+ stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label..,..p.value.label.., sep = '~~~~')),formula = y ~ x,parse = TRUE,label.x =0.2,label.y =0.9)+ guides(colour=guide_legend(title= 'Cyl'))+ theme_bw()+ xlab("DISP")+ylab("MPG") P `geom_smooth()` using formula = 'y ~ x'分面如下: P_wrap<-ggplot(aes(x =disp,y = mpg),data = mtcars)+ geom_point(aes(colour = as.factor(cyl)),size =3.0,show.legend =FALSE)+ scale_color_manual(values = c("#6495ED","#FFA500","#FF4500"))+ geom_smooth(method = "lm")+ stat_poly_eq(aes(label = paste(..adj.rr.label..,..p.value.label.., sep = '~~~~')),formula = y ~ x,parse = TRUE,label.x =0.1,label.y =0.9)+ facet_wrap(~cyl,scales = "free")+ theme_bw()+ xlab("DISP")+ylab("MPG") P_wrap `geom_smooth()` using formula = 'y ~ x'
可以看出,只有当cyl=4时,MPG和DISP才有显著性负关系!因此,分面能够提供给我们对数据具有更清晰的认识!
2.6 其他形式的散点图
除了上述发表在文章中的常见散点图外,还有其他形式的散点图,比如:密度估计散点图、带有椭圆的散点图、3D散点图等2.6.1 二维密度估计的散点图
为了在散点图进行密度估计,可以使用geom_density2d()方和geom_density2d_filled(),如下: p2.6.1.1<-p1+geom_density2d() p2.6.1.1p2.6.1.2<-p1+geom_density2d_filled() p2.6.1.2
![]()
2.6.2 带有椭圆的散点图
在数据集周围添加一个椭圆,可以使用stat_ellipse()函数。 可添加95%置信度水平上的置信区间 的椭圆,常见为PCA、PCoA图。 如下: p2.6.2<-p1+stat_ellipse(level = 0.95) p2.6.2![]()
2.6.3 3D散点图
要绘制3d的散点图,得用到scatterplot3d包,我们在R中就可以安装它: library("scatterplot3d") scatterplot3d(mtcars[,1:3])可以改变角度、散点形状和颜色 scatterplot3d(mtcars[,1:3],angle = 60,pch = 16, color="steelblue")
按照组别改变形状 shapes = c(16, 17) shapes <- shapes[as.numeric(as.factor(mtcars$am))] scatterplot3d(mtcars[,1:3], pch = shapes)
按照组别改变颜色 colors <- c("#32CD32", "#FF4500") colors <- colors[as.numeric(as.factor(mtcars$am))] scatterplot3d(mtcars[,1:3], pch = 16,color = colors)
该包还有很多种有趣的变化,详细可查阅”https://blog.csdn.net/m0_49960764/article/details/122249790“
2.6.4气泡图
气泡图是属于散点图的一种,在散点图的基础上改变点的形状,大小和颜色。 这里我们用自带的数据展示气泡图,其实很简单,就是在geom_point()添加第三变量,用颜色和点大小区分,这里我们用颜色(尺寸:绿到红)和散点大小来演示。 setwd("D:\\test\\ggplot2") df2.6.4<-read.csv("test1.csv",header = T) ggplot(aes(x = genus,y = abundance),data = df2.6.4)+ geom_point(aes(size = weight,color= weight))+ scale_colour_gradient(low="green",high="red")+ theme_bw()+coord_flip()![]()
2.6.5 曼哈顿图
曼哈顿(Manhattan)图实际就是点图,横坐标是chr,纵坐标是-log(Pvalue) ,原始P值越小,-log转化后的值越大,在图中就越高。 Manhattan图是GWAS分析的标配。 library(qqman) head(gwasResults)#内置数据集 SNP CHR BP P 1 rs1 1 1 0.9148060 2 rs2 1 2 0.9370754 3 rs3 1 3 0.2861395 4 rs4 1 4 0.8304476 5 rs5 1 5 0.6417455 6 rs6 1 6 0.5190959 # 使用manhattan函数绘制曼哈顿图 manhattan(gwasResults)# 调整参数 manhattan(gwasResults, main = "Manhattan Plot", #设置主标题 ylim = c(0, 10), #设置y轴范围 cex = 0.6, #设置点的大小 cex.axis = 0.9, #设置坐标轴字体大小 col = c("blue4", "orange3","red"), #设置散点的颜色 suggestiveline = F, genomewideline = F, #remove the suggestive and genome-wide significance lines chrlabs = c(paste0("chr",c(1:20)),"P","Q") #设置x轴染色体标签名 )
![]()
2.6.6火山图
火山图(volcano plot)是散点图的一种。 主要用于展示高通量实验(如基因表达谱分析、蛋白质组学研究)中的显著性和变化倍数(fold change)。 火山图结合了p值和变化倍数信息,可以直观地显示哪些基因或蛋白在实验条件下表现出显著变化。 下面我们使用ggvolcano函数制作一张普通的火山图。 还没安装该包的同学可以安装以下代码安装: devtools::install_github("BioSenior/ggVolcano") library(ggVolcano) data(deg_data)#内置数据集 data <- add_regulate(deg_data, log2FC_name = "log2FoldChange", fdr_name = "padj",log2FC = 1, fdr = 0.05) ggvolcano(data, x = "log2FoldChange", y = "padj", label = "row", label_number = 10, output = FALSE)1.
X轴 :通常表示变化倍数(fold change),是实验条件下某基因或蛋白质的表达量相对于对照条件的变化。 X轴的值可以是对数缩放的,如log2(fold change)。 2.Y轴 :表示显著性(p值)的负对数(通常是-log10(p-value)),所以Y轴上的值越高,表示结果越显著。 3.点 :每个点代表一个基因或蛋白质。 点的位置由其fold change和p值决定。 4.颜色或形状 :常用来标示显著变化的基因或蛋白质。 例如,红色点可以表示上调的基因,蓝色点可以表示下调的基因,而灰色点表示没有显著变化的基因。 5.左侧 :表示下调基因,fold change < 1(负值)。右侧 :表示上调基因,fold change > 1(正值)。顶部 :表示显著性高的基因,p值小。3 箱线图
箱线图如下:箱线图是利用数据中的五个统计量:最小值、第一四分位数、中位数、第三四分位数与最大值来描述数据的一种方法,它也可以粗略地看出数据是否具有有对称性,分布的分散程度等信息。
3.1 基础箱线图
最基础的箱线图如下: 这里我们演示用鸢尾花的数据,箱线图用geom_boxplot() data("iris") head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa p3.1.1<-ggplot(aes(x = Species,y =Sepal.Length ),data = iris)+ geom_boxplot() p3.1.1从上面的箱线图中可以看得出,virginica的Sepal.Length数据中位数最大,其次是versicolor,最小是setosa。 virginica还有一个外溢的点,这个点可看做为异常值。 因此箱线图十分适合看分类变量之间的数据离散! 同样地,这张图需要升级改造一下,至少需要以下几点: (1)主题的选择,选择theme_bw()或者其他合适的主题; (2)颜色的选择,选择更为突出的颜色; (3)箱线图上下的须需要调整; (4)可结合散点图,突出点的离散。 (5)合适的统计学分析,比如单因素方差分析,且标注在图中。 (6)可选择在y轴添加核密度估计
3.2 进阶箱线图
3.2.1颜色主题的选择
主题已经在上述的散点图中简述过,这里主要介绍颜色搭配: p3.2.1.1<-ggplot(aes(x = Species,y =Sepal.Length,colour=Species),data = iris)+ geom_boxplot(size = 0.8, width = 0.8, alpha = 0)+#设置箱线尺寸、箱形宽度、异常点透明度 geom_jitter(position = position_jitter(0.1), alpha = 0.5, size=1.5)+#设置数据点的分散程度、透明度和大小 theme_test() p3.2.1.1
![]()
3.2.1.1 箱线图-点和线颜色
有两种方法更改ggplot2中的箱线图-点和线颜色的修改: (1)使用scale_color_manual() 函数手动更改颜色。 (2)使用scale_color_brewer() 函数使用 RColorBrewer 包的调色板。 注意一下:color:对点和线的颜色进行调整;当为柱状图或者空心散点时,color仅改变边框颜色。 例子如下: p3.2.1.2<- p3.2.1.1+scale_color_manual(values = c(c("red", "blue", "orange"))) p3.2.1.2 p3.2.1.3<- p3.2.1.1+scale_color_brewer(palette = "Set1") p3.2.1.3![]()
可以看出,无论是scale_color_manual() 还是scale_color_brewer() ,凡是带color都只是对点和横框的颜色进行修改。
3.2.1.2 箱线图-填充颜色
对ggplot2中的箱线图-填充颜色的修改: scale_fill_manual() 和 scale_fill_brewer() 注意:geom_boxplot(alpha=0),这个显示透明度的代码一定要删了,不然是认为是透明的,boxplot就不填充颜色了! ggplot(aes(x = Species,y =Sepal.Length,fill = Species),data = iris)+ geom_boxplot(size = 0.8, width = 0.8)+ geom_jitter(position = position_jitter(0.1), alpha = 0.5, size=1.5)+ scale_fill_brewer(palette = "Set1")+ theme_test()这里我要强调一下scale_fill_brewer()配色的搭配: 对于分类变量,有8种色系选择:Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3。 但是这里的颜色最多只有8个颜色,即最多只能有8个分类变量,超过8个就不显色了。 我比较喜欢Set1和Dark2这两个色系,原因是颜色区分大,颜色搭配也好看。 ggplot(aes(x = Species,y =Sepal.Length,fill = Species),data = iris)+ geom_boxplot(size = 0.8, width = 0.8)+ geom_jitter(position = position_jitter(0.1), alpha = 0.5, size=1.5)+ scale_fill_brewer(palette = "Dark2")+ theme_test()
![]()
3.2.2推荐的颜色搭配
我们总是抱怨,别人的文章颜色搭配高大上,而我们的确实一言难尽。 颜色搭配是很难的,要注意以下细节: 尽可能避免使用”red”, “green”, “blue”, “cyan”, “magenta”, “yellow”颜色。 使用相对柔和的颜色”firebrick”, “springgreen4”, “blue3”, “turquoise3”, “darkorchid2”, “gold2”,会让人觉得舒服。 推荐1:“#440154FF”,“#007896FF”,“#3AC96DFF”,“#FDE725FF” 推荐2:“#007896FF”,“#619CFF”,“#3AC96DFF”,“#FDE725FF” 推荐3:四个灰度:“#636363”, “#878787”, “#ACACAC”, “#D0D0D0” 推荐:4:六个灰度:“#787878”, “#8F8F8F”, “#A6A6A6”, “#BDBDBD”, “#D4D4D4”, “#EBEBEB” (1)推荐的包:ggsci ggsci 包提供科学期刊和科幻主题的调色板。 绘制的图更像发表在科学或自然中的颜色的主题。 library(ggsci) library(cowplot) p3.2.2.1<-p3.2.1.1+scale_color_aaas()+theme(legend.position = "none") p3.2.2.2<-p3.2.1.1+scale_color_npg()+theme(legend.position = "none") plot_grid(p3.2.2.1,p3.2.2.2)(2)推荐的包:ggthemes包 ggthemes包允许 R 用户访问 Tableau 颜色。 Tableau 是著名的可视化软件,具有众所周知的调色板。 library(ggthemes) Attaching package: 'ggthemes' The following object is masked from 'package:cowplot': theme_map p3.2.1.1+scale_color_tableau()+theme(legend.position = "none")
![]()
3.2.3添加P值和显著性水平
介绍一个自动添加p值的包:ggpubr包。 主要用到两个函数:compare_means():用于执行均值比较。 stat_compare_means():用于在ggplot图形中自动添加P值和显著性水平 library(ggpubr) p3.2.2.1<-p3.2.1.1+stat_compare_means() p3.2.2.1这里默认多组之间的比较,用的是Kruskal-Wallis比较,这是一种非参数检验的常用方法。 变为参数检验可以用ANOVA,具体如下: p3.2.2.2<-p3.2.1.1+stat_compare_means(method = "anova",label.y = 7.5) p3.2.2.2
对于两组之间的比较可以用t.test或者Wilcoxon test,具体如下:
此外,想要进行组间的比较,该怎么做呢? compare_means(Sepal.Length ~ Species, data = iris) # A tibble: 3 × 8 .y. group1 group2 p p.adj p.format p.signif method <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> 1 Sepal.Length setosa versicolor 8.35e-14 1.7 e-13 8.3e-14 **** Wilcox… 2 Sepal.Length setosa virginica 6.40e-17 1.9 e-16 < 2e-16 **** Wilcox… 3 Sepal.Length versicolor virginica 5.87e- 7 5.90e- 7 5.9e-07 **** Wilcox… my_comparisons <- list( c("setosa", "versicolor"), c("versicolor", "virginica"), c("setosa", "virginica") )#两两比较的组别 p3.2.2.3<-p3.2.1.1+stat_compare_means(comparisons = my_comparisons) p3.2.2.3
可以看得出,三组之间的两两比较是存在显著性差异的。 如果我不想要这种方式,想要添加字母表示的,该怎么做呢? 一般用字母表示的带有误差棒的,用条形图比较合适,虽然还没具体介绍到条形图,也可以比较一下这两者(箱线图+散点图和条形图的区别) 这里得用到另外一个包—multcompView包,还未安装的同学可以install.package(“multcompView”)安装一下。 library(multcompView) fit<-aov(Sepal.Length ~ Species,data = iris)#单因素方差分析 summary(fit)#查看结果 Df Sum Sq Mean Sq F value Pr(>F) Species 2 63.21 31.606 119.3 <2e-16 *** Residuals 147 38.96 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 tukey<-TukeyHSD(fit)#组间多重比较 tukey Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Sepal.Length ~ Species, data = iris) $Species diff lwr upr p adj versicolor-setosa 0.930 0.6862273 1.1737727 0 virginica-setosa 1.582 1.3382273 1.8257727 0 virginica-versicolor 0.652 0.4082273 0.8957727 0 abc<- multcompLetters4(fit,tukey)#显著性用字母表示 head(abc) $Species virginica versicolor setosa "a" "b" "c" library(tidyverse) df <-iris %>% group_by(Species) %>% summarise(w=mean(Sepal.Length), sd = sd(Sepal.Length)) %>% arrange(desc(w)) %>% ungroup() %>% left_join(.,as.data.frame.list(abc$Species) %>% select(1) %>% rownames_to_column("Species")) Joining with `by = join_by(Species)` head(df) # A tibble: 3 × 4 Species w sd Letters <chr> <dbl> <dbl> <chr> 1 virginica 6.59 0.636 a 2 versicolor 5.94 0.516 b 3 setosa 5.01 0.352 c ggplot(df, aes(x= Species,y = w,fill = Species)) + geom_bar(stat = "identity",aes(colour = Species),show.legend = FALSE,width=0.5) + geom_errorbar(aes(ymin = w-sd, ymax=w+sd), width = 0.1) + geom_text(aes(label = Letters, y = w + sd), vjust = -0.5)+ scale_fill_brewer(palette = "Set1")+ scale_color_brewer(palette = "Set1")+ theme_test()+ylab("Sepal.Length")
所以,萝卜青菜各有所爱,选择那么多不是一件坏事!
4 小提琴图
小提琴是是箱线图和核密度图的集合,其可通过箱线思维展示数据的各个百分位点,与此同时,还可使用核密度图展示数据分布的‘轮廓’效果,’轮廓’越大,即意味着数据越集中于该处,反之则说明该处时数据越少。 如下图所示:我们看一下怎么用代码演示小提琴图: ggplot(aes(x = Species,y =Sepal.Length,colour=Species),data = iris)+ geom_violin(size = 0.8, width = 0.8, alpha = 0)+ geom_jitter(position = position_jitter(0.1), alpha = 0.5, size=1.5)+ scale_color_brewer(palette = "Set1")+ theme_test() Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use `linewidth` instead.
![]()
5 云雨图
所谓云雨图,基本形状由云朵和雨点组成,上面的云朵是数据的密度图,伞就是平铺的箱线图、雨就是下面的数据点。 本质上就是箱线图+散点图+核密度图,我们看一下怎么画云雨图 library(gghalves) p5.1<-ggplot(aes(x = Species,y =Sepal.Length,colour=Species,fill= Species),data = iris)+ scale_fill_manual(values = c("#8491B4FF", "#00A087FF", "#4DBBD5FF"))+ scale_colour_manual(values = c("#8491B4FF", "#00A087FF", "#4DBBD5FF"))+ theme_test() #先画一半小提琴图 p5.2<-p5.1+geom_half_violin(position=position_nudge(x=0.1,y=0), side='R',adjust=1.2,trim=F,color=NA,alpha=0.5) #添加散点图 #调整散点 p5.3<-p5.2+geom_half_point(position=position_nudge(x=-0.35,y=0),size =3, shape =19,range_scale = 0.5,alpha=0.5) #添加箱线图: p5.4<-p5.3+geom_boxplot(outlier.shape = NA, #隐藏离群点; width =0.1, alpha=0.5) #图形转置 p5.5<-p5.4+coord_flip() p5.5![]()
6 条形图
6.1 基础条形图
条形图是一种常见的数据可视化工具,用于显示分类数据的比较。 它使用水平或垂直的矩形条来代表数据的数值,每个条的长度或高度与其表示的数值成正比。 条形图通常用于比较不同类别之间的数量或频率,便于观察各类别之间的差异。 什么样的数据适合条形图呢? (1)分类数据和离散数据/连续数据 (2)想要看均值±SD的组别之间的差异 我们来看看基础的条形图代码: ggplot(aes(x = Species,y =Sepal.Length),data = iris)+geom_bar(stat='identity')条形图的语法为: geom_bar(mapping = NULL, data = NULL, stat = "count", width=0.9, position="stack") 要注意以下几点:
(1)stat: 设置统计方法,有效值是count(默认值) 和 identity,其中,count表示条形的高度是变量的数量,identity表示条形的高度是变量的值; 默认情况下,stat=“count”,这意味着每个条的高度等于每组中的数据的个数,并且,它与映射到y的图形属性不相容,所以,当设置stat=“count”时,不能设置映射函数aes()中的y参数。 如果设置stat=“identity”,这意味着条形的高度表示数据数据的值,而数据的值是由aes()函数的y参数决定的,就是说,把值映射到y,所以,当设置stat=“identity”时,必须设置映射函数中的y参数,把它映射到数值变量。 (2)position :位置调整,有效值是stack、dodge和fill ,默认值是stack(堆叠),是指两个条形图堆叠摆放,dodge是指两个条形图并行摆放,fill是指按照比例来堆叠条形图,每个条形图的高度都相等,但是高度表示的数量是不尽相同的。 对该条形图进行组别均值带有误差棒的单因素方差分析并且标注显著性字母的,请查阅上一章3.2.3添加p值和显著性水平。6.2簇状条形图
当分类变量出现两组时,就会出现簇状条形图。 此时可将分类变量映射到fill参数,并运行命令geom_bar(position=“dodge”),这可使得两组条形在水平方向上错开排列。 演示我们用gapminder数据集,该数据结构如下: library(gapminder) library(dplyr) head(gapminder) # A tibble: 6 × 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. df_6.2 <- gapminder %>% group_by(continent, year) %>% summarise( avgLifeExp = mean(lifeExp) ) `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument. head(df_6.2) # A tibble: 6 × 3 # Groups: continent [1] continent year avgLifeExp <fct> <int> <dbl> 1 Africa 1952 39.1 2 Africa 1957 41.3 3 Africa 1962 43.3 4 Africa 1967 45.3 5 Africa 1972 47.5 6 Africa 1977 49.6 ggplot(aes(x = continent,y = avgLifeExp,fill = as.factor(year)),data = df_6.2)+geom_bar(position = "dodge",stat="identity",colour="black")+theme_test()![]()
6.3堆砌条形图
我们演示一下不同位点下的微生物种类的差异的堆砌条形图: setwd("D:\\test\\ggplot2") df_6.3<-read.csv("bar_relative_abundance.csv",header = T) head(df_6.3) Site p__Chloroflexi p__Proteobacteria p__Actinobacteriota p__Bacteroidota 1 A1 24.99842 22.03019 20.54620 9.119543 2 A2 14.53592 23.70618 28.06998 9.251222 3 A3 31.63383 19.17744 13.31212 10.084735 4 A4 27.85878 20.77519 17.53587 11.705524 5 A5 16.49443 22.04119 12.40436 9.588936 6 A6 31.27888 20.65534 11.70475 7.769690 p__Nitrospirota p__Planctomycetota p__Patescibacteria p__Acidobacteriota 1 8.793537 5.047240 2.502761 2.600926 2 8.068257 5.252886 3.752516 3.052232 3 8.568813 2.834795 7.925106 1.772595 4 9.389788 3.117622 3.457717 1.614698 5 11.032492 17.851330 1.723117 3.585315 6 4.567493 10.888783 4.577289 3.924405 p__Firmicutes others 1 1.293612 3.067569 2 1.392112 2.918689 3 1.649580 3.040981 4 1.247601 3.297214 5 2.021840 3.256990 6 1.587912 3.045462 这样的数据我们称之为宽数据,在绘制堆砌条形图时,我们需要对数据进行转换一下:宽数据 转换为 长数据 这里我们用reshape2包中的melt函数: ##宽数据转变为长数据 library(reshape2) Attaching package: 'reshape2' The following object is masked from 'package:tidyr': smiths df_6.3_long<-melt(df_6.3, id.vars = c("Site"), #需保留的不参与聚合的变量列名 measure.vars = c(colnames(df_6.3)[2:11]),#需要聚合的变量 variable.name = c('phylum'),#聚合变量的新列名 value.name = 'value')#聚合值的新列名 head(df_6.3_long) Site phylum value 1 A1 p__Chloroflexi 24.99842 2 A2 p__Chloroflexi 14.53592 3 A3 p__Chloroflexi 31.63383 4 A4 p__Chloroflexi 27.85878 5 A5 p__Chloroflexi 16.49443 6 A6 p__Chloroflexi 31.27888 接着,我们进行堆砌条形图的绘制: colors_6.3 <- c("#9b3a74","#3cb346","#cc340c","#e4de00","#9ec417","#13a983", "#44c1f0","#3f60aa","#f88421","#156077") ggplot(aes(x= Site,y = value,fill = phylum),data = df_6.3_long)+ geom_bar(position='fill',stat='identity',width = 0.5)+ theme_test()+ theme(axis.text.x = element_text(angle = 30,vjust = 0.85,hjust = 0.75))+ scale_fill_manual(values=colors_6.3)+ scale_y_continuous(expand= c(0,0))![]()
6.4 频次直方图
当面对每行观测对应一个样本的数据集时,可利用频数绘制条形图。 此时不选择映射y参数,且参数默认被设定为stat=“bin” ggplot(aes(x = Species),data = iris)+geom_bar(fill = "lightblue")+theme_test()![]()
7 折线图
折线图是一种常用的数据可视化工具,主要用于显示数据随时间或其他连续变量变化的趋势。 它通过将数据点连接成线,帮助观察者识别数据的模式、趋势和波动。7.1 简单折线图
我们来看一下简单的折线图,用的数据集是R自带的economic数据集 head(economics) # A tibble: 6 × 6 date pce pop psavert uempmed unemploy <date> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1967-07-01 507. 198712 12.6 4.5 2944 2 1967-08-01 510. 198911 12.6 4.7 2945 3 1967-09-01 516. 199113 11.9 4.6 2958 4 1967-10-01 512. 199311 12.9 4.9 3143 5 1967-11-01 517. 199498 12.8 4.7 3066 6 1967-12-01 525. 199657 11.8 4.8 3018 ggplot(economics,aes(date,unemploy))+geom_line()+theme_test()如果我们掌握了散点图的绘制,对折线图而言,就相对简单了。 同理地,多条曲线、颜色都可以是和散点图一致的。 我们用ggplot2中的diamonds数据集进行演示一下: library(dplyr) diamonds2<-diamonds%>% filter(carat<=2)%>%mutate(lcarat = log2(carat),lprice=log2(price)) mod<-lm(lprice ~ lcarat ,data = diamonds2) diamonds2<-diamonds2%>%mutate(rel_price=resid(mod)) color_cut<-diamonds2 %>% group_by(color,cut) %>% summarise(price = mean(price),rel_price = mean(rel_price)) `summarise()` has grouped output by 'color'. You can override using the `.groups` argument. color_cut # A tibble: 35 × 4 # Groups: color [7] color cut price rel_price <ord> <ord> <dbl> <dbl> 1 D Fair 3939. -0.0755 2 D Good 3309. -0.0472 3 D Very Good 3368. 0.104 4 D Premium 3513. 0.109 5 D Ideal 2595. 0.217 6 E Fair 3516. -0.172 7 E Good 3314. -0.0539 8 E Very Good 3101. 0.0655 9 E Premium 3344. 0.0845 10 E Ideal 2564. 0.174 # ℹ 25 more rows ggplot(color_cut,aes(color,price))+geom_line(aes(group = cut),colour = "grey80")+geom_point(aes(colour = cut),size = 2.5)+theme_test()
接下来,我们来绘制一张折线图的另外一种形式:面积图
7.2 面积图
面积图主要用来展示数据随时间或类别变化的趋势。 面积图以其直观性和视觉吸引力,在数据可视化中非常受欢迎。 这里我用自己的数据集演示一下面积图的绘制,主要用到geom_area()函数: setwd("D:\\test\\ggplot2") dat_7.2<-read.csv("area_plot.csv",header = T) ggplot(dat_7.2, aes(x=site, y=value)) + geom_area()添加填充颜色,边界线和点,更换主题 ggplot(dat_7.2, aes(x=site, y=value)) + geom_area(fill="#69b3a2", alpha=0.6) + geom_line(color="black", size=1.5) + geom_point(size=3, color="red") + theme_minimal()
![]()
8 饼图
饼图(Pie Chart)是一种用于展示各部分与整体之间比例关系的图表。 它通过将一个圆形划分为不同的扇形,以直观地显示各部分所占的比例。8.1 普通饼图
R中的pie()函数就可以轻松绘制饼图: setwd("D:\\test\\ggplot2") dat_8.1<-read.csv("pie_plot1.csv",header = T) pie(dat_8.1$rel_abundance, labels=dat_8.1$phylum, radius = 1.0,clockwise=T, main = "Phylum(%)")ggplot2绘制饼图还是比较复杂:还需要用到另外一个包ggforce library(ggforce) ggplot()+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(), legend.title=element_blank(), panel.border = element_blank(), panel.background = element_blank())+#去除没用的ggplot背景,坐标轴 xlab("")+ylab('')+#添加颜色 scale_fill_manual(values = c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87', '#E59CC4', '#AB3282', '#23452F'))+ geom_arc_bar(data=dat_8.1, stat = "pie", aes(x0=0,y0=0,r0=0,r=2, amount=rel_abundance,fill=phylum) )+ annotate("text",x=1.6,y=1.5,label="25.00%",angle=-50)+ annotate("text",x=1.6,y=-1.5,label="22.03%",angle=45)+ annotate("text",x=0,y=-2.2,label="20.55%",angle=0)
![]()
8.2 环图
和上述的例子一样,想要获得环图(空心饼图),只需要将geom_arc_bar中的R0修改为R1即可: ggplot()+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(), legend.title=element_blank(), panel.border = element_blank(), panel.background = element_blank())+#去除没用的ggplot背景,坐标轴 xlab("")+ylab('')+#添加颜色 scale_fill_manual(values = c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87', '#E59CC4', '#AB3282', '#23452F'))+ geom_arc_bar(data=dat_8.1, stat = "pie", aes(x0=0,y0=0,r0=1,r=2, amount=rel_abundance,fill=phylum) )+ annotate("text",x=1.6,y=1.5,label="25.00%",angle=-50)+ annotate("text",x=1.6,y=-1.5,label="22.03%",angle=45)+ annotate("text",x=0,y=-2.2,label="20.55%",angle=0)总体来讲,ggplot2中绘制饼图是不友好的,需要手动设置参数,比较麻烦!
8.3 玫瑰图
玫瑰图(Rose Chart),也称为极坐标图或圆形柱状图,是一种用于展示数据分布的图表。 它通过在极坐标系统中绘制多个扇形,常用于表示周期性或方向性的数据。 数据我们用到第六章中6.2的数据: p_8.3 <- ggplot(df_6.2) + geom_col(aes(x = reorder(continent, avgLifeExp),y = avgLifeExp,fill = year), position = "stack",show.legend = TRUE,alpha = .9) p_8.3<-p_8.3+coord_polar()+scale_fill_gradient(low = "green", high = "red")+theme_minimal()+xlab(" ") p_8.3![]()
8.4 旭日图
旭日图(sunburst)是饼图的变形,简单来说是多个饼图的组合升级版。 饼图只能展示一层数据的占比情况,而旭日图不仅可以展示数据的占比情况,还能厘清多级数据之间的关系。 得用到moonBook包中的PieDonut函数 library(moonBook) library(webr) test=iris[,c(4,5)] test[,1]=ceiling(test[,1]) PieDonut(test,aes(pies=Petal.Width,donuts=Species))突出某个具体分类: PieDonut(test,aes(pies=Petal.Width,donuts=Species), selected=1,labelposition=1,explode=1,explodeDonut=TRUE)
![]()
9 网络图
网络图(Network Diagram)是一种用于表示节点(或顶点)和连接(或边)之间关系的图形。 它广泛应用于不同领域,如计算机网络、社交网络、交通系统等。 网络图十分繁杂,这里我演示两种常见的基础网络图:9.1 二分网
二分网络(Bipartite Network)是一种特殊的网络结构,其中节点分为两个不重叠的集合,边仅在这两个集合之间连接,而不在同一集合内连接。 二分网络常用于表示两个类型实体之间的关系,常用于生态学中的物种种间相互作用的分析。 主要用的是bipartite包: require(bipartite) setwd("D:\\test\\ggplot2") dat_9.1<-read.csv("Antplantdata.csv",header=T) head(dat_9.1) Tree.Number Date Collector Elevation Waypoints 1 700 - 0797 06-Jul-13 N.Plowman 700 820 2 700 - 0757 05-Jul-13 N.Plowman 700 780 3 700 - 0728 05-Jul-13 N.Plowman 700 751 4 700 - 0754 05-Jul-13 N.Plowman 700 777 5 700 - 0738 05-Jul-13 N.Plowman 700 761 6 700 - 0734 05-Jul-13 N.Plowman 700 757 New.Leaves.Accessible...Y.N. Estimated.Tree.Height..m. CBH..cm. 1 N 10.5 24.0 2 N 10.0 29.8 3 N 9.0 28.3 4 N 8.3 29.9 5 N 7.4 24.0 6 N 6.5 13.5 No..of.leaves.removed Photo.Numbers X0. X.5. X5.33. X.33. Family 1 3 8566-8568 NA NA NA NA Myristicaceae 2 0 266,267 NA NA NA NA Meliaceae 3 0 204,205 NA NA NA NA Meliaceae 4 0 261,262 NA NA NA NA Meliaceae 5 0 233,234 NA NA NA NA Meliaceae 6 2 224-225 NA NA NA NA Meliaceae Species Genus Species.code Field.Note X.specimens 1 Myristica sp. (brown) Anonychomyrma ANON009 50 2 Chisocheton lasiocarpus UNOCCUPIED UNOCCUPIED NA 3 Chisocheton lasiocarpus Anonychomyrma ANON002 28 4 Chisocheton lasiocarpus Anonychomyrma ANON009 15 5 Chisocheton lasiocarpus Anonychomyrma ANON009 30 6 Chisocheton lasiocarpus Podomyrma PODO003 8 pinned accession.no sep.for.barcoding barcoding.success 1 NA 2 NA 3 NA 4 NA 5 NA 6 NA #exclude unoccupied plants from network occ<-droplevels(dat_9.1[which(dat_9.1$Species.code!="UNOCCUPIED"),]) networkdata<-occ[which(occ$Species.code!="UNC"),c(4,16,18)] head(networkdata) Elevation Species Species.code 1 700 Myristica sp. (brown) ANON009 3 700 Chisocheton lasiocarpus ANON002 4 700 Chisocheton lasiocarpus ANON009 5 700 Chisocheton lasiocarpus ANON009 6 700 Chisocheton lasiocarpus PODO003 7 700 Ryparosa amplifolia ANON009 #colour palettes antpalette <- c("#a50026","#d73027","#f46d43","#fdae61", "#fee090","#e0f3f8","#abd9e9","#74add1", "#4575b4", "#313695") plantpalette<-c("#543005","#8c510a","#bf812d","#dfc27d","#f6e8c3","#f5f5f5","#c7eae5","#80cdc1", "#35978f","#01665e","#003c30") #bipartite network for all elevations combined networkdata3<-networkdata[,2:3] networkdata4<-table(networkdata3) plotweb(networkdata4, bor.col.interaction="grey80",low.lablength=21,high.lablength=10, text.rot=90,col.high=antpalette, col.low=plantpalette)![]()
9.2 微生物共现网络
微生物共现网络是用于分析和可视化不同微生物群落之间相互关系的一种图形表示方法。 它通过网络图展示微生物种类之间的共存或互作模式。 具有以下特点: 1.节点 :代表不同的微生物种类或OTUs(操作性分类单元)。 2.边 :表示微生物之间的共现关系,通常基于它们在相同样本中的共同出现。 3.共现分析 :通过统计方法(如皮尔逊相关系数或杰卡德指数)确定微生物之间的关系,建立网络。 我们用微生物OTU表的数据演示一下: library(WGCNA) library(psych) library(reshape2) library(igraph) setwd("D:\\test\\ggplot2") otu_table <- read.csv("Co_Net.csv",header = T,row.names = 1)#导入数据 #对OTU进行筛选 #(1)去掉平均相对丰度低于0.01% #(2)出现次数少于总样本量1/5的OTU #rel_abundance <- apply(otu_table, 2, function(x) x/sum(x)) # 计算相对丰度 mean_rel_abundance <- rowMeans(rel_abundance) # 计算各个OTU在每个样本中的相对丰度 low_rel_abundance_otu <- rownames(otu_table)[mean_rel_abundance < 0.0001] # 找到平均相对丰度小于0.01%的OTU otu_table_filtered <- otu_table[!(rownames(otu_table) %in% low_rel_abundance_otu), ] # 删除平均相对丰度低的OTU freq <- apply(otu_table_filtered, 1, function(x) sum(x > 0)/length(x)) keep <- freq >= 1/5 # 根据需要改边需要的出现频率 otu_table_filt <- otu_table_filtered[keep, ] # 仅保留出现频率大于设定阈值的OTU otu<-otu_table_filt cor = corAndPvalue(t(otu),y=NULL,use = "pairwise.complete.obs", alternative='two.sided',method='spearman') #OTU之间的Spearman相关系数和p值 r = cor$cor # 获取相关系数 p = cor$p #获取p值 p = p.adjust(p, method = 'BH') #对p值进行BH校正 r[p > 0.001 | abs(r) < 0.60] = 0 # 对相关性进行筛选,p值>0.001或|r|<0.60的将被去除(赋0值) write.csv(data.frame(r, check.names = FALSE), 'corr.matrix.csv') g = graph_from_adjacency_matrix(r,mode="undirected",weighted=TRUE,diag = FALSE) #根据相关系数矩阵创建一个加权无向图 g = delete.vertices(g, names(degree(g)[degree(g) == 0])) #删除度数为0的孤立节点 E(g)$corr = E(g)$weight #为网络的边属性赋值(权重) E(g)$weight = abs(E(g)$weight) #为网络的边属性赋值(权重) tax = read.csv('otu_tax.csv', row.names=1, header=T) #读取节点分类信息 tax = tax[as.character(V(g)$name), ] #为节点加上分类信息 V(g)$Kingdom = tax$Kingdom #界 V(g)$Phylum = tax$Phylum #门 V(g)$Class = tax$Class #纲 V(g)$Order = tax$Order #目 V(g)$Family = tax$Family #科 V(g)$Genus = tax$Genus #属 V(g)$Species = tax$Species #种 node_list = data.frame( label = names(V(g)), kingdom = V(g)$Kingdom, phylum = V(g)$Phylum, class = V(g)$Class, order = V(g)$Order, family = V(g)$Family, genus=V(g)$Genus, species = V(g)$Species) #创建节点列表 head(node_list) edge = data.frame(as_edgelist(g)) #创建边列表 edge_list = data.frame( source = edge[[1]], target = edge[[2]], weight = E(g)$weight, correlation = E(g)$corr ) write.graph(g, 'network.graphml', format = 'graphml') #后续在Gephi中可视化 后续在Gephi软件中进行优化绘图,具体可参考“微生物共现网络的构建及在Gephi中优化绘图“ 这一期的推文。 效果图如下:![]()
10雷达图
雷达图(Radar Chart),也称为蛛网图,是一种用于展示多变量数据的可视化工具。 它以中心点为起点,通过放射状的轴线表示不同的变量,适合于比较多个对象在不同维度上的表现。 具有以下特点: 1.多维数据展示 :能够在同一图表中显示多个变量,便于比较不同对象的特征。 2.直观可视化 :各个变量通过线连接,形成多边形,便于快速理解各对象的相对表现。 fmsb包是常用的绘制雷达图的包。 下面我们演示一下雷达图的绘制: # 安装和加载fmsb包 #install.packages("fmsb") library(fmsb) # 创建数据框 dat_10.1 <- data.frame( row.names = c("手机A", "手机B", "手机C"), 性能 = c(5, 4, 3), 摄像头 = c(4, 5, 3), 电池 = c(4, 3, 5), 显示屏 = c(3, 4, 5) ) # 添加最大和最小值 dat_10.1 <- rbind(rep(5, ncol(dat_10.1)), rep(1, ncol(dat_10.1)), dat_10.1) # 透明颜色函数 transp <- function(col, alpha) { rgb(t(col2rgb(col)), max = 255, alpha = alpha * 255, names = NULL) } # 绘制雷达图 radarchart(dat_10.1, axistype = 1, pcol = c("red", "blue", "green"), pfcol = c(transp("red", 0.5), transp("blue", 0.5), transp("green", 0.5)), plwd = 2, plty = 1, title = "手机特性比较" )使用fmsb包绘制雷达图特别需要注意数据的结果,前面两行是数据的最大值和最小,然后才是变量。 下来演示ggradar包绘制雷达图 ggradar是ggplot2的拓展包,调用ggplot2的语法绘制雷达图。 如果平时习惯ggplot2作图,那么这个包使用起来可能会比fmsb包更顺手,因为它的参数选项的名称和ggplot2的很像。 #通过连接入 github 安装 #install.packages('devtools') #devtools::install_github('ricardo-bion/ggradar', dependencies = TRUE) library(ggradar) #模拟数据 set.seed(1234) dat_10.2 <- data.frame( obj = c('obj1', 'obj2', 'obj3'), factor1 = runif(3, 0, 1), factor2 = runif(3, 0, 1), factor3 = runif(3, 0, 1), factor4 = runif(3, 0, 1), factor5 = runif(3, 0, 1)) #查看数据集结构 dat_10.2 obj factor1 factor2 factor3 factor4 factor5 1 obj1 0.1137034 0.6233794 0.009495756 0.5142511 0.2827336 2 obj2 0.6222994 0.8609154 0.232550506 0.6935913 0.9234335 3 obj3 0.6092747 0.6403106 0.666083758 0.5449748 0.2923158 #雷达图 ggradar(dat_10.2, background.circle.transparency = 0, group.colours = c('blue', 'red', 'green3'))
![]()
11弦图
弦图(chord diagram)又称和弦图。 可以显示不同实体之间的相互关系和彼此共享的一些共通之处,因此这种图表非常适合用来比较数据集或不同数据组之间的相似性。 需要用到circlize包,演示的数据是我们自带的数据集: library(circlize) setwd("D:\\test\\ggplot2") mat<-read.csv("hexiantu.csv",header = T) ma1<-mat[,c(1,3,5)] head(ma1) type phylum value 1 water Act 34.11 2 water Pro 26.08 3 water Cya 18.74 4 water Bac 7.55 5 water Pla 5.26 6 water Ver 1.99 chordDiagram(ma1)![]()
12 桑基图
桑基图(Sankey diagram)是一种常用的层次数据展示方法,它通过使用有向连接不同的节点来显示流动的路径和量级。 它以宽度不同的箭头或流线表示数据流的量,流量越大,数据越宽。 R语言中绘制桑基图主要用到两个包,ggplot2和ggalluvial。 library(ggplot2) library(ggalluvial) setwd("D:\\test\\ggplot2") dat_12<-read.csv("sankey_plot.csv",header = T) p_san<-ggplot(dat_12, aes(y = ab, axis1 =type, axis2 = phylum, axis3 = depth))+#定义图形绘制 theme_test()+ geom_alluvium(aes(fill = type),width = 0, reverse = FALSE)+#控制线条流向 scale_fill_manual(values = c("#FC4E07","#00AFBB") )+ geom_stratum(width = 1/3, reverse = FALSE) +#控制中间框的宽度 geom_text(stat = "stratum", aes(label = after_stat(stratum)),reverse = FALSE, size = 4,angle=0)+ #定义中间的文字 scale_x_continuous(breaks = 1:3, labels = c("type", "phylum", "depth"))+#定义X轴上图标排序 theme(legend.position = "none") p_san![]()
13相关性热图
13.1 普通相关性热图
相关关系图(Correlation Plot)是一种可视化工具,用于展示多个变量之间的相关性。 通过颜色和形状,可以直观地看到变量之间的正相关、负相关或无相关关系。 corrplot包绘制相关性热图: data(mtcars) cor(mtcars$disp,mtcars$hp)#简单看一下disp和hp两个变量的相关性 [1] 0.7909486 corr<- cor(mtcars)#求所有变量的相关性 library(corrplot)#加载所需要的包 corrplot 0.92 loaded corrplot(corr,method="pie")![]()
13.2 Mantel test 图
Mantel检验(Mantel Test)是一种统计方法,用于评估两个距离矩阵之间的相关性。 它常用于生态学和遗传学等领域,比较地理距离与遗传距离的相关性等。 Mantel检验通过计算两个距离矩阵的皮尔森相关系数,并通过置换检验(permutation test)来评估相关性的显著性。 下面我演示用linkET包绘制mantel test图: library(linkET) library(ggplot2) library(dplyr) #读取数据 #获得样品情况,作为环境因子表 setwd("D:\\test\\ggplot2") env <- read.csv("env.csv",row.names = 1,header = T) as_matrix_data(env) A matrix data object: Number: 1 Names: env Dimensions: 94 rows, 5 columns Row names: S01, S02, S03, S04, S05, S06, S07, S08, S09, S10, S11, S1... Column names: COD, DO, NH3.N, TP, petroeum as_md_tbl(env) # A tibble: 470 × 3 .rownames .colnames env * <chr> <chr> <dbl> 1 S01 COD 16 2 S02 COD 8 3 S03 COD 11 4 S04 COD 9 5 S05 COD 5 6 S06 COD 17 7 S07 COD 8 8 S08 COD 9 9 S09 COD 19 10 S10 COD 12 # ℹ 460 more rows ###微生物otu数据,丰度前20的门水平 t_water<-read.csv("water-phylum20.csv",header = T,row.names = 1) mantel <- mantel_test(t_water, env, spec_select = list(Spec01 = 1:5, Spec02 = 6:10, Spec03 = 11:15, Spec04 = 16:20)) %>% mutate(rd = cut(r, breaks = c(-Inf, 0.2, 0.4, Inf), labels = c("< 0.2", "0.2 - 0.4", ">= 0.4")), pd = cut(p, breaks = c(-Inf, 0.01, 0.05, Inf), labels = c("< 0.01", "0.01 - 0.05", ">= 0.05"))) `mantel_test()` using 'bray' dist method for 'spec'. `mantel_test()` using 'euclidean' dist method for 'env'. ##mantel test 绘图 qcorrplot(correlate(env), type = "lower", diag = FALSE) + geom_square() + geom_couple(aes(colour = pd, size = rd), data = mantel, curvature = nice_curvature()) + scale_fill_gradientn(colours = RColorBrewer::brewer.pal(11, "RdBu")) + scale_size_manual(values = c(0.5, 1, 2)) + scale_colour_manual(values = color_pal(3)) + guides(size = guide_legend(title = "Mantel's r", override.aes = list(colour = "grey35"), order = 2), colour = guide_legend(title = "Mantel's p", override.aes = list(size = 3), order = 1), fill = guide_colorbar(title = "Pearson's r", order = 3))![]()
14树形图 Treemap
树形图(Treemap)由一组矩形组成,这些矩形代表数据中的不同类别,其大小由与各自类别相关的数值定义。 这里演示的数据来自HistData的霍乱数据,同时需要加载treepmap包: library(HistData) library(treemap) library(dplyr) library(RColorBrewer) data("Cholera") str(Cholera)#查看数据集结构 'data.frame': 38 obs. of 15 variables: $ district : chr "Newington" "Rotherhithe" "Bermondsey" "St George Southwark" ... $ cholera_drate : int 144 205 164 161 181 153 68 120 97 75 ... $ cholera_deaths: int 907 352 836 734 349 539 437 1618 504 718 ... $ popn : int 63074 17208 50900 45500 19278 35227 64109 134768 51704 95954 ... $ elevation : int -2 0 0 0 2 2 2 3 4 8 ... $ region : Factor w/ 5 levels "West","North",..: 5 5 5 5 5 5 1 5 5 5 ... $ water : Factor w/ 3 levels "Battersea","New River",..: 1 1 1 1 1 1 1 1 1 2 ... $ annual_deaths : int 232 277 267 264 281 292 260 233 197 238 ... $ pop_dens : int 101 19 180 66 114 141 70 34 12 18 ... $ persons_house : num 5.8 5.8 7 6.2 7.9 7.1 8.8 6.5 5.8 6.8 ... $ house_valpp : num 3.79 4.24 3.32 3.08 4.56 ... $ poor_rate : num 0.075 0.143 0.089 0.134 0.079 0.076 0.039 0.072 0.038 0.081 ... $ area : int 624 886 282 688 169 250 917 4015 4342 5367 ... $ houses : int 9370 2420 6663 5674 2523 4659 6439 17791 6843 11995 ... $ house_val : int 207460 59072 155175 107821 90583 174732 238164 510341 180418 274478 ... 我们想创建一个树状图,其中我们有较大的矩形代表伦敦的区域,较小的矩形代表各自区域内的地区。 矩形的大小将告诉我们某一地区和地区霍乱造成的死亡率。 treemap(Cholera, index=c("region","district"), vSize="cholera_deaths", vColor = "region", type = "categorical", # formatting options: palette = brewer.pal(n = 5, name = "Accent"), align.labels=list( c("left", "top"), c("right", "bottom") ), border.col = "white", bg.labels = 255, position.legend = "none")![]()
15 聚类树状图 Dendrogram
树状图(Dendrogram)是一种展示数据集层次聚类结果的图形工具。 在聚类分析中,树状图通过逐步聚类的过程,将数据点按照相似性进行合并,并通过树形结构来表示合并的层次关系。 树状图不仅可以帮助我们了解数据点之间的相似性,还可以帮助我们决定适合的数据聚类数量。 数据用的是R自带的USArrests数据集,即1973年美国各个州每100000人名居民因谋杀、袭击和强奸被捕的人数。 head(USArrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.715.1 聚类树的构建
1.计算距离矩阵 :对所有数据点对之间的距离进行计算,通常使用欧几里得距离。 2.初始聚类 :每个数据点作为一个独立的聚类。 3.合并聚类 :逐步合并距离最近的两个聚类,并更新距离矩阵。 4.重复步骤3 :直到所有的数据点合并成一个聚类。 # 计算距离矩阵,默认method = "euclidean"计算欧氏距离 dists <- dist(USArrests,method = "euclidean") # 进行层次聚类,method = "average"选择UPGMA聚类算法 hc <- hclust(dists, method = "ave") # 将hclust对象转换为dendrogram对象 dend1 <- as.dendrogram(hc) # 绘制聚类树图,默认type = "rectangle" plot(dend1, type = "rectangle",ylab="Height")水平放置聚类树 plot(dend1, nodePar = list(pch = 17:16, cex = 1.2:0.8, col = 2:3), horiz = TRUE)
nP <- list(col = 3:2, cex = c(2.0, 0.8), pch = 21:22, bg = c("light blue", "pink"), lab.cex = 0.8, lab.col = "tomato") plot(dend1, nodePar= nP, edgePar = list(col = "gray", lwd = 2), horiz = TRUE)
![]()
15.2使用ggdendro包绘制聚类树图
ggdendro是R语言中绘制谱系图的强大工具 # 安装并加载所需的R包 #install.packages('ggdendro') library(ggdendro) library(ggplot2) # 层次聚类 hc <- hclust(dist(USArrests), "ave") hc Call: hclust(d = dist(USArrests), method = "ave") Cluster method : average Distance : euclidean Number of objects: 50 ggdendrogram(hc)修改一下风格 hcdata <- dendro_data(hc, type = "triangle") ggdendrogram(hcdata, rotate = TRUE) + labs(title = "Dendrogram in ggplot2")
![]()
15.3 使用ggraph包绘制聚类树图 # 安装并加载所需的R包 #install.packages("ggraph") library(ggraph) library(igraph) library(tidyverse) library(RColorBrewer) theme_set(theme_void()) # 构建示例数据 # data: edge list d1 <- data.frame(from="origin", to=paste("group", seq(1,7), sep="")) d2 <- data.frame(from=rep(d1$to, each=7), to=paste("subgroup", seq(1,49), sep="_")) edges <- rbind(d1, d2) # 我们可以为每个节点添加第二个包含信息的数据帧! name <- unique(c(as.character(edges$from), as.character(edges$to))) vertices <- data.frame( name=name, group=c( rep(NA,8) , rep( paste("group", seq(1,7), sep=""), each=7)), cluster=sample(letters[1:4], length(name), replace=T), value=sample(seq(10,30), length(name), replace=T) ) #创建一个图形对象 mygraph <- graph_from_data_frame( edges, vertices=vertices) # 使用ggraph函数绘制聚类树图 ggraph(mygraph, layout = 'dendrogram') + geom_edge_diagonal()# 构建测试数据集 d1=data.frame(from="origin", to=paste("group", seq(1,10), sep="")) d2=data.frame(from=rep(d1$to, each=10), to=paste("subgroup", seq(1,100), sep="_")) edges=rbind(d1, d2) # 创建一个顶点数据框架。 层次结构中的每个对象一行 vertices = data.frame( name = unique(c(as.character(edges$from), as.character(edges$to))) , value = runif(111) ) # 让我们添加一个列,其中包含每个名称的组。 这将是有用的稍后颜色点 vertices$group = edges$from[ match( vertices$name, edges$to ) ] #让我们添加关于我们将要添加的标签的信息:角度,水平调整和潜在翻转 #计算标签的角度 vertices$id=NA myleaves=which(is.na( match(vertices$name, edges$from) )) nleaves=length(myleaves) vertices$id[ myleaves ] = seq(1:nleaves) vertices$angle= 90 - 360 * vertices$id / nleaves # 计算标签的对齐方式:向右或向左 #如果我在图的左边,我的标签当前的角度< -90 vertices$hjust<-ifelse( vertices$angle < -90, 1, 0) # 翻转角度BY使其可读 vertices$angle<-ifelse(vertices$angle < -90, vertices$angle+180, vertices$angle) # 查看测试数据 head(edges) from to 1 origin group1 2 origin group2 3 origin group3 4 origin group4 5 origin group5 6 origin group6 head(vertices) name value group id angle hjust 1 origin 0.08520282 <NA> NA NA NA 2 group1 0.80271034 origin NA NA NA 3 group2 0.34579104 origin NA NA NA 4 group3 0.84521720 origin NA NA NA 5 group4 0.85891928 origin NA NA NA 6 group5 0.48287801 origin NA NA NA # 创建一个图形对象 mygraph <- graph_from_data_frame( edges, vertices=vertices ) #绘图 ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + geom_edge_diagonal(colour="grey") + #设置节点边的颜色 # 设置节点的标签,字体大小,文本注释信息 geom_node_text(aes(x = x*1.15, y=y*1.15, filter = leaf, label=name, angle = angle, hjust=hjust*0.4, colour=group), size=2.5, alpha=1) + # 设置节点的大小,颜色和透明度 geom_node_point(aes(filter = leaf, x = x*1.07, y=y*1.07, colour=group, size=value, alpha=0.2)) + # 设置颜色的画板 scale_colour_manual(values= rep(brewer.pal(9,"Paired") , 30)) + # 设置节点大小的范围 scale_size_continuous(range = c(1,10) ) + theme_void() + theme(legend.position="none",plot.margin=unit(c(0,0,0,0),"cm"), ) + expand_limits(x = c(-1.3, 1.3), y = c(-1.3, 1.3))
![]()
16森林图
森林图(Forest Plot)是一种用于展示多个研究结果的图形方法,常用于系荟萃分析(Meta Analysis)。 森林图可以让读者直观地看到不同研究的结果,并通过综合多个研究的结果来得出总体结论。 森林图的关键组成部分有: (1)每项研究的结果 :每一行通常代表一项独立的研究。 每个研究的结果通常以点估计(如平均值或比值比)和其置信区间(通常是95%置信区间)表示。 (2)置信区间 :点估计的左右两侧会有水平线,表示该点估计的置信区间。 这条线的长度反映了结果的不确定性。 置信区间越长,表示该研究结果的不确定性越大。 (3)合并结果 :图中通常会有一条垂直线,表示无效值(如相对风险为1,或均值差为0)的参考线。 下方会有一个菱形或其他符号,表示合并后的总体结果及其置信区间。 菱形的中心表示合并的点估计,两端表示合并结果的置信区间。 下面,我们用meta和metafor包演示一下森林图的绘制,数据是内置的数据包: library(meta) library(metafor) # 加载数据 data(caffeine) head(caffeine) study year h.caf n.caf h.decaf n.decaf D1 D2 D3 D4 D5 rob 1 Amore-Coffea 2000 2 31 10 34 some some some some high low 2 Deliciozza 2004 10 40 9 40 low some some some high low 3 Kahve-Paradiso 2002 0 0 0 0 high high some low low low 4 Mama-Kaffa 1999 12 53 9 61 high high some high high low 5 Morrocona 1998 3 15 1 17 low some some low low low 6 Norscafe 1998 19 68 9 64 some some low some high high 可以看出,数据的结构有: 1.研究内容(study) 2.研究时间(year) 3.头疼的参与者人数-咖啡因组(h.caf) 4.参与者人数-咖啡因组(n.caf) 5.头疼的参与者人数-无咖啡因组(h.decaf) 6.参与者人数-无咖啡因组(n.decaf)16.1 普通森林图
可以使用meta::forest()函数会对任何类型的meta分析对象创建森林图: m1 <- metabin(h.caf, n.caf, h.decaf, n.decaf, sm = "OR", data = caffeine, studlab = paste(study, year)) Warning: Studies with non-positive values for n.e and / or n.c get no weight in meta-analysis. forest(m1)![]()
16.2 亚组森林图
通过创建亚组变量,将亚组变量添加到函数中,可以创建亚组森林图。 caffeine$subyear <- ifelse(caffeine$year < 2000, "Before2000", "After2000") m2 <- metabin(h.caf, n.caf, h.decaf, n.decaf, data=caffeine, sm = "OR", studlab=paste(study, " " ,year), common = TRUE, random = TRUE, subgroup = subyear) Warning: Studies with non-positive values for n.e and / or n.c get no weight in meta-analysis. Warning: Studies with non-positive values for n.e and / or n.c get no weight in meta-analysis. forest(m2)![]()
16.3 带有ROB的森林图
ROB,也就是risk of bias,偏倚风险。 偏倚风险评估图用于展示纳入研究的方法学质量,绿、黄、红3种颜色分别代表低、中、高风险,相对于表格更为直观。 rob1 <- rob(D1, D2, D3, D4, D5, overall = rob, data = m1, tool = "RoB1") forest(rob1)![]()
17 金字塔图
金字塔图(Pyramid Chart),也称人口金字塔,是一种用于显示人口分布或其他分层数据的图形。 金字塔图通常用于展示不同年龄组和性别的人口数量,但它也可以用于其他数据集,例如物种分布等。 本质上,金字塔图是柱形图的一种。 下面演示一下金字塔图的绘制: 我们自己创建一个数据集: library(ggplot2) library(dplyr) # 示例数据 age_groups <- c('0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80+') male_population <- c(2000, 2200, 2400, 2600, 2800, 3000, 3200, 3400, 3600, 3800, 4000, 4200, 4400, 4600, 4800, 5000, 5200) female_population <- c(2100, 2300, 2500, 2700, 2900, 3100, 3300, 3500, 3700, 3900, 4100, 4300, 4500, 4700, 4900, 5100, 5300) # 创建数据框 data <- data.frame( AgeGroup = rep(age_groups, 2), Population = c(male_population, female_population), Gender = rep(c('Male', 'Female'), each = length(age_groups)) ) # 为女性人口取负值,以便在左侧显示 data <- data %>% mutate(Population = ifelse(Gender == 'Female', -Population, Population)) 绘制金字塔图: # 绘制金字塔图 ggplot(data, aes(x = AgeGroup, y = Population, fill = Gender)) + geom_bar(stat = "identity", position = "identity") + coord_flip() + scale_y_continuous(labels = abs) + labs(title = "Population Pyramid", x = "Age Group", y = "Population") + theme_minimal() + scale_fill_manual(values = c("Male" = "blue", "Female" = "pink"))![]()
18韦恩图
韦恩图(Venn Diagram)是一种用于展示不同集合之间关系的图形工具。 它通过重叠的圆形表示集合之间的交集、并集和差集等关系。 在R语言中,可以使用VennDiagram 包来绘制韦恩图。 下面我们演示一下三集合的韦恩图的绘制: #install.packages("VennDiagram") # 加载VennDiagram包 library(VennDiagram) Loading required package: grid Loading required package: futile.logger Attaching package: 'VennDiagram' The following object is masked from 'package:ggpubr': rotate # 定义三个集合 set1 <- c("A", "B", "C", "D") set2 <- c("B", "C", "E", "F") set3 <- c("A", "C", "F", "G") # 绘制韦恩图 venn.plot <- venn.diagram( x = list(Set1 = set1, Set2 = set2, Set3 = set3), category.names = c("Set 1", "Set 2", "Set 3"), filename = NULL, output = TRUE, fill = c('#FFFFCC','#CCFFFF',"#FFCCCC"), alpha = 0.5, cat.pos = c(-20, 20, 0), cat.dist = c(0.05, 0.05, 0.05), cat.cex = 1.5, cat.col = "black", lwd = 2 ) # 显示韦恩图 grid.draw(venn.plot)使用
venn.diagram 函数绘制韦恩图,x 参数传入包含集合的列表,category.names 参数设置集合名称。 设置图形属性: filename = NULL:不保存为文件。 output = TRUE:输出图形对象。 fill:设置每个集合的填充颜色。 alpha:设置颜色透明度。 cat.pos:设置集合标签的位置。 cat.dist:设置集合标签与圆形的距离。 cat.cex:设置集合标签的字体大小。 cat.col:设置集合标签的颜色。 lwd:设置圆形边框的宽度。19 词云图
词云图(Word Cloud)是一种可视化工具,用于展示文本数据中词汇的频率和重要性。 词汇出现频率越高,显示的字体越大,通常用于文本分析和展示。 在R中,可以使用wordcloud 包来绘制词云图。 下面展示如何绘制简单的词云图。 同时需要安装”tm”包用于文本挖掘。 #install.packages("wordcloud") #install.packages("tm") library(wordcloud) library(tm) # 示例文本数据 text <- c("R programming", "data analysis", "data visualization", "machine learning", "statistical modeling", "data science", "big data", "data mining", "artificial intelligence", "R programming", "data analysis") # 创建文本数据集 docs <- Corpus(VectorSource(text)) # 文本预处理 docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords("english")) docs <- tm_map(docs, stripWhitespace) # 创建词频表 dtm <- TermDocumentMatrix(docs) matrix <- as.matrix(dtm) word_freqs <- sort(rowSums(matrix), decreasing = TRUE) word_freqs <- data.frame(word = names(word_freqs), freq = word_freqs) # 绘制词云图 wordcloud(words = word_freqs$word, freq = word_freqs$freq, min.freq = 1, max.words = 100, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))1.
创建示例文本数据 :定义一个包含多个字符串的向量。 2.创建文本数据集 :使用Corpus 函数创建文本语料库。 3.文本预处理 : –转换为小写字母。 –移除标点符号和数字。 –移除停用词(如“the”、“is”等常用词)。 –清除多余空格。 4.创建词频表 :通过TermDocumentMatrix 生成词频矩阵,并将其转换为数据框。 5.绘制词云图 :使用wordcloud 函数绘制词云图,设置参数控制词汇的最小频率、最大词数、随机顺序和颜色。20地图绘制
地图的绘制十分复杂,而且有专门的软件,比如ArcGIS,这里我们演示世界地图的绘制和在世界地图上标注经纬度。 library(ggplot2) library(maps) Attaching package: 'maps' The following object is masked from 'package:purrr': map 使用maps 包提供的数据来绘制世界地图: # 加载世界地图数据 world_map <- map_data("world") # 使用ggplot2绘制世界地图 ggplot(world_map, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "lightblue", color = "black") + theme_minimal() + labs(title = "World Map")1.
加载世界地图数据 :使用map_data 函数加载世界地图数据,数据格式适用于ggplot2 绘图。 2.使用ggplot2绘制地图 : –aes(x = long, y = lat, group = group) :设置美学映射,long 和lat 分别代表经度和纬度,group 确保地图多边形正确绘制。 –geom_polygon(fill = "lightblue", color = "black") :绘制多边形,填充颜色为浅蓝色,边框颜色为黑色。 –theme_minimal() :使用简洁主题。 –labs(title = "World Map") :添加标题。 如何在地图上添加一些点,如城市位置 # 示例城市数据 cities <- data.frame( name = c("New York", "London", "Tokyo", "Sydney"), lat = c(40.7128, 51.5074, 35.6895, -33.8688), long = c(-74.0060, -0.1278, 139.6917, 151.2093) ) # 绘制世界地图并添加城市点 ggplot() + geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightblue", color = "black") + geom_point(data = cities, aes(x = long, y = lat), color = "red", size = 3) + geom_text(data = cities, aes(x = long, y = lat, label = name), vjust = -1, color = "black") + theme_minimal() + labs(title = "World Map with Cities")1.
加载世界地图数据 :使用map_data 函数加载世界地图数据,数据格式适用于ggplot2 绘图。 2.定义示例城市数据 :创建包含城市名称、纬度和经度的数据框。 3.绘制世界地图并添加城市点 : –使用ggplot() 函数开始绘图。 –geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightblue", color = "black") :绘制世界地图的多边形,填充颜色为浅蓝色,边框颜色为黑色,group 确保多边形正确绘制。 –geom_point(data = cities, aes(x = long, y = lat), color = "red", size = 3) :在地图上添加红色点,表示城市位置。 –geom_text(data = cities, aes(x = long, y = lat, label = name), vjust = -1, color = "black") :在城市点上方添加城市名称标签。 –theme_minimal() :使用简洁主题。 –labs(title = "World Map with Cities") :添加标题。 这样生成带有城市的世界地图,我们可以用我们的经纬度数据自己在世界地图上标注点。R 如何读取超大量的文件
现在遇到一个问题,我有一个8GB的文件,我的电脑内存只有16GB,这个时候用R常规读取这个文件简直就是跟和尚借梳子。常规的读取: dt <- read.table("asv_table_tax_subsample.txt", header = T, row.names = 1) 然后就会出现报错
该怎么处理这个问题呢?
方法一:使用 readLines() 分块读取 如果文件过大,可以使用 readLines() 按行读取文件并分块处理。 示例代码: con <- file("large_file.txt", "r") # 打开文件连接 chunk_size <- 10000 # 每次读取的行数 while (length(lines <- readLines(con, n = chunk_size, warn = FALSE)) > 0) { # 在此处处理每块数据 print(length(lines)) # 可将数据保存到文件或数据库中,避免占用过多内存 } close(con) # 关闭文件连接方法二:使用 data.table::fread() data.table 包中的 fread() 函数能够高效读取大文件,并且可以直接读取压缩文件。 它比 read.table() 更快,并且支持分块读取。 library(data.table) dt <- fread("large_file.txt", nrows = 10000, header = TRUE) # 你可以根据需要调整 nrows 参数的值来控制读取的行数方法三:使用 LaF 包 LaF (Large ASCII Files) 包可以用于逐行读取大文件,而不会将整个文件加载到内存中。 library(LaF) laf <- laf_open_csv("large_file.txt", column_types = c("character", "numeric", "integer")) # 使用 laf 对象处理文件 for (i in 1:nrow(laf)) { line <- next_block(laf, nrows = 10000) # 分块读取 # 在此处处理每块数据 } close(laf)方法四:使用 ff 包 ff 包允许你在不加载整个文件的情况下处理超大数据集,将数据存储在磁盘上并按需读取。 library(ff) large_file_ff <- read.table.ffdf(file="large_file.txt") # 你可以像操作普通数据框一样操作 large_file_ff方法五:使用 bigmemory 包 bigmemory 包提供了用于处理大数据的矩阵结构,可将大数据部分加载到内存中,部分存储在磁盘上。 library(bigmemory) big_matrix <- read.big.matrix("large_file.txt", type = "double", header = TRUE) # 使用 big_matrix 进行数据处理 根据你的文件大小和实际需求,可以选择其中一种或多种方法结合使用。 这些方法能够帮助你有效处理超大的文件,而不至于耗尽内存或导致系统崩溃。 如果用了以上的方法还是无法读取,那只有一个解决方法了:换电脑!store output of system command into a variable
a <- system("ls ") # this dosen't work a <- system("ls ", intern = TRUE) # this works If intern = TRUE , a character vector giving the output of the command, one line per character string. system(command, intern = FALSE, ignore.stdout = FALSE, ignore.stderr = FALSE, wait = TRUE, input = NULL, show.output.on.console = TRUE, minimized = FALSE, invisible = TRUE, timeout = 0 )R语言 24个高效操作技巧
1. 修改默认提示语言
2. 查看R所消耗的内存大小
3. 查看特定数据集的内存大小
4. 代码中的换行操作
5. 边赋值边显示变量
6. 查看函数的源代码
7. 设置CRAN镜像
8. 显示更多数据行
9. 设置显示的小数位数
10. 管道操作
11. 拆分列数据
12. 默认加载包
13. 为R添加额外扩展包加载路径
14. 迁移R包
15. 列出R包中的函数
16. 不加载包使用其中函数
17. 快速获取颜色
18. 炸开数据
19. 巧用example函数学习绘图
20. 统计计算时间
21. 释放内存
22. 删除全部变量并释放内存
23. 恢复默认数据集
24. 快速获取函数选项参数1. 修改默认提示语言
在R中,默认的提示语言根据用户的系统语言设置而定。 若需要统一修改为英文,可通过以下步骤操作: Sys.getlocale() # 显示当前系统语言设置 Sys.setenv(LANG="en") # 设置默认语言为英文2. 查看R所消耗的内存大小
memory.size() 函数用于查看当前R会话消耗的内存大小,但此函数仅在Windows系统中有效。 memory.size() # 输出内存大小,单位为MB3. 查看特定数据集的内存大小
使用object.size()函数可以查看任意数据集的内存占用,单位默认为字节。 若需转换为KB,可以进行简单的除法运算: object.size(mtcars) # 显示mtcars数据集的内存大小,单位为字节 object.size(mtcars) / 1024 # 转换为KB4. 代码中的换行操作
在RStudio中,回车键默认执行代码。 若在编辑时需要换行而不执行,可以使用Shift + Enter。 function(x, y) { # 你的代码 }5. 边赋值边显示变量
在R中,你可以在赋值的同时直接显示变量的值,通过将赋值语句包含在括号中实现: (x <- runif(10)) # 赋值并显示x的值6. 查看函数的源代码
想查看某个R函数的源代码,可以直接输入函数名,不加括号: mean # 显示mean函数的源代码7. 设置CRAN镜像
为避免每次安装R包时弹出选择镜像的对话框,可以预先指定CRAN镜像: chooseCRANmirror(ind = 18) # 直接选择适合你的编号为xx的镜像8. 显示更多数据行
默认情况下,R显示1000行数据。 通过设置max.print可以调整这一限制: options(max.print = 2000) # 设置为显示2000行数据9. 设置显示的小数位数
默认情况下,R显示数字时保留7位小数。 通过调整digits选项可以修改这一设置: options(digits = 2) # 设置默认显示两位小数10. 管道操作
使用管道符号%>%可以让代码更加简洁,避免定义过多的中间变量。 在R中使用管道前需要加载相关的包: library(magrittr) # 加载magrittr包以使用管道 mtcars %>% ggplot(aes(x = cyl, y = mpg, group = cyl)) + geom_boxplot()11. 拆分列数据
在使用数据集时,有时记不住列名或容易拼错。 使用attach()函数可以将数据集中的每一列变成一个独立的变量,方便直接调用: attach(mtcars) cyl # 显示cyl列的数据 mpg # 显示mpg列的数据12. 默认加载包
如果有经常使用的R包,可以通过修改.Rprofile文件设置R启动时自动加载这些包。 例如,自动加载ggplot2包: file.edit("~/.Rprofile") .First <- function() { library(ggplot2) }13. 为R添加额外扩展包加载路径
可以通过修改.libPaths()来添加额外的包安装路径,使R能够在新的目录中查找和安装包: .libPaths(new = "C:/Users/genom/Desktop/nparFiles/") # 添加新路径 .libPaths() # 显示当前所有库路径14. 迁移R包
当需要在不同设备之间迁移已安装的R包时,可以先在源设备上保存已安装包的列表,然后在目标设备上使用该列表进行安装: # 在源设备上 save(installed.packages()[,1], file = "installedPackages.Rdata") # 在目标设备上 load("installedPackages.Rdata") for (i in setdiff(installed.packages()[,1], oldip)) { install.packages(i) }15. 列出R包中的函数
要查看某个R包中包含的所有函数,可以使用ls()函数指定包名: ls(package:base) # 列出base包中的所有函数16. 不加载包使用其中函数
在不加载整个R包的情况下使用其中的某个函数,可以使用“包名::函数名”的格式: dplyr::filter() # 使用dplyr包中的filter函数17. 快速获取颜色
在需要快速为图形设置颜色时,可以使用rainbow()函数快速生成多种颜色: rainbow(6) # 生成并显示6种不同的颜色18. 炸开数据
虽然使用attach()函数可以简化数据列的调用,但这可能导致环境变量混乱。 使用%$%特殊管道符可以更安全地实现相同效果: library(magrittr) women %$% plot(weight, height) # 使用“炸开”数据来绘图19. 巧用example函数学习绘图
example()函数运行R帮助文档中的示例代码,是学习函数使用方法的好助手: library(pheatmap) example("pheatmap") # 运行并展示pheatmap函数的示例20. 统计计算时间
使用system.time()函数可以测量一段代码的运行时间: system.time(runif(100000000)) # 测量生成一亿个随机数的时间21. 释放内存
在R中,即使删除了变量,内存也不会立即释放。 可以通过gc()函数手动触发垃圾回收,释放内存: memory.size() # 显示当前内存使用量 rm(list = ls()) # 删除所有变量 gc() # 执行垃圾回收 memory.size() # 再次显示内存使用量22. 删除全部变量并释放内存
ls() # 显示所有变量 rm(list = ls()) # 删除所有变量 gc() # 执行垃圾回收23. 恢复默认数据集
如果不慎删除或覆盖了内置数据集,可以通过data()函数恢复: data("mtcars") # 恢复mtcars数据集 head(mtcars) # 显示数据集的前几行24. 快速获取函数选项参数
使用args()函数可以快速查看任何R函数的参数列表,无需查阅帮助文档: args(heatmap) # 显示heatmap函数的参数列表RStudio 设置方法
初次打开RStudio会显示如下界面。RStudio基础设置
常规设置
code设置
Console设置
Appearance设置
Pane Layout设置
Packages设置各窗口的使用
Environment
History
Files
Plots
Packages
Help代码区 快捷键 可执行如下步骤新建一个R脚本。
界面就会变成下面这个样子
![]()
认识了RStudio界面后,再来认识下RStudio的基础设置。
RStudio基础设置 ![]()
常规设置
其中涉及Restore 的几个选项会在打开新的RStudio窗口的时候自动打开之前使用过的R脚本以及RData数据,会减慢打开的速度。 可以将这几个选项关闭并使用手动保存的方式进行保存。![]()
code设置
在Editing里的相关设置主要是为了让代码变得整洁,阅读方便。 其中Soft-wrap R source files可将代码自动折叠,这样可以避免代码过长,需要左右拖动窗口才能查看全部代码。 建议勾选。 但Continue comment when inserting new line不建议勾选,因为有时会出现一行注释已书写完毕,换行是要输入代码而不是继续写注释。在Display里,Show margin可以取消勾选,这个意义不大,因为每个人的屏幕大小和分辨率是不一样的。
在Saving里,UTF-8位置处,这里是当打开脚本文件时出现了乱码要修改的地方,但一般不会。
Completion是自动补齐,这里的选项都不建议取消勾选,用自动补齐会加快代码的书写速度且不易出错。
在Diagnostics里Check usage of'<-' in function cal可以取消勾选,可以用=替代,之所以会有这么个选项,是因为有时用=可能会出现一些未知的错误,但到现在为止我还没有遇到过。 其他的可根据自己的需要进行勾选,有的不勾选也没关系,反正运行代码的时候,如果某些变量未被定义是会报错提示的。
![]()
Console设置
其中Limit output line length to是限制控制台保留多少行数据,如果出现在控制台查看数据缺少前几行的时候可以通过调整这里的数值,使数据显示完全。 Discard pending console input on error一定要勾选,前面的代码运行报错,那后面的代码也大概率会报错,继续运行无意义。![]()
Appearance设置
这里的比较简单了,觉得界面哪里不合适就调哪里。 至于哪个主题好,挨个试。![]()
Pane Layout设置
觉得布局不喜欢,就在这里调,总共有4个区,可以随意更换每个区的位置。![]()
Packages设置
这里会设置CRAN的镜像,默认是global,建议更换成China的,离自己位置近的。到这里为止,再往下的选项卡都不需要调,用到的次数很少,刚开始学习R语言也用不到下面的内容,像Markdown有专业的Typora可以使用,虽然收费,但比在一个专业R语言编辑器RStudio里写Markdown文档要方便的多。 其他的用到的时候更少,可以在用到的时候再学如何设置。
各窗口的使用 Environment
其中可通过以下方式清空环境中的数据。 但在进行这个操作前一定要确认数据是否还有需要继续使用的。
![]()
History
,有时代码修改了但又要用修改之前的,可以在这里快速找到之前运行的代码并可快速复制重新运行。另外两个用处不多。
Files
可以在这里查看工作目录下的文件,不是很常用,最起码我用来不习惯,我还是喜欢直接去文件夹里查看。![]()
Plots
你画的所有的图都会在这里显示,可以进行放大查看和保存等操作。 非常常用。![]()
Packages
通过Install安装R包,通过每个R包后面的X进行卸载。 每个R包最前面的复选框可以勾选加载R包,取消勾选,取消加载R包,也是查看R包有没有加载的方式。 查看R包的帮助文档可以直接点击R包的蓝色字部分直接进入。![]()
Help
这个没啥好说的,当你运行?+R包名或函数名时会自动跳转到这个选项卡。 在这里会展示相应R包或函数的使用方法、参数设置以及会有示例代码帮助你快速了解函数的作用。 每次写代码不使用十次八次都不算正常。这里要学会查找替换这个功能,可能在word里这个功能用的很熟练,但在RStudio里也不要忘了,有时会遇到需要全局修改同一个变量名,查找替换就很实用。 还有就是记得保存,保存通用快捷键Ctrl+S。
代码区 ![]()
前面或多或少的介绍了几个快捷键,这里再加几个。 tab:显示所有可以补全R包名称或函数名称的选项 Ctrl+Enter:运行选中的代码,未选中的话就会运行当前行的代码 Ctrl+shift+C:在文件夹里是直接复制文件的地址,在RStudio里是快速注释或取消注释。
快捷键 R入门: 向量
我们常说的数据操作其实就是对各种数据结构进行操作 ,你在平常碰到的绝大多数数据清理/整理等问题,说白了就是对数据框、向量、列表等各种结构进行处理,所以这部分内容非常重要。 因为不同的结构有不同的操作方法。 我们要做的就是对这个数据框进行各种操作。 R拥有许多用于存储数据的对象类型,包括标量、向量、矩阵、数组、数据框和列表等。 它们在存储数据的类型、创建方式、结构复杂度,以及对它们进行操作的方法等均有所不同。 下图给出了这些数据结构的一个示意图。R中的数据结构
Note R中有一些术语较为独特,可能会对新用户造成困扰。 在R中,对象 (object)是指可以赋值给变量的任何事物,包括常量、数据结构、函数,甚至图形。 对象都拥有某种模式,描述了此对象是如何存储的,以及某个类(class),像print() 这样的泛型函数表明如何处理此对象。 与其他标准统计软件(如SAS、SPSS和Stata)中的数据集类似,数据框 (dataframe )是R中用于存储数据的一种结构:列表示变量,行表示观测。 在同一个数据框中可以存储不同类型(如数值型、字符型)的变量。 数据框将是你用来存储数据集的主要数据结构。向量
向量,vector ,就是同一类型的多个元素构成的序列,可以是数值型、字符型、逻辑型等。创建向量
在R中,最基本的创建向量的方法是使用函数c() :# 创建一个名字是a的向量 a <- c(1, 2, 5, 3, 6, -2, 4) class(a) # 查看类型 ## [1] "numeric" # 创建一个名字是b的向量 b <- c("one", "two", "three") # 创建一个名字是d的向量,不用c是为了避免和函数 c() 混淆 d <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) 这里,a 是数值型向量,b 是字符型向量,而d 是逻辑型向量。 向量中的值被称为元素 (element),比如向量a 的第一个元素是1 ,向量b 的第一个元素是"one" 。 注意,单个向量中的数据有相同的类型或模式(数值型、字符型或逻辑型)。 同一向量中无法混杂不同类型的数据。 比如:# 会都变成字符型 a <- c("a",1,TRUE) a ## [1] "a" "1" "TRUE" 除了通过c() 创建向量,还可以使用seq() (sequence的缩写)创建数值型的向量,比如,创建一个从1~20的向量,并且步长设置为2:# 从1到20,中间间隔2 seq(1, 20, 2) ## [1] 1 3 5 7 9 11 13 15 17 19 重复某个值也可以得到一个向量:# rep是replicate的缩写 rep(1:2, times = 3) # 重复1 2 三次 ## [1] 1 2 1 2 1 2 rep(1:2, each = 3) # 重复1三次,重复2三次 ## [1] 1 1 1 2 2 2 或者最简单的方法,使用数字和冒号,生成连续的数字:1:5 ## [1] 1 2 3 4 5 Tip 标量是只含一个元素的向量,例如f <- 3、g <- “US”和h <- TRUE。 它们用于保存常量。探索向量
查看向量长度:length(d) ## [1] 6 查看前6行/后6行:head(seq(1, 20, 2)) ## [1] 1 3 5 7 9 11 tail(seq(1, 20, 2)) ## [1] 9 11 13 15 17 19 查看唯一元素:a <- c(1,2,2,3,4,4,4) # 查看唯一元素 unique(a) ## [1] 1 2 3 4 查看一共有几种不同的元素,以及每个元素的个数,也就是计数:table(a) ## a ## 1 2 3 4 ## 1 2 1 3 根据位置选择向量元素
通过在方括号中指定元素的位置,我们可以访问(或者叫提取、查看)向量中的某个元素。 例如:a[c(2, 4)] 用于提取向量a 中的第二个和第四个元素。 更多示例如下:# 创建一个向量,取名为a a <- c(1, 2, 5, 3, 6, -2, 4) a[3] # 取第3个元素 ## [1] 5 a[c(1,3,5)] # 取第1,3,5个元素 ## [1] 1 5 6 a[c(1:3)] # 取第1到第3个元素 ## [1] 1 2 5 a[c(1, 2, 3)] # 和上面结果相同,也是取第1到第3个元素 ## [1] 1 2 5 如果提取不存在的位置,则会返回NA ,比如我们提取第10个元素:a[10] ## [1] NA NA 表示“Not Available”,NA 是R语言中一种特殊的类型,常用来表示数据缺失。 如何把提取出来的元素保存为另一个变量呢?比如把a 里面的第一个元素保存为变量b ?直接赋值即可:# 提取,赋值,即可 b <- a[1] b ## [1] 1 替换、删除、增加
如果要替换某个元素,直接提取这个元素并赋予要替换的值即可:a <- c(1, 2, 5, 3, 6, -2, 4) # 把向量a的第1个元素换成 m a[1] <- "m" a # 注意,此时全部变成字符型了哦! ## [1] "m" "2" "5" "3" "6" "-2" "4" # 同时替换多个元素,注意长度要相同,并且要使用c()放在一个向量中 a[c(1,3,4)] <- c("d","e","f") a ## [1] "d" "2" "e" "f" "6" "-2" "4" 如果要删除某个元素,直接在位置前加负号即可:a <- c(1, 2, 5, 3, 6, -2, 4) # 删除a的第一个元素,结果中第一个元素 1 就被删掉了 a[-1] ## [1] 2 5 3 6 -2 4 # 但此时你打印a会发现a还是1, 2, 5, 3, 6, -2, 4, a ## [1] 1 2 5 3 6 -2 4 # 如果要获得修改后的a,一定要重新赋值! a <- a[-1] a # 此时a就是修改后的了 ## [1] 2 5 3 6 -2 4 # 同时删除多个元素 a <- c(1, 2, 5, 3, 6, -2, 4) # 直接把要删除的元素位置放在c()中即可 a[c(-1,-2,-3)] ## [1] 3 6 -2 4 # 如果要获得修改后的a,一定要重新赋值! a <- a[c(-1,-2,-3)] a ## [1] 3 6 -2 4 如果要继续增加元素,直接使用c() 即可:# 在向量a中添加3个元素,并赋值给a1 # 注意由于"80", "89", "90"都加了引号,所以修改后的a都变成了字符型 a1 <- c(a, "80", "89", "90") a1 ## [1] "3" "6" "-2" "4" "80" "89" "90" 根据名字选择向量元素
还可以对向量中的每一个元素取一个名字,比如:# 创建一个命名向量 named_a <- c(age = 18, bmi = 22, weight = 65) named_a ## age bmi weight ## 18 22 65 此时,向量named_a 中的3个元素,都有一个独一无二的名字,此时我们还可以通过向量的名字来访问对应的元素:named_a["age"] ## age ## 18 named_a["bmi"] ## bmi ## 22 查看每个元素的名字(如果这是一个命名向量的话):names(named_a) ## [1] "age" "bmi" "weight" 替换元素的名字:# 替换第一个元素的名字,从age变为height names(named_a)[1] <- "height" named_a ## height bmi weight ## 18 22 65 # 同时替换多个元素的名字 names(named_a)[c(1,2)] <- c("height","gg") #names(named_a)[1:2] <- c("height","gg") named_a ## height gg weight ## 18 22 65 # 同时替换所有元素的名字 names(named_a) <- c("aa","bb","cc") named_a ## aa bb cc ## 18 22 65 移除元素的名字:# 移除元素的名字,注意不能只移除某个元素的名字,要一起移除 names(named_a) <- NULL named_a ## [1] 18 22 65 根据表达式选择向量元素
除了通过位置和名字选择元素外,还可以通过表达式(也就是TRUE 或者FALSE ):a <- c(1,2,3,10,11) a[a==10] # 选择等于10的元素 ## [1] 10 a[a<5] # 选择小于5的元素 ## [1] 1 2 3 a[a %in% c(2,3,11)] # 选择在(2,3,11)里面的元素,很常用 ## [1] 2 3 11 向量排序
如果要对向量排序:# 创建一个向量a a <- c(4,1,2,3) a ## [1] 4 1 2 3 # 排序,默认按照从小到大 sort(a) ## [1] 1 2 3 4 # 按照从大到小的顺序排列 sort(a, decreasing = T) ## [1] 4 3 2 1 # 反转顺序 rev(a) ## [1] 3 2 1 4 order 函数返回的是向量元素的一个排列索引,它不是直接对数据进行排序,而是告诉你如何对数据进行排序。a <- c(4,1,2,3) a ## [1] 4 1 2 3 order(a) ## [1] 2 3 4 1 order(a) 的结果中,第一个数字是2,意思是:原向量a 中的第2个元素(也就是1)应该放在第1位,第2个数字是3,意思是:原向量中的第3个元素(也就是2)应该放在第2位… 所以order 返回的是原始向量排序后的位置,我们就可以使用这些位置对向量进行排序:# 默认从小到大 a[order(a)] # 等价于sort(a) ## [1] 1 2 3 4 也可以从大到小:a[order(a, decreasing = T)] ## [1] 4 3 2 1 去重复
a <- c(1,2,2,3,4,4,4) # 查看是否有重复 duplicated(a) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE ! 表示“非”,也就是反向选择:!duplicated(a) ## [1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE 通过反向选择的方式去重,非常重要的方法:# 通过反选的方式去重,很重要,!表示反选 a[!duplicated(a)] ## [1] 1 2 3 4 两个向量的操作
取两个向量的交集、并集、差集。 假设有两个向量如下:a <- c(1,2,3,4) b <- c(1,2,3,5,6) 取两个向量中共有的元素(交集):intersect(a,b) ## [1] 1 2 3 取并集:union(a,b) ## [1] 1 2 3 4 5 6 取向量a 有但是b 没有的元素(差集):setdiff(a,b) ## [1] 4 取向量b 有但是a 没有的元素(差集):setdiff(b,a) ## [1] 5 6 R入门:apply系列函数(apply、lapply、sapply、tapply)
https://space.bilibili.com/42460432/channel/collectiondetail?sid=3740949循环
for 循环是一个元素一个元素的操作,在R语言中这种做法是比较低效的,更好的做法是向量化操作,也就是同时对一整行/列进行操作,不用逐元素操作,这样可以大大加快运行速度。 apply函数家族就是这样的一组函数,专门实现向量化操作,可替代for循环。 先举个简单的例子,说明下什么是向量化。 假如你有如下一个向量a ,你想让其中的每个元素都加1,你不用把每个元素单独拎出来加1:a <- c(1,2,3,NA) a + 1 # 直接加1即可,是不是很方便? ## [1] 2 3 4 NA 再举个例子,下面这个数据框中有一些NA 除此之外还有一些空白,或者空格。 如何批量替换这些值?tmp <- data.frame(a = c(1,1,3,4), b = c("one","two","three","four"), d = c(""," ",NA,90), e = c(" ",NA, "",20) ) tmp ## a b d e ## 1 1 one ## 2 1 two <NA> ## 3 3 three <NA> ## 4 4 four 90 20 比如,让NA 都变成999。 常规的做法是:检查每一个值,确认它是不是NA ,如果是,就改成999,如果不是,就不改。 向量化的做法是:tmp[is.na(tmp)] <- 999 # tmp[tmp == NA] <- 999 # 错误的做法 tmp ## a b d e ## 1 1 one ## 2 1 two 999 ## 3 3 three 999 ## 4 4 four 90 20 再比如,让空白的地方变成NA :tmp[tmp == ""] <- NA tmp ## a b d e ## 1 1 one <NA> ## 2 1 two 999 ## 3 3 three 999 <NA> ## 4 4 four 90 20 为什么还有一些空白?因为有的空白是真空白,有的则是空格!tmp[tmp == " "] <- NA tmp ## a b d e ## 1 1 one <NA> <NA> ## 2 1 two <NA> 999 ## 3 3 three 999 <NA> ## 4 4 four 90 20 以上示例旨在告诉大家,有很多时候并不需要逐元素循环,向量化是更好的方式。apply
对数据框(或矩阵)按行或者按列执行某个操作。 下面使用一个例子演示。 示例数据是从TCGA官网下载的COAD的mrna的表达矩阵,一共有1000行,100列,每一行表示一个基因,每一列表示一个样本。load(file = "datasets/coad_mran_df.rdata") dim(coad_mrna_df) ## [1] 1000 100 class(coad_mrna_df) ## [1] "data.frame" coad_mrna_df[1:4,1:3] ## TCGA-5M-AAT6-01A-11R-A41B-07 TCGA-AA-3552-01A-01R-0821-07 ## MT-CO2 28026.23 32915.04 ## MT-CO3 29725.85 30837.60 ## MT-ND4 19509.82 22026.42 ## MT-CO1 23193.16 20924.84 ## TCGA-AA-3867-01A-01R-1022-07 ## MT-CO2 21030.00 ## MT-CO3 21997.99 ## MT-ND4 17171.58 ## MT-CO1 15485.43 如果要对表达矩阵进行log2 转换,无需单独对每个元素进行log2 ,直接对整个数据框进行log2 即可:coad_mrna_df <- log2(coad_mrna_df + 1) 如果要计算每一个基因在所有样本中的平均表达量,也就是计算每一行的平均值,使用apply 就非常简单:# apply主要是3个参数 # 第1个是你的数据框 # 第2个是选择行或者列,1表示行,2表示列 # 第3个是要执行的操作,可以是R自带函数,也可以是自编函数 # 自带函数不用加括号,直接写名字即可 tmp <- apply(coad_mrna_df, 1, mean) head(tmp) ## MT-CO2 MT-CO3 MT-ND4 MT-CO1 MT-ATP6 MT-ND3 ## 14.59276 14.43845 14.01330 14.04316 13.57397 13.40406 如果使用for 循环,就会显得很麻烦,运行时间也会长一点:tmp <- vector("numeric", nrow(coad_mrna_df)) for(i in 1:nrow(coad_mrna_df)){ tmp[i] <- mean(as.numeric(coad_mrna_df[i,])) } head(tmp) ## [1] 14.59276 14.43845 14.01330 14.04316 13.57397 13.40406 除了3个主要的参数,apply 还有一个... 参数,它表示:如果你要执行的操作中还有其他参数,可以直接往后写。 比如mean() 这个函数有一个na.rm 参数,表示要不要在计算时去除缺失值,你可以直接把这个参数写在后面:tmp <- apply(coad_mrna_df, 1, mean, na.rm = TRUE) # na.rm是mean的参数 head(tmp) ## MT-CO2 MT-CO3 MT-ND4 MT-CO1 MT-ATP6 MT-ND3 ## 14.59276 14.43845 14.01330 14.04316 13.57397 13.40406 如果要计算每一列的平均值,第2个参数就写2即可:# 1是行,2是列 tmp <- apply(coad_mrna_df, 2, mean, na.rm = TRUE) head(tmp) ## TCGA-5M-AAT6-01A-11R-A41B-07 TCGA-AA-3552-01A-01R-0821-07 ## 7.754459 7.921157 ## TCGA-AA-3867-01A-01R-1022-07 TCGA-AD-6895-01A-11R-1928-07 ## 8.131564 8.198273 ## TCGA-AA-3560-01A-01R-0821-07 TCGA-CM-6676-01A-11R-1839-07 ## 7.917137 8.056527 上面的示例只是为了演示apply 的用法,实际上在计算某一行/列的均值/加和时,R自带了几个函数,比如计算每一行的均值:tmp <- rowMeans(coad_mrna_df) head(tmp) ## MT-CO2 MT-CO3 MT-ND4 MT-CO1 MT-ATP6 MT-ND3 ## 14.59276 14.43845 14.01330 14.04316 13.57397 13.40406 其他几个类似函数:rowMeans(), rowSums(), colMeans(), colSums() 下面比较一下3种方法的运行时间:system.time({ # 最慢 tmp <- vector("numeric", nrow(coad_mrna_df)) for(i in 1:nrow(coad_mrna_df)){ tmp[i] <- mean(as.numeric(coad_mrna_df[i,])) } }) ## user system elapsed ## 0.39 0.00 0.40 system.time(tmp <- apply(coad_mrna_df, 1, mean)) ## user system elapsed ## 0.01 0.00 0.00 system.time(tmp <- rowMeans(coad_mrna_df)) # 最快 ## user system elapsed ## 0 0 0 要执行的操作除了可以是R自带的函数外,还可以是自编函数。 比如:筛选在所有样本中的表达量的加和大于800的基因:# 对每一行执行1个操作 # 计算每一行的加和,并和800进行比较 tmp <- apply(coad_mrna_df, 1, function(x){sum(x)>800}) head(tmp) ## MT-CO2 MT-CO3 MT-ND4 MT-CO1 MT-ATP6 MT-ND3 ## TRUE TRUE TRUE TRUE TRUE TRUE table(tmp) ## tmp ## FALSE TRUE ## 650 350 #coad_mrna_df[tmp,] 当然上面只是为了演示如何在apply 中使用自编函数,实际使用时还是用rowSums 更快更简单:tmp <- rowSums(coad_mrna_df) > 800 head(tmp) ## MT-CO2 MT-CO3 MT-ND4 MT-CO1 MT-ATP6 MT-ND3 ## TRUE TRUE TRUE TRUE TRUE TRUE table(tmp) ## tmp ## FALSE TRUE ## 650 350 再举个例子,选择方差大于1的行(方差小说明这个基因在所有样本中表达量都很接近,这种基因没有意义)tmp <- coad_mrna_df[apply(coad_mrna_df,1,function(x){var(x)>1}),] dim(tmp) ## [1] 178 100 lapply
对list 的每一个对象执行某个操作,或者对data.frame 的每一列执行某个操作,输出结果是list 。lapply 的首字母就是list 的首字母。 使用方法:lapply(X, FUN, ...) # x是你的数据框或者列表 # FUN是你要执行的操作 # ...和apply中的...一样 比如,选择方差大于1的列:# ?lapply # 和apply非常像,但是不用选择行或列,默认就是列 tmp <- lapply(coad_mrna_df, function(x){var(x)>1}) class(tmp) ## [1] "list" length(tmp) ## [1] 100 # coad_mrna_df[tmp,] 计算每一列的中位数:tmp <- lapply(coad_mrna_df, median) class(tmp) ## [1] "list" length(tmp) ## [1] 100 展开列表:class(unlist(tmp)) ## [1] "numeric" 查看列表中每个对象的长度:# 创建一个列表 g <- "My First List" # 字符串 h <- c(25, 26, 18, 39) # 数值型向量 j <- matrix(1:10, nrow=5) # 矩阵 k <- c("one", "two", "three") # 字符型向量 l <- list("apple",1,TRUE) # 列表 mylist <- list(title=g, ages=h, j, k, l) 查看每个对象的长度:lapply(mylist, length) ## $title ## [1] 1 ## ## $ages ## [1] 4 ## ## [[3]] ## [1] 10 ## ## [[4]] ## [1] 3 ## ## [[5]] ## [1] 3 unlist(lapply(mylist, length)) ## title ages ## 1 4 10 3 3 多个数据框的批量保存,lapply版本:df1 <- data.frame( patientID = c("甲","乙","丙","丁"), age = c(23,43,45,34), gender = c("男","女","女","男") ) df2 <- data.frame( patientID = c("甲","乙","戊","几","庚","丁"), hb = c(110,124,138,142,108,120), wbc = c(3.7,4.6,6.4,4.2,5.6,5.2) ) df3 <- data.frame( patientID = c("丙","乙","几","庚","丁"), rbc = c(4.5,4.3,4.5,3.4,4.2), plt = c(180,250,360,120,220)) df4 <- data.frame( patientID = c("丙","乙","几","庚","丁","甲","戊"), a = rnorm(7, 20), b = rnorm(7,10) ) df5 <- data.frame( patientID = c("丙","乙","甲","戊"), d = rnorm(4, 2), e = rnorm(4,1) ) df6 <- data.frame( patientID = c("乙","几","庚","丁"), f = rnorm(4, 2), g = rnorm(4,1) ) 使用lapply 的方式和for 循环非常像。 先把这些数据框放到一个列表中:dataframes <- list(df1,df2,df3,df4,df5,df6) 然后批量保存,和前面的for 循环比较一下,是不是基本一样?lapply(1:length(dataframes), function(x){ write.csv(dataframes[[x]], file = paste0("datasets/csvs/","df",x,".csv"), quote = F,row.names = F) }) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL ## ## [[6]] ## NULL 如果列表中的对象有名字,也可以像下面这样实现,还是和for 循环基本一样:dataframes <- list(df1,df2,df3,df4,df5,df6) # 放到1个列表中 names(dataframes) <- c("df1","df2","df3","df4","df5","df6") # 添加名字 names(dataframes) # 查看名字 ## [1] "df1" "df2" "df3" "df4" "df5" "df6" lapply(names(dataframes), function(x){ write.csv(dataframes[[x]], file = paste0("datasets/csvs/",x,".csv"), quote = F,row.names = F) }) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL ## ## [[6]] ## NULL 多个数据框的批量读取:allfiles <- list.files("datasets/csvs",full.names = T) allfiles ## [1] "datasets/csvs/df1.csv" "datasets/csvs/df2.csv" "datasets/csvs/df3.csv" ## [4] "datasets/csvs/df4.csv" "datasets/csvs/df5.csv" "datasets/csvs/df6.csv" # 1行代码解决,可以和前面的for循环对比下 dfs <- lapply(allfiles, read.csv) dfs[[1]] ## patientID age gender ## 1 甲 23 男 ## 2 乙 43 女 ## 3 丙 45 女 ## 4 丁 34 男 如果你没有使用全名,需要自己构建文件路径+文件名,借助paste0 即可:allfiles <- list.files("datasets/csvs") allfiles ## [1] "df1.csv" "df2.csv" "df3.csv" "df4.csv" "df5.csv" "df6.csv" # 自己写个函数即可 dfs <- lapply(allfiles, function(x){read.csv(paste0("datasets/csvs/",x))}) dfs[[1]] ## patientID age gender ## 1 甲 23 男 ## 2 乙 43 女 ## 3 丙 45 女 ## 4 丁 34 男 此时的x 就代指df1.csv 、df2.csv 这些名字。sapply
lapply 的简化版本,输出结果不是list 。 如果simplify=FALSE 和USE.NAMES=FALSE ,那么sapply 函数就等于lapply 函数了。 不如lapply 使用广泛。 选择方差大于1的列:tmp <- sapply(coad_mrna_df, function(x){var(x)>1}) # coad_mrna_df[tmp,] 计算每一列的中位数:tmp <- sapply(coad_mrna_df, median) class(tmp) ## [1] "numeric" length(tmp) ## [1] 100 head(tmp) ## TCGA-5M-AAT6-01A-11R-A41B-07 TCGA-AA-3552-01A-01R-0821-07 ## 7.632902 7.631332 ## TCGA-AA-3867-01A-01R-1022-07 TCGA-AD-6895-01A-11R-1928-07 ## 7.882883 8.042666 ## TCGA-AA-3560-01A-01R-0821-07 TCGA-CM-6676-01A-11R-1839-07 ## 7.730625 7.873826 tapply
分组操作。 根据某一个条件进行分组,然后对每一个组进行某种操作,最后进行汇总。 这种数据处理思想是非常出名的:split-apply-combine 。brca_clin <- read.csv("datasets/brca_clin.csv",header = T) dim(brca_clin) ## [1] 20 9 brca_clin[,4:5] ## sample_type initial_weight ## 1 Solid Tissue Normal 260 ## 2 Solid Tissue Normal 220 ## 3 Solid Tissue Normal 130 ## 4 Solid Tissue Normal 260 ## 5 Solid Tissue Normal 200 ## 6 Solid Tissue Normal 60 ## 7 Solid Tissue Normal 320 ## 8 Solid Tissue Normal 310 ## 9 Solid Tissue Normal 100 ## 10 Solid Tissue Normal 250 ## 11 Primary Tumor 130 ## 12 Primary Tumor 110 ## 13 Primary Tumor 470 ## 14 Primary Tumor 90 ## 15 Primary Tumor 200 ## 16 Primary Tumor 70 ## 17 Primary Tumor 130 ## 18 Primary Tumor 770 ## 19 Primary Tumor 200 ## 20 Primary Tumor 250 分别计算normal 组和tumor 组的weight的平均值:# 主要是3个参数 tapply(X = brca_clin$initial_weight, INDEX = brca_clin$sample_type, #组别是分类变量,不能数值型 FUN = mean) ## Primary Tumor Solid Tissue Normal ## 242 211 分别计算normal 组和tumor 组的age的中位数:tapply(brca_clin$age_at_index, brca_clin$sample_type, median) ## Primary Tumor Solid Tissue Normal ## 55.0 59.5 还有几个类似的函数,比如:aggregate 和by 。# 和tapply基本一样,但是第2个参数必须是list # 并支持根据多个变量进行分组 aggregate(brca_clin$age_at_index, list(brca_clin$sample_type), median) ## Group.1 x ## 1 Primary Tumor 55.0 ## 2 Solid Tissue Normal 59.5 aggregate(brca_clin$age_at_index, list(brca_clin$sample_type ,brca_clin$ajcc_pathologic_stage), median) ## Group.1 Group.2 x ## 1 Primary Tumor Stage I 56.0 ## 2 Solid Tissue Normal Stage I 68.5 ## 3 Primary Tumor Stage IA 49.0 ## 4 Solid Tissue Normal Stage IA 63.0 ## 5 Primary Tumor Stage IIA 67.5 ## 6 Solid Tissue Normal Stage IIA 78.0 ## 7 Primary Tumor Stage IIB 63.0 ## 8 Solid Tissue Normal Stage IIB 54.0 ## 9 Primary Tumor Stage IIIA 47.0 ## 10 Solid Tissue Normal Stage IIIA 39.0 ## 11 Primary Tumor Stage IIIC 36.0 by 也是一样的用法:组别需要是因子型或者列表:by(brca_clin$age_at_index, list(brca_clin$sample_type), median) ## : Primary Tumor ## [1] 55 ## : Solid Tissue Normal ## [1] 59.5 by(brca_clin$age_at_index, list(brca_clin$sample_type,brca_clin$ajcc_pathologic_stage), median) ## : Primary Tumor ## : Stage I ## [1] 56 ## : Solid Tissue Normal ## : Stage I ## [1] 68.5 ## : Primary Tumor ## : Stage IA ## [1] 49 ## : Solid Tissue Normal ## : Stage IA ## [1] 63 ## : Primary Tumor ## : Stage IIA ## [1] 67.5 ## : Solid Tissue Normal ## : Stage IIA ## [1] 78 ## : Primary Tumor ## : Stage IIB ## [1] 63 ## : Solid Tissue Normal ## : Stage IIB ## [1] 54 ## : Primary Tumor ## : Stage IIIA ## [1] 47 ## : Solid Tissue Normal ## : Stage IIIA ## [1] 39 ## : Primary Tumor ## : Stage IIIC ## [1] 36 ## : Solid Tissue Normal ## : Stage IIIC ## [1] NA 组别是因子型也可以(实测字符型也可以),比如:# 可以看到sample_type是字符型 str(brca_clin) ## 'data.frame': 20 obs. of 9 variables: ## $ bargr> ## $ patient : chr "TCGA-BH-A1FC" "TCGA-AC-A2FM" "TCGA-BH-A0DO" "TCGA-E2-A1BC" ... ## $ sample : chr "TCGA-BH-A1FC-11A" "TCGA-AC-A2FM-11B" "TCGA-BH-A0DO-11A" "TCGA-E2-A1BC-11A" ... ## $ sample_type : chr "Solid Tissue Normal" "Solid Tissue Normal" "Solid Tissue Normal" "Solid Tissue Normal" ... ## $ initial_weight : int 260 220 130 260 200 60 320 310 100 250 ... ## $ ajcc_pathologic_stage : chr "Stage IIA" "Stage IIB" "Stage I" "Stage IA" ... ## $ days_to_last_follow_up: int NA NA 1644 501 660 3247 NA NA 1876 707 ... ## $ gender : chr "female" "female" "female" "female" ... ## $ age_at_index : int 78 87 78 63 41 59 60 39 54 51 ... class(brca_clin$sample_type) ## [1] "character" by(brca_clin$age_at_index, brca_clin$sample_type, # 字符型也可以 median) ## brca_clin$sample_type: Primary Tumor ## [1] 55 ## brca_clin$sample_type: Solid Tissue Normal ## [1] 59.5 先把sample_type 变成因子型也可以:brca_clin$sample_type <- factor(brca_clin$sample_type) class(brca_clin$sample_type) # 变成因子型了 ## [1] "factor" # 也OK by(brca_clin$age_at_index, brca_clin$sample_type, # 字符型也可以 median) ## brca_clin$sample_type: Primary Tumor ## [1] 55 ## brca_clin$sample_type: Solid Tissue Normal ## [1] 59.5 其他apply函数
还有vapply、mapply、rapply、eapply,用的很少,不再介绍。 vapply类似于sapply,提供了FUN.VALUE参数,用来控制返回值的行名,这样可以让程序更清晰易懂。Reduce和do.call
Reduce
对多个对象进行累积操作。 比如,累加:Reduce("+", 1:100) ## [1] 5050 再比如,多个数据框的merge,merge 函数只能对两个数据框进行合并,但是如果有多个数据框需要合并怎么办?有100个怎么办? 批量读取多个数据框:# 6个数据框 allfiles <- list.files("datasets/csvs",full.names = T) allfiles ## [1] "datasets/csvs/df1.csv" "datasets/csvs/df2.csv" "datasets/csvs/df3.csv" ## [4] "datasets/csvs/df4.csv" "datasets/csvs/df5.csv" "datasets/csvs/df6.csv" # 1行代码解决 dfs <- lapply(allfiles, read.csv) # 查看其中1个 dfs[[2]] ## patientID hb wbc ## 1 甲 110 3.7 ## 2 乙 124 4.6 ## 3 戊 138 6.4 ## 4 几 142 4.2 ## 5 庚 108 5.6 ## 6 丁 120 5.2 6个数据框的merge:Reduce(merge, dfs) ## patientID age gender hb wbc rbc plt a b d e ## 1 乙 43 女 124 4.6 4.3 250 19.664 10.51165 2.474508 1.372298 ## f g ## 1 2.862749 -0.384265 如果想要使用merge 里面的参数怎么办?自己写函数即可:# 这个函数只能有两个参数 Reduce(function(x,y){merge(x,y, by = "patientID")}, dfs) ## patientID age gender hb wbc rbc plt a b d e ## 1 乙 43 女 124 4.6 4.3 250 19.664 10.51165 2.474508 1.372298 ## f g ## 1 2.862749 -0.384265 do.call
使用场景:你有很多个数据框,而且每个数据框的内容都一样,你想把这些数据框拼接到一起。df1 <- data.frame( patientID = 1:4, aa = rnorm(4,10), bb = rnorm(4,16) ) df2 <- data.frame( patientID = 5:8, aa = rnorm(4,10), bb = rnorm(4,16) ) df3 <- data.frame( patientID = 9:12, aa = rnorm(4,10), bb = rnorm(4,16) ) df4 <- data.frame( patientID = 13:16, aa = rnorm(4,10), bb = rnorm(4,16) ) 不断地重复写rbind ?没有必要。ll <- list(df1,df2,df3,df4) do.call(rbind, ll) ## patientID aa bb ## 1 1 9.574481 15.24356 ## 2 2 9.933919 15.83192 ## 3 3 10.675271 15.60532 ## 4 4 11.130001 16.94735 ## 5 5 10.068181 15.07117 ## 6 6 9.832190 18.76410 ## 7 7 8.944788 15.92174 ## 8 8 10.282279 17.53555 ## 9 9 9.580775 16.12769 ## 10 10 9.956511 16.31920 ## 11 11 11.776207 15.37159 ## 12 12 11.313994 14.55692 ## 13 13 10.306852 16.04596 ## 14 14 9.194999 13.03253 ## 15 15 8.295845 17.77535 ## 16 16 9.482168 16.35076 其实这种场景下使用Reduce 也可以,但是数据量比较大的话还是do.call 更快。Reduce(rbind, ll) ## patientID aa bb ## 1 1 9.574481 15.24356 ## 2 2 9.933919 15.83192 ## 3 3 10.675271 15.60532 ## 4 4 11.130001 16.94735 ## 5 5 10.068181 15.07117 ## 6 6 9.832190 18.76410 ## 7 7 8.944788 15.92174 ## 8 8 10.282279 17.53555 ## 9 9 9.580775 16.12769 ## 10 10 9.956511 16.31920 ## 11 11 11.776207 15.37159 ## 12 12 11.313994 14.55692 ## 13 13 10.306852 16.04596 ## 14 14 9.194999 13.03253 ## 15 15 8.295845 17.77535 ## 16 16 9.482168 16.35076 tidyverse
Load the packages
Learn the "pipe"
What is tidy data?
The core tidy data principles
A hypothetical clinical trial to explain variables
What's an observation?
What is the data table?
How the "tibble" is better than a table
Using the tidyr package
Using gather
Using key-value pairs
Using spread
https://www.storybench.org/getting-started-with-tidyverse-in-r/Load the packages
First, install tidyverse and then load tidyverse and magrittr. suppressWarnings(suppressMessages(install.packages("tidyverse"))) suppressWarnings(suppressMessages(library(tidyverse))) suppressWarnings(suppressMessages(library(magrittr)))Learn the "pipe"
We'll be using the "pipe" throughout this tutorial. The pipe makes your code read more like a sentence, branching from left to right. So something like this: f(x)becomes this: x %>% f and something like this: h(g(f(x)))becomes this: x %>% f %>% g %>% h The "pipe" and is from the magrittr package.What is tidy data?
"Tidy data" is a term that describes a standardized approach to structuring datasets to make analyses and visualizations easier. If you've worked with SQL and relational databases, you'll recognize most of these concepts.The core tidy data principles
There are three principles for tidy data: Variable make up the columns Observations make up the rows Values go into cells The third principle is almost a given if you've handled the first two, so we will focus on these.A hypothetical clinical trial to explain variables
A variable is any measurement that can take multiple values. Depending on the field a dataset comes from, variables can be referred to as an independent or dependent variables, features, predictors, outcomes, targets, responses, or attributes. Variables can generally fit into three categories: fixed variables (characteristics that were known before the data were collected), measured variables (variables containing information captured during a study or investigation), and derived variables (variables that are created during the analysis process from existing variables). Here's an example: Suppose clinicians were testing a new anti-hypertensive drug. They recruit 30 patients, all of whom are being treated for hypertension, and divide them randomly into three groups. The clinician gives one third of the patients the drug for eight weeks, another third gets a placebo, and the final third gets care as usual. At the beginning of the study, the clinicians also collect information about the patients. These measurements included the patient's sex, age, weight, height, and baseline blood pressure (pre BP). For patients in this hypothetical study, suppose the group they were randomized to (i.e the drug, control, or placebo group), would be considered a fixed variable. The measured pre BP (and post BP) would be considered the measured variables. Suppose that after the trial was over–and all of the data were collected–the clinicians wanted a way of identifying the number of patients in the trial with a reduced blood pressure (yes or no)? One way is to create a new categorical variable that would identify the patients with post BP less than 140 mm Hg (1 = yes, 0 = no). This new categorical variable would be considered a derived variable. The data for the fictional study I've described also contains an underlying dimension of time. As the description implies, each patient's blood pressure was measured before and after they took the drug (or placebo). So these data could conceivably have variables for date of enrollment (the date a patient entered the study), date of pre blood pressure measurement (baseline measurements), date of drug delivery (patient takes the drug), date of post blood pressure measurement (blood pressure measurement taken at the end of the study).What's an observation?
Observations are the unit of analysis or whatever the "thing" is that's being described by the variables. Sticking with our hypothetical blood pressure trial, the patients would be the unit of analysis. In a tidy dataset, we would expect each row to represent a single patient. Observations are a bit like nouns, in a sense that pinning down an exact definition can be difficult, and it often relies heavily on how the data were collected and what kind of questions you're trying to answer. Other terms for observations include records, cases, examples, instance, or samples.What is the data table?
Tables are made up of values. And as you have probably already guessed, a value is the thing in a spreadsheet that isn't a row or a column. I find it helpful to think of values as physical locations in a table – they are what lie at the intersection of a variable and an observation. For example, imagine a single number, 75, sitting in a table.
Column 1 | Column 2 | |
---|---|---|
Row 1 | ||
Row 2 | 75 |
Col 1 | Pre_Dia_BP | |
---|---|---|
Row 1 | ||
patient_3 | 75 |
meas_type | Dia_BP | |
---|---|---|
Row 1 | ||
patient_3 | pre | 75 |
install.packages("corrr")
library('corrr')
install.packages("ggcorrplot")
library(ggcorrplot)
install.packages("FactoMineR")
library("FactoMineR")
read.csv()
function, then str()
which gives the image below.
protein_data <- read.csv("protein.csv")
str(protein_data)
We can see that the data set has 25 observations and 11 columns, and each variable is numerical, except the colSums(is.na(protein_data))
The colSums()
function combined with the is.na()
returns the number of missing values in each column.
As we can see below, none of the columns have missing values.
numerical_data <- protein_data[,2:10]
head(numerical_data)
data_normalized <- scale(numerical_data)
head(data_normalized)
princomp()
computes the PCA, and summary()
function shows the result.
data.pca <- princomp(data_normalized)
summary(data.pca)
Powered By
data.pca$loadings[, 1:2]
fviz_eig()
function.
fviz_eig(data.pca, addlabels = TRUE)
# Graph of the variables
fviz_pca_var(data.pca, col.var = "black")
fviz_cos2
function.
A low value means that the variable is not perfectly represented by that component.A high value, on the other hand, means a good representation of the variable on that component.
fviz_cos2(data.pca, choice = "var", axes = 1:2)
The code above computed the square cosine value for each variable with respect to the first two principal components.
From the illustration below, cereals, pulse nut oilseeds, eggs, and milk are the top four variables with the highest cos2, hence contributing the most to PC1 and PC2.
fviz_pca_var
function as follows:
fviz_pca_var(data.pca, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE)
From the biplot below:
High cos2 attributes are colored in green: Cereals, pulses, oilseeds, eggs, and milk.Mid cos2 attributes have an orange color: white meat, starchy food, fish, and red meat.
Finally, low cos2 attributes have a black color: fruits and vegetables,The primary objective of this study is to provide a method for obtaining uniform seed varieties from crop production, which is in the form of population, so the seeds are not certified as a sole variety. Thus, a computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera.Each image contains multiple beans. The process of determining which pixels correspond to a particular bean is called
描述性统计是数据分析的第一步,用来总结和概括数据的特征。 在生物信息学、医学统计等领域,它常用于对实验数据进行初步分析,比如基因表达数据的分布、样本之间的差异等。一、描述性统计的核心内容
描述性统计主要包括:集中趋势 (如均值、中位数)离散程度 (如方差、标准差)分布形状 (如偏度、峰度)频数分布 (如直方图、频率表) 接下来,我们用R实现这些统计内容。二、准备工作:安装和加载必要的包
为了方便操作,建议安装tidyverse和psych包。 # 安装必要包 install.packages("tidyverse") install.packages("psych") # 加载包 library(tidyverse) library(psych) 此外,我们用内置的mtcars数据集作为示例数据。 它包含32辆汽车的11个变量信息。三、集中趋势的计算
集中趋势是描述数据“集中”位置的指标,包括均值、中位数和众数。 # 查看数据结构 head(mtcars) # 计算某变量的均值和中位数 mean_mpg <- mean(mtcars$mpg) # 均值 median_mpg <- median(mtcars$mpg) # 中位数 # 众数计算 table_mpg <- table(mtcars$mpg) mode_mpg <- names(table_mpg[table_mpg == max(table_mpg)]) cat("均值:", mean_mpg, "\n中位数:", median_mpg, "\n众数:", mode_mpg, "\n")![]()
四、离散程度的测量
离散程度指标包括方差、标准差、范围等。 # 方差和标准差 var_mpg <- var(mtcars$mpg) # 方差 sd_mpg <- sd(mtcars$mpg) # 标准差 # 极差(范围) range_mpg <- range(mtcars$mpg) range_diff <- diff(range_mpg) cat("方差:", var_mpg, "\n标准差:", sd_mpg, "\n范围:", range_mpg, "\n极差:", range_diff, "\n")![]()
五、分布形状的描述:偏度和峰度
偏度和峰度用来描述数据分布的形状。 # 偏度和峰度计算 skewness_mpg <- psych::skew(mtcars$mpg) # 偏度 kurtosis_mpg <- psych::kurtosi(mtcars$mpg) # 峰度 cat("偏度:", skewness_mpg, "\n峰度:", kurtosis_mpg, "\n")六、频数分布与可视化
频数分布直观展示数据分布情况,配合可视化工具更加直观。 # 频率表 mpg_table <- table(cut(mtcars$mpg, breaks = 5)) mpg_table # 绘制直方图和密度曲线 ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 2, fill = "skyblue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1) + labs(title = "MPG的直方图与密度曲线", x = "MPG", y = "频率")![]()
七、总结多变量:描述性统计表
有时我们需要对多个变量同时总结,可以使用summary函数或更强大的工具如describe。 # 使用summary函数 summary(mtcars)# 使用psych包的describe函数 describe(mtcars)
![]()
R数据清洗
1.为什么数据清洗如此重要?
2.数据清洗的基本步骤
3.导入数据
4.处理缺失值
4.1 查找缺失值
4.2 删除缺失值
4.3 填充缺失值
5.处理重复值
6.处理数据类型不一致
6.1 转换日期格式
6.2 转换为分类变量
7.处理异常值
7.1 查找异常值
7.2 可视化异常值
7.3 删除或替换异常值
8.数据清洗后的可视化对比
8.1 示例代码:数据清洗前后的可视化
进行数据清洗的代码
1.
现实世界中的数据往往是不完美的。 你可能会遇到以下常见问题:缺失值:部分数据缺失,导致模型无法完整地利用所有信息。 重复值:数据集中包含多个相同的条目,影响分析结果的精确度。 不一致的格式:日期、时间、数值和分类数据可能采用不同的格式,导致分析时出现问题。 异常值:一些极端值会严重影响模型的表现,需要仔细处理。 数据清洗的主要目标是提高数据质量,减少模型偏差,并使分析结果更加准确和具有解释性。为什么数据清洗如此重要? 2.
下面,我们将通过R语言演示一些常见的数据清洗任务,包括处理缺失值、重复值、不一致格式和异常值。数据清洗的基本步骤 3.
首先,使用read.csv()函数导入数据集:# 读取数据集 data <- read.csv("your_dataset.csv", stringsAsFactors = FALSE)导入数据 4.
缺失值是数据清洗中的常见问题。 在R中,我们可以使用is.na()函数来检查缺失值,并使用多种方法进行处理。处理缺失值 4.1 查找缺失值
# 查看每一列中的缺失值 colSums(is.na(data))4.2 删除缺失值
如果某些行中的缺失值过多,可以删除这些行:# 删除包含缺失值的行 clean_data <- na.omit(data)4.3 填充缺失值
另一种处理方法是填补缺失值,可以使用均值或中位数:# 用均值填充缺失值 data$column_name[is.na(data$column_name)] <- mean(data$column_name, na.rm = TRUE)5.
重复值会影响数据分析结果。 使用duplicated()函数可以查找和删除重复条目:# 查看重复行 duplicated_rows <- data[duplicated(data), ] # 删除重复行 data <- data[!duplicated(data), ]处理重复值 6.
数据中的日期、时间等字段可能格式不统一。 使用as.Date()可以将字符串转换为日期格式,或使用factor()将字符转换为分类变量。处理数据类型不一致 6.1 转换日期格式
# 将字符串转换为日期格式 data$date_column <- as.Date(data$date_column, format = "%Y-%m-%d")6.2 转换为分类变量
# 将字符型数据转换为因子 data$category_column <- as.factor(data$category_column)7.
异常值(outliers)是极端且可能不合理的值。 在进行模型分析前,通常需要处理这些值。处理异常值 7.1 查找异常值
使用summary()函数检查数据的分布情况,帮助找出异常值:# 检查数据分布 summary(data$numeric_column)7.2 可视化异常值
使用箱线图查看异常值的分布:# 生成箱线图 boxplot(data$numeric_column, main = "Boxplot for Numeric Column")7.3 删除或替换异常值
通过逻辑条件删除异常值:# 删除异常值 data <- data[data$numeric_column < upper_threshold & data$numeric_column > lower_threshold, ]8.
数据清洗不仅是修改数据,还需要通过可视化来直观地了解清洗前后的差异。 我们可以通过散点图来展示数据清洗前后的分布差异。数据清洗后的可视化对比 8.1 示例代码:数据清洗前后的可视化
# 加载必要的包 library(ggplot2) # 假设我们有一个数据集,包含缺失值和异常值 set.seed(123) raw_data <- data.frame( x = c(rnorm(100, mean = 50, sd = 10), NA, 200, 250), # 包含异常值和缺失值 y = c(rnorm(100, mean = 50, sd = 10), NA, 300, -100) # 包含异常值和缺失值 ) # 数据清洗前的可视化 ggplot(raw_data, aes(x = x, y = y)) + geom_point(color = "red") + ggtitle("数据清洗前的散点图") + theme_minimal()![]()
进行数据清洗的代码 # 进行数据清洗:去除缺失值和异常值 clean_data <- na.omit(raw_data) clean_data <- clean_data[clean_data$x < 150 & clean_data$y > 0 & clean_data$y < 150, ] # 数据清洗后的可视化 ggplot(clean_data, aes(x = x, y = y)) + geom_point(color = "blue") + ggtitle("数据清洗后的散点图") + theme_minimal()数据清洗前:图中使用红色散点表示原始数据,其中包含缺失值和异常值。 可以明显看到一些极端的点与数据的主流分布相差较远,表明数据中存在异常值。 数据清洗后:清洗后,我们删除了缺失值,并过滤掉了极端的异常值。 蓝色散点显示了清洗后的数据,分布更加集中,异常值被移除,数据更具分析意义。
5行R语言做出聚类分析热图
什么是聚类分析?常用的聚类方法
层次聚类:
K均值聚类:
密度聚类(如DBSCAN)常用的距离度量
欧几里得距离
曼哈顿距离
相关性距离三、为什么将热图和聚类分析结合使用? 四、R语言绘制热图及聚类分析:实操教程
步骤 1:加载数据和所需包
步骤 2:计算相关性矩阵
步骤 3:绘制热图并进行聚类分析
步骤 4:个性化热图设置
什么是聚类分析? 聚类分析是一种无监督学习方法,旨在将数据集中的对象分为若干组,组内对象的相似性较高,而不同组间的差异较大。 聚类分析能够在不依赖于标签的情况下,揭示数据间的内在结构,应用广泛,如数据挖掘、市场细分、生物信息学等。聚类方法多种多样,常见的有:
常用的聚类方法 层次聚类: 构建层次结构,将样本逐步合并或拆分,形成树状图(树状图即聚类的树形图结构)。 它分为凝聚层次聚类(由个体逐步合并)和分裂层次聚类(由整体逐渐分裂)。K均值聚类: 将数据分为K个簇,使每个簇内的样本均方误差最小。 K均值聚类对数据的形状和数量较为敏感。密度聚类(如DBSCAN) :基于样本密度分组,能够发现任意形状的聚类,适用于非线性分布的聚类问题。聚类分析依赖于样本间的距离测量,常见的距离度量包括:
常用的距离度量 欧几里得距离 : 衡量点间的直线距离,适合处理数值型变量。曼哈顿距离 : 衡量各维度的绝对差值之和,适合离散数据或高维数据。相关性距离 : 基于相关系数的距离度量,在基因表达等高维数据分析中广泛应用。将热图与聚类分析结合在同一张图片中,不仅可以展示数据的分布和大小,还能按相似性将数据进行重新排列。 通过对行和列分别进行聚类分析,我们可以更直观地观察到变量或样本之间的相似性。 结合的聚类树状图(dendrogram)提供了群体间关系的结构化信息,使得热图中的分布模式更具解释性。
三、为什么将热图和聚类分析结合使用? 接下来,我们将基于R语言内置的mtcars数据集,进行热图和聚类分析的绘制。 mtcars数据集记录了汽车的各类性能指标,是学习多变量分析的经典数据集。 我们将用pheatmap包来绘制热图和聚类树。
四、R语言绘制热图及聚类分析:实操教程 步骤 1:加载数据和所需包 在R中,使用mtcars数据集,并加载绘制热图的pheatmap包。 确保已安装pheatmap,否则先安装。 # 加载pheatmap包 library(pheatmap) # 查看mtcars数据集 data("mtcars") head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1步骤 2:计算相关性矩阵 我们首先计算mtcars数据集中变量的相关性矩阵。 这里使用皮尔逊相关系数来衡量各变量的相似性。 # 计算相关性矩阵 cor_matrix <- cor(mtcars)步骤 3:绘制热图并进行聚类分析 使用pheatmap函数绘制热图,同时对行和列进行聚类分析。 聚类将基于样本的相似性,将相似的变量和样本排列在一起。 # 绘制热图,进行行和列的聚类分析 pheatmap(cor_matrix, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", clustering_method = "complete", display_numbers = TRUE, color = colorRampPalette(c("blue", "white", "red"))(100))在上述代码中: clustering_distance_rows 和 clustering_distance_cols 参数指定行和列的距离度量,这里使用欧几里得距离(“euclidean”)。 clustering_method 指定了聚类方法,选择完全链接法(“complete”)。 display_numbers = TRUE 允许在热图中显示具体数值。 color 参数设置颜色梯度,从蓝到红反映数据相关性的高低。
步骤 4:个性化热图设置 通过自定义颜色、字体、图形边框等,可以使热图更加美观并适应实际需求。 # 个性化设置热图 pheatmap(cor_matrix, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", clustering_method = "average", display_numbers = TRUE, color = colorRampPalette(c("navy", "white", "firebrick3"))(100), fontsize = 10, fontsize_row = 8, fontsize_col = 8, main = "mtcars Data Correlation Heatmap", border_color = NA)![]()
Clustering in Machine Learning
What is Clustering ? Types of Clustering
Hard Clustering:
Soft Clustering:Uses of Clustering Types of Clustering Algorithms
1.Centroid-based Clustering (Partitioning methods)
2.Density-based Clustering (Model-based methods)
3.Connectivity-based Clustering (Hierarchical clustering)
Divisive Clustering
Agglomerative Clustering
4.Distribution-based ClusteringApplications of Clustering in different fields
In real world, not every data we work upon has a target variable. This kind of data cannot be analyzed using supervised learning algorithms. We need the help of unsupervised algorithms. One of the most popular type of analysis under unsupervised learning is Cluster analysis. When the goal is to group similar data points in a dataset, then we use cluster analysis. In practical situations, we can use cluster analysis for customer segmentation for targeted advertisements, or in medical imaging to find unknown or new infected areas and many more use cases that we will discuss further in this article.This method is defined under the branch of Unsupervised Learning, which aims at gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a target variable. Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group the points with highest similarity score together. For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming on the basis of distance.
What is Clustering ? Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters. For example, In the below given graph we can see that the clusters formed are not circular in shape.
![]()
Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:
Types of Clustering Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or not. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2. Data Points Clusters A C1 B C2 C C2 D C1Soft Clustering: In this type of clustering, instead of assigning each data point into a separate cluster, a probability or likelihood of that point being that cluster is evaluated. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be evaluating a probability of a data point belonging to both clusters. This probability is calculated for all data points. Data Points Probability of C1 Probability of C2 A 0.91 0.09 B 0.3 0.7 C 0.17 0.83 D 1 0Now before we begin with types of clustering algorithms, we will go through the use cases of Clustering algorithms. Clustering algorithms are majorly used for: Market Segmentation – Businesses use clustering to group their customers and use targeted advertisements to attract more audience. Market Basket Analysis – Shop owners analyze their sales and figure out which items are majorly bought together by the customers. For example, In USA, according to a study diapers and beers were usually bought together by fathers. Social Network Analysis – Social media sites use your data to understand your browsing behaviour and provide you with targeted friend recommendations or content recommendations. Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-rays. Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent transactions we can use clustering to identify them. Simplify working with large datasets – Each cluster is given a cluster ID after clustering is complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering is effective when it can represent a complicated case with a straightforward cluster ID. Using the same principle, clustering data can make complex datasets simpler. There are many more use cases for clustering but there are some of the major and common use cases of clustering. Moving forward we will be discussing Clustering Algorithms that will help you perform the above tasks.
Uses of Clustering At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest distance, and the density of the data points are a few of the elements that influence cluster formation. Clustering is the process of determining how related the objects are based on a metric called the similarity measure. Similarity metrics are easier to locate in smaller sets of features. It gets harder to create similarity measures as the number of features increases. Depending on the type of clustering algorithm being utilized in data mining, several techniques are employed to group the data from the datasets. In this part, the clustering techniques are described. Various types of clustering algorithms are: Centroid-based Clustering (Partitioning methods) Density-based Clustering (Model-based methods) Connectivity-based Clustering (Hierarchical clustering) Distribution-based Clustering We will be going through each of these types in brief.
Types of Clustering Algorithms 1.Centroid-based Clustering (Partitioning methods)
Partitioning methods are the most easiest clustering algorithms. They group data points on the basis of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a predetermined number of clusters, and each cluster is referenced by a vector of values. When compared to the vector value, the input data variable shows no difference and joins the cluster. The primary drawback for these algorithms is the requirement that we establish the number of clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering machine learning system starts allocating the data points. Despite this, it is still the most popular type of clustering. K-means and K-medoids clustering are some examples of this type clustering.2.Density-based Clustering (Model-based methods)
Density-based clustering, a model-based method, finds groups based on the density of data points. Contrary to centroid-based clustering, which requires that the number of clusters be predefined and is sensitive to initialization, density-based clustering determines the number of clusters automatically and is less susceptible to beginning positions. They are great at handling clusters of different sizes and forms, making them ideally suited for datasets with irregularly shaped or overlapping clusters. These methods manage both dense and sparse data regions by focusing on local density and can distinguish clusters with a variety of morphologies. In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due to its preset number of cluster requirements and extreme sensitivity to the initial positioning of centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to produce spherical or convex clusters restricts their capacity to handle complicated or irregularly shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-based techniques by autonomously choosing cluster sizes, being resilient to initialization, and successfully capturing clusters of various sizes and forms. The most popular density-based clustering algorithm is DBSCAN.3.Connectivity-based Clustering (Hierarchical clustering)
A method for assembling related data points into hierarchical clusters is called hierarchical clustering. Each data point is initially taken into account as a separate cluster, which is subsequently combined with the clusters that are the most similar to form one large cluster that contains all of the data points. Think about how you may arrange a collection of items based on how similar they are. Each object begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger clusters after the algorithm examines how similar the objects are to one another. When every object is in one cluster at the top of the tree, the merging process has finished. Exploring various granularity levels is one of the fun things about hierarchical clustering. To obtain a given number of clusters, you can select to cut the dendrogram at a particular height. The more similar two objects are within a cluster, the closer they are. It’s comparable to classifying items according to their family trees, where the nearest relatives are clustered together and the wider branches signify more general connections. There are 2 approaches for Hierarchical clustering:Divisive Clustering It follows a top-down approach, here we consider all data points to be part one big cluster and then this cluster is divide into smaller groups.Agglomerative Clustering It follows a bottom-up approach, here we consider all data points to be part of individual clusters and then these clusters are clubbed together to make one big cluster with all data points.4.Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized according to their propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other) within the data. The data elements are grouped using a probability-based distribution that is based on statistical distributions. Included are data objects that have a higher likelihood of being in the cluster. A data point is less likely to be included in a cluster the further it is from the cluster’s central point, which exists in every cluster. A notable drawback of density and boundary-based approaches is the need to specify the clusters a priori for some algorithms, and primarily the definition of the cluster form for the bulk of algorithms. There must be at least one tuning or hyper-parameter selected, and while doing so should be simple, getting it wrong could have unanticipated repercussions. Distribution-based clustering has a definite advantage over proximity and centroid-based clustering approaches in terms of flexibility, accuracy, and cluster structure. The key issue is that, in order to avoid overfitting, many clustering methods only work with simulated or manufactured data, or when the bulk of the data points certainly belong to a preset distribution. The most popular distribution-based clustering algorithm is Gaussian Mixture Model.
Applications of Clustering in different fields Marketing: It can be used to characterize & discover customer segments for marketing purposes.Biology: It can be used for classification among different species of plants and animals.Libraries: It is used in clustering different books on the basis of Clusteringtopics and information.Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present.Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.Image Processing : Clustering can be used to group similar images together, classify images based on content, and identify patterns in image data.Genetics: Clustering is used to group genes that have similar expression patterns and identify gene networks that work together in biological processes.Finance: Clustering is used to identify market segments based on customer behavior, identify patterns in stock market data, and analyze risk in investment portfolios.Customer Service: Clustering is used to group customer inquiries and complaints into categories, identify common issues, and develop targeted solutions.Manufacturing : Clustering is used to group similar products together, optimize production processes, and identify defects in manufacturing processes.Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases, which helps in making accurate diagnoses and identifying effective treatments.Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial transactions, which can help in detecting fraud or other financial crimes.Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours, routes, and speeds, which can help in improving transportation planning and infrastructure.Social network analysis: Clustering is used to identify communities or groups within social networks, which can help in understanding social behavior, influence, and trends.Cybersecurity: Clustering is used to group similar patterns of network traffic or system behavior, which can help in detecting and preventing cyberattacks.Climate analysis: Clustering is used to group similar patterns of climate data, such as temperature, precipitation, and wind, which can help in understanding climate change and its impact on the environment.Sports analysis: Clustering is used to group similar patterns of player or team performance data, which can help in analyzing player or team strengths and weaknesses and making strategic decisions.Crime analysis: Clustering is used to group similar patterns of crime data, such as location, time, and type, which can help in identifying crime hotspots, predicting future crime trends, and improving crime prevention strategies.Overview of clustering methods in R
ClusteringWhat is clustering?
Cluster analysis
unsupervised learning
What is it good for?
Classification to groups
Anomaly detection
Data compressionTypes of clustering methods
hierarchical
non-hierarchical
hierarchical clustering
non-hierarchical clustering
Non-hierarchical
Hierarchical
Centroid-based
K-means
K-means - example
K-medoids
K-medoids - example
The determination of the number of clusters
Elbow diagram
Model-based
Gaussian Mixture Models (GMM)
GMM
EM
BIC
Density-based
DBSCAN
OPTICS
DBSCAN
DBSCAN
DBSCAN
Bananas - DBSCAN result
Bananas - K-means result
Spectral clustering
Typical use case for spectral clustering
Hierarchical clustering
IRIS dataset use caseConnected data
DBSCAN - result for connected data
K-means - result for connected data
Gaussian model-based clustering result
Spectral clustering result
Other types of clustering methodsConclusions
Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.
What is clustering? Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a technique ofunsupervised learning , so clustering is used when no a priori information about data is available. This makes clustering a very strong technique for gaining insights into data and making more accurate decisions.What is it good for?
Clustering is used for: To gain insight into data, generate hypotheses, detect anomalies, and identify salient features, To identify the degree of similarity among objects (i.e. organisms), As a method for organizing the data and summarising it through cluster prototypes (compression).Classification to groups
The first use case is to group data, e.g. classify them into groups. For explanation purposes, I will generate synthetic data from three normal distributions plus three outliers (anomalies). Let’s load needed packages, generate randomly some data, and show the first use case in the visualization: library(data.table) # data handling library(ggplot2) # visualisations library(gridExtra) # visualisations library(grid) # visualisations library(cluster) # PAM - K-medoids set.seed(54321) data_example <- data.table(x = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(5, 1.9, 3.95)), y = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(1.65, 2.9, 4.2))) gg1 <- ggplot(data_example, aes(x, y)) + geom_point(alpha = 0.75, size = 8) + theme_bw() kmed_res <- pam(data_example, 3)$clustering data_example[, class := as.factor(kmed_res)] gg2 <- ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() define_region <- function(row, col){ viewport(layout.pos.row = row, layout.pos.col = col) } grid.newpage() # Create layout : nrow = 2, ncol = 2 pushViewport(viewport(layout = grid.layout(1, 2))) # Arrange the plots print(gg1, vp = define_region(1, 1)) print(gg2, vp = define_region(1, 2)) library(data.table) # data handling library(ggplot2) # visualisations library(gridExtra) # visualisations library(grid) # visualisations library(cluster) # PAM - K-medoids set.seed(54321) data_example <- data.table(x = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(5, 1.9, 3.95)), y = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(1.65, 2.9, 4.2))) gg1 <- ggplot(data_example, aes(x, y)) + geom_point(alpha = 0.75, size = 8) + theme_bw() kmed_res <- pam(data_example, 3)$clustering data_example[, class := as.factor(kmed_res)] gg2 <- ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() define_region <- function(row, col){ viewport(layout.pos.row = row, layout.pos.col = col) } grid.newpage() # Create layout : nrow = 2, ncol = 2 pushViewport(viewport(layout = grid.layout(1, 2))) # Arrange the plots print(gg1, vp = define_region(1, 1)) print(gg2, vp = define_region(1, 2)) library(data.table) # data handling library(ggplot2) # visualisations library(gridExtra) # visualisations library(grid) # visualisations library(cluster) # PAM - K-medoids set.seed(54321) data_example <- data.table(x = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(5, 1.9, 3.95)), y = c(rnorm(10, 3.5, 0.1), rnorm(10, 2, 0.1), rnorm(10, 4.5, 0.1), c(1.65, 2.9, 4.2))) gg1 <- ggplot(data_example, aes(x, y)) + geom_point(alpha = 0.75, size = 8) + theme_bw() kmed_res <- pam(data_example, 3)$clustering data_example[, class := as.factor(kmed_res)] gg2 <- ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() define_region <- function(row, col){ viewport(layout.pos.row = row, layout.pos.col = col) } grid.newpage() # Create layout : nrow = 2, ncol = 2 pushViewport(viewport(layout = grid.layout(1, 2))) # Arrange the plots print(gg1, vp = define_region(1, 1)) print(gg2, vp = define_region(1, 2))We can see three nicely divided groups of data.
Anomaly detection
Clustering can be also used as an anomaly detection technique, some methods of clustering can detect automatically outliers (anomalies). Let’s show visually what it looks like. anom <- c(rep(1, 30), rep(0, 3)) data_example[, class := as.factor(anom)] levels(data_example$class) <- c("Anomaly", "Normal") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() anom <- c(rep(1, 30), rep(0, 3)) data_example[, class := as.factor(anom)] levels(data_example$class) <- c("Anomaly", "Normal") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() anom <- c(rep(1, 30), rep(0, 3)) data_example[, class := as.factor(anom)] levels(data_example$class) <- c("Anomaly", "Normal") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw()![]()
Data compression
In an era of a large amount of data (also many times used buzzword - big data), we have problems processing them in real time. Here clustering can help to reduce dimensionality by its compression feature. Created clusters, that incorporate multiple points (data), can be replaced by their representatives (prototypes) - so one point. In this way, points were replaced by its cluster representative (“+”): data_example[, class := as.factor(kmed_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw() data_example[, class := as.factor(kmed_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw() data_example[, class := as.factor(kmed_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw()![]()
Since cluster analysis has been here for more than 50 years, there are a large amount of available methods. The basic classification of clustering methods is based on the objective to which they aim:
Types of clustering methods hierarchical ,non-hierarchical . Thehierarchical clustering is a multi-level partition of a dataset that is a branch of classification (clustering). Hierarchical clustering has two types of access to data. The first one, divisive clustering, starts with one big cluster that is then divided into smaller clusters. The second one, agglomerative clustering, starts with individual objects that are single-element clusters, and then they are gradually merged. The whole process of hierarchical clustering can be expressed (visualized) as a dendrogram. Thenon-hierarchical clustering is dividing a dataset into a system of disjunctive subsets (clusters) so that an intersection of clusters would be an empty set. Clustering methods can be also divided in more detail based on the processes in the method (algorithm) itself:Non-hierarchical : Centroid-based Model-based Density-based Grid-basedHierarchical : Agglomerative Divisive But which to choose in your use case? Let’s dive deeper into the most known methods and discuss their advantages and disadvantages.Centroid-based
The most basic (maybe just for me) type of clustering method is centroid-based. This type of clustering creates prototypes of clusters - centroids or medoids. The best well-known methods are: K-means K-medians K-medoids K-modesK-means
Steps: Create random K clusters (and compute centroids). Assign points to the nearest centroids. Update centroids. Go to step 2 while the centroids are changing. Pros and cons: [+] Fast to compute. Easy to understand. [-] Various initial clusters can lead to different final clustering. [-] Scale-dependent. [-] Creates only convex (spherical) shapes of clusters. [-] Sensitive to outliers.K-means - example It is very easy to try K-means in R (by the kmeans kmeans function), only needed parameter is a number of clusters. km_res <- kmeans(data_example, 3)$cluster data_example[, class := as.factor(km_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw() km_res <- kmeans(data_example, 3)$cluster data_example[, class := as.factor(km_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw() km_res <- kmeans(data_example, 3)$cluster data_example[, class := as.factor(km_res)] centroids <- data_example[, .(x = mean(x), y = mean(y)), by = class] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = centroids, aes(x, y), color = "black", shape = "+", size = 18) + theme_bw()We can see example, when K-means fails most often, so when there are outliers in the dataset.
K-medoids
The problem with outliers solves K-medoids because prototypes are medoids - members of the dataset. So, not artificially created centroids, which helps to tackle outliers. Pros and cons: [+] Easy to understand. [+] Less sensitive to outliers. [+] Possibility to use any distance measure. [-] Various initial clusters can lead to different final clustering. [-] Scale-dependent. [-] Slower than K-means.K-medoids - example K-medoids problem can be solved by the Partition Around Medoids (PAM) algorithm (function pam pam in cluster cluster package). kmed_res <- pam(data_example[, .(x, y)], 3) data_example[, class := as.factor(kmed_res$clustering)] medoids <- data.table(kmed_res$medoids, class = as.factor(1:3)) ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = medoids, aes(x, y, shape = class), color = "black", size = 11, alpha = 0.7) + theme_bw() + guides(shape = "none") kmed_res <- pam(data_example[, .(x, y)], 3) data_example[, class := as.factor(kmed_res$clustering)] medoids <- data.table(kmed_res$medoids, class = as.factor(1:3)) ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = medoids, aes(x, y, shape = class), color = "black", size = 11, alpha = 0.7) + theme_bw() + guides(shape = "none") kmed_res <- pam(data_example[, .(x, y)], 3) data_example[, class := as.factor(kmed_res$clustering)] medoids <- data.table(kmed_res$medoids, class = as.factor(1:3)) ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + geom_point(data = medoids, aes(x, y, shape = class), color = "black", size = 11, alpha = 0.7) + theme_bw() + guides(shape = "none")We can see that medoids stayed nicely in the three main groups of data.
The determination of the number of clusters
The disadvantage of centroid-based methods is that the number of clusters needs to be known in advance (it is a parameter of the methods). However, we can determine the number of clusters by its Internal validation (index). Basic steps are based on that we compute some internal validation index with many ( K ) and we choose ( K ) with the best index value. Many indexes are there… Silhouette Davies-Bouldin index Dunn index etc. However, every index has similar characteristics:w i t h i n − c l u s t e r − s i m i l a r i t y b e t w e e n − c l u s t e r s − s i m i l a r i t y . " role="presentation"> so, it is the ratio of the average distances in clusters and between clusters.Elbow diagram
The Elbow diagram is a simple method (rule) how to determine the number of clusters - we compute the internal index with a set of K and choose K where positive change is largest. So for example, I chose the Davies-Bouldin index implemented in the clusterCrit clusterCrit package. For our simple dataset, I will generate clusterings with 2-6 number of clusters and compute the index. library(clusterCrit) km_res_k <- lapply(2:6, function(i) kmeans(data_example[, .(x, y)], i)$cluster) km_res_k library(clusterCrit) km_res_k <- lapply(2:6, function(i) kmeans(data_example[, .(x, y)], i)$cluster) km_res_k library(clusterCrit) km_res_k <- lapply(2:6, function(i) kmeans(data_example[, .(x, y)], i)$cluster) km_res_k ## [[1]] ## [1] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 ## ## [[2]] ## [1] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 2 1 ## ## [[3]] ## [1] 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 4 3 ## ## [[4]] ## [1] 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 1 3 ## ## [[5]] ## [1] 1 1 1 1 1 1 1 5 5 1 6 6 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 2 6 4 ## [[1]] ## [1] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 ## ## [[2]] ## [1] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 2 1 ## ## [[3]] ## [1] 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 4 3 ## ## [[4]] ## [1] 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 1 3 ## ## [[5]] ## [1] 1 1 1 1 1 1 1 5 5 1 6 6 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 2 6 4 ## [[1]] ## [1] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 ## ## [[2]] ## [1] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 2 1 ## ## [[3]] ## [1] 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 4 3 ## ## [[4]] ## [1] 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 1 3 ## ## [[5]] ## [1] 1 1 1 1 1 1 1 5 5 1 6 6 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 2 6 4 db_km <- lapply(km_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_km)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw() db_km <- lapply(km_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_km)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw() db_km <- lapply(km_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_km)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw()data_example[, class := as.factor(km_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() data_example[, class := as.factor(km_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() data_example[, class := as.factor(km_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw()
We can see that the Elbow diagram rule chose 4 clusters - makes sense to me actually… We can also try it with PAM - K-medoids. kmed_res_k <- lapply(2:6, function(i) pam(data_example[, .(x, y)], i)$clustering) db_kmed <- lapply(kmed_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_kmed)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw() kmed_res_k <- lapply(2:6, function(i) pam(data_example[, .(x, y)], i)$clustering) db_kmed <- lapply(kmed_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_kmed)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw() kmed_res_k <- lapply(2:6, function(i) pam(data_example[, .(x, y)], i)$clustering) db_kmed <- lapply(kmed_res_k, function(j) intCriteria(data.matrix(data_example[, .(x, y)]), j, "Davies_bouldin")$davies_bouldin) ggplot(data.table(K = 2:6, Dav_Boul = unlist(db_kmed)), aes(K, Dav_Boul)) + geom_line() + geom_point() + theme_bw()
data_example[, class := as.factor(kmed_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() data_example[, class := as.factor(kmed_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() data_example[, class := as.factor(kmed_res_k[[which.min(c(0,diff(unlist(db_km))))]])] ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw()
It is the same result.
Model-based
Model-based clustering methods are based on some probabilistic distribution. It can be: Gaussian normal distribution Gamma distribution Student’s t-distribution Poisson distribution etc. Since we cluster multivariate data, model-based clustering uses Multivariate distributions and a so-called Mixture of models (Mixtures -> clusters). When using clustering with Gaussian normal distribution, we are using the theory ofGaussian Mixture Models (GMM) .GMM
The target is to maximize likelihood:L ( μ 1 , … , μ k , Σ 1 , … , Σ k | x 1 , … , x n ) . " role="presentation"> Here, cluster is represented by mean (( \mathbf{\mu} )) and covariance matrix (( \mathbf{\Sigma} )). So not just centroid as in the case of K-means. This optimization problem is typically solved by theEM algorithm (Expectation Maximization). Pros and cons: [+] Ellipsoidal clusters, [+] Can be parameterized by covariance matrix, [+] Scale-independent, [-] Very slow for high-dimensional data, [-] Can be difficult to understand. EM algorithm with GMM is implemented in the mclust mclust package. You can optimize various shapes of mixtures (clusters) by the modelNames modelNames parameter (check the ?mclustModelNames ?mclustModelNames function for more details). library(mclust) res <- Mclust(data_example[, .(x, y)], G = 3, modelNames = "VVV", verbose = FALSE) plot(res, what = "classification") library(mclust) res <- Mclust(data_example[, .(x, y)], G = 3, modelNames = "VVV", verbose = FALSE) plot(res, what = "classification") library(mclust) res <- Mclust(data_example[, .(x, y)], G = 3, modelNames = "VVV", verbose = FALSE) plot(res, what = "classification")Pretty interesting red ellipse that was created, but generally clustering is OK.
BIC
The Bayesian Information Criterion (BIC) for choosing the optimal number of clusters can be used with model-based clustering. In the mclust mclust package, you can just add multiple modelNames modelNames and it chooses by BIC the best one. We can try also to vary the dependency of covariance matrix ( \mathbf{\Sigma} ). res <- Mclust(data_example[, .(x, y)], G = 2:6, modelNames = c("VVV", "EEE", "VII", "EII"), verbose = FALSE) res res <- Mclust(data_example[, .(x, y)], G = 2:6, modelNames = c("VVV", "EEE", "VII", "EII"), verbose = FALSE) res res <- Mclust(data_example[, .(x, y)], G = 2:6, modelNames = c("VVV", "EEE", "VII", "EII"), verbose = FALSE) res ## 'Mclust' model object: (EII,6) ## ## Available components: ## [1] "call" "data" "modelName" "n" "d" "G" "BIC" "loglik" "df" "bic" "icl" ## [12] "hypvol" "parameters" "z" "classification" "uncertainty" ## 'Mclust' model object: (EII,6) ## ## Available components: ## [1] "call" "data" "modelName" "n" "d" "G" "BIC" "loglik" "df" "bic" "icl" ## [12] "hypvol" "parameters" "z" "classification" "uncertainty" ## 'Mclust' model object: (EII,6) ## ## Available components: ## [1] "call" "data" "modelName" "n" "d" "G" "BIC" "loglik" "df" "bic" "icl" ## [12] "hypvol" "parameters" "z" "classification" "uncertainty" plot(res, what = "BIC") plot(res, what = "BIC") plot(res, what = "BIC")The result: plot(res, what = "classification") plot(res, what = "classification") plot(res, what = "classification")
So, the methodology chose 6 clusters - 3 main groups of data and all 3 anomalies in separate clusters.
Density-based
Density-based clusters are based on maximally connected components of the set of points that lie within some defined distance from some core object. Methods:DBSCAN OPTICS HDBSCAN Multiple densities (Multi-density) methodsDBSCAN
In the well-known methodDBSCAN , density is defined as neighborhood, where points have to be reachable within a defined distance (( \epsilon ) distance - first parameter of the method), however, clusters must have at least some number of minimal points (second parameter of the method). Points that weren’t connected with any cluster and did not pass the minimal points criterion are marked as noise (outliers). Pros and cons: [+] Extracts automatically outliers, [+] Fast to compute, [+] Can find clusters of arbitrary shapes, [+] The number of clusters is determined automatically based on data, [-] Parameters (( \epsilon ), minPts) must be set by a practitioner, [-] Possible problem with neighborhoods - can be connected.DBSCAN is implemented in the same named function and package, so let’s try it. library(dbscan) res <- dbscan(data_example[, .(x, y)], eps = 0.4, minPts = 5) table(res$cluster) library(dbscan) res <- dbscan(data_example[, .(x, y)], eps = 0.4, minPts = 5) table(res$cluster) library(dbscan) res <- dbscan(data_example[, .(x, y)], eps = 0.4, minPts = 5) table(res$cluster) ## ## 0 1 2 3 ## 3 10 10 10 ## ## 0 1 2 3 ## 3 10 10 10 ## ## 0 1 2 3 ## 3 10 10 10 data_example[, class := as.factor(res$cluster)] levels(data_example$class)[1] <- c("Noise") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() + scale_shape_manual(values = c(3,16,17,18)) data_example[, class := as.factor(res$cluster)] levels(data_example$class)[1] <- c("Noise") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() + scale_shape_manual(values = c(3,16,17,18)) data_example[, class := as.factor(res$cluster)] levels(data_example$class)[1] <- c("Noise") ggplot(data_example, aes(x, y, color = class, shape = class)) + geom_point(alpha = 0.75, size = 8) + theme_bw() + scale_shape_manual(values = c(3,16,17,18))We can see that DBSCAN found 3 clusters and 3 outliers correctly when parameters are wisely chosen.
Bananas - DBSCAN result
To demonstrate the strength of DBSCAN, researchers created many dummy artificial datasets, which are many times called bananas. bananas <- fread("_rmd/t7.10k.dat") db_res <- dbscan(bananas, eps = 10, minPts = 15) data_all <- data.table(bananas, class = as.factor(db_res$cluster)) library(ggsci) ggplot(data_all, aes(V1, V2, color = class, shape = class)) + geom_point(alpha = 0.75) + scale_color_d3() + scale_shape_manual(values = c(3, rep(16, 9))) + theme_bw() bananas <- fread("_rmd/t7.10k.dat") db_res <- dbscan(bananas, eps = 10, minPts = 15) data_all <- data.table(bananas, class = as.factor(db_res$cluster)) library(ggsci) ggplot(data_all, aes(V1, V2, color = class, shape = class)) + geom_point(alpha = 0.75) + scale_color_d3() + scale_shape_manual(values = c(3, rep(16, 9))) + theme_bw() bananas <- fread("_rmd/t7.10k.dat") db_res <- dbscan(bananas, eps = 10, minPts = 15) data_all <- data.table(bananas, class = as.factor(db_res$cluster)) library(ggsci) ggplot(data_all, aes(V1, V2, color = class, shape = class)) + geom_point(alpha = 0.75) + scale_color_d3() + scale_shape_manual(values = c(3, rep(16, 9))) + theme_bw()![]()
Bananas - K-means result
km_res <- kmeans(bananas, 9) data_all[, class := as.factor(km_res$cluster)] ggplot(data_all, aes(V1, V2, color = class)) + geom_point(alpha = 0.75) + scale_color_d3() + theme_bw() km_res <- kmeans(bananas, 9) data_all[, class := as.factor(km_res$cluster)] ggplot(data_all, aes(V1, V2, color = class)) + geom_point(alpha = 0.75) + scale_color_d3() + theme_bw() km_res <- kmeans(bananas, 9) data_all[, class := as.factor(km_res$cluster)] ggplot(data_all, aes(V1, V2, color = class)) + geom_point(alpha = 0.75) + scale_color_d3() + theme_bw()K-means here is not a good choice obviously…but these datasets are far from real-world either.
Spectral clustering
Spectral clustering methods are based on the spectral decomposition of data, so the creation of eigen vectors and eigen values. Steps: N = number of data, d = dimension of data, ( \mathbf{A} ) = affinity matrix, ( A_{ij} = \exp(- (data_i - data_j)^2 / (2*\sigma^2) ) ) - N by N matrix, ( \mathbf{D} ) = diagonal matrix whose (i,i)-element is the sum of ( \mathbf{A} ) i-th row - N by N matrix, ( \mathbf{L} ) = ( \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2} ) - N by N matrix, ( \mathbf{X} ) = union of k largest eigenvectors of ( \mathbf{L} ) - N by k matrix, Renormalising each of ( \mathbf{X} ) rows to have unit length - N by k matrix, Run K-means algorithm on ( \mathbf{X} ).Typical use case for spectral clustering
We will try spectral clustering on the Spirals artificial dataset. data_spiral <- fread("_rmd/data_spiral.csv") ggplot(data_spiral, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw() data_spiral <- fread("_rmd/data_spiral.csv") ggplot(data_spiral, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw() data_spiral <- fread("_rmd/data_spiral.csv") ggplot(data_spiral, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw()Spectral clustering is implemented in the kernlab kernlab package and specc specc function. library(kernlab) res <- specc(data.matrix(data_spiral[, .(x, y)]), centers = 3) data_spiral[, class := as.factor(res)] ggplot(data_spiral, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() library(kernlab) res <- specc(data.matrix(data_spiral[, .(x, y)]), centers = 3) data_spiral[, class := as.factor(res)] ggplot(data_spiral, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() library(kernlab) res <- specc(data.matrix(data_spiral[, .(x, y)]), centers = 3) data_spiral[, class := as.factor(res)] ggplot(data_spiral, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw()
Let’s it try on more advanced data - compound data. data_compound <- fread("_rmd/data_compound.csv") ggplot(data_compound, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw() data_compound <- fread("_rmd/data_compound.csv") ggplot(data_compound, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw() data_compound <- fread("_rmd/data_compound.csv") ggplot(data_compound, aes(x, y, color = as.factor(label), shape = as.factor(label))) + geom_point(size = 2) + theme_bw()
res <- specc(data.matrix(data_compound[, .(x, y)]), centers = 6) data_compound[, class := as.factor(res)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() res <- specc(data.matrix(data_compound[, .(x, y)]), centers = 6) data_compound[, class := as.factor(res)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() res <- specc(data.matrix(data_compound[, .(x, y)]), centers = 6) data_compound[, class := as.factor(res)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw()
This is not a good result, let’s try DBSCAN. db_res <- dbscan(data.matrix(data_compound[, .(x, y)]), eps = 1.4, minPts = 5) # db_res data_compound[, class := as.factor(db_res$cluster)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() db_res <- dbscan(data.matrix(data_compound[, .(x, y)]), eps = 1.4, minPts = 5) # db_res data_compound[, class := as.factor(db_res$cluster)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw() db_res <- dbscan(data.matrix(data_compound[, .(x, y)]), eps = 1.4, minPts = 5) # db_res data_compound[, class := as.factor(db_res$cluster)] ggplot(data_compound, aes(x, y, color = class, shape = class)) + geom_point(size = 2) + theme_bw()
Again, the nice result for DBSCAN on the artificial dataset.
Hierarchical clustering
The result of a hierarchical clustering is a dendrogram. The dendrogram can be cut at any height to form a partition of the data into clusters. How data points are connected in the dendrogram has multiple possible ways (linkages) and criteria: Single-linkage Complete-linkage Average-linkage Centroid-linkage Ward’s minimum variance method etc. Criteria: single-linkage: ( \min { d(a,b):a\in A, b\in B } ) complete-linkage: ( \max { d(a,b):a\in A, b\in B } )
average-linkage: ( \frac{1}{ | A | B | }\sum_{a\in A}\sum_{b\in B}d(a,b) ) |
centroid-linkage: ( | c_t - c_s | ), where ( c_s ) and ( c_t ) are the centroids of clusters ( s ) and ( t ). |