[1] Hadley Wickham and Garrett Grolemund, R for Data Science, https://r4ds.had.co.nz [2](新西兰)哈德利·威克姆,(美)加勒特·格罗勒芒德著;陈光欣译.R数据科学[M].北京:人民邮电出版社.2018.
Welcome
1. Introduction
- It’s common to think about modeling as a tool for hypothesis confirmation, and visualization as a tool for hypothesis generation. But that’s a false dichotomy: models are often used for exploration, and with a little care you can use visualization for confirmation. The key difference is how often do you look at each observation: if you look only once, it’s confirmation; if you look more than once, it’s exploration.
经常有人认为建模是用来进行假设验证的工具,而可视化是用来进行假设生成的工具。这种简单的二分法是错误的:模型经常用于数据探索;只需稍作处理,可视化也可以用来进行假设验证。核心区别在于你使用每个观测的频率:如果只用一次,那么就是假设验证;如果多于一次,那么就是数据探索。
- For packages in the
tidyvesrse
, the easiest way to check is to runtidyverse_update()
.
对于
tidyverse
中的 R 包来说,检查版本的最简单方式是运行tidyverse_update()
函数。
- The shorter your code is, the easier it is to understand, and the easier it is to fix.
代码越短,越容易理解,问题也就越容易解决。
Explore
1. Introduction
- Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth.
数据探索是一门艺术,它可以审视数据,快速生成假设并进行检验,接着重复、重复、再重复。数据探索的目的是生成多个有分析价值的线索,以供后续进行更深入的研究。
2. Data visualisation
- To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside
aes()
.ggplot2
will automatically assign a unique level of the aesthetic to each unique value of the variable, a process known as scaling.ggplot2
will also add a legend that explains which levels correspond to which values.
要想将图形属性映射为变量,需要在函数
aes()
中将图形属性名称和变量名称关联起来。ggplot2
会自动为每个变量值分配唯一的图形属性水平,这个过程称为标度变换。ggplot2
还会添加一个图例,以表示图形属性水平和变量值之间的对应关系。
ggplot2
will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.
ggplot2
只能同时使用种形状。
- R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares.
The difference comes from the interaction of the
colour
andfill
aesthetics. The hollow shapes (0–14) have a border determined bycolour
; the solid shapes (15–20) are filled withcolour
; the filled shapes (21–24) have a border ofcolour
and are filled withfill
.
空心形状(0-14)的边界颜色由
color
决定;实心形状(15-20)的填充颜色由color
决定;填充形状(21-24)的边界颜色由color
决定,填充颜色由fill
决定。
3. Work flow: basics
- But while you should expect to be a little frustrated, take comfort in that it’s both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
但是,当有了一些心理准备后,你就可以心安理得地接受这些挫折,知道这是正常的,也是暂时的:每个人都会遇到困难,克服困难的唯一方法就是不断尝试。
4. Data transformation
- Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on
==
, usenear()
:
计算机使用的是有限精度运算(显然无法存储无限位的数),因此请记住,你看到的每个数都是一个近似值。比较浮点数是否相等时,不能使用==,而应该使用near():
near(sqrt(2) ^ 2, 2)
#> [1] TRUE near(1 / 49 * 49, 1)
#> [1] TRUE
- One important feature of R that can make comparison tricky are missing values, or
NA
s (“not availables”).NA
represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
R 的一个重要特征使得比较运算更加复杂,这个特征就是缺失值,或称NA(not > available,不可用)。NA > 表示未知的值,因此缺失值是“可传染的”。如果运算中包含了未知值,那么运算结果一般来说也是个未知值。
- All else being equal, I recommend using
log2()
because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
其他条件相同的情况下,我推荐使用
log2()
函数,因为很容易对其进行解释:对数标度的数值增加1 > 个单位,意味着初始数值加倍;减少1 个单位,则意味着初始数值减半。
- When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating
ggplot2
intodplyr
flows. It’s a bit painful that you have to switch from%>%
to+
, but once you get the hang of it, it’s quite convenient.
查看此类图形时,通常应该筛选掉那些观测数量非常少的分组,这样你就可以避免受到特别小的分组中的极端变动的影响,进而更好地发现数据模式。这就是以下代码要做的工作,同时还展示了将
ggplot2
集成到dplyr
工作流中的一种有效方式。从%>%
过渡到+
会令人感到不适应,但掌握其中的要领后,这种写法是非常方便的:
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
5. Work flow: scripts
- I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install. Note, however, that you should never include
install.packages()
orsetwd()
in a script that you share. It’s very antisocial to change settings on someone else’s computer!
我们建议你一直将所需要 R 包的语句放在脚本开头。这样一来,如果将代码分享给别人,他们就可以很容易地知道需要安装哪些R > 包。但注意,永远不要在分享的脚本中包括
install.packages()
函数或setwd()
函数。这种改变他人计算机设置的行为会引起众怒的。
6. Exploratory Data Analysis
- EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA(探索性数据分析)并不是具有严格规则的正式过程,它首先是一种思维状态。在EDA > 的初始阶段,你应该天马行空地发挥想象力,并考察和试验能够想到的所有方法。有些想法是行得通的,有些想法则会无疾而终。当探索更进一步时,你就可以锁定容易产生成果的几个领域,将最终想法整理成文,并与他人进行沟通。
- “There are no routine statistical questions, only questionable > statistical routines.” — Sir David Cox
没有一成不变的统计问题,统计上的一成不变都是有问题的。
- “Far better an approximate answer to the right question, which is > often vague, than an exact answer to the wrong question, which can > always be made precise.” — John Tukey
正确问题的近似答案通常是模糊的,但它远远胜过错误问题的确切答案,尽管后者总是很精确。
- It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
使用带有异常值和不带异常值的数据分别进行分析,是一种良好的做法。如果两次分析的结果差别不大,而你又无法说明为什么会有异常值,那么完全可以用缺失值替代异常值,然后继续进行分析。但如果两次分析的结果有显著差别,那么你就不能在没有正当理由的情况下丢弃它们。你需要弄清出现异常值的原因(如数据输入错误),并在文章中说明丢弃它们的理由。
- I recommend replacing the unusual values with missing values. The easiest way to do this is to use
mutate()
to replace the variable with a modified copy. You can use theifelse()
function to replace unusual values withNA
:
我们建议使用缺失值来代替异常值。最简单的做法就是使用
mutate()
函数创建一个新变量来代替原来的变量。你可以使用ifelse()
函数将异常值替换为NA
:
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
-
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
-
A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
-
Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
-
A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
按分类变量的分组显示连续变量分布的另一种方式是使用箱线图。箱线图是对变量值分布的一种简单可视化表示,这种图在统计学家中非常流行。每张箱线图都包括以下内容。
- 一个长方形箱子,下面的边表示分布的第25 > 个百分位数,上面的边表示分布的第75 > 个百分位数,上下两边的距离称为四分位距。箱子的中部有一条横线,表示分布的中位数,也就是分布的第50 > 个百分位数。这三条线可以表示分布的分散情况,还可以帮助我们明确数据是关于中位数对称的,还是偏向某一侧。
- 圆点表示落在箱子上下两边1.5 > 倍四分位距外的观测,这些离群点就是异常值,因此需要单独绘出。
- 从箱子上下两边延伸出的直线(或称为须)可以到达分布中最远的非离群点处。
- Scatterplots become less useful as the size of your dataset grows,
because points begin to overplot, and pile up into areas of uniform black. You’ve already seen one way to fix the problem: using the
alpha
aesthetic to add transparency.
随着数据集规模的不断增加,散点图的用处越来越小,因为数据点开始出现过绘制,并堆积在一片黑色区域中。我们已经介绍了解决这个问题的一种方法,即使用
alpha
图形属性添加透明度:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
7. Work flow: projects
- You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
千万不要在脚本中使用绝对路径,因为不利于分享:没有任何人会和你具有完全相同的目录设置。
- R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.
R 专家将与项目相关的所有文件放在一起,其中包括输入数据、R > 脚本、分析结果以及图形。因为这是极其明智而又通用的做法,所以RStudio通过项目对这种做法提供了内置的支持。
Wrangle
1. Introduction
2. Tibbles
3. Data import
- UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
UTF-8 可以为现在人类使用的所有字符进行编码,同时还支持很多特殊字符(如表情符号!)。
write_rds()
andread_rds()
are uniform wrappers around the base functionsreadRDS()
andsaveRDS()
. These store data in R’s custom binary format called RDS.
write_rds()
和read_rds()
函数是对基础函数readRDS()
和saveRDS()
的统一包装。前者可以将数据保存为R自定义的二进制格式,称为RDS格式。
4. Tidy data
5. Relational data
- The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
最常用的连接是左连接:只要想从另一张表中添加数据,就可以使用左连接,因为它会保留原表中的所有观测,即使它没有匹配。左连接应该是你的默认选择,除非有足够充分的理由选择其他的连接方式。
6. Strings
- By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a
?
after them. This is an advanced feature of regular expressions, but it’s useful to know that it exists.
默认的匹配方式是“贪婪的”:正则表达式会匹配尽量长的字符串。通过在正则表达式后面添加一个?,你可以将匹配方式更改为“懒惰的”,即匹配尽量短的字符串。虽然这是正则表达式的高级特性,但知道这一点是非常有用的。
- A word of caution before we continue: because regular expressions are so powerful, it’s easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski:
Some people, when confronted with a problem, think “I know, I’ll use > regular expressions.” Now they have two problems.
继续以下内容前,我们需要先提醒你一下:因为正则表达式太强大了,所以我们很容易认为所有问题都可以使用一个正则表达式来解决。正如Jamie > Zawinski 下面所说:
当遇到一个问题时,有些人会这样想:“我可以用正则表达式来搞定它。”于是,原来的一个问题就变成了两个问题。
7. Factors
- Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to
unique(x)
, or after the fact, withfct_inorder()
:
有时你会想让因子的顺序与初始数据的顺序保持一致。在创建因子时,将水平设置为
unique(x)
,或者在创建因子后再对其使用fct_inorder()
函数,就可以达到这个目的:
f1 <- factor(x1, levels = unique(x1))
f1 #> [1] Dec Apr Jan Mar #> Levels: Dec Apr Jan Mar
f2 <- x1 %>% factor() %>% fct_inorder()
f2 #> [1] Dec Apr Jan Mar #> Levels: Dec Apr Jan Mar
- More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is
fct_recode()
. It allows you to recode, or change, the value of each level.
比修改因子水平顺序更强大的操作是修改水平的值。修改水平不仅可以使得图形标签更美观清晰,以满足出版发行的要求,还可以将水平汇集成更高层次的显示。修改水平最常用、最强大的工具是
fct_recode()
函数,它可以对每个水平进行修改或重新编码。
- If you want to collapse a lot of levels,
fct_collapse()
is a useful variant offct_recode()
. For each new variable, you can provide a vector of old levels:
如果想要合并多个水平,那么可以使用
fct_recode()
函数的变体fct_collapse()
函数。对于每个新水平,你都可以提供一个包含原水平的向量:
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
#> # A tibble: 4 x 2 #> partyid n #> <fct> <int>
#> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180
8. Dates and times
- You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones.
只要能够满足需要,你就应该使用最简单的数据类型。这意味着只要能够使用日期型数据,那么就不应该使用日期时间型数据。日期时间型数据要复杂得多,因为它需要处理时区。
- Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.
注意,当将日期时间型数据作为数值使用时(比如在直方图中),1表示1秒,因此分箱宽度86400 > 才能够表示1天。对于日期型数据,1则表示1天。
- You can pull out individual parts of the date with the accessor functions
year()
,month()
,mday()
(day of the month),yday()
(day of the year),wday()
(day of the week),hour()
,minute()
, andsecond()
.
如果想要提取出日期中的独立成分,可以使用以下访问器函数:
year()
、month()
、mday()
(一个月中的第几天)、yday()
(一年中的第几天)、wday()
(一周中的第几天)、hour()
、minute()
和second()
。
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
#> [1] 2016 month(datetime)
#> [1] 7 mday(datetime)
#> [1] 8
yday(datetime)
#> [1] 190 wday(datetime)
#> [1] 6
- How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
如何在时期、阶段和区间中进行选择呢?一如既往,选择能够解决问题的最简单的数据结构。如果只关心物理时间,那么就使用时期;如果还需要考虑人工时间,那么就使用阶段;如果需要找出人工时间范围内有多长的时间间隔,那么就使用区间。
- Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like
c()
, will often drop the time zone.
除非使用了其他设置,lubridate 总是使用UTC(Coordinated Universal > Time,国际标准时间)。UTC > 是科技界使用的时区标准,基本等价于它的前身GMT(Greenwich Mean > Time,格林尼治标准时间)。因为没有夏时制,所以它非常适合计算。
Program
1. Introduction
- Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.
编程产出代码,代码是一种沟通工具。很显然,代码可以告诉计算机你想要做什么,同时也可以用于人与人之间的交流。将代码当作一种沟通工具是非常重要的,因为现在所有项目基本上都要靠协作才能完成。即使你现在单枪匹马地工作,也肯定要与未来的自己进行交流!代码清晰易懂特别重要,这样其他人(包括未来的你)才能理解你为什么要使用这种方式进行分析。因此,提高编程能力的同时也要提高沟通能力。随着时间的推移,你不仅会希望代码更易于编写,还会希望它更容易为他人所理解。
2. Pipes
- You may also worry that this form creates many copies of your data and takes up a lot of memory. Surprisingly, that’s not the case. First, note that proactively worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn’t stupid, and it will share columns across data frames, where possible.
你可能还会担心这种代码创建的多个数据副本会占用大量内存。出人意料的是,这种担心大可不必。首先,你不应该将时间花在过早担心内存上。当内存确实成为问题(即内存耗尽)时再担心便是,否则就是杞人忧天。其次,R > 是很智能的,它会尽量在数据框之间共享数据列。
- I’m not a fan of this operator(
%<>%
) because I think assignment is such a special operation that it should always be clear when it’s occurring. In my opinion, a little bit of duplication (i.e. repeating the name of the object twice) is fine in return for making assignment more explicit.
我不是很喜欢这个操作符(
%<>%
),因为我认为赋值是一种非常特殊的操作,如果需要进行赋值,那么就应该使赋值语句尽量清晰。我的看法是,一点小小的重复(即重复输入对象名称两次)是必要的,它可以更加明确地表示出赋值语句。
3. Functions
- Writing good functions is a lifetime journey. Even after using R for many years I still learn new techniques and better ways of approaching old problems.
编写优秀函数可以作为一个人的毕生事业。即使已经使用R多年,但我们还是对新技术和使用更好的方法解决老问题孜孜以求。
- As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good code style is like correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read! As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
除了函数开发,本章还会对代码风格提出一些建议。良好的代码风格就像正确使用标点符号一样重要。不使用标点符号你也一样可以写文章,但使用标点符号肯定可以让文章更通俗易懂。代码风格种类繁多,各具特色。这里介绍的只是我们所使用的代码风格,最重要的是风格要保持一致。
- Note the overall process: I only made the function after I’d figured out how to make it work with a simple input. It’s easier to start with working code and turn it into a function; it’s harder to create a function and then try to make it work.
注意以上创建函数的整体过程。确定函数如何使用简单输入来运行后,我们才开始编写函数。从工作代码开始,再将其转换为函数是相对容易的;先创建函数,再让其正确运行则是比较困难的。
- This is an important part of the “do not repeat yourself” (or DRY) principle. The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.
这是“不要重复自己”(do not repeat > yourself,DRY)这一原则中的重要部分。代码中的重复部分越多,当事情发生变化时(这是必然的),你需要修改的地方就越多,随着时间的推移,代码中的隐患也会越来越多。
- It’s important to remember that functions are not just for the computer, but are also for humans. R doesn’t care what your function is called, or what comments it contains, but these are important for human readers.
一定要牢记一点,函数不只是面向计算机的,同时也是面向人的。R > 并不在意函数的名称是什么,或者其中有多少注释,但是,这些对于人类读者来说都是非常重要的。
- Use comments, lines starting with
#
, to explain the “why” of your code. You generally should avoid comments that explain the “what” or the “how”. If you can’t understand what the code does from reading it, you should think about how to rewrite it to be more clear.
你应该使用注释(即由# 开头的行)来解释代码,需要解释的是“为什么”,一般不用解释“是什么”或“如何做”。如果通过阅读无法理解代码的行为,那么就应该考虑如何重写代码才能让它更清晰。
- You can use
||
(or) and&&
(and) to combine multiple logical expressions. These operators are “short-circuiting”: as soon as||
sees the firstTRUE
it returnsTRUE
without computing anything else. As soon as&&
sees the firstFALSE
it returnsFALSE
. You should never use|
or&
in anif
statement: these are vectorised operations that apply to multiple values (that’s why you use them infilter()
). If you do have a logical vector, you can useany()
orall()
to collapse it to a single value.
你可以使用
||
(或)和&&
(与)操作符来组合多个逻辑表达式。这些操作符具有“短路效应”:只要||
遇到第一个TRUE
,那么就会返回TRUE
,不再计算其他表达式;只要&&
遇到第一个FALSE
,就会返回FALSE
,不再计算其他表达式。不能在if
语句中使用|
或&
,它们是向量化的操作符,只可以用于多个值(这就是我们在filter()
函数中使用它们的原因)。如果一定要使用逻辑向量,那么你可以使用any()
或all()
函数将其转换为单个逻辑值。
- The default value should almost always be the most common value. The few exceptions to this rule are to do with safety. For example, it makes sense for
na.rm
to default toFALSE
because missing values are important. Even thoughna.rm = TRUE
is what you usually put in your code, it’s a bad idea to silently ignore missing values by default.
默认值应该几乎总是最常用的值。这种原则的例外情况非常少,除非出于安全考虑。例如,将
na.rm
的默认值设为FALSE
是情有可原的,因为缺失值有时是非常重要的。虽然代码中经常使用的是na.rm = TRUE
,但是通过默认设置不声不响地忽略缺失值并不是一种良好的做法。
- Notice that when you call a function, you should place a space around
=
in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.
注意,在调用函数时,应该在其中
=
的两端都加一个空格。逗号后面应该总是加一个空格,逗号前面则不要加空格(与英文写法相同)。使用空格可以使得函数的重要部分更易读。
- Note that when using
stopifnot()
you assert what should be true rather than checking for what might be wrong.
注意,如果使用了
stopifnot()
函数,那么你实际上是断言了哪些参数必须为真,而不是检查哪些参数可能是错的。
4. Vectors
- Integers have one special value:
NA
, while doubles have four:NA
,NaN
,Inf
and-Inf
. All three special valuesNaN
,Inf
and-Inf
can arise during division:
整型数据有1个特殊值
NA
,而双精度型数据则有4个特殊值:NA
、NaN
、Inf
和-Inf
。其他3个特殊值都可以由除法产生:
c(-1, 0, 1) / 0 #> [1] -Inf NaN Inf
- Avoid using
==
to check for these other special values. Instead use the helper functionsis.finite()
,is.infinite()
, andis.nan()
:
0 | Inf | NA | NaN | |
---|---|---|---|---|
is.finite() |
x | |||
is.infinite() |
x | |||
is.na() |
x | x | ||
is.nan() |
x |
不要使用
==
来检查这些特殊值,而应该使用辅助函数is.finite()
、is.infinite()
和is.nan()
。
- It remains to describe the class, which controls how generic functions work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input.
泛型函数是R中实现面向对象编程的关键,因为它允许函数根据不同类型的输入而进行不同的操作。
5. Iteration
- Typically you’ll be modifying a list or data frame with this sort of loop, so remember to use
[[
, not[
. You might have spotted that I used[[
in all my for loops: I think it’s better to use[[
even for atomic vectors because it makes it clear that I want to work with a single element.
一般来说,你可以使用类似的循环来修改列表或数据框,要记住使用
[[
,而不是[
。你或许已经发现了,我们在所有for
循环中使用的都是[[
。我们认为甚至在原子向量中最好也使用[[
,因为它可以明确表示我们要处理的是单个元素。
- The idea of passing a function to another function is an extremely powerful idea, and it’s one of the behaviors that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it’s worth the investment.
将函数作为参数传入另一个函数的这种做法是一种非常强大的功能,它是促使R > 成为函数式编程语言的因素之一。你需要花些时间才能真正理解这种思想,但这绝对是值得的。
- Once you master these functions, you’ll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code (although that’s definitely something you want to strive towards!). Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years.) The chief benefits of using functions like
map()
is not speed, but clarity: they make your code easier to write and to read.
一旦掌握了这些函数,你就会发现可以在解决迭代问题时节省大量时间。但你无须因为使用了
for
循环,没有使用映射函数而感到内疚。映射函数是一种高度抽象,需要花费很长时间才能理解其工作原理。重要的事情是解决工作中遇到的问题,而不是写出最简洁优雅的代码(尽管肯定也应该为之努力!)可能有些人会告诉你不要使用
for
循环,因为它们很慢。这些人完全错了!(至少他们已经赶不上时代了,因为for
循环已经有很多年都不慢了。)使用map()
函数的主要优势不是速度,而是简洁:它们可以让你的代码更易编写,也更易读。
Model
1. Introduction
- The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true “signals” (i.e. patterns generated by the phenomenon of interest), and ignore “noise” (i.e. random variation that you’re not interested in).
模型的作用是提供一个简单的、低维度的数据集摘要。理想情况下,模型可以捕获真正的“信号”(即由我们感兴趣的现象生成的模式),并忽略“噪声”(即我们不感兴趣的随机变动)。
- Traditionally, the focus of modelling is on inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard.
通常来说,建模的重点在于推断或验证假设是否为真。正确地完成这些任务并不复杂,但相当困难。
- You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration.
在进行数据探索时,一个观测可以使用任意多次,但进行假设验证时,一个观测只能使用一次。一旦使用两次观测,假设验证就会变成数据探索。
- This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic.
这点非常必要,因为要想验证假设,你必须使用与生成假设的数据无关的数据。否则,你就过于乐观了。
2. Model basics
- It’s important to understand that a fitted model is just the closest model from a family of models. That implies that you have the “best” model (according to some criteria); it doesn’t imply that you have a good model and it certainly doesn’t imply that the model is “true”. George Box puts this well in his famous aphorism:
All models are wrong, but some are useful.
拟合模型只是模型族中与数据最接近的一个模型,理解这一点非常重要。这意味着你找到了“最佳”模型(按照某些标准),但并不意味着你找到了良好的模型,而且也绝不代表这个模型是“真的”。George > Box 有一句名言说得很好:
所有模型都是错误的,但有些是有用的。
- The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
模型的目标不是发现真理,而是获得简单但有价值的近似。
- You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there’s a better way to tackle that problem: a numerical minimization tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can’t go any lower.
可以设想不断细化网格来最终找出最佳模型。但还有一个更好的方法可以解决这个问题,这种方法是名为“牛顿—拉夫逊搜索”的数值最小化工具。牛顿—拉夫逊方法的直观解释非常简单:先选择一个起点,环顾四周找到最陡的斜坡,并沿着这个斜坡向下滑行一小段,然后不断重复这个过程,直到不能再下滑为止。
- We’re going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of predictive model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand any type of predictive model.
这里我们准备通过预测来理解模型。这种方法的一大优点是,每种类型的预测模型都要进行预测(否则还有什么用处?),因此我们可以使用同样的技术来理解任何类型的预测模型。
- It’s also useful to see what the model doesn’t capture, the so-called residuals which are left after subtracting the predictions from the data. Residuals are powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.
找出模型未捕获的信息也是非常有用的,即所谓的残差,它是数据去除预测值后剩余的部分。残差是非常强大的,因为它允许我们使用模型去除数据中显著的模式,以便对剩余的微妙趋势进行研究。
- When you add variables with
+
, the model will estimate each effect independent of all the others. It’s possible to fit the so-called interaction by using*
. For example,y ~ x1 * x2
is translated toy = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2
. Note that whenever you use*
, both the interaction and the individual components are included in the model.
如果使用
+
添加变量,那么模型会独立地估计每个变量的效果,不考虑其他变量。如果使用*
,那么拟合的就是所谓的交互项。例如,y ~ x1 * x2
会转换为y = a_0 + a_1 * x1 + a_2 *x2 + a_12 * x1 * x2
。注意,只要使用了*
,交互项及其各个组成部分就都要包括在模型中。
- Note that the model that uses
+
has the same slope for each line, but different intercepts. The model that uses*
has a different slope and intercept for each line.
注意,在使用
+
的模型中,每条直线都具有同样的斜率,但截距不同。在使用*
的模型中,每条直线的斜率和截距都不相同。
- If your transformation involves
+
,*
,^
, or-
, you’ll need to wrap it inI()
so R doesn’t treat it like part of the model specification.
如果想要使用
+
、*
、^
或-
进行变量转换,那么就应该使用I()
对其进行包装,以便R在处理时不将它当作模型定义的一部分。
- Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.
注意,当使用模型在数据范围外进行推断时,效果明显非常差。这是使用多项式近似函数的一个缺点。但这是所有模型都具有的一个实际问题:当对未知数据进行外推时,模型无法确保结果的真实性。你必须依靠相关的科学理论。
3. Model building
- Our model is finding the mean effect, but we have a lot of big outliers, so mean tends to be far away from the typical value. We can alleviate this problem by using a model that is robust to the effect of outliers:
MASS::rlm()
.
模型寻找的是平均效应,但我们的数据中有大量数值很大的离群点,因此平均趋势与典型值之间的差别比较大。如果想要改善这个问题,可以使用对离群点健壮的模型:
MASS::rlm()
。
- Either approach is reasonable. Making the transformed variable explicit is useful if you want to check your work, or use them in a visualization. But you can’t easily use transformations (like splines) that return multiple columns. Including the transformations in the model function makes life a little easier when you’re working with many different datasets because the model is self contained.
每种方法都有其合理性。如果想对工作进行检查,或者想对工作结果进行可视化表示,那么你就应该明确表示变量转换。谨慎使用返回多个列的那些转换(比如样条法)。如果正在处理多个不同的数据集,那么将转换放在模型公式中可以使得工作更容易一些,因为这时的模型是自成一体的。
4. Many models
Communicate
1. Introduction
- Unfortunately, these chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which we’ll point you to at the end of each chapter.
遗憾的是,各章重点关注的是沟通的技术机制,而不是人与人之间想法的实际沟通。但是,我们将在每章末尾介绍一些关于实际沟通的优秀著作。
2. R Markdown
- When you knit the document, R Markdown sends the .Rmd file to knitr, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats.
在生成文档时,R Markdown 先将.Rmd > 文件发送给knitr,knitr会执行所有代码段,并创建一个新的Markdown文件(.md),其中包含所有代码和输出。然后knitr生成的Markdown文件再由pandoc进行处理,并生成最终文件。这种两阶段工作流的优点是可以创建多种输出格式。
- The following table summarises which types of output each option suppresses:
Option | Run code | Show code | Output | Plots | Messages | Warnings |
---|---|---|---|---|---|---|
eval = FALSE |
- | - | - | - | - | |
include = FALSE |
- | - | - | - | - | |
echo = FALSE |
- | |||||
results = "hide" |
- | |||||
fig.show = "hide" |
- | |||||
message = FALSE |
- | |||||
warning = FALSE |
- |
- Normally, each knit of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is
cache = TRUE
. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.
一般来说,R Markdown > 在每次生成文档时都是完全从头开始的。这对于文档的可重复性非常重要,因为这样可以确保不漏掉代码中的每一步重要计算。但是,如果有些计算需要花费大量时间,那么每次重新生成文档都会是一个非常痛苦的过程。我们对这个问题的解决方案是使用
cache = TRUE
。当使用这个选项时,R > Markdown > 会将代码段输出保存在磁盘上一个具有特殊名称的文件中。在此后的运行中,knitr会检查代码是否进行了修改,如果没有修改,则继续使用缓存结果。
dependson
should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.
dependson
应该包含每个代码段的一个字符向量,其中包括缓存代码段依赖的所有代码段。只要knitr检测到某个依赖代码段被修改了,就会重新运行缓存代码段以更新结果。
- As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with
knitr::clean_cache()
.
因为缓存策略会逐渐变得复杂,所以应该定期使用
knitr::clean_cache()
命令清除所有缓存。
3. Graphics for communication
- The purpose of a plot title is to summaries the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.
使用图形标题的目的是概括主要成果。尽量不要使用那些只对图形进行描述的标题,如“发动机排量与燃油效率散点图”。
caption
adds text at the bottom right of the plot, often used to describe the source of the data.
caption
可以在图形右下角添加文本,常用于描述数据来源。
- Note the use of
hjust
andvjust
to control the alignment of the label. Figure 28.1 shows all nine possible combinations.
hjust
和vjust
是用于控制标签的对齐方式。给出了所有9种可能的组合。
4. R Markdown formats
5. R Markdown work flow
- R Markdown is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences.
R Markdown > 的重要性还在于,它可以将文本和代码紧密地集成在一起。这使得它既可以开发代码,又可以记录你的想法,是一种非常棒的分析式笔记本。自然科学研究中一般都会有个实验记录本,分析式笔记本的一些用途与实验记录本是基本相同的。
- If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.
如果在某个数据文件中发现了一个错误,千万不要直接修改,而是应该通过编写代码来修改错误值,并解释为什么要进行这个修改。