不專業學術閒聊: R | Calculate row means, SD and SE

明明有 Excel 可用為什麼偏要用 R？
之《為了要畫出圖，只好先從最基本的開始研究。QQ 》

・如何 merge 兩個檔案資料(dataframe)
・如何從檔案資料中的某變項(variable)中挑出你要的
・如何從資料中指定出一個 matrix
・如何算每個 row 的平均值, standard deviation (SD) 和 standard error (SE)

我們需要裝下面的幾個，其中的 matrixStats 是用來算 mean 和 SD 的。

library(dplyr)
library(data.table)
library(tidyverse)
library(matrixStats)

Define a data frame, df1
Use data.frame() function to create a data frame

ID <- c("A", "B", "C", "D", "E", "F", "G")
N1 <- c(2,2.6,4,5,8,10,9)
N2 <- c(3,2.8,5,6,6,7,8)
N3 <- c(2,3.6,3,6,6,10,7)
df1 <- data.frame(ID, N1, N2, N3)

ID N1 N2 N3
1 A 2.0 3.0 2.0
2 B 2.6 2.8 3.6
3 C 4.0 5.0 3.0
4 D 5.0 6.0 6.0
5 E 8.0 6.0 6.0
6 F 10.0 7.0 10.0
7 G 9.0 8.0 7.0

Define another data frame, df2

P1 <- c(5,3.8,2,5,10,7,8)
P2 <- c(3,2.6,6,7,6,8,9)
P3 <- c(3,4.8,6,7,8,7,10)
df2 <- data.frame(ID, P1, P2, P3)

ID P1 P2 P3
1 A 5.0 3.0 3.0
2 B 3.8 2.6 4.8
3 C 2.0 6.0 6.0
4 D 5.0 7.0 7.0
5 E 10.0 6.0 8.0
6 F 7.0 8.0 7.0
7 G 8.0 9.0 10.0

設定第三個檔案資料，其中有兩個變項(variables)裡面的資料不是數字，而是字(character)。

K1 <- c(2,5,7,6,7,9,8)
K2 <- c(6,2,5,7,6,10,8)
K3 <- c("2","5","6","5","8","9","7")
Day <- c("AM", "PM", "PM", "AM", "PM", "AM", "AM")
df3 <- data.frame(ID, K1, K2, K3, Day)

ID K1 K2 K3 Day
1 A 2 6 2 AM
2 B 5 2 5 PM
3 C 7 5 6 PM
4 D 6 7 5 AM
5 E 7 6 8 PM
6 F 9 10 9 AM
7 G 8 8 7 AM

Combine 2 data frames by the variable "ID"
Need a common variable between 2 data frames
要合併兩個資料需要有一個共同的 variable，在這個例子裡是 ID。

Use merge() function to combine the 2 data frames

total_df <- merge(df1, df2, by = "ID")

ID N1 N2 N3 P1 P2 P3
1 A 2.0 3.0 2.0 5.0 3.0 3.0
2 B 2.6 2.8 3.6 3.8 2.6 4.8
3 C 4.0 5.0 3.0 2.0 6.0 6.0
4 D 5.0 6.0 6.0 5.0 7.0 7.0
5 E 8.0 6.0 6.0 10.0 6.0 8.0
6 F 10.0 7.0 10.0 7.0 8.0 7.0
7 G 9.0 8.0 7.0 8.0 9.0 10.0

另外一個合併資料庫的語法是 cbind()，但這個是把第二個資料直接接在後面，所以兩個資料檔不需要有同樣一個變項，也就是不用跟 merge() 需要用 "ID" 來合併。因此，如果用 cbind() 來合併 df1 和 df2 的話，就會顯示有兩欄是 "ID。

total_df2 <- cbind(df1, df2)

ID N1 N2 N3 ID P1 P2 P3
1 A 2.0 3.0 2.0 A 5.0 3.0 3.0
2 B 2.6 2.8 3.6 B 3.8 2.6 4.8
3 C 4.0 5.0 3.0 C 2.0 6.0 6.0
4 D 5.0 6.0 6.0 D 5.0 7.0 7.0
5 E 8.0 6.0 6.0 E 10.0 6.0 8.0
6 F 10.0 7.0 10.0 F 7.0 8.0 7.0
7 G 9.0 8.0 7.0 G 8.0 9.0 10.0

如果想用 cbind() 把兩個檔案合併成 total_df 的樣子，需要把 df2 的 ID 那欄先拿掉。

接下來，要介紹兩種使用這種資料檔案的方法。

1. 從檔案資料裡面挑出你要的。之前已經介紹 filter()，是從檔案裡面挑出你要的 variables，這裡想介紹的是在你想要的 variables 挑出你要的部分，用的語法是： var %in% c(" ")

%in% 是指在某個 variable 裡面找出符合 c(" ") 裡面所述的條件的。

filter() 的用法請看這篇：R | Data manipulation

下面介紹要怎麼用。我們現在把三個檔案資料合併在一起。

total_df3 <- merge(total_df, df3, by = "ID")

>total_df3
ID N1 N2 N3 P1 P2 P3 K1 K2 K3 Day
1 A 2.0 3.0 2.0 5.0 3.0 3.0 2 6 2 AM
2 B 2.6 2.8 3.6 3.8 2.6 4.8 5 2 5 PM
3 C 4.0 5.0 3.0 2.0 6.0 6.0 7 5 6 PM
4 D 5.0 6.0 6.0 5.0 7.0 7.0 6 7 5 AM
5 E 8.0 6.0 6.0 10.0 6.0 8.0 7 6 8 PM
6 F 10.0 7.0 10.0 7.0 8.0 7.0 9 10 9 AM
7 G 9.0 8.0 7.0 8.0 9.0 10.0 8 8 7 AM

如果想從 total_df3 裡面挑出 Day 是 "AM" 的 ID，就會是這樣：Day %in% c("AM")

t1 <- filter(total_df3, Day %in% c("AM"))

>t1
ID N1 N2 N3 P1 P2 P3 K1 K2 K3 Day
1 A 2 3 2 5 3 3 2 6 2 AM
2 D 5 6 6 5 7 7 6 7 5 AM
3 F 10 7 10 7 8 7 9 10 9 AM
4 G 9 8 7 8 9 10 8 8 7 AM

c(" ") 不一定要是字，也可以是數字，也可以在變項裡選兩個以上你想要的，例如想從 K2 裡面挑出 2, 6 和 10。

t2 <- filter(total_df3, K3 %in% c("2","6","10"))

>t2
ID N1 N2 N3 P1 P2 P3 K1 K2 K3 Day
1 A 2.0 3.0 2.0 5.0 3.0 3.0 2 6 2 AM
2 B 2.6 2.8 3.6 3.8 2.6 4.8 5 2 5 PM
3 E 8.0 6.0 6.0 10.0 6.0 8.0 7 6 8 PM
4 F 10.0 7.0 10.0 7.0 8.0 7.0 9 10 9 AM

我們也可以同時設兩種條件，例如要 K2 是 6 或者 Day 是 AM 的，用的符號是 | 。

t3 <- filter(total_df3, K2 == 6 | Day %in% c("AM"))

>t3
ID N1 N2 N3 P1 P2 P3 K1 K2 K3 Day
1 A 2 3 2 5 3 3 2 6 2 AM
2 D 5 6 6 5 7 7 6 7 5 AM
3 E 8 6 6 10 6 8 7 6 8 PM
4 F 10 7 10 7 8 7 9 10 9 AM
5 G 9 8 7 8 9 10 8 8 7 AM

2. 算出每行的的 mean 和 SD (standard deviation)。

Calculate mean of each row (= rmeans)
Use mutate() function to assign a new variable
Use rowMeans() function to calculate means

在下面的語法中，我們指定 rmeans 為每行的 mean，然後用 rowMeans() 的功能算出來，再用 mutate() 把它變成一個新的 variable 放在表格的最後一欄。

total_RM <- total_df %>%
mutate(rmeans = rowMeans(total_df[,2:7])) %>%
as.data.frame %>%
print()

ID N1 N2 N3 P1 P2 P3 rmeans
1 A 2.0 3.0 2.0 5.0 3.0 3.0 3.000000
2 B 2.6 2.8 3.6 3.8 2.6 4.8 3.366667
3 C 4.0 5.0 3.0 2.0 6.0 6.0 4.333333
4 D 5.0 6.0 6.0 5.0 7.0 7.0 6.000000
5 E 8.0 6.0 6.0 10.0 6.0 8.0 7.333333
6 F 10.0 7.0 10.0 7.0 8.0 7.0 8.166667
7 G 9.0 8.0 7.0 8.0 9.0 10.0 8.500000

上面的 total_df[, 2:7] 是指在 total_df 裡面的第二到第七欄，也就是算出每行第二到第七欄的 mean。（如果你還記得第一篇的基本語法的話， [ , ] 裡面是 [row, col] 。）

因為可以指定要算哪幾欄的 mean，所以我們也可以算 total_df3 的，只要不要把最後是 character 的那兩欄算進去就好了，也就是算出 N1 到 K2 這幾欄的 mean。

total_RM <- total_df3 %>%
mutate(rmeans = rowMeans(total_df3[,2:9])) %>%
as.data.frame %>%
print()

ID N1 N2 N3 P1 P2 P3 K1 K2 K3 Day rmeans
1 A 2.0 3.0 2.0 5.0 3.0 3.0 2 6 2 AM 3.250
2 B 2.6 2.8 3.6 3.8 2.6 4.8 5 2 5 PM 3.400
3 C 4.0 5.0 3.0 2.0 6.0 6.0 7 5 6 PM 4.750
4 D 5.0 6.0 6.0 5.0 7.0 7.0 6 7 5 AM 6.125
5 E 8.0 6.0 6.0 10.0 6.0 8.0 7 6 8 PM 7.125
6 F 10.0 7.0 10.0 7.0 8.0 7.0 9 10 9 AM 8.500
7 G 9.0 8.0 7.0 8.0 9.0 10.0 8 8 7 AM 8.375

*關於 mutate() 和 %>% 的用法可以參考：這篇

Define a matrix, m1 (要算出 standard deviation，需要讓 dataframe 變成 matrix。)
Use as.matrix() function to assign/create a matrix

用 as.matrix() 的功能把資料檔案變成一個 matrix。

m1 <- as.matrix(total_df[,2:7])

N1 N2 N3 P1 P2 P3
[1,] 2.0 3.0 2.0 5.0 3.0 3.0
[2,] 2.6 2.8 3.6 3.8 2.6 4.8
[3,] 4.0 5.0 3.0 2.0 6.0 6.0
[4,] 5.0 6.0 6.0 5.0 7.0 7.0
[5,] 8.0 6.0 6.0 10.0 6.0 8.0
[6,] 10.0 7.0 10.0 7.0 8.0 7.0
[7,] 9.0 8.0 7.0 8.0 9.0 10.0

也可以把上面的 t3 設成一個 matrix，不過要注意一下，K3 裡面的是 character，如果也把它設進去，整個 matrix 裡的都會變成 character，所以這邊只能把 matrix 設到 K2，不包含 K3 這欄。

m2 <- as.matrix(t3[,2:9])

N1 N2 N3 P1 P2 P3 K1 K2
[1,] 2 3 2 5 3 3 2 6
[2,] 5 6 6 5 7 7 6 7
[3,] 8 6 6 10 6 8 7 6
[4,] 10 7 10 7 8 7 9 10
[5,] 9 8 7 8 9 10 8 8

接下來，我們可以算 m1 和 m2 的 mean 和 SD。也就是說，你可以挑出你想要的資料算 mean 和 SD，在這個例子裡，就是 m2。

Calculate standard deviation (SD)
Use rowSds() function to calculate standard deviation of each row

算出每行的 SD：rowSds(matrix, na.rm = TRUE))

* setting na.rm = TRUE to omit NA（但在我們的 data frame 裡沒有 NA，所以沒差。）

transform(m1, SD = rowSds(m1, na.rm = TRUE))

用 transform() 的話，出來的結果會是一個表格，就是把 m1 加上算出來的 SD 合在一起變成一個新的表格的意思。

N1 N2 N3 P1 P2 P3 SD
1 2.0 3.0 2.0 5.0 3.0 3.0 1.0954451
2 2.6 2.8 3.6 3.8 2.6 4.8 0.8710147
3 4.0 5.0 3.0 2.0 6.0 6.0 1.6329932
4 5.0 6.0 6.0 5.0 7.0 7.0 0.8944272
5 8.0 6.0 6.0 10.0 6.0 8.0 1.6329932
6 10.0 7.0 10.0 7.0 8.0 7.0 1.4719601
7 9.0 8.0 7.0 8.0 9.0 10.0 1.0488088

可以把新的表格指定為 m1_sd，像下面這樣：

m1_sd <- transform(m1, SD = rowSds(m1, na.rm = TRUE))

當你在 Console 打入 m1_sd 後就會出現上面的表格。

可以直接算出 SD，但是 output 只會出現數字，而不是一個表格，像下面這樣。

total_sd = rowSds(m1, na.rm = TRUE)

> total_sd
[1] 1.0954451 0.8710147 1.6329932 0.8944272 1.0954451

Calculate standard error (SE)

也可以算 SE，SE 的公式為：SD / sqrt (n)

n 為每個的樣本數(sample size)，例如在 m1 中每個 ID (也就是每行)的樣本量有六個：N1, N2, N3, P1, P2, P3。

total_se = total_sd / sqrt(n)

Calculate standard deviation (sd) and standard error (se)
Create new variables "sd" and "se" and assign a data frame

如果我們想要算出 total_df 裡每行的 mean, SD 和 SE，然後用表格呈現出來，可以像下面這樣設。指定 rmeans 為每行 mean，sd 和 se 為其 SD 和 SE，語法上面已解釋過了，下面只是用 mutate() 的功能把它們變成新的 variables，然後呈現在最後三欄。

total_df_sum <- total_df %>%
mutate(rmeans = rowMeans(total_df[, 2:7]),
sd = rowSds(m1, na.rm = TRUE),
se = sd / sqrt(6)) %>%
as.data.frame()

ID N1 N2 N3 P1 P2 P3 rmeans sd se
1 A 2.0 3.0 2.0 5.0 3.0 3.0 3.000000 1.0954451 0.5477226
2 B 2.6 2.8 3.6 3.8 2.6 4.8 3.366667 0.8710147 0.4355074
3 C 4.0 5.0 3.0 2.0 6.0 6.0 4.333333 1.6329932 0.8164966
4 D 5.0 6.0 6.0 5.0 7.0 7.0 6.000000 0.8944272 0.4472136
5 E 8.0 6.0 6.0 8.0 6.0 8.0 7.000000 1.0954451 0.5477226

也可以只算你挑出的那幾個，例如 t3 中的那幾個 ID。

t3_sum <- t3 %>%
mutate(rmeans = rowMeans(t3[,2:9]),
sd = rowSds(m2, na.rm = TRUE),
se = sd / sqrt(8)) %>%
as.data.frame()

嗯，以上就先這樣吧。

不專業學術閒聊

2017年9月13日星期三

R | Calculate row means, SD and SE

沒有留言:

張貼留言

2017年9月13日 星期三

R | Calculate row means, SD and SE

沒有留言:

張貼留言

2017年9月13日星期三