[R] 데이터 가공 함수

[reshape2 패키지] melt,acast,dcast

AI gina 2022. 4. 15. 03:45

-  reshape2 패키지는 reshaple 패키지의 성능을 개선한 것으로,

- 열이 긴 형태의 데이터를 행이 긴 형태로 바꾸는 melt() 함수,

- 행이 긴 형태로 바꾸는 cast() 함수를 포함한다.

 

1. 넓은 모양 데이터를 긴 모양으로 바꾸기 : melt() 함수

melt(데이터, id.var="기준열", measure.var="변환열")

> names(airquality) <- tolower(names(airquality))
> melt_test <- melt(airquality, id.vars = c("month", "wind"), 
                    measure.vars = "ozone")
> head(melt_test)
  month wind variable value
1     5  7.4    ozone    41
2     5  8.0    ozone    36
3     5 12.6    ozone    12
4     5 11.5    ozone    18
5     5 14.3    ozone    NA
6     5 14.9    ozone    28

2. 긴 모양 데이터를 넓은 모양으로 바꾸기 : cast() 함수

 - 하나의 함수지만, 데이터 유형에 따라 사용하는 함수가 2가지로 나눠진다.

 - acast() : 데이터를 변형하여 벡터, 행렬, 배열로 반환

 - dcast() : 데이터를 변형하여 데이터 프레임 형태로 반환

dcast(데이터, 기준열 ~ 변환열)

> aq_melt <- melt(airquality, id.vars = c("month","day"), na.rm = TRUE)
> head(aq_melt)
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
6     5   6    ozone    28
7     5   7    ozone    23
> aq_dcast <- dcast(aq_melt, month + day ~ variable) #month, day 열 기준으로 variable열을 변환
> head(aq_dcast)
  month day ozone solar.r wind temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
4     5   4    18     313 11.5   62
5     5   5    NA      NA 14.3   56
6     5   6    28      NA 14.9   66
acast(데이터, 기준열 ~ 변환열 ~ 분리기준열)

> head(acast(aq_melt, day ~ month ~ variable)) #day열 기준으로 month열을 변환, variable 변수로 배열 만듬.
, , ozone

   5  6   7  8  9
1 41 NA 135 39 96
2 36 NA  49  9 78
3 12 NA  32 16 73
4 18 NA  NA 78 91
5 NA NA  64 35 47
6 28 NA  40 66 32

, , solar.r

    5   6   7  8   9
1 190 286 269 83 167
2 118 287 248 24 197
3 149 242 236 77 183
4 313 186 101 NA 189
5  NA 220 175 NA  95
6  NA 264 314 NA  92

- acast() 함수 이용해 데이터 세트를 배열로 정리하면, 항목별로 한눈에 비교하기 쉽다.

 

3. cast()함수로 데이터 요약하기

 - cast() 함수는 데이터 요약을 할 수 있는 것 특징이다.

> acast(aq_melt, month ~ variable, mean) #기술통계함수명만 씀. 함수명다음 괄호x
     ozone  solar.r      wind     temp
5 23.61538 181.2963 11.622581 65.54839
6 29.44444 190.1667 10.266667 79.10000
7 59.11538 216.4839  8.941935 83.90323
8 59.96154 171.8571  8.793548 83.96774
9 31.44828 167.4333 10.180000 76.90000

> dcast(aq_melt, month ~ variable, sum) #데이터 합계
  month ozone solar.r  wind temp
1     5   614    4895 360.3 2032
2     6   265    5705 308.0 2373
3     7  1537    6711 277.2 2601
4     8  1559    4812 272.6 2603
5     9   912    5023 305.4 2307

> dcast(aq_melt, month ~ variable, length) #데이터 개수
  month ozone solar.r wind temp
1     5    26      27   31   31
2     6     9      30   30   30
3     7    26      31   31   31
4     8    26      28   31   31
5     9    29      30   30   30

 

[참고]

reshape2: a reboot of the reshape package

Reshape2 is a reboot of the reshape package. It's been over five years since the first release of the package, and in that time I've learned a tremendous amount about R programming, and how to work with data in R. Reshape2 uses that knowledge to make a new package for reshaping data that is much more focussed and much much faster.

This version improves speed at the cost of functionality, so I have renamed it to reshape2 to avoid causing problems for existing users. Based on user feedback I may reintroduce some of these features.

What's new in reshape2:

  • considerably faster and more memory efficient thanks to a much better underlying algorithm that uses the power and speed of subsetting to the fullest extent, in most cases only making a single copy of the data.
  • cast is replaced by two functions depending on the output type: dcast produces data frames, and acast produces matrices/arrays.
  • multidimensional margins are now possible: grand_row and grand_col have been dropped: now the name of the margin refers to the variable that has its value set to (all).
  • some features have been removed such as the | cast operator, and the ability to return multiple values from an aggregation function. I'm reasonably sure both these operations are better performed by plyr.
  • a new cast syntax which allows you to reshape based on functions
    of variables (based on the same underlying syntax as plyr):
  • better development practices like namespaces and tests.

*출처 : https://stat.ethz.ch/pipermail/r-packages/2010/001169.html