Programming Language/R

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Data Preprocessing data)

chaerlo127 2022. 4. 25. 01:52
728x90

โœจ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •(Data Preprocessing Data)

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ Data mining ํ•˜์ง€ ์•Š๊ณ , ๋ถ„์„ํ•˜๊ธฐ ์ ํ•ฉํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜๋Š” ์ž‘์—…

 

 

โœจ dplyr package

๋ฐ์ดํ„ฐ๋ฅผ ๋นจ๋ฆฌ ๊ฐ€๊ณตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ฃผ๋Š” package

์ด ํŒจํ‚ค์ง€์˜ ๋‚ด๋ถ€ ํ•จ์ˆ˜๋ฅผ ์•Œ์•„๋ณด๊ณ ์žํ•œ๋‹ค.

  • filter() : ํ–‰ ์ถ”์ถœ
  • select() : ์—ด(variable) ์ถ”์ถœ
  • arrange() : ์ •๋ ฌ
  • mutate() : variable ์ถ”๊ฐ€
  • summarise() : ํ†ต๊ณ„์น˜ ์‚ฐ์ถœ
  • group_by() : grouping, ์ง‘๋‹จ๋ณ„๋กœ ๋‚˜๋ˆ„๊ธฐ
  • left_join() : ๋ฐ์ดํ„ฐ ํ•ฉ์น˜๊ธฐ(์—ด, variable)
  • bind_rows() : ๋ฐ์ดํ„ฐ ํ–‰(record) ํ•ฉ์น˜๊ธฐ 

bind_rows()์—์„œ๋Š” ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜์™€ ๋ณ€์ˆ˜์˜ ์ด๋ฆ„์ด ๊ฐ™์•„์•ผํ•œ๋‹ค. 

summarise()์™€ group_by()๋Š” aggregation function์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. db์—์„œ group by, having๊ณผ ๋น„์Šทํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

install.packages("dplyr")
library(dplyr)

 

โœจ dplyr method ์‚ฌ์šฉ

์ถœ์ฒ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ dplyr method ์‚ฌ์šฉ์„ ์ตํ˜€๋ณด๊ณ ์ž ํ•œ๋‹ค. 

 

๐Ÿ“š filter()

# filter
exam[exam$class == 1, ]
exam %<% filter(class == 1)

์œ„ ์ฝ”๋“œ์™€ ์•„๋ž˜์ฝ”๋“œ๋Š” ์ •ํ™•ํžˆ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

%>%๋Š” ํŒŒ์ดํ”„ ์—ฐ์‚ฐ์ž๋กœ Ctrl + Shift + M์„ ๋ˆ„๋ฅด๋ฉด %>%์ด ์‚ฝ์ž…๋œ๋‹ค.

 

exam %>% filter(math>80 & english <90)

 

๋˜ํ•œ ๊ฐ’์„ ๊บผ๋‚ผ ์ˆ˜๋„ ์žˆ๋‹ค.

class1 <- exam %<% filter(class == 1)
mean(class1$math)

 

filter()๋Š” db์˜ where ์ ˆ๊ณผ ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค.

 

 

๐Ÿ“š select()

exam %>% select(math)
exam %>% select(math, english, science)
exam %>% select(-math) # math ์ œ์™€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ

exam %>% 
  select(id, math) %>% 
  head(10)

 

๐Ÿ“š filter() & select()

exam %>% 
  filter(class==1) %>% 
  select(math, english)

 

 

๐Ÿ“š arrange()

์ •๋ ฌ, order by์™€ ๊ฐ™์Œ

exam %>% arrange(math)
exam %>% arrange(id)
exam %>% arrange(class)
exam %>% arrange(id, class)
exam %>% arrange(desc(class)) %>% head(10)

๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ  ์‹ถ์œผ๋ฉด desc() method๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค. 

 

๐Ÿ“š mutate()

ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ method => db add column (๋ณ€์ˆ˜(variable), ์—ด ์ถ”๊ฐ€)

exam %>% mutate(total = english + math + science)
exam %>% mutate(total = english + math + science, mean = total/3)
exam %>% mutate(test = ifelse(science>=60, "P", "F")) %>% head

 

 

๐Ÿ“š arrange() & mutate()

exam %>%
  mutate(total = english + math + science) %>%
  arrange(desc(total)) %>% 
  head

 

๐Ÿ“š group_by() & summarise()

exam %>% summarise(math_mean = mean(math))

exam %>% 
  group_by(class) %>% 
  summarise(math_mean = mean(math)
  
  
  exam %>% 
  group_by(class) %>% 
  summarise(mean_math = mean(math),
            sum_math = sum(math), 
            median_math = median(math),
            n = n()) # n์€ ๋นˆ๋„ = ํ–‰์˜ ๊ฐœ์ˆ˜

 

 

ํ•จ์ˆ˜ ์˜๋ฏธ
mean() ํ‰๊ท 
sd()  ํ‘œ์ค€ํŽธ์ฐจ
sum() ํ•ฉ๊ณ„
median() ์ค‘์œ„์ˆ˜
min() ์ตœ์†Ÿ๊ฐ’
max() ์ตœ๋Œ“๊ฐ’
n() ๋นˆ๋„

 

 

๐Ÿ“š ์—ฐ์Šต

mpg_audi <- mpg %>% filter(manufacturer == "audi")
mpg_toyota <- mpg %>% filter(manufacturer== "toyota")
mean(mpg_audi$hwy)
mean(mpg_toyota$hwy)

mpg_new <- mpg %>% select(class, cty)
mpg_new

mpg %>% filter(manufacturer == "audi") %>% arrange(desc(hwy)) %>% head(5)

mpg %>% group_by(manufacturer) %>%
  filter(class == "suv") %>% 
  mutate(mean_y = (cty + hwy)/2) %>% 
  summarise(mean_total = mean(mean_y)) %>% 
  arrange(desc(mean_total)) %>% 
  head(5)

 

 

์ถœ์ฒ˜]

https://rstudio-pubs-static.s3.amazonaws.com/382545_098d268806f449c496734236e0b97493.html 

 

728x90