Input data format:
date | term | value |
2012-01-01 | term1 | 3 |
2012-01-01 | term2 | 4 |
2012-01-01 | term4 | 5 |
2012-01-01 | term5 | 6 |
2012-01-01 | term6 | 7 |
2012-01-02 | term1 | 3 |
2012-01-02 | term2 | 4 |
2012-01-02 | term3 | 5 |
2012-01-02 | term5 | 7 |
2012-01-02 | term6 | 8 |
2012-01-03 | term3 | 4 |
2012-01-03 | term4 | 3 |
Desired output format:
date | term1 | term2 | term3 | term4 | term5 | term6 |
2012-01-01 | 3 | 4 | 0 | 5 | 6 | 7 |
2012-01-02 | 3 | 4 | 5 | 0 | 7 | 8 |
2012-01-03 | 0 | 0 | 4 | 3 | 0 | 0 |
Instructions:
# first of all, let's import data from CSV file
> value_per_term <- read.csv("~/value_per_term.csv")
First with reshape function from base package:
# now, the tricky part - reshaping; let's review parameters one-by-one
# data=value_per_term: defines structure to work with
# timevar="term": defines column in long format whose values will serve as
# column names in wide format; in our case its "term"
# idvar="date": defines grouping criteria - in our case this is date
# direction="wide": either "long" or "wide"; in our case its "wide"
> wide_by_reshape = reshape(data=value_per_term, timevar="term", idvar="date", direction="wide")
# replace NA with 0
> wide_by_reshape[is.na(wide_by_reshape)] <- 0
> wide_by_reshape[is.na(wide_by_reshape)] <- 0
# import the package
> library("reshape")
# reshape with "cast" method
# data=value_per_term: defines structure to work with# formula=date~term: row names are taken from column "date",
# while column names are taken from column "term"
# value="value": column containing data to aggregate
# fun.aggregate=sum: name of the function to perform aggregation
# fill=0: defining filling in cases data is missing. NA was used in example above
> wide_by_cast = cast(data=clicks_per_keyword, formula=date~term, value="value", fun.aggregate=sum, fill=0)
Summary: though, both methods allow reshaping data, I see cast as more preferable and flexible.
No comments:
Post a Comment