Wednesday, April 11, 2012

R: from long to wide

While preparing data for Multiple Linear Regression, I have faced an interesting problem of conversion data from "long" format into "wide" format.

Input data format:

date term value
2012-01-01 term1 3
2012-01-01 term2 4
2012-01-01 term4 5
2012-01-01 term5 6
2012-01-01 term6 7
2012-01-02 term1 3
2012-01-02 term2 4
2012-01-02 term3 5
2012-01-02 term5 7
2012-01-02 term6 8
2012-01-03 term3 4
2012-01-03 term4 3


Desired output format:

date term1 term2 term3 term4 term5 term6
2012-01-01 3 4 0 5 6 7
2012-01-02 3 4 5 0 7 8
2012-01-03 0 0 4 3 0 0


Instructions:
# first of all, let's import data from CSV file
> value_per_term <- read.csv("~/value_per_term.csv")


First with reshape function from base package:
# now, the tricky part - reshaping; let's review parameters one-by-one
data=value_per_term: defines structure to work with
# timevar="term": defines column in long format whose values will serve as 
# column names in wide format; in our case its "term"
# idvar="date": defines grouping criteria - in our case this is date
# direction="wide": either "long" or "wide"; in our case its "wide"

> wide_by_reshape = reshape(data=value_per_term, timevar="term", idvar="date", direction="wide")

# replace NA with 0
> wide_by_reshape[is.na(wide_by_reshape)] <- 0

Let's redo the same with cast function from reshape package:
# import the package
> library("reshape")

# reshape with "cast" method
data=value_per_term: defines structure to work with
# formula=date~term: row names are taken from column "date", 
# while column names are taken from column "term"
# value="value": column containing data to aggregate
# fun.aggregate=sum: name of the function to perform aggregation
# fill=0: defining filling in cases data is missing. NA was used in example above
> wide_by_cast = cast(data=clicks_per_keyword, formula=date~term, value="value", fun.aggregate=sum, fill=0)

Summary: though, both methods allow reshaping data, I see cast as more preferable and flexible.

No comments: