Analyzing growth in two datasets
In this tutorial, we'll use the R library to extract two datasets, then calculate and compare the wage and population growth rates through a D3-based visualization package.
Before we get started, you'll need to register for a BEA (API key). It'll be needed to make API requests.
key = "your-BEA-key-goes-here"
As usual, the library needs to be called.
library(eu.us.opendata)
Get the data
We'll now look for two datasets that contain the same geographic units. As it turns out, Population by State/NUTS2 and Wage and salary employment by State/NUTS2 have similar geographic units. We'll save these as ‘dataset' and ‘dataset2'.
searchRel("state nuts2", asHTML = T)
dataset <- getRel("<JOINT#SAL_A_2>", beaKey = key)
## [1] "You have selected '<JOINT#SAL_A_2>'"
## [1] "A total of 32535 records were retrieved."
## [1] "EU = 27315, US = 5220"
dataset2 <- getRel("<JOINT#POP_A_2>", beaKey = key)
## [1] "You have selected '<JOINT#POP_A_2>'"
## [1] "A total of 32475 records were retrieved."
## [1] "EU = 27255, US = 5220"
Process the data
To process the data, we'll need to first convert all values into indices so that the levels are comparable. Below, we've written a simple function that uses as number of manipulation methods developed for the eu.us.opendata library:
- timeSync() extracts the data based on overlapping date ranges. Specifying option “2” will pull overlapping dates whereas specifying “1” will pull the latest year.
- wide() converts the data from long form to wide form such that each row contains a unique time period and each column contains the values for each geographic unit
- the remaining code transforms data so that values are expressed as an index of the latest year of data (e.g. 1 = index leve for most recent year of data)
#Standardize values to latest record
indexVal <- function(dataset){
data <- timeSync(dataset,2) #an eu.us.opendata function for extracting data by overlapping time ranges
data = wide(data) #an eu.us.opendata method for reshaping data wide
data <- data[order(data$TIME),]
time <- data$TIME
data = as.matrix(data)
data = as.data.frame(t(t(data)/data[nrow(data),])) #index values based on latest year
data$TIME <- time
return(data)
}
#apply indexVal() to datasets
out1 = indexVal(dataset)
out2 = indexVal(dataset2)
#example
plot(out1[,c(1,3)])
To calculate the period growth rate, we'll calculate the simple linear time trend using a regression (e.g. Y ~ B0 + B1*Time + E). Remember that each column now represents a distinct, which means we'll need to loop through each column. To make this process simple, we've written a function called growthCalc() and have applied it to each resulting dataset from the section above.
#Calculate growth rate in each column using linear time trend
growthCalc <- function(data){
#Set up a empty dataframe to collect results
val = data.frame()
#Loop through each column
for(k in 2:ncol(data)){
#If there are no blanks
if(sum(is.na(data[,k]))==0){
#Record the location, linear time trend coeficient, check if location ID is numeric or not
val = rbind(val,
data.frame(loc = gsub("OBS_VALUE.","",colnames(data)[k]),
val = lm(data[,c(k,1)])$coef[2]))
}
}
#fill in data_source columns with US vs EU location name
val$loc <- as.character(val$loc)
return(val)
}
##apply growthCalc to indexed values
grow1 <- growthCalc(out1)
grow2 <- growthCalc(out2)
As a last step, we'll merge the datasets using location, then infer the region based on if the location code is numeric (US) or alphanumeric (EU).
##Merge
join <- merge(grow1, grow2, by=c("loc"))
colnames(join) <- c("location","wage","pop")
##Label region
join$region <- "EU"
join$region[!is.na(as.numeric(join$location))] <- "US"
Visualize the relationships
Lastly, we'll use the scatterD3 library to create a color-coded, interactive scatter plot. The library can be installed using the following command.
install.packages("scatterD3")
To create a color-coded scatter plot, we'll need to plug in field names and options.
library(scatterD3)
scatterD3(data = join,
x = pop,
y = wage,
lab = location,
col_var = region,
labels_size = 9,
point_opacity = 0.8,
ellipses = TRUE)