library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
3 Ground data wrangling and cleaning
3.1 About
We use the term “ground data” to refer to rain gauge observations collected from meteorological stations managed by the Smithsonian Tropical Research Institute and the Panama Canal Authority (Autoridad del Canal de Panamá). First, we merged both datasets and addressed key questions to establish criteria for selecting the stations to be included in the analysis. We then calculated annual precipitation and January-to-April precipitation, the latter serving as a proxy for dry season precipitation. Data visualization was used to identify the sites that meet the minimum criteria. We used the lmer4 package to fit a linear mixed model, using year as a random effect and site as a fixed effect across the full time period. The time series was filtered for the 1970-2016 period, ensuring no site was duplicated. We retained all actual values and gap-filled missing data using the predicted values from the model.
3.2 Libraries
3.3 The data
The github repository contains a version of the data that has been cleaned and calculated into monthly values.Data provided by the Meteorological and Hydrological Branch of the Panama Canal Authority and the Physical Monitoring Program of the Smithsonian Tropical Research Institute.
<- "../data_ground/met_data/ACP_data/cleanedVV/ACP_data.csv"
ACP_DATA<-"../data_ground/met_data/STRI_data/STRI_monthlyPrecip.csv" STRI_DATA
We read in the data using base R function read.csv()
<- read.csv(ACP_DATA)
acpDATA<- read.csv(STRI_DATA) striDATA
First we bind the rows of both data sources and create year and month columns
<- acpDATA%>%bind_rows(striDATA)
monthlyPrecipData$year<- as.numeric(format(as.Date(monthlyPrecipData$month_year, format="%Y-%m-%d"),"%Y"))
monthlyPrecipData$month<- as.numeric(format(as.Date(monthlyPrecipData$month_year, format="%Y-%m-%d"),"%m")) monthlyPrecipData
How many stations there is in each of the years?
<- monthlyPrecipData %>% group_by(year)%>% filter(!is.na(monthly_precip))%>%summarize(nstations= n_distinct(site))
nstations
ggplot(nstations, aes(x=year,y=nstations))+
geom_point()+
theme_bw()+
labs(x="Year", y="Number of stations")
Calculate the response variables: Annual precipitation and January through April precipitation. We only keep the years that have 12 months of data.
<- monthlyPrecipData %>%
precipitation_datafilter(!is.na(monthly_precip))%>%
group_by(site, year)%>%filter(n() == 12)%>%
summarize(annualPrecip = sum(monthly_precip, na.rm = TRUE),
jfmaPrecip = sum(monthly_precip[month %in% c(1,2,3,4)], na.rm = TRUE))
Calculate long term averages per site
<-precipitation_data%>%group_by(site)%>%
long_term_averagessummarize(longterm_mean_annual= mean(annualPrecip),longterm_mean_jfma=mean(jfmaPrecip))
How many years of actual data does every site has?
= precipitation_data %>% filter(year>=1970 & year<=2016)%>%
nyearsgroup_by(site) %>% tally() %>%
rename(nyears= n)%>% full_join(long_term_averages,by='site')
ggplot(nyears, aes(x = longterm_mean_annual, y = nyears)) +
geom_point() +
geom_text(aes(label = site), vjust = -0.5, size = 2) +
labs(x = "Mean annual precipitation", y = "Number of Years")
ggsave("plots/meanannual_v_nyears.jpg", width = 10, height = 6, dpi = 300)
Comment: we can clearly see a group of sites that has 30 or more years of data. Also, we recognize the presence of 3 sites with very high mean annual precipitation falling out of scope of our paper.
3.4 QAQC for sites
There is met stations located very close by to each other. We need to ensure that the set of data aren’t repeated.
An example is the Barro Colorado Island: which has 3 MET stations. Two of them are located within meters of each other. We will run some basic metrics to ensure this arent replicated stations.
<- precipitation_data%>%
BCI_subsetfilter(site=="BCI"|site=="BCICLEAR"|site=="BCIELECT")%>%
filter(year>=1970 & year<=2016)
#test the differences between the three sets of data
<-BCI_subset%>%group_by(site)%>%summarize(mean_annual_precip=mean(annualPrecip),sd_annual_precip=sd(annualPrecip),n=n())
means_BCI<- aov(annualPrecip ~ site, data = BCI_subset)
anova_result <- TukeyHSD(anova_result)
tukey_results
#ANOVA results
print(means_BCI)
# A tibble: 3 × 4
site mean_annual_precip sd_annual_precip n
<chr> <dbl> <dbl> <int>
1 BCI 2538. 590. 47
2 BCICLEAR 2649. 562. 45
3 BCIELECT 2292. 538. 47
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
site 2 3077427 1538713 4.842 0.0093 **
Residuals 136 43215268 317759
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(tukey_results)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = annualPrecip ~ site, data = BCI_subset)
$site
diff lwr upr p adj
BCICLEAR-BCI 110.2735 -168.3210 388.86788 0.6172843
BCIELECT-BCI -245.9634 -521.5129 29.58622 0.0905150
BCIELECT-BCICLEAR -356.2368 -634.8312 -77.64239 0.0081833
ggplot(BCI_subset, aes(x=year, y=annualPrecip, color=site))+
geom_line()+theme_bw()+labs(x="Year", y="Annual Precipitation (mm)")
The metadata indicates that both BCICLEAR and BCIELECT are both different sensor. BCICLEAR is an standard rain gauge that is emptied manually by technicians. On the other hand, BCIELECT is a electronic tipping bucket showing 356 mm less annual mean precipitation. Finally, the BCI station managed by the Panama canal authority is placed on the Gatun lake shore. We will keep the three sites
The GALETA station appears on the ACP monthly summary and it is also listed in the STRI datasets, here given the name of GALETASTRI. Are this different sensors? or repeated sets of data.
#GALETA AND GALETASTRI
<- precipitation_data%>% filter(site=="GALETA"|site=="GALETASTRI")%>%
GALETA_subsetfilter(year>=1970 & year<=2016)
ggplot(GALETA_subset, aes(x=year, y=annualPrecip, color=site))+geom_line()+theme_bw()+labs(x="Year", y="Annual Precipitation (mm)")
#anova
<-GALETA_subset%>%group_by(site)%>%summarize(mean_annual_precip=mean(annualPrecip),sd_annual_precip=sd(annualPrecip),n=n())
means_GALETA<- aov(annualPrecip ~ site, data = GALETA_subset)
anova_galeta<- TukeyHSD(anova_galeta)
tukey_results_galeta
#ANOVA results
print(means_GALETA)
# A tibble: 2 × 4
site mean_annual_precip sd_annual_precip n
<chr> <dbl> <dbl> <int>
1 GALETA 2933. 533. 42
2 GALETASTRI 2930. 585. 43
summary(anova_galeta)
Df Sum Sq Mean Sq F value Pr(>F)
site 1 333 333 0.001 0.974
Residuals 83 26062365 314004
print(tukey_results_galeta)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = annualPrecip ~ site, data = GALETA_subset)
$site
diff lwr upr p adj
GALETASTRI-GALETA -3.957544 -245.751 237.8359 0.9741082
The GALETA case shows very little to no difference in the yearly averages for most of the years. The metadata of GALETASTRI mentions the use of Limon Bay to fill in the gaps and the Horizontal version of the ACP monthly summaries lists GALETA as ‘GALETA(STRI)’. We conclude that the sets of data are repeated. We will keep GALETASTRI since this monthly summaries were calculated in this study
Remove the GALETA station
<- precipitation_data %>% filter(site!="GALETA") precipitation_data_subset
Filter the stations with more than 30 years of data between 1970 and 2016
<- merge(precipitation_data_subset, nyears, by.x= "site", by.y="site")
precipitation_data_30<- precipitation_data_30%>% filter(nyears>=30)
precipitation_data_subsetunique(precipitation_data_subset$site)
[1] "AGUACLARA" "ALHAJUELA" "BALBOAFAA" "BALBOAHTS" "BCI"
[6] "BCICLEAR" "BCIELECT" "CANDELARIA" "CANO" "CANONES"
[11] "CASCADAS" "CHICO" "CHORRO" "CIENTO" "CRISTOBAL"
[16] "DIABLO" "EMPIREHILL" "ESCANDALOSA" "GALETASTRI" "GAMBOA"
[21] "GATUN" "GUACHA" "HODGESHILL" "HUMEDAD" "MONTELIRIO"
[26] "PEDROMIGUEL" "PELUCA" "RAICES" "SALAMANCA" "SANMIGUEL"
[31] "SANTAROSA" "ZANGUENGA"
3.5 Gap fill the timeseries 1970-2016
Use the lme4 package to fit a linear mixed model with random effects for year and fixed effects for site for annual precipitation and January to April precipitation (Bates et al. 2015). Note that we have created the model using all the avaliable years.
<- lmer(annualPrecip~ site +(1|year), data=precipitation_data_subset)
model_annual<- lmer(jfmaPrecip~ site +(1|year), data=precipitation_data_subset) model_jfma
We then filter the data to only the desired years.
<- precipitation_data_subset%>%filter(year>=1970,year<=2016) precipitation_data_subset
Create an empty data frame to predict the yearly and January through April precipitation.
<- seq(1970,2016)
year_sequence<- unique(precipitation_data_subset$site)
sites<- expand.grid(year_sequence,sites)%>%rename(year="Var1",site="Var2")
precipitation_predicted
$predicted_annual<- predict(model_annual,newdata=precipitation_predicted)
precipitation_predicted $predicted_jfma<- predict(model_jfma,newdata=precipitation_predicted) precipitation_predicted
Gap fill the missing years using the predicted values to get a gap free time series.
<- precipitation_predicted %>%
precipitationfull_join(precipitation_data_subset, by=c("site","year"))%>%
mutate(gap_fill=if_else(is.na(annualPrecip),TRUE,FALSE),
annual=if_else(is.na(annualPrecip),predicted_annual,annualPrecip),
jfma=if_else(is.na(jfmaPrecip),predicted_jfma,jfmaPrecip))
Whats the percentage of the data that is gap filled
<- sum(precipitation$gap_fill == TRUE)
gap_fill_true <- sum(precipitation$gap_fill == FALSE)
gap_fill_false <- (gap_fill_true / (gap_fill_true + gap_fill_false)) * 100
percentage_gap_fill percentage_gap_fill
[1] 9.175532
We can visualize subsets of the final precipitation data frame. Note that some sites will have more instances of predicted response variables.
ggplot(precipitation %>% filter(site %in% sites[1:8]),
aes(x = year, y = annual, group = site,color=site,shape=gap_fill)) + geom_line() + geom_point(size=3) +
labs(x = "Year", y = "Annual Precipitation (mm)", color = "Site")
ggplot(precipitation %>% filter(site %in% sites[9:16]),
aes(x = year, y = annual, group = site,color=site,shape=gap_fill)) + geom_line() + geom_point(size=3) +
labs(x = "Year", y = "Annual Precipitation (mm)", color = "Site")
ggplot(precipitation %>% filter(site %in% sites[17:24]),
aes(x = year, y = annual, group = site,color=site,shape=gap_fill)) + geom_line() + geom_point(size=3) +
labs(x = "Year", y = "Annual Precipitation (mm)", color = "Site")
ggplot(precipitation %>% filter(site %in% sites[25:32]),
aes(x = year, y = annual, group = site,color=site,shape=gap_fill)) + geom_line() + geom_point(size=3) +
labs(x = "Year", y = "Annual Precipitation (mm)", color = "Site")
Finally, save the dataframe for its later use.
write.csv(precipitation,"../tables/precipitation.csv", row.names = FALSE)