Thursday 11 December 2014

Accessing, cleaning and plotting NOAA Temperature Data

In my previous post I said that my students are using data from NOAA for their research.
NOAA in fact provides daily averages of several environmental parameters for thousands of weather stations scattered across the globe, completely free.
In details, NOAA provides the following data:
  • Mean temperature for the day in degrees Fahrenheit to tenths
  • Mean dew point for the day in degrees Fahrenheit to tenths
  • Mean sea level pressure for the day in millibars to tenths
  • Mean station pressure for the day in millibars to tenths
  • Mean visibility for the day in miles to tenths
  • Mean wind speed for the day in knots to tenths
  • Maximum sustained wind speed reported for the day in knots to tenths
  • Maximum wind gust reported for the day in knots to tenths
  • Maximum temperature reported during the day in Fahrenheit to tenths
  • Minimum temperature reported during the day in Fahrenheit to tenths
  • Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths
  • Snow depth in inches to tenths

A description of the dataset can be found here: GSOD_DESC.txt
All these data are available from 1929 to 2014.

The problem with these data is that they require some processing before they can be used for any sort of computation.
For example, the stations ID and coordinates are available in a text file external to the data file, so each data file needs to be cross referenced to this text file to extract the coordinates of the weather station.
Moreover, the data files are supplied as either one single .tar file or as a series of .gz files, that if extract give a series of .op textual files, which can then be opened in R.
All these steps would require some sort of processing before they can be used in R. For example one could download the data file and extract them with 7zip.

In this blog post I would provide a way of downloading the NOAA data, cleaning them (meaning define outliers and extract the coordinates for each location), and then plot them to observe their location and spatial pattern.

The first thing we need to do is loading the necessary packages:

library(raster)
library(XML)
library(plotrix)


Then I can define my working directory:

setwd("C:/")

At this point I can download and read the coordinates file from this address: isd-history.txt

The only way of reading it is by using the function read.fwt, which is used to read "fixed width tables" in R.
Typical entries of this table are provided as examples below:

024160 99999 NO DATA NO DATA
024284 99999 MORA SW +60.958 +014.511 +0193.2 20050116 20140330
024530 99999 GAVLE/SANDVIKEN AIR FORCE BAS SW +60.717 +017.167 +0016.0 20050101 20140403


Here the columns are named as follow:
  • USAF = Air Force station ID. May contain a letter in the first position
  • WBAN = NCDC WBAN number
  • CTRY = FIPS country ID
  • ST = State for US stations
  • LAT = Latitude in thousandths of decimal degrees
  • LON = Longitude in thousandths of decimal degrees
  • ELEV = Elevation in meters
  • BEGIN = Beginning Period Of Record (YYYYMMDD)
  • END = Ending Period Of Record (YYYYMMDD)

As you can see, the length of each line is identical but its content varies substantially from one line to the other. Therefore there is no way of reading this file with the read.table function.
However, in read.fwt we can define the length of each column in the dataset so that we can create a data.frame out of it.
I used the following widths:
  • 6 - USAF
  • 1 - White space
  • 5 - WBAN
  • 1 - White space
  • 38 - CTRY and ST
  • 7 - LAT
  • 1 - White space
  • 7 - LON
  • 9 - White space, plus Elevation, plus another White space
  • 8 - BEGIN
  • 1 - White space
  • 8 - END

After this step I created a data.frame with just USAF, WBAN, LAT and LON.
The two lines of code for achieving all this are presented below:


coords.fwt <- read.fwf("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/isd-history.txt",widths=c(6,1,5,1,38,7,1,8,9,8,1,8),sep=";",skip=21,fill=T)<br />
coords <- data.frame(ID=paste(as.factor(coords.fwt[,1])),WBAN=paste(as.factor(coords.fwt[,3])),Lat=as.numeric(paste(coords.fwt$V6)),Lon=as.numeric(paste(coords.fwt$V8)))

As you can see, I can work directly using the data link, there is actually no need to have this file local.

After this step I can download the data files for a particular year and extract them. I can use the function download.file from XML to download the .tar file, then the function untar to extract its contents.
The code for doing so is the following:

#Download Measurements
year = 2013
download.file(paste("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",year,"/gsod_",year,".tar",sep=""),destfile=paste(getwd(),"/Data/","gsod_2013.tar",sep=""),method="auto",mode="wb")

#Extract Measurements
dir.create(paste(getwd(),"/Data/Files",sep=""))
untar(paste(getwd(),"/Data/","gsod_2013.tar",sep=""),exdir=paste(getwd(),"/Data/Files",sep=""))

I created a variable year so that I can change the year I want to investigate and the script will work in the same way.
I also included a call to dir.create to create a new folder to store the 12'511 data files included in the .tar file.
At this point I created a loop to iterate through the file names.


#Create Data Frame
files <- list.files(paste(getwd(),"/Data/Files/",sep=""))

classes <- c(rep("factor",3),rep("numeric",14),rep("factor",3),rep("numeric",2))

t0 <- Sys.time()
station <- data.frame(Lat=numeric(),Lon=numeric(),TempC=numeric())
for(i in 1:length(files)){

data <- read.table(gzfile(paste(getwd(),"/Data/Files/",files[i],sep=""),open="rt"),sep="",header=F,skip=1,colClasses=classes)
if(paste(unique(data$V1))!=paste("999999")){
coords.sub <- coords[paste(coords$ID)==paste(unique(data$V1)),]

 if(nrow(data)>(365/2)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }
} else {
coords.sub <- coords[paste(coords$WBAN)==paste(unique(data$V2)),]

 if(nrow(data)>(365/2)&coords.sub$Lat!=0&!is.na(coords.sub$Lat)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }

}
print(i)
flush.console()
}
t1 <- Sys.time()
t1-t0

I also included a time indication so that I could check the total time required for executing the loop, which is around 8 minutes.

Now let's look at the loop a bit more in details. I start it by opening the data file from the file.list named files using the function read.table coupled with the function gzip. This function can read the content of the .gz file as text so that it can be read as a table.
Then I set up an if statement because in some cases the station can be identified by the USAF value, in other cases USAF is equal to 999999 and therefore it needs to be identified by its WBAN value.
Inside this statement I put another if statement to exclude values without coordinates or with less than 6 months of data.
Then I also included another quality check, because after each measurements NOAA provides also the number of observation used to calculate each daily average. So I used this information to exclude the daily averages computed from less than 12 hours.
In this case I was only interested in Temperature data, so all my data.frames extracted only that property from the data files (and converted it to Celsius). However, it would be fairly easy to focus on different environmental data.
The loop fills the empty data.frame named station I create just before starting the loop.
At this point I can exclude the NAs from the data.frame, use the package sp to assign coordinates and plot them on top of country polygons:


Temperature.data <- na.omit(station)
coordinates(Temperature.data)=~Lon+Lat

dir.create(paste(getwd(),"/Data/Polygons",sep=""))
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile=paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""))
unzip(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""),exdir=paste(getwd(),"/Data/Polygons",sep=""))
polygons <- shapefile(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.shp",sep=""))

plot(polygons)
points(Temperature.data,pch="+",cex=0.5,col=color.scale(Temperature.data$TempC,color.spec="hsv"))

The results is the plot is presented below:


The nice thing about this script is that I just need to chance the variable called year at the beginning of the script to change the year of the investigation. 
For example, let's say we want to look at how many stations meet the criteria I set here in 1940. I can just change the year to 1940, run the script and wait for the results. In this case the waiting is not long, because at that time there were only few stations in USA, see below:

However, if we skip forward 10 years to 1950, the pictures changes. 



The second world war was just over and now we have weather stations all across countries that were highly involved in it. We have several stations in UK, Germany, Australia, Japan, Turkey, Palestine, Iran and Iraq.
A similar pattern can be seen if we take a look at the dataset available in 1960:


Here the striking addition are the numerous stations in Vietnam and surroundings. It is probably easy to understand the reason.

Sometimes we forget that our work can be used for good, but it can also be very useful in bad times!!

Here is the complete code to reproduce this experiment:







library(raster)
library(gstat)
library(XML)
library(plotrix)


setwd("D:/Wind Speed Co-Kriging - IPA Project")

#Extract Coordinates for each Station
coords.fwt <- read.fwf("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/isd-history.txt",widths=c(6,1,5,1,38,7,1,8,9,8,1,8),sep=";",skip=21,fill=T)
coords <- data.frame(ID=paste(as.factor(coords.fwt[,1])),WBAN=paste(as.factor(coords.fwt[,3])),Lat=as.numeric(paste(coords.fwt$V6)),Lon=as.numeric(paste(coords.fwt$V8)))


#Download Measurements
year = 2013
dir.create(paste(getwd(),"/Data",sep=""))
download.file(paste("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",year,"/gsod_",year,".tar",sep=""),destfile=paste(getwd(),"/Data/","gsod_",year,".tar",sep=""),method="auto",mode="wb")
#Extract Measurements
dir.create(paste(getwd(),"/Data/Files",sep=""))
untar(paste(getwd(),"/Data/","gsod_",year,".tar",sep=""),exdir=paste(getwd(),"/Data/Files",sep=""))


#Create Data Frame
files <- list.files(paste(getwd(),"/Data/Files/",sep=""))

classes <- c(rep("factor",3),rep("numeric",14),rep("factor",3),rep("numeric",2))

t0 <- Sys.time()
station <- data.frame(Lat=numeric(),Lon=numeric(),TempC=numeric())
for(i in 1:length(files)){

data <- read.table(gzfile(paste(getwd(),"/Data/Files/",files[i],sep=""),open="rt"),sep="",header=F,skip=1,colClasses=classes)
if(paste(unique(data$V1))!=paste("999999")){
coords.sub <- coords[paste(coords$ID)==paste(unique(data$V1)),]

 if(nrow(data)>(365/2)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }
} else {
coords.sub <- coords[paste(coords$WBAN)==paste(unique(data$V2)),]

 if(nrow(data)>(365/2)&coords.sub$Lat!=0&!is.na(coords.sub$Lat)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }

}
print(i)
flush.console()
}
t1 <- Sys.time()
t1-t0

Temperature.data <- na.omit(station)
coordinates(Temperature.data)=~Lon+Lat

dir.create(paste(getwd(),"/Data/Polygons",sep=""))
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile=paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""))
unzip(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""),exdir=paste(getwd(),"/Data/Polygons",sep=""))
polygons <- shapefile(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.shp",sep=""))


jpeg(paste(getwd(),"/Data/",year,".jpg",sep=""),2000,1500,res=300)
plot(polygons,main=paste("NOAA Database ",year,sep=""))
points(Temperature.data,pch="+",cex=0.5,col=color.scale(Temperature.data$TempC,color.spec="hsv"))
dev.off()

Wednesday 10 December 2014

World Point Grid

These days I am following a couple of master projects dealing with renewable energy potentials at the global scale. We wanted to try and compute these potential using only free data and see how far we could go.
For wind speed there are plenty of resources freely available on the web. In particular my students downloaded the daily averages from NOAA: http://gis.ncdc.noaa.gov/geoportal/catalog/search/resource/details.jsp?id=gov.noaa.ncdc%3AC00516

One of the problem they encountered was that obviously these data are available for discrete locations, where meteorological stations are located, and if we want to create a map out of them we need to use some form of estimation. The time was limited so we opted for one of the simplest out there, i.e. kriging. Using other form of estimations optimized for wind, like dispersion or other physical modelling would have been almost impossible to pull off at the global scale. These models need too much time, we are talking about month of computations for estimating wind speed in one single country.
Anyway, ETH is an ESRI development centre and therefore the students normally do their work in ArcGIS, this project was no exception. For this reason we used Geostatistical Analyst in ArcGIS for the interpolation. The process was fast considering that we were dealing with a global scale interpolation; even when we used two covariates and we tested co-kriging the entire process was completed in less that 10 minutes, more or less.
The problem is that the output of Geostatistical Analyst is a sort of contours raster, which cannot be directly used to export its value onto a point grid. In this project we were planning to work on a 1Km scale, meaning that we would have need a mean wind speed value for each point on a 1Km grid. In theory this can be done directly from ArcGIS, you can use the function "Create Fishnet" to create the point grid and then the prediction function in Geostatistical Analyst to estimate an interpolated value for each point in the grid. I said in theory because in practice creating a point grid for the whole world is almost impossible with standard PCs. In our students labs we have 4Gb of RAM in each PC and with that there is no way to do the create the grid. We also tried a sort of "raw parallellization", meaning that we used 10 PCs to create one small parts of the grid, then extract just the part that overlay with land (and exclude oceans). Even with this process (which is by no means practical, elegant or fast) we were unable to finish the process. Even 1/10 of the world grid is enough to fill 4Gb of RAM.
In the end we decided to work on a 10Km scale because we simply did not have time to try alternative routes. However, the whole thing make me think about possible ways of creating a point grid for the whole World in R. We could use the function spsample to create point grids using polygon shapefiles, but unless you have a ton of RAM there is no way of doing it in one go. So I thought about iterating through the country polygons, create small grid and export their coordinates in an external txt. Even this process did not work, as soon as the loop reached Argentina the process filled the 8Gb of RAM of my PC and it stopped.
The solution was the use of even smaller scale polygons. I downloaded the Natural Earth 1:10m cultural dataset with States and Provinces and iterated through that. The process takes quite a long time, using one CPU, but it can probably be parallelized. The final dimension of the file with just two columns with the coordinates is 4.8Gb. This can be used directly by loading it as an ff data.frame (read.table.ffdf) or by loading only small chucks of it in an iterative fashion once you need to use it for estimations.
The R script I created takes care of the whole process. It downloads the dataset from Natural Earth, excludes Antartica, which is too big, run the loop and saves a TXT file in the working directory.

This is my code, I hope it can help you somehow:


library(raster)
library(ff)


setwd("C:/")

download.file("http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_1_states_provinces.zip",destfile="ne_10m_admin_1_states_provinces.zip")
unzip("ne_10m_admin_1_states_provinces.zip",exdir="StateProvinces")

setwd("C:/StateProvinces")

polygons <- shapefile(list.files(pattern=".shp"))
polygons <- polygons[! paste(polygons$name) %in% paste("Antarctica"),]

names(polygons)

i=1
while(i<=nrow(polygons)){
if(paste(polygons[i,]$name)!=paste("NA")){
 grids <- try(spsample(polygons[i,],cellsize=0.01,type="regular"))

 if(inherits(grids, "try-error"))
    {
      i = i+1
    } else {

 write.table(as.data.frame(grids),"Grid_1Km.txt",sep=" ",row.names=F,col.names=F,append=T)
 
 print(i)
 flush.console()
 i=i+1 }
} else { i = i+1 }
}

#file.remove("Grid_1Km.txt")

grid <- read.table.ffdf(file="Grid_1Km.txt",nrows=1000,sep=",")
names(grid)

Friday 28 November 2014

R Object-oriented Programming - Book Review



I have been asked to review the book “R Object-oriented Programming” by Kelly Black, edited by Packt publishing (£14.45 for the E-Book, £27.99 for Print+E-Book).

The scope of the book is “to provide a resource for programming using the R language” and therefore it can be seen as a good and practical introduction to all the most commonly used part of R. The first 2 chapters deal with data type and data organization in R. They basically quickly review how to handle each type of data (such as integers, doubles) and how to organize them into R objects. The third chapter deals with reading data from files and save them. This chapter gives a pretty good introduction into reading and writing every sort of data, even binaries, and from a variety of sources, including from the web. Chapter 4 provides an introduction to R commands to generate random numbers, in particular it gives a thorough overview of the sample command. Chapters 5 and 6 give a good background into the use of R to manipulate string and time variables. Of particular interest throughout the book is the handling of data gathered from public source on the web. For these particular data skills in string manipulation become crucial both for handling web addresses and also to extract the actual data from the information returned by the server. For this reason I think this book does a good job in introducing these important aspect of the R language.
Chapter 7 introduces some basic programming concepts, such as if statements and loops. Chapters 8 and 9 provide a complete overview of the S3 and S4 classes and finally chapters 10 and 11 are two hands on examples on how to put together all the concepts learned in the book to solve very practical problems. In these example the reader will be guided towards the creation of powerful R programs to grade student and perform a Monte Carlo simulation.
The book is written in a very practical form, meaning that not much time is wasted explaining each function in details, readers can browse the help pages of each function for more details. This means that probably this book is not for newbies to programming languages. Most of the learning is done by exploring the lines of code provided and for this reason I think the best readers would be people familiar with a programming language, even though I do not think that readers necessarily needs some familiarity with R. However as stated on the website, the target for this book are beginners who wants to become more “fluent” with the language.
Overall, I think this book does a good of providing the reader with a strong and neat introduction to all the bits of coding required to become more comfortable writing advance scripts. For example, at the end of chapter 2 the author discuss the use of the apply set of commands. These are crucial milestone to be learned for every individual who wants to switch from a mundane use of R to a more advanced and rigorous use of the language. In my personal experience when I began using R I would often create very long script using lots of loops and if statements, which tends to greatly decrease the execution speed. As soon as I learned to master the apply set of commands I was able to reduce my code and crucially I was also able to substantially increase its executing speed. Personally I would have loved to have access to such a book back then! The use of web sources for data manipulation is also a very nice addition that as far as I know is not common in other introductory texts. Nowadays gathering data from the web has become the norm and therefore I think it is important to provide beginners with tools to handle these type of data.
The strength of this book however is in chapters 8 and 9, which provide an extensive introduction to the use of the classes S3 and S4. I think these two chapters alone would justify the price for buying it. As far as I know these concepts are generally not treated with the right attention in books for beginners. They may explain you that when you load a package then the functions you normally use, such as plot, may change their function and options. However, I never found an introductory book that provides such as exhaustive explanation of how to fully control these classes to create advance programs. Of particular interest are also the two examples provided in Chapters 10 and 11. These are practical exercises that put together all the concepts learned in the previous chapters with the purpose of creating R programs that can be easily implemented and share. Chapter 10 for example describe a neat and powerful way to create a new R program to grade students. In this chapter the reader will use all the basic programming concept learned during the course of the book and he/she will put them together for creating an R program to import grades from csv files, manipulate them and create summary statistics and plot.
In conclusion, I see a variety of uses for this book. Clearly it is targeted to post beginners who need a short way to unlock the full power of R for their daily statistical routines. However, this book does not loose its purpose after we learned to properly use the language. It is written in such a way that even for experienced R users it is a useful way to quickly look-up functions and methods that maybe they do not use very often. I sometimes forget how to use certain functions and having such a book on my office bookshelf will certainly help me in these frustrating situations. So I think it will become part of the set of references that future R user will use on a regular basis.

Wednesday 24 September 2014

Changing the Light Azimuth in Shaded Relief Representation by Clustering Aspect

Some time ago I published an article on "The Cartographic Journal" regarding a method to automatically change the light azimuth in shaded relief representations.
This method was based on clustering the aspect derivative of the DTM. The method was developed originally in R and then translated into ArcGIS, with the use of model builder, so that it could be imported as a Toolbox.
The ArcGIS toolbox is available here: www.fabioveronesi.net/Cluster_Shading.html

Below is the Abstract of the article for more info:

Abstract
Manual shading, traditionally produced manually by specifically trained cartographers, is still considered superior to automatic methods, particularly for mountainous landscapes. However, manual shading is time-consuming and its results depend on the cartographer and as such difficult to replicate consistently. For this reason there is a need to create an automatic method to standardize its results. A crucial aspect of manual shading is the continuous change of light direction (azimuth) and angle (zenith) in order to better highlight discrete landforms. Automatic hillshading algorithms, widely available in many geographic information systems (GIS) applications, do not provide this feature. This may cause the resulting shaded relief to appear flat in some areas, particularly in areas where the light source is parallel to the mountain ridge. In this work we present a GIS tool to enhance the visual quality of hillshading. We developed a technique based on clustering aspect to provide a seamless change of lighting throughout the scene. We also provide tools to change the light zenith according to either elevation or slope. This way the cartographer has more room for customizing the shaded relief representation. Moreover, the method is completely automatic and this guarantees consistent and reproducible results. This method has been embedded into an ArcGIS toolbox.

Article available here: The Cartographic Journal



Today I decided to go back to R to distribute the original R script I used for developing the method.
The script is downloadable from here: Cluster_Shading_RSAGA.R

Both the ArcGIS toolbox and the R script are also available on my ResearchGate profile:
ResearchGate - Fabio Veronesi


Basically it replicates all the equations and methods presented in the paper (if you cannot access the paper I can send you a copy privately). The script loads a DTM, calculates slope and aspect with the two dedicated functions in the raster package; then it cluster aspect, creating 4 sets.
At this point I used RSAGA to perform the majority filter and the mean filter. Then I applied a sine wave equation to automatically calculate the azimuth value for each pixel in the raster, based on the clusters.
Zenith can be computed in 3 ways: constant, elevation and slope.
Constant is the same as the classic method, when the zenith value does not change in space. For elevation and slope, zenith changes according to weights calculated from these two parameters.
This allows the creation of maps with a white tone in the valleys (similar to the combined shading presented by Imhof) and a black tone that increases with elevation or slope.

Tuesday 19 August 2014

Transform point shapefile to SpatStat object

Today I wanted to do some point pattern analysis in R using the fantastic package spatstat.
The problem was that I only had a point shapefile, so I googled a way to transform a shapefile into a ppp object (which is the point pattern object used by spatstat).
I found a method that involves the use of as.ppp(X) to transform both spatial points and spatial points data frames into ppp objects. The problem is when I tested with my dataset I received an error and I was not able to perform the transformation.

So I decided to do it myself and I now want to share my two lines of code for doing it, maybe someone has has encountered the same problem and does not know how to solve it. Is this not the purpose of these blogs?

First of all, you need to create the window for the ppp object, which I think it is like a bounding box. To do that you need to use the function owin.
This functions takes 3 arguments: xrange, yrange and units.

Because I assumed you need to give spatstat a sort of bounding box for your data, I imported a polygon shapefile with the border of my area for creating the window.
The code therefore looks like this:

library(raster)
library(spatstat)

border <- shapefile("Data/britain_UTM.shp")

window <- owin(xrange=c(bbox(border[1,1],bbox(border[1,2]),
yrange=c(bbox(border)[2,1],bbox(border)[2,2]),
unitname=c("metre","metres"))

 

Then I loaded my datafile (i.e. WindData) and used the window object to transform it into a point pattern object, like so:

WindData <- shapefile("Data/WindMeanSpeed.shp")

WindDataPP <- ppp(x=WindData@coords[,1],
y=WindData@coords[,2],
marks=WindData@data$MEAN,
window=window)

 

Now I can use all the functions available in spatstat to explore my dataset.

summary(WindDataPP)



@fveronesi_phd

Wednesday 11 June 2014

Extract Coordinates and Other Data from KML in R

KML files are used to visualize geographical data in Google Earth. These files are written in XML and allow to visualize places and to attach additional data in HTML format.

In these days I am working with the MIDAS database of wind measuring stations across the world, which can be freely downloaded here:

First of all, the file is in KMZ format, which is a compressed KML. In order to use it you need to extract its contents. I used 7zip for this purpose.
The file has numerous entries, one for each point on the map. Each entry generally looks like the one below:

 <Placemark>  
    <visibility>0</visibility>  
    <Snippet>ABERDEEN: GORDON BARRACKS</Snippet>  
    <description>  
       <![CDATA[  
       <table>  
       <tr><td><b>src_id:</b><td>14929  
       <tr><td><b>Name:</b><td>ABERDEEN: GORDON BARRACKS  
       <tr><td><b>Area:</b><td>ABERDEENSHIRE  
       <tr><td><b>Start date:</b><td>01-01-1956  
       <tr><td><b>End date:</b><td>31-12-1960  
       <tr><td><b>Postcode:</b><td>AB23 8  
       </table>  
       <center><a href="http://badc.nerc.ac.uk/cgi-bin/midas_stations/station_details.cgi.py?id=14929">Station details</a></center>  
       ]]>  
    </description>  
    <styleUrl>#closed</styleUrl>  
    <Point>  
       <coordinates>-2.08602,57.1792,23</coordinates>  
    </Point>  
 </Placemark>  


This chunk of XML code is used to show one point on Google Earth. The coordinates and the elevation of the points are shown between the <coordinates> tag. The <styleUrl> tag tells Google Earth to visualize this points with the style declared earlier in the KML file, which in this case is a red circle because the station is no longer recording.
If someone clicks on this point the information in HTML tagged as CDATA will be shown. The user will then have access to the source ID of the station, the name, the location, the start date, end date, postcode and link from which to view more info about it.

In this work I am interested in extracting the coordinates of each point, plus its ID and the name of the station. I need to do this because then I have to correlate the ID of this file with the ID written in the txt with the wind measures, which has just the ID without coordinates.

In maptools there is a function to extract coordinates and elevation, called getKMLcoordinates.
My problem was that I also needed the other information I mentioned above, so I decided to teak the source code of this function a bit to solve my problem.

#Extracting Coordinates and ID from KML  
kml.text <- readLines("midas_stations.kml")  
   
re <- "<coordinates> *([^<]+?) *<\\/coordinates>"  
coords <- grep(re,kml.text)  
   
re2 <- "src_id:"  
SCR.ID <- grep(re2,kml.text)  
   
re3 <- "<tr><td><b>Name:</b><td>"  
Name <- grep(re3,kml.text)  
   
kml.coordinates <- matrix(0,length(coords),4,dimnames=list(c(),c("ID","LAT","LON","ELEV")))  
kml.names <- matrix(0,length(coords),1)  
   
for(i in 1:length(coords)){  
    sub.coords <- coords[i]  
    temp1 <- gsub("<coordinates>"," ",kml.text[sub.coords])  
    temp2 <- gsub("</coordinates>"," ",temp1)  
    coordinates <- as.numeric(unlist(strsplit(temp2,",")))  
   
    sub.ID <- SCR.ID[i]  
    ID <- as.numeric(gsub("<tr><td><b>src_id:</b><td>"," ",kml.text[sub.ID]))  
   
    sub.Name <- Name[i]  
    NAME <- gsub(paste("<tr><td><b>Name:</b><td>"),"",kml.text[sub.Name])  
   
    kml.coordinates[i,] <- matrix(c(ID,coordinates),ncol=4)  
    kml.names[i,] <- matrix(c(NAME),ncol=1)  
}  
   
   
write.table(kml.coordinates,"KML_coordinates.csv",sep=";",row.names=F)  

The first thing I had to do was import the KML in R. The function readLines imports the KML file and stores it as a large character vector, with one element for each line of text.
For example, if we look at the KML code shown above, the vector will look like this:

 kml.text <- c("<Placemark>", "<visibility>0</visibility>",   
 "<Snippet>ABERDEEN: GORDON BARRACKS</Snippet>", ...  

So if I want to access the tag <Placemark>, I need to subset the first element of the vector:
kml.text [1]

This allows to locate the elements of the vector (and therefore the rows of the KML) where a certain word is present.
I can create the object re and use the function grep to locate the line where the tag <coordinates> is written. This method was taken from the function getKMLcoordinates.

By using other key words I can locate the lines on the KML that contains the ID and the name of the station.

Then I can just run a loop for each element in the coords vector and collect the results into a matrix with ID and coordinates.


Conclusions
I am sure that this is a rudimentary effort and that there are other, more elegant ways of doing it, but this was quick and easy to implement and it does the job perfectly.


NOTE
In this work I am interested only in stations that are still collecting data, so I had to manually filter the file by deleting all the <Placemark> for non-working stations (such as the one shown above).
It would be nice to find an easy way of filtering a file like this by ignoring the whole <Placemark> chunk if R finds this line: <styleUrl>#closed</styleUrl>


Any suggestions?

Wednesday 2 April 2014

Merge .ASC grids with R

A couple of years ago I found online a script to merge several .asc grids into a single file in R.
I do not remember where I found it but if you have the same problem, the script is the following:

 setwd("c:/temp")  
 library(rgdal)  
 library(raster) 

 
 # make a list of file names, perhaps like this:  
 f <-list.files(pattern = ".asc")  


 # turn these into a list of RasterLayer objects  
 r <- lapply(f, raster) 

 
 # as you have the arguments as a list call 'merge' with 'do.call'  
 x <- do.call("merge",r) 

 
 #Write Ascii Grid  
 writeRaster(x,"DTM_10K_combine.asc")  

It is a simple and yet very affective script.
To use this script you need to put all the .asc grids into the working directory, the script will take all the file with extension .asc in the folder, turn them into raster layers and then merge them together and save the combined file.

NOTE:
if there are other file with ".asc" in their name, the function
list.files(pattern = ".asc") 

will consider them and it may create errors later on. For example, if you are using ArcGIS to visualize your file, it will create pyramids files that will have the same exact name of the ASCII grid and another extension.
I need to delete these file and keep only the original .asc for this script and the following to work properly.



A problem I found with this script is that if the raster grids are not properly aligned it will not work.
The function merge from the raster package has a work around for this eventuality; using the option tolerance, it is possible to merge two grids that are not aligned.
This morning, for example, I had to merge several ASCII grids in order to form the DTM shown below:


The standard script will not work in this case, so I created a loop to use the tolerance option.
This is the whole script to use in with non-aligned grids:

 setwd("c:/temp")  
 library(rgdal)  
 library(raster)  
   
 # make a list of file names, perhaps like this:  
 f <-list.files(pattern = ".asc")  
   
 # turn these into a list of RasterLayer objects  
 r <- lapply(f, raster)  
   
   
 ##Approach to follow when the asc files are not aligned  
 for(i in 2:length(r)){  
 x<-merge(x=r[[1]],y=r[[i]],tolerance=5000,overlap=T)  
 r[[1]]<-x  
 }  
   
 #Write Ascii Grid  
 writeRaster(r[[1]],"DTM_10K_combine.asc")  

The loop merge the first ASCII grid with all the other iteratively, re-saving the first grid with the newly created one. This way I was able to create the DTM in the image above.

Monday 3 March 2014

Plotting an Odd number of plots in single image

Sometimes I have the need to reduce the number of images for a presentation or an article. A good way of doing it is putting multiple plot on the same tif or jpg file.
R has multiple functions to achieve this objective and a nice tutorial for this topic can be reached at this link: http://www.statmethods.net/advgraphs/layout.html

The most common function is par. This function let the user create a table of plots by defining the number of rows and columns.
An example found in website above, is:

attach(mtcars)
par(mfrow=c(3,1))
hist(wt)
hist(mpg)
hist(disp)

In this case I create a table with 3 rows and 1 column and therefore each of the 3 plot will occupy a single line in the table.

The limitation of this method is that I can only create ordered tables of plots. So for example, if I need to create an image with 3 plots, my options are limited:

A plot per line, created with the code above, or a table of 2 columns and 2 rows:

attach(mtcars)
par(mfrow=c(2,2))
hist(wt)
hist(mpg)
hist(disp)



However, for my taste this is not appealing. I would rather have an image with 2 plots on top and 1 in the line below but centered.
To do this we can use the function layout. Let us see how it can be used:

First of all I created a fake dataset:

 data<-data.frame(D1=rnorm(500,mean=2,sd=0.5),  
 D2=rnorm(500,mean=2.5,sd=1),  
 D3=rnorm(500,mean=5,sd=1.3),  
 D4=rnorm(500,mean=3.5,sd=1),   
 D5=rnorm(500,mean=4.3,sd=0.8),  
 D6=rnorm(500,mean=5,sd=0.4),  
 D7=rnorm(500,mean=3.3,sd=1.3))  
I will use this data frame to create 3 identical boxplots.
The lines of code to create a single boxplot are the following:

 boxplot(data,par(mar = c(10, 5, 1, 2) + 0.1),   
 ylab="Rate of Change (%)",  
 cex.lab=1.5, names=c("24/01/2011","26/02/2011",  
 "20/03/2011","25/04/2011","23/05/2011",  
 "23/06/2011","24/07/2011"),  
 col=c("white","grey","red","blue"),  
 at=c(1,3,5,7,9,11,13),  
 yaxt="n",  
 las=2)  
   
 axis(side=2,at=seq(0,8,1),las=2)  
   
 abline(0,0)  
   
 mtext("Time (days)",1,line=8,at=7)  
   
 mtext("a)",2,line=2,at=-4,las=2,cex=2)  
This creates the following image:


I used the same options I explored in one of my previous post about box plots: BoxPlots

Notice however how the label of the y axes is bigger than the label on the x axes. This was done by using the option cex.lab = 1.5 in the boxplot function.

Also notice that the label on the x axes ("Time (days)") is two lines below the names. This was done by increasing the line parameter in the mtext call.

These two elements are crucial for producing the final image, because when we will plot the three boxplots together in a jpg file, all these elements will appear natural. Try different option to see the differences.

Now we can put the 3 plots together with the function layout.
This function uses a matrix to identify the position of each plots, in may case I use the function with the following options:

layout(matrix(c(1,1,1,1,1,0,2,2,2,2,2,0,0,0,3,3,3,3,3,0,0,0), 2, 11, byrow = TRUE))

This creates a 2x11 matrix that looks like this:

1  1  1  1  1  0  2  2  2  2  2
0  0  0  3  3  3  3  3  0  0  0
what this tells the function is:
  • create a plotting window with 2 rows and 11 columns
  • populate the first 5 cells of the first row with plot number 1
  • create a space (that's what the 0 means)
  • populate the remaining 5 spaces of the first row with plot number 2
  • in the second row create 3 spaces
  • add plot number 3 and use 5 spaces to do so
  • finish with 3 spaces
The results is the image below:



The script is available here: Multiple_Plots_Script.r