Wednesday 11 June 2014

Extract Coordinates and Other Data from KML in R

KML files are used to visualize geographical data in Google Earth. These files are written in XML and allow to visualize places and to attach additional data in HTML format.

In these days I am working with the MIDAS database of wind measuring stations across the world, which can be freely downloaded here:

First of all, the file is in KMZ format, which is a compressed KML. In order to use it you need to extract its contents. I used 7zip for this purpose.
The file has numerous entries, one for each point on the map. Each entry generally looks like the one below:

 <Placemark>  
    <visibility>0</visibility>  
    <Snippet>ABERDEEN: GORDON BARRACKS</Snippet>  
    <description>  
       <![CDATA[  
       <table>  
       <tr><td><b>src_id:</b><td>14929  
       <tr><td><b>Name:</b><td>ABERDEEN: GORDON BARRACKS  
       <tr><td><b>Area:</b><td>ABERDEENSHIRE  
       <tr><td><b>Start date:</b><td>01-01-1956  
       <tr><td><b>End date:</b><td>31-12-1960  
       <tr><td><b>Postcode:</b><td>AB23 8  
       </table>  
       <center><a href="http://badc.nerc.ac.uk/cgi-bin/midas_stations/station_details.cgi.py?id=14929">Station details</a></center>  
       ]]>  
    </description>  
    <styleUrl>#closed</styleUrl>  
    <Point>  
       <coordinates>-2.08602,57.1792,23</coordinates>  
    </Point>  
 </Placemark>  


This chunk of XML code is used to show one point on Google Earth. The coordinates and the elevation of the points are shown between the <coordinates> tag. The <styleUrl> tag tells Google Earth to visualize this points with the style declared earlier in the KML file, which in this case is a red circle because the station is no longer recording.
If someone clicks on this point the information in HTML tagged as CDATA will be shown. The user will then have access to the source ID of the station, the name, the location, the start date, end date, postcode and link from which to view more info about it.

In this work I am interested in extracting the coordinates of each point, plus its ID and the name of the station. I need to do this because then I have to correlate the ID of this file with the ID written in the txt with the wind measures, which has just the ID without coordinates.

In maptools there is a function to extract coordinates and elevation, called getKMLcoordinates.
My problem was that I also needed the other information I mentioned above, so I decided to teak the source code of this function a bit to solve my problem.

#Extracting Coordinates and ID from KML  
kml.text <- readLines("midas_stations.kml")  
   
re <- "<coordinates> *([^<]+?) *<\\/coordinates>"  
coords <- grep(re,kml.text)  
   
re2 <- "src_id:"  
SCR.ID <- grep(re2,kml.text)  
   
re3 <- "<tr><td><b>Name:</b><td>"  
Name <- grep(re3,kml.text)  
   
kml.coordinates <- matrix(0,length(coords),4,dimnames=list(c(),c("ID","LAT","LON","ELEV")))  
kml.names <- matrix(0,length(coords),1)  
   
for(i in 1:length(coords)){  
    sub.coords <- coords[i]  
    temp1 <- gsub("<coordinates>"," ",kml.text[sub.coords])  
    temp2 <- gsub("</coordinates>"," ",temp1)  
    coordinates <- as.numeric(unlist(strsplit(temp2,",")))  
   
    sub.ID <- SCR.ID[i]  
    ID <- as.numeric(gsub("<tr><td><b>src_id:</b><td>"," ",kml.text[sub.ID]))  
   
    sub.Name <- Name[i]  
    NAME <- gsub(paste("<tr><td><b>Name:</b><td>"),"",kml.text[sub.Name])  
   
    kml.coordinates[i,] <- matrix(c(ID,coordinates),ncol=4)  
    kml.names[i,] <- matrix(c(NAME),ncol=1)  
}  
   
   
write.table(kml.coordinates,"KML_coordinates.csv",sep=";",row.names=F)  

The first thing I had to do was import the KML in R. The function readLines imports the KML file and stores it as a large character vector, with one element for each line of text.
For example, if we look at the KML code shown above, the vector will look like this:

 kml.text <- c("<Placemark>", "<visibility>0</visibility>",   
 "<Snippet>ABERDEEN: GORDON BARRACKS</Snippet>", ...  

So if I want to access the tag <Placemark>, I need to subset the first element of the vector:
kml.text [1]

This allows to locate the elements of the vector (and therefore the rows of the KML) where a certain word is present.
I can create the object re and use the function grep to locate the line where the tag <coordinates> is written. This method was taken from the function getKMLcoordinates.

By using other key words I can locate the lines on the KML that contains the ID and the name of the station.

Then I can just run a loop for each element in the coords vector and collect the results into a matrix with ID and coordinates.


Conclusions
I am sure that this is a rudimentary effort and that there are other, more elegant ways of doing it, but this was quick and easy to implement and it does the job perfectly.


NOTE
In this work I am interested only in stations that are still collecting data, so I had to manually filter the file by deleting all the <Placemark> for non-working stations (such as the one shown above).
It would be nice to find an easy way of filtering a file like this by ignoring the whole <Placemark> chunk if R finds this line: <styleUrl>#closed</styleUrl>


Any suggestions?