Skip to main content

John Murray mashing geospatial data at #DataMash

Posted by: , Posted on: - Categories: Defra digital

John Murray's work with #opendefra LiDAR yields 'actionable insights' – by mashing geospatial data from a range of sources, he shows how social patterns follow physical features.Everything happens somewhere.

John Murray's work with LiDAR data has sought (and succeeded) to extract information from the 3D topography of our built environment, that is of use to insurers, utilities providers and other services. In a session that captures the spirit of #DataMash – our two-day un/conference with Ordnance Survey – here he shows how multiple open data sources offer clues to the impact of location on people's lives. Always careful to consider the human context – this talk makes a particular case for considering the nuances of the interplay between the physical and the social in understanding places and the people in them.

There's a transcript of this video below.

So, we're talking about a use case and presenting a vision of a data-driven organization and I'm going to present some live case studies and talk about the infrastructure that i've used using Ordnance Survey, Defra and other data within it.

I just want to put this thought forward:'A truly data-driven organisation needs a flexible and an extensible data-driven tech infrastructure, to drive innovation...' so my view is without the right infrastructure in place it's very hard to build the data-driven organization, because you try to break down barriers culturally, you need to break down barriers within your data.

This is what Mesh – which is the system underneath which is something I've been working on for about four years – it measures a database infrastructure with supported by a software which basically provides a flexible data structure but it's based largely on location but it doesn't have to be – it's multiple view – so rethinking the different database what I want to do is I do feel very strongly that having predefined relationships having to enforce referential integrity all the time is actually a constraint that actually constrains your approach – you haven't got data, you've got missing items – you can't deal with it... you have to be able to handle everything...

this is the kind of compromise; in the 'big data' world – l'm not a big fan of the term, 'big data' – you have this idea of a data lake, which is basically – I call it a data dustbin – where you dump everything, don't keep it with any structure, you just keep it raw. I think that's also a mistake because it actually goes against you, because you need to do a certain amount of normalization and cleansing, but you also don't want to too much. I'm sort of taking it as far as you need to an object-oriented approach to entities, so you treat entities as real-world objects and spatial data forms the loop behind the database and the links between entities are be – mostly – derived on the fly as needed by the application that you're actually doing at the time.

You don't have predefined relationships, they're defined as you go, and the background to how this happened – how it came about... This is actually near where I live in Chester. I don't know whether anyone's familiar with Mosaic experience product... or Acorn? it's the ACI problem although classifications of neighbourhoods, now how they're generated... I actually built Prism, which is now Personicx, which is Acxiom's, and one of the things that they do this is generally the either do a grid-based aggregators, when you haven't got enough data, they do a grid-based aggregation or radial aggregation until they've got enough data, so they basically take these postcode centroids of red dots, so let's say we're building one at postcode level, what we will do is we haven't got enough data in that postcode, we'll just radiate out, pulling input of the data until we have sufficient, but the problem with radiating is that I just explained the demography and topography here, we have a select development Ward of million-pound houses, where footballers and Directors live, similarly here we've got the Council Esstate here; we've got some two up-two down from here, and we're getting lots more seventies houses on here, so within deprived– in the middle – so if you start out doing a radial approach, you just simply get an averaging so that's how the initial concept behind Mesh came about.

So what I wanted to do was, instead of aggregating data radially, I want to aggregate it topologically, so order to go along the road so I've created in here buffer zones along roads, I brought the open map polygons behind me, so I can pull in data relating to households, I can pull in attributes, but instead of going out a long distance essentially I walk along buffer pulling in data, until I have sufficient data – and actually in the University we've been testing this – we get a much better homogeneity of the data when we're aggregating, we're getting much better granular accuracy, so that was the concept behind Mesh: to create a structure that would easily enable that. So, we have real-world object – houses, local government at the end; we have infrastructure, railways, transport there, rivers... So, it's about creating objects of real-world objects in the database so layers within Mesh – I don't really like the term layers, but that's probably the best analogy – we've got infrastructure layers, so we've got them held spatially by using they're both using their polylines so we've got road, rail, waterways, utilities; it can bring in private data; for example I'm working with a utility company who provided me with the geometries of their own network, so I got to put that into Mesh – OK, it's not public, but it's actually done for that – so then you've got containers so containers, and your political boundaries, they could be physical boundaries – coastlines – there is statistical boundaries – land parcels – objects and buildings, facilities, natural objects like trees, vegetation; user-defined ones – you might have your own objects – say a retail chain may give me a set of their outlets and geometries, major three points... you've got attributes – so attribute tables, typically a it's likely GIS system in many ways, you have to pretend you've got objects of various types, you've got containers, but it's actually how it's applied, so now then it leads to other sources.

One of the things with imagery... imagery is a problem. It's holding it in. There are vast amounts, take LiDAR, take satellite imagery also the other aerial imagery that's available – it's just not really practical to hold this is a blob objects in a database; it's just too big – it's also very difficult to spatially index it – it just takes up so much space; for example, we took the whole of the Environment Agency one metre LiDAR, and I could put it into postGIS or another spatial database... I estimate if I index it, it will be a hundred terabytes. So it's actually it's not really logistically practical, and I tried it on a sample area; tried it on one grid square for the SJ grid square, it's also not very fast – it's just trying... so we've actually developed our own engine which some of you may have seen, which is actually a query engine for raster and gridded data so we can actually link to these. We've got LiDAR data, we've got also social media – we bring in social media, so we can actually scrape social media feeds with a geotag; we've got prescription data... so we've got lots of other things.

So, the Mesh data sources – this is what what is currently in Mesh. Ordnance Survey is a key data set; it actually forms the backbone of everything. It gives us the basic infrastructure, so it basically gives us the roads, the railways and everything and key objects – schools, places of interest – ONS gives us census; DWP we've got benefit claims, NHS digital – we've got all of these attributes – GP surgeries – and because we've got addresses we can make a location out of that, so we can start doing some analytics' we've got list attributes, we've got GP lists in five year bands, prescriptions, obesity, GP surgery, hospital admission – these things you can all assign by location; we've got Defra data in and we've got Land Registry, companies house, HMRC, and police – I'll come onto police in a minute; DVLA, DECC, Ofcom, so we're actually starting to get in one database a picture of people's lives, of the lives of the average household and the things that are available to them... we've taken down every barrier, we don't have any constraints; if you want query Mesh it will just give you a profile of everything available. We've also built... We found these very useful: this is using Ordnance Survey's Open Roads layer and I've written an algorithm that finds..that creates polygons of all minimally, minimal-size plots of lands that are completely enclosed by roads, so they're just being coloured in random colours here just to show you. These are incredibly useful containers as they can tell you where something is; give you a real-world location, whether something's on one side of the road or the other, or if it's adjacent, to looking at concepts of neighbourhoods – it's based on the United States' Diamond Tiger system if anyone's familiar with that but it's looking at block geometries; We've got LiDAR data – easily recognisable – that's some Environment Agency LiDAR data that I just visualized at an oblique angle. And this is actually prescription I've taken GP prescriptions. I've taken...because the drug code is a classification code, which actually is hierarchical, so every two digits gives you: part of the body, subclass, subclass in twos – it's a 12 digit number. So this is all antidepressants, and what I've done is I've taken them – the number of prescriptions written, it for patient number of items, for antidepressants on a prescription list of a GP surgery as a percentage of patients in a month – that's what this is – and this is the Wirral. Green is very low, red is very high. You see, right along – if you look at the numbers underneath this there are some quite big differences...and it is actually affluence, and poor areas to generally, but not entirely, have high areas of prescription, but that's something that's very useful.

Now, looking at crime data from the police. Now, this is something that's interesting – this is Chester; we've got some very big crime density in the City Centre and we've got an Estate here, we've quite a density; now when we actually looked at that, we took the Defra data – the Food Standards Agency data – and looked at fast-food outlets and it's all around the fast-food outlets you get the crime. And actually there is a correlation between low performance – this is something that came out – low performance, low-grade 1 or 2 star takeaways and crime. And you can actually use – there's pubs in the FSA data, so you can start using the FSA data to denormalize crime data, so what crime could I expect? It's not that that is a bad area, it's just that it is near a pub. So actually meshing Defra FSA data with police crime data using Ordnance Survey as backbone, we've actually been able to sort of find anomalies, model our crime dataset: 'what what would we expect at this location given these attributes?'

This is another thing that's project working on moment with Northern Rail, and we're looking at the number of railway lines' development opportunities. These – it looks like a hairy caterpillar – are stations along the mid-Cheshire line. I've created one-mile, two-mile, three-mile bands... these are postcode dots – postcode head counts – be able to pull it other data where we've got overlap, we assign to nearest – so Northern is interested in 'what is at state of the catchment?' – rural houses, urban... what are the attributes of deprivation because also we're bringing in – not just people – but using Ordnance Survey data... bringing in schools, and we've got employment areas, that we can bring in a lot of objects and can actually ask: 'where do people go?' – it's not just where people live and get on the train... it's where do they go when they get on the train?

Sharing and comments

Share this page