Browsable Science Museum Data
This is a browsable version of the Science Museum, National Media Museum and National Railway Museum collections data recently released by the National Museums of Science & Industry; a total of 218822 objects with 40595 images from three collections.
My main interest in all of this was the train stuff from the NRM. But I figured it'd be best to get a decent feel for the shape of the data before just looking at the NRM bits. So I took the object and media spreadsheets available from the Science Museum API and forced them into something roughly mysql shaped using more brute force than cleverness. And also normalised out what I could whilst trying to keep the data accurate to the source and not correct typos in case they weren't typos. Which I think is why there are four collections rather than three. For reference the original data (stripped of wrapping quotes) is shown at the bottom of each object page.
So far I haven't done anything with the event data because it doesn't (as yet) link to the objects so it didn't seem worth it. I've also left all the list views unpaginated partly through laziness and partly because it's easier to see what's going wrong that way. The best place to start browsing is probably with the collections because every object is in a collection (except these duplicate Urethral forceps! and another duplicate record) and they're fairly obviously organised.
Anyway, it's definitely worth a browse round (but I would say that). Some of the nooks and crannies are fascinating. Although some bits are rather like rooting through your mad auntie's attic...
Some stuff I noticed
- Some of the objects have neither title nor name.
- The object ID numbers are all unique if you're assuming case sensitivity but if you're working with case insensitive code there's one (and only one) duplication: a602770 is an artificial arm, A602770 is a rubber ball massager.
- As mentioned above some trade card found its way into a top level collection called SxCM which I assume was a typo.
- I did think something was going wrong with the materials normalisation but it seems someone at the railway museum is just obsessed with typing asbestos.
- Similar problem with manufacturers where someone at the science museum went into an et al loop.
- And a similar problem with places which looks like someone has confused the use of commas and semicolons as address separators.
- I had been planning to make the place data browsable but the structure was a little haphazard; sometimes comma separated address style, sometimes just a place name. So you get Oldham and Oldham, England, United Kingdom and Oldham, Greater Manchester, England, United Kingdom and Oldham, Greater Manchester, England, United Kingdom, Lancashire and Oldham, Manchester, England, United Kingdom and Oldham, Oldham borough, Greater Manchester, England, United Kingdom. One Oldham is quite enough...
- Similar problems with manufacturers where you get Acorn Computers Limited and Acorn Computers Ltd. eg.
- The manufacturer field has been overloaded. Sometimes with birth / death dates of people which means if you assume semi-colons are used to separate different manufacturers (which in 99% of cases is true) you end up with 1832-1907 as a manufacturer. And sometimes with the role the manufacturer played in the making so you end up with Brunel, Isambard Kingdom, 1806-1859 and Brunel, Isambard Kingdom (designer) as manufacturers.
- The materials field has also been overloaded with the part of the object using that material. So there's Iron and internal pan, iron eg.
- Of the 40595 images, 32095 match to records in the object sheets. Which means 8500 don't. I assume they're images of objects in collections which haven't yet been released?!? There's a full list of unmatched images here.
- Sometimes the image API works; sometimes it doesn't. (eg row 39558 of the media spreadsheet has an image for object A626844 with a media key of 181972 but putting that into the image API just returns 'Error: an unsupported input format was requested.' Not sure what the 'wm' in the API means here; maybe that's the problem...)
- I didn't do anything with the date stuff partly because I wasn't sure how to expose that facet as an aggregation and partly because the syntax used is again inconsistent. Sometimes as date/month/year, sometimes as year/month/date, sometimes as year ranges, sometimes as year/month...
- Somewhere along the line my character encoding went wrong, but I never did get character encoding...
In general it looks like a fairly typical archive data set that's been added to over time. From working with similar data in the past the problems tend to be the same. Once the data model is set up it's usually considered too expensive to change it, so instead cataloguers tend to syntactically overload the fields they have been given. As old employees leaves and new employees arrive the syntax changes with use. Which just makes it very difficult to get a computer to understand it. There also seems to be a lack of controlled vocabularies / reference data / lookup tables for the various facets (unless that's not yet been released).
It would probably be worth running the original spreadsheets through Google Refine to tidy up some of the mess but not sure how far you'd get without a domain / museum expert. But I think you'd be able to tidy up most of it in a couple of days...
General linked-data-ish thoughts
If you could get the facet data tidied linking it up should be relatively easy (manufacturers and people to DBpedia, places to DBpedia / GeoNames). But linking the actual objects would be much more difficult because it's such a bizarre mish-mash of stuff.
Some of the objects (like the Mallard and Flying Scotsman) are famous in their own right so could link to DBpedia. But many of the objects are instances of famous classes of (mass-manufactured) things. So Wikipedia has a page for Concordes in general but not for Concorde 002. So maybe you'd need to use some kind of product ontology? And some objects are parts from famous objects or instances of famous objects so you'd need a way to describe components...
Other objects (like Charles Babbage's scribbling book) are in the collection because they were owned by / used by famous scientists / engineers. And other objects just seem more Victorian freak show than science museum. I have no idea why there's a Specimen jar containing piece of William Burke's brain for instance...
We did talk about objects in the context of the Science ontology we've been working on and it seemed to break down roughly into objects needed as equipment in experiments and objects that owe at least part of their existence to scientific theory. But the latter felt more like theory > influences > engineering > leads to > invention > leads to object so we left that bit out. But obviously there are engineering inventions as a result of scientific theory which go on to be used as equipment in other experiments. Which as Silver pointed out is not utterly dissimilar to the foodstuff > ingredient > recipe > foodstuff > ingredient > recipe model we have in the food ontology (not yet published). Or maybe we just like circular models...
As a final thought I do wonder if a lightweight, white label, graph based collection management tool would be useful for museums and galleries? Something that allowed the addition of new facets without too much pain and provided easy maintenance and linking of reference data...