Data.gov - Empowering People

Skip to navigation Skip to main content Skip to search Skip to login
/
Semantic Web
 
 

Cleaning Up Metadata Messiness

Posted on Wed, 2012-05-02 10:57 by Chris Musialek
Bookmark and Share

Recently, I took some time to do a little analysis of our geospatial metadata in geo.data.gov. The results are extremely interesting, and they highlight a difficult challenge facing Data.gov as we work towards improving ways that our users search for and discover Federal geospatial datasets, which rely heavily on the quality of our metadata.

Background

Almost all agencies publishing metadata on geo.data.gov currently use the FGDC format, which is named after the interagency committee that established the standard in 1994. The latest version of the format has been around for close to 15 years, is very highly structured, and has many required elements. Most importantly for metadata geeks, many of the fields allow the use of free text, and do not enforce strict vocabularies.

I decided to look at one important FGDC metadata element in particular, the publisher name, (actual xml tag name is <publish>) which is usually the agency that provided the data. Both our public consumers as well as our data providers are interested in filtering geospatial results by agency, and I wanted to see how feasible it would be to index this metadata field in order to create a filter by agency capability. We continually get requests to provide current counts of datasets per agency, to track dataset publishing status for just their organization, and to allow data journalists another facet by which they can gauge individual agency participation in open data and open government, and this is the field that is best suited to satisfy those needs.

The Data

Below is a list of the unique values ranked by frequency of occurrence based on my original query parameters. Specifically, I queried for all records that have been created or updated within the last year that are also approved for release on Data.gov. You can interact with this list directly by searching/filtering.



As you can see, we describe our agencies’ names in many, many different ways. My personal favorites are the number of different ways that the agencies USGS and NOAA are described. USGS describes themselves as “U.S. Geological Survey”, “USGS”, “U.S. Geological Survey (USGS)”, and three others. In their defense, they are also the ones who probably have the greatest number of groups publishing geospatial metadata, so they’re also the most likely to be at the top of the list.

Improving search

Faceted search is highly dependent on metadata quality. In order for us to be able to provide a way to filter results by agency, we need a standard way of describing agency names, or a way to map the different labels representing the same thing. Most search engines don’t expose very many facets, but providing the most common ones can make a huge difference in terms of better search and discovery.

At this point, we are looking at two ways to provide a short term solution to this problem while the community looks longer term at replacing the FGDC metadata standard with something better suited (one of the standards gaining a lot of traction in the geo community today is ISO 19115-2). One solution is to use entity resolution technology to try to converge the names onto a controlled vocabulary list of agencies that we manage. Two is to require agencies, when publishing their metadata to Data.gov, to reference a controlled vocabulary unique identifier for this element.

Our current preference is to require agencies to reference a controlled vocabulary URI in their metadata. This moves us in the right direction of metadata standardization rather than passing the issue on to each application owner who will invariably need to deploy their own custom solution. Of the agencies that we’ve spoken to, many have said that it wouldn’t be too onerous a change for them to update any new records to reference a controlled vocabulary listing available via an HTTP URI.

In either solution, it’s clear that we need a controlled vocabulary using permanent URIs that describe each of the Federal agencies. We hope that our effort to create a vocabulary publishing site called vocab.data.gov will help to fill this gap.

How do you think we should solve this problem? Please send us your comments.

Chris Musialek is the Chief Software Architect on Data.gov and the product manager for the geo.data.gov and geoplatform.gov projects.

Bookmark and Share

Comments

Agency names

I wonder if one way of creating a standard for agency names (and thus data coming from them) is to model on NYSE and NASDAQ by having unique 4 or 5 letter tickers...USGS and NOAA for example are set up already. Of course it would require that sub-sets of those agencies get on board.

 

issue can be easily slove

Faceted search is highly dependent on metadata quality.And to top that off you have to spell it correctly. That is the biggest problem with boolean searches

Excellent Data

I will say that its great having ready access to such large data sets.  I just wish that there was an API or some other way for developers to access it.  That way we could easily build a website and make it available to thousands of users.

Data from government sites

it is very difficult to get data from government site. is it possible to collect data from govt sites?

Great Article

I really appreciate the work you guys are doing!

The structure of metadata

The structure of metadata needs to be standardized for this to become more efficient and useable.

-Jay

Great Example of Data Services

This community will help to everyone to collect the data. This is great resource.

Will always be a problem

As Chris notes one of the problems is in the naming conventions. Unless globaly we conform to a particular set of data we will never achieve perfection. And to top that off you have to spell it correctly. That is the biggest problem with boolean searches. Things are getting better but words by themselves have no real meaning. There is no "data" behind them. This is a huge challenge especially in hour health care system. Keep up the research!

Bob@scconcierge.net
http://scconcierge.net

 

search is important

Improving search is so important to help users finding data more quickly.

Great article

Faceted search is highly dependent on metadata quality. In order for us to be able to provide a way to filter results by agency, we need a standard way of describing agency names, or a way to map the different labels representing the same thing.

Add comment

  • Allowed HTML tags: <a> <address> <blockquote> <br> <cite> <code> <dd> <dl> <dt> <em> <h2> <h3> <h4> <h5> <h6> <img> <li> <ol> <p> <pre> <span> <strong> <sub> <sup> <ul>
    Allowed Style properties: border, border-bottom, border-bottom-color, border-bottom-style, border-bottom-width, border-color, border-left, border-left-color, border-left-style, border-left-width, border-right, border-right-color, border-right-style, border-right-width, border-style, border-top, border-top-color, border-top-style, border-top-width, border-width, float, height, text-align, width
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.