[This Transcript is Unedited]

DEPARTMENT OF HEALTH AND HUMAN SERVICES

NATIONAL COMMITTEE ON VITAL AND HEALTH STATISTICS

SUBCOMMITTEE ON POPULATIONS

WORKSHOP ON DATA LINKAGES TO IMPROVE HEALTH OUTCOMES

September 18, 2006

Renaissance Hotel
999 9th Street, NW
Washington, D.C.

Proceedings by:
CASET Associates, Ltd.
10201 Lee Highway
Fairfax, Virginia 22030
(703)352-0091

List of Participants:


TABLE OF CONTENTS


P R O C E E D I N G S (9:10 a.m.)

Agenda Item: Call to Order, Welcome and Introductions

DR. STEINWACHS: I am Don Steinwachs. I have the pleasure of chairing the Populations Subcommittee of the National Committee on Vital and Health Statistics. We welcome you today to a two-day workshop on data linkages to improve health outcomes.

I think everyone has a copy of the agenda that spans two days. After saying a couple of words about NCVHS and what the Populations Subcommittee does, I will be turning to other members of the committee to talk about the motivation and the direction that this program has been shaped to try and answer some key questions for us about best practices and successes in data linkages, and identifying where there are barriers, and overcoming those barriers to improve data linkages, to provide information necessary for improving the health of the public.

As many of you may know, the National Committee on Vital and Health Statistics is an advisory body to the Secretary of Health and Human Services on health information and different data policy. The Populations Subcommittee has a particular focus within that, and that focus is on population health measurement. In looking at population health, we tend to emphasize the distribution of health characteristics in the population, trying to identify disparities in subpopulations.

The program today is shaped in many ways trying to look at those sources of data that provide potentially important information on risk factors for health, socioeconomic status, education, income and other factors, as well as potentially open doors to understanding the health of subpopulations, because the data are more comprehensive and may exist in national surveys and interviews that are done.

There is a picture that all of us have on the committee for what health statistics might be. About five years ago, a report came out of the subcommittee before any of us who are here today were members, but Marjorie was there and some others, so can talk about it authoritatively.

It is a report on the vision for health statistics in the 21st century. That vision talked of bringing together the variety of information that is needed to look at health risks and health outcomes, to look at disparities in health and the provision of health services, a very ambitious agenda, one I don't think that we can attain without exploiting fully the capacity for data linkages for existing data, as well as trying to fill the data gaps.

Another piece of work this Populations Subcommittee has taken on is trying to look at health in minority populations. We found very clearly that in many minorities, we have insufficient data to be able to say something about the health status and disparities in health, whether you are talking about Native Hawaiians, Pacific Islanders, you are talking about the tribes in the American Indians, and other groups. So when we take on the Secretary's goal of reducing disparities in health and health services, we find in many areas we just don't have the information. Again, there is hope that possibly through linking data sets and identifying areas for strengthening data collection, we can improve that.

Before turning to the introductions specifically for the two-day program, I would like to go around and have members of the committee introduce themselves, so you know who is on the Populations Subcommittee and on the staff.

I am Don Steinwachs. I am at Johns Hopkins University, Bloomberg School of Public Health and Chair of the subcommittee.

(Whereupon, introductions were performed.)

DR. STEINWACHS: I should also let you know that this two-day workshop is being broadcast on the Internet, so I ask for you to speak into the microphone. I know that some of the people who are attending around the table do not have a microphone. In that case, if there is not a microphone easily accessible to the individual to ask a question, I would ask that the person who is receiving the question restate the question. Oh, we have a portable mike. I should have never doubted that technology would be here.

Why don't we take advantage and also ask the audience here to introduce themselves for the benefit of people on the Internet.

(Whereupon, introductions were performed.)

DR. STEINWACHS: I would like to turn now to Gene Steuerle and Nancy Breen, who along with Joan Turek shaped the agenda, and many of you got to talk to as the agenda came into place for the workshop and have been trying to think through these issues.

Gene, I don't know whether you are going to start or Nancy, to make some comments about the specific expectations for the workshop.

Agenda Item: The Importance of Data Linkages – C. Eugene Steuerle

DR. STEUERLE: I would like to also extend my thanks to all of you for coming. I realize that any activity like this is always time consuming. I can see peoples' brains now wandering back to what they have got sitting on their desks that is not being done right now, so I really appreciate that.

I know we have got a couple of speakers that are still in the audience. This is really meant to be a roundtable. The configuration of the room is what it is, but please come up to the table. We are not just asking the speakers around this time, but anybody who is a speaker in a future session, please come up, because among the things we want to have is a dialogue. We want someone from say Census to turn to somebody from say IRS and say, if we could only have merged these data sets, we could have achieved the following. Or, we have a problem just like that with respect to privacy, or something like that. So we really do want it to be a dialogue.

I want to be clear that the National Committee on Vital and Health Statistics, of which I am a member and Don is the Chair, our role is mainly advisory. The main person we advise, although it doesn't have to be the only person, but the main person we advise is the Secretary of HHS. That indeed is one of our final products as we write these letters to the Secretary and say the following things should be done.

The purpose of this meeting is not just to talk about what data linkages are possible or are being done, but we really want the data advisory group to say, here are things that maybe aren't being done, or here are things that with a little bit of additional resources could be done, or here are some real constraints blocking the following linkage.

I ask all of you to think about it in the following sense. Our ultimate goal of course is to improve the health of the U.S. population. We are not linking data sets or we are not raising the issue because we are researchers and we like bigger data sets, which maybe we do, but hopefully we are doing it because we have a very specific reason.

I ask you to think about that also in terms of the power of the advice. If the advice says here are some data sets that are linked, it says one thing. If we could link these data sets, it is likely we could understand better the following. That would lead to an improvement in health for the U.S. population. It carries a lot more weight.

I realize that a number of you, particularly from the agencies, are resource constrained. That is one of the issues you will probably raise. You will say, we could do the following were it not for additional resources. There is probably a certain extent to which politically you are constrained from saying my boss or someone else in the Administration or Congress was too dumb to realize I needed more resources, but you can say here are the benefits if we could do the following, and we can let other people worry about whether those costs can be met.

We already know and anticipate that two of the biggest constraints that people will raise once they come up with ideas are with respect to resources, but also privacy and confidentiality. If you get to that issue, we would like you to help us also tease out whether there are alternative ways around the problem. Maybe there are ideal ways we want to share data sets, which I think is crucial to the extent we can, but even if we can't, to what extent are we able to bring in outsiders into agencies, to what extent are we blocked.

I know in dealing with groups that I have advised, at times they are blocked from bringing people inside the agencies for a lot of reasons that have little to do with the law or anything else, but just -- I don't want to say narrow, but almost Catch-22 types of rules. We would like to identify those. So really, to ask you again to think about the ultimate goal; it is to improve the health of the population and to help us sort out what could be done and not done.

In all the sessions we have tried to leave a fair amount of time for discussion and dialogue. So again, I would like to ask all of you who are not the formal speakers to jump in. People back here who are too modest to come up to the table, we really want your advice and input.

I think what we find as we go through is that all of us probably know -- because we are talking about information and data, something in which we can have an infinite demand, if we could have an infinite supply barring all costs. So a lot of us know an example here and an example there. The hope of this hearing is, we can go beyond an example we might gather one by one and put them together to be able to weave from the examples a broader story. So that is again the reason we encourage and hope that you will participate. I myself, and I know I speak for other members of the committee, are deeply appreciative of the time you have given to us.

Nancy?

Agenda Item: Nancy Breen

DR. BREEN: I would also like to welcome everybody, and thank you for taking the time to come here to this workshop on data linkages.

We feel it is really important. Gene and I didn't really now each other until we both came up with this idea that we should start to investigate what was going on in the federal government more systematically on data linkages, and try to determine what are the best practices, and also see if we could do more along those lines. The examples that both of us knew for data linkages were so fruitful and were bearing such great research, that it seemed like the is was something that we should try to move forward more systematically. So once again, I am delighted that people have taken the time to come.

In the course of developing this, I want to take a minute to thank Joan Turek, who is sitting next to me. As we were talking about this at the committee, Gene and I had some ideas, and we had a list, but we didn't know everybody in the federal government that was doing this. Joan, it turned out, did.

DR. STEUERLE: And does.

DR. BREEN: So she filled the gaps for us. I would really like to thank Joan for what she has given to this. Thank you, Joan.

MS. TUREK: You see what happens when you work for Uncle Sam for 37 years.

DR. BREEN: And so productively, thank you. One of the things that she did was to make us recognize that the Census Bureau is doing a whole lot in this area, which I had no idea. I worked at the Census Bureau years ago, and they were collecting survey data and they were collecting the census. They were collecting a lot of different data, but as far as I knew, they weren't doing much in administrative records.

At the National Cancer Institute where I work, we started working with them a few years ago on the national mortality longitudinal survey, so I knew that they were starting to work with administrative records, but Joan made us realize that there is a whole group of people who were systematically working through how they could use survey and administrative data to improve not only the quantity but even more important perhaps, the quality of the data that we use.

In conversations -- can I take this as a transition to start to introduce the first speakers?

DR. STEINWACHS: Please do.

DR. BREEN: Okay. I guess I will start by introducing the speakers and then say a little bit about our conversations and what I think they are going to be talking about, just to bring out some of the high points, and then they will provide more detail on what they are actually doing.

I would like to introduce Sally Obenski. I think you have been reorganized recently. The title that she had about a month ago, and also that Ron Prevost who is sitting next to her, had about a month ago has been changed. This is such a timely and hot topic that Census has reorganized so that there is now a division which is the Data Integration Division.

David Johnson, are you chief of that division.

MR. JOHNSON: No, no.

DR. BREEN: Would you like to be promoted to chief of that division today? You are in that division?

MR. JOHNSON: No, I am the director in my division, the new division, the demographic director.

DR. BREEN: So you are here. One thing I might want to mention because people may be interested in this, it wasn't the impetus for this, but an issue that is related to access and a number of researchers have been concerned about was the defunding of SIPP, the Survey of Income and Program Participation, a really important panel survey that the Census Bureau was doing, which gave us right up to the date information on employment, on who was in and out of the labor force. We learned a lot about patterns and long term trends that we could never figure out from any of the other data sets that we had.

David Johnson -- I hope I have this part right, David -- is the program manager on the dynamics of economic well-being system, the DEWS, which is going to be engineered in order to provide the information that SIPP was providing to us. So he is here and available to answer questions related to that, so I want to thank him for that.

Then the third speaker, the formal speaker from the Census here today is Gerry Gates, who is the Chief Privacy Officer. He is here at the end of the table.

Of course, we want to know here today how we can improve data systems by improving data linkages, and how we can improve interagency collaboration to do that. As you will see, Census is working with all kinds of agencies within the federal government in order to try to do that, and quite effectively.

Another question that Gene brought up is, how can we insure adequate access. This is a big issue. As Sally said, we don't have all the answers, but we are very committed to working with stakeholders to figure out how we can best do this. So it is really important that people around this table, at home, on the Internet and subsequently work with Census in order to try to understand how we can best get access, because that is a big question that is remaining.

This information is confidential. We don't want to violate anybody's confidentiality, of course. They won't give us additional data if we do. But we do want the information to be available to people, because otherwise it is not helpful in improving population health and reducing and eliminating health disparities.

Sally is going to give us an overview. Let me just give you Sally's new title. Both of them are now in the Data Integration Division, and they are both assistant division chiefs. Sally is the assistant chief for administrative records applications and Ron is the assistant division chief for data management. Sally will provide an overview as I said, and then Ron will talk about -- he has done some interesting research, in which he has found by looking at survey data and administrative data there is a big mismatch. So coming from that point, how can we use this finding to improve the data sets through the linkage mechanism. It is a real quality improvement question, which when Gene and I started thinking about this wasn't really where we were thinking it would go. We were thinking along the lines of just getting more information.

So I want without further ado introduce the speakers and let them talk about these issues. We will let all three of the speakers speak, Sally, Ron and Gerry, and then we will have some time for questions after that. So I want to turn the mike over to Sally. Thank you.

Agenda Item: Census Bureau

Sally Obenski, Assistant Chief for Administrative Records Research

MS. OBENSKI: Thank you. As Nancy mentioned, what I would like to do is provide an overview of the technical infrastructure that is enabling the expanded use of administrative records and record linkage at the Census Bureau. It also will provide the 30,000-foot view of projects that are using linked data sets. We are also going to talk a little bit about operational and technical constraints, whereas the policy constraints are going to be discussed both by Ron and Jerry.

As many of you know and others may not, Title 13 which is our guiding mandate, states that we are to use administrative records extensively as possible. Also, our strategic plan calls for the use of administrative records to reduce both reporting burden and minimize costs, and also to come up with innovative data sources. We are also governed under a plethora of legal guidance and protections that include Title 13 and Title 26, Privacy Act, and so on and so forth.

Whenever any of us discuss using administrative records, we always make sure that the audience is aware that we have commitments to our data providers, to our data users, and to the public. In order to use other parties' data we take it very seriously. We have a stringent infrastructure known as the data stewardship program that Gerry is going to talk about, in which the use of administrative records nest. We insure that there is a consistent application of policies, and we have numerous administrative controls including, before a project is approved it undergoes quite a bit of scrutiny. We have checklists. We have to make sure that we are protecting the privacy and always the confidentiality of the respondents.

Another point that is important, I will make it later, is that although we acquire data sets that have personally identifiable information on it, those identifiable information are stripped from the data sets before they are used, and they are replaced with a protected identification key which is an anonymous key that allows us to do record linkage while maintaining privacy.

What this gives you is a snapshot of the program evolution, the administrative records program evolution. The Census Bureau has used administrative records since the 1940 census, where we used it to identify the first differential under count. However, in terms of a formal program, it really came at the aftermath of the 1990 census, in which we showed a severe and serious differential under count. Out of that came 17 program designs, of which one was, could we supplant direct enumeration using administrative data.

So a small group of people began researching. There was a privacy conference that was held in '93. There were numerous surveys to look at the feasibility of such an undertaking. About the mid-90s we began developing a prototype of what is known as a statistical administrative record system, or STARS.

Now I would like to give you some of what I consider to be the enabling technology and methods that have allowed us to expand the use of administrative records. Early on in the program, it became obvious that in order to do anything seriously with administrative data, we would have to have a database made up of national files.

So STARS is such a database. It was first prototyped in 1999 to be tested in the Census 2000 evaluation experimentation program. STARS is comprised of seven national fields, including IRS files, HUD, Indian Health Service, Medicare and Selective Service. As you see in terms of the record count, it is quite a large file. The way that STARS was designed was to conform to the short form census, in that it has short from demographic characteristics, age, sex, race, Hispanic origin. Its address part conforms to our master address file. So this database was created as a prototype in 1999 and tested in 2000.

It was tested in what we call the administrative record experiment or AREX. What came out of this experiment even surprised the developers of STARS. It was not so much that it could ever supplant a census, but it validated the conformance of STARS to census, in the fact that we had demonstrated that we had captured 85 percent of the census addresses and 95 percent of the persons. So STARS has been recreated every year ever since, and improvements continue to be made on it.

Let me talk a little bit and digress a tiny bit about the Numident. The Numident is an incredibly important file that we acquired in the late '90s from the Social Security Administration, and it is the transaction file that has every SSN that has ever been requested. This is about 803 million records.

What we have done is to collapse this file into the best, into unique records. We use it for two things. Initially, the file did provide our demographic data for STARS. Secondly and incredibly importantly is that it provides our verification and validation system.

Before I talk about the validation system I do want to swerve back to the demographic data. As I said, initially we used in STARS demographic data from the Numident. The problem with the race data in particular was that it only captured white, black and other, and had no ethnicity. Furthermore, race data on children is no longer being captured on social security records when children are born.

So to build on this, we in fact built on work that was done by Barry Biden in the late 1990s, in which he linked the current population survey to the Numident to start building a race model. We went the next step. We linked the Census 2000 to the Numident. Where we had a match we brought over race and Hispanicity and where we didn't have a match we modeled the differences. What this has allowed us now is that we have a hybrid system that we have Census 2000 race and Hispanicity and we have the very, very excellent high quality Numident age and sex.

So the result of this fixed a very substantial weight in STARS, which was the race data, and it is currently being used by a number of Census Bureau programs, including the intercensal estimate.

Person identification validation system that we call the PVS. The Social Security Administration who we have worked very closely with over the years in a number of venues requires that any file that is going to be linked to their data must be validated. Originally we worked with SSA on their validation system to do such activities. Over the last few years we have been working with them in order to develop our own completely automated system, which has been approved by them. That is what this PVS is.

The importance of this system can't be understated, because it is in fact our record linkage infrastructure, if you will. We use the Numident, the SSA file as our reference file. When we have a file come in the door, regardless of whether it is going to be linked to SSA data or not, we run it as part of a huge quality control check against their reference file. We link addresses, we match on name, address and date of birth. We search within the address, and then if we don't find the person in the address, we go ahead and search by name. Then we append the record with a unique protected identification key. It is this pick that is used by the Census Bureau record linkers in order to do their work throughout the Census Bureau. There is no identifiable information passed to anyone.

Finally, the other major enabler was the implementation of the American Community Survey. As you all know, the ACS provides a large timely sample of essentially decennial long form data. It is essential in order to start getting data at smaller levels of geography.

What we have been working on with some of our researchers is the idea of using the ACS to model the model. We have gotten the rules from deeper surveys such as the SIPP or the CPS, but then in order to push it down to lower levels of geography we used the ACS.

Now I am going to talk very, very quickly in general about the work that we do in processing our files and anonymizing them, are used across the Census Bureau by numerous important programs, including the intercensal estimates and the small area income and poverty estimates. I would like to remind folks that these systems are administrative records based and are responsible for the allocation of billions of dollars of federal funds. Also, as some of you are familiar with the national longitudinal mortality study, we support them with our processing, and also the LEHD program.

Now I would like to talk a little bit about some research that we are involved with that is looking at uses in the decennial census. We have three major programs underway that are being evaluated in the 2006 census test. One of them is to see if we can assist, not replace but assist, the hot deck imputation method by using administrative records to assign age, race, sex, Hispanic origin when we can match a record.

What this does is, it reduces the work load that falls to the hot deck, which improves its standard error. This looked very promising, and we are checking to see if it is operationally feasible in a production environment as we speak.

The second use was to use administrative records to identify households with coverage problems. What this is, is a systemic problem where a number of misses, omissions, come at the within-household. In a given household we tend to miss people. This is for a whole lot of reasons.

So to ameliorate this, we send out a major field operation, which is the coverage followup operation, and it is very expensive. We have a research project underway in which our STAR system was remodeled. We used some modeling to develop probabilities that certain types of housing units in certain types of areas would be under covered. This is also currently being evaluated.

Finally, we have been involved for several years now looking to see if we can enhance the group quarters frame, which was a very challenging endeavor in Census 2000. We are looking at using Info USA, which is the yellow pages, and also the ES 202, which is the business register for states. We also have evaluated seven states and their co-op list, and that program proved so successful that it is being expanded nationally.

What are some other kinds of survey improvements? Ron is going to talk about one very important case study, but just a few others that are worth mentioning. Bob Faye has been working in some very exciting work in seeing if he can use our STARS database to develop survey controls for reducing ICS small area variance. This is looking highly promising.

A second body of research we have had underway was to take a look at a STARS to CPS match and to look at non-respondents and to see if there is any difference between them and the responders. The linkages occurred, but the analysis is still underway.

Another thing that was very exciting was our response to the aftermath of Katrina. As you all know, the effect on the federal statistical system highlighted an inability to react in a real-time basis. As luck would have it, we were in the process of acquiring the national change of address file from the U.S. Postal Service.

So we got together with the data linkage experts. Everybody said this is our single best chance to get something out there. So they graciously agreed to give us an abstract, and we used that file to come up with some alternative survey controls and later on we produced county level tallies using NCOA. We also were very interested in getting FEMA's files, but we didn't have them in time for the immediately response.

So as an outcome of this, we are looking into the feasibility of developing the next generation of STARS, which maybe could produce some real-time measurements.

Here I have given you a very brief snapshot of record linkage and some exciting projects. What are the constraints? In order to use third party data, this requires an extraordinarily complex memorandum of understanding in order to insure that all parties are protected.

Just to give you an example, after Katrina we had immediate discussions with FEMA. OMB wanted us to have the FEMA files, we wanted them. It should have been slam dunk; it took nine months. So the other problem is that when you deal with some of the federal poverty data, they tend to be state based. Dealing state by state as I think Julia could attest is quite an undertaking.

Also, there are differences in content definition, quality and program rules. We have come up against this in some of our projects when we are trying to build eligibility models, and they sometimes differ at the county level, let alone the state.

Also, lag time. This is a big, big problem. Most of our files lag by about a year. For example, one of our more important health ones lags by about four years before we get the national file. As we all are seeing, more applications require more near-time, real-time response.

Technical constraints, getting the right data in the right format. Even if we write these very detailed specifications, it many times takes our analyst many conversations and the data going back and forth to get it right, because it is coming from two different views, the administrative view versus the survey integration view.

Also, we do have varying rates of validation, for example, Medicare very high, Medicaid lower. Just to note, if the record does not validate, we do not use that record. If we can't put a pick on it, we do not use that record in our projects.

Also, something we came up with in terms of looking at SIP administrative data is the coarseness of administrative data compared to the nuances of teasing out what we want from the survey. Finally, measuring error, what does it mean, how do we put a confidence interval around an integrated data set. It is very challenging.

What are we doing to overcome the constraints? Revolving file acquisition issues especially among state data may require OMB or Congressional assistance. The lag time for general demographics is we believe largely addressed by the national change of address file and possibly moving to this enhanced STARS.

We have under the new Data Integration Division completely standardized and centralized file acquisition. The next bullet is speaking to continual improvements in our person validation system, including converting it to SAS, which has made it much more accessible to our analysts. We have identified a data quality standards team that is going to be looking at measuring error in integrated data sets.

What are our conclusions? New files and innovations clearly leading to this expansion of administrative records uses, but new challenges continue to arise. The idea of having to regularly update a file like STARS that has got hundreds of millions of records, and then it is going to be updated on a quarterly basis, is very, very difficult. Also, just understanding what integrated data sets are. But I believe, everyone here would believe from the Census Bureau, that we are at the incipience of a new generation of products and services.

That concludes my talk.

Agenda Item: Ronald C. Prevost, Special Assistant, Demographic Surveys Division

MR. PREVOST: I guess I will go into my part of the presentation now, which is talking about health related administrative records research at the Census Bureau.

What I am going to do is, I will start with a brief commercial announcement about the small area health insurance estimates program. We will also talk about the Medicaid undercount study description, its preliminary results, what our next steps are. We will discuss a little bit about related research we are doing, the benefits of integrated data sets, and the policy challenges and conclusions.

The U.S. Census Bureau has embarked upon a small area health insurance estimates program that produces a consistent set of estimates of health insurance for all counties in the United States. The intent of the program is to have published estimates for these counties and states by age, under 18 and total, with confidence intervals. Right now, we are investigating model improvements and expanding the age categories for which we can estimate.

Here is just a couple of examples. This is just a brief display. Policy is implemented locally, and there are significant differences in local area ability to have insurance coverage. This slide here is a quick snapshot, I don't expect you to read it or understand it all, but it is just a blurb to see what the differences are in the U.S. for the total population without health insurance. We have a similar situation for children without health insurance.

These health insurance coverage estimates are created by combining survey data with population estimates and administrative records. Currently we are working on race, ethnicity, age, sex and income categories that are being investigated for counties and states. The state level estimates such as the uninsured black or African-American population under age 18 and less than or equal to 200 percent of poverty. We are also looking at county level estimates such as the uninsured children under the age of 18, again under that poverty constraint. This project is partially funded by the Centers for Disease Control and Prevention's national breast and cervical cancer early detection program.

What is forthcoming is health insurance coverage estimates in 2007. We are going to be providing updated county and state level estimates by age, state level estimates by race, ethnicity, age, sex and income categories. Then depending on future funding, the SAHIE program plans to produce county and state level model based estimates as an annual series.

This is just one way in which we have used administrative records at the Census Bureau. We use them as Sally showed earlier for inter-censal estimates, for the smaller poverty estimates, et cetera, and so this is an example of modeling.

What we are coming up with now for the next part of the presentation will be the Medicaid under count project. I see that there are a couple of the collaborators in the audience, Mike Davern and Dave Baugh. They are the experts here, I am merely the rapporteur. We have had a great collaboration I believe between our agencies, between the Centers for Medicare and Medicaid Services, the states who have helped provide their data. We have the Assistant Secretary for Program Evaluation, and those of us at the Census Bureau. Can't forget our sponsors. Our sponsors are the Robert Wood Johnson Foundation and also ASPE.

I think this is a really great project because of the type of collaboration that we have. When you are integrating data sets, you need to bring in expertise from both sides, who is expert in the survey data, who is expert in the administrative data and the use of those data.

What is the Medicaid under count? I'm sure you all are familiar with this. Survey estimates of Medicaid enrollment are well below the administrative data enrollment figures. Why is that? In preliminary numbers for calendar year 2000, the current population survey estimated -- and this is using the more conservative estimates, not the published numbers -- 25 million persons that were in the system. The Medicaid statistical information system or what Sally referred to earlier as MSIS estimated that there were 38.8 million persons enrolled. I'll show you how we get to these numbers a little bit later on. There is a substantial under count therefore in the CPS relative to the MSIS. In this case it is 64 percent.

Why do we care? We care because we want to better serve our customers. We want to improve our surveys, and we want to enhance performance indicators and provide feedback loops. These numbers are used for policy simulations by federal and state governments. They are the only source for the number of uninsured. They are also the only source for the Medicaid eligible but uninsured population, et cetera. So this under count calls the validity of survey estimates into question, and this study is intended to understand the causes.

What could explain the undercount? There are universe differences between the administrative records that are collected at the states and the survey data collected by the Census Bureau. There is measurement error. There are administrative and survey data processing, editing and imputation errors, and there are survey sample coverage areas and survey non-response biases.

So we came up with a bunch of hypotheses. These hypotheses were drilling down into these sources of error. Why would there be persons included in the MSIS but not in the CPS? There are persons living in group quarters, and group quarters are defined differently by the two agencies CMS and Census Bureau. There are persons who don't have a usual residence, but still receive services. There are persons who receive Medicaid in two or more states. This occurs obviously because if you move within a given month and you apply in your new state, you are going to show up in both states, and that is what we have to work with.

We have to look at what is meaningful health insurance coverage. There are persons with restricted Medicaid benefits. There are persons that have only Medicaid coverage for one or a few months. So what does it mean to be insured?

On the CPS side we have respondent knowledge. Because of plan names, enrollees and Medicaid and prepaid plans may not know they have Medicaid coverage. We also are hypothesizing that Medicaid enrollees who didn't use Medicaid services may not consider themselves covered by Medicaid. We also have the issue of proxy responses for other members of the household that may be incorrect, especially for non-family members or when there are households that have multiple families, and the respondent for the survey is answering for both.

We developed this study. The steps of the study were several phases. The first phase was to develop a validated national CMS enrollment file and to determine the coverage and validation differences between the MSIS and the MEDB, which is the Medicare enrollment database, to determine the characteristics of these databases and to look at dual eligibles which we thought might have a factor in here, and also to conduct a national Medicaid to CPS person match, and this way we would determine why the Medicaid and CPS differ so widely on enrollment status, and then we would build a suite of tables detailing explanatory factors and characteristics.

I mean to say that this study is a longitudinal data file that we are building. We are collecting data from calendar year 2000 through 2002. These are the first results from 2001.

In later phases, in order to enhance the study, we determined that there were some variables that we would like to receive that weren't in the national files. So we are currently working with several states to get their local information to see how we can improve the information we have on the national files, and how that can be used to conduct a variety of researches on both our master address file, CPS, the American Community Survey, otherwise known as the SS-01 in this case for the 2001, and to also look at Medicaid addresses to see if there was any specific frame bias.

Finally, we want to take a look at the impact that the state data has on the national data to see if it provides any further explanatory factors that could potentially apply in the national environment.

The later phase of the study, the fourth phase, and we have about a year left in the study, give or take a year, we will be matching the national Medicaid system to the national health interview survey as well to look at person coverage, and then each way along the way we will be documenting the results in papers.

Preliminary explanations. We have enforced the CPS group quarters definitions on the MSIS data where we have administrative data address information. That is, we were able to locate a person at a specific address and in the Census master address file. If that address was defined as being a group quarters we eliminated these people because in the CPS it is a household-based survey that has only -- it is the civilian non-institutionalized population, so there are components of the population that are not included by definition in the universe. Then also looking at duplicative persons in different states, and understanding the covariates for the mid-reporting.

Here is a brief graphic that shows that there is a major overlap in the universe, but there are folks on both sides that are not included, particularly those who are under group quarters, those who were deceased during the time period, and obviously we couldn't survey them in the CPS, those that did not have valid records on either side, and those that were in two states. So we had to build a common universe out of that in order to conduct the study.

We removed the dual eligible cases defined as a group quarters by Census, and then we ran the data through the Census Bureau's personal I.D. validation system that Sally had discussed earlier. We removed these duplicative valid records and then we removed the MSIS enrollees that were not enrolled in full benefits.

So how does this break down? We started with 44.3 million MSIS records in the year 2000. We had one and a half million of those records that were more than one state or were in a group quarters, and we had four million that had partial benefits. Partial benefits, this is a situation where you only had received Medicaid for a day, or where you only had use of family planning services, et cetera. I know there are a whole variety of these things, and Dave Voss, he is the master here, he can tell us if you have more questions on this.

On the sample loss side, nine percent of all MSIS records did not have a valid record, and were not eligible to be linked into the CPS. On the current population survey side, 6.1 percent of the respondents' records were not validated, but more importantly, roughly 22 percent refused to have their data linked. What this meant was, they refused to provide a social security number. Therefore, the way that we interpreted it was, if you do not provide an SSN, we will not link your data.

However, the effectiveness of the ID validation system that has been developed by Census Bureau has allowed us to change the method of collection of information from respondents, so that we are not asking for social security number anymore. There really is a two-part question. One is, what is your social security number and the other is, can I link your data. We really should separate those two out. And because we can link data without a social security number, and we know that is particularly sensitive, we have eliminated that from our demographic surveys.

So in the future, hopefully we will be able to link more data that way. I'm sure Gerry will be talking about the privacy implications and how Census Bureau is addressing that.

Here is an example of the validation differences that you see across the United States in the records coming from the Medicaid system. Those areas in red and those areas in black, you can't see the black too well, but there is a number of them in California and also up in Montana, are areas of the country where we had the worst validation rates. So if one was conducting a study in the state of Montana or the state of California and attempting to apply that study to the United States, you would get a very different view.

In California in some cases, I believe they are up to almost one third of the records that were not validated. California has different rules for their Medicaid systems. They serve a slightly different population than the rest of the United States, so we are hoping that the state data that they provide us will assist us in being able to do further linkages and improve it so that we don't have these validation issues.

In the state of Montana, the reason we had issues linking the data was that in some cases, I think it was particularly children, they used state IDs, case numbers, in the social security number field, at least in this year. I don't know how long that continued.

We matched the respondents together with the reported data only. We have 12,341 CPS person records that matched into the MSIS, 1906 had imputed or edited CPS data, which was about 15 percent of the total.

If you are looking now at why the dysjuncture between what we saw in the administrative records and what we saw in the survey, 60 percent of the respondents in CPS responded that they had Medicaid, nine percent responded that they had some other public type of coverage, but not Medicaid, even though we had on the administrative record showed that they were in Medicaid. Seventeen percent responded that they had some type of private coverage but not Medicaid, and 15 percent responded that they were uninsured.

So basically what you can gather from this is that people really don't know the source of their health insurance coverage.

The factors that were associated with this error was the length of time that they were enrolled in the system and how recently they were enrolled. For example, and we have in other studies as well, when you asked the question, have you been enrolled in X for the last year or over the last calendar year, if they are currently enrolled in that month and you ask them the question, you get a very good response. We had a similar study in food stamps, and if it was within one or two months of the survey month that the persons had been enrolled and participated in the program, they were showing only a ten to 20 percent response error. If it had been six months since they had received benefits from the program, we were showing response errors in the range of 60 to 80 percent.

Poverty status impacts Medicaid reporting, but it does not impact the percent reporting that they are uninsured. This gets back to that stigma issue. Stigma does not seem to be a factor here. As a matter of fact, it is the folks that are at the higher levels of income who still qualify for the program who are most likely to misreport. They think that they are getting private health insurance and not Medicaid.

Adults 18 to 44 are less likely to report dis-enrollments, and adults 18 to 44 are more likely to report being uninsured. Overall, the CPS rate of those with Medicaid reporting that are uninsured is higher than in other studies, and the CPS rate of those with Medicaid reporting Medicaid is lower than with other studies.

The work that is remaining. I already discussed briefly the other phases of the project, where we will be bringing in the state files. We will be using the MSIS data to enhance the study. We will also be bringing in the analytical extract, the MAX file, to take a look at differences between those who are enrolled and those who are enrolled and receiving benefits, which may have a big difference in the way that the responses are occurring.

We hope to soon be working with the national health interview survey, and then we will be doing a comparison measure of error in the CPS to the state survey experiments, and then also looking at how well the NHIS does, because the NHIS asks the question very different than the CPS does. The question we have is, if the NHIS does better, is that telling us that we need to change the way in which we are asking the question.

We will also be evaluating how well the CPS edits and imputations work both at the micro level and overall macro level. We will be evaluating additional state level Medicaid, and then we will be looking at the coverage area and survey non-response bias.

As I said, these are preliminary results. They are subject to change after further investigation. At the moment, we conclude that the survey measurement error is playing the most significant role in producing the under count. Some Medicaid enrollees answer that they have other types of coverage, and some answer that they are uninsured. The overall goal of this project though is to improve the CPS for supporting health policy analysis, especially refining the estimates of the uninsured.

The use of integrated data sets is really an important growth that we are seeing here. You couldn't get the sort of analysis that we are doing by looking at aggregate data from administrative records or aggregate data from the current population survey, or any other survey for that matter. You have to look at the unit level to determine what the causes are of why misreporting or dysjuncture is occurring between the two systems.

The administrative data, while they may provide the experience, it is really two sets of truth. The administrative data show you the information and the experience that federal agencies have and state agencies have working with a given set of individuals. They only collect that administrative data, they don't talk about all the demographics and social and economic information that you really want to have in order to have a really complete picture for use of a multitude of areas, including the development of policy, the implementation of policy and the evaluation of your activities.

There are some related research examples I just wanted to share with you briefly. We had a similar experience, where we started down this path three years ago with the Maryland food stamp study, where we worked with some folks at the Jacob Branz Institute in Maryland. We matched the American Community Survey data to the food stamp recipient data, and we found that we had 50 percent response error. That is, the Maryland food stamp recipients were 50 percent higher than we were showing in our estimates. I understand there are other surveys out there that have come up with similar results.

We were able to explain in our linkage study 85 percent of the discrepancy by looking at these individual records. In fact, the misreporting was 63 percent of that discrepancy, much of it due to the temporal biases that we were seeing that I mentioned earlier, and also to the fact that there seemed to be a serious dysjuncture when there was a person who was not receiving the benefit who was the survey respondent, or who had not applied for the benefit.

Other related research that we are doing in integrated data. We are working with the University of Chicago Chapin Hall Center for Children on a child care subsidy study. The other thing that this integrated data set does that you would never get from administrative data alone is that it allows you to develop eligibility models, those people who are eligible for a program but are not necessarily participating. So we are looking at developing eligibility models in the study, and then the researchers will examine the effects of this on employment and self support, that is, the outcome measures at our research data centers.

If we were to add data, we were asked earlier what data would you add to the study if you could to improve research on health outcomes. We have identified a few data sets that we think are of particular importance that if we could bring it together with this three-year longitudinal file we already have, that would be really important. It would be the WIC files, files from the NHANES survey, the MEPS survey. We have not yet linked in the CPS food security supplement, but that would be great, to bring in relationship data from the SS-5 information that Social Security Administration collects. The SS-5 is your application for a Social Security card; this currently is not on our extract.

There are other federal health insurance data sets that would be important to look at. I think there are future things that we could look at like co-insurance and its effect on things to bring in the VA medical health data, to bring in tri-care information, to bring in the Indian Health Service information. I think if you could get a good picture of what was happening in the federal sphere, then we could take it to the next step perhaps and see if there were any data out there for private health care coverage.

So the value of integrated data sets is that they really provide a more robust and accurate picture of what is going on. It builds on both views of the world, what the agency is seeing and what the persons are experiencing, but it controls for both their weaknesses. It provides better statistics for input into simulations, for predictions and for funds distribution, and as the demand for data increases, and frankly we are all experiencing a budget decrease, data reuse may be the only cost effective option for moving forward in the future.

So in doing this, we have a bunch of policy challenges, that is, communicating the benefits of integrating these data sets versus the privacy concerns, the need for interagency teams to insure accurate results. I think this team that we had on the Medicaid study was a great picture of that, because it wasn't until you had all the expertise from every side to bear on one specific problem that you could really address it.

To look at interagency agreements, as Sally said, what ends up happening with many of these research activities is that you set a research plan for two years, and you spend a year and a half trying to get the data. There needs to be a more effective way of moving forward.

Then once the data are put together, who owns them? Big question. Everybody owns them. Then there is the potential growth of possible disclosure risks. If you were to try to blend these data sets together and then release them as a data file, how do you do so without the administrative agency who had provided you that data being able to back out the survey respondents' information? This is certainly that we can't allow to happen. I'm sure Gerry will talk about that.

Then there is the need for these longitudinal databases to find an anonymized person at an address at a specific point in time. That really should be our vision in the future. But how do you do that and balance it with privacy?

In conclusion, integrated data architectures are the future of American statistics. There was recently a paper released at the UN ECE conference. The Norwegians, who have been working in similar systems -- well, a number of European countries have, but the Norwegians presented a paper that said they are changing their approach to the way they are doing work, and it is not going to be register based anymore. They believe that the blending and the integration of data is the way that they want to work. So I think we are in good company there.

As I said earlier, as the demand for data increases and budgets decrease, data reuse may be the only cost effective option for us. We have to overcome technical and policy related challenges, and this approach will support evidence based public policy research on decisions.

Thank you.

Agenda Item: Jerry Gates, Chief Privacy Officer

MR. GATES: Good morning. As you heard, my name is Gerry Gates. I am the chief privacy officer at the Census Bureau. It is a position I have held for a little over a year now. Prior to that I was chief of the policy office, and for over ten years prior to that I was the administrative records program officer at the Census Bureau, responsible for coordinating administrative records access and use. I have had some fairly close relationships with several people in this audience over the years, acquiring administrative data for our programs.

What I want to talk to you today about is some of the policy issues associated with acquiring and using administrative data for statistical purposes. I think the key policy issues related to these uses revolve around trust. I have identified four trust relationships which I think are critical in determining how we will be allowed to use administrative data for statistical programs.

The first is between the administrative data provider and the statistical data collector. In reaching agreements on using administrative data there is a consideration of several key issues. First of whether the proving agency trusts the statistical agency to protect the data and use it according to legal and formal agreements. The second issue relates to whether the providing agency believes that they have the legal authority to provide that data.

Another issue involves how the providing agency will be protected from any public backlash associated with the sharing of that information. Another issue involves what are the risks that the uses of these data may reflect negatively on the data provider, in terms of whether or not the quality of that data is sufficient, and whether the program goals are being met.

Another issue involves whether the provider has any say in how the published data are protected, and what role they will play in that. Finally, there is the issue associated with whether the provider is able to make use of the results. In addition to cost reimbursement, will there be any quid pro quo, will any data be returned to the data provider.

The second trust relationship is between the statistical data collector and the respondents to our surveys and censuses. This raises issues about what consent was reached in terms of permitting the data collected in surveys and censuses to be linked with administrative data, was that agreement a specific consent or was it a notification, was it an opt in or an opt out agreement, were there any conditions associated with that agreement and finally, how is the process transparent to the public at large; are these linkages being done in such a way that they are well known, or are they not known.

The third relationship is between the administrative data provider and the program recipient. This relationship is very similar to the relationship between the statistical agency and the survey respondent. It is an agreement as to how that information they provide will be used.

Finally, there is a relationship between the statistical agency and the users of the data. These data are only valuable if they are being made publicly available or being made available for research purposes. This involves an agreement as to the quality of the data for the intended use, as well as how accessible those data are.

So that frames where I want to go with this discussion today, to talk a little bit about how a statistical agency makes policy decisions in reflection of these trust relationships that they have to accommodate.

The Census Bureau's mission includes a prominent statement about our responsibility to honor privacy and protect confidentiality, as you can see here. This demonstrates that when we use and collect information, we find it critically important that we address our relationship of trust with our respondent.

Our law specifically says that we can only use the information that people furnish us for no purpose other than the statistical purpose for which it is supplied, and that we cannot make any publication where the data furnished by any individual could be identified. So this sets the ground rules. When these data are collected they must be protected, and they must only be used for statistical purposes.

So when the data come in, they come in from an administrative agency, and for our statistical purposes they cannot go back out for administrative purposes. The law prohibits it.

The Census Bureau has decided to address these trust relationships through its commitment to what we call data stewardship. It is a formalized program we have established since 2001 reflecting our management commitment to comply with our legal requirement and to acknowledge and address our ethical requirements in terms of professional statisticians.

We have established a formal structure to accomplish this. It is titled our data stewardship executive policy committee. Through that committee and its subcommittees, we set policy related to the acquisition of data, the protection of data and the use of the data. Three formal committees report to our senior level committee, which is chaired by our deputy director. One of those committees is our privacy, policy and research committee, one is our administrative records planning committee, and another one is our disclosure review board. So as you can see, administrative records are a focal point of this data stewardship program.

As I said, the program is built around our foundation of values and principles. Management decisions reflect -- I think what is key to this is, management decisions reflect not only our legal requirements, but consideration of the ethical obligations to individuals.

We have established controls to formalize and insure that our policies are met. These are done through the implementation of privacy impact assessments, a requirement of the Government Act of 2002. What we have done is, we have linked all our data stewardship policies to the assessment, so that managers have to acknowledge the compliance with specific policies in addressing risk to privacy throughout the process, from the initiation, the collection and the processing of information.

As I said, we have established many new policies that respond to the issues associated with this, and we continue to develop new policies in response to our changing environment. Basically, data stewardship is making commitments. It is a commitment to our data provider to manage and safeguard their information in accordance with legal and policy requirements. It is a commitment to our data user community that the administrative records will result in high quality data products. Also, it is a commitment to the public that we will maintain confidentiality of personal information, and also make sure that it will only be used for statistical purposes.

We have legal guidance and protections that are in place. Title 13 of the United States Code is the basic legal framework for the Census Bureau programs. As Sally mentioned, Section 6 of Title 13 specifically authorizes us to obtain records from existing sources, rather than collecting that information again. So we obtain our legal authority for the collection of administrative records from that section of the law.

Section 9 of Title 13 is the provision that we must keep the information confidential. Title 26 becomes very important to the Census Bureau, because that is the IRS code. The Census Bureau makes use of administrative records from the Internal Revenue Service, something we have done for over 40 years. The authority for that is in Section 6103-J of the IRS code that permits the Census Bureau to obtain tax information for its statistical surveys and censuses.

We also have specific recognition in the privacy act that statistics are a routine use for information collected by the federal government. The Privacy Act specifically singles out the Census Bureau as a routine use, so administrative agencies can provide information to the Census Bureau under that provision.

We also are guided by the Paperwork Reduction Act, which instructs that we collect the information to the minimum extent possible, and use information that is already available. This is a companion to the Section 6 of the Title 13.

Let me start with some definitions I think it is important to understand. We talk about confidentiality and we talk about privacy, and they both are important to the use of administrative data for statistical programs. I think it is important to understand this distinction.

I get this definition from the IRB Guide Book. Confidentiality pertains to the treatment of information that an individual has disclosed in a relationship of trust -- again we go with the trust -- and with the expectation that it will not be divulged to others in ways that are inconsistent with the understanding of the original disclosure without permission.

As I said, there are some legal requirements for confidentiality. The basic one is Title 13. There are also reflections of confidentiality in the security guidelines established by the Government Information Security Reform Act and the Federal Information Security Management Act, which is also part of the E-Government Act of 2002, and confidentiality is also impacted by the federal information processing standard, which is FIPS 199. So these all provide a framework under which we must protect and secure the information that we collect not only from our respondents, but also data that we obtain from administrative agencies.

Now, information privacy on the other hand is defined here by Alan Weston in 1967 as the claim of individuals, groups or institutions to determine for themselves when, how and to what extent information about themselves is communicated to others. So this is the individual's control over their information and how that information is used. The requirements for insuring privacy come from Title 13, in the sense that Title 13 says that we can only use information for statistical purposes. We have to tell people that that is how we are going to use their information. The Privacy Act instructs us that we have to tell people about the authority we have to collect the information, the purpose for the collection, how the information will be used, and whether our asking for that information is mandatory or voluntary.

We are also guided by the Freedom of Information Act, which says that private information cannot be disclosed to requesters. Also, HIPAA has a role to play here as well as the E-Government Act of 2002, which as I said establishes requirements that agencies must conduct privacy impact assessments to insure that information is protected.

A little bit about Census Bureau policies as they relate to administrative record use. These are all policies associated with all our privacy principles. This is the basis for our data stewardship program. We as an agency have acknowledged that these are principles that whatever we do must be necessary for our mission, that we will be open and transparent about what we will do. We will have respect for the individuals who provide us the information, and we will provide confidentiality for any information we gather. So these are the overarching principles of our data stewardship program.

We have established policies to implement those principles or to insure that we are in compliance with those principles. The highlighted principles as you can see under mission necessity, linkage of decennial census records and number five, record linkage. These are policies that we have established related to our linkage of information. I am going to talk a little bit more about the record linkage policy in a minute. Then there are some policies related to multiple principles including collaborative arrangements with agencies which certainly impact arrangements with administrative agencies. Finally, the administrative records policies and procedures, which is a handbook for how we will manage and how we will establish agreements to use administrative data, how we will manage and handle administrative data that we gather.

So those are the four guiding policies which impact administrative records.

Now a little bit about policy on record linkage. This policy establishes six principles for conducting Census Bureau projects that use record linkage techniques.

The first is mission necessity. What that says is that the linkage must be necessary and consistent with the Census Bureau's legal authority and mission.

The second principle is best alternative. That principle says Census Bureau will examine alternatives for meeting the project objectives and determine that record linkage is the best alternative, given considerations of cost, respondent burden, timeliness and data quality.

The third principle is public good determination. The Census Bureau will weigh the public benefits to be gained from the information resulting from the record linkage against any risk to the individual privacy that may be created by the linkage, and determine the benefits clearly outweigh any risks.

Next is sensitivity. The Census Bureau will assess the public perception that the level of risk to the individual privacy to the particular linkage and create an appropriate level of review and tracking.

Openness. The Census Bureau will communicate with the public about its record linkage activities, how they are conducted and the purposes and benefits derived from them.

Finally, consistent review and tracking. Record linkage activities will undergo a consistent review process using the criteria set forth in its policy and a centralized tracking by the Census Bureau.

Now, the policy establishes a checklist of questions that have to be asked related to each of these principles, and risk points are assigned. If the risk points are high enough, it is considered a highly sensitive record linkage and it needs specific approval from our data stewardship executive policy committee. So there is a thought process that goes into this, and an assessment is made that yes, this is a high, moderate or low sensitive project.

There are controls that we have established that support our administrative record uses. Sally has mentioned this, but I think this comes under her purview, centralized data acquisition and agreements. We have one focal point within the Census Bureau for establishing the contacts with the administrative agency to acquire data and to establish agreements. We also have a centralized review process, so that all projects using administrative records go through a formal review to insure that they are compliant with the legal requirements and the agreement between the agencies. We require need to know access. Only those people who need to have access to this information are permitted access. We remove identifiable information immediately, and replace it with what we call a pick.

We have an administrative records tracking system, which is a computerized system, which basically allows us to control a project from its inception to completion, to insure that anyone who works with that project understands what the rules are, how that information can be used.

We have file receipt logs and audit trails to insure that we are complying with the agreements. When our data have to be used off site, we do independent site reviews to make sure that the security is required.

Finally, we do security and confidentiality training. All of our employees on an annual basis must take confidentiality training under Title 13, and training specific to Title 26, which is tax data, and also IT security training. So this is a mandatory annual training.

Let me talk just a minute now about some unique privacy and confidentiality concerns related to our administrative records use. A lot of this involves perception as much as it does reality, I think. That is why these issues get very complicated. We have to be sure that what we are doing is perceived to be the right thing to do.

The first concern is, is the consent needed for the statistical use of administrative records through the data provider, through the survey collector, and what are the conditions for that consent. Is it an opt in or opt out, does it matter whether it is voluntary or mandatory, whether the survey is voluntary or mandatory. These are important decisions that we have to make, to make sure that it is transparent as to what we are doing and the permissions that are being given to do this.

Another unique issue involves, will the public accept a near real-time system to respond to immediate statistical needs. Past uses of administrative records are quite different than they are today with the new systems that we have developed. The data that we have had before gets old very quickly. We have had matched systems from data that is five years old. Today that becomes much more current information, so that raises additional concerns for how that information could be used, not how it will be used, but how it could be used.

The next is, does the public trust the protections around the interagency data sharing. Today's climate is different than it was prior to 9/11, in terms of acceptance of record linkage and data sharing activities among government agencies.

Next is how can we be more transparent about record linkage activities. Events in Canada several years ago led us to conclude that these things cannot be done in secret. They have to be very, very publicly known, because the public will react very negatively to information they find out about something that is happening that wasn't publicly acknowledged. So we have to be very transparent about what we are doing.

Finally, how can we continue to meet the statistical needs while assuring confidentiality. That is what I want to talk about next. It is critically important that we make these links data accessible for research purposes. That is where the value is. We have to determine our options.

As Ron mentioned, administrative records linked to survey data raises unique confidentiality concerns because of the fact that that administrative data exists somewhere else. So we are limited in the amount of public use micro data sample files that we can publish. We don't want to discount them, but we have to understand that they are not going to be of as much value as they would have been if they were just the survey data. So we have to look at other options.

We have been assessing many other options for providing access. As Sally mentioned, we have a network of research data centers, and we provide access to qualified researchers for work supporting the Census Bureau's programs through these research data centers. So this provides an opportunity for those people who are in geographic proximity to the regional data centers and have projects that meet these criteria to access some of these linked data. We are looking at options for streamlining and enhancing those centers.

We also are actively researching techniques like have been used for the Luxembourg income study for many years, which allows users to develop programs to run on the matched data. They don't actually access the matched data, but they can submit those programs, and they are run on the matched data and products are released to them. So that is something NCHS has done a lot of work on too, and the Census Bureau is continuing its research.

We are also researching the development of synthetic data John Abod, who is currently working with the Census Bureau, has done quite a bit of research on synthetic data which is model data, so that we may be able to release linked data that are not the actual data, but maintain much of the properties of the original data.

Finally and probably most important, we have to continue the dialogue we have established with researchers and program evaluators and program implementers about what their needs are and how best we can meet those needs under the constraints that we have to live with.

So in conclusion, what I wanted to say today is that the Census Bureau is committed to meeting the needs of its customers, by enhancing the reuse of statistical data through our use of administrative records and survey and census data sets. We are also committed to reducing the cost and respondent burden of developing statistics, and insuring the trust of the public, our data providers and our data users, and finally, continuing a public dialogue on the advantages and cautions surrounding our use of administrative data.

With that, I want to thank you.

DR. STEINWACHS: Nancy and I and all of us want to thank all three of you very, very much. This has been exciting. It opened my eyes to a lot of things I didn't know, but probably other people around the table did, and that is part of the sharing.

Nancy and I discussed, maybe we could take a five-minute break to let people have a human break, at least stand up and stretch, and come back and start the dialogue, both questions, comments and interchange.

(Brief recess.)

DR. STEINWACHS: If we get started, we will get a chance to get a discussion going and answer peoples' questions.

Let's get started. Let's open it up to questions and comments, both from those around the table and those in the audience. We have a microphone that we can pass around.

MS. GENTZLER: I am Jenny Gentzler from Food and Nutrition Service at USDA. I would like to hear some more discussion about the feasibility of getting food stamp and TANIF data from umpty-ump state agencies, and sometimes even if it is administered by the county agencies. I know that is a linchpin of the new dues system. So I would really like to hear some more discussion about whether we could have all of those matches done and have the data available before 2020.

Thank you.

DR. STEINWACHS: Who would like to take that?

MS. OBENSKI: Even though there are technical challenges with different state data, our experience to date is that we can overcome the technical challenges.

By far the single biggest hindrance or obstacle are all the legal requirements around acquiring the data set. For example, in our child care subsidy program, which is a pilot project that we are undertaking with researchers from Texas, Illinois and Maryland, it took a year and a half. This is working with researchers who have established relationships with the state entities. This wasn't the Census Bureau calling up cold.

So although there are technical data challenges, I think I can speak for my analyst that those they can overcome, but the policy and legal obstacles or challenges to make sure that all parties are protected is quite formidable.

MR. PREVOST: I just wanted to add, we were involved in a child care subsidy conference last week, where a similar question came to us. I think if you look at the way that data are submitted to CMS as an example, an agency would almost have to have a centralized focus within the federal government that somehow related to the regulations around a program and perhaps tied to the funding to those states, to be able to collect or to require the collection of that information in the standardized format in order for this to occur, because otherwise it becomes voluntary. I guarantee you, out of the 51 or more entities out there, somebody will not want to participate for one reason or another.

MR. LOCALIO: You gave us an interesting response. Let me just tell you what happened last week. The full committee had a meeting. One of the issues had to do with the liability -- I guess it was NCHS -- or the decreasing ability of NCHS to get vital birth and death data from the states and local entities that produce those data.

We had a speaker from New York City, Steve Schwartz. He called himself Dr. No, because he tells so many people that they can't get access to the data. Then we had a presentation that indicated some of the difficulties that this poses for people who previously got data but now they cannot on births and deaths. Obviously this is very important for a researcher, who needs to know how many people in the study have died. That is your outcome, survival.

So what people didn't hear is what I asked him after we broke for lunch. I introduced myself and I said, suppose Health and Human Services told New York City that their further receipt of any Medicaid funding from the federal government was contingent upon their cooperating with NCHS in providing the data as they used to. His reaction was very defensive. He said, oh no, Tenth Amendment would prohibit that because that has traditionally been a state and local activity. Then I reminded him that in the early ‘70s, every state had a 55 mile an hour speed limit because they were told 55 miles an hour or you don't get any highway funds.

So I would submit that a lot of the way around some of these obstacles which are most unfortunate and not in the best interests of the country as a whole is to do what I think you alluded to. That is, to suggest that more cooperation will be forthcoming if these data systems are linked to certain line items that provide appropriate incentives.

Now, the other thing that came up which I think was not nearly as threatening as what I mentioned is to sponsor model state and local laws that would be appropriate and consistent with the needs of everyone and have uniformity, and to say, we have this model, this requires somebody to get out there and think about it and write it, and that is not always easy. But if you have model laws, it then becomes a little bit easier to convince state and local entities, many of which have a lot of difficulty passing a budget every year, that this is worthwhile if you do some of the work for them.

So I just wanted to relate that to people, that this is my conversation on Thursday of last week. I am wondering if anything was said about me back in New York City on Friday.

DR. STEINWACHS: Anyone representing New York City here?

DR. MADANS: Lucky for you, Steve was still in Hyattsville on Friday, and it hadn't gotten back to New York.

I just wanted to clarify a little bit about what is happening there. This is really a data release issue that we are facing with vital statistics. There are some items we are not getting, but that is a different issue.

NCHS is getting the information. What has changed is what and how we are allowed to re-release it. It does feed directly into the issues that have been brought up. I have a feeling some of the presentations you are going to hear today and tomorrow will get boring, because if you take out Census and put in NCHS, other than examples, the issues are all the same. But if you nod off, we will understand.

This whole issue about what is safe to release and what are our requirements, what are the various parties' requirements, how do they mesh, is one of the major issues. To the extent that something like a model law, or some way that we can ease the way that we work together is really important.

Every time we start a different linkage project, it is as if no one has ever linked data. We start from scratch, dealing with the same issues over and over again. Nothing changes, again, substitute the different data sets. So getting agreement on best practices, on what is reasonable to do, what is acceptable to do jointly I think would make our lives very easy, because after awhile this is just work. We are not getting anywhere. We are just redoing the same thing over and over and over again.

DR. STEINWACHS: Let me ask a question that falls into one of the areas this subcommittee has been dealing with, and that is trying to improve the reporting of race and ethnicity. It sounded to me as if I just had to come to the Census and I had both good reporting and filing in the missing blanks by statistical means.

Let me just ask the question as part of my education. You have that data well developed and are updating it and monitoring it. On the flip side of that, there are people who analyze the Medicare claims data, administrative data, to look at disparities in health care services. But mainly they have what Sally was saying, they have white, black, other, because it dates back to the -- the source of the data dates backs through Social Security or original enrollment in the Medicare program.

What would be involved if CMS came to you and said, could we get from you an estimate of race and ethnicity that we could link to each of these people and we could update that. I was trying to get a sense of how does that fall in privacy and confidentiality. Does that raise issues? In a sense, they have it but they don't have the measure that we use today. They don't have the categories. But it would help me to get some sense of what the possibilities are, maybe have a little discussion of something like that kind of reverse request, where CMS was coming to you.

MR. GATES: Maybe I will start this discussion. I think the issues associated with assigning race and ethnicity have to do with -- if you are going to do it on an individual level, how good it is. If you are assigning race and ethnicity based on some models that are a hundred percent accurate, then you have got confidentiality concerns, because of the fact that that information may have been -- where you are deriving that information from, if it is derived from the Census, let's say, that is protected, and you are creating this model that is a perfect model, then you have got a real issue in terms of, did you really comply with the confidentiality requirements. So that would be a major concern.

So I think the issues associated with this have to do with perception about, have you violated the confidentiality.

DR. STEINWACHS: Let me ask it another way. What if there were some auspices under which you took Medicare data into your data centers and did a linkage? What would be the issues if you went the other way?

MR. PREVOST: I think I can speak to that a little bit. Part of it is, that is what we have done. The auspices under which that would work, we have our research data centers, which would provide an environment that was protected, where the individual, if they had a project that had been approved, could come in, could work on that project and the data that were ensconced in that data center, and then there would be a confidentiality filter, being people who man the research data center, who would take a look at the results to see if those results were presentable outside. By that, I mean that they would not be disclosing any given individual's information.

So we have worked under that. We have also worked under situations like in our current areas, where we are jointly conducting this research with our partners. We do confidentiality research to make sure we are not disclosing anything, and then we provide it as a group.

MR. GIBSON: I apologize for interrupting. I'm just going to add some technical details here. We have had the privilege of trying to work with Sally and Ron before. My name is Dave Gibson, and I am representing Spike Duser, who is out this week and he has asked me to speak in his behalf.

DR. STEINWACHS: Thank you.

DR. BREEN: CMS.

MR. GIBSON: I'm sorry, I'm from CMS, I apologize. We have tried to establish some relationships with Census. It wasn't for want of trying on Ron and Sally's part. We really appreciate their professional demeanor and everything that they were trying to bring.

We have a lot of problems with administrative data. I'll be talking about that some this afternoon. But I would like to correct something, the fact that you are saying that some of the data for the other race and ethnicity categories are not broken out.

We have done a better job in terms of especially the Indian Health Service. We have a relationship with them where they populate our data, and if you are a member of a recognized tribe, that information will trump anything that is in our database. We have had mailings out to people in that other category, who are we think mostly Hispanic, and people who have Spanish and Hispanic surnames. We have done mailings to them and tried to get information back to populate the expansion of our race codes to go beyond just white, black and other.

That was a constraint that as you know was based on the old SS-5, and data that is coming in currently they are collecting it more, but we are collapsing it back down in some cases, but in our enrollment databases we are trying to keep that information there.

So there is some effort going on to try to improve the race and ethnicity data. I think the thing that you have to worry more about is the fact that with enumeration of SSNs at birth being done by the states, many of the states now are refusing to give SSA the race and ethnicity of the child. So whereas most of the Medicare population are people who are obviously aged for 85 percent, there are disabled in that category, some of the ESRD especially could be children. You are going to find a lot more of these unknowns creeping into the data over time.

So I don't know if that is helpful or not, but just little technical details there.

DR. BREEN: Could I ask David a followup to that? I was wondering, the techniques you said you were using working with the Indian Health Service, which of course would be a really good source, but wouldn't it be more cost effective to work with the Census in order to get information from them rather than to send out a mailing? That sounds like an expensive way to do it.

MR. GIBSON: I believe we did it a one-shot time. I don't recall. I would have to talk to folks in the enrolment area to see if they have had a followup on that. But my awareness of it was only that it was done one time. Yes, it probably would be more cost effective to try to do some sort of a -- you are talking about some sort of a statistical match?

DR. BREEN: Yes.

MR. GIBSON: That would probably be much more helpful, I agree.

DR. BREEN: And you would probably end up with better results.

MR. GIBSON: I would suspect we would.

MS. TUREK: Gerry, is this a Title 13 issue? If they were to do a match, can you give them your race data from the Census to match into their files? An exact match.

MR. GATES: Are you saying if we gave it to them?

MS. TUREK: To CMS; may you do that?

MR. GATES: No, we can't. It would have to be done by us.

MS. TUREK: Then you couldn't share the final data set with them. You couldn't give them the micro data, right?

MR. GATES: No, not the micro data. We could give them back tabular data, but we could not give them back micro data.

DR. IAMS: If I may point out, I am from Social Security. Starting in the mid-80s, the social security number has been issued at birth. We don't have race and ethnicity, so your black-white is disappearing from the data set. At some point in time people will be on Medicare, and you won't have anything.

DR. BREEN: Is it not possible to fill in the blanks and return the data?

DR. IAMS: I'm not sure what you are suggesting.

DR. BREEN: If there were missing information, to update that information, or if there were incorrect information, to update that information at the Census Bureau with the CMS data, and then return the data corrected.

MR. GATES: No, that would not be possible under our statute. We could not disclose that personal information to CMS.

MR. PREVOST: While it would be technically possible and we have done it within the constrains of the Census Bureau, delivering that data back to the agency, we would not be capable of doing that.

MS. TUREK: Title 13; they don't give that to anybody.

DR. STEINWACHS: Let's continue. I want to get everyone in.

DR. STEUERLE: Continue this conversation. I don't want to interrupt.

DR. STEINWACHS: Others who want to join in on this conversation?

DR. DAVERN: Mike Davern from the University of Minnesota. Would it be feasible, and I know you guys have talked in the past about building an imputation model, in which you could give CMS the information it needs to take a best guess at what the race might be on their file, based solely on Medicare information?

MR. PREVOST: Yes, it is certainly possible to do that. We would have to conduct a wide variety of research to make sure that we weren't in particular cases re-identifying somebody.

DR. STEINWACHS: Other comments on this?

MR. J. SCANLON: Aside from the data acquisition issue, which will probably be with us for a long time, you mentioned on the dissemination end you had research data centers, you had other ways of releasing the data. It seems to me that is probably an area to focus in as well.

Do you have any indication of to what extent research data centers are utilized, and when you do talk to customers what kind of ideas do they give you? Apparently you can't use remote access for a number of reasons, but someone can request certain tabulations or regressions or other things. You can provide the product itself. But how much is it utilized? What sort of models are you thinking of for the future, and how do you relate to the way other agencies are approaching this?

DR. DAVERN: I think one of the issues on the research data centers is the process involved in getting approval for the particular project, especially with administrative records data. It has to go through an approval process with the disclosure review board and all those types of things, so it takes a long time to get approval for the research data centers.

I know there is a large interest in people using them, and there are a lot of people who are using the research data centers. I don't know what the breakout is between those using administrative data and those just using some of the other Title 13 data. I think there are two different ways to do it. But the research data centers are used heavily. There are only 12, 13 research data centers, and they are looking at expanding the access to those.

MR. J. SCANLON: Do you physically have to present yourself?

DR. DAVERN: You physically have to go to the research data center, correct.

DR. SONDIK: I am doing the wrong thing because I am reading ahead here.

DR. STEINWACHS: I think that is against procedural rules, isn't it?

DR. SONDIK: It probably is, but I noticed this before, and given that you just asked the question it can come up again. But in the talk that is going to be given by Julia Lane, there is a slide that says research data centers drawbacks low in declining utilization. It said, fewer than a hundred active projects.

DR. STEINWACHS: Let's call on Julia.

DR. SONDIK: I was struck by this. There is a judgment there that that is low, but it struck me as -- I don't know what it means across 12.

Agenda Item: Using Linked Micro Data: Julia Lane

DR. LANE: My background is, I was at the Census Bureau for seven years, and then I was at the National Science Foundation, in the economics program, which continues to fund the research data centers.

That judgment is not mine, by the way. That is the judgment of Brad Jensen, who used to be the director of the research data centers, and that is a direct quote of his testimony to CENSTAT. NSF had very similar concerns, because NSF funds it. When I was there, there was a lot of concern about the little usage which last year was under a hundred projects spread over eight, count nine if you think of the California one.

So the big issues that we have when we put it before review panels was first of all, there was a delay in the review process, which could take up to a year, which is for graduate students and so on. Physically having to go on site to do the work is a big deterrent for research purposes, and certainly for program agencies. Joan might want to speak to that.

So we at NSF came to the conclusion that it is very important to support it as a niche approach, but it is certainly not going to be a bill-based response. That is a little bit what I will be talking about this afternoon at Joan and Gene's request.

MS. TUREK: Let me give an example of very intensive data use. I manage the transfer income model which is used to look at the impact of changing federal programs. It has been around since the ‘60s, so it has probably got $70, $80 million in it over all that time.

We use the CPS fortunately and not CEPR. We would be in real trouble. We run the micro database through that model and assign people to different buckets, depending upon their characteristics. We also can change the code in that model daily when we are doing something for the Hill. At one time we had an open line to one of the subcommittees and were doing runs while the subcommittee was in process.

When they change the programs, we have to write new code. So for us to use the matched data we would need to be able to have our model sitting out somewhere where we could change code, do runs and get the results in real time.

I think that is going to require some major changes in law. I wouldn't want you to do that to any of my data sets that I am using until after -- I would want to keep what I have until you have got the rules in place that I could use the new, or you are going to affect one of the major policy functions of our office.

Do you want to say something about social security? Because you had SIPP based models.

DR. IAMS: We continue to have SIPP based models. We are a data center that follows the rules and regulations of Census and the Internal Revenue Service, and treat it as if it is gold at Fort Knox. We are highly protective of our security and privacy, and we do run these models. We have an ongoing relationship on social security reform with the White House, that is receiving it on a flow basis. But that is facilitated by having Social Security Administration being a data center that permits us to use the restricted data.

MS. TUREK: Which we would not be, because we are a policy office.

DR. IAMS: Yes, you are a different agency.

DR. BREEN: Is there any access by the public or researchers to your data center or that data that you are talking about, Howard?

DR. IAMS: Yes and no. The two main models we use with restricted data, one is called MINT, modeling income in the near term, and the other is our SSI financial eligibility model we call FEM. We have had researchers funded through our retirement research center grant system use MINT data for various descriptive projects. I think people from the Urban Institute have done that as grants. But that is really a statistical adjustment to five SIPP panels, and I am not sure how broadly based the Social Security Administration wants to make that available. We get into an agency proprietary aspect.

DR. STEUERLE: Howard, just quickly, you and Tom Petska do at times -- I'm not saying it is done as much as you like, but you do also tend to have joint projects. You have had external research combined with internal research.

DR. IAMS: Yes, we have had joint projects with external researchers, some of whom we fund. Offhand I can't think of anyone we don't fund.

We have researchers with our retirement centers. We fund projects through a cooperative agreement to three centers. The Social Security Administration funds projects at National Bureau of Economic Research at the Michigan RRC and at Boston College. We have had researchers come and use matched data, getting a grant from us and using matched data at our center, and we have CBO who has come and used matched data.

They are using the administrative data matched to the Census Bureau data. I am speaking about that this afternoon. We have an array of administrative files that are matched to it.

On a slightly different note, I have felt that the Bureau in the past could have made more use of these matched data to improve the data quality of their missing data and imputed data. Our data on being paid a social security benefit, that is what Treasury sent them, and our data on an SSI check, that is what we sent them. In the CPS or in the SIPP, when you have a matched record and someone fails to provide an answer, you could provide a statistical assignment from our administrative record that is going to be better than using a hot deck from some other person, who may or may not be related.

We found in our studies that SSI gets misreported in CPS and in SIPP. People call it social security, so you have more social security beneficiaries in these surveys than we have as beneficiaries for matched people, because some of those people are getting SSI and say it is social security. Then you have an under count in SSI.

If you were to do that assignment, you probably could carry it further, because with SSI usually comes Medicaid. So you could assign Medicaid based on SSI. But there are a number of things.

I believe in the new economic well-being survey system, they are planning to make more use of administrative records in improving the information collected in the survey than has been done in the past. There is not a case where you have to go get privacy from 51 states. We do have it, and we do have an exchange and a series of MOUs and exchange data with Census for statistical uses.

DR. W. SCANLON: I just had a question for clarification. What are the restrictions, or what are the requirements for an agency to be a data center, in terms of social security, qualifying for it to be one? Is it conceivable that CMS could qualify to be one?

DR. IAMS: If I could be colloquial, we basically take your first-born child. I am joshing, since this may be recorded.

There is a series of background checks that are required. There are a series of sworn statements that have to be made. There is a training program that IRS maintains that people have to go through. There is a training program that Census maintains that has to go through. Then our data center walls off the data; people can't print from the data, they have to print to a printer, and one of our employees goes and reviews the output that has been printed, and you can't take the data away.

Then to get to our data center, you have to go through two armed guards on two different floors to get to it. Then you would have to know what you are doing when you got to the computer, in terms of getting access to the file and all the standard protections.

So there is a set of requirements that Treasury has laid out for being a restricted data center that protects their data. You have to go through these safeguards, and users have to go through these types of things. I don't know if Tom Petska tomorrow is planning on touching on this at all. But it really is a very tight control, in terms of getting access and use. I think the Census Bureau maintains the same type of thing at their RBCs.

DR. DAVERN: Let me say a little bit about that. We started the data center network probably about 12 years ago through our Center for Economic Studies. The process for identifying new research data centers is handled through an arrangement between the Census Bureau and the National Science Foundation to determine where the best place to locate the centers are, based on the research community associated in that area, and how we best can establish a partnership with an academic institution or whoever we decide to partner with to establish a secure environment for accessing these data.

As Howard said, there is a very tight process for determining what projects will be accepted through the research data centers. You should know that people who access these data are considered -- we call them special sworn status persons, and they are given special sworn status because they are helping the Census Bureau conduct its activities under Title 13. So we go through a process of identifying how that project is going to support our programs by improving our data, improving our knowledge about how the survey is functioning. So that is all part of this process.

But there is a formal process for identifying new research data centers. The arrangement that Howard is talking about between the Social Security Administration and Census Bureau was established in 1967, way before we started formally introducing research data centers. It was in recognition of a collaboration that these two agencies had for better working together on these projects that were of great interest to the Census Bureau in working on its income programs and other programs.

So I think it is important to understand how this process has come to be where it is.

MR. LOCALIO: I just want to make a couple of comments on data centers from the perspective of a working statistician.

I am probably currently the statistician on 25 projects. I have meetings every day, even when I am not in the office, sometimes by phone. To do an analysis and to do it right sometimes takes years. It is impossible to get funding out of anybody to finance staying in a hotel room for six months and doing work.

So I would suspect that a lot of work that could be done is not getting done, and I would suspect that a lot of the work that is being done is not very good. I can tell you right now, putting on another hat, that is as an associate editor for a journal where I act as a statistical reviewer, a lot of this stuff is not very good and I send it back to the federal agency as well as other academic institutions. I say, I has to go back and you have to redo this analysis. So if you are working in a data center, you would have to make another arrangement to go back to the data center and rerun everything and rearrange your schedule.

So I can understand why there are a hundred current projects and it is dwindling. That is no surprise to me. I am amazed there are a hundred active projects.

If you think the data center or dozen of them is an answer to one of the issues, it is just not. It won't work. What it does is, it trivializes statistical science to the fact that people are writing a couple of lines of code in SASS, and then running it. You may be writing thousands of lines of code in SASS or RS Plus or some of the other programs, thousands, and you are rewriting them and rewriting them to get it right. These data sets are complex. There is misclassified data and missing data.

See all this gray hair? This is not an easy life. It doesn't pay very well, either.

So I think we have to consider much more seriously about what types of projects can be done at data centers and on what terms and conditions. It is better than nothing, but if it does work, there has to be much more thought to increased flexibility as well as adding to their numbers.

By the way, my views are not new. Marjorie, remember that e-mail I sent in response to NCH's data center request for comments? That was in the spring of 2005. It caused some laughter because I tend to try to kid around a little bit when I am serious.

So those are the real issues when you are working with data. By the way, the data sets also tend to be big, and everything is bad.

MS. OBENSKI: I'd like to partially respond. We know that that is a problem, and we certainly can appreciate the fact that these data sets are extraordinarily complex. I have watched the very best analysts run and rerun, et cetera.

One of the models that ameliorates it, does not fix it, is for example in the Chafin Hall project, which is very complicated, it is TANIF, UI wage, chalk year subsidy and the American Community Survey. We are working very closely with the researchers, but we are doing the data analyses and the runs and the reruns and getting it right. What we are delivering to them in the RDC is a fairly pristine data set that they should be able once they get access to do pretty good research on without the runs and reruns. We intend to keep involved with them to help expedite that.

Again, it is not a fix, but we believe that it should help things in the RDC.

MR. PREVOST: I just wanted to add a little bit more to it. I think one of the other things Howard said was important. What could be done is for edits and imputes of the data to occur to the POMS micro data sets themselves, therefore affording access to individuals as they would any other POMS file.

But as Gerry I'm sure can say, and I said in my speech earlier, this certainly raises the disclosure issues from being able to reidentify people. I think POMS data sets as a whole -- as computer technology and everything else increases, the ability to match back to am individual and to reidentify them is growing. It is certainly a concern that we all have as statisticians.

So anyway, certainly one model in the short run could be doing this with edited data in a POMS if it passed disclosure proofing. But then also to provide the differences that we are seeing between the survey data and what is being collected in the administrative data to enhance the modeling that is being done by researchers for policy reasons. I think you need both prongs of that in order to make that function work.

DR. STEINWACHS: We are going to need to wrap up here, so one comment, Gene, and then we will close up and go to lunch.

MR. GIBSON: I am afraid in some ways, the changes that CMS is planning may necessitate something along the lines of what we are talking about with a data center, and whether we would qualify, I don't know.

We have been used to giving out our files as flat files for both the claims as well as the unloaded EDB or else the denominator file. We intend to go to an integrated data repository, which is basically a relational database with everything in it. There would be only one source of that data, and you would go to that data, so there wouldn't be these multiple copies of flat files that we would be giving out as either an Epsidec or a SASS file out to our users.

So that has us as researchers inside concerned, because we have found that when we go against relational databases like the NMUD which is used for our claims -- and even we don't have access to them, but we have seen when you use queries against them, it can take a long time to get a response back. That is typically caused because the records are very wide, they are variable linked packed decimals.

We would like to entertain the notion of shortening them, trimming them off on the front and back and coming up with what I would call core research files, which would be very, very deep, but very, very narrow. Trying to get our data processing friends to buy into that is another story, but that would be another possibility.

We could work with the groups who know how to handle data such as Census. Where we are typically hurt is by the lack of meta data associated with the instructions from our CWF friends in policy when they implement policy, or our CWF friends in OIS, that is our data processing group, when they go out to our standard systems and come up with procedures for physical areas, carriers and providers. The meta data somehow gets lost, and it is not associated with the claims.

So if there could be some way of working with agencies who know how to handle meta data all the way from the policy to the systems to the manuals to the actual claims and enrollment database, such as with the Census folks who are SSA or NCHS, and then think about trimming off these humongous files for the variable length packed decimal stuff that is not really needed, we could do a lot, lot more in terms of being more responsive. A relational database, we found, if it is relatively narrow you can get answers out of it fairly quickly.

DR. STEUERLE: I have had the fortune of being associated with a number of agencies that are very interested in data from SOI to Social Security now to HHS, and on numbers of occasions have made contact to Census.

It is probably unfair, but I would say that probably the view of the statistical community, at least around Washington and maybe around the country, is that Census is king. At times it ends up to be the lion king and other times it ends up to be King Kong. It is just big, it has some really good people doing wonderful things, but at times the nature of the agency is such that it accidentally steps on people that it doesn't want to, or knocks over buildings by accident or something.

As an economist, I tend to think of this as very much based on -- I ask myself, what are the incentives of each agency. It was interesting in your talks, because a couple of things stood out. One was, the word counting came up numerous times in your analysis. Even when you listed the reasons you would do things, I think the public good was listed third or fourth on a list. They weren't necessarily hierarchical lists, so maybe that is unfair, but it wasn't listed first, it was listed third or fourth.

I think the reason for that is, the Constitution gives you authority on counting, and that gives you great sway if you go to Congress, because Congressmen care about how many people live in their districts.

If you could justify things on the basis of counting, you can often go out and do a number of things. The further you get from counting and the closer you get to research, including research that might show policies as being ineffective or policies as varying by states, and in some states being effective and ineffective, the harder it is for you to get the easy justification.

The other thing I sense, and this is not Census, I think part of my point about making the king statement is, I think the reason people turn to Census is just because you are larger than anybody else. They hope you are going to solve the problem, even if it really is an IRS problem or a Social Security problem or an HHS problem.

This question about public acquiescence and public buy-in. That is a tough one. It reminds me of the story of when the private person asks to do something that might create some harm, so you do something that might at some level reveal somebody's privacy. It is called a type one error. One person objects, and somebody in the press plays it up, and it gets bad news.

The consensus of many agencies in government of course is, you don't want bad publicity. It is almost like that is the first incentive. The second incentive is in some cases to serve the public.

However, there are times when that is overcome. That is when type two error becomes publicly visible enough. The case that you gave was real time analysis. All of a sudden there is a hurricane, we have -- really, let's be honest, we have not done a good job on integrating data sets. We did not adequately serve the public. Again, not to blame it on FEMA. I don't even know who would be responsible. But all of a sudden, the type two error gets large enough that people say, privacy concern is nothing when we couldn't serve the population of New Orleans.

This leads me to asking a last question. What do you see as ways of cutting through the Catch-22s, the dilemmas that many people have pointed out here with some specific examples, on where we could use a better integrated data set? By the way, I don't agree fully, Russell, that these data centers aren't useful, because I think it has been an attempt of these agencies to -- it is the one way they figured they could cut the ice. I can give some examples where I think it has made a difference, even though for every one example I could give you where it has worked, you could give me ten where it doesn't.

But it is a step. So what we are encouraging you to do and what I hope you will do is give us advice on what are some steps we could do. Do we need more people working on state protocols? Do we need advocates within agencies who would be research advocates, not having the advocacy coming from the outside, but somebody inside saying this would serve the public good, let's figure if there is not a way to solve this problem and get around our 47 constraints, or dealing with Tom Petska at IRS, or dealing with Russell.

I am really asking you, what do you see that we could advise policy makers on how to deal with these issues? I have been involved with too many drafting of laws, and I know what happens. It is what former press secretary Jim Brady used to call the bog set method, a bunch of guys sitting around the table. One says I think this looks good, now all of a sudden it is the law, but it is not that it was written in stone or was even well analyzed in the way it was drafted. Do we need people in agencies who work on how we can redraft the laws to deal with these things? Do we need advocates for the public good? That is the one thing we can use to give the type two error its importance.

It does seem to me that there are cases, and I don't want to use real-time analysis to be the only example, there are cases where we are dis-serving the health of the public by not doing the things that as analysts or researchers we know we can do, but as bureaucrats or law abiding citizens we know we can't do.

DR. STEINWACHS: He asks an easy question, doesn't he?

MR. GATES: That gave us a lot to think about. I would like to separate that into two areas, if I could. One is the issue about whether we can do the linkages to support this research. The other is, can we provide the data to the researchers to make it happen.

I think in terms of whether we can do the linkages, there are a lot of issues we can deal with, but the perception issues are the really critical issues, about how do we convince people, convince everybody, that this is the right thing to do. We should be linking these data for these purposes, because we can do it in a very safe and controlled way.

I think we have to have more of a public discussion about that, about the fact that these records for statistical purposes under these controlled conditions are a good thing.

The other issue about whether we can provide these data for the researchers is also a critical issue. It comes down to, as I said before, about how do encourage the research and still maintain the trust that we have built up. We did build up trust, we don't want to lose that trust. We cannot lose the trust of the public that gave us the information. Either they gave it to the administrative agencies or they gave it to the statistical agencies.

So we have to figure out under current legal requirements whether or not we need to modify the legal requirements. I don't know the answer to that, but that has got to be worked out in terms of how we can do it in a way that keeps the information safe.

I think what we have tried to do, at least with these research centers and moving towards other ways of permitting remote access or doing synthetic data that may give micro data that is useful to researchers, are ways of accommodating what we think are the best ways to get the data out to the research community in a way that is safe.

There are probably other ways that we could consider too, but I think that is really going to be the hard question, and it is a question that we have to address.

I'm not sure what the right answer is, but the more we can discuss this, and the more we can think about, is there something else besides those things that gets those data that we need for the research, those linked data that we need for the research out to the researchers. I don't know.

DR. DAVERN: I think what we need is, we need researchers to be patient with us, and work as hard as they can to get access to the research data centers to use the data, and show that it makes the difference.

Our experience with the SIPP, when we came out and sid we are going to get rid of the SIPP, we are not going to collect it, we are going to do a new dynamic system, a lot of people out there, the research community agencies, pointed to a lot of research that was used heavily and widely and said, we need this data, these data are important.

We have a couple of projects out there in the research data centers that are doing these linkages, are doing matches to evaluate that, and we need more of those out there doing this to show results that get articles published or articles used by agencies to show that these things are working, to show that we need more of these research data centers or increased ability to access these, and to work with us to get those out there and to keep working on that.

We have also contracted with CENSTAT to do a panel that looks at these administrative data linkages, particularly as it relates to this new dynamic system. One of the key questions are, how do you get these data out there to the public to be used, and how do you deal with the privacy concerns, and are there other ways like synthetic files or other types of things that can be looked at.

So we are working with CENSTAT to look at these issues, but I think what we need really are people who are willing to go through that process, no matter how difficult it is, but that is the process we have, to show that this is an important thing to look at.

MR. PREVOST: I just wanted to add on, I think there is a third part. The third part comes beforehand, it is the prequel, if you wish. Gerry was talking about getting the data and linking it, and then providing access to the researchers.

But what I would submit is that there is a government business model out there. We are all working to try to serve our clients, the general public, as well as we can. The way that the agencies are set up right now is that we are all working within these stovepipes of our own business models. I think some of the things that have come out of E-Government have started to take a look at generalized data systems and processes around the federal government.

One of the things that could be done is to come up with standardized agreements and standardized processes. I don't care if it is with Census or not; if it is somebody sharing data with National Center for Health Statistics, for example. Should have a streamlined process, so if you have a project that is two years long, that you don't spend 90 percent of your time trying to get the data. That is the first part, and I think that is something that we could deal with.

MS. PARKER: I am Jennifer Parker from the National Center for Health Statistics. I just wanted to say something about the research data centers. I went to three epidemiology conferences this year, and by and large nobody wanted to use the research data centers. The graduate students in particular said they didn't have money, and the academics said they didn't have the money to place somebody in the centers.

If we want people to use them, we have to identify funding, because people can't write grants if they don't know what they are going to do. So they need preliminary exploratory analysis to get the job done. The graduate students, they don't have any money. So if this community wanted to promote that, then we are going to have to come up with money for people to use them.

Thanks.

MS. OBENSKI: I would like to respond from a different perspective. I think that there are big, big questions that are arising, because we are in a new day here with these integrated data sets. We are addressing policy problems and questions that we really haven't had to address in the past.

But I think that if you talk about how do we get the word out there, I think a project like Mike Davern's project on the Medicaid under count study, where we have working collaboratively with experts from multiple entities, all coming at it from a different vantage point, but all trying to answer the same question of solving this very complex question.

I think it is somewhat unique. As the director said, we met with ASPE and our team last November. It is probably a unique experience to bring in these other agency entities as part of the team. What it has done is, it has enabled us to do a project that the Census Bureau never could have done as well on its own.

So I think more of these, this model, is the way I believe it is going to be in the future.

DR. STEINWACHS: Ed, you get the final word before lunch.

DR. SONDIK: Then it had better be brief. I think it is really interesting that this discussion focused on the data centers and not what I thought it was going to focus on, which was technical issues related to matching administrative data to survey data or whatever. I think that is very important, and we really should take note of that.

Second, I completely agree with Jennifer about the cost issue. We are thinking about that in NCHS. There are several reasons for cost. One is recovery and the other is regulating demand, if you will. So we are rethinking that.

Third, we have a way of doing remote access with our data center. You are going to talk about that. So I think that is really important. That is an issue of trying to deal with Russell's point, easing the access.

Gene's point struck me that you are raising a very fundamental question. It is this tradeoff between access, increasing access, and at the same time preserving the confidentiality. The way we work it in this country, as we all know, we have a set of different agencies that have different rules, but basically they are pretty much the same. But there are alternatives.

The National Center for Education Statistics licenses the data. What has always troubled me with the licensing, even though there are penalties for violation of the license in some way, is what is the impact of that, what is the cost of that to the individuals involved, to the agency, or whatever.

If we are going to rethink this, I think we have to think very fundamentally about this, and understand what risk really is. I find it fascinating. I would be surprised if any of the agencies had any quantitative measure of risk. I know we don't have a quantitative measure of risk. By that, I mean what is the probability that we have a leak, whatever it is, or a successful hack or hacking, whatever the term would be. I think if we are going to rethink this, we have to think about that.

But the point about the need for all of this in time of an emergency actually raises a very interesting question in terms of our preparation for dealing with emergencies. I think that one of the points that have come out of this is that we are not necessarily -- we are not prepared for that to the extent that we should be. It may very well be that we can prepare and have maybe even special authorities that can be invoked in time of particular emergencies, which again would bring this back to the legal side.

A lot of food for thought here.

DR. STEINWACHS: I very much want to thank Sally, Ron and Gerry. Thank you for leading off. You have certainly gotten this discussion going. I hope you may be able to stick around. I think it says very much that there is a lot of common ground here and certainly common interest in trying to reach the same ends.

In the information you have there is an identification of places to eat. I think there are about four of them listed in the food court. Originally we had you back by one. If you could come back by 1:10.

(The meeting recessed for lunch at 12:18 p.m., to reconvene at 1:10 p.m.)


A F T E R N O O N S E S S I O N (1:10 p.m.)

DR. STEINWACHS: We are very fortunate to have Julia Lane with us, even more so because I understand she needs to rush off and catch a flight due to a loss of a friend of family member; your uncle died. So we appreciate your fitting this into not an easy time and an easy schedule.

Julia is a senior vice president at NORC, the National Opinion Research Center. Julia is going to talk about using linked micro data. Julia, the floor is yours.

Agenda Item: Using Linked Micro Data – Julia Lane

DR. LANE: Thanks very much. When Gene and Joan asked me to do this, I was happy to be part of it. I think what we are doing is really important.

I spend a lot of time matching administrative and survey data with my partner in crime there, Ron Prevost. Seven years of matching data from five different agencies, SSA, IRS, Census, DOL and HHS, and then 50 states. So I have the scars to show it; it took a very long time.

What I wanted to talk to you a little bit about, I am a health economist, just to give you the benefits of what we saw in terms of putting linked micro data together, the generic issues and then some of the challenges, and then very much in light of the discussion we had before lunch, maybe some suggested solutions.

Obviously much of the stuff that was discussed before lunch, clearly when you have linked and administrative survey data, you can get much better analysis of existing data. So for example when you are trying to explain earnings and employment outcomes on individuals, when you have linked employee data, instead of just being able to explain 30 percent of earnings in employment variation, you are able to explain 85 to 90 percent. So the explanatory power of existing data is far enhanced.

You can also do re-analyses that you thought you might not be able to do before. When you put multiple sources together, the feasible set of research increases. In my case for example, when we had information on firms, rather than just the supply side of the labor market, we could look at the demand side. In health, if you were interested in looking at just patients, you could also look at health care providers and geographic information and so on. So there is a rich new set of analyses that can be done.

The other thing in my time as a rotator at the National Science Foundation, it turns out that the capacity to capture new sources of information is increasing. We now have MRIs on individuals, we have got biomarkers. At NRC we are doing work for NIA, which captures biomarker information on the NCHAP survey. You can capture people in video and text. All of a sudden it starts to be able to explain other aspects that you couldn't explain from just admin or even survey records per se, just the straight data that were captured. So for example, it might be that increased earnings potential is due to excessive testosterone or something, just hypothetically. At least, that is what my husband tells me. That is what makes him more productive than me. That is why he makes more.

This came up a little bit in the discussion before. This is simply good government. Enormous amounts of energy and taxpayer's dollars have been used to collect these data. The more that you can leverage that investment in data collection, the better off the taxpayer is.

Of course, all of that information that is collected -- this is a Toles cartoon that you might have seen -- might cause some privacy concerns and privacy issues. You may recall the stuff about medical ID chip.

With the data collection, and I thought Gerry and Sally and Ron did a very nice job of describing what those challenges are, we really have a serious challenge with providing access to the data. If you think about data utility as being a function of not just the quality of data that are collected, but also the number of people who access it and the number of times it is used, the big challenge is not just to collect a fabulous data set, but to have people use it and use it in the way for which it was intended, for which taxpayers are paying those millions of dollars.

It used to be that statistical agencies handled this by producing public use data. The problem is that the increased likelihood of reidentification of public use data, together with an increased understanding of the consequences of public use and the quality of analysis that is done, and I'll give you an example of that in just a minute, means that not only is the quality of public use files declining, but the likelihood is that fewer public use data sets will be released.

In particular when you think about public use data set for health issues, the very skewness of the distribution means that the types of approaches that are used to protect the data, in particular the top coding, will have serious implications for the quality of analysis that is done.

So for example, I was trying to remember this morning, I should have looked it up before I came, but I think either David Cutler or David Meltzer shows that there is five percent of the population that is responsible for 80 percent of the Medicare expenditures. So if you cut off the top spenders because you are worried about top coding, you are not going to have much analytical capacity associated with that.

We have heard a lot about synthetic data this morning. Synthetic data also certainly has an enormous potential to help researchers. But one problem that you have with synthetic data is obviously by its very nature, because it is using a distribution

to synthetically impute values and replace existing values with imputed values, it is going to get rid of outliers. If outliers are the ones that are defining what is happening in the economy and what is happening with health care expenditures, as in the example I just gave, you have really affected the utility and the quality of the data.

These issues are particularly exacerbated with linked data. Clearly once you add in admin records, they are wonderful because they add so much richness to the data, but you immediately have a much increased use of reidentification. The more information you get on people, the more likely you are to be able to reidentify someone, and then you go to jail, which is not a very good thing.

Then the other thing, and this was also pointed out this morning, admin records are often received from enforcement agencies. So as soon as you link in the survey data which are protected, and you have got to protect the confidentiality of the respondents, once you add in admin data, the enforcement agencies retain those, and they can very easily reidentify the individual.

So that is a problem.

I wanted to take a little bit of a detour and just give you a sense of the impact of top coding, something as simple as top coding on the quality of analysis that is done. What I want you to think about is that the data utility is going to be a function not only of the quality of the data, but also of the number of researchers that are going to look at it.

Public use files have the advantage that lots of people have access, but the quality of the data isn't very good. So what happens here, here is an example of an earnings regression, where earnings have been top coded. This is on the current population survey. You want to look at the effect of the black-white earnings differential over time. Because it was top coded at different levels over time, the impact of the estimate on the returns on the black-white earnings differential are quite different. The same thing in terms of the returns to education.

I am not going to go through it in any detail, but basically the black-white earnings differential, depending on how you correct for that top coding, how you statistically correct for the top coding, you might say that the black-white earnings differential is .35 log points or .63 log points. If you correct in different ways, you might say that the change in the gap was .06 or .15. Those are huge differences, right?

So lots of policy makers to decide, is the racial earnings gap big or still big, but twice the size? Is the earnings gap closing rapidly or is it closing slowly? And because of the increased noise in the data, which is by definition what they do to protect public use data files, because that biases coefficients down, do you know whether a perceived vanishing of let's say a racial discrimination coefficient or a double on race, do you know whether that is really the case in the underlying data or is it just that more noise has been added all the time? The same thing happens when you look at the return to education, which is obviously another policy area.

So those are two labor economics issues, but you can see how the same issues would come up in health. So here we have a situation where lots of people have access to data, but you don't know what is happening to the quality of the data over time, and you don't know what is done to it. Typically the agency can't tell you what is done to it, because they don't want people to back out their disclosure protection techniques.

So one alternative approach is to say, let's have really high quality data at the Census Bureau or at research data centers. Then what happens is, people can go into the research data centers and they get the good data, and they can do the analysis on that.

I have to say, I am a big supporter of any access in general, so the line that you picked up I think was a little bit taken out of context. But I wanted to make it clear that there were issues with this approach as well.

For those of you who are not aware of it, researchers and many statistical institutes throughout the world picked up this. The Census Bureau pioneered it. Researchers physically go, they are monitored by employees. This is supported. It is a very expensive program. I think the Census Bureau alone puts in about three million a year for eight or nine sites.

There is an elaborate project approval process. By law all these projects must provide a benefit to Census Bureau programs. They are required to have special sworn status and there are penalties if they reidentify individuals. There is all kinds of access constraints. I'm not going to go through those, since they are in your handout.

As you eloquently pointed out, there are lots of problems associated with that. You have got very high quality data, but the N of researchers that access the data is very small. Furthermore, one of the issues that you run into is that by and large at least at NSF, what we found is that the absolutely top quality researchers would not go to the research data centers because they had other things to do with their time. So the best researchers are doing analyses.

Danish data in Statistics Denmark, for example, in this particular case they set up a very nice remote access system. As Brad said, this procedure is expensive, very fragile and very tenuous. One of the issues is the link to the review process. Ron Prevost and Sally mentioned the work that they are doing with Bob Gerter at Tapinol. That has taken two and a half years, and they are still not approved yet. A similar type thing that we had with NSF supported researchers take a very long time to get through. Then it is expensive in terms of time and in terms of money.

The panel was concerned because of disparate use, so you have got well-endowed institutions who can afford to set up the research data centers, but if you are a University of West Virginia or if you are Iowa State or in Oklahoma, it is much more difficult for you to have access, so there was a disparate impact. Then the concern was no remote access.

As a researcher I was concerned about this. My colleague within the Census Bureau, Pat Doyle, was very, very concerned about the impact on the data quality, the quality of the research and the quality of the policy analysis that was done. Ironically she was focused very much on the SIPP.

At her memorial session at the ASA meetings, they asked me to write a paper on it. What I wrote down was many of Pat's ideas, except informed by some of my time at the National Science Foundation. What I thought was, why don't we -- given that we know there are all these problems, Pat and I co-edited a book in 2001 in which we said this is a real issue. It is not just for health, it is for just about every area in social sciences. This is a major issue. How might one go about thinking about how to structure this?

Let's think about it in terms of learning from other disciplines. I got put in charge of doing the cyber infrastructure initiative for social behavioral and economic sciences, so I met lots of computer scientists and engineers and so on. It turns out that obviously in other areas of NSF, for example in the computer science area, there is this whole research funding associated with cyber trust. That is funded very much by agencies like DARPA.

What they are worried about is setting up secure computer access for extraordinarily sensitive information. Let me give you an example of how that is applied. When you are Joint Chiefs of Staff for example at DoD, you need to think about troop movements in Iraq. They don't go to a research data center to look at the data. They log online with tight protocols to access the data. They have gone through training and so on, but the cyber trust initiative has invested substantially on setting up secure protocols whereby the Joint Chiefs of Staff for example can access troop movements online.

I should say that the cyber trust doesn't just look at the physical computer security. Any computer scientist, everyone I talked to at NSF, would say anyone can hack into a system. So what they also tried to do was to set up human protocols as well, so figure out adoptable protocols. You are going to be surprised to hear that computer scientists worry about this, but they learned -- you know when they tell you to make up a password, they say you have to use some exclamation marks and some numbers and some capital letters and some small case letters. So you are making up this completely unmemorizable password. So you write it down and you stick it to your computer.

So thinking out the humanly adoptable protocols as well as the computer science protocols is a very large portion of what they are worrying about, and they are funding that. Joan Feickenbaum, who many of you might know at Yale, has been heavily involved in the Portia project. There are lots of commercial applications. Financial services, very sensitive financial information gets accessed through the web. People don't physically go on site to look at very, very sensitive financial information. They have figured out how to solve these problems.

So thinking about, can we think a little bit out of the box here and think of rather than just one magic bullet, let's think about a portfolio approach. Let's think about setting up a set of computer protections. At DoD for example there are actually three levels of computer protections. The cyber trust people can tell you more about that. Try and minimize the amount of statistical protection, take off obvious identifiers, but try not to much around with the quality of the data too much unless you can document what the impact is on the quality of the analysis.

This is going to work differently for different agencies. Particularly with the Census Bureau we have got Title 13, public use or not public use. But you can have degrees of statistical protection, the law says within a reasonable doubt. So you adjust the statistical protection by adjusting the screening which you have for people coming in to use the data.

Kathy Wolman asked me to go and talk to the Conference of European Statisticians a couple of years ago because they were worried about confidentiality protection. So I did the opening speech. There were about a hundred chief statisticians from around the world, and over and over again they said, we will give the data to people we will trust. In other words, there is a very big difference between giving a lot of people access, the great unwashed, to giving the access to a subset of people who have gone through a screening process, who maybe have an institutional bond and so on. So think about putting those sets of requirements together.

Having been and still am a researcher, instead of assuming that researchers know how to protect data, actually go through an intensive training class. Thinking back on how I treated data before I went to the Census Bureau, I don't want to tell you about it because I know this is being taped. I never did anything wrong. But when you go to the Census Bureau and you work with statistical agencies, there is very clearly a cultural confidentiality that gets inculcated in you. So figuring out ways to train researchers so that they understand that this is part of the mandate.

In other words, think about setting this portfolio approach up. What you might have is multiple access modalities. You might have public use files and synthetic data so you can get all your code run and muck around a little bit. Then if you work through the remote access in a subset of cases, and then if you wanted to go on site and work from the research data centers, go on site and do that.

Instead of thinking about the silver bullet, one approach, think about an integrated approach, legal, statistical, operational and educational, and think about as we were saying having a consortium of agencies as Ron and Sally and Gerry said. This shouldn't have to be invented de novo by each agency.

You might think about developing a set of legal options, a set of statistical options, a set of operational and a set of educational options, and then different agencies and different studies within agencies. For example, the CPS within the Census Bureau will have a different set of rules in the annual survey of manufacturers, which is business data. So you might have different sets of options that are chosen.

You could think about how the remote access might work. Instead of having the event where you physically have to go on site, use 21st century technology and think about encrypted connections and smart cards. You can restrict user access from specified predefined IP addresses. You can have something like Citrix technology, which is becoming increasingly used, or you could have your own developed approaches.

For statistical protection you might remove obvious identifiers. You limit access to the data that is approved and you could have the statistical techniques chosen by the agency.

I strongly believe in training researchers. I think it makes a big difference. The Title 13-Title 26 IT security training that they do at the Census Bureau is very good, but that is web based. You would probably need maybe a two-day training class before people are allowed to access the data, and then maybe refresher things. But just basic principles of confidentiality, why is this done, basic information about the layers governing the agency and why those layers work.

So I really do think that -- it has been five years since Pat and I edited that book on disclosure protection by statistical agencies. I don't see that there has been much new that has been generated since then. Yet the pressures are increasing. The pressures for high quality analytical work, the enormous promise of matching administrative and survey data is there. We have the potential to do what I think is needed to be able to understand what is going on in different parts of the economy. I would suggest that we might want to think about using both the cyber trust and human cyber infrastructure. The San Diego Super Computer Center is doing some work, we are doing some work, and so on.

So now I will shut up.

DR. STEINWACHS: Julia, thank you very much. I think you may have at most five minutes to answer any questions or comments. So do people have questions or comments for Julia?

DR. BREEN: You mentioned the portfolio approach. By that, do you mean an array of different kinds of things that amongst them would provide protection for the data and appropriate training and all of that stuff. It is a concept, isn't it?

DR. LANE: Yes. Because public use files did so well for so long, we have been very focused on statistical protection for public use type data sets. What I am saying is that the problem of public use all by itself is that you have to protect against any onslaught. So what you need to do is to think about a portfolio where you don't just use statistical protections, you also think about limiting access to authorized researchers setting up computer security protocols like Statistics Denmark has done very, very successfully, so that people can't hack into the data, and you limit their ability to re-identify the data by matching to information that is on the Internet, and legal protocols so that if they do go ahead and try to reidentify, you have got some legal framework within which to prosecute them. There is an institutionally binding agreement.

DR. STEUERLE: Julia, thanks again for doing this. You are one of the many stars we have here, and I really appreciate your coming.

I have two questions that are related to the ones I asked earlier. In your experience at Census, are there really advocates within agencies for trying to promote the research? I'm not saying researchers that want to do it, but I'm thinking in the legal side, advocates who take the public good interest?

It seems to me that when you get down to the ultimate legal question, even when I go through your matrix, somewhere, sometime, somehow, there is somebody who could be identified. If I tell you eight characteristics down to the letter that are observation in my data file and you can match that up and find the ninth one, there is probably no circumstance at which at some level somebody couldn't be identified.

So the incentive of the legal community, if you are given the question of, should you allow this to happen, is almost always going to be against trying to figure out how to provide access, no matter to whom, even though we know that at another level, a government worker could lose a veteran's file. So the chance of something happening is not necessarily related to people doing research. So I am just curious, how to get around this.

DR. LANE: In most agencies it says reasonable means. So the big question is whether a portfolio approach would match anyone's definition of reasonable means, and I would argue that it would.

I am not a statistician, we have got Tom Petska here who knows far more than I, but my reaction would be, the law is reasonably clear as to what the purposes are for which you can use the data. So you have to fit within each agency's mandate. It has to be authorized purpose. Without that you are breaking the law, so you don't want to go to jail.

The question is the degree to which -- the interpretation that is put on that. You raised the very good point earlier that a very narrow approach would just be, you can only count. A broader approach is to -- and if you take a look at the IRS Census criteria agreement, there are nine categories, and some of them are quite broadly written, under which you can do research as long as its predominant purpose is to improve economic and demographic census and population estimates.

So with a broad view, I think a lot of work can be done, but you still have to find that authorized purpose. Is that a fair statement, Tom?

DR. STEINWACHS: My big job is time watcher, so I worry about you and the plane. Do you think this is a good time?

DR. LANE: I think I had better run.

DR. STEINWACHS: Again, thank you very much.

DR. LANE: Thank you very much for inviting me.

DR. STEINWACHS: I am going to ask Gene Steuerle to introduce the next panel.

DR. STEUERLE: We have a stellar lineup from our own Health and Human Services agency, with whom we work closely in the National Committee on Vital and Health Statistics. I have to confess, I know a couple of you but I don't know all of you, so I am going to be reading your titles, I apologize.

Jennifer Madans, Associate Director for Science. David Gibson of CMS. Steve Cohen, the Director of the Center for Financing Access and Cost Trends at AHRQ. This is going to be our first panel. Our second panel, depending on how long we take to get to the presentations, is going to be Martin Brown, Chief of the Health Services and Economics Branch at NCI; Gerald Riley is a social science research analyst at CMS. John Drabek is an economist with the Office of the Assistant Secretary for Planning and Evaluation, DHHS.

We are going to start with Jennifer and David and Christine.

Agenda Item: Health and Human Services Agencies

Jennifer Madans, Associate Director for Science, National Center for Health Statistics

DR. MADANS: Well start with me. I am going to start this and then hand it over to Chris who runs a linkage unit, and then pick up at the end on some of the policy issues.

What we are going to do is just briefly go over the NCHS data linkage programs, past, present and future, look at our current access procedures, some of the challenges that we see in conducting linkages now and in the future, and maybe some ideas about how to solve them.

Obviously we are doing record linkage the same reason everyone else is doing record linkage. It certainly increases the accuracy and the detail of the data that we can collect.

We generally use it to augment the information that we collect on our major surveys. One of the early reasons for doing the linkages was to longitudinalize what are really cross-sectional surveys. By doing the linkages we can follow people over time without spending the money to recontact them. So it reduces the burden on the respondent, it is cheaper, and it should get us better information. So I think we are all doing it for basically the same very good reasons.

Just to give you an idea of some of the potential that has come out of our record linkages, this is just a handful of examples, a lot of them having to do with linking our population based surveys with morbidity and mortality outcome. We can talk about some of these databases that these studies are based on.

We do two types of record linkage. I think most of the time we are talking about the first type here, and that is what we are going to focus on for this talk, but I did want to mention the second one very briefly. This is where we are linking at the person level or in some of our provider based surveys at the facility level. So we find some external data source that generally is a census of some kind, whether it is the Census, which we don't link to, or all of the CMS records or all of the information from the American Hospital Association, where we can do a one to one link with our unit record on our survey, which is the person or the facility, with a record or set of records on the external file. We have done this at the person level and as I say at the facility level.

But a lot of our linkages deal with contextual data. All of our surveys are geocoded, and we do a large amount of linkage of geographic data from a variety of sources with these population based studies and also the facility based studies. So this is where we might use census tract characteristics from the Census. We can link those into the population based surveys. We have data from EPA. We get information at the state level on things like Medicaid eligibility requirements. So it is not at the individual level, but it is more the contextual or geographic level.

A lot of the same issues in terms of access and identifiability that you have on the first kind of linkage also apply to the second kind of linkage, but the examples we give for today will be at the person level.

Now I'll turn it over to Chris.

Agenda Item: Christine Cox

DR. COX: The NCHS record linkage program got its start back in the early 1980s, which is about when Jennifer and I arrived at NCHS, if that is okay to say. We had both just graduated from college.

We began bringing together survey data with health care utilization data by linking the national medical care utilization expenditure survey to Medicare MADARS data, the Medicare Automated Data Retrieval System, for those of you who have been around for a very long time as well. We did the same thing with the epidemiologic followup study to the first national health and nutrition examination survey.

We were pretty much pioneers in linking to the national death index with the NIFA survey. In fact that survey had respondent verification through either proxy interviews or death certification collections, so it has become what we call at NCHS a truth data set, and we use it to develop our probabilistic record linkage algorithms.

We expanded the record linkage program in the 1990s to include a routinized approach to linking the national health interview survey to the national death index to correct mortality data, to allow for as Jennifer said the longitudinal study of our cross-sectional data.

Then around 2000 NCHS developed an organization -- reorganized and created a data linkage unit where these services were centralized and systematized, so we could gain from the expertise in that unit and conduct most of our record linkages.

We were very productive. As of the year 2006 we have linked a variety of our large national public health surveys to three focus areas, mortality data collected from the national death index, Medicare, CMS data, and that would be denominator and the standard analytic file data covering the period 1991 through 2000, and retirement and disability benefits data from the Social Security Administration. That actually goes from 1962 through 2003. So it takes data back before the survey contact and extends the period of followup beyond the survey.

The surveys we were able to link to, for those of you who aren't speaking NCHS acronym today, that is the national health interview survey, the longitudinal study of aging, the first, second and third national health and nutrition examination surveys and the 1985 national nursing home survey, which is a facility survey.

We have completed other data linkages. We link infant death records to the birth certificates to allow for study of birth characteristics of infants who die within the first year of life. We link information from the American Hospital Association to annual survey of hospitals on facility characteristics to the national hospital discharge survey.

We are trying to develop an ongoing program. These agreements we think we have mostly standardized, if such a thing exists. It is pretty easy to standardize an agreement to match with the national death index outside NCHS. We take it easy on each other, we only filed about 30 forms back and forth to share data within the organization. But we intend to keep linking the HIS from 1986 through our current year, which is about year 2005 available at the moment, to NDI data. Our HANES surveys are now yearly surveys, so we will continue linking them, and our national nursing home surveys will be linked to mortality information. That is cause of death and date of death data.

We also will continue hopefully linking to Medicare enrollment and utilization data from CMS. Right now we do have the HIS 1994 through 1998 link to that data. We hope to extend that through more current years of HIS, and the same for the HANES data.

We would like to start adding the national nursing home surveys to that agreement. And of course we will continue to link birth and infant death files.

We plan lots of interesting linkages. I was congratulating Ron and Sally earlier; I think we are about to have a one-year anniversary on the negotiation of an agreement to add HIS to the CMS Medicaid project that they discussed earlier. We are nearing that one-year mark, on trying to get that agreement finalized between our two agencies to share that data, so we are very excited. We think it will probably happen this year.

We also intend to be linking the 2004 national nursing home survey to the CMS minimum data set, and that will allow us to pick up characteristics of the facilities, the residents they are staying in, as well as characteristics of the residents in that facility. We hope to be linking the current NHANES to food assistance programs such as food stamps and other food assistance programs. We'll see.

We don't only do record linkage. We also spend a fair amount of resources developing user tools, documentation, methodologic reports, where we describe how the linkage took place and what kind of probabilistic matching algorithms we use. We conduct bias analyses so that when researchers use this data they understand what the limitations are and where we are not matching people and what kind of people we won't have linked data for. We also continually evaluate and try and improve our record matching algorithms to make them just a little tighter and reduce type one and type two errors.

I'm turning it back over to Jennifer.

DR. MADANS: As you can tell, there are no problems with data access for a linked file.

I'm sorry Julia is not here, because we were having a conversation in the hall about this remote access system. I'll talk about ours just very briefly. We have the same kind of confidentiality concerns that the folks in the Census Bureau discussed this morning. As everyone else has mentioned, our general files are becoming harder and harder to release through public access because of not only the kind of information we collect, but the kind of information everybody else is putting out on the web. So we are having a reduction in what we can put out as public use. But the linked files are particularly vulnerable and we worry about them more than we worry about many of the other files.

We try to put out as much as we can on public use. There is the problem of reidentification by the people who gave us the data, which is an added problem, but we know that there will be files that are not appropriate for public use. This is the kind of thing where we make an announcement, the data are on the web. We have no restrictions. There is no control, there is nothing.

We also have a data center. There are three means of access in our data center. We only have one data center in Hyattsville, so anything that you can say about the nine that the Census Bureau has, it is much worse when you only have one.

I will report, I can announce that after many years of negotiation we have reached an agreement with the Census Bureau to as a pilot project allow access to our data in their data centers. So we feel like we have expanded access nine times over what we had before so it is very discouraging to hear that nobody likes their nine any more than they like our one. But it is still better. I agree that there are problems with data centers, but they aren't going to go away, and I think we have to make them work better.

Our data center like the Census data center does have on site access. We have a nice little room in NCHS. You have to go through two guards as well, and it has all the other operational characteristics that the Census Bureau's have.

We also started a remote system when we started our data center in ‘98. This I would call a first generation remote system. It is automated, so it is remote and automated. It is called ANDRE, and I never remember what that stands for. You submit programs through e-mail. Those are collected by the system, the system evaluates them for illegal operations. You can't do things like list cases, and it won't run them if it sees something illegal. It puts them through a batch processing. The output is again looked at in an automated way to see if there is any disclosure risk for what is sent back, and if there is none it sends the output to the person requesting it.

A lot of the things Julia was talking about in terms of the front end are very similar. You have to be a registered user. We will only send it to certain e-mail addresses, all of those things. For our system we want to get to the second generation, have a web-based system. We have to worry about the firewalls and all of those things.

What she didn't talk about, and I was curious to hear her talk about, is, for us the real challenge is not the front end of that system, it is the back end of that system. It is how do you do the appropriate kind of disclosure in that kind of environment. A lot more control when somebody is sitting in our facility. We have a lot less control when they are across the country. The response of most of the system is that they let you do less remotely than they will let you do if you are actually sitting there.

One does have to worry about things, like how do you deal with multiple submissions of programs, where each one is fine but together they are not. You have some of this when you are on site as well, but it is not as bad as when it is remote. If I can do my commercial here, I'm not sure these problems are insurmountable, but I am sure that we haven't put the kind of resources that we need to into solving them.

My other favorite soapbox is that we are reaching a point where we spend a lot of money on data collection, but we missed a balance in terms of resources for data collection versus data dissemination. The questions that we are being faced with now in terms of dissemination are going to require some more funding than we have been giving them in the past, and unfortunately if you have a flat budget, if you are lucky enough to have a flat budget, if you are spending more on dissemination you are not collecting as much data as you were before.

I think our users have to understand there are tradeoffs here. We can make it easier to get access, but there will be less to have access to. That is a fact. The question is, what do you cut, and that leads to many, many conversations.

Let me also say about data centers, I think all of us who are running these data centers do understand some of the frustrations that our users are facing, and are trying to solve some of those, especially the front-end stuff, getting proposals approved, facilitating peoples' access as they wander the maze of the data center.

I sometimes say we are going to get a Walmart greeter. Everyone will be assigned a person whose job it is to make their stay in our data center a happy one. I think we are all moving towards that. It will solve some of the problems. It certainly won't solve all of them. You still have to come to Hyattsville or one of the other seven data centers. So I personally think a remote system is going to be more beneficial, but it is going to be a lot more expensive to develop.

We also have staff in our RDC that can provide on-site programming and help people out, but I know most researchers don't like that, like that the least, because they want to get down with the data, they want to be able to play with it and see what is happening. But it is an option. So I think we are beginning to get that portfolio that Julia was talking about, different ways of accessing data.

I think the main theme we are worrying about is the disclosure of view, and how do we do that in a way where -- to answer your question, Gene, I think there is an issue of due diligence. I think if you force any of us into a corner, we will agree that we can't hundred percent protect, although that is what we try to do, but when push comes to shove, we have to be comfortable that we did everything that we could, and that an outside person looking at what was done to try to protect the data will say that we exercise that due diligence. Obviously it is a judgment call, but I think that is where we are comfortable in saying that we have done what we can.

So where are the challenges? These have also been mentioned. First we have to get the informed consent. We have to get the informed consent from the person who is giving us the data. So we have to meet their needs, and we have to do that in a way that will satisfy NCHS' institutional requirements for permission to link, what do we need to get from the person so that we can go to our IRB and make a case that we should be able to do this linkage. We need to satisfy the provider's institutional requirements for permission to link, and we have to be able to get that informed consent from the -- in our case, from the subject whether it is a person or a facility, where we on the one hand can tell them about the importance, also be fair about what the risks are and how we are going to protect them.

So those are hard things to do. One of our concerns are to get a more systematized way of doing that in a way that will be acceptable to the community. Up until now we have all been doing this on our own. There is not as much collaboration among the agencies, both the providers and the receivers of these administrative data as there could be. Every time we start a new linkage, it is as if we have never done a link before. We start out from square one; do you have the right permission, is it the right permission for us. To get the lawyers involved is always a bad thing. So we really need a better way of working this out.

This just shows you what is happening to our ability to get adults to give us their social security number, similar to Census Bureau. If we got the number, we took that as approval to link, and it is going down. There was a change in how we did this in 2001, but even at 41 percent which is the better number, that is not enough to have a viable data set. So using this as a mechanism for getting permission isn't going to get us very far. We are also thinking of asking a more explicit question about linkage, because most people don't want to give you the social security number when they don't want to link. But how one does that not only so that we will get the number, but that we are doing it in a way that is protective of the subject and acknowledges their rights not to be linked. These surveys are not mandatory, so we have to provide the true informed consent and get their permission.

As has been mentioned many times, the institutional requirements among those of us who are receiving and providing data are very complex. We have some differences in our privacy requirements and how our privacy requirements are interpreted, the legislative mandates. Two years, we don't see that as odd. It takes a long time to develop these, but once you develop them you can extend them. But it would make life a lot easier for everyone if this was done in a more collaborative way.

Again, the balancing of the resources. These linked files are fairly expensive to produce, to provide the right user documentation. They often don't come in a way that is easy to analyze. You have to transform them into analytical data as opposed to administrative data. You have to tell people what is good and bad about them and what mistakes they can make. We sometimes don't have the right expertise to do that, so we have to work closely with the provider. Then what kind of resources are appropriate to put into assisting users with these kinds of data versus other things we are doing, like collecting new data.

We try again to create the public files from the linked data. It is very difficult. Once you create a file, when we get data for example from CMS, and if we create a file from that data that CMS can't take and reidentify, there is not much left on that file, but there are some things, and we need input from users on what that something should be. We are making decisions about what to take off and what to leave on and how do we satisfy the most use.

If you can get someone for a public use file, that may go a long way to making their stay in the data center much more pleasant, because they have experience with the file. We also use sworn agent status, and we are trying to think of ways that we can use that authority for different kinds of data to use different kinds of access mechanisms. We are also looking at whether or not we can use perturbation on these linked files. I think we also looking at trying to get synthetic files.

So again, we really want to share the knowledge and experience across agencies so that we are not reinventing the wheel, and we can learn from each other's experiences.

The standardization should increase efficiency. We may be able to do it faster, it will be cheaper. It might help us do the user documentation if the providers are thinking about possible other uses for their data. We saw big changes in the CMS data when we first started linking back in the ‘80s. I remember, we couldn't figure out how to read the file, the first one we got, to now when things are much, much easier. And again, the development of standards and best practices for linking, for data handling and how to get the extracts and the documentation.

We would like to increase collaboration and communication among agencies. I think this has to be done, maybe under OMB, I don't know, but there has to be some way of allowing us to work together without violating certain ways of interpreting our authorizing legislation. We have to find safe havens where certain

things can be shared in the development.

We want to develop more linkage projects. If we see a file that is Census, we are going to try to incorporate it into our data systems. I think if we can expand the access to RDCs through the development of new disclosure methodologies so that we are more comfortable in using remote access systems, we have more control over being able to evaluate whether something is a risk, I think that would go a long way into improving user access.

Thank you.

DR. STEINWACHS: One way might be to have a few questions now or comments, and -- it's up to you, Gene.

DR. STEUERLE: Let's go ahead and take some questions now. Then we will have the next set of speakers, then we will have questions there. Somewhere in between we need to take a break.

MS. TUREK: I'm just curious, on the re-disclosure, why can't you have an agreement with them that they won't do that? That would seem to be relatively straightforward.

DR. MADANS: I think all of us would interpret our requirements that we can't allow a redisclosure to take place, because if we were to do that, we could let anybody have privacy access to data that would say they wouldn't try to re-disclose.

MS. TUREK: But presumably the only person that could do it would be the agency that supplied it, right?

DR. MADANS: That particular file. But we have many data files. The only way you could re-disclose it -- most of our files, this may not be true for Census, on their own, if we took out the names and the addresses, those straight identifiers, when we get them out of the field they are not identifiable. They are only identifiable if you link them to something else that is in either the public or private domain.

In the case of a university user, we don't give them our linked file with the NDI, because they would be able to do re-disclosure. But if they signed something that says we won't do it, it should be the same thing as another agency doing it. If an agency that has programmatic or administrative responsibility did do the re-disclosure, there is harm to the subject. I guess our bottom line is, do no harm to people who have done their civic duty.

I think this is this due diligence thing. If there was a breach, we would feel in that case it would be our fault there is a breach. So whether or not you can say they can have anything, or we won't put out anything, it is what steps can you go through. There are legal requirements that if somebody at CMS would do this that they could be fired, fined, whatever. We try to do statistical perturbation so they can't. But I think it is the responsibility of those of us who have made a pledge to the people allowing us to link their data and giving us sometimes very sensitive data, that we will use a variety of mechanisms to make sure that they are not harmed. We have thought that someone signing a piece of paper that says they won't do it is not sufficient protection for that person.

DR. STEINWACHS: You were mentioning about the declining willingness to give the social security number. Your explanation sounds not surprising. Sally or Ron were talking about a capacity now to match to unique individuals without the social security number. I thought whether or not there was any possibility here of a short cross conversation, is there any way NCHS could use what you have, or is that somehow not in the feasible domain. I can already see, Gerry is ready to take this.

DR. MADANS: We are having that conversation. I don't know. That is one of the things we are trying to work out. We can do similar things and we do link to the NDI without the social security number. We can do that. It is not as good link, but we don't have access to as much information.

So we started these dialogues, but we are on new ground here, whether or not some of these things are possible. We are hoping that we can do things more jointly. We have lots of joint projects. It does seem a reasonable thing to do. So maybe next year when you have this we will have signed all these things and can move forward.

But I think it is important for us to do it in a proper way. So we try to work out what are we asking for, what are the problems, what are some of the reasons you wouldn't want to do it, and then have an open conversation.

DR. STEUERLE: A lot of your comments on disclosure seemed to me to be related to surveys, where you had to get permission or you felt implicit that you have got permission for people to not use the data in certain ways.

But among the other linkages that we are talking about here are linkages of administrative data sets. I can imagine some private researcher going out and gathering some data from hospitals somehow, if there was even such a data set that exists, he wants to link to Medicare data on hospital payments or something like that.

I am thinking there are a lot of pieces of data where we have never actually gone to the public and asked for permission, but we may have gathered the data in some form they followed in order to apply for a program.

There, correct me if I am wrong, but the companies uses a different one. It is what we interpret the law as allowing us to do. Are you also saying that there is often an implicit agreement in these administrative data sets that if I filed for whatever it is, if I filed for Medicare, that that data is therefore usable by some other agency who also has administrative data records, that they could be merged?

DR. MADANS: Most of what we are linking are survey data. We have one data set that is registry data, where our relationship is with the states.

I think where this becomes an issue is how is the owner of that administrative file, how did they view their responsibilities to the people who are in their file. That is why we have these very long negotiations. So if we want to link to the Medicare file, we think we are in a very good position, because we ask people and they said we could link to Medicare. But then we go to the agency and they say, we view our requirements as not allowing you to link. We never had that conversation with Medicare, but we have with others. Even though we have an explicit statement that says we are going to link to these kinds of records, and sometimes we even mention the kind of record that we want to link to, their legal interpretation is that they have to get approval, even though we have approval.

So it takes awhile, especially when this is the first time it is happening with that particular agency or that file, to work through why do you think -- like IRB discussions; what is the risk, what is the benefit, how can you protect. It takes a while to go through that.

But I think our interpretation has always been, if we have permission, then we should be able to go into an administrative file and pull out the records for that person.

Now, in terms of linkages where it is not based on a survey, you are just taking two administrative files and linking them, we don't do that very much. So I don't know much about it.

DR. STEUERLE: There may be other questions for the group when HHS finishes. There are questions about administrative data sets we don't combine at all, independent of the ones that you are thinking about combining at this point.

Other questions?

DR. BREEN: Jennifer, you said that -- thank you very much for your talk, both of you -- you mentioned that you would put more resources into data dissemination, that you thought that was an area where you could use more resources. But what exactly would you do with the resources? Have you thought about what you would like to do and how you would like to expand that?

You said you were going to increase the number of data access locations to utilize those of the Census Bureau, but knowing you, you probably have some other ideas on what you want to do.

DR. MADANS: I think we should be spending more on this. I don't know how Ed feels about this. For me there needs to be some basic science methodologic development on the disclosure review process. That is where I would put my first $200,000 or whatever it is going to be.

We are committed to trying to use the Census data centers. There are costs associated with that. I think there are costs associated with our own data center, it has been underfunded. So that is two very easy things, to more fully staff the data centers so that the process is simpler, easier for the user, and then to develop these disclosure review processes.

I think making it easier on the front end is a staffing question that is not rocket science. We have some fewer requirements than the Census Bureau does in terms of how do you qualify to use our data center. We have found it very difficult to say people can't use the data center. I think that Julia mentioned something about, we give it to people we trust. That is not a way we can work. We have to have very clear rules about who gets it, and we can't have a lot of judgment about who can access the data if they have a reasonable proposal.

But I think we can fix that. I think that is a staffing and money issue. I think it is an increased knowledge issue to deal with how do you do disclosure review in a faster view.

Then we didn't talk about this, to really use the administrative data and use it in a way so that it really does augment the survey data, you have to figure out a way to get the administrative data faster. That was brought up as well. So that is money that we are not going to spend, but I think money that our users need to get someone to spend for the providers of the data so that they can get it out quicker.

But I think in general, within each of our data divisions we need to put more resources, more staff primarily and more developmental money into doing different products, easier access to the products, more technical assistance. As they get more complicated, people get frustrated because they didn't quite understand how to use them. There are some things I don't know how to file, some I do. If we had the money we could solve them, they are pretty straightforward. But the thing we don't know so much about is how to do disclosure quickly and efficiently.

DR. SONDIK: I completely agree with that. I think one of the areas where we could really put some more resources is into smarter front ends to the surveys and to the vital statistics, for that matter. I view that as part of dissemination.

I think trying to deal with some of these issues in the RDCs that have come up -- I just want to make one brief comment. I was struck by the prior presentation and this one, but maybe it is a bias on my part, but I had a sense that in the prior presentation there was a sense of, it is useful to disseminate the data, to have the RDCs, but not necessarily the primary purpose of the data collection that was being addressed. Maybe that is a little harsh.

That is not true with NCHS. I think the primary purpose of NCHS is to disseminate the information that we are collecting as effectively as we can. In some sense it is the same problem, but in another sense I think maybe it is a little bit different emphasis. But if we collect NHANES or HIS and we sit on it, then we are not doing our mission.

So the idea of making this data as accessible as possible is really a principal drive for us. Jennifer has said on many occasions that this is our principal mission, and I agree with that. That is in a way why this is so important.

MR. LOCALIO: I am citing here the Institute of Medicine report on expanding access to research data. This is page 12. One of the sets of research that they cite has to do with evidence of increased public concern about privacy and distrust of government, assurances of confidentiality. In other words, if you don't get people participating in these interviews, it becomes a problem.

I just want to tell you, I read this waiting to see a physician. I had to wait a couple of hours. In order to get to see a physician today, you have to give your social security number.

But anyway, one of the things that struck me while I was reading this, in these studies it doesn't appear that people, when they are asked about their attitudes about trust, were asked about the various reasons that concerned them. Are they concerned that the information they are going to get is going to be turned over to the Immigration and Naturalization Service, very valid concern these days, to the Homeland Security Department, to the Internal Revenue Service, to law enforcement. I would say that those are concerns that a lot of people have. Having their information linked, de-identified and then given to some researcher who could care less about who these people are is a very different level of concern.

It seems to me that you have a one size fits all set of guidelines here, that we need to partition. So I would encourage people to do more research, more methodological research on why people feel constrained about participating in interviews, for example, exactly what the reasons are, and then design policies based on those reasons, in terms of whom you give access to data to.

MS. GENTZLER: I wanted to ask Jennifer, you had touched briefly on your experience trying to link with food and nutrition assistance programs, given that they are state administered mostly.

DR. MADANS: We have just started that. There was that recent IOM or somebody's report that said we should link HANES to the food stamp and WIC and a variety of other things. So we have something jointly going with the Census Bureau to try to do a pilot on that. We have been talking with the folks at the Food and Nutrition Service to figure out how to do that link. That is one of the ones where it is going to have to work through some conflicting interpretations about how you do it, and the state involvement and what kind of approvals do you need and who has to give the approvals and where do you get them.

So we have started that, and again, next year we can tell you what happened. I think there are those basic concerns, there are a lot of other concerns with doing something on a state basis. It is not so bad for us because we are only in a certain number of states every year, but you have to negotiate with every state eery time you do something; it gets to be a little time consuming.

If we were in all 50 states, this wouldn't probably be a way to do it, HANES isn't in all 50 states every year. But I think that this is again where we are at the beginning. It probably won't take us three years, but it probably will take us a little time to figure out whether or not this kind of linkage is going to work, and the data when it is linked with the survey data is useful.

Again, there is cost doing these linkages. The final reason for doing it is, after you do it for awhile, is anybody using it, is it helping them. That sometimes is a function of the quality of the data, or not so much the quality, but is it appropriate for the research questions that you can address with that data set. It may be that that is not the case, so why put everyone through the process. But yes, that is one where we are at the beginning.

MS. TUREK: I am struck by the fact that the confidentiality and privacy laws seem to be agency specific. I think my big example of this is, the HIS only puts out range for its total family income question, and you have to go to the research center to get the dollar amount, where on the CPS I can get the dollars of income for detailed sources of income on 20 sources. I don't see where the risk on the two surveys is all that different.

It seems to me that this is a case where some consistency across agencies would also be helpful.

DR. MADANS: I don't disagree with you on that particular issue. I think that there would be consistency.

But it is not that item again. When you make a public use file, it is a tradeoff of where do you want your details. So when those files were created, it was decided because it is a health survey to give up some detail on that item, as opposed to detail on anther item. I don't know where the tradeoff was for that, but it is a balancing act.

I think the folks who create the files try to do the balance in the best way they can with their knowledge of what the data are going to be used for, and they try to create the files in a way that they will meet the needs of the greatest number of people, which sometimes means people like you don't get exactly what you need.

But in this particular instance, we have heard you. They are going back to look at that to see if we can change that and would it affect the other items. But when our disclosure review board looks at a file, they look at a whole file. They may go back to the program and say, okay, you can have this or you can have this, but you can't have both. In that decision it might have been where top coding comes.

DR. STEUERLE: I think we are going to have to cut here and go to our next speakers. I have got the list here, but I don't know how many people we have got left from HHS speaking. So just so I know how many presentations we have, can you raise your hands so I can count? Five formal presentations.

Why don't we go then immediately to the next presentations. That would be David Gibson and Steve Cohen.

Agenda Item: David Gibson, CMS, Office of Research Dissemination and Information

MR. GIBSON: My name is David Gibson. I work in CMS Office of Research Dissemination and Information. I am here to replace Spike Duzor, who is the project officer on this particular project that you asked me to speak to you about today. To the extent that I can answer questions, I will certainly try to do so.

First of all, you have to realize that when Medicare and to a large extent Medicaid came into being, we saw our missions not to do research; it was basically to pay claims. Originally it was oriented to acute care, at least on the Medicare side. It was oriented towards cost based or reasonable charge reimbursement. There was little concern for the impact of administrative decisions on research and how to use the data.

Consequently, we organized our databases in ways that often made it difficult to do research. But as time went on, we saw that we wanted to come up with more rational ways to make payment, for instance, to bundle the payment for a number of services together rather than paying for each service individually. We also saw that we wanted to match the services that we were providing and that we were covering to what the needs of the population were.

We also realized that when we were in this mode of trying to pay claims quickly, that we often made payments when it was inappropriate or even fraudulent. We were often in what they call a pay and chase mode. We made the payment, and then we had to go back and get the dollars, and that was often hard when a home health agency or a DME was using a former BP station as their place of business.

What are some of the barriers that we have in using CMS administrative and other data to study health outcomes? Spike asked me to go through some of the list of items. Many of you probably know many, many more; I don't pretend that these are exhaustive. Then I want to describe to you a database that we are building that we hope address some of them, and it is not going to be all of them, unfortunately.

One of the big problems that we have in the Medicare and Medicaid systems is a lack of unique identifiers within programs across types of data. The biggest one we have is the health insurance claim number or HIC that we use to identify beneficiaries in the Medicare program. We use an identifier that as most of you know is a combination of a CAN, a claim account number, which is the social security number or an RRB, Railroad Retirement Board number, for a beneficiary, and then a beneficiary identification code that relates that person to the bennie.

Unfortunately, some of these identifiers change over time. You have situations where a wife gets benefits under her husband's account, and over time she may earn enough benefits under her own account to get benefits. She will change the CAN portion of her health insurance claim number.

This causes problems in following people longitudinally. If you have a situation where you are working with a five percent sample, you already have a fairly small sample. It is robust, but think about if you want to follow a cohort over a long period of time. If one to two percent of the people cross reference out of your five percent sample, think of multiplying .05 times .98 times .98, and you keep seeing the effect of that diminishing of the population that you are interested in.

So we have that problem of unique identifiers for beneficiaries. We have the same problem with some of our providers. If you look at some of our data that is provider oriented, say hospital oriented, you will see that the number of hospitals has supposedly dropped drastically over time. What that is, is a recategorization of those hospitals as critical access instead of short stay.

Second, we have a lack of unique identifiers across programs. We have Medicare, we have Medicaid. As the speaker before me mentioned, we have assessments now for home health, for skilled nursing facilities, for rehab facilitates and for swing bed. These numbers are assigned by the state, not by the Medicare or the Medicaid programs, so you have a different set of identifiers for bennies for this very rich source of information that will help to give more information about the facility and about the patient.

What are some other barriers that we have? We have the separation of billing and associated diagnostic and therapeutic care into separate bill types. By this we mean, if you look at the services that we have, you can see that many times if you wanted to create an episode you would have to go to as many as seven different bill types to try to find the information that is related to that particular episode.

We also have the use of what we call ruleout or confirmed diagnoses on certain types of bills. Some of the bills that we have are not at the point where final diagnosis has been made. So if you are looking at the principal diagnostic and even sometimes the secondary diagnosis, you are going to be seeing the diagnoses that the physician is trying to determine whether the patient really has that particular condition or not. So that diagnosis and the presence of that diagnosis in the field doesn't necessarily mean that you would want to include this person in a particular category.

This causes us many times if we are doing some cross sectional analyses, that we are looking at patients who have diabetes or whatever, that we are in fact including bills there where the physician was trying to confirm or not whether the patient really had diabetes or not.

So what does this cause? This causes problems with trying to identify persons by diagnosis in a cross sectional analysis. Just getting prevalence rates is often difficult for us, because we are not sure that we really are looking at those people who in a cross sectional time frame have a particular diagnosis. It also causes problems with us trying to find when certain conditions started, the incidence rates associated with particular diagnoses.

We also have problems with the fact that we use different coding systems. Many of you have encountered this. If you are looking at someone who had a surgical stay in the hospital, you will find that on the hospital sides we use the ICD-9 procedure codes. On the physician or professional side you will see that we use what we call the HCCPC, the health care common procedural code, which is basically based on the CPT-4 that the AMA puts out.

Other problems are the lack of what many people think is most crucial, clinical information that determines and differentiates personal level critical pathways. In other words, if a physician has decided that a person has a particular disease and they want to adopt a particular regimen or protocol for that patient, it would be helpful for us if we had that clinical information. We do not have that.

Next is the lack of information on the cause of disability for the disabled. About 15 percent of our people are disabled. They are under 65. We can be looking at the diagnoses on the claims to try to figure out what was the cause of the disability that they have, but we do not actually know the cause for disability for this individual. So we do not have from SSA the actual cause for why the person is entitled to Medicare.

Another problem that we have with our data is, you will notice that most of the time the data that you will get from us will be listed by what we call MSC code, Medicare status code, which will just tell if they are aged, if they are disabled, or if they have ESRD. This will not tell you for the aged population if they formerly had a disabled status. For instance, many of our aged population aged in, if they survived long enough, into our aged population. They are quite different, we find, from the aged folks who come in, many times postponing to come in when they are eligible at 65, if they are covered by a working aged situation. They may be 69, 70 or even older before they sign up. Have very, very few causes for disability. Yet our data tends to group them together if you just look at things like age or Medicare status code.

What else? Lack of comprehensive, and by that I mean both breadth and depth, of person level data on primary and secondary health insurance coverage. We group people who are working aged and are covered by their employer sponsor plan with those people who use Medicare as their primary source of insurance. Consequently you may miss a lot of their utilization if you go out looking in the claims, because the provider if they are going to Blue Cross Blue Shield may not even bother to send the information in. So when you compute rates and whatnot, you find that you often underestimate what the actual cost is for those people who use Medicare as their primary, because you are grouping people in who you shouldn't be including.

Next is the lack of information on socioeconomic status. That would be an ideal thing to add in. We also are unable to get the cause of death, and that has been discussed earlier today in other presentations.

This one is an odd one. Since we have just one into Part D we consider that we would be supplementing the Part A and the Part D benefits with our Part D event data. Unfortunately we are unable to do that. The law prohibits us from doing that. We have a notification that we are hoping to get out soon that will allow us to start linking the Part D drug events for those people. It is not all the people who had drug benefits that we are going to be getting their drug events. If they are in a situation where they are getting coverage by their employer, their employer may get a subsidy, but that doesn't mean we are going to get the drug events. We are going to only hopefully be getting the events from the folks who are going through PDPs.

Next is the size of the sample. Traditionally we give out the five percent sample, but for many conditions -- and I will be talking about the chronic condition warehouse in just a couple of minutes -- it would be ideal if we had a larger sample. But right now we have a five percent sample that we are dealing with, and that is primarily because of the size of the physician claims. There are about 800 to 900 million physician supplier claims a year, and these are very complex records going often over 2,000 and 3,000 characters per record, and it often becomes difficult to do large scale analyses where you would like to do small area analysis.

Next is the inability to disaggregate the program payments, payments by other payors such as some claims for the working aged, and the beneficiary payments. If the services are all bundled such as for the DRG under the inpatient TPS system, to disaggregate them to get accurate revenue functions for the provides. If you wanted to disaggregate an entire stay that involved 20 or 30 different revenue centers and different types of services, you have all the information that is only stored on what we call the fixed portion of the claim.

Next is the inability to link specific services on claims with provider costs to data provider cost functions. I think this was mentioned earlier. Wouldn't it be nice for institutional providers to link the claim back to the cost report, take the payment amount, and using some methodology, whether it is cost to charge ratios or some other methodology, allocate the provider's cost for providing that service to all the bundled services within that entire stay. On the physician side, it would be good to have information at the office setting for determining the cost to the physician for providing a service. We do not have that.

Next is the inconsistent use of the UPIN or the unique physician identification number. We have a problem, that many times the physicians in a practice will all use the UPIN number for the main doc. Especially we see this with radiologists, pathologists and to some extent anesthesiologists, where they all use the UPIN for one doc. So you may see one service only for many physicians, or you may see 25,000 for a limited set of physicians.

Now, admittedly for pathology they may see a large number of samples, but many times we think this is the result of using the same UPIN number for all the physicians in a practice.

DR. STEUERLE: David, if I could interrupt you for a second, this is partly my fault, but we have seven speakers on this session.

MR. GIBSON: I am going too long, I apologize.

DR. STEUERLE: It is not entirely your fault. We ran over in the morning meeting, and we haven't enforced on anybody yet, so we are starting to enforce on you.

MR. GIBSON: I apologize. Let me talk about Section 723. There are a lot of these barriers to using data. I would like to mention a couple of them that we are trying to address with Section 723. That is the final one down here.

723 was created by the Medicare Modernization Act. It was signed in December of 2003. It made a lot of major changes in the Medicare program. I talked about the outpatient prescription drug benefit, basic changes to Medicare Advantage, to the managed care program, but it also mandated studies and demonstrations to improve the effectiveness of the Medicare program and the quality of program recipients.

What that law did was establish a research database for the chronically ill. It was recognized that the chronically ill account for the great bulk of program payments. Before the program had been geared towards acute episodes, it was recognized that the chronically ill represent a large proportion of the dollars. In fact, they estimated that it is closer to 80 percent of the dollars are accounted for by the chronically ill.

So what they wanted to do in this database that would be geared towards the chronically ill, they wanted to provide the capacity for improving quality of care, for reducing cost of care, et cetera.

So what we did is, we went out, we talked with some clinicians. We developed algorithms for defining 21 chronic conditions that we could use the claims for to identify these individuals. These are just some statistics that I just quoted to you. The idea here was to identify individuals who had these chronic conditions, and to eliminate some of the barriers that I mentioned earlier, such as, we developed a system to do away with the problem of cross referencing, and the fact that many people change numbers over time.

We developed a methodology that would also allow us to link data in these different systems. There is a diagram here that describes that. It is called the enterprise cross reference system. The idea here was to link Medicare, Medicaid and assessment data. We assigned one unique number within each of the systems, and then across the systems we allowed a methodology that will identify individuals in both systems and pool their information together and put it into the chronic condition warehouse.

This allows us to create patient level profiles. It allows us to identify and keep a history of people who have these 21 chronic conditions, but in addition it allows us if we want to search the database for other conditions or do ad hocs, it will allow us also to do that as well.

The difference that you see in this database versus the other that we had is that most of the databases that we have worked with in the past have been flat files. This isn't a relational database.

I am trying here to pick up a couple of points that might be of particular relevance. I think the idea that I want to emphasize in this database is that it will allow us to link data across the different programs, Medicare and Medicaid and the assessment data, pull it together into a relational database and allow us to query it and develop statistics off of it. That is what we are in the process of doing right now.

It is based on a five percent sample. We have loaded from ‘99 to 2000 and four at the five percent level. We are testing to see if we can go a hundred percent with the 2005 data, and if so, we are going to load the 2006 as well. One of the things that is particularly appealing to me is that instead of waiting for the staff files, we are going to use the claims as they come right out of our system.

I talked about the ECR a little bit. Core research files, that is the idea of paring down our records so that they are not so long. I mentioned the algorithms that we have gotten for the 21 chronic conditions.

The Iowa Foundation for Medical Care, that is one of the QIOs. It is a recognized data center for QIOs. They are also the maintainer of our assessment data.

Moving on, we are working also with the University of Minnesota. They have a group out there called Resdac, that works with researchers to try to help them to see if the database is of particular use to them.

I think that is it. I hope on the last part I didn't go through too quickly, but I got the feeling I was running a little over time on that.

DR. STEUERLE: Again, I have to apologize both to you and to our main speakers. I think we probably crammed way too many speakers into this session. But I think to make sure that all the speakers to make sure they are able to give their talk, I'll ask them to stay within maybe ten minutes for the remaining speakers.

Agenda Item: Steven Cohen

MR. COHEN: Given that I have a half hour presentation, I am really going to cut to the highlights. I have some handouts, so you could get into more detail here.

In addition to everything that we have covered so far, this presentation takes a little bit more of an intensity of integrated survey designs, analytical enhancements through the integration of surveys, both with administrative but also other survey data.

I will go over briefly some activities in the department and in AHRQ in terms of implementing integrated survey designs to enhance analytical capacity and see data quality enhancements, and see some efficiencies in the collection of data, and how this model is a good model to help also improve the accuracy of survey data. There will be a few examples at the end in terms of the application of this AHRQ data portfolio, in terms of improving health outcomes and limitations of the approach. Much of the limitations in terms of some of the confidentiality issues have already been covered. The data portfolio activity fits in very nicely with the mission of the agency to improve the quality and the safety, the efficiency and the effectiveness of health care for all Americans.

On to this integrated survey design model, the model itself looks at core health survey, and it looks to other existing larger surveys or administrative databases from which the survey could be derived. Rather than having an independent large scale screener which would be very costly, this integrated survey design makes use of ongoing surveys, so one can have that predispositional information and use it in a very cost efficient manner. It also makes use of the capability of linking secondary data, and if there was a prior host survey, record of call information in terms of where there were reluctant respondents, where there were multiple contacts to get the higher response rate, and after the fact quite a bit of detail in terms of sociodemographic factors for non-response adjustments.

In terms of the core survey and the department that has used this, it is the medical expenditure panel survey. It is linked to the health interview survey. The medical expenditure panel survey is designed to provide estimates to health care use, expenditures, sources of payment.

Some of the core issues Julia Lane talked about was outlier cases and in the expenditure distribution, what percent of the population is tied to 25 percent of the expenditures. So in terms of confidentiality issues and linkage, we would opt to do everything possible to give the greatest accuracy on those high expenditure cases and be constrained in terms of some geographical dimensions for data linkage, so one can get at the core analytical variables, one could go to either the AHRQ data center or the NCHS research data center, and I will talk about how the health interview survey and the MEPS are linked.

In addition to those linkage enhancements, we have a survey much like MEPS which is core, and you have a predispositional survey such as the health interview survey, a general purpose public health survey, 40,000 households, 110,000 individuals. In addition to screening perspectives, you have an extra time point to enhance predispositional information, longitudinal analyses.

So we draw up the health interview survey with all its strengths and analytical capacities. It facilitates a very timely effective over sample of cases, joined to the medical expenditure panel survey for two years. I am going fast, but there are a few other compelling additions that have helped both surveys tremendously.

NCHS has been on the hook to get data to us fast track, really fast track, so we can get the information and then sub sample. From that they have developed a fast release of insurance coverage estimates. They are putting it out. I think the 2005 estimates came out roughly like June of this year, and I think they do quarterly estimates. So both organizations benefitted by demands to get data out quickly to make it more usable to the research community. Again, very efficient over samples of policy relevant groups.

So that is the first core example of the integrated design. MEPS recognizes that while households could give you quite a bit of very accurate information on their use, on their insurance coverage, on their access to care and their perception. There are a lot of things that are problems with it in terms of health expenditures, very, very demanding questions. So the underpinnings of the design makes use of the recognition of a lot of item non-response, and it links to other follow-back surveys. So you get permission to go to medical providers to get details on health expenditures and the diagnoses and the event dates.

We have an insurance component. There are two stages of linkage that I will go into later, but this gives the depth of information on what we are spending for health care, where the asymmetry is in terms of those expenditures, how that impacts on take-up and overall access to care, and how that translates to health outcomes. The insurance component is linked to the household component to help support what the premium costs are.

A little bit more of this integrated design. As I said, the health interview survey serves as the sample frame for the household component of the MEPS.

We also benefit by a Census Bureau business register, which has tremendous timeliness, is the best frame available in terms of having good coverage rates, great information for modeling and over sampling, an annual survey of roughly 40,000 establishments. It builds off all the precursor information and allows for tremendous modeling design work. Then the linkage from the core surveys to follow-back surveys, and as we have seen on other examples, linkage to secondary data.

Instead of a typical survey where you just have information at the county level or the MSA level, where you are drawing in the analytical units, here you have precursor information from the health survey on the sociodemographics, so you can form non-response adjustments. A few equations have to be part of any presentation.

But in terms of one other thing about the linkage between one survey and other survey, this goes into one of the very core outcome measures rather than the SF-12 that we also have in the MEPS. This looks at in the last year of the MEPS a self assessment of health, and then it looks at of those people, were they ever uninsured over a two-year period, were they consecutively uninsured for two years, and through linkage between MEPS and HIS, MEPS can give you two years of coverage with several rounds of data collection to minimize recall.

At the first round, if a person is not covered, it has a question with a two-year recall, very, very risky to put that out without any verification. With the linkage to the health interview survey where they have a point in time and they also go back in time, we can make edits, and now we can put out estimates of the long term uninsured in MEPS. You can see how that could factor into certain analyses. Again, on expenditure estimates, one other dimension in terms of precursor information from HIS to MEPS.

Concentration of expenditures is critical. Persistence of the concentration of expenditures is critical. So year one, year two. A lot of analysts come to us, groups like Kaiser Permanente, they are looking at the top one percent of the population and what can be done in terms of more efficient use of services. By having HIS information, we can look at two prior years in terms of predicting a third year. Two surveys married, and the whole is greater than the sum of its parts.

A little more detail on the medical provider survey, compensating for item non-response on expenditures, gold standard for expenditure estimates, greater accuracy and supports imputation. For our households, we get permission to go to their hospitals, their associated physicians in the hospital, office based physicians, home health agencies and associated pharmacies. The pharmacy verification component is critical as we wait for CMS to come on line with the Part D data to inform a number of the comparative effectiveness analyses. In the meanwhile, MEPS for both Medicare beneficiaries and the population in general, the civilian non-institutionalized population, has a pharmacy verification survey.

We have units of building blocks from the household and the provider. We take the provider when we have it from both sides. We take the provider when we only have it from the provider. We take the household data when it is from the household, but we recalibrate based on those two points in time.

I am going very, very quickly. I hope this is helpful.

I am going to slow up a little bit. The non-response is based on imputations from our medical provider survey. Hot deck imputation; we build on provider expenditure data. We use model based predictors in terms of defining the hot deck cells. We look at predictors of expenditures, predictors of non-response. The intersection of the two is what we prioritize on. Then we go to factors associated with expenditures. Then if we have room for other variables, we will go into the item non-response.

On the pharmacy verification component, the household gives us entree to the pharmacies. We get details on the pharmacy itself and the use. Then the pharmacy gives us the medication name, the NDC code, quantity dispensed, strength and form, source of payment, amount paid.

Fictitious person so there is no confidentiality violation. This is Sandy King. She might exist, but she doesn't exist at this address.

You can see some of the wealth of data. With that expansion from the household data to get entree to the pharmacy to get all this data. It sequences, where we can then link that NDC code to other proprietary databases that can get us more granularity in terms of the therapeutic class and the sub-classes. So it really enhances analytical capacity. We can link to databases like the FDA, the year of approval for the drug, whether it is a brand or a generic indicator.

Some of the outcome analyses. We look at trends in out of pocket burdens across all major population subgroups, prevalence of potentially inappropriate prescribing patterns. We also look for substitution effects over time, new higher priced medications, but maybe there is some cut utilization overall, looking at outcomes. Trends and use by therapeutic category, and more and more work on predicting models of future year expenditures.

We just had a meeting with the Medicaid chronic disease directors. Very concerned in terms of looking at chronic disease, Medicaid expenses, looking at things that can inform them in terms of cost avoidance. I don't know if there are any cost savings, but just something that is informative.

On administrative data, we work very closely with CMS. This is not a linkage exercise, but it shows how administrative data would have national estimates. To be reconciled to MEPS, they cover infrastructure, research and development, they cover the nursing home and institutionalized population.

We have to bring the national health counts down to MEPS, so we can see if the two data sources are providing concurrent information. If they are not, if the reconciliation shows areas of disparity, we have to look to see how each of those data sources as they speak to one another at a higher level, can be improved.

We also link to CMS data through our IRB, getting information from Medicare to validate household reports on the use and going into the details of the type of service and some complexities in patients on separate billing doctors.

A little bit more on our establishment survey. The estimates that come out from this look at differentials nationally and state by state in terms of take-up of coverage, the cost of the coverage. These estimates go into health update, GDP estimates of the premium cost for the provision of health insurance coverage. We benefit by a linkage to the business registry. With that comes a lot of conditions. We cannot release the micro data to the public. This data is residing at the research data center, gets tremendous use, but it is in a secure environment at the RDC. But we do release tabular data.

The integrated design allows for the detail of the sampling to optimize sample designs to minimize variance for fixed cost constraints. Serves as an imputation source for editing, for small area estimation models, for table production and also just like HIS and MEPS, non-response adjustments.

These integrated results, rather than more surveys, would have some palliative effects on reduction in respondent burden, sample precision and improvements, and modeling research.

Just as we close out, let me just give a little flavor to some of the other elements in the AHRQ data portfolio. We are having some discussions, and this might feed into some of the questions in terms of what is permissible on administrative data.

This is a different part of the data portfolio, the health care cost and utilization project; 37 state partners, roughly 90 percent of all payor hospital discharges in the U.S., inpatient data, ambulatory surgery and emergency department databases. The patient enters the hospital, get billing records. The state partners send it to the major data contractor tied to this, and produces a standardized administrative data set.

The standard linkages on the administrative data would be the AHA data, more details on characteristics of the hospitals, information on the providers and secondary linkages that we have heard from all the other speakers so far.

The identifiers are encrypted. There is some limitations with this encryption, different states do different things, so we don't have it across the board. But some states are better than others, and you can have some more episodic, I wouldn't say longitudinal, but episodic type analyses across the databases. Some of the state partners in their own right make use of linkage to vital records, to disease registries, to state program files to enhance the analytical capacity.

Some of the challenges as I said are, the encryption methods are not uniform. You don't have consistency across time, and some sensitivity to some of these supplemental linkages.

Some of the analytical outcomes with this database would be looking on racial and ethnic disparities and readmissions for diabetes, the incident cost of motorcycle injury to inform decisions on state helmet laws, financial status of safety net hospitals, impact of motor vehicle exhaust on pediatric asthma admissions, certainly marrying a whole bunch of disparate data sources for an informative study.

In terms of pushing further, the department needs more information, particularly with Secretary Leavitt's transparency initiative, and looking further in terms of quality metrics. Building on the capacity when it does become more visible on electronic medical records and supplements to the claims data, better links across the different states that are uniform. More information on hospitals that are missing right now to help in terms of decision making by consumers on organizational culture, clinical integration and the availability of data from health information technology, and more quality metrics and nursing staffing data is always critical.

Where we need to be. There is an initiative in the Department called Joining Forces to expand on existing administrative data for consumer choice. We need information that is more timely, that is less expensive, that is actionable. Right now we have quite a bit of administrative data, but a lot more work needs to be done, and dealing with some of the confidentiality issues as well.

Some of the examples we are pushing on would be more HIT applications for timeliness, more augmentation on clinical detail, on what condition was present in admission. I think our colleagues from CMS said that would be very, very helpful, too. And some lab values and more cross-site data, and promising all the confidentiality commitments are adhered to.

We have some staff from HCUP, so if you have some other questions, they are here to answer some questions.

Just before I close out, the agency with CMS and others in the Department have quite a bit of activities going on with MMA. The agency in particular has a role in terms of studies on comparative effectiveness analysis. One of the programs at the agency is the Decide research network. That focuses on developing evidence to inform decisions and effectiveness. The purpose of this program is to expeditiously develop valid science evidence about the outcomes, comparative clinical effectiveness, safety and appropriateness of health care items and services.

Several well-known academic centers are a part of the network. The onus of the Decide network is to analyze existing health care databases for comparisons of effectiveness and outcomes of treatment, analyze existing disease, device and other registries, and improve the accuracy of those data sources. We have spoken on ancillary and secondary data sources for some of the linkage.

We have a data center. It is the second one in the Department. Some of the models that NCHS are pushing on for getting more and more information in the hands of the analysts to inform policy and practice directions AHRQ intends to go in as well. There is quite a bit of coordination in the Department on that front.

Some linkage variables that you can look at; it is probably hard to read on this screen. I think the presentations gave great justice to some of the limitations in the availability, so I am not going to repeat that.

Let me just close. I might have done 25 minutes in ten, I don't know, but I went very quickly. I did cover the capacity of integrated survey designs, the ability to reduce non-response, related enhancements and data quality, analytical capacity, some attention to MEPS but some of the other data resources, both within AHRQ and the Department.

That's it.

DR. STEUERLE: Stephen, that was not only a brilliant presentation, but a tour de force on your speaking ability and speed.

Agenda Item: Gerald Riley, Social Science Research Analyst, Center for Medicare and Medicaid Services and Martin L. Brown, Chief, Health Services and Economics Branch, National Cancer Institute

MR. RILEY: Good afternoon. Martin and I are going to team up to describe the SEER Medicare linked database, which represents a joint effort of the National Cancer Institute and CMS. We will also briefly discuss some other linkage projects. I think we are planning to go about ten minutes each, if that is okay. We will try to make it a little shorter.

I will give some background information on the linkage, and Martin will discuss some of the uses of the data and some of the access to the database for outside researchers.

Before I begin, I should acknowledge Joan Warren of the National Cancer Institute, who helped prepare this presentation. She has played a very central role in developing and improving the linked database.

SEER Medicare consists of cancer registry data from NCI's surveillance, epidemiology and results of the SEER program linked to Medicare records on an individual basis. The linked database has been in existence since 1991, and has became a significant resource for cancer related health services research.

Under SEER, NCI contracts with individual cancer registries to collect and identify information on all incident cancer cases in their reporting areas, with the exceptions of non-melanoma skin cancer and in situ cervical cancer.

The participating registries attempt to identify cases treated in all settings, and they examine death certificate records and autopsy records to identify additional cases. Incident cases only are reported, and recurrences are not captured. The SEER program began in 1973 and has covered 11 geographic areas since 1992, representing 14 and a half percent of the U.S. population. The program expanded to four new areas in 2001 and now covers about a quarter of the U.S. population.

This map shows the SEER reporting areas. Until 2001, the program covered five states and six metropolitan areas. In 2001 the states of New Jersey, Kentucky, Louisiana and the remainder of California were added.

SEER areas were not selected to be representative of the U.S. population, but were selected for the quality of their cancer registries and the diversity of their populations. Analyses of the elderly population in SEER areas have shown that there are lower percentages of whites and people living in poverty, and a higher percentage of urban dwellers compared to the U.S. in general.

This slide shows some of the detailed clinical data that are collected under SEER. Each individual is assigned a unique case number. If an individual is diagnosed with more than one primary cancer while residing in the SEER area, information on each cancer is recorded separately under that person's case number. Information on month and year of diagnosis is collected, as well as site of cancer, histologic type and extent of disease at diagnosis.

The SEER program staff uses information on extent of disease to assign a stage of diagnosis. SEER also captures type of cancer related surgery if any, and any radiation therapy that is given or planned as part of the first course of treatment. Vital status is also tracked over time and includes cause of death for most cases.

Most of the Medicare administrative records are included in the linked database. Enrollment data provide information on entitlement, demographics, Medicaid buy-in status and managed care enrollment. Individual claim records are included for all kinds of coverage services.

Most claims records are available from 1991 forward. The continuous Medicare history sample file contains longitudinal data on a five percent sample of Medicare beneficiaries from 1974 onwards. In addition, a five percent random sample of Medicare enrollees has been identified who are not in SEER but who reside in a SEER reporting area. These individuals serve as a cancer-free comparison group for studies on cancer screening and other topics.

It is our intention to add enrollment and Part D prescription drug plans in future updates of the database, assuming those data are available. We hope to add Part D drug claims if this becomes feasible. As Dave said, we are working to try to get access to that now.

The SEER and Medicare data complement each other in providing information on a variety of cancer control activities. The data are useful for patterns of care studies that control for the effects of comorbidities. Post diagnostic surveillance can be measured for many years after the initial course of therapy, and recurrences can often be identified from the claims data. End of life care can also be analyzed, including the use of hospice services, which are covered by Medicare.

I will briefly describe our linkage activities. NCI receives files with personal identifiers for cancer cases directly from the SEER registries. NCI checks the files, and if they appear satisfactory, they are forwarded to a CMS contractor, who matches the data to the Medicare enrollment database. For all cases that are successfully matched, the unique health insurance claim numbers or HICs are identified and all Medicare claims for those individuals are extracted. The contractor then removes all identifiers including the HICs, and the claims and enrollment data are sent to NCI's contractor for creation of analytic files. All the analytic files retain arbitrarily assigned SEER case numbers to distinguish the individual cases.

The linkage is updated every three years, with the next update scheduled to begin in August 2007. We do this every three years because it is a pretty time consuming complicated process to do the updates.

MR. BROWN: And when we update, we update the entire thing retrospectively because of these changes in enrollment status that were discussed.

MR. RILEY: The current files contain SEER cases diagnosed through 2002, so the next update will carry us through 2005.

I will just briefly describe the matching algorithm that is used to match the SEER and the Medicare records. In most cases, social security number is reported by the registry. That is the most important variable that we use in matching. We do require some agreement on corroborating variables such as first and last name, month of birth and sex. Agreement on sex is relatively important to prevent us from inadvertently matching records for husbands and wives. If SSN is not available or doesn't match, we use first and last name, date of birth, middle initial and date of death to match records.

Our matching criteria in the absence of SSN are rather strict, because it is relatively easy to get false positive matches with such large databases involved. Our match rates for persons aged 65 and over are quite good. We have been able to find a Medicare record for about 94 percent of the elderly in the SEER database. This varies somewhat by race and ethnicity. The match rate for Hispanics is only about 88 percent, and that for Asians is 90 percent. The match rates for persons under age 65 is rather meaningless because we don't expect to match most SEER cases in that age range.

The next table shows the number of linked cases for some of the most common sites of cancer. There are about 300,000 prostate cancer cases and over 200,000 cases of breast, lung and colorectal cancers, so this is a very large database. These top four cancer sites account for 60 percent of all linked cases. The database is large enough to support detailed studies of many less common types of cancer as well.

Before Martin describes some of the applications of SEER and Medicare, I will briefly describe some other database linkages involving SEER and Medicare data. SEER has been linked to CMS' health outcome survey or HOS. The HOS gathers data on health status measures among the Medicare managed care population. The survey is administered annually to 1,000 Medicare enrollees in each Medicare Advantage plan. The survey is used to assess the ability of plans to maintain or improve the physical and mental health of its Medicare members over time. The linkage with SEER will permit better studies of quality of life for cancer patients in Medicare managed care.

There are also current plans to link SEER with the Medicare consumer assessment of health care providers and systems or HCAPS survey. HCAPS measures the experiences of beneficiaries in their health plans with fee for service Medicare and with providers like hospitals and nursing homes. The survey is used to monitor quality of care and to measure the performance of Medicare health plans and providers. So the linkage of HCAPS with SEER would permit more detailed studies of patient experiences with cancer care.

Medicare claims and enrollment records have been linked to many other databases besides SEER. Medicare administrative data are routinely linked to the Medicare current beneficiary survey. They have also been linked to other surveys, like the national long term care survey and health and retirement study data. Data have been linked to social security administrative records under interagency agreement between CMS and SSA, and Medicare data have been linked to several NCHS surveys under an interagency agreement, as Jennifer mentioned a little while ago.

These data linkages have greatly enhanced the value of survey data by adding information on use of health care services that is difficult to obtain via survey.

I will mention one attempted linkage with SEER that did not prove very useful. NCI considered linking SEER with Medicaid data to get better information on health care use and costs for cancer patients with low incomes. This link is difficult to begin with because the Medicare data are state specific, which complicates the privacy and the technical issues involved.

The validation project was conducted by NCI with SEER and Medi-Cal data from California. Medi-Cal claims were found to be not very useful, primarily because of heavy enrollment in managed care plans. That is, the state does not give claims data for Medi-Cal managed care enrollees, so the information available on them is minimal. NCI has therefore dropped its effort to link SEER with Medi-Cal data because the claims are not complete enough.

So Martin is going to talk about some of the specific applications of SEER Medicare data, and also some of the conditions of access to the data.

MR. BROWN: There are three or four contexts in which we use the SEER Medicare data. This whole project was instituted ten or more years ago, because at NCI we are associated with the SEER program that Gerry mentioned. This was seen as a natural extension of our mission to do surveillance research and monitoring.

Part of what we do is what we call inhouse research at a research institute. We are an extramural branch, but we do inhouse research as part of our federal mandate to do surveillance.

Secondly, we have done a lot of work to publicize the existence of the SEER Medicare database, to provide technical assistance through various mechanisms such as a very extensive webpage, conferences, outreach, Joan is up at Harvard today, giving an extensive seminar, for example, and a funding mechanism. We have developed a large stable of extramural researchers who use SEER Medicare data.

Then we have hybrid studies in which we have developed over the years this partnership with extramural researchers. Quite often we will have a need to do a particular analysis at NCI. We will get some of the clinical or health services expertise from the outside and we will get together and partner a study. We oftentimes have people come to us and say, can you do this for me. If we think it is a worthwhile thing we will say yes. For example, we have been working with the American Medical College's Center for Workforce Studies to do a study of supply and demand for the oncology workforce over the next 20 years or so.

So that is the context in which these things have come about. We now have about over 200 published articles that have used SEER Medicare data.

A couple of examples that fall in these various categories are trends and use of cancer related services, procedures, resources and costs, descriptions and disparities in use of cancer care, patient, physician and health services and determinants of patterns of care, and volume outcome studies.

For example, this paper by Gerry Riley looked at differences between HMOs and fee for service settings and stage of diagnosis and treatment approaches for early stage breast cancer. This was done in 1999 when there was a lot of heat and emotion about whether the quality of care was substandard in so-called managed care or HMO settings versus fee for service settings. So at least by these parameters, that certainly is not the case.

The next one is an example of a study done by an extramural researcher which looks at trends and use of adjuvant deprivation therapy among men with prostate cancer. This was study that was shown in clinical trials to have potential benefit in the early ‘90s. So the question is, what is going on with this treatment, and this is one way we can track the dissemination of this kind of treatment. This is the kind of treatment we can't capture very well in our routine SEER data collection, for example.

Then the next one is dear to my heart. This is an example of estimates of cost of treatment as defined by Medicare payments. We can have a long discussion of whether Medicare payments are a true or not true proxy for cost, but compared to almost anything else that is available they are pretty good.

This is an example done by an extramural researcher. We do a lot of inhouse research on this topic. I won't go into details, but it is interesting because it shows that the incremental costs for treating colorectal cancer relative to if that person had lived and not been diagnosed with colorectal cancer, lived and died of another disease, it is inversely related to stage. The more severe stage the lower the cost. In fact, it is a negative cost compared to not having this disease. So in the context of cost effectiveness analyses this has some obvious implications that are oftentimes not well appreciated.

This is an example of health disparities done by a fellow who worked with me, which showed a pretty robust disparity in African-American men in receiving -- not men, just African-Americans, in receiving surveillance colonoscopy after initial treatment for colorectal cancer. The level of detail isn't shown on this diagram. Not only were African-Americans less likely to receive a surveillance colonoscopy, but they were much less likely to receive a colonoscopy exam versus a sigmoidoscopy plus a barium enema exam. That level of details you can get with the Medicare data that would be hard to come by otherwise.

Finally, Deb Schrag who is in the audience is the first author of this paper, which is one of many studies that have used SEER Medicare data to look at the issue of volume outcome for cancer specifically. The marginal is very interesting, because they show that there is a hospital volume effect, there is also an individual surgeon volume effect, and the two of them together explain more than each one of them separately. There has been some very good descriptive work done using SEER Medicare data, but also some very good methodological work that suggests some of the limitations of these analyses as well.

Those are some examples of the types of studies that have been done.

This shows that there has been a growing interest and use in the SEER Medicare data by the number of data requests and also the number of publications.

I don't know if we need to repeat this one. The advantages of SEER Medicare are obvious from what you have heard from several speakers today. The link to SEER does provide very rich clinical information at the date of diagnosis which you couldn't get from Medicare alone. Going to Medicare allows us -- despite all of the problems that were mentioned earlier, nevertheless it provides a pretty good longitudinal set of data that we can then follow from the original incident cohort that we define in the SEER data.

The limitations are legion. Most of those have been mentioned. It is limited to individuals over the age of 64. Our cancer diagnoses are not. We can link someone diagnosed prior to the age of 65 once they get into the Medicare program later on. We have used that to great effect, actually.

There is a problem with HMO enrollees that you know about, and Part D, we will see what happens. A lot of people would like to use SEER Medicare to look at issues of cancer screening. There are limitations about that, because the Medicare codes for things like mammography and colonoscopy don't always allow us to accurately identify whether the exam was a screening exam or some other type of exam. The SEER registry areas are not totally representative, although they are large, they are national.

Probably the biggest limitation in our viewpoint, and this has also been mentioned, is the time lag. We have a significant time lag between when the events occur and when we can make the data available, three or four years. We would definitely like to do something to shorten that time lag, and we are doing everything we can to explore potential ways to do that.

We are in touch with CMS and with NIH about various efforts to improve the timeliness and efficiency of using CMS data for research purposes. We also have other efforts that are complementary, for example, we do have a large HMO based research network. We had a grant and we recently received a fundable score, and its basic purpose is to create a HMO parallel system to the SEER Medicare data using these large HMOs.

This maybe speaks to some of the early concerns about how do you get SEER Medicare data. I won't go into any detail on this, but maybe we can talk about it in the discussion. This is not a public use database, but it is data that is available to outside users upon filing a data use agreement that has a certain amount of requirements that we then enforce. We have not found it to be a big problem so far. We have had lots and lots of users and we haven't had any cases of the kind of abuse that would cause problems.

DR. STEUERLE: Thank you both for rushing through and taking time to let your colleague John make the last presentation. So John, we are going to let you go next.

Agenda Item: John Drabek, Economist, Office of the Assistant Secretary for Planning and Evaluation

DR. DRABEK: My name is John Drabek. I work on OASPE and HHS. I have a few comments about one of the studies that was displayed in Jennifer and Chris' slides. That concerns the linkage of the national health interview survey data with CMS and social security data.

This project has taken a number of years to bring to this stage. It first of all required people from four agencies, NCHS, CMS, Social Security and OASPE to meet and discuss how to accomplish such a match, that would satisfy the privacy concerns of all the agencies, but yet yielded a database that was usable for researchers. We were able to do that, and we were able to agree on the technical details of how the match needed to be conducted and how the files needed to be sent back and forth between the agencies.

The data are now available at the NCHS data center. There is extensive documentation on the NCHS website under the data linkage page. There is a large amount of data that are available. It is four years of the national health interview survey linked with extensive Social Security and Medicare histories.

The data are useful from the standpoint of linking people in the HIS to those administrative record sources, but also surveys that are built off the NIS are also available to be analyzed, particularly the LSOA with its followup interviews of aged respondents. Although MEPS started being linked with the national health interview survey late in the process, the potential to link to MEPS is there as well.

There are a couple of things I want to draw attention to. One is the fact that while the HIS is well documented in terms of user friendly files and lots of help in terms of being able to analyze the data, the Social Security and Medicare files are more oriented around program operations. So learning how to use the Social Security and Medicare files is a level of complexity that is considerable on top of the HIS.

To help in this process, our agency, OASPE, provided funds for the development of analytical files for the four HIS merged surveys, and also developed documentation. We are providing additional funds this year to make further support available to users.

Gerry Riley, who just presented the SEER data linkage, has done one study with a linked database, looking at people in the waiting period, those who have received SSDI and are awaiting their two-year interval before they are eligible for Medicare, and looking at their characteristics and seeing how they differ from other individuals. So that is a unique project that could only be done with this type of database.

We are also in the process of awarding a research contract to use the database to look at three areas. One is understanding the interaction of the population with disabilities and the SSI and SSDI programs. There are many more people who have disabilities than those who are in the Social Security programs, and what their characteristics are and severity of disability and things like that are important.

The second area is to take advantage of the fact that we have a survey, say the disability survey conducted in 1994. We can go up to 2003 and look at what happened to people after that in terms of their interactions with Social Security. So we can see how many people who had conditions in ‘94 applied to received disability benefits in later years.

The third area is understanding family and caregiving support for people with SSI and SSDI and other disabilities. There is extensive questions about disability and use of services, household structure and things like that in the ‘94 and ‘95 HIS that go beyond the normal core data in the HIS.

We have supported this, because we think it is essential to have good data for policy research and program evaluation. Linking databases in this manner is a relatively new activity, and the design and use of the products is not well understood at this point. We feel that only by conducting analyses with the linked data that has been conducted so far will we understand the benefits and the limitations of what has been done so far.

In particular I think we need to get a better handle on the different time frame available with various parts of the database. We have the health interview survey in 1994, but we have extensive data before and after from the case history files. Figuring out how to summarize those case histories and integrate it with the survey data is a challenge that hasn't been faced before. Similarly, if you have longitudinal interviews, how you integrate that with case history data. So hopefully we will learn better how to do that in the future.

Thank you.

DR. STEUERLE: Thank you again for helping us to get back onto schedule. I now have to apologize to everybody; we will probably have to let you take your own break again today, because I want to make sure -- we have Howard on at the last session, and I want to make sure we give him his full time. So I am going to go right to questions, and if people feel you need to step out, please do so, and we are just going to continue straight on through.

I am going to take the liberty of asking the first question, and I can address it to any of you. As I heard your presentations, one thing that stood out in my mind, and I was comparing your presentation to those of the Census Bureau, it seemed to me that one thing that seemed to make a huge difference was whether the particular part of your agency or particular program you are in or whatever had positive incentive to do something.

So the Secretary decides he wants a Decide program. That creates an enormous weight within the Department to go out and make things happen, which had he not done that might have much more left it up to some outside researcher pressing to have something done, which maybe the inside lawyers would say we might get into trouble for doing this.

The SEER program, I'm not sure what the underlying legislation impetus was there, but cancer has always been a hot topic. So it appears that if in the system somewhere there is somebody who creates this incentive that information should be provided on something, that it often is the motivation for breaking through so many of these barriers we are talking about. Am I correct or incorrect?

MR. COHEN: You really are on target, but let me give you two examples where you are on target. The integration of the MEPS and the HIS were part of a major initiative within DHHS, I think it was ‘95, as part of Vice President Gore's RIGO-2 activity, reinventing government. There was an initiative to look at all the departmental surveys in a very short period of time, and that created opportunities and synergies, and that was a catalyst.

More recent activities, there are Secretarial initiatives. The Department gets a lot more efficient by bringing the core forces together and getting it rolling.

With that said, it has a momentum of its own. There are other opportunities, once these things are put in play, that other users see things that weren't on the table before, but because of this coalescing there are these opportunities that present themselves.

I would say there are probably other cases where you have independent forces coming together, but there is a framework that makes this all jell when it is a mandate or an initiative and you get the support that is needed from the leadership to bring people together, because we are all so busy meeting all our objectives. That is first and foremost, but many times you get these opportunities.

DR. MADANS: I guess we have had a different experience, although the MEPS-HIS link clearly was something that was started by the Department and certainly encouraged.

I think the other linkages that we have done actually haven't been a real push. We have had a lot of support from the Department, particularly from OASPE, to do them. The earlier links with Medicare data and the original thinking about it I think was coming from the agencies. We have a lot of conversations among the agencies in looking for enhancements.

But that said, it is much easier to do if there is a structure within the departments that help you do it and encouragement for it, so that has been very useful.

MR. BROWN: I think the fact that we have a surveillance mandate, NCI was important. We said should this mandate just cover incidents in survival or should it cover expenditures. It was expenditures that was first published, because we also had constant demands from Congress and still have constant demands from Congress to answer questions like how much does cancer cost, what is the economic burdens of cancer. As great as MEPS is, and we use MEPS actually and that linkage asks for quite a few studies, it doesn't have the numbers to get at specific cancer sites.

Some of us, and Gerry was there, Ed was there and other people were there, said this is something we need to produce. Then later on, once it was on its own, and it was a struggle to get this thing moving to the point where it is now, the whole quality of care tsunami struck. Then it turned out there was a huge interest and a lot of extramural researchers have taken advantage of this to use the SEER Medicare database not for what we thought was going to be its main purpose of looking at expenditures, but looking at cancer care and health disparities.

MS. TUREK: One interesting point. You can get the linked NHIS Medicare file and do analysis, so you can analyze how similar questions were answered on the two surveys. I get it for income. I thought it was wonderful as a user, and it is in public use form. I just wanted to commend both organizations for making that available in a way that protected confidentiality, but really enriched the data set a lot.

DR. STEUERLE: The speakers are also free to ask questions or comment on these others if they have something that they felt was missing or needed to be added.

DR. SCHRAG: I am an end user, so I am going to ask my question from that perspective. We heard about two examples of data sets that are linked to HIS and Medicare and SEER and Medicare. But of course, what end users really want is to be able to link all three.

So two questions to put on the table. We want to link data sources. There are many examples where two agencies have gotten together and created a useful link, and Medicare is most frequently at the nexus, but we want to link three and four, which gets a bit more complicated.

The other challenge I would put on the table is, we in the extramural, extra government, extra academic research community often want to link agency data with data that I will call -- they are non-governmental, but they are extremely informative. We heard a couple of examples, AHA data, data from the American Board of Internal Medicine, about where different kinds of doctors are located and what their qualifications are. Those would be the biggest examples probably.

But there are whole different sets of rules that govern those private data resources from the governmental resources. As an extramural researcher, it is very difficult to navigate those waters when you are trying to link data sets that cross public-private boundaries. I think that is where some of our greatest challenges lie.

The only time we have been successful at it is when we find cooperative investigators within the agencies, like Martin's branch in particular for those of us in cancer, who recognize the importance of those private sources and accomplish the linkages for us, so that we can get our grubby greedy little hands in there, carefully of course, with confidentiality.

Those are some of the challenges that we face. It is great to hear about two, but we don't want two sources. We want three, four.

DR. MADANS: The HIS link is the three, because it is SSA, CMS and us. So it is going across.

We did do SEER once. It was the HANES-1 I think. But SEER isn't everywhere. So by the time we did the link we were down to no cases. So we don't have a big enough sample in any state. That is part of the problem. There are lots of data sets, but there are probably not that many that would meet your particular needs.

I think you are right about the proprietary data. If someone comes to us and says we want to take your survey and we want to link in our proprietary data, they can do that in our data center. But they have to make the arrangements with the other entity, and that means you can't share it, because they are paying for it.

That would be an area where it would be nice to have some government wide leadership so that we are not negotiating those things over and over again. Some of the AHA stuff we buy so we can put it out. We know other people have linked in. But you're right, it gets very complicated.

DR. STEUERLE: Any other questions?

DR. BREEN: I have one question.

DR. STEUERLE: Yes, please.

DR. BREEN: This is a question for Steve. You said that the sample frame for MEPS from the NHIS was at the health level. I always thought it was at the person level, so that you had all of the data that you asked to sample adults included in your MEPS database. Is that not true?

MR. COHEN: It is actually a household based survey. It is person specific measures that would be factors that we would look at. But MEPS is a household survey, so you could reconfigure once we take the household in from HIS to person based analysis, to health insurance eligibility units, to tax filing units. You could reconfigure nuclear families, secondary families.

I think one of the difficulties and challenges is when on gets to something like a condition over sample, where you have a sample adult in the household. It is a bit tricky that everybody in the HIS didn't get the same set of questions, so it is a little different than sampling on what is in their core, in terms of some of the demographic, the insurance, some of the other measures. But that said, once we sample a household, everybody comes along for the ride.

DR. BREEN: Thanks.

DR. SONDIK: To add a point to the answer to your question, I think one of the things is that all of these areas have a very strong analytical staff behind them. So this is not something that is, here it is, here is the directive, go do it, and you get people whose interest isn't primarily in this area and using it. I think that is really important.

The other point is that Martin mentioned, and it is certainly true with AHRQ, unfortunately it isn't true with NCHS, but I wish it were, which is a support of research funds that can be earmarked to build a user community which can then spread. I think that is really important, because the evaluation of these things is in the doing. Even though we have strong staffs involved, the real evaluation is when you get people outside using it as well. I think that is really an important point.

So if we did get such a directive, I would want to be sure that the directive as the promise of funds along with it. It doesn't have to be a lot. We are talking about tiny amounts of money here, relatively speaking, that can seed the research community.

DR. STEINWACHS: You were talking about linkages. Medicare is an ideal population, Medicaid for some reasons.

One of the health issues that keeps coming up over and over again is the effects of people going to war, which is potentially the veterans. I know the HIS asks you about veteran status, but I don't know whether anyone has explored trying to identify who the veterans are that are part of surveys.

Then there is also VA utilization, which probably would be a small piece if you linked with HIS, I don't know. But they are among the people many times with heavy disabilities if you were focusing on individual disabilities.

So it is a question whether any one of you who have been dealing with health issues have tried to bring the VA into this, either on cancer or any other issues.

DR. COX: I can sort of answer it. Before the VA event, we were in negotiations with them to try and link with their population of VA data to the HIS. I think we had some analysis, about ten to 15 percent of the HIS had served in the armed forces at some point in their lifetime, and we were going to add data about the dates of service and whether there was active duty in war situations.

We were very close. We were actually talking about the amount of money it would take to do that kind of work, and we stopped. So we are letting VA get their IT policies into place and recover from the situation they got themselves in. I think it will come back at some point. But they need to get their house in order first.

MR. BROWN: We have two experiences. One is, we have a project called Cancors. It is a large study that is recruiting 5,000 newly diagnosed colorectal, 5,000 newly diagnosed lung cancer patients, and have been following them intensely over a one-year period. VA is one of our sites for recruiting patients.

I don't know if we have any specific information on their status in regard to their military experience, but they are a participating site in that study.

The other, over the years we have had several discussions not with the VA, but with the military health care system, the Tricare system. It has been disappointing mainly, because of this problem of not being able to identify the enrollment status in secondary and tertiary payors. So it is hard to set up anything that is any kind of longitudinal study if you don't know who your denominator is or where they are in any given time. But we have some continuing discussions scheduled on that topic.

DR. STEINWACHS: Just out of curiosity, I assume in the SEER registries there are veterans that may or may not have been contributed by the VA as a source. Is there any information collected in SEER about veteran status as described here as having military service or service during wartime?

MR. BROWN: Not that I know of. But you're right. Some of the reporting units to SEER are part of VA, but there have been some issues lately about that as well, which we are negotiating.

MR. PREVOST: At Census prior to the event, we were having lots of discussions with the VA as well. If one were to use our surveys in trying to attribute status in a linked environment, there were many concerns that the VA had about how well veterans are responding to our questionnaires and saying that they were veterans, particularly if they were in a reinstatement status.

So we were going to begin to start doing some quality assessment work as well, to see how well the veterans were responding to those questionnaires and if they could be used. But anyway, I am hoping that at some point in the near future we will be able to re-start those talks.

DR. STEINWACHS: There are Americans at any given time, a lot of them overseas, and some of them are there for periods of time. Do they get captured at all? A household interview would not, you have to be a resident of the United States, but where you talk about births and deaths, they have to occur in the United States, too. But if you are in the military overseas, would that be back into the U.S. in terms of data if you die?

I was just wondering what that link was, because some people are over there serving us, working for us. Some people are over there on their own choice.

MR. COHEN: For some of those core surveys they would have had to have been unfortunately an injury, and they would have had to return to the States.

But I can tell you, issues like this are coming to the table in health care, although it is very, very rare. You are hearing of people going out of the country to get health care. There were tradeoffs in terms of quality, but there were some costs, to several disparate countries. If this becomes a bit more prevalent, I'm sure we will want to track this and get a sense of this. So right now, we don't have the mechanisms.

DR. STEUERLE: I would like to thank each of you, and just remind you that this committee forum is pretty much of an open forum. We invite you to stay for other sessions. Also, remember to keep in mind that the goal of our committee is mainly to advise. We mainly advise the Secretary.

So if there are things you think we should be advising, please let us know. I have worked many years in government, many years outside. I have often viewed the role of advisory groups or consultants as finding the information that is already inside, and just rewriting it and replaying it a little bit. The good ideas that you have had that haven't quite made it to the top, our role is to help them get there. So thank you again for your patience, and thank you for especially dealing with the fact that we crowded you into this one session.

Howard, I think that leaves you the end game.

Agenda Item: Social Security Administration – Howard Iams, Research Associate

DR. IAMS: Yes. It is always a challenge to be the last on a long day. I will try and cover the topic and do it relatively briefly.

I need to alert everybody in truth in advertising that I am an advocate of using linked survey data. I have been actively doing this since 1984, and I started using the new beneficiary survey in 1986, SIPP matched survey. The phrase end user, I have been an end user and co-authored at least a dozen analyses, a number of which are published in the Social Security Bulletin.

I work for an agency that is a proponent, a strong advocate. The Social Security Administration in the 1960s started a survey called the retirement history survey, which had administrative data from benefits and earnings linked to it, and followed people into retirement over a decade.

The Census matched data was started in the early ‘70s. I believe Fritz Shurren who is in the audience and Dorothy Projector were heavily involved in starting that work, and it was continued by Denny Vaughn, and I continued it after Denny Vaughn left the agency.

We just awarded our ninth retirement research consortium. We awarded roughly 70 or 80 projects, about $5.5 million to three centers at NBER, the University of Michigan and Boston College. At least half of those studies are using the health and retirement survey and a fair portion of them end up using matched survey data. So my agency not only started it, it continues to fund and support it, and we have an analytical group that continues working on it. Kelman Rupp who is in the audience has been active with the financial eligibility model which we developed with matched data. We use that for modeling SSI, for doing Quimby, and we recently were doing estimates on Medicare extra help eligibility and what was the participation rate of people in Medicare extra help.

I have been active involved in using SIPP matched data to project the future retiree population, the baby boom, which has been extended through the 21st century. I call it modeling income in the near term. John Sabelhouse at CBO says it ought to be modeling income in the late term, MILT, but anyway, this has been providing social security reform information to the White House in designing alternatives for over two years, and also to the House Ways and Means Committee, and to the GAO. We are actively supporting it. We think it does a lot of things.

We currently have linkages to the Census, SIPP and CPS. You have heard a lot about those linkages and how that works. We have supported data improvements for SIPP and CPS. We are currently supporting a Census effort to get more matched data for the 2004 model panel using administrative address record process, which you heard about this morning. They are in the process of contacting people, and we are contributing funds to support that.

We also enhance the health and retirement survey and try to promote it as a data set that is of use to the community. We have funded enhancements. We are currently funding a projection of lifetime pension wealth and lifetime social security wealth using administrative records, which will be released to the user community for people in the early baby boom. A similar activity was done for people born in the Depression.

We have studied bias from using only matched data. We find there is a bias, and we are going to try and promote or create something for the health and retirement survey to allow the analytical community to deal with that bias.

So I think we are very active in pushing this and supporting it. We have tried to create a user community. I'm not sure I mentioned, but at least 158 HRS studies have been funded by us, and many of them are using this stuff.

What future directions? That was a question that was raised, future directions I see. We are trying synthetic data. There has been an effort with Giana Baud in Census to cerate administrative data for the SIPP that would be released publicly. It involves all the SIPP panels for the ‘90s.

I'm not sure how well it will work. I suspect it will work for some things and not for others. I am somewhat skeptical about its use for the decision to retire and take up social security benefits, for example, which can be an idiosyncratic kind of thing, as opposed to measuring general wealth and well-being.

I think we are going to see -- I mentioned the assignment of linkage from government address files which Census is doing. That is being tried for the first time in the SIPP or CPS. That was done on ACS, but I don't think it was done on SIPP or CPS.

There is a broader health and retirement survey consent form that allows us to release more administrative data. Previously the lawyers who negotiated that kind of stuff made things very narrow, and would only allow release up to the moment that the consent was signed. We now have an agreement that the consent will include matching through 2021, which means that we will be able to follow these people forward into the future, which is what we currently are doing with the SIPP.

The 1984 SIPP, for example, those folks were interviewed a long time ago. We have earnings records, yearly earnings records through 2004. We have social security benefits through December of 2005. We have SSI benefits paid by Social Security through December of 2005. We keep updating our administrative file. Each spring we go and pull in another year of data, so we are constantly creating a very useful longitudinal data set from our administrative records.

Possible future linkages. I would love to see a linkage to the 1040 tax records, so we can identify the validity of some of the income being reported. I would love to see a match to the national compensation survey, which is a survey of employers about their pension plans. Potentially Medicare bills.

I did that once with the new beneficiary surveys. As Gerry said it was a very painful process because of the dirty data set. So I don't know, I would probably have to be pushed to do Medicare bills.

We currently have linked a lot of SSA records. We have a master beneficiary record, which is our Title 2. That is what most people call social security. For the disabled, we have the latest disability insurance information, including the first and second diagnosis of why someone was disabled. This is close to the ICD-9. Social Security has some slight differences.

I would just point out that in order to get disability from Social Security you have to have been examined by a doctor and proved that you are unable to work, that this is a medically determinable condition that lasts 12 months or will kill you.

It is a growth industry. We now have eight million beneficiaries, 6.2 million of them are disabled workers, 154,000 are spouses and 1.6 million are children of workers.

So this is a fairly concrete establishment. If somebody has poor health at this point, roughly a third of the folks are mentally impaired. Back in the old days when we did the new beneficiary survey it was ten percent. So that is quite a shift.

The SSI record, Title 16, we have every month the benefits since the program started. We have seven million SSI people. Four million are working age, two million are aged, and about a million are kids.

We have the Numident. Census tells you about the Numident for giving you the social security number. We use it for data death. That is where death is recorded so we can tell death when people have departed our data set.

Summary earnings record, that is the record of -- actually, you never depart our data set. You always are in our data set. You depart active updating, shall we say. The summary earnings record is the record of earnings under social security taxes. That is what these people will receive their benefits on. That is what they are going to get paid from, so you have the real thing.

The detailed earnings record is our phrase for what is the W2 tax record. We have the summary of earnings record from 1951 through 2004; we have the detailed earnings record from 1982 through 2004. Recently researchers have discovered that they can pull out the amount of money that has been deferred for 401K type pensions. It is measurable on the form. They are starting to measure it, and it is a much more valid source of information on participation in defined contribution pensions than self reports and retrospective information might be.

We do have Medicare Part D extra help application information. It is feasible, but we haven't actually matched it up. I think that basically is it.

So we have what people are being paid by an agency that pays 46 million people benefits in Title 2 and in supplemental, seven million. We have what they really have earned, according to what is filed with the tax authorities. This is a fairly useful data set, and we run it longitudinally for all of the surveys we deal with, which is SIPP and CPS and HRS.

Rule for use. You have heard the Census rules for use fairly extensively, so I won't mention that.

The HRS is widely used because the researchers can do it in their own location. They have to have an approved research project. They have to have a federal grant or use a secured data center at Michigan; most get a federal grant. They have to have an isolated computer, they have to agree to restrictions on access and public release. They can't use geography with the SSA records except at the Michigan secure site. Social Security funds the HRS to send out a contractor to evaluate whether on an unannounced visit the use of the computer and the data are being complied with. They do get visited even at government employment.

Validity and reliability. I think that it is hard to doubt that our record of what you got paid for SSI and our record of what the Treasury issued to you for payments in terms of social security is better, is more reliable than the self reports. We have done several studies which are in the SIPP working paper series about the extent of bias and misreporting. It is not extensive, but it does have an effect, particularly on the SSI.

I think another thing that the data provide is that people are ignorant of a number of the details of the social security program. For people doing retirement research, sometimes those details are very, very important. I highlighted one thing that the man on the street would say is important, how much money did you contribute to your pension account. I doubt there are that many people who remember the dollar amount in the last 20 years.

Another reason is, it is a fairly high cost thing to collect longitudinal data. These are fairly high quality data, earnings 1951 to current year, W2s 1982. The SSA benefits start in 1962, the SSI benefits in 1974.

What is new? We have got this publicly synthesized administrative data which is somewhere in disclosure review. If the world works right it will be available in the next couple of months. For the health and retirement survey, we have a much more complete release of administrative data. A past practice had been to release a selected set of items. I think it was way too minimal.

We had evolved at Social Security a whole set of files that we developed for the SIPP and for our modeling. We are now releasing most of that information with the health and retirement survey. It was sent out to Michigan for the 2004 interview in the spring. There are some users who are starting to use it now.

That's it on that.

We were supposed to talk about barriers. I didn't put barriers in my thing, I would just as soon not write it down. I don't think there are any barriers internal to my organization. The only issue with my organization is that these records come from administering a program. They are not created for research, they are created to administer the program. So in order to use the records very well, you have to understand the program, and there are not very many people in the research community who understand the details of the program.

I think if you stick to how much was the payment and how much were the earnings, and when did the benefits start, everybody is going to be okay, but they do have their idiosyncracies. We are doing internally an effort to document all these files and all these data. The last version was yay thick. We are updating it, and we should have this in the next six months. This effort has been going on for about nine months, involving specialists in the MBR, specialists in the SSR, specialists in these different files, and we will make that available to the research community who wants it.

Another barrier is, there is a public reluctance to provide social security numbers in these surveys. The last HRS got a 50 percent response rate for this, the SIPP got a 60 percent. We are going to fund HRS to do a better job and perhaps that will enhance things, but it is a problem. These are such valuable data, it would be tragic not to have them.

There is a difficulty getting other agencies to let us use their data in our matched data records. I think I expressed this morning, I consider our data security equivalent to breaking into Fort Knox. I really am serious about that. My agency is obsessed that the public can't get access to peoples' earnings and benefits records. I think that the computer security is incredibly difficult to break through. We take much more advanced methods. Anyway, it is just very, very hard.

So I have a hard time understanding why we wouldn't be able to bring in a data set and match it up for statistical purposes. For three years we have been trying to get the Bureau of Labor Statistics to let us use their national compensation survey. We first tried to get a copy to statistically match, not an exact match but a statistical match. We did it on our own. We worked through Census with SIPP-C. That all has failed, so now they are drawing up some agreement and our lawyers are talking to their lawyers, and I don't know if that will ever occur.

For three years we have been trying to get the 1040 records for a year of the CPS and a year of SIPP, so we can assess the validity of reports of asset income. Our statistical measures believe that there is an under report. The percentage reporting asset income is under reported by 15 percentage points. This is a very big deal to Social Security, because there is a statistic we calculate about how many people rely entirely upon Social Security for their income. If you just had one dollar in asset income you wouldn't be in that statistic, and we think it would cut our level in half if we had valid information. But we have been unable to get that, for some reason.

Disclosure you know is a problem. We use secure files. We support financially the synthetic data. We have made every effort to try and get the synthetic data to work. It would be nice if it would, I don't know if it will. We expect to have a contractor evaluate its utility when it finally is made available. So I think those are the barriers that we face, and those are the strengths and weaknesses of where things stand. I will stop.

DR. STEUERLE: Questions, comments?

MR. LOCALIO: You made an interesting comment about, their lawyers are talking to your lawyers. One of the problems I have felt since I have been on the committee for the last four years is that the lawyers never talk to the data people. There are some good reasons for this. Lawyers tend to hate data, they don't understand data. On the end of my tag here, what I used to do in my former life, before I climbed the data mountaintop and saw the promised land.

But I am finding that there are these lawyers out there who make pronouncements and write rules and regulations and policies and do not touch base with the people who have to use the data. Then they don't understand the practical problems or the implications of what they do. I don't think we have any agency lawyers here today, is that correct?

DR. IAMS: I want to defend our agency lawyers. We have been having to go through the agency lawyers on interagency agreements now for about two years. We are able to communicate with them fairly well.

This last one with BLS, maybe you are right, because I don't think they understand SIPP-C at all. It is partly under SIPP-C and it is using SIPP-C language. But they were able to call over and talk to each other and come up with some sort of agreement.

I noticed that ELS wants to review every piece of information that is created by my agency using their data. If that means that they have to approve what goes to the White House, I don't think we will have a deal.

It has been a very long and drawn-out process. For the most part I can understand what you are saying.

MS. TUREK: When you say synthetic data, I think of something polyester.

DR. IAMS: It's close.

MS. TUREK: What happens? How do you create synthetic data? Do you try to maintain the same distribution? If a person is in a program, do they stay in it? I suspect that we could have a whole conference on what do you mean by synthetic data.

DR. IAMS: When it was first sold to us four years ago, they were going to statistically change the survey data and keep all of our administrative records intact and whole.

What is now being reviewed by the disclosure board has statistically made up every single one of our administrative records, and all of the survey data except three data items. So you will have to go review the work of the statisticians who did the econometric modeling, that says that they think they retained the relationships.

As a multivariate analyst type, I am dubious and have been dubious, which is why I wanted what they originally said they were going to do. But in order to meet disclosure review, they got pushed and pushed and pushed until now it is all statistically made up, for the most part.

DR. STEUERLE: I should indicate, at Treasury for years we developed -- and still we had to use, I don't think they called them synthetic data, but they were statistically matched data sets. So they weren't exact matches, they were statistically matched. Tom Petska who will talk to you in a second has been involved with that somewhat.

The notion was, you preserve both the variance and covariance matrix for the original administrative data set and say a survey data set you had, but the link you used some sort of transportation algorithm. You have a cost function, you minimize the distance between wages or interest, stuff like that. Then you hope that the things you didn't minimize on were okay, which is where I think Howard was becoming quite skeptical.

But sometimes, if you are in the government and you have to make cost estimates, or you are in perhaps Medicare or some other place, you have to have a more complete data set. Sometimes we have no other choice.

DR. IAMS: I think it will work very well for certain kinds of research questions. If you want to know if someone earned a lot of money in their lifetime and worked a lot of years, these data should be just fine. They are probably pretty close to the original.

If you are trying to do what a lot of labor economists are doing, which is predict exactly when someone applies for social security, when do they leave the labor force, that point of decision is going to be statistically smudged. I'm not sure it will be as good to research that subject.

DR. STEUERLE: Another way to say this, we have no idea what the standard deviation on any variable you create is.

DR. PETSKA: I'd just like to say a couple of things. Gene, that is exactly right. Treasury is still doing statistical matches. Treasury has a very strong interest in the CPS, and they cannot get the identifiable version of that, so they have been doing statistical matches for many years, back when Gene was there in the Tax Reform Act of 1986 and so on.

But in regard to your comments about access to 1040 data, I don't know if you were trying three years ago or for three years or whatever, but I would certainly be willing to talk more about that. As I think you know, I will be talking tomorrow morning at 9 o'clock in the first presentation. Access to tax data is closely regulated by Title 26, 6103 of the Internal Revenue Code. It requires an authorized statutory purpose.

Social Security is clearly in the code, but the question is, what about the purpose? Even though you get access to certain files, what about other files and the content? We agonize with Census Bureau repeatedly, and Gerry can vouch for this, about not only what files, but literally an item by item basis. There is a regs request to expand the items that is pending right now. The Bureau of Economic Analysis has the same thing.

So we can certainly talk about that off line and see where it is now. It has gone from your lawyers to our lawyers, and that is why it hasn't gone anywhere. Maybe reason can prevail.

The last comment I just wanted to make very briefly was, for such things as W2 data which go into earnings histories, there is a joint custodianship relationship. This is shared by the agencies. So if you appeal to Social Security to get access to the earnings histories, that has to come to us as well because those data are co-owned by the two agencies.

DR. IAMS: We have had a very amicable and longstanding relationship in dealing with those.

MR. RILEY: One additional on the synthetic data. As we try to match administrative records data to survey data, this is the problem. We can either release no administrative record, we can force people to come to an RDC or come to the area or figure out some way to get remote access, or we can create these synthetic files, which is one of the things I think they were trying to do with the SIPP.

One of the big problems with creating a synthetic file after the data set is released is, you haven't done any top coding to begin with, and you are trying to add these other data on, and it makes it much more complicated. So I think this is one of the things we are trying to do with the new system, is design it such that we know beforehand what is going to be released and what is not going to be released.

This is also one of the issues the CENSTAT panel, we have asked them to look at, are synthetic data good, what is going on with them, can we assess the quality of them in addition to the administrative data. So this is one of the things that really need to be looked at, which is why we need to get the data set out, so people can review this and get some information.

DR. STEINWACHS: I was just curious from the researcher side. Are there examples in which the analyses of synthetic data sets in terms of policy analysis have been done and published in the peer review literature? Or is there a pushback from reviewers about, once you indicate it is a synthetic data set versus quote-unquote a real data set?

MR. RILEY: I think there are some examples. John Ebatt has a number of papers. They have used both the synthetic data and the actual data.

We currently have a variable that we have looked at, the multi-benefit analysis, which is from the security records, matched to the 2004 SIPP data to compare the mean. The key is the distribution in the covariances. You look at the means by about 20 different characteristics, and they are very similar, the administrative data versus the actual data.

So this is the stuff we are looking at. But I think if you go to the LEHD site on the Census website, you will see some papers that have tried to evaluate the synthetic data.

MR. LOCALIO: I want to follow up on that. I have seen the technical papers in the field, but I don't think I have seen an applied paper presented using synthetic data.

I can tell you right now, if I saw one, I would be open minded about it only because I have read the technical data. But I would say that it is well beyond the capability of most journals to review papers like that. It is too new. It hasn't been tried out enough.

MR. RILEY: Applied journals.

MR. LOCALIO: Applied journals, that is correct.

DR. STEINWACHS: This may take you in a little bit different direction, but when we all talk, we always think about the social security number as the thing to link on if you can. There have been a number of comments about the difficulty of collecting it and so on.

I am just wondering how good the social security number is. If you say there is one person for the entire life, I always think of the witness protection program, where I assume it changes. I know that is a very small number hopefully of people. But there must be things that look funny when you look at social security numbers, where people have picked them up, used them, maybe don't use them. Some people have two or more of them eventually in their lifetime. Is that a tenth of a percent that you think are funny numbers, or is this more pervasive in our system?

DR. IAMS: I'm not in a position to say. I think the Office of the Inspector General at Social Security has had some statements or studies made about the validity of social security numbers. You know that it is a concern at the Homeland Security Administration.

I would say, we have something called the earnings suspense file, which has millions if not billions of dollars sitting in it, which comes about when someone's tax record comes in and something is wrong with the record. It doesn't quite match the name with the number, and it can't get deposited in the account.

Now, that is not an area I know a great deal about, but this was going on when I first went to work for the agency back in the 1980s when our agency lost a quarter of its work force. The suspense file went way up. I recall, at one point IRS refused to work on it and SSA refused to work on it, because they were both short of administrative funds in the ‘80s. Anyway, I would just go anecdotal.

It empirically exists. It is not trivial. If you go to the Inspector General reports, I know they have had work done on the earnings suspense file. That might give you a sense.

DR. COX: I can say that when we link the HIS data to SSA data, we did do a validation step first. While I don't think it is a huge problem, we see two areas of concern. One is elderly women who are using their husband's social security numbers because they don't have their own, or they receive benefits under his work history. We do see lower validation rates for SSNs provided by Hispanics.

DR. STEUERLE: Howard, turning to an area that you and I have worked on over the years, one thing you do have in Social Security is an ability to try to measure lifetime earnings. Although there are gaps, it is probably one of the best data sources anywhere to look at relative status of populations over long periods of time.

In fact, the former head of CMS did a study right before he took that job in which he tried to estimate whether Medicare was distributed in a progressive manner according to income. I can't remember the study, but I remember looking at it and thinking, if anything is a synthetic data set, because that is what he had, he had that.

You have Social Security which if linked to Medicare could come much closer to answering that question. So beyond answering that narrow question of that linkage of Medicare data to lifetime earnings to see studies on the distribution of Medicare benefits, are there other efforts being made to try to figure out how to use a lifetime earnings measure of well-being for a whole variety of health outcomes and measures?

DR. IAMS: I think that our retirement research consortium has a number of papers and analyses that use lifetime earnings. Because the health and retirement survey is in the public domain with minimal restrictions, there is a lot of use.

The economists who study that type of thing were using this when everything was in the public domain back in the 1960s, before Watergate led to the restrictions of the Privacy Act. They carried on that type of work.

Now, sometimes they have health as an outcome. Many times they are doing labor force retirement. Let me ask, Kelman Rupp, do you know of -- you do more in the disability and health area, are some of the people studying the outcome of disablement or health impairment or something along that line, Medicare, using a predictor of lifetime earnings?

MR. RUPP: (Comments off mike.)

DR. IAMS: I guess you have tapped a subject for the user research community to get into.

DR. STEUERLE: Does the SEER data file have lifetime earnings as a prognosticator of cancer?

DR. IAMS: You haven't matched it up to earnings records, have you?

DR. BREEN: We actually tried to link it with Social Security Administration records at one point. Maybe we tried to be too detailed, but not all of the data were electronic, so they weren't available for all the years.

Largely, and I think it will change as baby boomers retire, but it was available pretty much for white males. There weren't earnings history for most of the people.

DR. IAMS: That sounds weird. I have been using these data for 20 years, and that has not been the case when you match it up to SIPP. We have something like 85, 90 percent match to SIPP and 75 percent match to CPS, and we don't have the kind of holes you are talking about.

DR. BREEN: What is the age range? Because cancer is a disease of older people.

DR. IAMS: Our data don't start until 1951, so we have earnings on people after 1950. If you were looking at people born in the First World War, there would be a lot of missing earnings.

DR. STEINWACHS: I am told that our contractual arrangements for the room end this afternoon at 4:45. You may be thankful for that.

I do want to thank all the speakers this afternoon. This has really been fantastic, a wonderful learning experience and a great exchange. Thank you very much.

DR. STEUERLE: Can I just add one thank you to Cynthia Sidney, who helped organize this?

DR. STEINWACHS: Hear, hear. I hope you will be here at 9 o'clock in the morning tomorrow morning when we will learn from the IRS how to solve all these problems. We will move on to the Veterans Administration, education and some other areas. Thank you very much.

(Whereupon, the meeting was recessed at 4:46 p.m., to reconvene Tuesday, September 19, 2006 at 9:00 a.m.)