Page 1 of 1

Chapter 2b

Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for MEdicare Beneficiaries - Chapter 2b

2. Methods and Data (continued)

2.5.2 Running the GeoCode Program

In testing the GeoCode program, we discovered that the program had a tendency for erratic performance. The help staff at GeoLytics seemed unable to explain the variations in performance. The primary problem was due to a lookup error-"failure to open data member" (eFOM). Between two and six percent of addresses we tested returned this error. Upon examination, we could not find any syntax errors that prevented these records from being successfully coded, and the technical support people at GeoLytics could not explain why these errors were occurring. However, we found that when we ran the addresses receiving the eFOM error code back through the GeoCode CD program a second time by themselves, they were matched at a 100 percent success rate.

The GeoLytics GeoCode CD program product allows the user to choose a variety of options that alter the balance between completeness of address coverage and speed of processing. In order to obtain maximum coverage, and thereby match the most addresses possible, we ran the GeoCode CD program with the following options turned on:

  1. Allow phonetic match of state name.
    – The geocoder phonetically matches the full state name in an address (but not an abbreviation).
  2. Allow place-based ZIP code match.
    – If a street is not found in a ZIP, the geocoder scans other ZIP codes associated with the place (typically a city or a town) for a match.
  3. Allow phonetic match of street name.
    – The geocoder uses a phonetic match for street names (e.g., an input address with the street name "Maine St." is considered a match with Main St. in the database).
  4. Disregard parity for address match.
    – Normally, the geocoder matches even/odd addresses with even/odd address ranges. This option disregards this practice.
  5. Allow closest address match.
    – The geocoder finds the closest address range to match the house number (rather than an exact one).
  6. Allow fuzzy street type match.
    – The geocoder will match addresses with the same street name, even if the street types are different (e.g., Greenwood Drive is considered a match with Greenwood Road).
  7. Geocode no matter what.
    – If it cannot find an exact match, the geocoder will assign to the address the census coordinates associated with the center of a ZIP code (ZIP centroid12), or the center of a state (state centroid).

The GeoCode program outputs two files as it runs—a text file (*.txt) summarizing the geocoder performance, the accuracy codes, and the error codes; and a database file (*.dbf) containing the fields selected by the user. For each database file, we selected the following fields13:

FieldDescription
SEQNOSequential Number
ADDRESSInput Address
ACCURACYAccuracy and Error Codes
BLOCKMatched Block Code
PLACEPlace FIPS Code
MCDMCD (Minor Civil Division) Code
STATEState FIPS Code
ZIPZIP Code for 2003
PLACENAMEMatched Place Name
AreaKeyBlock Group Code

The sequential number field contains a number between 1 and n, where n is the total number of records processed by the program. The input address is the address in the STREET, CITY, STATE ZIP format constructed and output by the address cleaning SAS program. Accuracy and error codes are explained below. The matched block code is a string of fifteen digits that indicates, respectively, an individual's state (2 digit FIPS code), county (3 digit FIPS code), census tract (6 digit FIPS code), and block (4 digit FIPS code, the first digit in the 4-digit string indicates the block group). The full string constitutes a unique, block-level identifier. Any persons living within the same block will have the same matched block code. Place indicates the city or town FIPS code, and MCD indicates the Minor Civil Division code. The area key is basically a substring of the matched block code that contains the first twelve, rather than the full fifteen digits, and constitutes a unique block group-level identifier.

Return to Contents

2.5.3 Summary of GeoCode program accuracy codes

Failure details. The geocoding process can fail for a number of reasons, including setup or programmatic errors, a missing database entry, or an invalid input address. Failures fall under two general categories: syntax/lookup errors and programmatic/setup errors. Failed GeoCode results are indicated by error codes, which are summarized in Tables 2.6 and 2.7.

Table 2.6 GeoCode program syntax and lookup errors

Error CodeError Message
eIHNMissing or invalid house number*
eIStMissing or invalid street name*
eITyMissing or invalid street type
eINaMissing or invalid city name
eISNMissing or invalid state name/abbrev*
eIZIMissing or invalid ZIP code*
eIAdIncomplete or malformed address*
eUAFUnknown address format
eMiAMissing address
eNZIFailed to lookup ZIP code
eANFAddress not found
eSNFStreet not found

*Errors encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Table 2.7 GeoCode program programmatic and setup errors

Error CodeError Message
eGNOGeoCode has not been opened
eFODFailed to open database
eFOFFailed to open data file NAME
eFOMFailed to open data member NAME*
eMiFMissing file NAME
eGOFGeneral open failure, file NAME
eFA1Failed to allocate memory
eNASNo address data for state NAME*
eNSZNo data for state-zip NAME
eSSOString size overflow
eOKIOutput file kind invalid NAME
eOF1Output failure NAME
eOLIOutput field list invalid NAME

*Errors encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Success details. The GeoCode program also indicates how successful it has been in matching addresses to FIPS codes. In addition to indicating accurate or exact matches, it indicates what kinds of "adjustments" it made to successfully match the address to a place with a FIPS code. Successful match details are presented in Table 2.8. Some successful results will generate accuracy codes indicating that the geocoder could only code the address by using some of the fallback matching options described above. Its worth noting that GeoCode CD may employ more than one of these fallback matching options to find a match for a particular address.

Table 2.8 GeoCode program accuracy codes and messages

AccuracyAccuracy Message
aNP1Place not found*
aNPaAddress match with no parity*
aCAdClosest address match*
aFTyFuzzy street type match*
aPhMPhonetic match*
aNMaNo match found
aNMPNo match performed
aPBZPlace-based ZIP match*
aSpCSpelling corrected*
aStCState centroid used*
aSEnStreet end used*
aZICZIP centroid used*
aInDInaccurate direction*

*Accuracy options encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Test results using the GeoCode program on the CAHPS sample addresses. Table 2.9 below summarizes the error and accuracy results from the CAHPS sample test file. It indicates that 8.4 percent of the 830,728 CAHPS sample addresses taken from the EDB were dropped because they were uncodeable by the GeoCode program for some reason, very often for having a box number instead of a street address. It also shows that of the remaining 760,961 addresses (91.6 percent of the original total), all but four-tenths of a percent (0.4 percent) were successfully geocoded. The process we followed in this test yielded an overall total successful match of 91.2 percent of the EDB addresses to Census block group level FIPS codes.

Table 2.9 Summary of GeoCode error and accuracy results for the CAHPS test file

CAHPS/EDB Test File

ResultsNumberPercent
Original number of records830,728100.0
Number of records dropped (uncodeable)69,7678.4
Addresses processed760,96191.6
...Successfully geocoded (first iteration)719,22094.5
...Successfully geocoded eFOM records (second iteration)38,3225.0
...Total failed3,4190.4
GeoCode success rate757,54299.6
Percent total test file records matched 91.2
Success details*  
Accurate Match477,74662.8
Place Not Found77,27310.2
Address match with no parity5,9310.8
Closest address match37,9845.0
Fuzzy street type match86,70111.4
Phonetic match37,8475.0
Place-based ZIP match16,5192.2
Spelling corrected00.0
State centroid used9050.1
Street end used3,8710.5
ZIP centroid used63,0318.3
Inaccurate direction20,5252.7
Failure details  
Failed due to syntax error3,4180.4
...Missing or invalid house number3,3670.4
...Missing or invalid state name/abbreviation00.0
...Missing or invalid ZIP code470.0
...Incomplete or malformed address40.0
Failed due to lookup error38,3235.0
...Failed to open data member (eFOM)38,3225.0
...No address data for state10.0

*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.

Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.

Return to Contents

2.5.4 Application of the GeoCode Program Processing to the Full EDB

We obtained the 10 segments of the full unloaded EDB from CMS in mid-2003. Because each segment of the EDB contained more than four million beneficiary records, we processed each segment separately, first extracting the addresses and other necessary identification variables from the EDB, correcting the addresses using the SAS programs we developed, and finally running them through the GeoCode program. Each segment of the EDB was run through the GeoCode program separately. The program took from 16 to 36 hours to process and match the more than four million records contained in each segment. As indicated above in the description of the test results on the CAHPS sample addresses, it was necessary to rerun the addresses with an eFOM error that failed to match on the first iteration, and virtually all of them were successfully matched on the second iteration through the GeoCode program.

Run EDB segments through the GeoCode program. The results of the GeoCode program processing are summarized in Table 2.10 for all 10 segments of the unloaded EDB combined. The results were extremely similar for each of the 10 segments. Overall, 86.8 percent of the 41,742,407 addresses of Medicare beneficiaries were processed through the Geocode program. Ninety-nine and two tenths percent of the addresses that were processed (or 36,223,053) were successfully matched to a FIPS code that included the block group. As Table 2.8 shows, 61 percent of the matches made were exact with the addresses that were input.

Import Geocode output files and merge with EDB records. We used PROC IMPORT in SAS 8.2 to transform the database (*.dbf) files produced by the GeoCode program into SAS data files (*.sas7bdat). Using the ADDRESS field we prepared as input from the EDB to the GeoCode program as the common key (common to the EDB and the GeoCode output), we merged the output files (containing Census-based geographic identifiers including the AreaKey number string that identifies block groups) onto the EDB records.

Return to Contents

2.5.5 Results of Geo-coding the Sample of 1.96 Million Medicare Beneficiaries

The sample of 1.96 million Medicare fee-for-service beneficiaries is a subset of the beneficiaries geocoded from the mid-2003 EDB. The results of the geocoding for the 1.96 million are presented in Table 2.11. While the table indicates that 81 percent (1,588,121 out of 1,960,121) of the addresses for the sample members were successfully geocoded, this was with allowing the use of ZIP code and state centroid when there was no other way to achieve a successful match of the input address to a Census-listed address. It should be noted that we did rerun unmatched addresses from the mid-2003 EDB as well as those that changed from the mid-2003 through the Geocode CD in the hope of more completely and correctly geocoding sample members.

We know from analyses performed in sub-task one of this task order that most of the state centroid matches (4,090) are not true matches at all, but forced to the state centroid by the GeoCode CD program on addresses that are foreign. The same may be true of some of the Zip (159,217) centroid matches as well. We feel very confident saying, however, that based upon our validation of address block group matching against the Census, that the true match rate at the block group level for the sample is most likely at least 75 percent.

Table 2.10 Summary of GeoCode error and accuracy codes for the 10 segments of the EDB combined

ResultsSumsPercent
Original number of records41,742,407100.0
Number of records dropped (uncodeable)5,223,76612.5
Addresses processed36,518,64187.5
...Successfully geocoded (first iteration)35,108,32996.1
...Successfully geocoded eFOM records (second iteration)1,114,7243.1
...Total failed295,5880.8
GeoCode success rate36,223,05399.2
Percent total EDB records matched 86.8
Success details*  
Accurate Match20,028,63361.0
Place Not Found3,216,8689.8
Address match with no parity281,5540.9
Closest address match1,821,8935.5
Fuzzy street type match3,919,79211.9
Phonetic match1,752,8585.3
Place-based ZIP match799,8362.4
Spelling corrected100.0
State centroid used47,2520.1
Street end used181,2700.6
ZIP centroid used2,972,2749.0
Inaccurate direction1,027,3773.1
Failure details  
Failed due to syntax error262,1760.8
...Missing or invalid house number175,5610.5
...Missing or invalid state name/abbreviation40.0
...Missing or invalid ZIP code86,3350.3
...Incomplete or malformed address2760.0
Failed due to lookup error1,022,2673.4
...Failed to open data member (eFOM)1,018,4833.4
...No address data for state3,7840.0

*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.

Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.

Table 2.11 Success with Geocoding of the Medicare Beneficiaries Included in the RTI Sample of 1.96 Million

SampleNumber
Total Sample1,960,121
Successfully geocoded1,588,607
GeoCoding Success Rate81.0%
  
Success Details 
Exact Match920,390
Other Accuracy Code504,910
Zip Centroid159,217
State Centroid4,090

Source: Result for sample of 1.96 million of running GeoCode CD program 2003, Version 1.02 on addresses from Medicare EDB from mid-2003.


12The centroid of a 5-digit ZIP code area is the balance point of the polygon formed by its boundaries. The centroid is calculated based on the coordinate extremes of the polygon.

13One field we did not include, the MATCH field, contained the full address that the GeoCode search engine determined to be the closest match to the input address. We had intended to include this field, but during the testing phase, we discovered problems with the MATCH field that led to major problems when trying to transform the *.dbf files into SAS files.


Return to Contents
Proceed to Next Section

Current as of January 2008
Internet Citation: Chapter 2b: Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for MEdicare Beneficiaries - Chapter 2b. January 2008. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/research/findings/final-reports/medicareindicators/medicareindicators2b.html