Public Accessibility to EPA Statistical Data and its Usefulness


1. Tell the user more about the technical aspects of using the data.
2. Make the data easier to use.
3. Create and make available time series data.
4. Provide a more useful context for the data.

I. Discussion of the Pfiesteria Problem
and Data Needs
This case study is built around a fictitious organization that represents chicken farmers of the Mid-Atlantic states. The Chicken Farmer's Association (CFA) looks out for the interests of the chicken farmers, provides support to them, and lobbies on their behalf. An issue of great importance to these farmers is the Mid-Atlantic problem of pfiesteria that has occurred in several rivers in this area (generally, EPA Region 3). We assume the role of statisticians who have been hired by the CFA to explore other possible causes for pfiesteria.
Click here for a map of Somerset County
Some believe that the pfiesteria outbreaks are due to the application of manure based fertilizers which are used on farms in the area. This manure is chicken-based, and it is a byproduct of the chicken farming industry located there. The sale of manure provides additional income to farmers and also helps them dispose of chicken waste. This manure is rich in nitrogen and phosphorus, which is thought to fuel pfiesteria outbreaks. The nutrients are also detrimental to waterways since they feed algae growth, which takes oxygen from water when introduced in great quantities.
Several states, including Maryland and Virginia, have proposed legislation which would require farmers to adopt tighter regulations on waste products. This legislation would be costly to farmers in terms of both capital and operating expenses. The CFA argues that there is little evidence that the pfiesteria outbreak is the result of chicken manure. Pollutants could also come from a variety of industrial, municipal, and other sources.
To test our theory we want data on a river where a pfiesteria outbreak has occurred. For that river, we want to examine data on discharges from sources other than chicken farmers. We want data on excessive water pollutant discharges, and in particular, those that occurred during the periods of the pfiesteria outbreaks.
We think a good case study is the Pocomoke River, where a major
pfiesteria outbreak occurred. The Pocomoke is also a good choice
because it forms part of the border between Maryland and Virginia,
but originates in Delaware. Thus, it represents a regional problem
and provides initial insight into the inter- comparability of the
data between states. Additionally, there are some important
wetlands and eco-systems directly related to the health of the
river.

In order to relate other contaminants with pfiesteria we need time series data to concord other area water discharges to outbreaks of pfiesteria. In fact, while water discharges may be a leading contributor to the pfiesteria problem, part of the problem originates in air and land media.
The most accessible data is obtained through the Internet. The least costly way for us to do research is by utilizing the data collected by the EPA under its Permit Compliance System (PCS). PCS data is now available to the public from a site maintained by the EPA (Envirofacts) and a private group (RTK.NET from the Unison Institute).
Click here for a map of Worcester County
Both sites provide about 15 different types of data sets on information available to the public. PCS data is available from both systems. Both systems offer drag and click choices for identifying regions of data requests. An initial reaction was that the RTK.NET data was the more easily accessible of the two.
Both systems, however, are oriented towards a higher ability Web user. We believe both sites need a greater degree of interface that allows the users to more discretely target their data choices. The data from RTK.NET is comprised of over 100 variables, most of which are nominal data choices and are irrelevant to the question we pose.

III. Results
This section discusses right-to-know data, where to find it,
and how to get it.
A. Using Right-to-Know Data
We attempted to use environmental data that is available to the public through legislation that has been called "Right-to- Know." RTK.NET is the "The-Right-to-Know" Web site, a private organization affiliated with OMB Watch and the Unison Institute. RTK.NET downloads their information from the EPA site, so the information that RTK possesses appears essentially the same covering the same categories of reporting. The only difference is perhaps in time coverage: the RTK.NET time series may be longer than that of the EPA. According to an interview with the head of RTK.NET, EPA has deleted some old information due to computer storage limitations which are not faced by his organization.
Our impression is that another level of sorting options must be made available to the user. User choices could be expanded and other "user-friendly" features added. Here is a brief description of the process of getting data.
The RTK.NET web site is very accessible, and the options for sorting by geographic locale are quite easy. Actually managing and utilizing the data is another matter. The system does have a nice feature which sends the data for the chosen geographic locale to your email address.
The data is an ascii file, which you can
easily download from email using the "save as" option. The file is
the entire data set (more than 100 categories) that can be
transmitted in either TAB or COMMA delimited format. From there
one can import the files into Quattro Pro or Excell (and presumably
SPSS and SAS) for data analysis.
B. What Does the Data Tell Us?
We collected PCS data for three counties bordering the Pocomoke, including both major and minor facilities: Somerset and Worcester Counties in Maryland and Accomak County in Virginia. Worcester had about 100 reporting facilities, Accomak 70, and Somerset 50.
The types of facilities that may be also contributing to the pfiesteria incidents are: Industry, Federal, Municipal, and Other. The types of industries that contribute pollution to the river include construction, asphalt, shipyards, container, fuel spill, and lumber.
The key data obtained regards the recorded violations of the facilities, and there are several categories that give this violation data. These include indicators such as the number of single event violations at the facility and the number of quarters the facility was found to be in non-compliance.
One problem is that the data provided is for the life of the facility, which itself can differ, and is not broken down by time or owner for that matter (although such data does exist). While environmental problems are often the result of the aggregation of pollutants, and it may well be so with respect to pfiesteria, correlating pollution to health problems requires the use of time series data. This will serve as a severe limitation in evaluating the data if unavailable.
A more serious problem is in summing of the data for the aggregates. These variables appear toward the end of the 100- plus categories reported for the PCS data, thus if there are any missing or extra categories downloaded for these 100 plus variables, the alignments in a spreadsheet such as Quattro Pro will be off. This misalignment in reading in the data for these MOST critical variables renders them useless in any analysis role. In fact, we found misalignment to be a common occurrence.
Where might the
pfiesteria problem actually lie if additional sources are
considered? In what data are available, there were 27 instances of
single event violations and 17 quarters where facilities were not
in compliance in Somerset county. This gives Somerset County the
highest rate of violations among its
facilities. In terms of the absolute number of total violations
Worcester County represents the largest contributor with 73, much
more than the 48 in Accomak and 44 Somerset Counties.

C. Comparable and Additional EPA Data
Through examination of comparable and additional data, it is
possible to provide context for the pfiesteria case.
1. Comparable Data on the Web: Envirofacts
We attempted to access the same data set using EPA's
Envirofacts (EF) Web system, in an initial effort in comparing data
sets on water discharges. As a comparative test we
attempted to retrieve the same PCS data for Somerset County,
Maryland. Envirofacts on the Internet, like RTK.NET, has a click
and drag menu feature which is relatively easy to use. The EF data
has a greater degree of informational support than RTK.NET. This
support information is embedded within other related
features, including a useful mapping feature which creates maps of
the areas of interest to the user.
A cursory study of the informational records uncovers a high
degree of overlap in the types of data available on the EF and RTK
sites. This overlap is not, however, across the board. RTK net
reports more data than EF, and some of that additional data is
critical in successfully completing this case study. EF reports on
the permit level per facility, but nowhere could we find reports on
the number of permit violations at the facility level as we did in
RTK.NET. There appears to be no readily accessible method for
evaluating the actual levels of discharge, but merely whether or
not discharge levels represented a violation.

For both RTK.NET and EF data sets, we chose to obtain report information on all facilities both major and minor. As noted, there were about 50 in Somerset County reported by RTK.NET. However, EF only reported 19 major or minor facilities for Somerset County. Since the data is derived from a common source, the difference may well be in the percentage of total records made available by the two different systems. Per our example, maybe one reports on facilities with more than 10 employees and the other system facilities with more than 20?
The size of the reported sample is very important in terms of making reasonable claims of statistical significance. A statistician may be unconcerned about this since the conditions for meeting sampling criteria would not appreciably change. However, the general public cannot be expected to comprehend probability or inferential statistics. They will construct descriptive frequency counts and claim them as empirical proof.
This difference clearly displays the clash between the goals of
statistical purity and the institutional process. RTK.NET may have
chosen to target the more sophisticated user, who wants data for
analysis. EF may be more public-oriented thus providing less data
and more assistance in understanding it.
2. Additional Data on the Web
a. EPA's Surf Your Watershed
Data on water discharges is only part of the overall story in explaining other factors present in the pfiesteria outbreak on the Pocomoke River. In probing deeper into these other issues, we would also need to obtain data from Storet(x) which details the volume of water discharges and therefore provides context for the PCS discharge data.
Watershed data (Surf Your Watershed) may also provide critical background information. The level of aggregation for watershed data is, however, geographically too inclusive to provide additional insight into conditions on the Pocomoke River. For example, the watershed area for Somerset County is the lower Chesapeake Bay, but there is no "Pocomoke River" watershed or method by which to indicate it as an area of focus.
Here the need is for watersheds to be defined on a variety of levels of aggregation. The current levels cover large areas, but in fact, public interest is likely to center around much more specific areas. People will want to know about the rivers and streams in their back yards where their children play.
b. FedStat
We also used FedStat, which is a collection of databases from many Federal agencies, to seek out causes for the pfiesteria outbreak. Through the site search engine we input the search word "chicken".
The search revealed sites where chicken data (6 data sets) and chicken reports (2) were available. Most of these site references originate from the U.S. Department of Agriculture, from the NASS and ERS data bases.
One site we discovered through this system was a survey on Maryland chicken farmers provided by the Maryland Department of Agriculture. The survey found that two-thirds of Maryland farmers focused on poultry (67 percent). The remaining animal operations focused on swine and cattle, with both about 18 percent. The survey covered the Pocomoke watershed, or parts of Worcester, Wicomico and Somerset counties in Maryland.
Manure is applied to 42 percent of cropland in the state and 85 percent of farmers apply manure to crops. Agriculture is the major industry in the area. About 85 percent of farmers apply manure to fields and 62 percent get that manure from another farmer. However, only 42 percent of cropland receives manure. Therefore, it would be prudent to identify the leading farmers who apply manure to their fields and where that manure comes from.
The survey found that "there are no extraordinary conditions in the Pocomoke Watershed. Most farmers are protecting water quality in an appropriate manner, using current technology." This statement does not entirely correspond with the above data from the IDEA dataset. During the last three years, according to IDEA data -- Perdue, Holly, or Hudson -- were cited in violation. Were these firms included in this interview? If so, are they not extraordinary conditions.
3. Additional Data Not on the Web
Publicly available data on water discharges only indicates that
violations have occurred at facilities. We asked for more
information on possible sources of pfiesteria, beyond what was
available on the Web at the right-to-knows sites maintained by the
Unisom Institute (RTK net) and the U.S. EPA (Environfacts).
We received a description of the IDEA dataset and four related data files for the three counties in the PFIESTERIA study. The IDEA dataset covers all types of emissions (from 12 differing datasets) by facility. There is one page of description plus several hundred pages of dreary field identifier printouts. I know nothing about IDEA, except for data retrieval, from this document.
Somerset County Raw files show PCS data on inspections and violations by facility. Even with this added layer of data depth, there is still other data that is not being reported either through Envirofacts or IDEAs. In all, there are 215 data fields in PCS. We have not seen the full extent of that data yet (IDEA User's Guide), and IDEAs admits this is the case. We suspect there is more PCS data for Somerset County than what we have been given. In fact the report begins with the words "Unrestricted Dissemination."
Data are shown by facility, with violations for calendar years 1995-97. There are several types of violations, but no explanation of what is meant by the fields. The data do not seem to match the descriptors contained in the noted dreary field descriptions, despite their length (see Table 1).
Table 1
IDEA Reporting Fields
AllViols EffVios Insp NOVs AAs JAs
CY 1995
CY 1996
CY 1997
Table 2 shows selected IDEAS data for violations in the period 1995-97.
These violations do not represent a time line due to changing and
regulations. They can provide some sense of direction over time.
Table 2
Selected Data from IDEA
1995 96 97
Pocomoke City Sewage Treatment Violations 1 3 1
Inspections 2 1 11
Crisfield Sewage Treatment Vios. 1 3 1
Inspections 2 0 1
Perdue Farms 2 2 5
Inspections 5 3 3
Hudson Foods 19 19 27
Inspections 13 5 8
Holly Farms 0 4 3
Inspections 0 0 0
Somerset County 10 13 4
Inspections 3 2 2
Ocean City Waste Treatment Plant 3 4 2
Inspections 4 1 1
For Somerset County, the following sites were found to be in violation ("all"). Note that this is the number of violations, but not any measure of magnitude.
There was also a RCRIS report on Carvel Hall Cutlery, but there were no reports of either violations or inspections. If there were no data, why include it in the report?. Further, RCRIS data is available from 1992-97.
For all IDEA facilities, there were a 100 reporters from Somerset County. According to a related enforcement action report, Pocomoke City Sewage Treatment had an "Administrative Consent Order" on 5/20/1991 and was closed and brought back into compliance via a consent decree on 2/13/97.
Further detail is provided in some instances. The Crisfield Sewage Plant had fecal coliform violations for 1996 and 1997. There were also 1996 violations for residue and flow. Perdue submitted PCS and RCRIS reports and Hudson Foods PCS and AFS reports. Holly Farms has two different PCS discharges permits and one TRI reporting file. Carvel Hall also reported emissions level data for various chemicals and mixtures including lead, nickel, chromium, cyanides and others, beginning in 1987.
Note the diametrically different relation between inspections and violations between the two cities of Pocomoke and Crisfield and the cities with private industry.
The same data for Worcester County, Maryland, looks quite different. In fact, there were some reporting of minimal air emissions, but no violations or inspections occurred in the county. The disparity with Somerset County is startling.
We were also provided a file that showed the Judicial Docket Data for the two counties in Maryland and the one in Virginia. This file has a curious cadre of violators, outside of the others noted above, including:
two violators from the state of Delaware,
a violator from West Virginia,
the University of Maryland,
Maryland State Highway Administration,
Chesapeake & Potomac Telephone Co.,
Berlin Baptist Learning Center,
McReady Memorial Hospital,
7-11 Store,
about 50 violations with no recorded information,
the U.S. Department of Interior,
Westover School,
Eastern Correctional Institution,
Maryland Police Barrack V,
Goddard Space Flight Center,
Accomack nursing home,
A laundromat, and
NASA Launch facility.
The files indicated that the Somerset County Sanitary Facility was
fined $27,000 for violations. These reports covered 454
facilities.
Finally, there is a summary report for the three counties that provides macro-information for the three counties and attributes of the IDEA data base. First, here is the reporting for the counties and the differing reporting mechanisms.
Table 3 shows IDEA data by type of reporting mechanism.
| Program | Number |
|---|---|
| AFS | 24 |
| CER | 3 |
| DCK | 29 |
| DUN | 0 |
| FFI | 6 |
| FIN | 372 |
| LST | 0 |
| 184 | RCR |
| SET | 0 |
| TRI | 14 |
The data are also broken out by the SIC, or the Standard Industrial Classification, for the facility of interest. Here is a ranking of the leading SIC codes by region, that includes the three counties (see Table 4).
Table 4
SIC Codes and Facilities
SIC Type Number
2092 Frozen Seafood 50
4952 Sewerage Systems 24
2091 Canned/Cured Seafood 10
913 Shellfish 7
2048 Animal Foods 7
4911 Electrical Services 7
2015 Poultry Slaughter 6
TRI data is also indicated in the aggregate for the three counties. The data, however, are of little solace for those seeking consistent data trends. Ammonia emissions doubled between 1988 and 1994, and then dropped by one-half. Arsenic mysteriously is the same value for the years 1989-91. Chlorine emissions jump from 500 in 1989 to 3,600 in 1993. By 1995, the emissions had fallen to 541.
Some data is obviously wrong. Copper compounds show a hectic data path (see Table 5). The data trend show increases between 1988 and 1992 and then data that is largely unbelievable.
Table 5
Copper Compounds
1988 250
1989 502
1990 510
1991 510
1992 760
1993 0
1994 509
1995 3
The story for Ethylene Glycol is even more ominous in terms of trend reliability. It is clearly a statisticians nightmare, the data show absolutely no variation. Between 1991 and 1995 the emissions were constantly held at 250 units per annum.
Table 6
Ethylene Glycol
1991 250
1992 250
1993 250
1994 250
1995 250
According to the data, sulfuric acid has been eliminated for the three counties. Between 1998 and 1994 there were no sulfuric acid emissions reported.

IV. Recommendations on Problems Associated with Carrying out the Case Study
Here are four recommendations about improving public access
to and use of right-to-know environmental data.
1. Tell the user more about the technical aspects of using the data.
There is simply not enough
information available on the process of downloading and utilizing
data from either Web site. We use Quattro Pro on the AU system.
The default on the RTK.NET system is tab delimited format, although
Quattro Pro supports a comma delimited format. We unfortunately
discovered this the hard way. There should be an explanation of
how to actually manage the data in various software packages as
well as introductory instruction in analyzing it. At the user end,
there should be a user-friendly choice of downloading the data in
readily accessible formats (for example, Quattro Pro, Excell, Word,
etc.).
2. Make the data easier to use.
The data is presented in a random way that confuses the user as
to order of information types. For example, the data fields in the
files when downloaded are not accompanied by the data headers when
imported, which means these must be imported from another file or
typed in by hand. Included in the e-mailed data set, there is a
hyper-link for the header categories, but this step serves as an
additional obstacle for the user to solve as well as another
potential source of error in data use.
3. There needs to be readily-useable time series data available.
There should also be a means by which to discriminate data by
time, as that is a feature which will be of constant concern. Data
will naturally need to be examined in terms of periodicity. This
information is determinable, but is not easily attainable in the
current data offering on the Web sites.
4. Provide a more useful context for the data.
There is context for the data, but it is often at levels too disparate from the level of data. In the Pfiesteria case study, there was a context, but the specific locations of the point source data could not link up to the eco-system level data of the context. There must be some discrimination in eco-system levels and scopes to provide a link to the point-source data.

Data Field Explanations
Coding Key for RTK.net Data Case Study: Pfiesteria
A=npdes_id (A unique alphanumeric which identifies either a permit or a facility)
B=region (Two digit code for EPA region in which the facility is located)
C=state (FIPS alphabetic state code (generated by PCS system))
D=permit_ind_cat_tr (Translation of permit_ind cat field)
E=inactive_status (Code indicating whether facility is currently active (I=inactive, A=active)
F=facility_name_1 (Official or legal name of faciltiy (1st segment))
G=facility_name_2 (Official or legal name of facility (2nd segment))
H=facility_name_3 (Official or legal name of facility (3rd segment))
I=facility_name_4 (Official or legal name of facility (4th segment))
J=major_facility (Code indicating that the facility is a major discharger, M=major)
K=sic (Four-digit Standard Industrial Classification code for facility)
L=sic_tr (Translation of sic field)
M=major_rating (Numerical total of ranking points used to delineate major and minor facilities)
N=county (Name of the county in which the facility is located)
O=owner_type (Code for ownership classification)
P=owner_type_tr (Translation of owner type field)
Q=appl_type (Indicates the type of application form that the facility submitted)
R=appl_type_tr (Translation of appl_type field)
S=priority_epa_hq (Management tool used by EPA headquarters to assign priorities to facilities)
T=epa_or_state_perm (Indicates whether EPA (=E) or the state (=S) issued the permit)
U=facility_name (Name of entity located at facility's physical address)
V=facility_street_1 (First Line of address of physical location of facility)
W=facility_street_2 (Second line of address of physical location of facility)
X=facility_city (Name of the city or town in which the facility is physically located)
Y=facility_state (State or territory on which the facility is physically located)
Z=facility_zip (Zip code for address of physical location of facility)
AA=facility_phone (Telephone number of the facility)
AB=name_mail (facility name in the primary mailing address)
AC=street_1_mail (First line of primary mailing address of facility)
AD=street_2_mail (Second line of primary mailing address of facility)
AE=city_mail (City in the primary mailing address for the facility)
AF=state_mail (State in the primary mailing address of the facility)
AG=zip_mail (Zip code in the primary mailing address of the facility)
AH=hearing_status (Indicates evidentiary hearing anticipated or in progress for permit (I or A))
AI=hearing_file_num (EPA file number identifying the evidentiary of hearing)
AJ=hearing_docket (Legal case number identifying evidentiary hearing)
AK=hearing_issue_1 (First of 3 codes for central issue causing evidentiary hearing)
AL=hearing_issue_1_tr (Translation of hearing_issue_1 field)
AM=hearing_issue_2 (Second of 3 codes for central issue causing evidentiary hearing)
AN=hearing_issue_2_tr (Translation of hearing_issue_2 field)
AO=hearing_issue_3 (Third of 3 codes for central issue causing evidentiary hearing)
AP=hearing_issue_3_tr (Tranlation of hearing_issue_3 field)
AQ=contact_name (Name/department of permittee's representative responsible for DMRs)
AR=contact_phone (Telephone number of the permittee's representative responsible for DMRs)
AS=issue_date (Date the first permit was issued for a facility)
AT=river_basin_tr (Translation of river basin field)
AU=river_segment (River segment or sub-basin (extension to river basin code))
AV=inactive_date (Date on which the facility became inactive or active)
AW=number_reissues (The number of times the permit has been re-issued)
AX=new_facility (Code indicating a new facility with no previous discharge permit)
AY=new_facility_tr (Translation of new_facility field)
AZ=new_date (Date that new source or new discharge began operation)
BA=receiving_waters (Name of river, stream, lake, or other body of water which receives discharge)
BB=grant_indicator (Identifies POTW with SIC code 4952 which obtained federal grant money (=$))
BC=final_limits (Indicated facility on final limits; when treatment constuction complete (=F))
BD=latitude (Latitude of facility (degrees to tenths of seconds & direction DDMMSSTD))
BE=longitude (Longitude of facility (degrees to tenths of seconds & direction DDMMSSTD))
BF=design_flow (Average design flow for a facility (in million gallons per day))
BG=pretreat_req (Code indicating whether municipality is required to develop pretreatment prog)
BH=pretreat_req_tr (Translation of pretreat_req field)
BI=water_qual_limits (Indicates whether permit contains water quality based limits (Y=yes))
BJ=state_permit_num (Space available to state user to classify permits)
BK=nmp_schedule (Indicates whether Municipal Compliance Plan schedule made in accord with NMP)
BL=nmp_schedule_tr (Translation of nmp_schedule field)
BM=nmp_financial (Indicate financial fitness of POTW to comply with MCP in accord with NMP)
BN=nmp_quarter (Indicates fiscal quarter during which MCP schedule planned to be established)
BO=nmp_quarter_tr (Translation of NMP_quarter field)
BP=owner (Legal name of hte person, firm or entity that owns the facility)
BQ=street_1_owner (First line of the address of the owner of the facility)
BR=street_2_owner (Second line of the address of the owner of the facility)
BS=city_owner (Name of the city or town in the address of the owner of the facility)
BT=state_owner (State or territory of the address of the owner of the facility)
BU=zip_owner (Zip code in the address of the owner of the facility)
BV=phone_owner (Telephone number of the owner of the facility)
BW=operator (Name of the person, firm, or entity that legally operates the facility)
BX=street_1_operator (First line of the street address of the operator of the facility)
BY=street_2_operator (Second line of the street address of the operator of the facility)
BZ=city_operator (Name of the city or town in which the facility's operator is located)
CA=state_operator (State or territory in which the facility's operator is located)
CB=zip_operator (Zip code in the address of the operator of the facility)
CC=phone_operator (Telephone number of the operator of the facility)
CD=control_auth_id (Control authority for enforcing pretreatment regulations)
CE=potw_id (NPDES ID of POTW that receives discharge (monitored by PPETS))
CF=hq01 (1st EPA Headquarters defined data field)
CG=dry_sludge_amount (Amount of sludge a facility produces in DMT/year, dry weight)
CH=sludge_class_ind (Classification assigned to facility producing sludge)
CI=sludge_cls_ind_tr (Translation of sludge_class_ind field)
CJ=sludge_fac_ind (Indicator identifying the type of sludge facility)
CK=sludge_fac_ind_tr (Translation of the sludge_fac_ind field)
CL=industrial_cat_tr (Translation of industrial category code)
CM=facility_type_tr (Translation of facility type code)
CN=epa_id (EPA ID for facility)
CO=water_basin
CP=num_enforcement (Number of enforcement actions for this permit)
CQ=num_dmr_viol (Number of DMR measurement records with violations (effluent or non- reporting))
CR=num_inspection (Number of inspections of this facility)
CS=num_limit (Number of limit records with this record's NPDES ID)
CTnum_outfall (Number of outfalls regulated under this permit)
CU=num_single_viol (Number of single event violations for this permit)
CV=num_compsched_viol (Number of compliance schedule violations for this permit)
CW=num_nc_quarter (Number of quarter years that facility was in noncompliance)
CX=city
(City in which facility is located (updated by EPA))
**
Fields in bold are empty in the data set so there is no way
to match columns with labels.

The Next Case Study
We think one way to explore this case study is to follow-up
on this trail of discovery by turning attention away from the
sophisticated use by researchers to the problems of providing
accesible data that can be used. Therefore, we suggest the case
study continue, but this time from a focus of the Pfiesteria case
study within EPA itself and specifically within the Center for
Environmental Information and Statistics (CEIS) and issues about
right-to-know data.

Nine Key Questions
This case study constitutes a good basis from which to answer the "Nine Key Questions" which form the basis for review of PCS and other EPA databases. What I do not find in this document is any clear identification of what type of user should one assume when answering this question. If the user in mind is the average citizen then perhaps we should use surveys or focus groups. We believe that the level of accuracy requires an assumption of proof, this for attaining reasonable scientific findings and for the legal reasons that flow from scientific findings, especially those based on statistics. We will answer these questions using the data in the context of an academic researcher, one therfore whose findings would be sufficient to stand as an expert witness in a court case or proof of statistical relationship. This data is either, in this context, good or bad. We will assume good. We also assume that the data is publicly available and began with use of a non-profit user of EPA data.
1. How Comprehensive is the Database?
Unknown. As a case study, comprehensiveness was antithetical to the scope of the research.
2. Can the Database Be Used for Spatial Analysis?
Maybe. There are spatial variables in the database. However, it is unknown as to their geographic exactness to produce cause and effect. Is the address on the report for the site of an event, the site of the nearest post office, or the corporate headquarters filing the report? Likewise, do municipal variables refer to the location of the event or the government office responding to the request? Furthermore, there are distinct state-by-state reporting characteristics that were found in this case that need to be addressed.
3. Can the Database be Used for Temporal Analysis?
No. Publicly-available PCS contains inadequate data for even constructing a time series, a key find of our report. This data does exist but the data on the Web has only limited time series indications. This is a function of both funding and protection of business and private interests.
4. How Consistent Are the Variables Over Space and Time?
Not enough. Time is distorted and space may be limited in the dataset.
5. Can Data Be Linked with Information from other Databases?
Absolutely. We were able to use PCS along with other data through facility reports provided. However the other data was not publicly-available.
6. How Accurate are the Data?
We did not investigate this, but we have proposed a follow-up project that would use RCRIS data to examine its relation to pfiesteria outbreaks.
7. What are the Limitations? Is the data that is now available on the Internet of sufficient quality for scientific examination?
At the moment, the answer is no.
8. How Can I Get Information?
Any Internet account with a search engine can find the data. We did not investigate ordering the data by phone in hard copy.
9. Is There Documentation?
Yes, but not very accessible.

