Abstract
This chapter describes the major data sets used in this monograph: The Human Mortality Database, data from the National Center for Health Statistics of the United States for the analysis of causes of death, and the individual-level, longitudinal data of the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute of the United States. The latter is used to illustrate the dynamics of cancer survival.
You have full access to this open access chapter, Download chapter PDF
3.1 Data
3.1.1 Human Mortality Database
Most of our analyses are based on data from the Human Mortality Database (âHMDâ, 2017), which can be freely accessed after registration at http://www.mortality.org. The database is a collaborative project of research teams from the Department of Demography at the University of California, Berkeley (USA) and the Max Planck Institute for Demographic Research in Rostock (Germany). It contains aggregate mortality statistics such as death counts, population estimates, exposure to risk estimates, life tables as well as some other statistics of more than 35 countries (see Table 3.1). Further distinctions into sub populations are possible for some countries such as Germany (East and West Germany), the United Kingdom (England and Wales, Northern Ireland, Scotland) or New Zealand (Maori, Non-Maori). The database has its focus on highly developed countries.
Since its launch in 2002, the HMD has become the gold standard for the aggregate level (demographic) analysis of mortality. Apart from the diligent collection of data, its widespread adoption can mainly be attributed to two reasons: (1) Rigorous quality checks are conducted before new data are added to the database. (2) The biggest asset of the HMD is that it does not simply publish processed data. Instead, the HMD estimates life tables and other statistics itself using raw data, applying the same set of methods. Thus, any differences over time or across region can not be attributed to different methodologies, for instance, how the life table was closed (HMD 2007).
As some life tables in the HMD are smoothed at ages 80 and higher, we did not rely on life tables estimates at all but used exclusively the death counts and the corresponding exposures from the HMD on a 1-calendar-year by 1-age-year grid to estimate death rates. Most of our analyses deal with mortality developments since 1950. We selected this threshold year because of the availability of more data compared to earlier time periods. Furthermore, it also marks the beginning of a new era: Most gains in life expectancy are nowadays due to survival improvements among the elderly (Christensen et al. 2009), a development, which was virtually non-existent before the middle of the twentieth century. Kannisto (1994), for instance, estimated that the onset of sustained decline in old-age mortality occurred for women in Switzerland, Belgium and Sweden in 1956.
As shown in Table 3.1 total deaths range from barely 100,000 (Iceland) to more than 130 million in the United States. We analyzed all countries; the only exceptions are Chile and the Maori population of New Zealand due to problematic data quality (Jdanov et al. 2008) and the low number of years covered (Chile). Nevertheless, we did not include those figures for all countries and both sexes as it would have resulted in a monograph consisting of hundreds of additional pages. We typically restricted ourselves, instead, to a few examples that feature interesting characteristics.
3.1.2 Cause-Specific Death Counts in the United States
The National Center for Health Statistics of the United States provides a unique collection: Individual death counts by sex, age at death, year of death, cause of death, and many more characteristics can be freely downloaded from its web page. The data are available since 1968 in annual files. Additionally, the website of the National Bureau of Economic Research (NBER) provides data since 1959, which we used in our analyses. The last year in our analysis is 2014. With the exception of 1972, when only a 50% sample was taken, each file contains all deaths in the United States. In the analysis by cause of death in later chapters of this volume, we simply multiplied the number of deaths for a given age, sex, and cause in the year 1972 by a factor of 2.
Causes of death are coded by the so-called âInternational Classification of Diseasesâ (ICD). Since its introduction in the late nineteenth century, the system has been revised at irregular intervals (MeslĂ©Â 2006). The tenth revision is currently used. During the first years of our analysis, ICD-7 was used. ICD-8 was in effect in the United States between 1968 and 1978, followed by ICD-9 from 1979 until 1998.
Obtaining consistent time series of causes of death across ICD revisions requires meticulous work and care (e.g., Meslé and Vallin 1996; Pechholdovå 2009). We therefore decided to use only very broad categories for causes of death and followed primarily the coding of Janssen et al. (2003) and of Meslé and Vallin (2006a). Both papers include an appendix with detailed ICD codes across the four revisions required in our analysis.
Table 3.2 is split into two halves. The upper panel provides the ICD codes we used to extract the causes of death, whereas the lower panel lists the number of deaths in absolute and relative terms for the selected causes by sex.
 |  | Number of cases | |||||
---|---|---|---|---|---|---|---|
 |  | Total | Female | Male | |||
Nr. | Cause | Counts | % | Counts | % | Counts | % |
(1) | All causes | 118,678,283 | (100.00) | 56,432,184 | (100.00) | 62,246,099 | (100.00) |
(2) | Circulatory dis. | 52,668,448 | (44.38) | 25,985,900 | (46.05) | 26,682,548 | (42.87) |
(3) | Heart | 40,342,012 | (33.99) | 19,072,073 | (33.80) | 21,269,939 | (34.17) |
(4) | Cerebrovasc. | 9,381,071 | (7.90) | 5,430,076 | (9.62) | 3,950,995 | (6.35) |
(5) | Other | 2,945,365 | (2.48) | 1,483,751 | (2.63) | 1,461,614 | (2.35) |
(6) | Cancers | 25,722,893 | (21.67) | 12,096,049 | (21.43) | 13,626,844 | (21.89) |
(7) | Breast | 2,067,878 | (1.74) | 2,050,192 | (3.63) | 17,686 | (0.03) |
(8) | Lung | 6,393,007 | (5.39) | 2,260,023 | (4.00) | 4,132,984 | (6.64) |
(9) | Colorectum | 2,884,519 | (2.43) | 1,458,772 | (2.59) | 1,425,747 | (2.29) |
(10) | Other | 14,377,489 | (12.11) | 6,327,062 | (11.21) | 8,050,427 | (12.93) |
(11) | Resp. diseases | 9,566,798 | (8.06) | 4,457,141 | (7.90) | 5,109,657 | (8.21) |
(12) | Motor vehicle acc. | 2,538,449 | (2.14) | 742,599 | (1.32) | 1,795,850 | (2.89) |
(13) | Other | 28,181,695 | (23.75) | 13,150,495 | (23.30) | 15,031,200 | (24.15) |
Our database consists of more than 118Â million deaths. Although we have selected very few causes, they account for about three quarters of all deaths (Category 13 âOtherâ is 23.75%). A bit more than 44% of all deaths classified as originating from circulatory diseases. In that category, heart diseases are about one third of all deaths for women and men alike. The almost 10Â million deaths from cerebrovascular diseases between 1959 and 2014 represent about eight percent of all deaths. The most common cerebrovascular disease is stroke. Malignant neoplasms (âcancerâ) are the second largest chapter in the ICD. Regardless of sex of the decedent, about one in every fifth death belongs to that category. We selected three prominent cancer sites: Breast, lung and colorectum. Please note that while there are many more deaths from breast cancer for women, also more than 17,000 men died from it during the 56Â years of our observation period. Respiratory diseases are with approximately 8% of all deaths slightly more common than cerebrovascular diseases. Although it is not a major cause of death (2%), we also included information about motor vehicle accidents since it turned out to be an interesting case study for seasonality in deaths, which we analyze in Chap.â9
3.1.3 SEER Cancer Register Data 1973â2011
The Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute of the United States allows researchers access to longitudinal data on the individual level about the incidence of cancer and includes also information about the survival of patients. The data coverageâthe SEER data start in 1973âand the large size of data, combined with the ease of access, make the SEER data an ideal instrument for the analysis of cancer survival by age over calendar time. We were using data that were released in April 2014 with a follow-up cutoff date of December 31, 2011 (Surveillance, Epidemiology, and End Results (SEER) Program 2014). The SEER data do not cover all cancer diagnoses of the United States. It is a collection of data from several registries. With the exception of Seattle (Puget Sound) and Metropolitan Atlanta that started in 1974 and 1975, respectively, we only used registers that covered the whole time span from 1973 until the end of 2011. Although we use less data than we could have, we thought that a heterogeneous set of registers would have induced problems for the analysis over time. The registers included in our analysis were: San Francisco-Oakland SMSA, Connecticut, Metropolitan Detroit, Hawaii, Iowa, New Mexico, Utah as well as Seattle and Metropolitan Atlanta.
In our analysis of cancer survival in Chap.â10, starting on page 123, we selected five cancer sites: Breast cancer; cancer of the lung and bronchus; cancer of the colon, rectum, and anus; pancreatic cancer; prostate cancer. As shown in Table 3.3, those five cancer sites constitute about 55% of all cancer diagnoses for women as well as for men out of the 4.5 million cases recorded during our observation period. The largest categories are by far breast cancer for women (30.44%) and prostate cancer for men (25.79%). The absolute and relative frequencies of the other cancer sites as well as their respective ICD codes can be inspected from Table 3.3. While ICD-8 was in use at the beginning of the observation period in 1973 and cancer cases are typically coded by the ICD-O standard, all ICD codes were converted to ICD-10 by SEER.
 | Incidence |  | |||
---|---|---|---|---|---|
 | Female |  | Male | ||
Cancer site | Counts | in % | Counts | in % | |
(1) | All | 2,328,116 | (100.00) | 2,195,983 | (100.00) |
(2) | Breast | 708,696 | (30.44) | 4,680 | (0.21) |
(3) | Bronchus and lung | 224,927 | (9.66) | 332,974 | (15.16) |
(4) | Colon, rectum, and anus | 257,406 | (11.06) | 263,050 | (11.98) |
(5) | Pancreas | 51,712 | (2.22) | 51,440 | (2.34) |
(6) | Prostate | N/A | (N/A) | 566,311 | (25.79) |
(7) | Rest | 1,085,375 | (46.62) | 977,528 | (44.51) |
3.2 Software
All analyses have been conducted and all figures have been produced using R (Version 3.2.3), a free software environment for statistical computing and graphics (R Development Core Team 2015). The surface maps were created by the image() function and contour lines were added with the contour() function. To facilitate the creation of surface maps of rates of mortality improvement for other researchers, an R package called ROMIplot has been created and uploaded to CRAN, the general archive of R packages. Installation and usage of this package are explained in Appendix âSoftware: R package ROMIplotâ (p. 161).
References
Christensen, K., Doblhammer, G., Rau, R., & Vaupel, J. (2009). Ageing populations: The challenges ahead. The Lancet, 374(9696), 1196â1208.
Janssen, F., Nusselder, W. J., Looman, C., Mackenbach, J. P., & Kunst, A. E. (2003). Stagnation in mortality decline among elders in the Netherlands. Gerontologist, 43(5), 722â734.
Jdanov, D. A., Jasilionis, D., Soroko, E. L., Rau, R., & Vaupel, J. W. (2008). Beyond the Kannisto-Thatcher database on old age mortality: An assessment of data quality at advanced ages. Working paper MPIDR Working Paper WP-20083-013, Max Planck Institute for Demographic Research, Rostock, Germany.
Kannisto, V. (1994). Development of oldest-old mortality, 1950â1990: Evidence from 28 developed countries (Monographs on Population Aging, Vol. 1). Odense: Odense University Press.
MeslĂ©, F. (2006). Medical causes of death. In G. Caselli, J. Vallin, & G. Wunsch (Eds.), Demography. Analysis and synthesis (Vol. II, Chap. 42, pp. 29â44). Amsterdam: Elsevier.
MeslĂ©, F., & Vallin, J. (1996). Reconstructing long-term series of causes of death. Historical Methods, 29, 72â87.
MeslĂ©, F., & Vallin, J. (2006a). Diverging trends in female old-age mortality: The United States and the Netherlands versus France and Japan. Population and Development Review, 32, 123â145.
PechholdovĂĄ, M. (2009). Results and observations from the reconstruction of continuous time series of mortality by cause of death: Case of West Germany, 1968â1997. Demographic Research, 21, 535â568.
R Development Core Team. (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN:3-900051-07-0.
Surveillance, Epidemiology, and End Results (SEER) Program. (2014). Research data (1973â2011). Available online at www.seer.cancer.gov. National Cancer Institute, DCCPS, Surveillance Research Program, released April 2014, based on the November 2013 submission.
University of California, Berkeley (USA), and Max Planck Institute for Demographic Research, Rostock, (Germany). (2007). Methods protocol for the human mortality database. Available at http://www.mortality.org/Public/Docs/MethodsProtocol.pdf.
University of California, Berkeley (USA), and Max Planck Institute for Demographic Research, Rostock, (Germany). (2017). Human mortality database. Available at http://www.mortality.org.
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license, and any changes made are indicated.
The images or other third party material in this book are included in the workâs Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the workâs Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Rau, R., Bohk-Ewald, C., MuszyĆska, M.M., Vaupel, J.W. (2018). Data and Software. In: Visualizing Mortality Dynamics in the Lexis Diagram. The Springer Series on Demographic Methods and Population Analysis, vol 44. Springer, Cham. https://doi.org/10.1007/978-3-319-64820-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-64820-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64818-7
Online ISBN: 978-3-319-64820-0
eBook Packages: Social SciencesSocial Sciences (R0)