ۥ-!@ -
G'SBjBjjBjBjBjBjBxBEEEE"E(JEvEESFWF:FFFFFFFFFFFFFF4F3FjBFFCHAPTER 1
POPULATION DATA ANALYSIS
__________________________________________________________
ANALYTICAL TECHNIQUES/TOOLS USED:
Straight line curve-fitting--least squares
Exponential curve-fitting
Exponential curve-fitting with a lower bound
Logistic curve-fitting
Gompertz curve-fitting
__________________________________________________________
DATA TYPE: ABUNDANT LONGITUDINAL DATA
transfer of data from data base to spreadsheet
cleaning, analysis, and graphing of transferred data
__________________________________________________________
Overview of Data
Population data sets were chosen as the initial sets of data on which to illustrate curve fitting because there are so many data sets concerning population and because there is a vast systematic literature already available concerning the analysis of these data. Indeed, demography is the science of the statistical analysis of human populations, focusing on size and density, distribution, and vital statistics. It is a long-standing science, certainly dating from the time of Malthus; Malthus's basic law states that in the absence of constraints on resources or other factors, one would naturally expect exponential population growth. A larger population generates yet a larger one, in much the way that the compounding of interest causes a bank account to grow exponentially.
The science of population dynamics includes the study of births, deaths, and population age cohort changes; in and out-migration among countries; and, rural/urban migration within countries. A particularly exciting approach, cast in the current context of analyzing the human dimensions of global change, is to consider the relationship between population dynamics and environmental degradation. Some of the topics within the idea of environmental degradation include adverse toxic environmental impact; consumption of non-renewable resources (such as oil) at rates that exceed the capacity to replace them with an alternative in the future; and, the consumption of renewable resources at rates in excess of replacement--non-sustainable consumption. A variety of other components contribute to environmental degradation, including those coming from economic and other development; in some cases populations dynamics is the dominant factor, in others, development is the driving force behind environmental degradation. The data sets chosen in this book are typical of the sorts of data sets that might contribute significantly to analyzing the relationship between population dynamics and environmental degradation. Thus, the reader who is learning to fit curves, in any disciplinary context, also has the opportunity to see how data might be used creatively in the emerging science of population dynamics.
Various international organizations publish data on population in an electronic format. There is no global standard for census taking: the collection of data concerning human populations. Thus, there will be variation in censuses on numerous bases: from the national level, to the state level, to the county level.
With any data set (presented in electronic or paper format), it is important first to examine the set for interesting or unusual patterns in the display. These patterns often influence decisions in choosing subsets of data and tools to analyze subsets.
__________________________________________________________
PATTERNS IN DATA--WHAT TO LOOK FOR
1. What is the general organizational scheme of the entire set? Is it arranged alphabetically, numerically, or in some other fashion?
2. Are the real-world entries in the Table (nations, states, counties) expressed as comparable units? For example, county data and national data are generally not comparable.
3. Are the numerical entries in the Table expressed in comparable units? For example, data in one column might measure percentages while data in another column might measure thousands of dollars--these columns would not be comparable.
4. Are there gaps in the data? If so, what is their significance to the questions you wish to have the data answer?
__________________________________________________________
The World Resources Institute data base contains information, derived from World Bank data, for most countries of the world. This information is partitioned on the basis of over 500 variables, many of which are related to population. There is longitudinal data available, in electronic format, for many of the population indicators. With this electronic data base, as with others, the content of interest can be selected and transferred to commercial packages of choice. This feature makes these databases extremely useful for analysis; occasionally, computer users have some difficulty implementing these data bases. Thus, to begin, we illustrate in detail (by example), how to transfer information from the World Resources Institute data base into a Lotus 1-2-3 spreadsheet. Then, we suggest how to analyze data within a spreadsheet, and finally, how to portray some elements of the data set on a map. To illustrate that the general strategies employed are not peculiarities of the software selected, we then select another data base and mapping package and run through the analysis again.
Data base from the World Resources Institute
A traditional idea in studying population is that an increase in population will result in an increase in births. Indeed, Jonathan Swift carried this Malthusian notion to an extreme in Gulliver's Travels. Suppose, though, that one wished to investigate this theme, of the relationship between total population and total births, from a systematic standpoint. One logical beginning might be with a major international data base.
Two of the over 500 variables in the World Resources Institute data base, derived from public domain information, concern total population by country, expressed in units of thousands, and crude birth rate per thousand population by country. For the sake of example, we chose to look at these two variables for the country of Bangladesh (formerly East Pakistan)--a country known to have an interesting history in terms of population growth.
Transfer of information from data base to spreadsheet
Most electronic databases are very easy to use--the online help guides even the shyest computer user through the mechanics of selecting precisely the desired information for the desired countries. If difficulties in using these databases arises, it is usually, as with any software, in the seam of the interface linking one piece of software to another.
Remember, depending on the operating system in use, that it is often useful to run a directory on the software to see what is available, prior to starting an analysis of data. In a PC environment, the command, at the C prompt of the subdirectory holding the software, of dir/w will cause the content of the directory for the software to be displayed in table-format across the width of the screen. The command dir/p will cause the content of the directory for the software to be displayed in column-format, accompanied by the size of the file and the date of last use, on the screen a page at a time. It is easy to forget how to start databases--especially in environments loaded with different software, some of which the user employs only occasionally. Looking up names in a directory removes the need to remember commands specific to individual software packages and replaces it, instead, with the universal idea of looking it up. Directories are powerful tools.
A set of data chosen to examine some theme is a start. The next step is to prepare to analyze the data. Many databases have some analytical capabilities built-in. They are often easy to use but are currently likely not to have the power of the commercial database programs that have been refined for years to present the user with sophisticated analysis and an online help facility that is easy to use (in any package, try F1 if it is not clear how to gain access to the online help). It may also be the case that graphics displayed in databases on the screen may not print out directly on a printer. In brief, there may be a number of shortcomings with the analysis segment of a data base; the primary function of the data base is as a convenient source of data, often expressed at the scale of individual nations or groupings of nations, concerning a number of different variables.
To transfer data from an electronic data base to a spreadsheet, it is necessary to "export" the data from the database to the spreadsheet. The data base should have online help that is easy to use to create a file to export. If there is any difficulty, it is likely to come in knowing how to name the file so that it appears in the target spreadsheet software when that is next opened up. In the World Resources Institute data base, the following strategy works well. When the time comes to name the file in the process of exporting, the user is offered a line of text including c:\wrd\....... . The dots indicate that the user is to supply a file name. If the file name is simply appended, the file will not go directly to the spreadsheet. Suppose the spreadsheet software is Lotus 1-2-3, release 2.3, stored in subdirectory 123r23. Suppose the name of the spreadsheet is to be "sample-1.wk1--the extension "wk1" is specific to the spreadsheet software. Now, when the database software presents the line c:\wrd\......., use the backspace to erase "wrd\" and substitute 123r23\sample-1. Now the line of text will read c:\123r23\sample-1. When the type of file this is to be stored as is selected as a "wk1" file, the data in "sample-1" will now appear as one spreadsheet that can be selected directly from Lotus 1-2-3, release 2.3.
We have chosen to use this particular spreadsheet with the World Resources Institute data base. Choose your source of data first; then determine which other analytical tools it can interact with. Then try the various interfaces, determine which will work best for you, and proceed with analysis. Minimal facility with various spreadsheets and analytical software is important in making a good selection, just as exposure to various natural, artificial, and formal languages can foster a broad, general understanding of mathematical and natural language.
Cleaning of data transferred from an electronic database
The four steps enumerated in the box above may seem a bit obvious but are very important in making sure that results are as clear and as accurate as possible. When dealing with data that has been difficult to collect, it is important not to introduce any errors in the analysis. One reason direct transferral of data from electronic data base to spreadsheet (or other software) is important is that no re-keying of the data is involved and therefore no new errors can be introduced from any extra data entry. However, when different variables, or different countries are chosen from the electronic data base, the entries may have to be rearranged or manipulated in some other way so that desired analyses and comparisons can be performed.
To illustrate the idea of "cleaning" data, suggested in steps 1-4 above, we chose a very simple example--one country and two variables. With more geographical units and more variables, the increase in difficulty of cleaning is more likely exponential than linear! The necessity to clean data is of critical and fundamental importance.
1. What is the general organizational scheme of the entire set? Is it arranged alphabetically, numerically, or in some other fashion?
The entire set of World Resources Institute Data can be sorted first by variable and then by country, or first by country and then by variable. The capability to sort the data in both ways makes this database one that is quite flexible and one that can be used to interpret a variety of themes.
2. Are the real-world entries in the Table (nations, states, counties) expressed as comparable units? For example, county data and national data are generally not comparable.
This data base contains data for most countries of the world; some variables have data for more countries than do others. Thus, when looking at more than one variable at a national level, one must be certain that exactly the same set of countries is being dealt with, prior to making comparisons. There is also data grouped a regional levels--that is as groupings of countries, such as "oil producing exporters." Such groupings are nice to have, but they can introduce bias by suggesting what to study.
TABLE 1.1
Bangladesh; actual data, 1955-1990, projected data, 1995-2025.
Total population and crude births in millions
Source: World Resources Institute Data Base.
3. Are the numerical entries in the Table expressed in comparable units? For example, data in one column might measure percentages while data in another column might measure thousands of dollars--these columns would not be comparable.
The entries in this data base are expressed both as raw data and as rates. Thus, one must be careful not to compare a rate to raw data, or to average a set of rates (producing meaningless results). The variable concerning total population is expressed as raw data; the variable concerning crude births is expressed as a rate per 1000. To make the variables comparable, it is an easy matter simply to multiply each of the crude birth rates by the number of thousands of population per country in order to get a figure on total crude births. The data in Table 1.1 has been adjusted, decimally, so that the total population figures in the second column show millions of total population for the country of Bangladesh over a seventy year period, and so that the crude birth figures show millions of total crude births for Bangladesh for the same seventy year period.
4. Are there gaps in the data? If so, what is their significance to the questions you wish to have the data answer?
In Table 1.1, the first column shows years for which there is data. In fact, more was available. From the period for 1970 to 1990, data was presented on an annual basis. When fitting curves, it is generally appropriate to have evenly-spaced units on the x-axis represent evenly-spaced units in the data. To introduce even spacing in this data set, one might choose only to look at the data for which there is annual data, or one might choose only to look at data every five years. We chose the latter course.
Note also in the first column of Table 1.1, that the years from 1955 to 1990 are positioned differently than those from 1995 to 2000. This offset pattern is presented to remind the user of the Table that the data to 1990 is actual data, and that beyond 1990 is projected data.
Predictions of any sort are at best tenuous; when funding is allocated based on projections, questions should be asked. The more analytic support one has for asking questions, including charts, graphs, maps, and other visual devices, the more likely one is to achieve some sort of constructive, interactive communication.
Analysis of spreadsheet data
There are a variety of default analyses that can be performed easily using spreadsheets; bar charts can be drawn, pie charts can be created, and curves can be drawn. When curves are drawn, the default curve is simply a curve formed by joining discrete values with line segments--a follow-the-dots approach. Figure 1.1 shows default curves when the actual data, from 1955 to 1990, from Table 1.1 are graphed. Default graphs offer an easy and effective visual display. They do not have an equation(s) associated with them, so they cannot be referred to in notation (without additional work) and therefore they cannot be used to make projections from actual data. Consequently, their only use is as visual backup--when there are also equations behind the curves, then they serve as analytic tools, as well. Thus, it becomes important to be able to associate equations, which can then generate suitable curves, with actual data.
Figure 1.1. Total population and total crude births, Bangladesh, 1955-1990.
The graph in Figure 1.1 shows a curved rise in the population of Bangladesh over the time period from 1955 to 1990. One might therefore suppose that projections made on the basis of this actual data would continue to reflect this trend. The crude birth data also shows a slight, but steady, rise during this time period, too. The database from which these values were extracted also contains data projected to the year 2025 for these variables. When these are graphed, Figure 1.2, the population clearly continues to rise, although the rate at which it rises appears to taper off a bit. The birth rate data is more difficult to see at the scale of the graph in Figure 1.2. When the scale is enlarged, Figure 1.3, it becomes clear that a steadily rising trend of actual data is projected as one that rises for a short time in the near-future and then drops off sharply at the year 2005. These default graphs, alone, suggest the importance of questioning why there is an assumption made about a decline in births in Bangladesh starting in 2005. Is it the result of some current programs involving education and family planning that are already in place, or is it a wished-for result that could come about IF certain proposals involving education and family planning are funded? The difference is substantial in terms of policy implications; to understand which it is, it is critical to know on what basis this drop is forecast.
Figure 1.2. Total actual and projected population and crude births, 1955-2025.
The mere default graphing of data can suggest critical questions; graphing beyond the default level can provide even more insight. We consider a few additional ways to look at data using tools readily available in spreadsheets. First each variable is considered graphically and then the theme of the relationship between total population and births is returned to.
Figure 1.3. Bangladesh, actual and projected total crude births: 1955-2025.
Straight line curve fitting--least squares analysis--birth data
The graph of the variable, "total crude births," suggests that a straight line might fit the actual data quite well. In Table 1.2, a straight line has been fit to the actual data from 1955 to 1990, using least squares "regression" analysis. The fit is quite tight, as is indicated by the r-squared value of 0.99 (Table 1.2). If a different fit had been tried, resulting in a value of r-squared smaller than this, one might conclude that the curve with the value of r-squared closest to 1 gives the tightest fit to the scatter of dots. What that means would depend on the kind of data used as inputs; experimental data of unknown reliability produces r-squared values of questionable meaning, as does data that is based on averages.
The procedure for obtaining this line is straightforward and uses only the capability of a good spreadsheet. Select the regression feature from the spreadsheet. Enter the column, from 1955-1990, for "year" as the x-axis value; enter the crude births values for 1955-1990 as the next variable (the A-range in Lotus 1-2-3, release 2.3). Choose a blank location of the spreadsheet as the output range. Then, go with the regression, and the output should be similar to that part of Table 1.2 below the entry for the year 2025. Typically, the last two lines, giving the phrase "linear fit" and the equation of the regression line are not part of the output. The user needs to understand the content of the output enough to know that generally, the equation of the line of least squares will be of the form
y=mx+b
where m is the slope of the line and b is its intercept on the y-axis. The "X Coefficient(x)" in Table 1.2, 0.0825614617, is the x coefficient--m--in the displayed equation above. The "constant" in Table 1.2, -159.38724963, is the y-intercept and so is the b-value in the displayed equation.
As a specific sample, fit a straight line to the data in Table 1.2--using Lotus 1-2-3 release 2.3, in which the year 1955 appears in column A, row 5 (cell A5) of the spreadsheet. A reader unfamiliar with using a spreadsheet might wish to actually try fitting a straight line by following the instructions below, step by step.
---------------------------------------------------------------------------------------
STRAIGHT LINE FIT TO THE DATA OF TABLE 1.2
(Refer to Table 1.2 and Figure 1.4)
1. Enter the years 1955-1990, for which there is ACTUAL data, in spreadsheet column (A) to be used as the x-axis input.
2. Enter the ACTUAL data that varies over time, 1955-1990, in another column (spreadsheet column B--total crude births in this case), to be used as y values
3. Choose the regression feature from the software, with the x values as in step 1 and the y values as in step 2.
4. Choose the output range as a blank area in the spreadsheet. Then proceed with the calculation as directed by the software; the output from the regression will appear in a form similar to the one in Table 1.2, bottom half (produced in Lotus 1-2-3, release. 2.3).
5. The equation below the output range must generally be derived by the user from the regression output. The slope-intercept form for the equation of a straight line (y=mx+b) is used. The "X Coefficient" from the regression output is used as "m". The "Constant" from the regression output is used as b. Thus, the equation, that fits the ACTUAL data from 1955-1990, can be read directly from the regression output.
6. In a separate column of the spreadsheet, labelled "projected straight line vals" in Table 1.2 (Column E), enter the equation derived from the regression: in cell E5, enter the formula 0.082561*A5-159.387. The value 2.019755 should appear at the top of the "projected straight line vals" column. Then, copy the cell content from E2 to the 7 cells below it; this should produce the numerical range, from 1955-1990, shown in the "projected straight line vals" column.
7. Graph the results; select an XY-graph. Put the 8 entries for the years in the X-range, and the eight actual values in the A range. Enter the 8 entries of the projected linear data, from 1955-1990, in an additional range, B, of the spreadsheet (Figure 1.4). The result should look like the left half of Figure 1.4.
__________________________________________________________
Often, spreadsheets will present the user with more decimal places than desired. In that case, the user must decide whether to truncate the long decimal or round it up or down. Each strategy has merit; making a rational decision, employing it consistently throughout an analysis, and telling the reader what decision has been made and why it has been made is a good approach. In this example, decimals were truncated because it appeared consistent with what the computer does when the column is expanded; there is less chance of introducing an error by merely erasing digits--when old ones are altered the risk increases.
__________________________________________________________
EXTRAPOLATION
To extrapolate the linear curves fit to the actual data, simply add data to spreadsheet columns and graph the extended columns. In the case here:
1. Extend the number of years to cover a period of time up to and including the year 2025 (in five-year intervals) as in Table 2.1, column A.
2. Enter WRD data for the entries from 1995 to 2025.
TABLE 1.2
Least squares regression line fit to actual crude birth data, 1955-1990.
Right-hand column shows fit of line to actual data
and extrapolation to cover 1995-2025.
3. Copy the formula in cell E12 into cells E13 to E18. The Table should now be identical to Table 1.2.
4. Graph the extended data; simply open the previous graph and extend the columns for X, A, and B ranges to cover the values associated with the years 1995 to 2025 as well as those from 1955 to 1990. The resulting graph should appear as that in Figure 1.4. Alter labels to suit individual tastes.
__________________________________________________________
The equation obtained from fitting a straight line to the actual data may then be used to calculate values for projected data. Table 1.2 shows projected values in the right hand column, obtained by entering the value 0.082561*x-159.387 into the first row of that column and replacing the value "x" with "1955" or with the cell address, A5, of the range containing "1955." Subsequent values in this column were calculated by inserting appropriate values for the "Year" column into the formula derived from the regression line. Individual values for x may be inserted by hand; however, when the cell address is used instead, the formula may simply be copied from one range to another and new values are then calculated automatically--in Lotus 1-2-3, simply use the "copy" command from the main menu and copy the content of the top cell in the right-hand column to the remaining cells in that column. This latter strategy is easier and there is less room for the introduction of error.
When the data are graphed, it becomes clear that the linear fit is quite good for the actual data (Figure 1.4) and is good on the projected data until about the year 2000. After 2005, there is a sharp drop away from the least squares line. The straight line fit offers concrete, analytical evidence on which to question this drop: "why does the projection veer off from the straight line which naturally fits this data?"
The sorts of follow-up questions that might ensue could include questioning where WRD data comes from; if it comes directly from the Bangladesh government, one might then try to determine the basis on which they make projections and the sorts of education and family planning projects that are already in place. Is this projected, radical downswing a result of carefully orchestrated planning that can be documented, or is it based on wishful thinking about what must take place for success in caring for this segment of the world's population? Indeed, it might be prudent to fit a curve to this projected segment from 2005, although any sort of projection is a very risky, but often used, strategy.
Figure 1.4. Least squares line--fit to actual and projected WRD data.
---------------------------------------------------------------------------------------
BLACK BOX SUMMARY
see Introduction for theoretical explanation
LEAST SQUARES REGRESSION LINE
y=mx+b
where
m is the slope of the line, or the "x-coefficient"
b is the y-intercept, or the "constant."
---------------------------------------------------------------------------------------
Exponential curve-fitting--crude birth data
The total crude birth curve in Figure 1.4, to the right of 2000 or 2005 is one that shows a sharp decline. To capture it in an equation, it seems natural to fit an exponential curve to it. The following strategy illustrates how to fit an exponential to a set of data; the reader might wish to follow along, step by step, using a computer and Lotus 1-2-3, release 2.3.
__________________________________________________________
EXPONENTIAL CURVE FIT TO THE DATA OF TABLE 1.3
(Refer to Table 1.3 and Figure 1.5)
1. Enter the years in a spreadsheet column (A) to be used as the x-axis input. The first entry, for 1955, is in cell A5.
2. Enter the data that varies over time in another column (spreadsheet column B--WRD projection in this case), to be used as y values.
3. Take the natural log of the y values (other bases for the logarithms work, too). These appear in column C and are obtained from column B by writing @LN(B5) in cell C5. Copy the formula in C5 to cells C6 through C10.
4. Choose the regression feature from the software, with the x values as in step 1 and the y values as in step 3--the regression will be run on the x values and the LN y values (to obtain the exponential form).
5. Choose the output range as a blank area in the spreadsheet. Then proceed with the calculation as directed by the software; the output from the regression will appear in a form similar to the one in Table 1.3, bottom half.
6. The equations below the output range must generally be derived by the user from the regression output. The slope-intercept form for the equation of a straight line (LN y=mx+b) is used. The "X Coefficient" from the regression output is used as "m". The "Constant" from the regression output is used as b. Thus, the first equation can be read directly from the regression output.
7. The second equation, that is an exponential, is derived from the first equation by raising both sides to e, the base of the natural logarithms. This is the equation that will be used to enter projected exponential values, starting with the year 2000.
TABLE 1.3
Exponential curve fit to projected crude birth data, 2000-2025.
Horizontal asymptote at y=0.
(Source: World Resources Institute)
8. In a separate column of the spreadsheet, labelled "exp. proj. 2000" in Table 1.3 (Column D), enter the equation derived from the regression: in cell D5, enter the formula @EXP(-0.00986*A5+21.51354). The value 5.9857885876 should appear at the top of the "exp. proj. 2000" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from D5 to the 5 cells below it; this should produce the entire numerical range shown in the D column.
9. Graph the results; select an XY-graph. Put the 6 entries for the years in the X-range, and the six WRD projected values in the A range. Enter the 6 entries of the projected exponential data in an additional range, B, of the spreadsheet (Figure 1.5). The result should look like Figure 1.5.
__________________________________________________________
Figure 1.5. Graph of exponential fit to projected crude birth data, 2000-2025; the x-axis is a horizontal asymptote for the exponential curve.
TABLE 1.4
Exponential curve fit to projected crude birth data, 2005-2025.
Horizontal asymptote at y=0.
(Source: World Resources Institute)
The exponential curve found in this case fits the WRD projections worst in the time interval from 2000 to 2005; thus, a logical next step to improve the fit might be to repeat the strategy using WRD data from 2005 to 2025. Table 1.4 shows the results of executing a regression on the years 2005 to 2025 and the associated natural logs of the WRD projections. These results are displayed in column D of the spreadsheet and are labelled "exp. proj. 2005" -- in contrast to the projected exponential starting in 2000. The spreadsheet in Table 1.4 shows the WRD projections, the exponential projections beginning in 2000, and the exponential projections beginning in 2005. The regression analysis calculated an R-squared value of 0.96 using the data beginning in 2005. This R-squared value is closer to 1 than is that of 0.90 derived from the regression beginning in 2000. The 2005 fit is tighter than is the 2000 fit.
Figure 1.6. Graphs of exponential fits to projected crude birth data, 2000-2025, and 2005-2025; the x-axis is a horizontal asymptote for each exponential curve.
When all three curves are plotted (Figure 1.6), it also appears that the 2005 exponential projection offers a better fit to the WRD projection than does the 2000 exponential projection. In addition, the visual display suggests that better exponential fits than either of these might be available--it might be possible to adjust the R-squared value upward.
The exponentials above were fit to the data assuming the x-axis as a lower bound--as x approaches infinity, the curve will approach the value y=0. An exponential fit assuming y=0 as a lower bound is the simplest sort of exponential fit. An exponential curve can be fit given any real number, n, with y=n as a lower bound.
In an effort to make the fitted curves track more accurately the WRD projections, raise the lower bound, to y=4 (suggested by the data). The idea is to force the exponentials to begin tapering off toward the lower bound (horizontal asymptote). The summary below, when used in conjunction with Table 1.5, shows how to achieve this style of fit using y=4, instead of y=0, as a lower bound.
__________________________________________________________
EXPONENTIAL CURVE FIT TO THE DATA OF TABLE 1.5
(Refer to Table 1.5 and Figure 1.7)
1. Enter the years in spreadsheet column (A) to be used as the x-axis input. The first entry, for 2000, is in cell A5.
2. Enter the data that varies over time in another column (spreadsheet column B--WRD projection in this case), to be used as y values.
3. Take the natural log of -4 plus the y values (other bases for the logarithms work, too). These appear in column C and are obtained from column B by writing @LN(-4+B5) in cell C5. Copy the formula in C5 to cells C6 through C10. The reason the value of -4 is introduced is because of the lower bound, y=4.
4. Choose the regression feature from the software, with the x values as in step 1 and the y values as in step 3--the regression will be run on the x values and the LN (-4+y) values (to obtain the exponential form).
5. Choose the output range as a blank area in the spreadsheet. Then proceed with the calculation as directed by the software; the output from the regression will appear in a form similar to the one in Table 1.5, bottom half.
6. The equations below the output range must generally be derived by the user from the regression output. The slope-intercept form for the equation of a straight line (LN (-4+y)=mx+b) is used. The "X Coefficient" from the regression output is used as "m". The "Constant" from the regression output is used as b. Thus, the first equation can be read directly from the regression output.
TABLE 1.5
Exponential curve fit to projected crude birth data, 2000-2025.
Horizontal asymptote at y=4.
(Source: World Resources Institute)
7. The second equation, that is an exponential, is derived from the first equation by raising both sides to e, the base of the natural logarithms, and isolating y on the left side. This is the equation that will be used to enter projected exponential values, starting with the year 2000.
8. In a separate column of the spreadsheet, labelled "exp. proj. 2000" in Table 1.5 (Column D), enter the equation derived from the regression: in cell D5, enter the formula @EXP(-0.04283*A5+86.40702)+4. The value 6.1107007471 should appear at the top of the "exp. proj. 2000" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from D5 to the 5 cells below it; this should produce the entire numerical range shown in the D column.
9. Graph the results; select an XY-graph. Put the 6 entries for the years in the X-range, and the six WRD projected values in the A range. Enter the 6 entries of the projected exponential data in an additional range, B, of the spreadsheet (Figure 1.7). The result should look like Figure 1.7.
__________________________________________________________
Figure 1.7. Graph of exponential fit to projected crude birth data, 2000-2025; the line y=4 is a horizontal asymptote for the exponential curve.
TABLE 1.6
Exponential curve fit to projected crude birth data, 2005-2025.
Horizontal asymptote at y=4.
(Source: World Resources Institute)
The R-squared value for the 2000 exponential fit, with y=4 as a lower bound, is 0.916, which is closer to 1 than is the corresponding R-squared value for the 2000 exponential fit with y=0. Indeed, the graph in Figure 1.7 reflects not only this slight improvement, but more to the point, it shows the exponential fit tracking the WRD projections more closely; thus, one might ask whoever made the projections what sorts of assumptions they made concerning lower bounds and what basis in fact, plans, or policy led to the assumptions on which they made the lower bound choice.
Thus, one might expect an improved fit, reflected in the R-squared value, by fitting an exponential with a lower bound of y=4 to the data starting with the year 2005. Table 1.6 shows the results of performing that exponential fit. The R-squared is 0.989, an improvement over all previous fits. Indeed, when the WRD projection is graphed along with the two exponential projections using a lower bound of y=4, it is clear that the exponential beginning in 2005 tracks both the actual values of the WRD curve, as well as its general shape (Figure 1.8). One might try to fine-tune the fit even more; however, these fits are sufficient to direct some provocative questions with adequate visual support as back-up.
Figure 1.8. Graphs of exponential fits to projected crude birth data, 2000-2025, and 2005-2025; the line y=4 is a horizontal asymptote for each exponential curve.
---------------------------------------------------------------------------------------
BLACK BOX SUMMARY
see Introduction for theoretical explanation
EXPONENTIAL CURVE
y=Ceax + b
where
a < 0
and
y=b is the lower bound of the exponential;
C is a constant.
---------------------------------------------------------------------------------------
Logistic curve-fitting--Bangladesh total population data
Figure 1.1, which graphs total population data for Bangladesh using spreadsheet default curves, suggests that some sort of increasing curve might fit the actual total population data fairly well. Table 1.7 shows a straight line fit to that data, and Table 1.8 shows an exponential fit to that data. The straight line fit to the actual data is reasonable--r-squared is 0.978. The exponential fit is better--r-squared is 0.999. When the default curve, from the actual data, is graphed with the linear and the exponential fit for 1955-1990 (years of the actual data), it is also evident that the exponential offers a tight fit (Figure 1.9). One might suppose that projected values of total population were obtained using an exponential close to this one. When the WRD projections are graphed together with the linear and exponential projections, derived from extending the curves that actually fit the data, it is clear that some curve other than these was used for forecasting (Figure 1.10). The WRD projections lie between the linear and the exponential and the curve they trace out may well represent an exponential that has been damped by environmental factors, forcing it, eventually, to taper off horizontally. The curve that permits this sort of damping is the logistic curve--an S-shaped curve that exhibits exponential rise initially and then tapers off symmetrically around the center point of the "S," at which the slope of the curve (of the line tangent to the curve at that point) is steepest.
The mechanics of fitting a logistic curve to a table of values is straightforward to execute in a spreadsheet (again, reference is made to Lotus 1-2-3, release 2.3). A sequence of steps for doing so is presented below; all that is required is to select a starting point (usually a time, such as a year), an endpoint (again, often in time), an upper bound for the population (such as carrying capacity of the environment) and an assumption that the distribution be logistic in character.
TABLE 1.7
Least squares regression line fit to actual population data, 1955-1990.
Right-hand column shows fit of line to actual data
and extrapolation to cover 1995-2025.
(Source: World Resources Institute)
TABLE 1.8
Exponential curve fit to actual and projected population data,
1955-2025.
Horizontal asymptote at y=0.
(Source: World Resources Institute)
Figure 1.9. Linear and exponential curve fits to actual Bangladesh population data.
The general form for the logistic equation used here is:
y=q/(1+aebx)
where b<0, q is selected prior to making any analysis and is the value of the upper bound selected by the user on carrying capacity or other bases, and a and b are constants to be determined depending on the values selected for q and the beginning and ending times chosen. The point of steepest slope is halfway between the lines y=0 and y=q; thus the choice of q influences directly where the steepest rate of increase in the curve occurs. There are numerous equivalent forms that this logistic equation might assume.
In the example below, there is data for the period from 1955 to 2025; the data from 1955 to 1990 is actual data, and that from 1995 to 2025 is projected. To see if a logistic was used to make the projection, we shall try to fit curves to all the data; since it is necessary only to choose one point as an end, it might well be that planners notice an exponential trend in actual data, as we have here, and, guessing that a purely exponential increase is not likely to remain, choose some value at a future time, as in 2025, that seems likely to them based on their accumulated wisdom and field evidence. The exponential part of the logistic will fit the earlier, exponentially rising years, and the years for projection, from the halfway point of the logistic, to the final year, are filled in by values projected from the curve. Whether or not this manner of forecasting is acceptable is a concern each time it is done; we illustrate with the case of total population in Bangladesh.
Figure 1.10. Curve showing WRD projections for population beyond 1990 to 2025. The linear fit and the exponential fit shown in Figure 1.9 are extrapolated here--the exponential overfits the WRD projections and the linear underfits the WRD projections.
__________________________________________________________
LOGISTIC CURVES FIT TO THE DATA OF TABLE 1.9
(Refer to Table 1.9 and Figure 1.11)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=300; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the logistic equation. To begin, assume that q=300 (million). Thus, the logistic equation is now y=300(1+aebt).
5. Find the constant a:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=300/(1+aeb*0), or more simply, 45.486=300/(1+a).
The right hand equation can be solved for the single variable a, by isolating that variable on the left as:
a=-1+300/45.486=5.595435.
The logistic equation is now: y=300/(1+5.595435ebt).
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the logistic equation yields, substituting in values for y and t:
234.987=300/(1+5.595435e14b).
This is an equation in a single variable; isolate the variable b on the left. Thus, e14b=(-1+(300/234.987))/5.595435. The arithmetic of the right hand side reduces easily using a computer or calculator, so that e14b = 0.049445. Taking the natural logarithm of both sides produces 14b=ln(0.049445) so that b=-0.21477. The b value has been determined and it is negative, as the theory tells us it must be. The logistic equation fit to the actual data is:
y=300/(1+5.595435e-0.21477*t).
TABLE 1.9
Logistic curve fits to total population with various upper bounds of
q=300, q=350, q=375.
(Source: World Resources Institute)
7. In a separate column of the spreadsheet, labelled "q=300" in Table 1.9 (Column D), enter the equation derived from the logistic equation: in cell D5, enter the formula 300/(1+5.95435*@EXP(-0.21477*A5)). The value 45.486006609 should appear at the top of the "q=300" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from D5 to the 14 cells below it; this should produce the entire numerical range shown in the D column. The second value in the q=300 column is 64.633496693.
Figure 1.11. Logistic curve fits to total population in Bangladesh, projected and actual, from 1955-2025. In Fit 1, q=300; in Fit 2, q=350; in Fit 3, q=375.
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=300 logistic data in an additional range, B, of the spreadsheet (Figure 1.11). The result should look like the curve labelled "Fit 1" in Figure 1.11.
__________________________________________________________
Adjustment of the q value may afford an even closer fit; we illustrate the effects of doing so, first by taking q=350, and then by taking q=375. Table 1.9 shows the results of doing so, and these results are also graphed on Figure 1.11. The detail of the mechanics required to obtain these values is reproduced below.
__________________________________________________________
LOGISTIC CURVES FIT TO THE DATA OF TABLE 1.9
(Refer to Table 1.9 and Figure 1.11)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=350; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the logistic equation. To begin, assume that q=350 (million). Thus, the logistic equation is now y=350(1+aebt).
5. Find the constant a:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=350/(1+aeb*0), or more simply, 45.486=350/(1+a).
The right hand equation can be solved for the single variable a, by isolating that variable on the left as:
a=-1+350/45.486=6.694675.
The logistic equation is now: y=350/(1+6.694675ebt).
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the logistic equation yields, substituting in values for y and t:
234.987=350/(1+6.694675e14b).
This is an equation in a single variable; isolate the variable b on the left. Thus, e14b=(-1+(350/234.987))/6.694675. The arithmetic of the right hand side reduces easily using a computer or calculator, so that e14b = 0.0731095 Taking the natural logarithm of both sides produces 14b=ln(0.0731095) so that b=-0.18684. The b value has been determined and it is negative, as the theory tells us it must be. The logistic equation fit to the actual data is:
y=350/(1+6.694675e-0.18684*t).
7. In a separate column of the spreadsheet, labelled "q=350" in Table 1.9 (Column E), enter the equation derived from the logistic equation: in cell E5, enter the formula 350/(1+6.694675*@EXP(-0.18684*A5)). The value 45.486001683 should appear at the top of the "q=350" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from E5 to the 14 cells below it; this should produce the entire numerical range shown in the E column. The second value in the q=350 column is 53.404581971.
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=350 logistic data in an additional range, C, of the spreadsheet (Figure 1.11). The result should look like the curve labelled "Fit 2" in Figure 1.11.
__________________________________________________________
The second logistic fit appears to fit the WRD data more tightly than did the first fit; thus, one might be tempted to increase the q value a bit more. When q=375, a third logistic curve is fit to the data; it appears to be a better fit than the first curve, but not as good a fit as the second curve (Table 1.9, Figure 1.11). The detail of the adjustment is given below.
__________________________________________________________
LOGISTIC CURVES FIT TO THE DATA OF TABLE 1.9
(Refer to Table 1.9 and Figure 1.11)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=375; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the logistic equation. To begin, assume that q=375 (million). Thus, the logistic equation is now y=375(1+aebt).
5. Find the constant a:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=375/(1+aeb*0), or more simply, 45.486=350/(1+a).
The right hand equation can be solved for the single variable a, by isolating that variable on the left as:
a=-1+375/45.486=7.244294.
The logistic equation is now: y=375/(1+7.244294ebt).
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the logistic equation yields, substituting in values for y and t:
234.987=375/(1+7.244294e14b).
This is an equation in a single variable; isolate the variable b on the left. Thus, e14b=(-1+(375/234.987))/7.244294. The arithmetic of the right hand side reduces easily using a computer or calculator, so that e14b = 0.0822486. Taking the natural logarithm of both sides produces 14b=ln(0.0822486) so that b=-0.17842. The b value has been determined and it is negative, as the theory tells us it must be. The logistic equation fit to the actual data is:
y=375/(1+7.244294e-0.17842*t).
7. In a separate column of the spreadsheet, labelled "q=375" in Table 1.9 (Column F), enter the equation derived from the logistic equation: in cell F5, enter the formula 375/(1+7.244294*@EXP(-0.17842*A5)). The value 45.48600523 should appear at the top of the "q=375" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from F5 to the 14 cells below it; this should produce the entire numerical range shown in the F column. The second value in the q=375 column is 53.112302854.
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=375 logistic data in an additional range, D, of the spreadsheet (Figure 1.11). The result should look like the curve labelled "Fit 3" in Figure 1.11.
__________________________________________________________
Logistic curves have often been used as models to describe the diffusion of biological, physical, and even social phenomena. In the latter vein, Trsten Hgerstrand (a Swedish geographer at the University of Lund) employed Monte Carlo simulation, and logistic curves, to study the diffusion of an innovation within a population; the results of his elegant study suggest that, generally speaking, people tend not to move far from their origins (see references). This result also fits with common sense, as we reflect on the perpetuation of cultural enclaves as well as on the perhaps surprisingly high percentage of our classmates who remain near their secondary schools.
---------------------------------------------------------------------------------------
BLACK BOX SUMMARY
see Introduction for theoretical explanation
LOGISTIC CURVE
y=q/(1+aebx)
where
b < 0
and
y=q is the upper bound of the curve, chosen ahead of time;
a and b are constants, calculated from the data.
---------------------------------------------------------------------------------------
Gompertz curve--Bangladesh total population
The process of fitting a logistic curve to a set of data is quite different from the process of fitting a straight line, using least squares, or an exponential to the same data set. With the logistic fit, two points and a line (the q value) control the shape of the S; indeed, the q-value functions as a ceiling pressing down on exponential growth, much as a living room ceiling inhibits the "natural" growth of tall indoor plants. We might wonder, then, if there are various modifications of the logistic curve that allow for creating S-shapes using a variation in how the ceiling value is chosen.
One variant of the logistic curve, in which the S-shape appears flatter is the Gompertz curve; it is used to model growth of various kinds, from financial to population. The reason the curve is flatter become evident when the logistic equation is written as a differential equation, dP/dt = P(a-b*P), and the Gompertz is also written in an equivalent manner, as dP/dt = P(a-b*ln P)--the logarithmic function tends to flatten out the curve and make the S-shape less curved than would a logistic fit.
Gompertz curves are fit to the Bangladesh total population values, assuming upper bounds of 300, 350, and 375 million (Table 1.10); the graphs of these fits are shown in Figure 1.12. A brief glance at these curves shows that they all lie above the actual data and that they are too straight--the Gompertz is not as good a choice as a logistic curve in fitting this data. One of the assumptions in fitting S-shaped curves is on the upper bound; another, in the previous case, is that the curve follow a logistic equation. Here, we alter the logistic assumption and fit curves to the same years, population values, and upper bounds. The mechanics of fitting the curves is given in the sequence of steps below.
The general form for the Gompertz equation used here is:
y=q*e-ce-bx
where q is selected prior to making any analysis and is the value of the upper bound selected by the user on carrying capacity or other bases, and b and c are constants to be determined depending of the values selected for q and the beginning and ending times chosen. There are numerous equivalent forms that this Gompertz equation might assume.
__________________________________________________________
GOMPERTZ CURVES FIT TO THE DATA OF TABLE 1.10
(Refer to Table 1.10 and Figure 1.12)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=300; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the Gompertz equation.
To begin, assume that q=300 (million). Thus, the Gompertz
equation is now
y=300*e-ce-bx .
TABLE 1.10
Gompertz curve fits to total population with various
upper bounds of
q=300, q=350, q=375.
(Source: World Resources Institute)
5. Find the constant c:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=300*e-c.
The right hand equation can be solved for the single variable c, by isolating that variable on the left as:
c=ln(300/45.486)=1.8863779.
The Gompertz equation is now:
y=300*e-1.8863779e-bx .
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the Gompertz equation yields, substituting in values for y and t:
234.987=300*e-1.8863779e-14b .
This is an equation in a single variable; isolate the variable b on the left. Thus, -1.8863779e-14b=ln(234.987/300)=-0.2442523. Thus, 0.1294822=e-14b so that -b = 1/14 ln(0.2442523/1.8863779) = -0.1460152. The Gompertz equation fit to the actual data is:
y=300*e-1.8863779e-0.1460152x
7. In a separate column of the spreadsheet, labelled "q=300" in Table 1.10 (Column D), enter the equation derived from the Gompertz equation: in cell D5, enter the formula 300*(@EXP(-1.8863779*@EXP(-0.1460152*A5))). The value 45.48599947 should appear at the top of the "q=300" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from D5 to the 14 cells below it; this should produce the entire numerical range shown in the D column. The second value in the q=300 column is 58.77805652 (less than the corresponding logistic value, and closer to the WRD value).
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=300 Gompertz data in an additional range, B, of the spreadsheet (Figure 1.12). The result should look like the curve labelled "Fit 1" in Figure 1.12.
__________________________________________________________
Adjustment of the q value may afford an even closer fit; we illustrate the effects of doing so, first by taking q=350, and then by taking q=375. Table 1.10 shows the results of doing so, and these results are also graphed on Figure 1.12. The detail of the mechanics required to obtain these values is reproduced below.
Figure 1.12. Gompertz curve fits to total population in Bangladesh, projected and actual, from 1955-2025. In Fit 1, q=300; in Fit 2, q=350; in Fit 3, q=375.
__________________________________________________________
GOMPERTZ CURVES FIT TO THE DATA OF TABLE 1.10
(Refer to Table 1.10 and Figure 1.12)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=350; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the Gompertz equation.
To begin, assume that q=350 (million). Thus, the Gompertz
equation is now
y=350*e-ce-bx .
5. Find the constant c:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=350*e-c.
The right hand equation can be solved for the single variable c, by isolating that variable on the left as:
c=ln(350/45.486)=2.0405286.
The Gompertz equation is now:
y=350*e-2.0405286e-bx .
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the Gompertz equation yields, substituting in values for y and t:
234.987=300*e-2.0405286e-14b .
This is an equation in a single variable; isolate the variable b on the left. Thus, -2.0405286e-14b=ln(234.987/350)=-0.398403. Thus, -b = 1/14 ln(0.398403/2.0405286) = -0.1166786. The Gompertz equation fit to the actual data is:
y=350*e-2.0405286e-0.1166786x
7. In a separate column of the spreadsheet, labelled "q=350" in Table 1.10 (Column E), enter the equation derived from the Gompertz equation: in cell E5, enter the formula 350*(@EXP(-2.0405286*@EXP(-0.1166786*A5))). The value 45.485998553 should appear at the top of the "q=350" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from E5 to the 14 cells below it; this should produce the entire numerical range shown in the E column. The second value in the q=350 column is 56.947262424 (less than the corresponding logistic value, and closer to the WRD value).
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=350 Gompertz data in an additional range, C, of the spreadsheet (Figure 1.12). The result should look like the curve labelled "Fit 2" in Figure 1.12.
__________________________________________________________
__________________________________________________________
GOMPERTZ CURVES FIT TO THE DATA OF TABLE 1.10
(Refer to Table 1.10 and Figure 1.12)
Refer to Introduction for theoretical background
ASSUMPTIONS: Q=375; TIME PERIOD 1955-2025
1. Enter labels ranging from 0 to 14, one for each year, in column A--the first entry is in cell A5.
2. Enter the years for which there is data, both actual and projected--1955 to 2025 in this case--in column B.
3. Enter the data that varies over time in another column (spreadsheet column C--WRD projection in this case), to be used as y values.
4. The variable t will play the role of x in the Gompertz equation.
To begin, assume that q=375 (million). Thus, the Gompertz
equation is now
y=375*e-ce-bx .
5. Find the constant c:
In 1955, when t=0, y=45.486 (cell C5).
Thus, the logistic equation now becomes, at that time,
45.486=375*e-c.
The right hand equation can be solved for the single variable c, by isolating that variable on the left as:
c=ln(375/45.486)=2.1095214.
The Gompertz equation is now:
y=375*e-2.2.1095214e-bx .
6. Now, find b--use information from the other endpoint selected--t=14, in 2025. When t=14, y=234.987. So, at this time, the Gompertz equation yields, substituting in values for y and t:
234.987=300*e-2.21095214e-14b .
This is an equation in a single variable; isolate the variable b on the left. Thus, -2.21095214e-14b=ln(234.987/375)=-0.4673958. Thus, -b = 1/14 ln(0.4673958/2.1095214) = -0.1076457. The Gompertz equation fit to the actual data is:
y=375*e-2.2109521e-0.1076457t .
7. In a separate column of the spreadsheet, labelled "q=375" in Table 1.10 (Column F), enter the equation derived from the Gompertz equation: in cell F5, enter the formula 375*(@EXP(-2.2109521*@EXP(-0.1076457*A5))). The value 45.486001804 should appear at the top of the "q=375" column. (The number of decimal places displayed will depend upon the width of the column.) Then, copy the cell content from F5 to the 14 cells below it; this should produce the entire numerical range shown in the F column. The second value in the q=375 column is 56.412479021 (just slightly less than the corresponding logistic value, and closer to the WRD value).
8. Graph the results; select an XY-graph. Put the 14 entries for the years (column B) in the X-range, and the fourteen WRD projected values in the A range. Enter the 14 entries of the q=375 Gompertz data in an additional range, D, of the spreadsheet (Figure 1.12). The result should look like the curve labelled "Fit 3" in Figure 1.12.
__________________________________________________________
These Gompertz curves appear generally not to fit the WRD data as well as do the logistic curves; however, the Gompertz fit is better to the early data, where the increase is steep--then, the relative flatness of the Gompertz curves carries them, from a steep start, to overfit the later WRD values. Again, these observations suggest the importance of understanding the various ways projections might be made--apparently subtle differences can translate into major differences when regional policies, involving large projects and sums of investment, are based on forecasts.
---------------------------------------------------------------------------------------
BLACK BOX SUMMARY
GOMPERTZ CURVE
y=q*e-ce-bt
where
y=q is the upper bound of the curve, chosen ahead of time;
b and c are constants, calculated from the data.
---------------------------------------------------------------------------------------
Total population and crude births
In looking at the relationship between total population and total crude births, a starting point might be to fit a straight line to "total population" (independent variable) and "total crude births" (dependent variable). Table 1.11 shows the needed data; when the regression feature of Lotus 1-2-3 is used to fit a straight line to the actual data, from 1955 to 1990, a good fit is established, with R-squared 0.9797306479. The right hand column uses the linear equation derived from the regression, noted at the bottom of Table 1.11, to project values for the crude birth variable from the population variable. The linear data closely track the actual WRD data as is reflected in the left-hand side of the Figure 1.13. When the linear fit is projected to the future, it is different from the WRD projection; indeed, the R-squared value for the fit of all the data, 1955-2025, is 0.594527, and the R-squared value for the fit of the projected data, alone (from 1990-2025), is 0.141413!
Figure 1.13. Regression line fit to actual data , 1955-1990, for Bangladesh. Linear and WRD projections appear to the right of the heavy vertical line.
TABLE 1.11
(Source: World Resources Institute)
A good fit of actual data by one function does not necessarily mean that that function will be used to make projections; there are an infinite number of functions that fit any finite display of data points. In this case, the background work suggests that the unexpected drop occurs in the projected birth data and that therefore one might need to do further research to determine why the decline is projected. Often it is helpful to separate data sets in order to understand why particular patterns occur.
References
1. Bogue, D. J. Principles of Demography, Wiley, New York, 1969.
2. Freedman, D., Pisani, R., Purves, R. Statistics, W. W. Norton, New
York, 1978.
3. Hgerstrand, T. Innovation Diffusion as a Spatial Process,
University of Chicago Press, Chicago, 1967.
4. Keyfitz, N. Introduction to the Mathematics of Population,
Addison-Wesley, Reading, Massachusetts, 1968.
5. Ness, G. D., Drake, W. D., Brechin, S. R. (eds.) Population-
Environment Dynamics: Ideas and Observations, The University of
Michigan Press, Ann Arbor, Michigan, 1993.
6. Zill, D. G. A First Course in Differential Equations (fifth edition),
Boston, PWS-Kent, 1993.
PAGE58 Chapter 1
Population Data Analysis PAGE59
PAGE13
v
}~#``
B C <tR**33t33!4#4==AAAAAGGH"HIIJJ!JbJUNVNQQSSSS.T0TTTAUDUOVQVWWYYo_q_|_m`o`ffggjjkkvkkkkllnnnngoioooppPqQq
]Qqqqrr
ttuuuuuuuuwwyyzz!z~~_c?ADEG}WZ=?ىCEiӋՋ̍ht(.^quw)c@Kalբע"$JXHJI B]BEHJȮʮӮ֮
1mƴԴegik>7:=?ֻٻȼ˼~A04*,dgKMVY-0"&(PR].>GI([
sHJhoruwyHJjw!&02ESTDH>A{}07BFGILZL %&($135>3:=@BDFhuv}&([nr_d8:F
ALrx&Y&\`
+
S
! <\,.c
<>y{km "tvZR
#S(***-/ //0촴쭦
900r1r3t333334!4#4%4'4)4+4-4/41434547494;4=4?4A4C4E4G4I4K4M4O4Q4S4U4W4Y4[4]4_4a4c4S5U58879?;]<_<===rAtAvAxAzA|A~AAAAAAAAAAAAAA
JAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGG"H$HIIIIIIIIIIIIIIIIIIIIIIIIIIIJ!JbJIMwPPQSaSSSS.T N.TTAUOVWY[L[N[]^^^^^8_o_q_|___!`#`%`'`)`+`-`/`1`3`5`7`9`;`=`?`A`C`E`G`I`K`M`O`Q`S`U`W`Y`[`]`_`a`c`e`g`i`k`m`o``bBbDbǳ @Db'fgjjjjjjjjjjjjjjjjjjjjjjjjjjjjjktkvkkkkkkkkklIlllllllVnnnnnngoopqr
tuuuɿ Auu_u}uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuwyPyRyTyVyXyZy\y^y`ybydyfyhyjylynypyrytyvyxyzy|y~yyyyz!zĽ J!zbzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz~~~~~~~~~~~~~~~~~~~~~~~~~acЀ?AE KE}W=ىۉ'Ekmoqsuwy{}Ӌ̍468:<>@BDFHJLNPRTVXZ\^` E`bdfhGeÐŐǐɐː͐ϐ
.\^q}Η')c VcR@KȠ!#%')+-/13579;=?ACEGIKMOQSUWY[]_alס
!#%')+-/13579;=?ACEGIKMZMOQSUWY[]_acegikmoqsuwy{}ע٢'1 "$&(*,.02468:<>@BDFHJJLNPRDFɿ HHb̭
wӮ
3579;=?ACEGIKMOQSUWY[]_acegikĺ착
B´ĴƴgiCEGٸ
9;=WlȼwUWAg0
5*OV6&P,.>LSZ_&(V
xz@Bĺ
3BnpHVhy{} "$&(*,.02468:<>@BDFHbOqj0쿵쫿 <0
!#%')+-/13579;=?ACE#%T{D>-j/I(ԙԙ <13 E
9;!3D^Kmhv&}[npr_bdﺺΖΏ 5kmoqsuwy{}:Flnprtvxz|~
bt*^
W
g
F |
0'13=fBF.KT]^i(ksqtxzP[S{BɠҨ0
R |
8/jj- K!"#$%Y&'()*<+,-.O/m0T1h2345,6*789:;DQWQqBL&0A.TDbu!zE`cMB0
Tms Rmn Symbol"HelvCG Times (E1)
"NewsGothicPerpetua"Univers (E1)"DomCasual1LetterGothicTimesNewRomanPSRoman|5:=DILW!!!h8,3William D. DrakeWilliam D. Drake