2.1 Raw Data
When data are collected, the information obtained from each member of a population or sample is recorded in the sequence in which it becomes available. This sequence of data recording is random and unranked. Such data, before they are grouped or ranked, are called raw data.
Definition of Raw Data
Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data.
Suppose we collect information on the ages (in years) of 50 students selected from a university. The data values, in the order they are collected, are recorded in Table 2.1. For instance, the first student’s age is 21, the second student’s age is 19 (second number in the first row), and so forth. The data in Table 2.1 are quantitative raw data.
Table 2.1 Ages of 50 Students
21 19 24 25 29 34 26 27 37 33 18 20 19 22 19 19 25 22 25 23 25 19 31 19 23 18 23 19 23 26 22 28 21 20 22 22 21 20 19 21 25 23 18 37 27 23 21 25 21 24
Suppose we ask the same 50 students about their student status. The responses of the students are recorded in Table 2.2. In this table, F, SO, J, and SE are the abbreviations for freshman, sophomore, junior, and senior, respectively. This is an example of qualitative (or categorical) raw data.
Table 2.2 Status of 50 Students
J F SO SE J J SE J J J F F J F F F SE SO SE J J F SE SO SO F J F SE SE SO SE J SO SO J J SO F SO SE SE F SE J SO F J SO SO
The data presented in Tables 2.1 and 2.2 are also called ungrouped data. An ungrouped data set contains information on each member of a sample or population individually.
2.2 Organizing and Graphing Qualitative Data
This section discusses how to organize and display qualitative (or categorical) data. Data sets are organized into tables, and data are displayed using graphs.
2.2.1 Frequency Distributions
A sample of 100 students enrolled at a university were asked what they intended to do after graduation. Forty-four said they wanted to work for private companies/businesses, 16 said they wanted to work for the federal government, 23 wanted to work for state or local governments, and 17 intended to start their own businesses. Table 2.3 lists the types of employment and the number of students who intend to engage in each type of employment. In this table, the variable is the type of employment, which is a qualitative variable. The categories (representing the type of employment) listed in the first column are mutually exclusive. In other words, each of the 100 students belongs to one and only one of these categories. The number of students who belong to a certain category is called the frequency of that category. A frequency distribution exhibits how the frequencies are distributed over various categories. Table 2.3 is called a frequency distribution table or simply a frequency table.
Table 2.3 Type of Employment Students Intend to Engage In
Type of Employment Number of Students
Private companies/businesses 44
Federal government 16
State/local government 23
Own business 17
Sum = 100
Definition of Frequency Distribution for Qualitative Data
A frequency distribution for qualitative data lists all categories and the number of elements that belong to each of the categories.
Example 2-1
A sample of 30 employees from large companies was selected, and these employees were asked how stressful their jobs were. The responses of these employees are recorded below where very represents very stressful, somewhat means somewhat stressful, and none stand for not stressful at all.
somewhat none somewhat very very none
very somewhat somewhat very somewhat somewhat
very somewhat none very none somewhat
somewhat very somewhat somewhat very none
somewhat very very somewhat none somewhat
Construct a frequency distribution table for these data.
Solution
Note that the variable in this example is how stressful is an employee’s job. This variable is classified into three categories: very stressful, somewhat stressful, and not stressful at all. We record these categories in the first column of Table 2.4. Then we read each employee’s response from the given data and mark a tally, denoted by the symbol |, in the second column of Table 2.4 next to the corresponding category. For example, the first employee’s response is that his or her job is somewhat stressful. We show this in the frequency table by marking a tally in the second column next to the category somewhat. Note that the tallies are marked in blocks of five for counting convenience. Finally, we record the total of the tallies for each category in the third column of the table. This column is called the column of frequencies and is usually denoted by f. The sum of the entries in the frequency column gives the sample size or total frequency. In Table 2.4, this total is 30, which is the sample size.
Table 2.4 Frequency Distribution of Stress on Job
Stress on Job Tally Frequency (f) Very ||||| ||||| 10 Somewhat ||||| ||||| |||| 14 None ||||| | 6 Sum = 30
2.2.2 Relative Frequency and Percentage Distributions
The relative frequency of a category is obtained by dividing the frequency of that category by the sum of all frequencies. Thus, the relative frequency shows what fractional part or proportion of the total frequency belongs to the corresponding category. A relative frequency distribution lists the relative frequencies for all categories.
Calculating Relative Frequency of a Category
Relative frequency of a category = Frequency of data / Sum of all frequency
The percentage for a category is obtained by multiplying the relative frequency of that category by 100. A percentage distribution lists the percentages for all categories.
Calculating Percentage
Percentage = (Relative frequency) x 100
Example 2-2
Determine the relative frequency and percentage distributions for the data of Table 2.4.
Solution
The relative frequencies and percentages from Table 2.4 are calculated and listed in Table 2.5. Based on this table, we can state that 0.333 or 33.3% of the employees said that their jobs are very stressful. By adding the percentages for the first two categories, we can state that 80% of the employees said that their jobs are very or somewhat stressful. The other numbers in Table 2.5 can be interpreted the same way.
Notice that the sum of the relative frequencies is always 1.00 (or approximately 1.00 if the relative frequencies are rounded), and the sum of the percentages is always 100 (or approximately 100 if the percentages are rounded).
Table 2.5 Relative Frequency and Percentage Distributions of Stress on Job
Stress on Job Relative Frequency Percentage Very 10/30 = 0.333 0.333 x 100 = 33.3 Somewhat 14/30 = 0.467 0.467 x 100 = 46.7 None 6/30 = 0.200 0.200 x 100 = 20.0 Sum = 1.000 Sum = 100
2.2.3 Graphical Presentation of Qualitative Data
All of us have heard the adage “a picture is worth a thousand words.” A graphic display can re- veal at a glance the main characteristics of a data set. The bar graph and the pie chart are two types of graphs used to display qualitative data.
Bar Graphs
To construct a bar graph (also called a bar chart), we mark the various categories on the horizontal axis as in Figure 2.1. Note that all categories are represented by intervals of the same width. We mark the frequencies on the vertical axis. Then we draw one bar for each category such that the height of the bar represents the frequency of the corresponding category. We leave a small gap between adjacent bars. Figure 2.1 gives the bar graph for the frequency distribution of Table 2.4.
Figure 2.1 Bar graph for the frequency distribution of Table 2.4
Definition of Bar Graph
A graph made of bars whose heights represent the frequencies of respective categories is called a bar graph.
The bar graphs for relative frequency and percentage distributions can be drawn simply by marking the relative frequencies or percentages, instead of the class frequencies, on the vertical axis. Sometimes a bar graph is constructed by marking the categories on the vertical axis and the frequencies on the horizontal axis.
BUS 3507 BUSINESS STATISTICS
6
Pie Charts
A pie chart is more commonly used to display percentages, although it can be used to display frequencies or relative frequencies. The whole pie (or circle) represents the total sample or population. Then we divide the pie into different portions that represent the different categories.
Definition of Pie Chart
A circle divided into portions that represent the relative frequencies or percentages of a population or a sample belonging to different categories is called a pie chart.
As we know, a circle contains 360 degrees. To construct a pie chart, we multiply 360 by the relative frequency of each category to obtain the degree measure or size of the angle for the corresponding category. Table 2.6 shows the calculation of angle sizes for the various categories of Table 2.5.
Table 2.6 Calculating Angle Sizes for the Pie Chart
Stress on Job Relative Frequency Angle Size Very 10/30 = 0.333 360 x 0.333 = 119.88 Somewhat 14/30 = 0.467 360 x 0.467 = 168.12 None 6/30 = 0.200 360 x 0.200 = 72.00 Sum = 1.000 Sum = 360
Figure 2.1 shows the pie chart for the percentage distribution of Table 2.5, which uses the angle sizes calculated in Table 2.6.
Figure 2.2 Pie chart for the percentage distribution of Table 2.5.
2.3 Organizing and Graphing Quantitative Data
In the previous section we learned how to group and display qualitative data. This section explains how to group and display quantitative data.
2.3.1 Frequency Distributions
Table 2.7 gives the weekly earnings of 100 employees of a large company. The first column lists the classes, which represent the (quantitative) variable weekly earnings. For quantitative data, an interval that includes all the values that fall within two numbers, the lower and upper limits, is called a class. Note that the classes always represent a variable. As we can observe, the classes are nonoverlapping; that is, each value on earnings belongs to one and only one class. The second column in the table lists the number of employees who have earnings within each class. For example, nine employees of this company earn $401 to $600 per week. The numbers listed in the second column are called the frequencies, which give the number of values that belong to different classes. The frequencies are denoted by f.
Table 2.7 Weekly Earnings of 100 Employees of a Company
Weekly Earnings (dollars) Number of Employees (f)
401 to 600 = 9
601 to 800 = 22
801 to 1000 = 39
1001 to 1200 = 15
1201 to 1400 = 9
1401 to 1600 = 6
For quantitative data, the frequency of a class represents the number of values in the data set that fall in that class. Table 2.7 contains six classes. Each class has a lower limit and an upper limit. The values 401, 601, 801, 1001, 1201, and 1401 give the lower limits, and the values 600, 800, 1000, 1200, 1400, and 1600 are the upper limits of the six classes, respectively. The data presented in Table 2.7 are an illustration of a frequency distribution table for quantitative data. Whereas the data that list individual values are called ungrouped data, the data presented in a frequency distribution table are called grouped data.
Definition of Frequency Distribution for Quantitative Data
A frequency distribution for quantitative data lists all the classes and the number of values that belong to each class. Data presented in the form of a frequency distribution are called grouped data.
To find the midpoint of the upper limit of the first class and the lower limit of the second class in Table 2.7, we divide the sum of these two limits by 2. Thus, this midpoint is
600 + 601 / 2 = 600.5
The value 600.5 is called the upper boundary of the first class and the lower boundary of the second class. By using this technique, we can convert the class limits of Table 2.7 to class boundaries, which are also called real class limits. The second column of Table 2.8 lists the boundaries for Table 2.7.
Definition of Class Boundary
The class boundary is given by the midpoint of the upper limit of one class and the lower limit of the next class.
The difference between the two boundaries of a class gives the class width. The class width is also called the class size.
Finding Class Width
Class width = Upper boundary - Lower boundary
Thus, in Table 2.8,
Width of the first class = 600.5 - 400.5 = 200
The class widths for the frequency distribution of Table 2.7 are listed in the third column of Table 2.8. Each class in Table 2.8 (and Table 2.7) has the same width of 200.
The class midpoint or mark is obtained by dividing the sum of the two limits (or the two boundaries) of a class by 2.
Calculating Class Midpoint or Mark
Class Midpoint or Mark = (Lower Limit + Upper Limit ) / 2
Thus, the midpoint of the first class in Table 2.7 or Table 2.8 is calculated as follows:
Midpoint of the 1st class = ( 401 + 600 / 2 ) = 500.5
The class midpoints for the frequency distribution of Table 2.7 are listed in the fourth column of Table 2.8.
Table 2.8 Class Boundaries, Class Widths, and Class Midpoints for Table 2.7
Class Limits Class Boundaries Class Width Class Midpoint
401 to 600 400.5 to less than 600.5 200 500.5 601 to 800 600.5 to less than 800.5 200 700.5 801 to 1000 800.5 to less than 1000.5 200 900.5 1001 to 1200 1000.5 to less than 1200.5 200 1100.5 1201 to 1400 1200.5 to less than 1400.5 200 1300.5 1401 to 1600 1400.5 to less than 1600.5 200 1500.5
Note that in Table 2.8, when we write classes using class boundaries, we write to less than to ensure that each value belongs to one and only one class. As we can see, the upper boundary of the preceding class and the lower boundary of the succeeding class are the same.
2.3.2 Constructing Frequency Distribution Tables
When constructing a frequency distribution table, we need to make the following three major decisions.
Number of Classes
Usually the number of classes for a frequency distribution table varies from 5 to 20, depending mainly on the number of observations in the data set. It is preferable to have more classes as the size of a data set increases. The decision about the number of classes is arbitrarily made by the data organizer.
Class Width
Although it is not uncommon to have classes of different sizes, most of the time it is preferable to have the same width for all classes. To determine the class width when all classes are the same size, first find the difference between the largest and the smallest values in the data. Then, the approximate width of a class is obtained by dividing this difference by the number of desired classes.
Calculation of Class Width
Approximate class width = ( Largest Value - Smallest Value ) / Number of Classes
Usually this approximate class width is rounded to a convenient number, which is then used as the class width. Note that rounding this number may slightly change the number of classes initially intended.
Lower Limit of the First Class or the Starting Point
Any convenient number that is equal to or less than the smallest value in the data set can be used as the lower limit of the first class.
Example 2.3 illustrates the procedure for constructing a frequency distribution table for quantitative data.
Example 2-3
Table 2.9 (on next page) gives the total home runs hit by all players of each of the 30 Major League Baseball teams during the 2004 season. Construct a frequency distribution table.
Solution
In these data, the minimum value is 135 and the maximum value is 242. Suppose we decide to group these data using five classes of equal width. Then,
Approximate width of each class = (242 - 135) / 5 = 21.4
Now we round this approximate width to a convenient number—say, 22. The lower limit of the first class can be taken as 135 or any number less than 135. Suppose we take 135 as the lower limit of the first class. Then our classes will be
135–156, 157–178, 179–200, 201–222, and 223–244
One rule to help decide on the number of classes is Sturge’s formula:
c = 1 + 3.3 log n
where c is the number of classes and n is the number of observations in the data set. The value of log n can be obtained by entering the value of n on the calculator and pressing the log key.
Table 2.9 Home Runs Hit by Major League Baseball Teams During the 2004 Season
Team Home Runs Team Home Runs
Arizona 135 Milwaukee 135
Atlanta 178 Minnesota 191
Baltimore 169 Montreal (now Washington) 151
Boston 222 New York Mets 185
Chicago Cubs 235 New York Yankees 242
Chicago White Sox 242 Oakland 189
Cincinnati 194 Philadelphia 215
Cleveland 184 Pittsburgh 142
Colorado 202 St. Louis 214
Detroit 201 San Diego 139
Florida 148 San Francisco 183
Houston 187 Seattle 136
Kansas City 150 Tampa Bay 145
Now we read each value from the given data and mark a tally in the second column of Table 2.10 next to the corresponding class. The first value in our original data is 135, which belongs to the 135–156 class. To record it, we mark a tally in the second column next to the 135–156 class. We continue this process until all the data values have been read and entered in the tally column. Note that tallies are marked in blocks of fives for counting convenience. After the tally column is completed, we count the tally marks for each class and write those numbers in the third column. This gives the column of frequencies. These frequencies represent the number of teams that belong to each of the five different classes representing the total home runs. For example, 10 of the 30 Major League Baseball teams hit a total of 135–156 home runs during the 2004 season.
Table 2.10 Frequency Distribution for the Data of Table 2.9
Total Home Runs Tally Frequency (f) 135 – 156 ||||| ||||| 10 157 – 178 ||| 3 179 – 200 ||||| || 7 201 – 222 ||||| | 6 223 – 244 |||| 4 Sum = 30
In Table 2.10, we can denote the frequencies of the five classes by f1,f2,f3,f4 &f5 respectively. Therefore,
f1= Frequency of the first class = 10
Similarly, Hence, the sum of the frequencies of all classes
= f1 + f2 + f3 + f4 + f5
= 10 + 3 + 7 + 6 + 4
= 30
The number of observations in a sample is usually denoted by n. The number of observations in a population is denoted by N. Consequently, summation of f is equal to N for population data. Because the data set on the total home runs by Major League Baseball teams in Table 2.10 is for all 30 teams, it represents the population.
Note that when we present the data in the form of a frequency distribution table, as in Table 2.10, we lose the information on individual observations. We cannot know the exact number of home runs hit by any particular Major League Baseball team from Table 2.10. All we know is that the home runs hit by 10 of these teams during the 2004 season are between 135 - 156, and so forth.
2.3.3 Relative Frequency and Percentage Distributions
Using Table 2.10, we can compute the relative frequency and percentage distributions the same way we did for qualitative data in Section 2.2.2. The relative frequencies and percentages for a quantitative data set are obtained as follows.
Relative frequency of a category = frequency of that category / sum of all frequencies
Percentage = (Relative frequency) x 100
Example 2-4
Calculate the relative frequencies and percentages for Table 2.10.
Solution
The relative frequencies and percentages for the data in Table 2.10 are calculated and listed in the third and fourth columns, respectively, of Table 2.11 here. Note that the class boundaries are listed in the second column of Table 2.11.
Table 2.11 Relative Frequency and Percentage Distributions for Table 2.10
Total Home Runs Class Boundaries Relative Frequency Percentage
135–156 134.5 to less than 156.5 0.333 33.3
157–178 156.5 to less than 178.5 0.100 10.0
179–200 178.5 to less than 200.5 0.233 23.3
201–222 200.5 to less than 222.5 0.200 20.0
223–244 222.5 to less than 244.5 0.133 13.3
Sum = 0.999 Sum = 99.9% Using Table 2.11, we can make statements about the percentage of teams with home runs within a certain interval. For example, 33.3% of the Major League Baseball teams in this population hit total home runs between 135–156 during the 2004 season. By adding the percentages for the first two classes, we can state that about 43.3% of these teams hit home runs between 135–178 during the 2004 season. Similarly, by adding the percentages of the last two classes, we can state that about 33.3% of these teams hit home runs between 201– 244 during the 2004 season.
2.3.4 Graphing Grouped Data
Grouped (quantitative) data can be displayed in a histogram or a polygon. This section describes how to construct such graphs. We can also draw a pie chart to display the percentage distribution for a quantitative data set. The procedure to construct a pie chart is similar to the one for qualitative data explained in Section 2.2.3; it will not be repeated in this section.
Histograms
A histogram can be drawn for a frequency distribution, a relative frequency distribution, or a percentage distribution. To draw a histogram, we first mark classes on the horizontal axis and frequencies (or relative frequencies or percentages) on the vertical axis. Next, we draw a bar for each class so that its height represents the frequency of that class. The bars in a histogram are drawn adjacent to each other with no gap between them. A histogram is called a frequency histogram, a relative frequency histogram, or a percentage histogram depending on whether frequencies, relative frequencies, or percentages are marked on the vertical axis.
Definition of Histogram
A histogram is a graph in which classes are marked on the horizontal axis and the frequencies, relative frequencies, or percentages are marked on the vertical axis. The frequencies, relative frequencies, or percentages are represented by the heights of the bars. In a histogram, the bars are drawn adjacent to each other.
Figures 2.3 and 2.4 show the frequency and the relative frequency histograms, respectively, for the data of Tables 2.10 (page 12) and 2.11 (page 13) of Sections 2.3.2 and 2.3.3. The two histograms look alike because they represent the same data. A percentage histogram can be drawn for the percentage distribution of Table 2.11 by marking the percentages on the vertical axis.
The symbol –//– used in the horizontal axes of Figures 2.3 and 2.4 represents a break, called the truncation, in the horizontal axis. It indicates that the entire horizontal axis is not shown in these figures. Notice that the 0 to 134.5 portion of the horizontal axis has been omitted in each figure.
Figure 2.3 Frequency histogram for Table 2.10.
Figure 2.4 Relative frequency histogram for Table 2.11.
Polygons
A polygon is another device that can be used to present quantitative data in graphic form. To draw a frequency polygon, we first mark a dot above the midpoint of each class at a height equal to the frequency of that class. This is the same as marking the midpoint at the top of each bar in a histogram. Next we mark two more classes, one at each end, and mark their midpoints. Note that these two classes have zero frequencies. In the last step, we join the adjacent dots with straight lines. The resulting line graph is called a frequency polygon or simply a polygon.
A polygon with relative frequencies marked on the vertical axis is called a relative frequency polygon. Similarly, a polygon with percentages marked on the vertical axis is called a percentage polygon.
Definition of Polygon
A graph formed by joining the midpoints of the tops of successive bars in a histogram with straight lines is called a polygon.
Figure 2.5 shows the frequency polygon for the frequency distribution of Table 2.10.
Figure 2.5 Frequency polygon for Table 2.10.
2.5 Cumulative Frequency Distributions
Consider again Example 2–3 of Section 2.3.2 about the home runs hit by Major League Base- ball teams. Suppose we want to know how many teams hit a total of 200 or fewer home runs during the 2004 season. Such a question can be answered using a cumulative frequency distribution. Each class in a cumulative frequency distribution table gives the total number of values that fall below a certain value. A cumulative frequency distribution is constructed for quantitative data only.
Definition of Cumulative Frequency Distribution
A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class.
In a cumulative frequency distribution table, each class has the same lower limit but a different upper limit. Example 2–5 illustrates the procedure to prepare a cumulative frequency distribution.
Example 2-5
Using the frequency distribution of Table 2.10, reproduced here, prepare a cumulative frequency distribution for the home runs hit by Major League Baseball teams during the 2004 season.
Total Home Runs f
135–156 10
157–178 3
179–200 7
201–222 6
223–244 4
Solution
Table 2.12 gives the cumulative frequency distribution for the home runs hit by Major League Baseball teams. As we can observe, 135 (which is the lower limit of the first class in Table 2.10) is taken as the lower limit of each class in Table 2.12. The upper limits of all classes in Table 2.12 are the same as those in Table 2.10. To obtain the cumulative frequency of a class, we add the frequency of that class in Table 2.10 to the frequencies of all preceding classes. The cumulative frequencies are recorded in the third column of Table 2.12. The second column of this table lists the class boundaries.
Table 2.12 Cumulative Frequency Distribution of Home Runs by Baseball Teams
Class Limits Class Boundaries Cumulative Frequency
135–156 134.5 to less than 156.5 10
135–178 134.5 to less than 178.5 10 + 3 = 13
135–200 134.5 to less than 200.5 10 + 3 + 7 = 20
135–222 134.5 to less than 222.5 10 + 3 + 7 + 6 = 26
135–244 134.5 to less than 244.5 10 + 3 + 7 + 6 + 4 = 30
BUS 3507 BUSINESS STATISTICS
18
From Table 2.12, we can determine the number of observations that fall below the upper limit or boundary of each class. For example, 20 Major League Baseball teams hit a total of 200 or fewer home runs.
The cumulative relative frequencies are obtained by dividing the cumulative frequencies by the total number of observations in the data set. The cumulative percentages are obtained by multiplying the cumulative relative frequencies by 100.
Calculating Cumulative Relative Frequency and Cumulative Percentage
( )
Table 2.13 contains both the cumulative relative frequencies and the cumulative percentages for Table 2.12. We can observe, for example, that 66.7% of the Major League Baseball teams hit 200 or fewer home runs during the 2004 season.
Table 2.15 Cumulative Relative Frequency and Cumulative Percentage Distributions for Home Runs Hit by Baseball Teams
Class Limits Cumulative Relative Frequency Cumulative Percentage
135–156 33.310/30 = 0.333 33.3
135–178 43.313/30 = 0.433 43.3
135–200 66.720/30 = 0.667 66.7
135–222 86.726/30 = 0.867 86.7
135–244 100.030/30 = 1.000 100.0
BUS 3507 BUSINESS STATISTICS
19
Ogives
When plotted on a diagram, the cumulative frequencies give a curve that is called an ogive. Figure 2.6 gives an ogive for the cumulative frequency distribution of Table 2.12. To draw the ogive in Figure 2.6, the variable, which is total home runs, is marked on the horizontal axis and the cumulative frequencies on the vertical axis. Then the dots are marked above the upper boundaries of various classes at the heights equal to the corresponding cumulative frequencies. The ogive is obtained by joining consecutive points with straight lines. Note that the ogive starts at the lower boundary of the first class and ends at the upper boundary of the last class.
Figure 2.6 Ogive for the cumulative frequency distribution of Table 2.12
Definition of Ogive An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of respective classes.
One advantage of an ogive is that it can be used to approximate the cumulative frequency for any interval. For example, we can use Figure 2.6 to find the number of Major League Baseball teams with 188 or fewer home runs. First, draw a vertical line from 188 on the horizontal axis up to the ogive. Then draw a horizontal line from the point where this line intersects the ogive to the vertical axis. This point gives the cumulative frequency of the class 135–188. In Figure 2.6, this cumulative frequency is (approximately) 16 as shown by the dashed line. Therefore, 16 baseball teams had 188 or fewer home runs during the 2004 season.
We can draw an ogive for cumulative relative frequency and cumulative percentage distributions the same way we did for the cumulative frequency distribution.