Regression Modeling Analysis:
Determining the Number of Days a Patient is in the Hospital

Rochester Institute of Technology
Executive Masters of Business Administration
CLASS XX/TEAM III
CHRIS HAJECKI, CHUCK D’AGNOSTINO,
BARBARA STEPHENS, MARY KATE SCANLON
NOVEMBER 17th, 2012

Abstract:
Objectives: To identify factors associated with the number of days spent in the hospital. Design: A multiple regression model applied to data from a sample size. We chose this design because it shows multiple correlations concurrently between all of the independent variables and how they predict the number of days a patient spends in the hospital. We began by running a regression model using all variables. We removed the variables that had P-values over .05, indicating that they were insignificant. We were surprised that “age at first claim was not significant”. We ran correlations between the significant independent variables that showed that they were not correlated. Then we ran another regression model with the variables that had P-values under.05. The P-value for the f-test showed that there was a difference in means of at least one of the independent variables for both regression models. Then, we considered that there must be some correlation with “age at first claim”. We sorted the data set into two subsets, one with 0-39 year olds, and one with 40+ year olds. We ran a regression for both age groups with “visit count” and “drug count”. Then we ran another one with “visit count,” “drug count” and “age at first claim”. Breaking the data into subsets improved the goodness of fit (Adj. R-SQ) for the younger group and decreased the goodness of fit by .1 for the older group. Again adding age at first claim did not have significance. Then we ran correlations between “visit count” and “drug count” with the two subsets and found no co-correlation. The p-value of the f-test is below .05 so the test has some degree of significant. For each additional visit count there...

...criterion we get the best model: y = -124.382 + 0.296X1 + 0.048X2 + 1.306X3 + 0.5198X4. This model contains all four predictor variables X1, X2, X3 and X4. This model is selected as best model by the MaxR criterion because it has the largest R-Square 0.9629, which is larger than 0.9615(model containing 3 variables), 0.9330(model containing 2 variables) and 0.8047(model containing 1 variable).
Below is a SAS output of the MaxR criterion.
Obviously, the “best” model obtained from MaxR criterion differs from that obtained from Stepwise and Backward Elimination Method. It is not hard to understand this phenomenon: Since for the Stepwise/Backward Elimination method, F-statistic plays an important role in selecting a variable: the F-statistic for a variable to be added must be significant at the SLENTRY level, the F-statistic for a variable to be removed must be significant at the SLSTAY level. While the MaxR method selects variables depending on which variable or variable combination can produce the largest R square. MaxR makes the switch that produces the largest increase in R square.
Appendix |
Code:
data job;
infile "C:\Users\sandra\Desktop\CH09PR10.txt";
input y x1 x2 x3 x4;
run;
proc reg data=job;
model y=x1 x2 x3 x4/selection=stepwise slstay=.10 slentry=.05;
title "Stepwise Selection";
run;
proc reg data=job;
model y=x1 x2 x3 x4/selection=adjrsq;
run;
proc reg data=job;
model y=x1 x2 x3...

...
Q1: What is the t-statistical used for? What is the P-value used for? What is the difference between the use of t-statistic and p-value?
Ans: The t-statistic is a ratio of the departure of an estimated parameter from its notional value and its standard error. It is used in hypothesis testing.
Let be an estimator of parameter β in some statistical model. Then a t-statistic for this parameter is any quantity of the form
Where β0 is a non-random, known constant, and is the standard error of the estimator . By default, statistical packages report t-statistic with β0 = 0 (these t-statistics are used to test the significance of corresponding regressor). However, when t-statistic is needed to test the hypothesis of the form H0: β = β0, then a non-zero β0 may be used.
Uses of t-statistical:
Most frequently, t-statistics are used in Student's t-tests, a form of statistical hypothesis testing, and in the computation of certain confidence intervals.
The key property of the t-statistic is that it is a pivotal quantity – while defined in terms of the sample mean, its sampling distribution does not depend on the sample parameters, and thus it can be used regardless of what these may be.
P-value & its uses:
“The level of marginal significance within a statistical hypothesis test, representing the probability of the occurrence of a given event.” The p-value is...

...correlation exists between the time period the workers have worked in the industry and their health effects.
Analysis will be carried out with the help of the following 5 samples:
* Worker ID
* Age
* Department
* Length of service
* Percentage of cell damage
The above samples are independent within and also between each other. To obtain an accurate analysis of the data, the normality, box plot and straight-line relationship and independence of the statistical analysis will be checked. The Null or Alternative Hypothesis will be accepted or rejected on the basis of a statistical analysis, which will be used to analyse the median percentage of damaged cells got from the brick and tile operations.
Table 1: Descriptive Statistics of brick and tile operation workers percentage damaged cells
Variable | N | N* | Mean | SE Mean | St: Dev. | Minimum | Q1 | Median | Q3 | Maximum |
% Damaged cells of Tile operation | 27 | 0 | 1.337 | 0.210 | 1.090 | 0.200 | 0.600 | 1.100 | 1.500 | 4.700 |
% Damaged cells of Brick operation | 38 | 0 | 1.532 | 0.179 | 1.106 | 0.200 | 0.536 | 1.370 | 2.189 | 4.562 |
Table 1 gives a descriptive data of the workers of the respective industries.As seen in the table above the % of damaged cells of the workers in the brick industry is higher when compared with the tile operation workers.The median percentage of brick industry workers is 1.370 which is higher as...

...necessary statistical data from firms and government agencies. The concept of validity and reliability were kept in mind while doing this process. The data gathered were checked for accuracy and consistency and that the data gathered were indeed the data needed in this study.
Finally, the collected data were statistically treated via E-Views software so that proper interpretations, conclusions and recommendations may be formulated.
D. Statistical Analysis of Data
It is important that the collected data be treated in the most accurate and appropriate method. Statistics allows for the detection and evaluation the group of differences that are small compared to individual differences. It also objectifies evaluations, but do not guarantee correct decisions every time.
In this study, inferential statistics will be used. Inferential statistics uses patterns in the sample data to draw inferences about the population represented, accounting for randomness. These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation) and modeling relationships within the data (for example, using regression analysis). Inference can extend to forecasting, prediction and estimation of unobserved values either in or associated with the population being...

...
November 19, 2010
NAME: The Statistics of Poverty and Inequality
TYPE: Sample
SIZE: 97 observations, 8 variables
DESCRIPTIVE ABSTRACT:
For 97 countries in the world, data are given for birth rates, death
rates, infant death rates, life expectancies for males and females, and
Gross National Product.
SOURCES:
Day, A. (ed.) (1992), _The Annual Register 1992_, 234, London:
Longmans.
_U.N.E.S.C.O. 1990 Demographic Year Book_ (1990), New York: United
Nations.
VARIABLE DESCRIPTIONS:
Columns
1 - 6 Live birth rate per 1,000 of population
7 - 14 Death rate per 1,000 of population
15 - 22 Infant deaths per 1,000 of population under 1 year old
23 - 30 Life expectancy at birth for males
31 - 38 Life expectancy at birth for females
39 - 46 Gross National Product per capita in U.S. dollars
47 - 52 Country Group
1 = Eastern Europe
2 = South America and Mexico
3 = Western Europe, North America, Japan, Australia, New Zealand
4 = Middle East
5 = Asia
6 = Africa
53 - 74 Country
Values are aligned and delimited by blanks.
Missing values are denoted with *.
The Statistics of Poverty and Inequality
This paper describes a case study based on data taken from the U.N.E.S.C.O. 1990 Demographic Year Book and The Annual Register 1992 giving birth rates, death rates, life expectancies, and Gross National Products for 97 countries.
When reviewing the...

...DEPARTMENT OF SOCIOLOGY, PSYCHOLOGY AND SOCIAL WORK
SOCI 1005 (SY16C) -INTRODUCTORY STATISTICS FOR THE
BEHAVIOURAL SCIENCES
SUMMER SCHOOL 2012/2013- COURSE OUTLINE
Lecturer: Ayesha Facey
Office: Room 46, Faculty of Social Sciences
Office #: 970-6324
E-mail: ayeshafcy@yahoo.com
COURSE OBJECTIVE
This course aims to introduce students to basic univariate and bivariate statistics. A student who successfully completes this course will possess a reasonable level of knowledge of basic statistics and their interpretations.
LEARNING OUTCOMES
At the end of the course, students should be able to:
• Adequately define statistical concepts
• Distinguish between descriptive statistics and inferential statistics
• Distinguish between qualitative data and quantitative data
• Classify data with respect to the four levels of measurement: nominal, ordinal, interval, and ratio
• Create grouped frequency distributions
• Compute measures of central tendency and variation and use them to analyze data
• Calculate and interpret the correlation coefficient and equation of the least-squares regression line for bivariate data and use the results to make predictions.
• Solve probabilities
• Compute binomial distributions
• Use the normal distribution to interpret z scores and compute probabilities
• Estimate a...

...SECTION A (You should attempt all 10 questions)
A1. Regression analysis is ____________________________________.
A) describes the strength of this linear relationship.
B) describes the mathematical relationship between two variables.
C) describes the pattern of the data.
D) describes the characteristic of independent variable.
A2. __________________ is used to illustrate any relationship between two variables.
A) Histogram
B) Pie chart
C) Scatter diagram
D) Frequency polygon
Questions A3 to A5 relate to the following information.
Suppose a firm fed the values of turnover, y, and advertising expenditure, x, (both in $000) for the past eight years, into a computer and obtained the regression relationship y = 26.7 + 8.5x.
A3. What is the dependent variable?
A) Number of computers
B) Size of the firm
C) Turnover
D) Advertising expenditure
A4. What is the independent variable?
A) Number of computers
B) Size of the firm
C) Turnover
D) Advertising expenditure
A5. If the advertising expenditure is $5000 in a particular year, estimate the turnover for that year.
A) $69,200
B) $42,526.70
C) $26.7
D) $69.20
A6. Explain what the following sample correlation coefficients tell you about the relationship between the x and y values in the sample:
r = - 0.8
A) No correlation.
B) Perfect negative...