Labels

Learn the powerful enterprise adaptable database:

Getting Started With ADABAS & Natural

Wednesday, January 30, 2013

SPSS-Statistical techniques to explore relationships among variables


Correlation



1) Correlation is used when you wish to describe the strength and direction of the relationship between two variables (usually continuous). 

2) It can also be used when one of the variables is dichotomous—that is, it has only two values (e.g.
sex: males/females). 

3) The statistic obtained is Pearson’s product-moment correlation (r). The statistical signifi cance of r is also provided.

Partial correlation


4) Partial correlation is used when you wish to explore the relationship between two variables while statistically controlling for a third variable. 

5) This is useful when you suspect that the relationship between your two variables of interest may be infl uenced, or confounded, by the impact of a third variable. 

6) Partial correlation statistically removes the infl uence of the third variable, giving a cleaner picture of the actual relationship between your two variables.

Multiple regression


7) Multiple regression allows prediction of a single dependent continuous variable from a group of independent variables. 

8) It can be used to test the predictive power of a set of variables and to assess the relative contribution of each individual variable.

Logistic regression


9) Logistic regression is used instead of multiple regression when your dependent variable is categorical. 

10) It can be used to test the predictive power of a set of variables and to assess the relative contribution of each individual variable.

Factor analysis


11) Factor analysis is used when you have a large number of related variables (e.g. the items that make up a scale) and you wish to explore the underlying structure of this set of variables. 

12) It is useful in reducing a large number of related variables to a smaller, more manageable, number of dimensions or components. 

SPSS-Computing New Variables




1) Using a wide variety of mathematical functions, you can compute new variables based on highly complex equations. In this example, however, we will simply compute a new variable that is the difference between the values of two existing variables.


2) The data file demo.sav contains a variable for the respondent's current age and a variable for the number of years at current job. It does not, however, contain a variable for the respondent's age at the time he or she started that job. We can create a new variable that is the computed difference between current age and number of years at current job, which should be the approximate age at which the respondent started that job.  

From the menus in the Data Editor window choose:
For Target Variable, enter jobstart.
Select Age in years [age] in the source variable list and click the arrow button to copy it to the Numeric Expression text box.
Click the minus (–) button on the calculator pad in the dialog box (or press the minus key on the keyboard).
Select Years with current employer [employ] and click the arrow button to copy it to the expression.



Click OK to compute the new variable.

The new variable is displayed in the Data Editor. Since the variable is added to the end of the file, it is displayed in the far right column in Data View and in the last row in Variable View.



3) You can also use predefined functions in expressions. More than 70 built-in functions are available, including:
• Arithmetic functions
• Statistical functions
• Distribution functions
• Logical functions
• Date and time aggregation and extraction functions
• Missing-value functions
• Cross-case functions
• String functions




Using Functions in Expressions
======================

4) Functions are organized into logically distinct groups, such as a group for arithmetic operations and another for computing statistical metrics. 



5) For convenience, a number of commonly used system variables, such as $TIME (current date and time), are also included in appropriate function groups.
 
6) A brief description of the currently selected function (in this case, SUM) or system variable is displayed in a reserved area in the Compute Variable dialog box.
 

7) To paste a function into an expression:
Position the cursor in the expression at the point where you want the function to appear.


Select the appropriate group from the Function group list. The group labeled All provides a listing of all available functions and system variables.



Double-click the function in the Functions and Special Variables list (or select the function and click the arrow adjacent to the Function group list).




8) The function is inserted into the expression. If you highlight part of the expression and then insert the function, the highlighted portion of the expression is used as the first argument in the function.



9) The function is not complete until you enter the arguments, represented by question marks in the pasted function. The number of question marks indicates the minimum number of arguments required to complete the function.


Highlight the question mark(s) in the pasted function.
Enter the arguments. If the arguments are variable names, you can paste them from the variable list.



Using Conditional Expressions
=====================

10) You can use conditional expressions (also called logical expressions) to apply transformations to selected subsets of cases. A conditional expression returns a value of true, false, or missing for each case. If the result of a conditional expression is true, the transformation is applied to that case. If the result is false or missing, the transformation is not applied to the case.



11) To specify a conditional expression:
Click If in the Compute Variable dialog box. This opens the If Cases dialog box.




Select Include if case satisfies condition.



Enter the conditional expression.


12) Most conditional expressions contain at least one relational operator, as in:
age>=21
or
income*3<100

13) In the first example, only cases with a value of 21 or greater for 
Age [age] are selected.


14) In the second example, Household income in thousands [income] multiplied by 3 must be less than 100 for a case to be selected.



15) You can also link two or more conditional expressions using logical operators, as in:
age>=21 | ed>=4
or
income*3<100 & ed=5

16) In the first example, cases that meet either the Age [age] condition or the Level of education [ed] condition are selected.



17) In the second example, both the Household income in thousands [income] and Level of education [ed] conditions must be met for a case to be selected.

 













SPSS-Creating a Categorical Variable from a Scale Variable


1) Several categorical variables in the data file demo.sav are, in fact, derived from scale variables in that data file. For example, the variable inccat is simply income grouped into four categories.



2) This categorical variable uses the integer values 1–4 to represent the following income categories (in thousands): less than $25, $25–$49, $50–$74, and $75 or higher.

3) To create the categorical variable inccat:
From the menus in the Data Editor window choose:

 

5) Since Visual Binning relies on actual values in the data file to help you make good binning choices, it needs to read the data file first. Since this can take some time if your data file contains a large number of cases, this initial dialog box also allows you to limit the number of cases to read ("scan").
 

6) This is not necessary for our sample data file. Even though it contains more than 6,000 cases, it does not take long to scan that number of cases.
 
Drag and drop Household income in thousands [income] from the Variables list into the Variables to Bin list, and then click Continue.

In the main Visual Binning dialog box, select Household income in thousands [income] in the Scanned Variable List.


A histogram displays the distribution of the selected variable (which in this case is highly skewed).


Enter inccat2 for the new binned variable name and Income category [in thousands] for the variable label.


Click Make Cutpoints.



Select Equal Width Intervals.


► Enter 25 for the first cutpoint location, 3 for the number of cutpoints, and 25 for the width.

The number of binned categories is one greater than the number of cutpoints. So in this example, the new binned variable will have four categories, with the first three categories each containing ranges of 25 (thousand) and the last one containing all values above the highest cutpoint value of 75 (thousand).

Click Apply.

The values now displayed in the grid represent the defined cutpoints, which are the upper endpoints of each category. Vertical lines in the histogram also indicate the locations of the cutpoints.










 By default, these cutpoint values are included in the corresponding categories. For example, the first value of 25 would include all values less than or equal to 25.

But in this example, we want categories that correspond to less than 25, 25–49, 50–74, and 75 or higher.

In the Upper Endpoints group, select Excluded (<).

 

Then click Make Labels.


This automatically generates descriptive value labels for each category. Since the actual values assigned to the new binned variable are simply sequential integers starting with 1, the value labels can be very useful.

You can also manually enter or change cutpoints and labels in the grid, change cutpoint locations by dragging and dropping the cutpoint lines in the histogram, and delete cutpoints by dragging cutpoint lines off of the histogram.

Click OK to create the new, binned variable.

The new variable is displayed in the Data Editor. Since the variable is added to the end of the file, it is displayed in the far right column in Data View and in the last row in Variable View.



SPSS-Distribution of scores and suggested transformations





Often when you check the distribution of scores on a scale or measure (e.g. selfesteem, anxiety) you will find that the scores do not fall in a nice, normally distributed curve. Sometimes scores will be positively skewed, where most of the respondents record low scores on the scale (e.g.depression).

Sometimes you will find a negatively skewed distribution, where most scores are at the high end (e.g.
self-esteem). Given that many of the parametric statistical tests assume normally distributed scores, what do you do about these skewed distributions?

One of the choices you have is to abandon the use of parametric statistics (e.g. Pearson correlation, analysis of variance) and instead choose to use non-parametric alternatives (e.g. Spearman’s rho, Kruskal-Wallis). SPSS includes a number of useful non-parametric techniques in its package.

Another alternative, when you have a non-normal distribution, is to ‘transform’ your variables. This involves mathematically modifying the scores using various formulas until the distribution looks more normal. There are a number of different types of transformation, depending on the shape of your distribution. There is considerable controversy concerning this approach in the literature, with some authors strongly supporting, and others arguing against, transforming variables to better meet the assumptions of the various parametric techniques.







Procedure for transforming variables

(You need the following data file:


)

1. From the menu at the top of the screen, click on Transform, then click on Compute Variable.

2. Target Variable. In this box, type in a new name for the variable. Try to include an indication of the type of transformation and the original name of the variable. For example, for a variable called tnegaff I would make this new variable sqtnegaff, if I had performed a square root. Be consistent in the abbreviations that you use for each of your transformations.

3. Functions. Listed are a wide range of possible actions you can use. You need to choose the most appropriate transformation for your variable. Look at the shape of your distribution; compare it with those in the above Figure. Take note of the formula listed next to the picture that matches your distribution. This is the one that you will use.

4. Transformations involving square root or logarithm. In the Function group box, click on Arithmetic, and scan down the list that shows up in the bottom box until you fi nd the formula you need (e.g. Sqrt or Lg10). Highlight the one you want and click on the up arrow. This moves the formula into the Numeric Expression box. You will need to tell it which variable you want to recalculate. Find it in the list of variables and click on the arrow to move it into the Numeric Expression box. If you prefer,
you can just type the formula in yourself without using the Functions or Variables list. Just make sure you spell everything correctly.

5. Transformations involving Refl ect. You need to fi nd the value K for your variable. This is the largest value that your variable can have (see your codebook) + 1. Type this number in the Numeric Expression box. Complete the remainder of the formula using the Functions box, or alternatively type it in yourself.

6. Transformations involving Inverse. To calculate the inverse, you need to divide your scores into 1. So, in the Numeric Expression box type in 1, then type / and then your variable or the rest of your formula (e.g. 1/tslfest).

7. Check the fi nal formula in the Numeric Expression box. Write this down in your codebook next to the name of the new variable you created.

8. Click on the button Type and Label. Under Label, type in a brief description of the new variable (or you may choose to use the actual formula you used).

9. Check in the Target Variable box that you have given your new variable a new name, not the original one. If you accidentally put the old variable name, you will lose all your original scores. So, always double-check.

10. Click on OK (or on Paste if you wish to paste this command to the Syntax Editor window). To execute it after pasting to the Syntax Editor, highlight the command and select Run from the menu. A new variable will be created and will appear at the end of your data file.

11. Run Analyze, Frequencies to check the skewness and kurtosis values for your old and new variables. Have they improved?

12. Under Frequencies, click on the Charts button and select Histogram to inspect the distribution of scores on your new variable. Has the distribution improved? If not, you may need to consider a different type of transformation.

SPSS-Summary Statistics Using Descriptives


The Descriptives procedure is useful for obtaining summary comparisons of approximately normally distributed scale variables and for easily identifying unusual cases across those variables by computing z scores.

Using Descriptives to Study Quantitative Data
================================
 A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.

This information is collected in telco.sav. See the topic Sample Files for more information. Use Descriptives to study customer spending to determine which services are most profitable.


Running the Analysis
===============

To run a Descriptives analysis, from the menus choose:




These selections generate the following command syntax:

DESCRIPTIVES
  VARIABLES=longmon tollmon equipmon cardmon wiremon
  /STATISTICS=MEAN STDDEV MIN MAX .
• The procedure analyzes the variables longmon, tollmon, equipmon, cardmon, and wiremon.

• The STATISTICS subcommand requests the mean, standard deviation, minimum, and maximum.


To recode 0's as missing values, from the menus choose:
Select Long distance last month, Toll free last month, Equipment last month, Calling card last month, and Wireless last month as numeric variables.
Type 0 as the Old Value.
Select System-missing New Value.
Click Continue.
These selections generate the following command syntax:
RECODE
  longmon tollmon equipmon cardmon wiremon  (0=SYSMIS)  .
EXECUTE .
Click Options in the Descriptives dialog box.


Deselect Minimum and Maximum.
Select Skewness and Kurtosis.
Click Continue.
Click OK in the Descriptives dialog box.



These selections generate the following command syntax:
DESCRIPTIVES
  VARIABLES=longmon tollmon equipmon cardmon wiremon
  /STATISTICS=MEAN STDDEV SKEWNESS KURTOSIS .
• The STATISTICS subcommand now requests the skewness and kurtosis instead of the minimum and maximum.

Descriptive Statistics
===============

When the analysis is conditional upon the customer's actually having the service, the results are dramatically different.

 

Wireless and equipment rental services bring in far more revenue per customer than other services.



Moreover, while wireless service remains a high variable prospect, equipment rental has one of the lowest standard deviations.

 


This hasn't solved the problem of who purchases these services, but it does point you in the direction of which services deserve greater marketing.

 

Finding Unusual Cases
================
You can find customers who spend much more or much less than other customers on each service by studying the standardized values (or z scores) of the variables.

However, a requirement for using z scores is that each variable's distribution is not markedly non-normal. The skewness and kurtosis values reported in the statistics table are all quite large, showing that the distributions of these variables are definitely not normal.

One possible remedy, because the variables all take positive values, is to study the z scores of the log-transformed variables. The log-transformed variables have already been computed and entered into the data file; you can use the Descriptives procedure to compute the the z scores.


Running the Analysis
===============
To obtain z scores for the log-transformed variables, recall the Descriptives dialog box.


Deselect Long distance last month through Wireless last month as analysis variables.
Select Log-long distance through Log-wireless as analysis variables.
Select Save standardized values as variables.
Click OK.

These selections generate the following command syntax:
DESCRIPTIVES
  VARIABLES=loglong logtoll logequi logcard logwire
  /STATISTICS=MEAN STDDEV SKEWNESS KURTOSIS 
  /SAVE .
• The SAVE subcommand specifies that z-scores for each of the variables on the VARIABLES subcommand should be saved to the active dataset.


Descriptive statistics table
==================
With the exception of Log-toll free, the skewness and kurtosis are considerably smaller for the log-transformed variables.

 

The log-transformed toll-free service may continue to have a large skewness and kurtosis because a customer spent an unusually large amount last month. Check boxplots to verify this.



Boxplots of Z Scores
===============
To visually scan the z scores and find unusual values, from the menus choose:
Select Summaries of separate variables.
Select Zscore: Log-long distance through Zscore: Log-wireless as the variables the boxes represent.
Click Options.




Click Continue.
Click OK in the Define Simple Boxplot dialog box.


EXAMINE VARIABLES=Zloglong Zlogtoll Zlogequi Zlogcard Zlogwire 
  /COMPARE VARIABLE
  /PLOT=BOXPLOT
  /STATISTICS=NONE
  /NOTOTAL
  /MISSING=PAIRWISE.


Boxplots of the z scores show that customer 567 spent much more than the average customer on toll-free service last month. This should account for the larger skewness and kurtosis observed in Toll free last month.







Summary
=======
You have determined that equipment rental and wireless services have a high return per customer, although wireless has greater variability. You still need to determine whether these services can be effectively marketed to your customer base in order to fully assess their profitability.
You have also found that one customer, compared to other customers, spent an unusually large amount on toll-free services last month. This should be investigated to determine whether this spending was a one-time event or will be ongoing. 


Related Procedures
=============

The Descriptives procedure is a useful tool for summarizing and standardizing scale variables. 

•  You can alternatively use the Frequencies procedure to summarize scale variables. Frequencies also provides statistics for summarizing categorical variables.

•  The Means procedure provides descriptive statistics and an ANOVA table for studying relationships between scale and categorical variables. 

•  The Summarize procedure provides descriptive statistics and case summaries for studying relationships between scale and categorical variables. 

•  The OLAP Cubes procedure provides descriptive statistics for studying relationships between scale and categorical variables. 

•  The Correlations procedure provides summaries describing the relationship between two scale variables. 


Recommended Readings
=================
See the following texts for more information on summarizing data:
Hays, W. L. 1981. Statistics, 3rd ed. New York: Holt, Rinehart, and Winston.

Norusis, M. 2004. SPSS 13.0 Guide to Data Analysis. Upper Saddle-River, N.J.: Prentice Hall, Inc..

Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc..