1 Entering, importing, summarizing and subsetting data

1.1 Learning goals

1.1.1 Conceptual

Understand the meaning of unit of analysis
Understand and apply common measures of central tendency

1.1.2 SPSS

Enter new data
Import and work with existing data
Create frequency tables
Select subsets of data using Select cases

1.2 Entering data

In an SPSS spreadsheet, rows correspond to observations on the unit of analysis (e.g., mice, people, countries), and columns correspond to variables (e.g., fur colour, occupation, GDP per capita).

Let’s create a dataset of student names and ages:

names	age
Matthew	18
Mark	19
Luke	18
John	20

1.3 Measures of central tendency

In this course, we will consider three measures of central tendency.

1.3.1 The arithmetic mean

This is usually represented by one of two formulae:

\(\bar{x} = \frac{1}{n} \sum\limits_{i = 1}^{n}x_i\)
\(\bar{x} = \frac{\sum_{i = 1}^{n}x_i}{n}\)

Straightforwardly: sum all scores and divide by the total number of scores. Our students have a mean age of 18.75.

Strictly speaking, the mean can only be sensibly calculated for interval/ratio level data.

1.3.2 The median

This is the 50th percentile in the dataset (i.e., it divides the data into a lower half and an upper half). It is obtained by ranking scores and finding the value which divides the dataset in two.

For data \(x\) with \(n\) elements:

if \(n\) is odd, \(med(x) = x_{(n+1)/2}\)
if \(n\) is even, \(med(x) = \frac{x_{(n/2)} + x_{((n/2)+1)}}{2}\)

For example, in a sorted dataset with 11 values, the median will be the value at the observation ranked 6th. However, in a sorted dataset with 12 values, the median will be the arithmetic mean of the values ranked in 6th and 7th. The median age in our newly-synthesized dataset is 18.5.

The median can be used to summarize both interval/ratio and ordinal variables.

1.3.3 The mode

Lastly, we turn to the mode, which is simply the most frequently occurring value. In our dataset, this is 18.

Datasets can have multiple modes; for example, let’s imagine we added a new student aged 20 to the dataset. In this case, the modes would be 18 and 20.

The mode can be calculated for all measurement levels about which you have learned, but it is typically most informative when there are relatively few values which the data can take. For example, imagine trying to calculate modal household income within a country; you would likely end up with countless unique values and multiple modes, rendering this measure fairly useless for the purpose of summarizing.

1.4 SPSS explainer video

Entering data
Frequencies
Central tendency
Selecting cases (ESS11)

See video ‘ISA1’ here. (UvA login required)

Note 1.1: Use of logical operators when selecting cases

You will probably rely mostly on the following:

= - searches for equivalence, whether the data are numerical or character string (text).
~= - the complement of =. E.g, cntry ~= "Netherlands" means “return all respondents who are NOT from the Netherlands”.
& (AND) - used when you want the selected responses to satisfy multiple conditions. E.g., cntry == "Netherlands" & agea > 22 means “return all respondents from the Netherlands who are also older than (> = ”greater than”) 22”.
| (OR) - use this if you want to select responses which match at least one of the specified conditions. E.g., cntry == "Netherlands" | agea > 22 will return participants who are either in the Netherlands, or older than 22, or both. Participants who meet neither criterion will be filtered out.