| names | age |
|---|---|
| Matthew | 18 |
| Mark | 19 |
| Luke | 18 |
| John | 20 |
1 Entering, importing, summarizing and subsetting data
1.1 Learning goals
1.1.1 Conceptual
- Understand the meaning of unit of analysis
- Understand and apply common measures of central tendency
1.1.2 SPSS
- Enter new data
- Import and work with existing data
- Create frequency tables
- Select subsets of data using
Select cases
1.2 Entering data
In an SPSS spreadsheet, rows correspond to observations on the unit of analysis (e.g., mice, people, countries), and columns correspond to variables (e.g., fur colour, occupation, GDP per capita).
Let’s create a dataset of student names and ages:
1.3 Measures of central tendency
In this course, we will consider three measures of central tendency.
1.3.1 The arithmetic mean
This is usually represented by one of two formulae:
\(\bar{x} = \frac{1}{n} \sum\limits_{i = 1}^{n}x_i\)
\(\bar{x} = \frac{\sum_{i = 1}^{n}x_i}{n}\)
Straightforwardly: sum all scores and divide by the total number of scores. Our students have a mean age of 18.75.
Strictly speaking, the mean can only be sensibly calculated for interval/ratio level data.
1.3.2 The median
This is the 50th percentile in the dataset (i.e., it divides the data into a lower half and an upper half). It is obtained by ranking scores and finding the value which divides the dataset in two.
For data \(x\) with \(n\) elements:
if \(n\) is odd, \(med(x) = x_{(n+1)/2}\)
if \(n\) is even, \(med(x) = \frac{x_{(n/2)} + x_{((n/2)+1)}}{2}\)
For example, in a sorted dataset with 11 values, the median will be the value at the observation ranked 6th. However, in a sorted dataset with 12 values, the median will be the arithmetic mean of the values ranked in 6th and 7th. The median age in our newly-synthesized dataset is 18.5.
The median can be used to summarize both interval/ratio and ordinal variables.
1.3.3 The mode
Lastly, we turn to the mode, which is simply the most frequently occurring value. In our dataset, this is 18.
Datasets can have multiple modes; for example, let’s imagine we added a new student aged 20 to the dataset. In this case, the modes would be 18 and 20.
The mode can be calculated for all measurement levels about which you have learned, but it is typically most informative when there are relatively few values which the data can take. For example, imagine trying to calculate modal household income within a country; you would likely end up with countless unique values and multiple modes, rendering this measure fairly useless for the purpose of summarizing.
1.4 SPSS explainer video
- Entering data
- Frequencies
- Central tendency
- Selecting cases (ESS11)
See video ‘ISA1’ here. (UvA login required)
You will probably rely mostly on the following:
=- searches for equivalence, whether the data are numerical or character string (text).~=- the complement of=. E.g,cntry ~= "Netherlands"means “return all respondents who are NOT from the Netherlands”.& (AND)- used when you want the selected responses to satisfy multiple conditions. E.g.,cntry == "Netherlands" & agea > 22means “return all respondents from the Netherlands who are also older than (> = ”greater than”) 22”.| (OR)- use this if you want to select responses which match at least one of the specified conditions. E.g.,cntry == "Netherlands" | agea > 22will return participants who are either in the Netherlands, or older than 22, or both. Participants who meet neither criterion will be filtered out.