How to Merge Data with PSPP

Merging data files is a routine part of data processing and it refers to combining files from multiple datasets. Additional variables often come from a different file, for example.

psppire does not currently provide dialogs for merging datasets. MATCH FILES, ADD FILES, and UPDATE are standard PSPP features, but the commands must be run from PSPP syntax, either in the Syntax Editor or from the command line.

Note that on Windows, when PSPP is installed, it sets up a PSPP icon. That icon opens psppire, not the command line pspp, which is also installed but requires running at a command line (CMD.EXE perhaps).

In PSPP, merging usually means adding variables from one file to the cases in another file. The files must share a key variable (such as ID, PATID, or another unique identifier) so PSPP knows which records belong together. The MATCH FILES command performs this type of merge.

Other situations involve stacking files on top of each other (adding more cases). This is done with ADD FILES. Updating existing cases with new values uses UPDATE. Each command handles a different kind of data combination.

 

MATCH FILES

The following example shows how to merge two datasets that share a common key variable (PATID). One file contains demographic variables; the other contains psychological measures. MATCH FILES combines them into a single dataset by matching cases on PATID.

Example datasets

demo.sav

PATIDAGESEX
101341
102292
103411

psych.sav

PATIDSCORE1SCORE2
1011218
1021520
1031117

Merged Output

PATIDAGESEX SCORE1SCORE2
101341 1218
102292 1520
103411 1117
MATCH FILES
  /FILE='demo.sav'
  /TABLE='psych.sav'
  /BY PATID.
EXECUTE.

SAVE OUTFILE='merged.sav'.

PSPP matches cases from both files using PATID. Variables from demo.sav and psych.sav appear together in the merged dataset. If a case appears in one file but not the other, MATCH FILES still produces a case, but variables from the missing file are system-missing.

MATCH FILES requires that both datasets be sorted by the BY variable(s) before merging (SORT CASES). If the file aren't sorted, PSPP will stop with an error.

The dataset named on /FILE becomes the active dataset after the merge. Variables from each /TABLE file are added to it. In this example, demo.sav is the base file. After MATCH FILES completes, the active dataset contains the merged result, which should be saved (for example, with SAVE OUTFILE='merged.sav'). PSPP system files contain the data set with all its variable attributes.

PSPP always has exactly one active file, referred to in commands as * (asterisk). Commands such as GET FILE, MATCH FILES, and ADD FILES replace the active file with their result. Because the active file is overwritten whenever a new dataset is read or created, save any results desired to be kept before running another command that changes the active file.

What happens when there is not a one-to-one merge? Subject 104 has been added to psych.sav but not to demo.sav. In this case, PATID 104 is merged but that subject's demographic variables are all missing.

Revised psych.sav

PATIDSCORE1SCORE2
1011218
1021520
1031117
1041419

Merged Output

PATIDAGESEX SCORE1SCORE2
101341 1218
102292 1520
103411 1117
104.. 1419

If a PATID value appears in one file but not the other, MATCH FILES still creates a case in the merged dataset. This is not a missing file or a missing variable — it is a missing case in one of the datasets. PSPP has no values to supply for that side of the merge, so the variables from the file where the case is absent are set to system-missing. This is the safest behavior: it preserves the case without inventing zeros or placeholder values.

Flags created with /IN= track which file each case came from. An /IN variable is set to 1 if the case was present in that file, and 0 if it was not. This makes it easy to identify unmatched cases after the merge.

MATCH FILES
  /FILE='demo.sav'  /IN=indemo
  /TABLE='psych.sav' /IN=inpsych
  /BY PATID.

SAVE OUTFILE='merged.sav'.

Merged Output with /IN= Flags

PATIDAGESEX SCORE1SCORE2 indemoinpsych
101341 1218 11
102292 1520 11
103411 1117 11
104.. 1419 01

In addition to MATCH FILES, PSPP also provides ADD FILES to append data from multiple files to the active data set, and UPDATE, which updates a master file with modifications from a transaction file.

ADD FILES

ADD FILES adds data sets vertically. It can combine multiple datasets. No BY statement is needed. If it is used it enforces sorting of the dataset.

The following example shows two small data files combined into one larger file. It also illustrates sorting cases.

Files to combine.

GET DATA/TYPE=TXT
 /FILE='C:\Users\file1.txt'
 /DELIMITERS=" "
 /ARRANGEMENT=DELIMITED
 /VARIABLES =
   ID F3.0
   AGE F3.

SORT CASES BY ID.
LIST.
Data List
╭──┬───╮
│ID│AGE│
├──┼───┤
│ 1│ 56│
│ 2│ 62│
│ 3│ 38│
│ 4│ 29│
╰──┴───╯

* Save the system file before starting the next one.
SAVE OUTFILE='C:\Users\file1.sav'.

GET DATA  /TYPE=TXT
 /FILE='C:\Users\file2.txt'
 /DELIMITERS=" "
 /ARRANGEMENT=DELIMITED
 /VARIABLES =
     ID F3
     AGE F3
 .

SORT CASES BY ID.

LIST.

Data List
╭──┬───╮
│ID│AGE│
├──┼───┤
│ 5│ 25│
│ 6│ 33│
│ 7│ 67│
╰──┴───╯

DISPLAY DICTIONARY.
                                    Variables
╭────┬────────┬─────────────────┬─────┬─────┬─────────┬────────────┬────────────╮
│Name│Position│Measurement Level│ Role│Width│Alignment│Print Format│Write Format│
├────┼────────┼─────────────────┼─────┼─────┼─────────┼────────────┼────────────┤
│ID  │       1│Nominal          │Input│    8│Right    │F3.0        │F3.0        │
│AGE │       2│Scale            │Input│    8│Right    │F3.0        │F3.0        │
╰────┴────────┴─────────────────┴─────┴─────┴─────────┴────────────┴────────────╯

SAVE OUTFILE='C:\Users\file2.sav'.

Add the two files.

ADD FILES
  /FILE='C:\Users\file1.sav'
  /FILE='C:\Users\file2.sav'
  .
LIST.
Data List
╭──┬───╮
│ID│AGE│
├──┼───┤
│ 1│ 56│
│ 2│ 62│
│ 3│ 38│
│ 4│ 29│
│ 5│ 25│
│ 6│ 33│
│ 7│ 67│
╰──┴───╯
display dictionary.
                                    Variables
╭────┬────────┬─────────────────┬─────┬─────┬─────────┬────────────┬────────────╮
│Name│Position│Measurement Level│ Role│Width│Alignment│Print Format│Write Format│
├────┼────────┼─────────────────┼─────┼─────┼─────────┼────────────┼────────────┤
│ID  │       1│Nominal          │Input│    8│Right    │F3.0        │F3.0        │
│AGE │       2│Scale            │Input│    8│Right    │F3.0        │F3.0        │
╰────┴────────┴─────────────────┴─────┴─────┴─────────┴────────────┴────────────╯

The output shows that ADD FILES stacked file1 (4 cases) and file2 (3 cases) into an active data set that has 7 cases.

UPDATE Files

Another type of PSPP data merge is UPDATE. Like its name implies it is used to update a master dataset with values in a transaction dataset. It won't overwrite values of cases it's not updating.

At least two /FILE commands are needed. The first is the master file, the second and succeeding are transaction files. The BY is also required for matching cases. Sort cases by the BY variables to ensure proper matching.

UPDATE differs from RECODE. RECODE changes case values within a file. UPDATE applies changes from another (transaction) file. The two commands serve different purposes.

main2.txt master file

1  
2 F
3  
4 F

fix2.txt transaction file

1 M 46
2 F 32
3 M 25
4 F 38
5 F 49

Update example code

ECHO '===== Read and Save main2.txt'.

GET DATA/TYPE=TXT
 /FILE='main2.txt'
 /DELIMITERS=" "
 /ARRANGEMENT=DELIMITED
 /VARIABLES =
   ID F3.0
    GENDER A2.  

SORT CASES BY ID.    
LIST.
 
DISPLAY DICTIONARY.

* Save the system file before starting the next one.
SAVE OUTFILE='main2.sav'.
     
ECHO '===== Read and Save fix2.txt'.

**     File B (fix.sav).
GET DATA  /TYPE=TXT
 /FILE='fix2.txt'
 /DELIMITERS=" "
 /ARRANGEMENT=DELIMITED
 /VARIABLES =
     ID F3 
     GENDER A2
     AGE F3
 .

SORT CASES BY ID.
LIST.
DISPLAY DICTIONARY.

SAVE OUTFILE='fix2.sav'.

ECHO '===== Read main2.sav system file'.

* New Active File.
GET FILE='main2.sav'.

LIST.

ECHO '====== Update run ======'.

UPDATE
  /FILE='main2.sav'
  /FILE='fix2.sav'
  /BY ID.
EXECUTE.

ECHO '====== List updated data'.

* List the updated data.
LIST.

DISPLAY DICTIONARY.

Output from the run

===== Read and Save main2.txt

 Data List
+--+------+
|ID|GENDER|
+--+------+
| 1|      |
| 2|F     |
| 3|      |
| 4|F     |
+--+------+

                                   Variables
+------+--------+----------------+-----+-----+---------+----------+-----------+
|      |        |   Measurement  |     |     |         |   Print  |   Write   |
|Name  |Position|      Level     | Role|Width|Alignment|  Format  |   Format  |
+------+--------+----------------+-----+-----+---------+----------+-----------+
|ID    |       1|Nominal         |Input|    8|Right    |F3.0      |F3.0       |
|GENDER|       2|Nominal         |Input|    2|Left     |A2        |A2         |
+------+--------+----------------+-----+-----+---------+----------+-----------+

===== Read and Save fix2.txt

   Data List
+--+------+---+
|ID|GENDER|AGE|
+--+------+---+
| 1|M     | 46|
| 2|M     | 32|
| 3|M     | 25|
| 4|F     | 38|
| 5|F     | 49|
+--+------+---+

                                   Variables
+------+--------+----------------+-----+-----+---------+----------+-----------+
|      |        |   Measurement  |     |     |         |   Print  |   Write   |
|Name  |Position|      Level     | Role|Width|Alignment|  Format  |   Format  |
+------+--------+----------------+-----+-----+---------+----------+-----------+
|ID    |       1|Nominal         |Input|    8|Right    |F3.0      |F3.0       |
|GENDER|       2|Nominal         |Input|    2|Left     |A2        |A2         |
|AGE   |       3|Scale           |Input|    8|Right    |F3.0      |F3.0       |
+------+--------+----------------+-----+-----+---------+----------+-----------+

===== Read main2.sav system file

 Data List
+--+------+
|ID|GENDER|
+--+------+
| 1|      |
| 2|F     |
| 3|      |
| 4|F     |
+--+------+

====== Update run ======

====== List updated data

   Data List
+--+------+---+
|ID|GENDER|AGE|
+--+------+---+
| 1|M     | 46|
| 2|M     | 32| The GENDER was F in the original master data
| 3|M     | 25|
| 4|F     | 38|
| 5|F     | 49|
+--+------+---+

                                   Variables
+------+--------+----------------+-----+-----+---------+----------+-----------+
|      |        |   Measurement  |     |     |         |   Print  |   Write   |
|Name  |Position|      Level     | Role|Width|Alignment|  Format  |   Format  |
+------+--------+----------------+-----+-----+---------+----------+-----------+
|ID    |       1|Nominal         |Input|    8|Right    |F3.0      |F3.0       |
|GENDER|       2|Nominal         |Input|    2|Left     |A2        |A2         |
|AGE   |       3|Scale           |Input|    8|Right    |F3.0      |F3.0       |
+------+--------+----------------+-----+-----+---------+----------+-----------+

In this example, UPDATE replaces existing values and fills missing values when the transaction file(s) provide new data for a case matched with the BY variable(s). The second GENDER value changed from F to M because the transaction file contained the updated value. The first and third missing GENDER values were filled from the transaction file. ID number 5 was added because it appeared only in the transaction file, and the AGE variable was added because it existed only in the transaction file.

This makes UPDATE good for keeping records of transactions that change over time. For example, a club membership file can be updated each month to add new members, dropped members can be removed or flagged, and existing members can have their information updated.

A Note About Copy-and-Paste Merging:

The /IN= flags above show exactly what goes wrong when the data in two files doesn't match perfectly. A statistical merge can make these mismatches visible and preserve the structure of the data but only if you use the tools designed for that purpose. A copy‑and‑paste merge cannot. In a spreadsheet, unmatched cases, misaligned rows, and missing variables are all silent — you won't see them, and you won't know they happened. Copy/paste may appear to “work” on perfectly clean data, but the first time the files differ even slightly, the structure is destroyed and information is lost. This is why merges should always be done by a statistical package like PSPP, not by hand. Understand your data, understand its structure, and always verify your merges.