How to Merge Data with PSPP
Merging data files is a routine part of data processing and it refers to combining files from multiple datasets. Additional variables often come from a different file, for example.
psppire does not currently provide dialogs for merging datasets.
MATCH FILES,ADD FILES, andUPDATEare standard PSPP features, but the commands must be run from PSPP syntax, either in the Syntax Editor or from the command line.Note that on Windows, when PSPP is installed, it sets up a PSPP icon. That icon opens psppire, not the command line pspp, which is also installed but requires running at a command line (CMD.EXE perhaps).
In PSPP, merging usually means adding variables from one file to the cases in another file. The files must share a key variable (such as
ID,PATID, or another unique identifier) so PSPP knows which records belong together. TheMATCH FILEScommand performs this type of merge.Other situations involve stacking files on top of each other (adding more cases). This is done with
ADD FILES. Updating existing cases with new values usesUPDATE. Each command handles a different kind of data combination.
MATCH FILES
The following example shows how to merge two datasets that share a common key variable (
PATID). One file contains demographic variables; the other contains psychological measures.MATCH FILEScombines them into a single dataset by matching cases onPATID.Example datasets
demo.sav
PATID AGE SEX 101 34 1 102 29 2 103 41 1 psych.sav
PATID SCORE1 SCORE2 101 12 18 102 15 20 103 11 17 Merged Output
PATID AGE SEX SCORE1 SCORE2 101 34 1 12 18 102 29 2 15 20 103 41 1 11 17 MATCH FILES /FILE='demo.sav' /TABLE='psych.sav' /BY PATID. EXECUTE. SAVE OUTFILE='merged.sav'.PSPP matches cases from both files using
PATID. Variables fromdemo.savandpsych.savappear together in the merged dataset. If a case appears in one file but not the other,MATCH FILESstill produces a case, but variables from the missing file are system-missing.MATCH FILES requires that both datasets be sorted by the
BYvariable(s) before merging (SORT CASES). If the file aren't sorted, PSPP will stop with an error.The dataset named on
/FILEbecomes the active dataset after the merge. Variables from each/TABLEfile are added to it. In this example,demo.savis the base file. After MATCH FILES completes, the active dataset contains the merged result, which should be saved (for example, withSAVE OUTFILE='merged.sav'). PSPP system files contain the data set with all its variable attributes.PSPP always has exactly one active file, referred to in commands as
*(asterisk). Commands such asGET FILE,MATCH FILES, andADD FILESreplace the active file with their result. Because the active file is overwritten whenever a new dataset is read or created, save any results desired to be kept before running another command that changes the active file.What happens when there is not a one-to-one merge? Subject 104 has been added to psych.sav but not to demo.sav. In this case, PATID 104 is merged but that subject's demographic variables are all missing.
Revised psych.sav
PATID SCORE1 SCORE2 101 12 18 102 15 20 103 11 17 104 14 19 Merged Output
PATID AGE SEX SCORE1 SCORE2 101 34 1 12 18 102 29 2 15 20 103 41 1 11 17 104 . . 14 19 If a PATID value appears in one file but not the other,
MATCH FILESstill creates a case in the merged dataset. This is not a missing file or a missing variable — it is a missing case in one of the datasets. PSPP has no values to supply for that side of the merge, so the variables from the file where the case is absent are set to system-missing. This is the safest behavior: it preserves the case without inventing zeros or placeholder values.Flags created with
/IN=track which file each case came from. An/INvariable is set to 1 if the case was present in that file, and 0 if it was not. This makes it easy to identify unmatched cases after the merge.MATCH FILES /FILE='demo.sav' /IN=indemo /TABLE='psych.sav' /IN=inpsych /BY PATID. SAVE OUTFILE='merged.sav'.Merged Output with /IN= Flags
PATID AGE SEX SCORE1 SCORE2 indemo inpsych 101 34 1 12 18 1 1 102 29 2 15 20 1 1 103 41 1 11 17 1 1 104 . . 14 19 0 1 In addition to
MATCH FILES, PSPP also providesADD FILESto append data from multiple files to the active data set, andUPDATE, which updates a master file with modifications from a transaction file.ADD FILES
ADD FILESadds data sets vertically. It can combine multiple datasets. No BY statement is needed. If it is used it enforces sorting of the dataset.The following example shows two small data files combined into one larger file. It also illustrates sorting cases.
Files to combine.
GET DATA/TYPE=TXT /FILE='C:\Users\file1.txt' /DELIMITERS=" " /ARRANGEMENT=DELIMITED /VARIABLES = ID F3.0 AGE F3. SORT CASES BY ID. LIST. Data List ╭──┬───╮ │ID│AGE│ ├──┼───┤ │ 1│ 56│ │ 2│ 62│ │ 3│ 38│ │ 4│ 29│ ╰──┴───╯ * Save the system file before starting the next one. SAVE OUTFILE='C:\Users\file1.sav'. GET DATA /TYPE=TXT /FILE='C:\Users\file2.txt' /DELIMITERS=" " /ARRANGEMENT=DELIMITED /VARIABLES = ID F3 AGE F3 . SORT CASES BY ID. LIST. Data List ╭──┬───╮ │ID│AGE│ ├──┼───┤ │ 5│ 25│ │ 6│ 33│ │ 7│ 67│ ╰──┴───╯ DISPLAY DICTIONARY. Variables ╭────┬────────┬─────────────────┬─────┬─────┬─────────┬────────────┬────────────╮ │Name│Position│Measurement Level│ Role│Width│Alignment│Print Format│Write Format│ ├────┼────────┼─────────────────┼─────┼─────┼─────────┼────────────┼────────────┤ │ID │ 1│Nominal │Input│ 8│Right │F3.0 │F3.0 │ │AGE │ 2│Scale │Input│ 8│Right │F3.0 │F3.0 │ ╰────┴────────┴─────────────────┴─────┴─────┴─────────┴────────────┴────────────╯ SAVE OUTFILE='C:\Users\file2.sav'.Add the two files.
ADD FILES /FILE='C:\Users\file1.sav' /FILE='C:\Users\file2.sav' . LIST. Data List ╭──┬───╮ │ID│AGE│ ├──┼───┤ │ 1│ 56│ │ 2│ 62│ │ 3│ 38│ │ 4│ 29│ │ 5│ 25│ │ 6│ 33│ │ 7│ 67│ ╰──┴───╯ display dictionary. Variables ╭────┬────────┬─────────────────┬─────┬─────┬─────────┬────────────┬────────────╮ │Name│Position│Measurement Level│ Role│Width│Alignment│Print Format│Write Format│ ├────┼────────┼─────────────────┼─────┼─────┼─────────┼────────────┼────────────┤ │ID │ 1│Nominal │Input│ 8│Right │F3.0 │F3.0 │ │AGE │ 2│Scale │Input│ 8│Right │F3.0 │F3.0 │ ╰────┴────────┴─────────────────┴─────┴─────┴─────────┴────────────┴────────────╯The output shows that
ADD FILESstacked file1 (4 cases) and file2 (3 cases) into an active data set that has 7 cases.UPDATE Files
Another type of PSPP data merge is
UPDATE. Like its name implies it is used to update a master dataset with values in a transaction dataset. It won't overwrite values of cases it's not updating.At least two
/FILEcommands are needed. The first is the master file, the second and succeeding are transaction files. TheBYis also required for matching cases. Sort cases by theBYvariables to ensure proper matching.
UPDATEdiffers fromRECODE.RECODEchanges case values within a file.UPDATEapplies changes from another (transaction) file. The two commands serve different purposes.main2.txt master file
1 2 F 3 4 Ffix2.txt transaction file
1 M 46 2 F 32 3 M 25 4 F 38 5 F 49Update example code
ECHO '===== Read and Save main2.txt'. GET DATA/TYPE=TXT /FILE='main2.txt' /DELIMITERS=" " /ARRANGEMENT=DELIMITED /VARIABLES = ID F3.0 GENDER A2. SORT CASES BY ID. LIST. DISPLAY DICTIONARY. * Save the system file before starting the next one. SAVE OUTFILE='main2.sav'. ECHO '===== Read and Save fix2.txt'. ** File B (fix.sav). GET DATA /TYPE=TXT /FILE='fix2.txt' /DELIMITERS=" " /ARRANGEMENT=DELIMITED /VARIABLES = ID F3 GENDER A2 AGE F3 . SORT CASES BY ID. LIST. DISPLAY DICTIONARY. SAVE OUTFILE='fix2.sav'. ECHO '===== Read main2.sav system file'. * New Active File. GET FILE='main2.sav'. LIST. ECHO '====== Update run ======'. UPDATE /FILE='main2.sav' /FILE='fix2.sav' /BY ID. EXECUTE. ECHO '====== List updated data'. * List the updated data. LIST. DISPLAY DICTIONARY.Output from the run
===== Read and Save main2.txt Data List +--+------+ |ID|GENDER| +--+------+ | 1| | | 2|F | | 3| | | 4|F | +--+------+ Variables +------+--------+----------------+-----+-----+---------+----------+-----------+ | | | Measurement | | | | Print | Write | |Name |Position| Level | Role|Width|Alignment| Format | Format | +------+--------+----------------+-----+-----+---------+----------+-----------+ |ID | 1|Nominal |Input| 8|Right |F3.0 |F3.0 | |GENDER| 2|Nominal |Input| 2|Left |A2 |A2 | +------+--------+----------------+-----+-----+---------+----------+-----------+ ===== Read and Save fix2.txt Data List +--+------+---+ |ID|GENDER|AGE| +--+------+---+ | 1|M | 46| | 2|M | 32| | 3|M | 25| | 4|F | 38| | 5|F | 49| +--+------+---+ Variables +------+--------+----------------+-----+-----+---------+----------+-----------+ | | | Measurement | | | | Print | Write | |Name |Position| Level | Role|Width|Alignment| Format | Format | +------+--------+----------------+-----+-----+---------+----------+-----------+ |ID | 1|Nominal |Input| 8|Right |F3.0 |F3.0 | |GENDER| 2|Nominal |Input| 2|Left |A2 |A2 | |AGE | 3|Scale |Input| 8|Right |F3.0 |F3.0 | +------+--------+----------------+-----+-----+---------+----------+-----------+ ===== Read main2.sav system file Data List +--+------+ |ID|GENDER| +--+------+ | 1| | | 2|F | | 3| | | 4|F | +--+------+ ====== Update run ====== ====== List updated data Data List +--+------+---+ |ID|GENDER|AGE| +--+------+---+ | 1|M | 46| | 2|M | 32| The GENDER was F in the original master data | 3|M | 25| | 4|F | 38| | 5|F | 49| +--+------+---+ Variables +------+--------+----------------+-----+-----+---------+----------+-----------+ | | | Measurement | | | | Print | Write | |Name |Position| Level | Role|Width|Alignment| Format | Format | +------+--------+----------------+-----+-----+---------+----------+-----------+ |ID | 1|Nominal |Input| 8|Right |F3.0 |F3.0 | |GENDER| 2|Nominal |Input| 2|Left |A2 |A2 | |AGE | 3|Scale |Input| 8|Right |F3.0 |F3.0 | +------+--------+----------------+-----+-----+---------+----------+-----------+In this example,
UPDATEreplaces existing values and fills missing values when the transaction file(s) provide new data for a case matched with theBYvariable(s). The second GENDER value changed from F to M because the transaction file contained the updated value. The first and third missing GENDER values were filled from the transaction file. ID number 5 was added because it appeared only in the transaction file, and the AGE variable was added because it existed only in the transaction file.This makes
UPDATEgood for keeping records of transactions that change over time. For example, a club membership file can be updated each month to add new members, dropped members can be removed or flagged, and existing members can have their information updated.A Note About Copy-and-Paste Merging:
The
/IN=flags above show exactly what goes wrong when the data in two files doesn't match perfectly. A statistical merge can make these mismatches visible and preserve the structure of the data but only if you use the tools designed for that purpose. A copy‑and‑paste merge cannot. In a spreadsheet, unmatched cases, misaligned rows, and missing variables are all silent — you won't see them, and you won't know they happened. Copy/paste may appear to “work” on perfectly clean data, but the first time the files differ even slightly, the structure is destroyed and information is lost. This is why merges should always be done by a statistical package like PSPP, not by hand. Understand your data, understand its structure, and always verify your merges.