Anonymizer definitions are used by HL7Viewer and HL7Script to quickly anonymize or de-identify one or more messages that contain PHI, making them suitable for replay in a test environment. When anonymizing a series of messages, the changed data is persisted to keep the messages consistent. For example, if PID.3 (the patient ID) is changed from "12345" to "TEST001" in the first message, "12345" is changed to "TEST001" in all PID.3 fields.
A sample definition is included with the release, Generic.anon.ini. This definition is a good start, but should not be considered an authoritative guide to anonymization. It covers the segments regularly encountered in the author's experience, but it should be tested with your own messages and configured to make sure all instances of actual PHI have been anonymized, including any custom Z-segments.
A definition includes three (sometimes four) sections:
The [Global] section appears at the top and is used to set global options.
The [Values] section defines how to generate replacement values for various string, numeric, and date/time types.
The [Fields] section lists all the fields that require anonymization, and which value generator should be used for each. A field may also be replaced with another previously anonymized field.
The [Increments] section is maintained by the program if the SaveIncrements global option is enabled.
Below is a snippet of an example definition file:
; This is a comment [Global] Alphabet=BCDFGHJKLMNPQRSTVWXYZ Persist=1 SaveIncrements=1 DataStore=D:\HL7\Example.anon.data NamedFields=D:\HL7\HL7NamedFields.2.7.txt [Values] anyST=ST anyNM=NM anyDT=DT SameAge=1 MRN=NM Min=100000001 Increment=1 Prefix=M Name=ST Min=3 Max=12 Street=ST Constant="123 ANON ST" Zip=ST Constant=12345 Phone=ST Constant=(800)555-1212 Email=ST Constantemail@example.com SSN=NM Min=999101000 Increment=1 Mask=999-99-9999 Ignore=000-00-0000|999-99-9999 [Fields] PID.3=MRN ;End-of-line comment PID.5.1.1=Name PID.5.1.2=Name PID.5.2=Name PID.5.3=anyST PID.7=anyDT PID.11.1=Street PID.11.2=Blank PID.11.3=Name PID.11.5=Zip PID.13.1=Phone PID.13.4=Email PID.14.1=Phone PID.14.4=Email PID.18.1=PID.3 PID.19=SSN
Blank lines are ignored. Comments start with a semicolon (;) and may be whole-line or end-of-line comments.
The following options may be specified in the Global Settings section:
A database may be used instead of the ini and datastore file for storing increment values and persisted data, respectively. The following global settings are used only when using a database, and override the file-based settings when provided.
The Anonymizer Database Schema section contains an example of how to create tables and procedures for working with anonymization data.
There are three types of value generators: strings (ST), numbers (NM), and dates (DT). Each value definition starts with a unique name, an equal sign, and one of the types.
; A bare minimum value definition anyST=ST
Only the name and type are required, but there are numerous options to help generate an interesting value. Options are separated by spaces and are given in Option=Value format. If a value contains spaces or semicolons, enclose it in double quotes (").
; Quote option values that contain spaces or semicolons Street=ST Constant="123 ANON ST"
There are two built-in value generators that are always available: Blank and Null. Those do exactly what you would think and set the value to blank or Null (""), respectively.
The available options vary based on the value type. If an option has a default value other than blank, it is shown in parentheses. Boolean values use 0 for False and 1 for True.
Date values generate a random date based on the options. If the input contains a time, the time remains unchanged.
The following options apply to all types, even when the value is a Constant. Ignore is always checked first to determine if the value should remain unchanged. After generating the value using the type-specific options, the general options are applied in the order they are provided in the definition. Each option may be specified only once.
Right-justifies/overlays a string of characters (usually digits) into a format string. Especially handy for phone number/SSN formatting, but it could conceivably be used on any type of input. Ex: FormatDigits('6025551212', '(099)999-9999') -> '(602)555-1212' FormatDigits('5551212', '(099)999-9999') -> '555-1212' FormatDigits('6025551212', '999.999.9999') -> '602.555.1212' FormatDigits('foo', 'bar') -> 'foobar' All digit characters are always output even if the format string is shorter or blank. Output stops when you run out of digit characters, even though there may be more format string remaining. Format string rules: 9 = Replace this character with a character from the digit string. 0 = Same as 9 but always includes the next format character to the left, even if you have run out of digit characters. * = Any other character is copied to the output as a literal. To output a literal 0 or 9, precede it with the escape character. The default escape character is a backslash, but it can be changed if you need backslashes in your output. Example: FormatDigits('123456', '999\0999') -> '1230456'
The Fields section contains a list of all fields, components, and subcomponents that require anonymization. Each line consists of a field key, an equal sign, and the name of a value generator or a previously anonymized field key to copy.
If a Named Fields file has been specified, named fields may be used in field definitions. Numeric keys are always valid, even when a Named Fields file has been loaded.
The following example applies the value generator called "MRN" to PID.3:
This example copies the value generated for PID.3 into PID.18. Note that PID.3 must be defined in the Fields list before PID.18 to do this.
All repetitions in all like segments will be anonymized unless the key provides specific segment sequence and/or repetition indexes, e.g. NK1#1.5, PID.3~1. One example of a reason to include a specific repetition index would be if a sender always uses the third repetition of PID.13 for the email address. You would list the regular PID.13 anonymization first, then the specific repetition.
PID.13.1=Phone PID.13~3.1=Email ;Vendor always puts email here
If copying a previously anonymized field and the value should be copied from the same segment sequence and/or repetition that is currently being anonymized, wildcards can be used. The wildcard character is a question mark (?) and can follow either a segment sequence (#) or repetition (~) marker. The question marks will be replaced with the appropriate indexes for the current field. Without a wildcard, the first such segment (#1) and repetition (~1) are assumed.
PID.18=PID#?.3~?.1 ;Copies PID.3.1 from the same segment and repetition of this PID.18
A field definition may also include one or more of the following options:
If more flexibility is required in choosing a replacement for a field, conditional logic in IF-THEN-ELSE format can be used to select the correct value generator or field to copy:
PID.13.1=IF PID#?.13~?.2 == "NET" THEN Email ELSE Phone ; If the SSN starts with "X" don't change it: PID.19.1=IF PID.19.1 ~= "X" THEN IGNORE ELSE SSN
"IF " must immediately follow the equal sign. The IF portion of the expression uses the same syntax as HL7Script IF statements. The THEN and ELSE parts are both required, and must provide either a value generator name, a field key to copy, or the word IGNORE to leave the field unchanged. Segment and repetition wildcards work as they do in non-conditional assignments.
The THEN and ELSE parts can also nest additional conditional logic expressions within parentheses:
PID.13.1=IF PID#?.13~?.2=="NET" THEN Email ELSE (IF PID#?.13~?.3=="CP" THEN CellPhone ELSE Phone)
Nesting is effectively unlimited, but the entire expression must be contained on a single line.
The Persist and Ignore options are still available when using conditional logic, and must be the last options on the line when present.
Here is an example of a possible database schema for persisting anonymization data, including Global Options tailored to work with it.
If multiple threads or processes could be anonymizing data simultaneously, a threadsafe design using sequences/generators/identity columns should be developed. Those constructs guarantee that no two connections could retrieve the same increment value.
CREATE TABLE AnonStore ( fieldkey nvarchar(50) NOT NULL, origdata nvarchar(250) NOT NULL, anondata nvarchar(250) NOT NULL, CONSTRAINT pk_AnonStore PRIMARY KEY (fieldkey, origdata) ) GO CREATE TABLE AnonInc ( valuename NVARCHAR(50) NOT NULL PRIMARY KEY, lastincrement BIGINT NOT NULL ) GO CREATE PROCEDURE AnonIncrement(@valuename NVARCHAR(50), @inc BIGINT, @min BIGINT) AS BEGIN DECLARE @last BIGINT SELECT @last = lastincrement FROM AnonInc WHERE valuename = @valuename; IF @last IS NULL INSERT INTO AnonInc (valuename, lastincrement) VALUES (@valuename, @min); ELSE BEGIN SET @last = @last + @inc; UPDATE AnonInc SET lastincrement = @last WHERE valuename = @valuename; END SELECT lastincrement FROM AnonInc WHERE valuename = @valuename; END GO
Database=(your connection name here) IncrementSQL=EXEC AnonIncrement :ValueName, :ValueInc, :ValueMin; DataReadSQL=SELECT anondata FROM AnonStore WHERE fieldkey = :FieldKey AND origdata = :OrigData; DataSaveSQL=INSERT INTO AnonStore (fieldkey, origdata, anondata) VALUES (:FieldKey, :OrigData, :AnonData);