Credit union interest in Big Data is at an all-time high. The promise of predictive analytics and other Big Data opportunities will be a key part of helping the industry compete more effectively with traditional banks and fintech upstarts.
However, where does the data for Big Data come from? The answer is simple: from the credit unions themselves. For example, the loan loss forecasts required by CECL models will require data from many credit unions to increase their predictive accuracy.
While credit unions are eager to cash in on the Big Data boom, one of the costs is “contribution” of their own data to the Big Data “lake”. A data lake is a virtual “storehouse in the cloud” that holds a vast amount of data that can be used for Big Data analytics.
At this point, credit union decision makers often turn sour on Big Data. Why? The cost of “contribution” is too high. The credit union is obligated to protect the sensitive member data in its care. This data cannot simply leave the credit union’s firewall perimeter and be uploaded to the Data Lake.
The healthcare industry faced a similar conundrum regarding electronic medical records. As medical records evolved from paper to an electronic format, the opportunity to perform analytics on this data was gigantic. Yet, the Health Insurance Portability and Accountability Act (HIPPA), a law about patients’ medical records privacy, stood in the way.
To take advantage of this opportunity but still adhere to HIPPA, healthcare analytics companies devised processes to “de-identify” the sensitive data in medical records. In this this way, no specific patient could be uniquely identified while analysts gleaned insights from millions of medical records uploaded by thousands of healthcare providers.
Credit union member data can be handled in a similar way. In fact, the same method for protecting patient privacy can be adapted to the data of credit union members.
In a 2015 publication from the National Institute of Standards and Technology (NIST), the concept of “de-identification” of data is explained. It is defined as, “…a tool that organizations can use to remove personal information from data that they collect, use, archive, and share with other organizations.”
The document describes the HIPPA Safe Harbor method which specifies 18 specific types of data to be de-identified. The list has been altered to replace healthcare data types with credit union data types. The 18 types are:
- All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
- The geographic unit formed by combining all ZIP codes with the same three initial digits contain more than 20,000 people; and The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
- All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
- Telephone numbers
- Fax numbers
- Email addresses
- Social security or other tax identification numbers
- Member or Customer Numbers Medical record numbers
- Member or Customer Numbers Health plan beneficiary numbers
- Account or application numbers
- Operator numbers/employee numbers/officer numbers Certificate/license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web Universal Resource Locators (URLs)
- Internet Protocol (IP) addresses
- Biometric identifiers, including finger and voiceprints
- Full-face photographs and any comparable images
- Any other unique identifying number, characteristic, or code
An important consideration is how this data is de-identified. Removal of Direct Identifiers is at the heart of de-identification. The NIST document defines Direct Identifiers as “data that directly identifies a single individual. Examples of direct identifiers include names, social security numbers, and email addresses.”
The document notes numerous ways Direct Identifiers can be de-identified:
- The direct identifiers can be removed.
- The direct identifiers can be replaced with either category names or data that are obviously generic. For example, names can be replaced with the phrase “PERSON NAME”, addresses with the phrase “123 ANY ROAD, ANY TOWN, USA”, and so on.
- The direct identifiers can be replaced with symbols such as “*****” or “XXXXX”.
- The direct identifiers can be replaced with random values. If the same identity appears twice, it receives two different values. This preserves the form of the original data, allowing for some kinds of testing, but makes it harder to re-associate the data with individuals.
- The direct identifiers can be systematically replaced with pseudonyms, allowing records referencing the same individual to be matched.
“Pseudonymization” is an extremely important topic in de-identification. Unlike the techniques above, it allows “linking information belonging to an individual across multiple data records or information systems, provided that all direct identifiers are systematically pseudonymized.”
In layman’s terms, this means that authorized parties can restore de-identified data back to its original form. For example, Member Number is de-identified via pseudonymization. The data is not comprehensible to any unauthorized party. However, when the data returns to the credit union, it can be reversed and integrated back into the database.
Credit unions that can understand the what and how of data de-identification will be better prepared to take advantage of Big Data opportunities.