Privacy Design Methodologies in Big Data World

In this blog post, I am documenting two talks about privacy and design patterns that grabbed my attention.

Raghu Gollamudi in his talk “Privacy Ethics – A Big Data Problem” broadly covers best practices concerning Data Management aspects, applying Data Protection rules and “Privacy by design” methodologies to maintain good privacy hygiene.

Second speaker, Lars Albertsson in his talk “Privacy by design” discusses privacy, personal integrity and design patterns.

Data is exploding

Data is being acquired/purchased at a rapid pace
IOT data is ending up into massive data lakes (e.g., Strava exposing U.S. military bases)
Application logs (e.g., collecting everything)
Biased algorithms (e.g., offering new products to assumingly a pregnant girl)

Privacy vs. security

Privacy is:

A fundamental right of the individual
Focused on use of data based on bossiness and content
Open to interpretation based on context and risk
Under-invested market

Security is:

A framework to protect information
Focused on confidentiality and accessibility
Clearly defined with explicit controls to monitor for and prevent authorized access
Matured market

Privacy requirements from engineer’s perspective

Right to be forgotten
Limited collection
Limited retention
Limited access
Consent for processing
Right for explanations
Right to correct data
User data enumeration
User data export

Solutions

Culture

Customer/Employees first
Performance reviews

Security

Encrypt, Encrypt, Encrypt: Transit, Rest, Backups
Robust Authentication mechanism
Anonymization and/or Pseudoanonymization
Store credentials in Vault storage
Audit logs

Design

Data integrity
Authorization
Obscurity by design
Conservative approach to privacy settings by default
Visibility and Transparency
Privacy by Design The 7 Foundational Principles by Ann Cavoukian

Process

Privacy related tasks part of scrum/sprint
Data minimization
Self service user management portal
Honor user consent when processing user data
Retention

Automation

Discovery: Rest and in-motion (tagging, classifying, confidence scores; data sources, data formats, performance)
Rules: Regulations, consent and contracts
Issue management: Rule violations, issue resolution, subject access requests

Data classification

Personal information (PII) classification

Red - sensitive data (messages, GPS location, preferences)
Yellow - personal data (ID, name, email, address, IP address)
Green - insensitive data (not related to persons)
Grey zone - birth date, zip code, recommendation models

PI arithmetics

red + green = red
sum( red / yellow ) = green?
yellow + yellow + yellow = red?

Make privacy visible at ground level

In dataset names hdfs://red/crm/…
In field names: user.y_name

Privacy protection patterns

Privacy protection at ingress

Scramble on arrival
Simple to implement
Limits incoming data = limited value extraction
Deanonymization possible

_{Privacy by design by Lars Albertsson}

Privacy protection at egress

Processing in opaque box
Enabling
Strict operations required
Exploratory analytics need explicit egress/classification

_{Privacy by design by Lars Albertsson}

Right to be forgotten - data deletion

Anonymisation

Discard all PII (e.g., User id)
No link between records or datasets

_{Privacy by design by Lars Albertsson}

Pseudonymisation

Records are linked
Hash PII

_{Privacy by design by Lars Albertsson}

Re-computation

Rerun all your datasets to remove customers
Computationally expensive
No versioning in tools
No data model changes required

Ejected record pattern

Mapping table, you keep PII in one table
Ones user wants to be forgotten, delete one record
It includes extra join in all selects
PII table key needed (e.g., hash)

Lost key pattern

PII fields encrypted
Per-user decryption key table
Extra join to decrypt
Single dataset leak -> no PII leak
Ones user wants to be forgotten, delete a key
Instead of deleting a key, you can store a key in another place

_{Privacy by design by Lars Albertsson}

Data model deadly sins

Using PII data as key (username, email)
Publishing entity ids containing PII data (e.g. user shared resources including username
Publishing pseudonymized datasets (can be de-pseudonymized with external data)