Privacy Design Methodologies in Big Data World

data-platform notes software

In this blog post, I am documenting two talks about privacy and design patterns that grabbed my attention.

Raghu Gollamudi in his talk “Privacy Ethics – A Big Data Problem” broadly covers best practices concerning Data Management aspects, applying Data Protection rules and “Privacy by design” methodologies to maintain good privacy hygiene.

Second speaker, Lars Albertsson in his talk “Privacy by design” discusses privacy, personal integrity and design patterns.

Data is exploding

  • Data is being acquired/purchased at a rapid pace
  • IOT data is ending up into massive data lakes (e.g., Strava exposing U.S. military bases)
  • Application logs (e.g., collecting everything)
  • Biased algorithms (e.g., offering new products to assumingly a pregnant girl)

Privacy vs. security

Privacy is:

  • A fundamental right of the individual
  • Focused on use of data based on bossiness and content
  • Open to interpretation based on context and risk
  • Under-invested market

Security is:

  • A framework to protect information
  • Focused on confidentiality and accessibility
  • Clearly defined with explicit controls to monitor for and prevent authorized access
  • Matured market

Privacy requirements from engineer’s perspective

  • Right to be forgotten
  • Limited collection
  • Limited retention
  • Limited access
  • Consent for processing
  • Right for explanations
  • Right to correct data
  • User data enumeration
  • User data export



  • Customer/Employees first
  • Performance reviews


  • Encrypt, Encrypt, Encrypt: Transit, Rest, Backups
  • Robust Authentication mechanism
  • Anonymization and/or Pseudoanonymization
  • Store credentials in Vault storage
  • Audit logs



  • Privacy related tasks part of scrum/sprint
  • Data minimization
  • Self service user management portal
  • Honor user consent when processing user data
  • Retention


  • Discovery: Rest and in-motion (tagging, classifying, confidence scores; data sources, data formats, performance)
  • Rules: Regulations, consent and contracts
  • Issue management: Rule violations, issue resolution, subject access requests

Data classification

Personal information (PII) classification

  • Red - sensitive data (messages, GPS location, preferences)
  • Yellow - personal data (ID, name, email, address, IP address)
  • Green - insensitive data (not related to persons)
  • Grey zone - birth date, zip code, recommendation models

PI arithmetics

  • ​red + green = red
  • sum( red / yellow ) = green?
  • yellow + yellow + yellow = red?

Make privacy visible at ground level

  • In dataset names hdfs://red/crm/…
  • In field names: user.y_name

Privacy protection patterns

Privacy protection at ingress

  • Scramble on arrival
  • Simple to implement
  • Limits incoming data = limited value extraction
  • Deanonymization possible

Privacy by design by Lars Albertsson

Privacy protection at egress

  • Processing in opaque box
  • Enabling
  • Strict operations required
  • Exploratory analytics need explicit egress/classification

Privacy by design by Lars Albertsson

Right to be forgotten - data deletion


  • Discard all PII (e.g., User id)
  • No link between records or datasets

Privacy by design by Lars Albertsson


  • ​Records are linked
  • Hash PII

Privacy by design by Lars Albertsson


  • ​Rerun all your datasets to remove customers
  • Computationally expensive
  • No versioning in tools
  • No data model changes required

Ejected record pattern

  • Mapping table, you keep PII in one table
  • Ones user wants to be forgotten, delete one record
  • It includes extra join in all selects
  • PII table key needed (e.g., hash)

Lost key pattern

  • PII fields encrypted
  • Per-user decryption key table
  • Extra join to decrypt
  • Single dataset leak -> no PII leak
  • Ones user wants to be forgotten, delete a key
  • Instead of deleting a key, you can store a key in another place

Privacy by design by Lars Albertsson

Data model deadly sins

  • ​Using PII data as key (username, email)
  • Publishing entity ids containing PII data (e.g. user shared resources including username
  • Publishing pseudonymized datasets (can be de-pseudonymized with external data)

I'm Valdas Maksimavicius. I write about data, cloud technologies and personal development. You can find more about me here.