Tools Solutions Pricing Blog
Log in Start for free
← All articles

Redact, Hash, or Fake: Which Anonymization Strategy to Use for Each Column

A practical decision guide for anonymizing CSV columns. When to redact, when to hash, when to replace with fake data, and why picking the wrong strategy either leaks information or destroys the analysis.

You have a CSV with sensitive data. You need to share it with someone outside your team. You already know “anonymize it before sending” is the right answer. The harder question is: which kind of anonymization, for which column?

The three main options (redact, hash, replace with fake data) all sound like they do the same thing. They do not. Picking the wrong one for a given column either leaks information you thought you protected, or destroys the analysis the recipient is trying to do.

This guide is a practical decision tree. For each common column type in a typical RevOps, sales, or marketing CSV, here is which strategy fits and why.

What each strategy actually does

Before walking through columns, here is what the three core strategies mean in practice.

Redaction replaces the value with a blank or a fixed placeholder like REDACTED or xxxx. Every row in the column now looks the same. The column still exists in the schema, but the values carry no information.

When to reach for it: when the column is genuinely not needed by the recipient, and you do not want them to see even the shape of the data.

Hashing runs each value through a one-way function (SHA-256, for example) and stores the fingerprint. The same input always produces the same output. Different inputs produce different outputs. You cannot reverse a hash to recover the original value.

When to reach for it: when the recipient needs to count unique values, deduplicate rows, or join two files on a column, but does not need to read the actual value.

Replacing with fake data substitutes each real value with a realistic-but-invented one. John Doe becomes Alex Rivera. [email protected] becomes [email protected]. The column still has the shape and feel of real data, but none of it refers to anything real.

When to reach for it: when the recipient needs the file to look real (for a demo, a screenshot, a deck) or when realistic values make the analysis more useful than blank or hashed ones.

There are also a few secondary strategies you will see in the column-by-column rules below:

  • Rounding or bucketing for numeric data: replace 47,283 with 47,000 or $10k to $50k. Preserves distribution, hides exact values.
  • Date shifting: add or subtract the same number of days from every date in the file. Preserves intervals, breaks calendar alignment.
  • Truncation: keep only the first two digits of a postcode, the year of a date, the country code of a phone number. Reduces precision, keeps some signal.

The decision: three questions per column

For any column, the strategy is determined by three questions:

  1. Does the recipient need to read this column’s values? If yes, you cannot redact or hash. You need fake data, or you need to keep it.
  2. Does the recipient need to count unique values, dedupe, or join on this column? If yes, you cannot fake data (because fakes are random per row and will not match across files). You need to hash, or to keep it.
  3. Is the column itself sensitive, or only the values? If only the values are sensitive and the column name and shape are fine to share, you transform the values. If the column itself reveals a confidential category, you drop it entirely.

Walking those three questions over every column gives you the strategy without much thinking. Here is what that looks like in practice.

Personal identifiers

Email addresses

What the recipient might want to do with it: count unique contacts, join two files on email, see realistic-looking data.

NeedStrategy
Count or dedupeHash
Join with another fileHash (same function in both files)
Show realistic dataFake (e.g. [email protected])
None of the aboveRedact

One subtle thing: hashing only the local part and keeping the @acmecorp.com visible is not anonymization. The domain often identifies the company, which is usually the most sensitive part of a B2B record. If the domain matters, keep it; otherwise hash or fake the whole address.

Full names

NeedStrategy
The file is a demo or screenshotFake
Analysis (counts, segments, ratios)Redact
Join across filesDo not use names for joining; hash a stable ID instead

Names rarely need to be hashed, because nobody reliably joins on Full Name (too many duplicates, variants, “John Smith” everywhere).

Phone numbers

NeedStrategy
Detect duplicatesHash
Realistic-looking fileFake, using a reserved fictional format (+1-555-01XX)
Nothing specificRedact

If you fake phone numbers, use a clearly fictional range so the values cannot accidentally be dialed.

Addresses

NeedStrategy
Geographic analysisKeep country and region, redact street and unit
Postcode-level analysisTruncate postcode to the first two or three characters
Realistic demo dataFake the full address
Nothing specificRedact

Full street addresses identify individuals. Country and region usually do not. Postcode falls in between, depending on the country (UK postcodes are extremely precise; US ZIPs less so).

National IDs, social security numbers, tax IDs

There is only one right answer: redact, or remove the column entirely. These should not be in a file you are sharing externally under any circumstance you are likely to encounter. If you genuinely need to share that one of two rows refers to the same individual, hash it; never share a partial or formatted version.

Internal identifiers

Customer IDs, account IDs, deal IDs, lead IDs

This is the most commonly mishandled column. Internal IDs feel safe because they are short numeric or alphanumeric strings. They are not safe: anyone with access to your internal systems can map them back to real entities.

NeedStrategy
Join across filesHash, using the same function across all files in the set
Count unique entitiesHash
Look up in productionShould never be the goal of a shared file; do not share

Hashing is almost always right for internal IDs. It preserves the joins (which is usually the whole reason the column is in the file) without exposing the values.

Employee or owner IDs

Treat the same as internal customer IDs. Hash if the recipient needs to see “deals per owner” or similar, redact otherwise.

Commercial-sensitive columns

Company names

In B2B data, this is often more sensitive than personal names. Your customer list is competitive intelligence.

NeedStrategy
Realistic demo or reportFake
Count unique companiesHash
Distribution analysis (by segment, size)Hash, or keep the segment column and redact the company column
Public communicationKeep, but only with explicit permission

A common mistake: shuffling real customer names across rows. This is not anonymization. It still discloses the list of customers, just not which deal belongs to which. Use invented names, not shuffled real ones.

Industry, segment, size band

NeedStrategy
Almost any analysisKeep
Confidential segmentation taxonomyRedact or remap to generic labels

These are categorical, low-cardinality, and rarely identifying. Keep them unless your segmentation itself is confidential (some companies treat their segment definitions as competitive intel).

Deal value, revenue, contract amount

Hashing financial numbers is almost always wrong: it destroys ordering and arithmetic, which breaks every interesting analysis. Use bucketing or rounding instead.

NeedStrategy
Distribution analysisRound to nearest 1,000 or 10,000
Segment comparisonBucket into ranges (<$10k, $10k to $50k, etc.)
Preserve relative scaleMultiply every value by the same random factor (0.7 to 1.3)
Recipient does not need numbersRedact

Be careful about combining low-cardinality dimensions with exact deal values. A unique deal value at a known close date is often enough to identify the customer.

Discount, margin, cost columns

Same as deal value: round or bucket. These are often even more sensitive than revenue, because margin reveals pricing strategy and unit economics.

Time columns

Created date, modified date, closed date

Dates by themselves are rarely sensitive. The combination of a date and another column (a deal value, a customer ID) often is, because the combination can be unique.

NeedStrategy
Time-series analysisKeep, if the rest of the file is well anonymized
Break calendar alignmentShift every date in the file by the same number of days
Reduce precisionTruncate to month or quarter
Not neededDrop

The key thing if you shift dates: apply the same shift to every date column in the file. If created_date shifts by 90 days and closed_date shifts by 60 days, the deal length (closed minus created) is corrupted, and any duration-based analysis is broken.

Timestamps (login, last activity, event time)

Treat the same as dates, but consider truncating to the hour or the day. Second-level precision on a login event combined with even a hashed user ID is more identifying than people expect.

Free-text columns

Notes, descriptions, comments, transcripts

There is essentially one strategy: drop the column unless you have manually reviewed every row.

Free text contains anything: real names, real email addresses, real account details, real complaints, real internal-only context. No automatic anonymization can reliably catch all of it. Pattern-matching for emails and phone numbers is necessary but not sufficient.

If the recipient genuinely needs the notes, you have to read every row and rewrite anything sensitive by hand. That is expensive, which is the right signal: free-text columns should be excluded from external shares by default.

Quick reference table

Column typeDefault strategyWhen to deviate
EmailHashFake for demos, redact if not needed
Full nameRedactFake for demos
PhoneHashFake for demos, redact if not needed
Street addressRedactFake for demos; keep country and region
National ID, SSNRedact (or remove column)Never share in any form
Internal customer / account IDHashRedact if no joins or counts are needed
Owner / employee IDHashRedact otherwise
Company nameFakeHash for counts; redact when not needed
Industry, segmentKeepRedact only if taxonomy is confidential
Deal valueRound or bucketRedact if not needed; never hash
Discount, marginBucketRedact when especially sensitive
Created / closed dateShift all dates by same amountTruncate to month for less precision
TimestampsTruncate to day or hourSame as dates
Free textDrop the columnManual review only when essential

Common mistakes to avoid

Mistake 1: applying the same strategy to every column. “I’ll just hash everything” sounds safe but destroys the analysis. “I’ll fake everything” breaks joins and unique counts. Each column needs its own answer.

Mistake 2: shuffling values within a column. Taking the real customer names and randomly reassigning them to rows is not anonymization. The list of customers is still disclosed, only the row-level mapping is changed. Use invented names, not shuffled real ones.

Mistake 3: hashing with a guessable algorithm and small value space. Hashing a boolean column, or a column with 50 possible values, with a public algorithm is reversible by anyone with five minutes and a script. They precompute the hashes of every possible value and look up the matches. If the value space is small, do not hash; redact or generalize.

Mistake 4: leaving combinations of “safe” columns that re-identify rows. A row with country = France, industry = Pharma, deal value = $1,247,392, closed in March 2026 may be globally unique, even though no single column on its own is sensitive. Watch for combinations, not just columns.

Mistake 5: forgetting that the schema itself leaks information. A column called vip_customer_flag or internal_credit_risk_score reveals something even if every value is redacted. If the column name is sensitive, rename it before sharing or drop it entirely.

Where Asphorem fits in

The Data Anonymizer implements the column-by-column model in this article. Upload a CSV, see every column with sample values, pick a strategy per column (redact, hash, replace with fake data, round, bucket, shift), and download the result.

The whole transformation runs in your browser. The file is not uploaded to any server, which is the only sensible default when the entire point is to control where the data goes.

Anonymization configurations are saved per file shape, so the next time you need to send an updated extract to the same recipient, you re-apply the same per-column rules in one click. The decisions you make once here become a workflow.

If you also need to standardize the file before anonymizing (clean column names, normalize picklist values, fix date formats), the CSV Normalizer handles that side, and the two tools share the same client-side model.

Frequently asked questions

Is hashing the same as encryption?

No. Encryption is reversible if you have the key; hashing is one-way and cannot be reversed even by the person who applied it. For anonymization, you want one-way (no key to leak, no way for the recipient to ever recover the originals). Hashing also produces a fixed-length output regardless of input, which is why two files hashed the same way join cleanly on the hashed column.

Can a hashed column be re-identified?

If the column has a small value space (booleans, country codes, fewer than a few thousand possible values), yes. A determined recipient can hash every possible value and reverse the lookup. For high-cardinality columns like emails or customer IDs, the practical re-identification risk is much lower, but it is not zero, especially if the recipient already knows some of the values. For strict cases, use a salted hash with a salt the recipient does not have.

What if the recipient says “just send me the raw data, I’ll be careful”?

The recipient being careful is not the threat model. The threat model is: their laptop is lost, their email is phished, their cloud drive is misconfigured, they leave the engagement and the file stays on a personal device. Anonymization is defense in depth against incidents that are not the recipient’s fault.

How do I anonymize files that need to be re-joined later?

Use a consistent hash function across every file in the set. As long as the recipient uses the hashed columns to join (not the original ones), the joins behave identically. Apply the same anonymization configuration to every file, and the relationships are preserved without the originals being exposed.

Does “anonymous” under GDPR mean the same thing as “anonymized” in this article?

Not quite. Under GDPR, “anonymous” data is data that cannot be re-identified by any reasonable means, and it falls outside the scope of the regulation. Most practical anonymization (hashing IDs, faking names) is technically pseudonymization, which is still in scope, because re-identification is possible if you have access to other data. For more on the regulatory side, see GDPR and CSV exports: what you can and can’t share with a third party.

Stop fixing the same CSV problems every week

Asphorem maps your columns, standardises picklist values, and normalises dates so your next import works first time. Free plan included.

Start for free →