Redact, Hash, or Fake: Which Anonymization Strategy to Use for Each Column
A practical decision guide for anonymizing CSV columns. When to redact, when to hash, when to replace with fake data, and why picking the wrong strategy either leaks information or destroys the analysis.
You have a CSV with sensitive data. You need to share it with someone outside your team. You already know “anonymize it before sending” is the right answer. The harder question is: which kind of anonymization, for which column?
The three main options (redact, hash, replace with fake data) all sound like they do the same thing. They do not. Picking the wrong one for a given column either leaks information you thought you protected, or destroys the analysis the recipient is trying to do.
This guide is a practical decision tree. For each common column type in a typical RevOps, sales, or marketing CSV, here is which strategy fits and why.
What each strategy actually does
Before walking through columns, here is what the three core strategies mean in practice.
Redaction replaces the value with a blank or a fixed placeholder like REDACTED or xxxx. Every row in the column now looks the same. The column still exists in the schema, but the values carry no information.
When to reach for it: when the column is genuinely not needed by the recipient, and you do not want them to see even the shape of the data.
Hashing runs each value through a one-way function (SHA-256, for example) and stores the fingerprint. The same input always produces the same output. Different inputs produce different outputs. You cannot reverse a hash to recover the original value.
When to reach for it: when the recipient needs to count unique values, deduplicate rows, or join two files on a column, but does not need to read the actual value.
Replacing with fake data substitutes each real value with a realistic-but-invented one. John Doe becomes Alex Rivera. [email protected] becomes [email protected]. The column still has the shape and feel of real data, but none of it refers to anything real.
When to reach for it: when the recipient needs the file to look real (for a demo, a screenshot, a deck) or when realistic values make the analysis more useful than blank or hashed ones.
There are also a few secondary strategies you will see in the column-by-column rules below:
- Rounding or bucketing for numeric data: replace
47,283with47,000or$10k to $50k. Preserves distribution, hides exact values. - Date shifting: add or subtract the same number of days from every date in the file. Preserves intervals, breaks calendar alignment.
- Truncation: keep only the first two digits of a postcode, the year of a date, the country code of a phone number. Reduces precision, keeps some signal.
The decision: three questions per column
For any column, the strategy is determined by three questions:
- Does the recipient need to read this column’s values? If yes, you cannot redact or hash. You need fake data, or you need to keep it.
- Does the recipient need to count unique values, dedupe, or join on this column? If yes, you cannot fake data (because fakes are random per row and will not match across files). You need to hash, or to keep it.
- Is the column itself sensitive, or only the values? If only the values are sensitive and the column name and shape are fine to share, you transform the values. If the column itself reveals a confidential category, you drop it entirely.
Walking those three questions over every column gives you the strategy without much thinking. Here is what that looks like in practice.
Personal identifiers
Email addresses
What the recipient might want to do with it: count unique contacts, join two files on email, see realistic-looking data.
| Need | Strategy |
|---|---|
| Count or dedupe | Hash |
| Join with another file | Hash (same function in both files) |
| Show realistic data | Fake (e.g. [email protected]) |
| None of the above | Redact |
One subtle thing: hashing only the local part and keeping the @acmecorp.com visible is not anonymization. The domain often identifies the company, which is usually the most sensitive part of a B2B record. If the domain matters, keep it; otherwise hash or fake the whole address.
Full names
| Need | Strategy |
|---|---|
| The file is a demo or screenshot | Fake |
| Analysis (counts, segments, ratios) | Redact |
| Join across files | Do not use names for joining; hash a stable ID instead |
Names rarely need to be hashed, because nobody reliably joins on Full Name (too many duplicates, variants, “John Smith” everywhere).
Phone numbers
| Need | Strategy |
|---|---|
| Detect duplicates | Hash |
| Realistic-looking file | Fake, using a reserved fictional format (+1-555-01XX) |
| Nothing specific | Redact |
If you fake phone numbers, use a clearly fictional range so the values cannot accidentally be dialed.
Addresses
| Need | Strategy |
|---|---|
| Geographic analysis | Keep country and region, redact street and unit |
| Postcode-level analysis | Truncate postcode to the first two or three characters |
| Realistic demo data | Fake the full address |
| Nothing specific | Redact |
Full street addresses identify individuals. Country and region usually do not. Postcode falls in between, depending on the country (UK postcodes are extremely precise; US ZIPs less so).
National IDs, social security numbers, tax IDs
There is only one right answer: redact, or remove the column entirely. These should not be in a file you are sharing externally under any circumstance you are likely to encounter. If you genuinely need to share that one of two rows refers to the same individual, hash it; never share a partial or formatted version.
Internal identifiers
Customer IDs, account IDs, deal IDs, lead IDs
This is the most commonly mishandled column. Internal IDs feel safe because they are short numeric or alphanumeric strings. They are not safe: anyone with access to your internal systems can map them back to real entities.
| Need | Strategy |
|---|---|
| Join across files | Hash, using the same function across all files in the set |
| Count unique entities | Hash |
| Look up in production | Should never be the goal of a shared file; do not share |
Hashing is almost always right for internal IDs. It preserves the joins (which is usually the whole reason the column is in the file) without exposing the values.
Employee or owner IDs
Treat the same as internal customer IDs. Hash if the recipient needs to see “deals per owner” or similar, redact otherwise.
Commercial-sensitive columns
Company names
In B2B data, this is often more sensitive than personal names. Your customer list is competitive intelligence.
| Need | Strategy |
|---|---|
| Realistic demo or report | Fake |
| Count unique companies | Hash |
| Distribution analysis (by segment, size) | Hash, or keep the segment column and redact the company column |
| Public communication | Keep, but only with explicit permission |
A common mistake: shuffling real customer names across rows. This is not anonymization. It still discloses the list of customers, just not which deal belongs to which. Use invented names, not shuffled real ones.
Industry, segment, size band
| Need | Strategy |
|---|---|
| Almost any analysis | Keep |
| Confidential segmentation taxonomy | Redact or remap to generic labels |
These are categorical, low-cardinality, and rarely identifying. Keep them unless your segmentation itself is confidential (some companies treat their segment definitions as competitive intel).
Deal value, revenue, contract amount
Hashing financial numbers is almost always wrong: it destroys ordering and arithmetic, which breaks every interesting analysis. Use bucketing or rounding instead.
| Need | Strategy |
|---|---|
| Distribution analysis | Round to nearest 1,000 or 10,000 |
| Segment comparison | Bucket into ranges (<$10k, $10k to $50k, etc.) |
| Preserve relative scale | Multiply every value by the same random factor (0.7 to 1.3) |
| Recipient does not need numbers | Redact |
Be careful about combining low-cardinality dimensions with exact deal values. A unique deal value at a known close date is often enough to identify the customer.
Discount, margin, cost columns
Same as deal value: round or bucket. These are often even more sensitive than revenue, because margin reveals pricing strategy and unit economics.
Time columns
Created date, modified date, closed date
Dates by themselves are rarely sensitive. The combination of a date and another column (a deal value, a customer ID) often is, because the combination can be unique.
| Need | Strategy |
|---|---|
| Time-series analysis | Keep, if the rest of the file is well anonymized |
| Break calendar alignment | Shift every date in the file by the same number of days |
| Reduce precision | Truncate to month or quarter |
| Not needed | Drop |
The key thing if you shift dates: apply the same shift to every date column in the file. If created_date shifts by 90 days and closed_date shifts by 60 days, the deal length (closed minus created) is corrupted, and any duration-based analysis is broken.
Timestamps (login, last activity, event time)
Treat the same as dates, but consider truncating to the hour or the day. Second-level precision on a login event combined with even a hashed user ID is more identifying than people expect.
Free-text columns
Notes, descriptions, comments, transcripts
There is essentially one strategy: drop the column unless you have manually reviewed every row.
Free text contains anything: real names, real email addresses, real account details, real complaints, real internal-only context. No automatic anonymization can reliably catch all of it. Pattern-matching for emails and phone numbers is necessary but not sufficient.
If the recipient genuinely needs the notes, you have to read every row and rewrite anything sensitive by hand. That is expensive, which is the right signal: free-text columns should be excluded from external shares by default.
Quick reference table
| Column type | Default strategy | When to deviate |
|---|---|---|
| Hash | Fake for demos, redact if not needed | |
| Full name | Redact | Fake for demos |
| Phone | Hash | Fake for demos, redact if not needed |
| Street address | Redact | Fake for demos; keep country and region |
| National ID, SSN | Redact (or remove column) | Never share in any form |
| Internal customer / account ID | Hash | Redact if no joins or counts are needed |
| Owner / employee ID | Hash | Redact otherwise |
| Company name | Fake | Hash for counts; redact when not needed |
| Industry, segment | Keep | Redact only if taxonomy is confidential |
| Deal value | Round or bucket | Redact if not needed; never hash |
| Discount, margin | Bucket | Redact when especially sensitive |
| Created / closed date | Shift all dates by same amount | Truncate to month for less precision |
| Timestamps | Truncate to day or hour | Same as dates |
| Free text | Drop the column | Manual review only when essential |
Common mistakes to avoid
Mistake 1: applying the same strategy to every column. “I’ll just hash everything” sounds safe but destroys the analysis. “I’ll fake everything” breaks joins and unique counts. Each column needs its own answer.
Mistake 2: shuffling values within a column. Taking the real customer names and randomly reassigning them to rows is not anonymization. The list of customers is still disclosed, only the row-level mapping is changed. Use invented names, not shuffled real ones.
Mistake 3: hashing with a guessable algorithm and small value space. Hashing a boolean column, or a column with 50 possible values, with a public algorithm is reversible by anyone with five minutes and a script. They precompute the hashes of every possible value and look up the matches. If the value space is small, do not hash; redact or generalize.
Mistake 4: leaving combinations of “safe” columns that re-identify rows. A row with country = France, industry = Pharma, deal value = $1,247,392, closed in March 2026 may be globally unique, even though no single column on its own is sensitive. Watch for combinations, not just columns.
Mistake 5: forgetting that the schema itself leaks information. A column called vip_customer_flag or internal_credit_risk_score reveals something even if every value is redacted. If the column name is sensitive, rename it before sharing or drop it entirely.
Where Asphorem fits in
The Data Anonymizer implements the column-by-column model in this article. Upload a CSV, see every column with sample values, pick a strategy per column (redact, hash, replace with fake data, round, bucket, shift), and download the result.
The whole transformation runs in your browser. The file is not uploaded to any server, which is the only sensible default when the entire point is to control where the data goes.
Anonymization configurations are saved per file shape, so the next time you need to send an updated extract to the same recipient, you re-apply the same per-column rules in one click. The decisions you make once here become a workflow.
If you also need to standardize the file before anonymizing (clean column names, normalize picklist values, fix date formats), the CSV Normalizer handles that side, and the two tools share the same client-side model.
Frequently asked questions
Is hashing the same as encryption?
No. Encryption is reversible if you have the key; hashing is one-way and cannot be reversed even by the person who applied it. For anonymization, you want one-way (no key to leak, no way for the recipient to ever recover the originals). Hashing also produces a fixed-length output regardless of input, which is why two files hashed the same way join cleanly on the hashed column.
Can a hashed column be re-identified?
If the column has a small value space (booleans, country codes, fewer than a few thousand possible values), yes. A determined recipient can hash every possible value and reverse the lookup. For high-cardinality columns like emails or customer IDs, the practical re-identification risk is much lower, but it is not zero, especially if the recipient already knows some of the values. For strict cases, use a salted hash with a salt the recipient does not have.
What if the recipient says “just send me the raw data, I’ll be careful”?
The recipient being careful is not the threat model. The threat model is: their laptop is lost, their email is phished, their cloud drive is misconfigured, they leave the engagement and the file stays on a personal device. Anonymization is defense in depth against incidents that are not the recipient’s fault.
How do I anonymize files that need to be re-joined later?
Use a consistent hash function across every file in the set. As long as the recipient uses the hashed columns to join (not the original ones), the joins behave identically. Apply the same anonymization configuration to every file, and the relationships are preserved without the originals being exposed.
Does “anonymous” under GDPR mean the same thing as “anonymized” in this article?
Not quite. Under GDPR, “anonymous” data is data that cannot be re-identified by any reasonable means, and it falls outside the scope of the regulation. Most practical anonymization (hashing IDs, faking names) is technically pseudonymization, which is still in scope, because re-identification is possible if you have access to other data. For more on the regulatory side, see GDPR and CSV exports: what you can and can’t share with a third party.
Asphorem maps your columns, standardises picklist values, and normalises dates so your next import works first time. Free plan included.