Merging 8,600 Records Into 3,100 Usable Contacts with AI Data Cleanup
Three Systems, One Customer, Zero Agreement
A Tampa staffing agency with 40 employees ran three systems for tracking clients: a CRM the sales team half-used, a spreadsheet the account managers maintained, and an email folder system the founder had built over 12 years. The same client might appear as "Johnson Manufacturing," "Johnson Mfg LLC," and "Bill Johnson - manufacturing" across the three systems.
The consequences showed up in embarrassing ways. A sales rep pitched a prospect who was already a client. An account manager sent a renewal offer at last year's rate because the CRM had outdated pricing. The founder missed a follow-up with a $200K annual contract because the reminder lived in a spreadsheet tab nobody checked.
They'd tried manual cleanup twice. Both times, a junior employee spent two weeks merging records, got 60% through the list, and quit from boredom. The data drifted back within three months.
Building an AI Data Reconciliation Pipeline
We built the solution in three stages: extraction, matching, and ongoing deduplication.
Stage one pulled all records from the CRM (4,200 contacts), the master spreadsheet (2,800 rows), and the founder's email contacts (1,600 entries). Total: 8,600 records representing an unknown number of actual unique businesses.
Stage two used fuzzy matching to identify duplicates. Simple string matching catches "Johnson Manufacturing" and "Johnson Manufacturing Inc." AI-powered matching catches "Bill J. - that manufacturing company in Brandon" in the spreadsheet notes and connects it to the formal CRM record. The system scored each potential match on a confidence scale. Exact matches merged automatically. Probable matches (85-99% confidence) went to a review queue. Low-confidence matches got flagged for human decision.
The AI analyzed phone numbers, email domains, addresses, contact names, and contextual notes. It learned that "Tammy Johnson" and "T. Johnson, VP" at the same company domain were the same person. It distinguished between "Bay Area Electric" in Tampa and "Bay Area Electric" in San Francisco by cross-referencing area codes and addresses.
Results After the First Pass
8,600 records became 3,100 unique businesses. Over 5,500 records were duplicates, partial entries, or outdated contacts.
| Metric | Before | After |
|---|---|---|
| Total records | 8,600 | 3,100 unique |
| Duplicate rate | Unknown | 64% were duplicates |
| Time spent on data entry/week | ~6 hours | ~1.5 hours |
| Contact info accuracy | ~58% | 94% |
| Orphaned pipeline deals | 47 deals with no owner | 0 |
The 47 orphaned pipeline deals were the surprise. These were prospects where conversations had started but nobody followed up because the contact lived in a system nobody checked. The sales team recovered 12 of them in the first month, closing $31,000 in new contracts from leads they'd already generated but lost track of.
Ongoing Deduplication
Stage three was the piece that made the cleanup stick. We installed an AI-powered deduplication layer that runs every time a new contact is added. It checks the incoming record against existing entries and either merges it with a match or creates a new entry. No more drift.
When a sales rep enters "Tampa General Contractors" and the system already has "TGC LLC," it surfaces the match and asks: "Is this the same company?" One click to merge. The system remembers the association for next time.
Monthly data quality reports flag records with missing phone numbers, outdated email addresses (bounced in the last 90 days), or contacts who haven't been touched in over a year. The account team reviews the report in 20 minutes and updates the flagged records.
What Made This Work
Two decisions during planning shaped the outcome.
First, we prioritized the founder's email contacts as the richest data source, not the CRM. Counterintuitive, but the founder had 12 years of relationship context in email threads that the CRM had never captured. Notes like "met at Chamber event, interested in Q3" contained information no database field could hold. The AI extracted these details and attached them to the unified record.
Second, we built the review queue for borderline matches instead of auto-merging everything. This cost an extra 4 hours of staff review in week one, but it prevented the system from merging two different "Bay Area Electric" companies into one record. Trust in the system meant people actually used it.
Cost and Timeline
Setup took four weeks. Week one: data export and schema mapping across all three systems. Week two: matching algorithm training on a sample of 500 known duplicates. Week three: full merge with human review queue. Week four: ongoing deduplication integration and staff training.
Build cost: $9,500. Monthly operating cost: $180 (API usage for real-time deduplication). Annual savings: roughly $14,000 in recovered labor plus $31,000 in recovered pipeline from the first pass. The ongoing deduplication prevents the data from degrading again, which is worth more than any single cleanup sprint.