CarneiroTech/cnpj-migration-database.md at 1141594d20024931f9e7eebcfdf22404e1fb3406

ricardo/CarneiroTech

Fork 0

Ricardo Carneiro 1141594d20 feat: Primeira Versão traduzida

2025-12-21 21:57:39 -03:00

14 KiB

Raw Blame History

title

slug

summary

client

industry

timeline

role

image

Overview

A collection agency that works with transitory data databases (no proprietary software) needs to adapt its systems to the new Brazilian alphanumeric CNPJ format.

Main challenge: Migrate ~100 million records in tables with BIGINT and NUMERIC columns to VARCHAR, without locking the production database.

Status: Project in execution (migration script preparation).

Challenge

Massive Data Volume

Company context:

Collection agency (does not develop proprietary software)
Works with transitory data (high turnover)
SQL Server database with critical volume

Initial analysis revealed:

Table	Column	Current Type	Records	Size
Debtors	CNPJ_Debtor	BIGINT	8,000,000	60 GB
Transactions	CNPJ_Payer	NUMERIC(14)	90,000,000	1.2 TB
Companies	CNPJ_Company	BIGINT	2,500,000	18 GB
TOTAL	-	-	~100,000,000	~1.3 TB

Identified problems:

Tables with 8M+ rows using BIGINT for CNPJ
90 million records in transactions table
CNPJ as primary key in some tables
Foreign keys relating multiple tables
Impossibility of extended downtime (24/7 operation)
Disk space restrictions (requires efficient strategy)

Strategic Decision: Phased Commits

Why NOT do ALTER COLUMN directly?

Naive approach (DOESN'T work):

-- NEVER DO THIS ON LARGE TABLES
ALTER TABLE Transactions
ALTER COLUMN CNPJ_Payer VARCHAR(18);

Problems:

Locks entire table during conversion
Can take hours/days on large tables
Blocks all operations (INSERT, UPDATE, SELECT)
Risk of timeout or failure mid-operation
Complex rollback if something goes wrong

Chosen Strategy: Column Swap with Phased Commits

Based on previous experience, I decided to use a gradual approach:

┌─────────────────────────────────────────────┐
│  1. Create new VARCHAR column at END        │
│     (fast operation, doesn't lock table)    │
└─────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────┐
│  2. UPDATE in batches (phased commits)      │
│     - 100k records at a time                │
│     - Pause between batches (avoid lock)    │
└─────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────┐
│  3. Remove PKs and FKs                      │
│     (after 100% migrated)                   │
└─────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────┐
│  4. Rename columns (swap)                   │
│     - CNPJ → CNPJ_Old                       │
│     - CNPJ_New → CNPJ                       │
└─────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────┐
│  5. Recreate PKs/FKs with new column        │
└─────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────┐
│  6. Validation and old column deletion      │
└─────────────────────────────────────────────┘

Why this approach?

No complete table lock (incremental operation) Can pause/resume at any time Real-time progress monitoring Simple rollback (just drop new column) Minimizes production impact (small commits)

Decision based on:

Previous experience with large volume migrations
Knowledge of SQL Server locks
Need for zero downtime

Note: This decision was made without consulting AI - based purely on practical experience from previous projects.

Implementation Details

Phase 1: Create New Column

-- Fast operation (metadata change only)
ALTER TABLE Transactions
ADD CNPJ_Payer_New VARCHAR(18) NULL;

-- Add temporary index to speed up lookups
CREATE NONCLUSTERED INDEX IX_Temp_CNPJ_New
ON Transactions(CNPJ_Payer_New)
WHERE CNPJ_Payer_New IS NULL;

Estimated time: ~1 second (independent of table size)

Phase 2: Batch Migration (Core Strategy)

-- Migration script with phased commits
DECLARE @BatchSize INT = 100000;  -- 100k records per batch
DECLARE @RowsAffected INT = 1;
DECLARE @TotalProcessed INT = 0;
DECLARE @StartTime DATETIME = GETDATE();

WHILE @RowsAffected > 0
BEGIN
    BEGIN TRANSACTION;

    -- Update batch of 100k records not yet migrated
    UPDATE TOP (@BatchSize) Transactions
    SET CNPJ_Payer_New = RIGHT('00000000000000' + CAST(CNPJ_Payer AS VARCHAR), 14)
    WHERE CNPJ_Payer_New IS NULL;

    SET @RowsAffected = @@ROWCOUNT;
    SET @TotalProcessed = @TotalProcessed + @RowsAffected;

    COMMIT TRANSACTION;

    -- Progress log
    PRINT 'Processed: ' + CAST(@TotalProcessed AS VARCHAR) + ' rows. Batch: ' + CAST(@RowsAffected AS VARCHAR);
    PRINT 'Elapsed time: ' + CAST(DATEDIFF(SECOND, @StartTime, GETDATE()) AS VARCHAR) + ' seconds';

    -- Pause between batches (reduces contention)
    WAITFOR DELAY '00:00:01';  -- 1 second between batches
END;

PRINT 'Migration completed! Total rows: ' + CAST(@TotalProcessed AS VARCHAR);

Configurable parameters:

@BatchSize: 100k (balanced between performance and lock time)
- Too small = many transactions, overhead
- Too large = prolonged lock, production impact
WAITFOR DELAY: 1 second (gives time for other queries to run)

Time estimates:

Records	Batch Size	Estimated Time
8,000,000	100,000	~2-3 hours
90,000,000	100,000	~20-24 hours

Advantages:

Doesn't freeze application
Other queries can run between batches
Can pause (Ctrl+C) and resume later (WHERE NULL picks up where it left off)
Real-time progress log

Phase 3: Constraint Removal

-- Identifies all PKs and FKs involving the column
SELECT name
FROM sys.key_constraints
WHERE type = 'PK'
    AND parent_object_id = OBJECT_ID('Transactions')
    AND COL_NAME(parent_object_id, parent_column_id) = 'CNPJ_Payer';

-- Remove PKs
ALTER TABLE Transactions
DROP CONSTRAINT PK_Transactions_CNPJ;

-- Remove FKs (tables that reference)
ALTER TABLE Payments
DROP CONSTRAINT FK_Payments_Transactions;

Estimated time: ~10 minutes (depends on how many constraints exist)

Phase 4: Column Swap (Renaming)

-- Rename old column to _Old
EXEC sp_rename 'Transactions.CNPJ_Payer', 'CNPJ_Payer_Old', 'COLUMN';

-- Rename new column to original name
EXEC sp_rename 'Transactions.CNPJ_Payer_New', 'CNPJ_Payer', 'COLUMN';

-- Change to NOT NULL (after validating 100% populated)
ALTER TABLE Transactions
ALTER COLUMN CNPJ_Payer VARCHAR(18) NOT NULL;

Estimated time: ~1 second (metadata change)

Phase 5: Constraint Recreation

-- Recreate PK with new VARCHAR column
ALTER TABLE Transactions
ADD CONSTRAINT PK_Transactions_CNPJ
PRIMARY KEY CLUSTERED (CNPJ_Payer);

-- Recreate FKs
ALTER TABLE Payments
ADD CONSTRAINT FK_Payments_Transactions
FOREIGN KEY (CNPJ_Payer) REFERENCES Transactions(CNPJ_Payer);

Estimated time: ~30-60 minutes (depends on volume)

Phase 6: Validation and Cleanup

-- Validate that 100% was migrated
SELECT COUNT(*)
FROM Transactions
WHERE CNPJ_Payer IS NULL OR CNPJ_Payer = '';

-- Validate referential integrity
DBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS;

-- If everything OK, remove old column
ALTER TABLE Transactions
DROP COLUMN CNPJ_Payer_Old;

-- Remove temporary index
DROP INDEX IX_Temp_CNPJ_New ON Transactions;

CNPJ Fast Process Customization

Differences vs. Original Process

The original CNPJ Fast process was restructured for this client:

Main changes:

Aspect	Original CNPJ Fast	Client (Customized)
Focus	Applications + DB	DB only (no proprietary software)
Discovery	App inventory	Schema analysis only
Execution	Multiple applications	Massive SQL scripts
Batch Size	50k-100k	100k (optimized for volume)
Monitoring	Manual + tools	Real-time SQL logs
Rollback	Complex process	Simple (DROP COLUMN)

Reason for restructuring:

Client has no proprietary applications (only consumes data)
100% focus on database optimization
Much larger volume than typical cases (100M vs ~10M)

Tech Stack

SQL Server T-SQL Batch Processing Performance Tuning Database Optimization Migration Scripts Phased Commits Index Optimization Constraint Management

Key Decisions & Trade-offs

Why 100k per batch?

Performance tests:

Batch Size	Time/Batch	Lock Duration	Contention
10,000	2s	Low	Minimal
50,000	8s	Medium	Acceptable
100,000	15s	Medium	Balanced
500,000	90s	High	Production impact
1,000,000	180s	Very high	Unacceptable

Choice: 100k offers best balance between performance and impact.

Why create column at END?

SQL Server internals:

Add column at end = metadata change (fast)
Add in middle = page rewrite (slow)
For large tables, position matters

Why WAITFOR DELAY of 1 second?

Without delay:

Batch processing consumes 100% of I/O
Application queries slow down
Lock escalation may occur

With 1s delay:

Other queries have window to execute
Distributed I/O
User experience preserved

Trade-off: Migration takes +1s per batch (~25% slower), but system remains responsive.

Current Status & Next Steps

Current Status (December 2024)

Preparation Phase:

Discovery complete (100M records identified)
Migration scripts developed
Tests in staging environment
Performance validation in progress
Awaiting production maintenance window

Next Steps

Complete production backup
Production execution (24/7 environment)
Real-time monitoring during migration
Post-migration validation (integrity, performance)
Lessons learned documentation

Lessons Learned (So Far)

1. Previous Experience is Gold

Decision to use phased commits came from practical experience in previous projects, not from documentation or AI.

Similar previous situations:

E-commerce data migration (50M records)
Encoding conversion (UTF-8 in 100M+ rows)
Historical table partitioning

2. "Measure Twice, Cut Once"

Before executing in production:

Exhaustive tests in staging
Scripts validated and reviewed
Rollback tested
Time estimates confirmed

Preparation time: 3 weeks Execution time: Estimated at 48 hours

Ratio: 10:1 (preparation vs execution)

3. Customization > One-Size-Fits-All

The original CNPJ Fast process needed to be restructured for this client.

Lesson: Processes should be:

Structured enough to repeat
Flexible enough to adapt

4. Monitoring is Crucial

Scripts with detailed progress logs allow:

Estimate remaining time
Identify bottlenecks
Pause/resume with confidence
Report status to stakeholders

-- Log example
Processed: 10,000,000 rows. Batch: 100,000
Elapsed time: 3600 seconds (10% complete, ~9h remaining)

Performance Optimizations

Optimizations Implemented

Temporary index WHERE NULL
- Speeds up lookup of unmigrated records
- Removed after completion
Optimized batch size
- Balanced between performance and lock time

Transaction log management

-- Check log growth
DBCC SQLPERF(LOGSPACE);

-- Adjust recovery model (if allowed)
ALTER DATABASE MyDatabase SET RECOVERY SIMPLE;

Execution during low-load hours
- Overnight maintenance window
- Weekend (if possible)

Expected result: Migration of 100 million records in ~48 hours, without significant downtime and with possibility of fast rollback.

Need to migrate massive data volumes? Get in touch

14 KiB Raw Blame History