CSV Column Auto-Recognition Patterns - Designing and Choosing Between Heuristic and LLM-Based Approaches

In e-commerce back-office development, CSV import functionality is unavoidable. Each data source — Shopify, Amazon, WooCommerce, or your own ERP — uses different column naming conventions. “product_code”, “item_code”, “SKU”, “article_number” — all refer to the same field, but unless the system can auto-recognize this, users must manually configure mappings every time.

This article designs and implements a system that auto-detects data types and column mappings from CSV headers using two approaches:

Heuristic approach: Deterministic matching with alias dictionaries and scoring
LLM-based approach: Flexible inference powered by large language models

We present both implementations in TypeScript and provide guidance on when to use each, based on accuracy, speed, cost, and maintainability.

The Problem: CSV Imports in E-Commerce

A typical e-commerce back-office regularly imports CSVs of the following data types:

Data Type	Typical Sources	Example Column Names
Product master	ERP, PIM	`product_code`, `SKU`, `item_id`
Orders	Marketplace API exports	`order_no`, `order_id`, `purchase_number`
Inventory	WMS, warehouse systems	`stock_qty`, `available`, `quantity`
Customers	CRM exports	`customer_name`, `full_name`, `buyer`
Shipments	Carrier CSVs	`tracking_no`, `tracking_number`, `waybill`

The core problem is that column names differ across source systems for the same data type. Add in mixed languages (English/Japanese in our case), full-width/half-width characters, and parenthetical annotations, and the variation becomes enormous.

Approach 1: Heuristic-Based Column Recognition

Design Philosophy

The core idea is to pre-define known system profiles — field definitions with alias dictionaries for each data type — and score how well input CSV headers match each profile.

The processing flow has three steps:

graph LR
    A[CSV Headers] --> B[Normalize]
    B --> C[Score Against Profiles]
    C --> D[Select Best Match]
    D --> E[Mapping Result]

Step 1: Define Known System Profiles

// known-ec-systems.ts
 
export type EcDataType =
  | "product_master"
  | "orders"
  | "inventory"
  | "customers"
  | "shipments";
 
export type KnownSystemField = {
  targetField: string;
  aliases: string[];
};
 
export type KnownSystemProfile = {
  dataType: EcDataType;
  systemName: string;
  fields: KnownSystemField[];
};
 
export const KNOWN_SYSTEM_PROFILES: readonly KnownSystemProfile[] = [
  {
    dataType: "product_master",
    systemName: "Product Master",
    fields: [
      {
        targetField: "product_id",
        aliases: [
          "product code",
          "product id",
          "SKU",
          "item code",
          "article number",
          "JAN",
          "EAN",
          "UPC",
          "ASIN",
        ],
      },
      {
        targetField: "product_name",
        aliases: [
          "product name",
          "item name",
          "name",
          "title",
          "product title",
        ],
      },
      {
        targetField: "price",
        aliases: [
          "price",
          "unit price",
          "selling price",
          "retail price",
          "msrp",
          "list price",
        ],
      },
      {
        targetField: "category",
        aliases: [
          "category",
          "product category",
          "genre",
          "classification",
          "department",
        ],
      },
      {
        targetField: "description",
        aliases: [
          "description",
          "product description",
          "detail",
          "long description",
        ],
      },
      {
        targetField: "stock_quantity",
        aliases: [
          "stock",
          "quantity",
          "stock qty",
          "inventory",
          "available qty",
        ],
      },
      {
        targetField: "brand",
        aliases: ["brand", "manufacturer", "maker", "vendor"],
      },
      {
        targetField: "weight",
        aliases: ["weight", "weight kg", "weight g", "mass"],
      },
    ],
  },
  {
    dataType: "orders",
    systemName: "Order Data",
    fields: [
      {
        targetField: "order_id",
        aliases: [
          "order id",
          "order no",
          "order number",
          "purchase id",
          "transaction id",
        ],
      },
      {
        targetField: "order_date",
        aliases: [
          "order date",
          "purchase date",
          "ordered at",
          "created at",
          "transaction date",
        ],
      },
      {
        targetField: "customer_name",
        aliases: [
          "customer name",
          "buyer name",
          "purchaser",
          "full name",
          "name",
        ],
      },
      {
        targetField: "customer_email",
        aliases: [
          "email",
          "customer email",
          "buyer email",
          "mail",
          "contact email",
        ],
      },
      {
        targetField: "total_amount",
        aliases: [
          "total",
          "total amount",
          "order total",
          "grand total",
          "amount",
        ],
      },
      {
        targetField: "status",
        aliases: [
          "status",
          "order status",
          "state",
          "fulfillment status",
        ],
      },
      {
        targetField: "shipping_address",
        aliases: [
          "shipping address",
          "delivery address",
          "ship to",
          "address",
        ],
      },
      {
        targetField: "payment_method",
        aliases: [
          "payment method",
          "payment type",
          "payment",
          "pay method",
        ],
      },
    ],
  },
  // ... inventory, customers, shipments profiles omitted for brevity
] as const;

The key is to register synonyms, abbreviations, and naming variations exhaustively for each field. The richness of this dictionary directly determines heuristic recognition accuracy.

Step 2: Normalization and Scoring

Before comparing CSV headers against aliases, we normalize both strings — converting to lowercase, unifying delimiters, and applying Unicode NFKC normalization (which converts full-width characters to half-width, important for Japanese CSVs).

// column-recognizer.ts
 
function normalize(value: string): string {
  return value
    .toLowerCase()
    .normalize("NFKC")
    .replace(/[_\-/]+/g, " ")
    .replace(/[()[\]]+/g, " ")
    .replace(/\s+/g, " ")
    .trim();
}
 
function scoreAlias(header: string, alias: string): number {
  const h = normalize(header);
  const a = normalize(alias);
 
  // Exact match: highest score
  if (h === a) return 1;
 
  // Substring containment: high score
  if (h.includes(a) || a.includes(h)) return 0.8;
 
  // Token overlap: medium score
  const headerTokens = new Set(h.split(" "));
  const aliasTokens = a.split(" ");
  const overlap = aliasTokens.filter((t) => headerTokens.has(t)).length;
  if (overlap === 0) return 0;
  return Math.min(0.7, overlap / aliasTokens.length);
}

The scoring function uses three tiers:

Match Type	Score	Example
Exact match	1.0	`"product code"` vs `"product code"`
Substring	0.8	`"product code (JAN)"` vs `"product code"`
Token overlap	0.0 - 0.7	`"product mgmt code"` vs `"product code"`

Step 3: Profile Matching and Best-Match Selection

For each header, we select the field with the highest score, then compute an overall confidence for the profile.

export type ColumnMapping = Record<string, string | null>;
 
export type ColumnRecognitionResult = {
  dataType: EcDataType;
  confidence: number;
  columnMapping: ColumnMapping;
  unmappedColumns: string[];
  detectedSystem: string;
};
 
function pickMappingForHeader(
  header: string,
  profile: KnownSystemProfile
): string | null {
  let bestField: string | null = null;
  let bestScore = 0;
 
  for (const field of profile.fields) {
    for (const alias of field.aliases) {
      const score = scoreAlias(header, alias);
      if (score > bestScore) {
        bestScore = score;
        bestField = field.targetField;
      }
    }
  }
 
  return bestScore >= 0.55 ? bestField : null;
}
 
function scoreProfile(headers: string[], profile: KnownSystemProfile) {
  const columnMapping: ColumnMapping = {};
  let matchedHeaders = 0;
 
  for (const header of headers) {
    const matchedField = pickMappingForHeader(header, profile);
    columnMapping[header] = matchedField;
    if (matchedField) matchedHeaders += 1;
  }
 
  const headerCoverage =
    headers.length > 0 ? matchedHeaders / headers.length : 0;
  const fieldCoverage =
    profile.fields.length > 0
      ? matchedHeaders / Math.min(profile.fields.length, headers.length)
      : 0;
 
  return {
    profile,
    columnMapping,
    matchedHeaders,
    confidence: Number(
      ((headerCoverage * 0.7 + fieldCoverage * 0.3) * 100).toFixed(1)
    ),
    score: matchedHeaders,
  };
}

Confidence is a weighted average of two coverage metrics:

Header coverage (weight 0.7): What fraction of input CSV headers were mapped? This directly affects whether users perceive “too many unrecognized columns.”
Field coverage (weight 0.3): What fraction of the profile’s defined fields were covered? This affects data type detection accuracy.

Header coverage gets the higher weight because, in practice, “few unmapped columns” matters most for user experience.

Finally, we match against all profiles and select the highest-scoring one:

export function recognizeColumns(
  headers: string[]
): ColumnRecognitionResult {
  const candidates = KNOWN_SYSTEM_PROFILES.map((profile) => ({
    ...scoreProfile(headers, profile),
  }));
 
  const best = candidates.sort((a, b) => {
    if (b.score !== a.score) return b.score - a.score;
    return b.confidence - a.confidence;
  })[0];
 
  return {
    dataType: best.profile.dataType,
    confidence: best.confidence,
    columnMapping: best.columnMapping,
    unmappedColumns: Object.entries(best.columnMapping)
      .filter(([, v]) => v === null)
      .map(([k]) => k),
    detectedSystem: best.profile.systemName,
  };
}

Worked Example

Given a standard Shopify order export:

Order Number, Created At, Customer Name, Email, Total, Status, Shipping Address, Payment Method

Result:

{
  "dataType": "orders",
  "confidence": 100,
  "detectedSystem": "Order Data",
  "columnMapping": {
    "Order Number": "order_id",
    "Created At": "order_date",
    "Customer Name": "customer_name",
    "Email": "customer_email",
    "Total": "total_amount",
    "Status": "status",
    "Shipping Address": "shipping_address",
    "Payment Method": "payment_method"
  },
  "unmappedColumns": []
}

But with a non-standard CSV format:

item_no, title, retail (incl. tax), stock_count, genre, maker_name, memo

{
  "dataType": "product_master",
  "confidence": 61.4,
  "columnMapping": {
    "item_no": null,
    "title": "product_name",
    "retail (incl. tax)": "price",
    "stock_count": "stock_quantity",
    "genre": "category",
    "maker_name": "brand",
    "memo": null
  },
  "unmappedColumns": ["item_no", "memo"]
}

item_no fails because the alias dictionary has "item code" but not "item no". This gap highlights the fundamental limitation of heuristics: you can’t pre-define every possible naming variation. This is where the LLM approach shines.

Approach 2: LLM-Based Column Recognition

Design Philosophy

The LLM approach passes column headers and sample data as a prompt, leveraging natural language understanding to infer data types and column mappings.

graph LR
    A[CSV Headers<br/>+ Sample Rows] --> B[LLM Inference]
    B --> C[Structured JSON]
    C --> D[Mapping Result]

Implementation

// llm-column-recognizer.ts
 
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod/v4";
 
const LlmMappingSchema = z.object({
  dataType: z.enum([
    "product_master",
    "orders",
    "inventory",
    "customers",
    "shipments",
    "unknown",
  ]),
  confidence: z.number().min(0).max(100),
  detectedSystem: z.string(),
  columnMapping: z.record(z.string(), z.string().nullable()),
  reasoning: z.string(),
});
 
type LlmMappingResult = z.infer<typeof LlmMappingSchema>;
 
const TARGET_FIELDS: Record<string, string[]> = {
  product_master: [
    "product_id", "product_name", "price", "category",
    "description", "stock_quantity", "brand", "weight",
  ],
  orders: [
    "order_id", "order_date", "customer_name", "customer_email",
    "total_amount", "status", "shipping_address", "payment_method",
  ],
  // ... other data types
};
 
function buildPrompt(
  headers: string[],
  sampleRows: string[][]
): string {
  const sampleText = sampleRows
    .slice(0, 3)
    .map((row) => row.join(" | "))
    .join("\n");
 
  return `You are an e-commerce data analysis expert.
Given the following CSV headers and sample data, determine the data type and map each column to the appropriate target field.
 
## CSV Headers
${headers.join(" | ")}
 
## Sample Data (up to 3 rows)
${sampleText}
 
## Target Field Definitions
${Object.entries(TARGET_FIELDS)
  .map(([type, fields]) => `- ${type}: ${fields.join(", ")}`)
  .join("\n")}
 
## Output Format (JSON only)
{
  "dataType": "one of the above data types, or unknown",
  "confidence": 0-100,
  "detectedSystem": "description of the likely data source",
  "columnMapping": { "original column name": "target field name or null" },
  "reasoning": "brief explanation of the classification"
}
 
Output only valid JSON.`;
}
 
export async function recognizeColumnsWithLlm(
  headers: string[],
  sampleRows: string[][]
): Promise<LlmMappingResult> {
  const client = new Anthropic();
 
  const message = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [
      { role: "user", content: buildPrompt(headers, sampleRows) },
    ],
  });
 
  const text =
    message.content[0].type === "text" ? message.content[0].text : "";
 
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) {
    throw new Error("LLM response did not contain valid JSON");
  }
 
  const parsed = JSON.parse(jsonMatch[0]);
  return LlmMappingSchema.parse(parsed);
}

Why the LLM Approach Works Better for Edge Cases

The LLM successfully maps item_no to product_id because it understands that “no” is an abbreviation for “number” and that “item number” semantically means a product identifier. It also uses sample data values (A001 as a code pattern, 12800 as a price) as additional signals — something the heuristic approach cannot do.

Head-to-Head Comparison

Criterion	Heuristic	LLM-Based
Latency	< 5ms	500ms - 3s
API cost	$0	~$0.003 / request
Offline operation	✅ Yes	⚠️ Cloud API requires network (local LLM possible)
Known pattern accuracy	Very high (if in dictionary)	High (but probabilistic)
Unknown pattern handling	Cannot recognize	Can infer from context
Determinism	Same input → same output	May vary slightly
Maintenance cost	Ongoing dictionary expansion	Prompt refinement only
Debuggability	Score tracing is straightforward	Reasoning can be opaque

Hybrid Strategy: The Recommended Pattern

In production, a hybrid strategy combining both approaches is most practical.

// hybrid-recognizer.ts
 
export async function recognizeColumnsHybrid(
  headers: string[],
  sampleRows: string[][]
): Promise<ColumnRecognitionResult> {
  // Phase 1: Fast heuristic attempt
  const heuristicResult = recognizeColumns(headers);
 
  // If confidence is high enough, return immediately
  if (heuristicResult.confidence >= 80) {
    return { ...heuristicResult, recognitionMethod: "heuristic" };
  }
 
  // Phase 2: Fall back to LLM for low-confidence cases
  try {
    const llmResult = await recognizeColumnsWithLlm(headers, sampleRows);
 
    return {
      dataType: llmResult.dataType,
      confidence: llmResult.confidence,
      columnMapping: llmResult.columnMapping,
      unmappedColumns: Object.entries(llmResult.columnMapping)
        .filter(([, v]) => v === null)
        .map(([k]) => k),
      detectedSystem: llmResult.detectedSystem,
      recognitionMethod: "llm",
    };
  } catch {
    // If LLM fails, fall back to heuristic result
    return { ...heuristicResult, recognitionMethod: "heuristic-fallback" };
  }
}

Flow Diagram

graph TD
    A[CSV Upload] --> B[Heuristic Recognition]
    B -->|confidence ≥ 80%| C[Return result<br/>instant, free]
    B -->|confidence < 80%| D[LLM Recognition]
    D -->|Success| E[Return LLM result]
    D -->|Failure| F[Fall back to<br/>heuristic result]

Key design principles:

Most requests never call the LLM: Repeatedly imported standard CSVs match the dictionary, so heuristics alone handle them. In practice, 80%+ of requests resolve here.
LLM costs stay minimal: The LLM is only called for “first-seen formats,” so API costs don’t scale linearly with traffic.
Graceful degradation: On LLM API errors or timeouts, the system still returns heuristic results — lower accuracy, but functional.

Feedback Loop: Growing the Dictionary from LLM Results

To further evolve the hybrid strategy, feed successful LLM mappings back into the heuristic dictionary:

// feedback-loop.ts (conceptual)
 
type MappingFeedback = {
  header: string;
  targetField: string;
  dataType: EcDataType;
  confirmedByUser: boolean;
};
 
async function learnFromMapping(
  feedback: MappingFeedback,
  profileStore: ProfileStore
): Promise<void> {
  if (!feedback.confirmedByUser) return;
 
  const profile = profileStore.getProfile(feedback.dataType);
  const field = profile.fields.find(
    (f) => f.targetField === feedback.targetField
  );
 
  if (field && !field.aliases.includes(normalize(feedback.header))) {
    field.aliases.push(normalize(feedback.header));
    await profileStore.save(profile);
  }
}

When users approve an LLM mapping suggestion, the header name is added to the alias dictionary. Next time, the same header is recognized by heuristics alone.

First time:

graph LR
    A[CSV] --> B[Heuristic<br/>low confidence]
    B --> C[LLM inference]
    C --> D[User approves]
    D -->|Learn| E[Add to alias dictionary]

Next time:

graph LR
    A[CSV] --> B[Heuristic<br/>high confidence]
    B --> C[Return immediately]

Summary

CSV column auto-recognition may seem like a minor feature, but it significantly impacts data import UX.

Heuristic approach: Fast, free, deterministic, and highly reliable for known patterns. The alias dictionary design — normalization, scoring thresholds, weighted coverage — is the key to accuracy.
LLM approach: Flexible with unknown column names and formats, but comes with latency and cost trade-offs. Including sample data in the prompt improves accuracy.
Hybrid strategy: Combines both strengths, optimizing cost and latency while maintaining adaptability to unknown patterns.

Add a feedback loop, and you build a self-improving column recognition system where LLM inference continuously strengthens the heuristic dictionary.

Start with heuristics only, monitor the unmapped column rate, and introduce LLM fallback as needed — that’s the pragmatic roadmap.

That’s all for CSV column auto-recognition implementation patterns — from the gemba.