Krishna G. Lodha

Chapter Content

Fixed-Length vs. Variable-Length

PostgreSQL offers three core text types:

Type	Description	Maximum Length
`CHAR(n)`	Fixed-length, blank-padded	1 to 10,485,76
`VARCHAR(n)`	Variable-length with limit	1 to 10,485,76
`TEXT`	Variable unlimited length	Unlimited

Use Cases:

CHAR(n): ISO country codes (CHAR(2))
VARCHAR(n): Email addresses with known max length
TEXT: Descriptions, comments, logs

CREATE TABLE addresses (
  id SERIAL PRIMARY KEY,
  country_code CHAR(2),
  email VARCHAR(255),
  notes TEXT
);

INSERT INTO addresses (country_code, email, notes)
VALUES
  ('US', 'alice@example.com', 'First contact.'),
  ('IN', 'bob@example.in', 'Follow-up pending.');

Case-Insensitive Text with CITEXT

The citext extension provides case-insensitive text comparisons:

CREATE EXTENSION IF NOT EXISTS citext;

CREATE TABLE users (
  username CITEXT PRIMARY KEY,
  full_name TEXT
);

INSERT INTO users (username, full_name)
VALUES ('Alice', 'Alice Smith');

-- This matches regardless of case:
SELECT * FROM users WHERE username = 'alice';

Benefit: Simplifies comparisons and uniqueness.

Domain Types for Validation

Wrap text in a domain with a regex check, e.g., email format:

CREATE DOMAIN email_addr AS VARCHAR(255)
  CHECK (VALUE ~* '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$');

CREATE TABLE contacts (
  id SERIAL,
  email email_addr
);

INSERT INTO contacts (email) VALUES ('valid@example.com');
-- Invalid addresses will be rejected.

Altering Column Types

Convert between types safely:

-- TEXT to VARCHAR(100)
ALTER TABLE addresses
  ALTER COLUMN notes TYPE VARCHAR(100)
  USING LEFT(notes,100);

Advice: Use USING to trim or cast values during conversion.

Full-Text Search Basics

PostgreSQL supports full-text search via tsvector and tsquery:

CREATE TABLE articles (
  id SERIAL,
  title TEXT,
  body TEXT
);

INSERT INTO articles (title, body)
VALUES ('Postgres Tips', 'Learn about text types and full-text search.');

-- Create a tsvector column for search:
ALTER TABLE articles ADD COLUMN document tsvector;
UPDATE articles SET document = to_tsvector(body);

-- Search for keyword 'text':
SELECT id, title
FROM articles
WHERE document @@ to_tsquery('text');

Indexing Text Columns

B-tree on VARCHAR/TEXT for equality
GIN on tsvector for full-text search

-- Equality index:
CREATE INDEX idx_email ON contacts (email);

-- Full-text index:
CREATE INDEX idx_article_search ON articles USING GIN(document);

Performance and Storage

CHAR(n) always stores n bytes (plus overhead)
VARCHAR(n) and TEXT store actual length + 1–4 bytes
CITEXT adds overhead (~5–10 bytes per value)

Use the smallest type that satisfies your needs to save space.

Character Encoding & Collation

What is Character Encoding?

Character encoding determines how characters are stored and interpreted in bytes. PostgreSQL supports many encodings, with UTF8 (Unicode) being the most common.

Why Encoding Matters:

Without proper encoding, international text like Hindi, Japanese, or emoji can break.
Text input/output needs consistent encoding across database, client, and application layers.

Check Current Encoding:

SHOW SERVER_ENCODING;

Set Encoding (only at database creation):

CREATE DATABASE international_db WITH ENCODING = 'UTF8';

What is Collation?

Collation defines the rules for sorting and comparing text, such as alphabetic order, case sensitivity, and locale preferences.

Check Column Collation:

SELECT column_name, collation_name
FROM information_schema.columns
WHERE table_name = 'contacts';

Set Collation on Column/Table:

CREATE TABLE cities_de (
  name TEXT COLLATE "de_DE"
);

CREATE TABLE cities_en (
  name TEXT COLLATE "en_US"
);

List Available Collations:

SELECT * FROM pg_collation;

Encoding vs. Collation Summary

Feature	Purpose	Set At	Affects
Encoding	Byte representation of characters	Database creation	Input/output
Collation	Sort and comparison rules	Column/table	ORDER BY, LIKE, =

Practical Examples

-- Incorrect sorting:
SELECT name FROM cities ORDER BY name;

-- Correct with collation:
SELECT name FROM cities ORDER BY name COLLATE "en_US";

Best Practices

Always use UTF8 encoding unless working with legacy systems.
Set collation explicitly if your application supports multiple languages/locales.
Avoid mixing collations across tables unless needed — it can cause query errors or performance issues.
Set default collation at database creation:

CREATE DATABASE mydb
  LC_COLLATE = 'en_US.UTF-8'
  LC_CTYPE = 'en_US.UTF-8'
  ENCODING = 'UTF8';

Text Data Types

The Ultimate PostGIS course

Chapter Info