Using DISTINCT to Eliminate Duplicates

PostgreSQL Queries

Video Locked

Please log in to watch this video

Log In
Chapter Info
Course The Ultimate PostGIS course
Module PostgreSQL Queries
Chapter Using DISTINCT to Eliminate Duplicates

Chapter Content

When working with databases, duplicate values in query results can clutter your output and make data analysis more difficult. PostgreSQL's DISTINCT keyword provides a powerful solution for eliminating duplicate rows from your result sets. This guide covers everything you need to know about using DISTINCT effectively in your SQL queries.

Understanding the DISTINCT Keyword

The DISTINCT keyword is used in SQL to remove duplicate rows from the result set. When you're querying data, especially from large tables, you may encounter duplicate rows that aren't what you need for your analysis or reporting.

You place DISTINCT directly after SELECT:

SELECT DISTINCT column1 FROM table_name;

This tells PostgreSQL to return only unique values for the specified column(s).

Setting Up Sample Data

Let's create an employees table to demonstrate DISTINCT functionality:

CREATE TABLE employees (
  id SERIAL PRIMARY KEY,
  name TEXT,
  department TEXT,
  salary NUMERIC(10,2),
  city TEXT
);

INSERT INTO employees (name, department, salary, city) VALUES
('Alice', 'Engineering', 75000.00, 'New York'),
('Bob', 'HR', 50000.00, 'Chicago'),
('Charlie', 'Engineering', 72000.00, 'New York'),
('Diana', 'Marketing', 65000.00, 'Los Angeles'),
('Eve', 'HR', 51000.00, 'Chicago'),
('Frank', 'Marketing', 62000.00, 'New York'),
('Grace', 'Engineering', 78000.00, 'San Francisco');

Basic DISTINCT Usage

Single Column DISTINCT

Let's say we want to know all departments in our company. Without DISTINCT:

SELECT department FROM employees;

Result:

Engineering
HR
Engineering
Marketing
HR
Marketing
Engineering

With DISTINCT:

SELECT DISTINCT department FROM employees;

Result:

Engineering
HR
Marketing

Much cleaner and more useful for analysis!

Multiple Examples

-- Get all unique cities
SELECT DISTINCT city FROM employees;

-- Get all unique salary levels
SELECT DISTINCT salary FROM employees ORDER BY salary DESC;

-- Count unique departments
SELECT COUNT(DISTINCT department) AS department_count FROM employees;
Query Purpose Example Use Case
Unique categories SELECT DISTINCT department FROM employees Dropdown lists, filters
Unique values with sorting SELECT DISTINCT salary FROM employees ORDER BY salary Salary ranges, price lists
Count unique values SELECT COUNT(DISTINCT city) FROM employees Analytics, reporting

DISTINCT with Multiple Columns

You can use DISTINCT with multiple columns to find unique combinations:

SELECT DISTINCT department, city FROM employees;

This query removes rows where the combination of department and city is the same, not just individual columns.

Result:

Engineering | New York
HR | Chicago  
Marketing | Los Angeles
Marketing | New York
Engineering | San Francisco

Understanding Multi-Column DISTINCT

Concept Explanation Example
Row-level uniqueness Entire row combination must be unique ('Engineering', 'New York') vs ('Engineering', 'Chicago')
Not column-level Individual columns can still have duplicates 'Engineering' appears multiple times but with different cities
Order matters Column order affects result presentation DISTINCT dept, city vs DISTINCT city, dept

Practical Multi-Column Examples

-- Unique department-salary combinations
SELECT DISTINCT department, salary 
FROM employees 
ORDER BY department, salary DESC;

-- Unique city-department pairs with employee count
SELECT 
  department, 
  city, 
  COUNT(*) as employee_count
FROM employees 
GROUP BY department, city
ORDER BY department, city;

DISTINCT vs GROUP BY

While DISTINCT and GROUP BY can sometimes produce similar results, they serve different purposes:

Aspect DISTINCT GROUP BY
Purpose Remove duplicates Group rows for aggregation
Aggregation Limited (COUNT only) Full aggregation support
Performance Generally faster for simple deduplication Better for complex aggregations
Syntax SELECT DISTINCT col FROM table SELECT col FROM table GROUP BY col

When to Use Each

-- Use DISTINCT for simple deduplication
SELECT DISTINCT department FROM employees;

-- Use GROUP BY when you need aggregation
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;

-- DISTINCT with COUNT
SELECT COUNT(DISTINCT department) FROM employees;

-- GROUP BY equivalent (more complex)
SELECT COUNT(*) FROM (
  SELECT department FROM employees GROUP BY department
) subquery;

Advanced DISTINCT Techniques

DISTINCT with Expressions

-- Distinct calculated values
SELECT DISTINCT 
  EXTRACT(YEAR FROM hire_date) as hire_year
FROM employees
ORDER BY hire_year;

-- Distinct string manipulations
SELECT DISTINCT 
  UPPER(LEFT(name, 1)) as first_letter
FROM employees
ORDER BY first_letter;

-- Distinct conditional values
SELECT DISTINCT
  CASE 
    WHEN salary >= 70000 THEN 'High'
    WHEN salary >= 60000 THEN 'Medium'
    ELSE 'Low'
  END as salary_category
FROM employees;

DISTINCT in Subqueries

-- Find employees in departments with multiple locations
SELECT name, department
FROM employees
WHERE department IN (
  SELECT DISTINCT department
  FROM employees
  GROUP BY department
  HAVING COUNT(DISTINCT city) > 1
);

-- Employees with unique salary in their department
SELECT name, department, salary
FROM employees e1
WHERE salary = (
  SELECT DISTINCT salary
  FROM employees e2
  WHERE e2.department = e1.department
  AND salary = (
    SELECT MAX(salary)
    FROM employees e3
    WHERE e3.department = e1.department
  )
);

Performance Considerations

Impact on Query Performance

Scenario Performance Impact Optimization Strategy
Small result sets Minimal impact Use freely
Large tables Can be expensive Add indexes on DISTINCT columns
Multiple columns Higher cost Consider composite indexes
Complex expressions Significant overhead Use functional indexes

Optimization Examples

-- Create index to speed up DISTINCT queries
CREATE INDEX idx_employees_department ON employees(department);
CREATE INDEX idx_employees_city_dept ON employees(city, department);

-- Use EXPLAIN to analyze performance
EXPLAIN ANALYZE
SELECT DISTINCT department FROM employees;

-- Compare with and without index
EXPLAIN ANALYZE
SELECT DISTINCT department, city FROM employees;

Real-World Use Cases

Data Analysis and Reporting

-- Customer segmentation
SELECT DISTINCT
  CASE 
    WHEN total_purchases > 1000 THEN 'VIP'
    WHEN total_purchases > 500 THEN 'Regular'
    ELSE 'New'
  END as customer_segment
FROM customer_summary;

-- Product categories for filters
SELECT DISTINCT category
FROM products
WHERE active = true
ORDER BY category;

-- Geographic analysis
SELECT DISTINCT 
  state,
  COUNT(*) OVER (PARTITION BY state) as location_count
FROM store_locations
ORDER BY location_count DESC;

Application Development

-- Populate dropdown menus
SELECT DISTINCT department as value, department as label
FROM employees
WHERE active = true
ORDER BY department;

-- User preference options
SELECT DISTINCT 
  preference_category,
  preference_value
FROM user_preferences
WHERE preference_category = 'notification_type';

-- Data validation
SELECT DISTINCT email_domain
FROM users
WHERE email_domain NOT IN ('gmail.com', 'yahoo.com', 'outlook.com');

Common Pitfalls and Solutions

Pitfall 1: Unnecessary DISTINCT Usage

-- Unnecessary - ID is already unique
SELECT DISTINCT id, name FROM employees;

-- Better - Remove DISTINCT when not needed
SELECT id, name FROM employees;

Pitfall 2: DISTINCT with ORDER BY

-- This might not work as expected
SELECT DISTINCT department 
FROM employees 
ORDER BY salary;  -- Error: salary not in SELECT

-- Correct approaches
SELECT DISTINCT department 
FROM employees 
ORDER BY department;

-- Or include the ORDER BY column
SELECT DISTINCT department, salary
FROM employees 
ORDER BY salary;

Pitfall 3: Performance Issues

-- Slow on large tables
SELECT DISTINCT expensive_function(column) FROM large_table;

-- Better - use GROUP BY with proper indexing
SELECT expensive_function(column)
FROM large_table
GROUP BY expensive_function(column);

Testing and Validation

Verify DISTINCT Results

-- Check if DISTINCT is working correctly
SELECT 
  COUNT(*) as total_rows,
  COUNT(DISTINCT department) as unique_departments
FROM employees;

-- Validate multi-column DISTINCT
SELECT 
  department,
  city,
  COUNT(*) as occurrence_count
FROM employees
GROUP BY department, city
HAVING COUNT(*) > 1;  -- Should return no rows if DISTINCT worked

Best Practices

  1. Use DISTINCT purposefully: Don't add it by default—verify that duplicates actually exist
  2. Consider alternatives: Sometimes GROUP BY with aggregation is more appropriate
  3. Index appropriately: Create indexes on columns used with DISTINCT for better performance
  4. Test with real data: Verify that DISTINCT produces the expected results
  5. Monitor performance: Use EXPLAIN ANALYZE to understand query costs
  6. Document intent: Make it clear why DISTINCT is necessary in your queries

Practice Exercises

Try these exercises with the sample data:

  1. List all unique salaries in the employees table
  2. Show all unique pairs of department and city
  3. Find the number of unique departments
  4. Get unique cities where Engineering employees work
  5. List unique salary ranges (High: >70k, Medium: 60-70k, Low: <60k)

Exercise Solutions

-- 1. Unique salaries
SELECT DISTINCT salary FROM employees ORDER BY salary DESC;

-- 2. Unique department-city pairs
SELECT DISTINCT department, city FROM employees ORDER BY department, city;

-- 3. Count unique departments
SELECT COUNT(DISTINCT department) as unique_departments FROM employees;

-- 4. Cities with Engineering employees
SELECT DISTINCT city FROM employees WHERE department = 'Engineering';

-- 5. Salary ranges
SELECT DISTINCT
  CASE 
    WHEN salary > 70000 THEN 'High'
    WHEN salary >= 60000 THEN 'Medium'
    ELSE 'Low'
  END as salary_range
FROM employees;