SQL grammar in EBNF, the logical data-flow that explains every rule, and the practical fluency to write correct queries — a progressive learning arc, not a feature dump.
SQL is written in one order and executed in another. The grammar says what's legal;
the pipeline says why — why an alias works in ORDER BY but not WHERE,
why aggregates are banned in WHERE but allowed in HAVING.
WHERE (2) runs before SELECT (5),
so a SELECT alias is invisible to WHERE/HAVING — repeat the full expression. ORDER BY (7) runs after
SELECT, so it can use the alias.LIMIT/OFFSET aren't standard (portable: OFFSET … FETCH).
② DISTINCT ON (k) keeps the first row per key — PG-only.
③ FROM may be omitted entirely.Every column, expression and comparison has a type. Choosing well prevents silent bugs (float money, naive timestamps) and unlocks the right operators and indexes.
| Type | Use for | Note |
|---|---|---|
smallint/int/bigint | Whole numbers | 2 / 4 / 8 bytes |
numeric(p,s) | Money, exact | Arbitrary precision, no rounding error |
real / double precision | Scientific | Float — never for currency |
serial / bigserial | Auto-increment | Legacy; prefer GENERATED … IDENTITY |
| Type | Use for | Note |
|---|---|---|
text | Default choice | Unlimited; no perf penalty in PG |
varchar(n) | Length-capped | Enforces a max; otherwise = text |
char(n) | Fixed width | Space-padded; rarely wanted |
text and varchar perform identically in PG — reach for text unless you need a hard length limit.| Type | Stores |
|---|---|
date | Calendar day, no time |
time [tz] | Time of day |
timestamp | Date+time, no zone (naive) |
timestamptz | Zone-aware — use this |
interval | Duration ('2 days 3 hours') |
timestamptz. It stores UTC and converts on display; timestamp drops zone context and causes off-by-hours bugs.| Type | Values / use |
|---|---|
boolean | TRUE / FALSE / NULL |
uuid | 128-bit id; gen_random_uuid() |
bytea | Raw binary |
enum | CREATE TYPE mood AS ENUM(…) |
jsonb | Indexed JSON; -> ->> @> |
type[] | Array: int[], text[] |
tsvector / int4range | Full-text / range types |
5/2 = 2. Force real division by casting one side: 5::numeric/2 = 2.5.WHERE keeps a row only when its condition is TRUE. NULL turns Boolean logic from
two values into three — the source of more silent bugs than any other SQL feature.
| Predicate | Matches | Example |
|---|---|---|
= <> < > <= >= | Comparison | price >= 100 |
BETWEEN a AND b | Inclusive range | age BETWEEN 18 AND 65 |
IN (…) | Any of a set | state IN ('AP','TS') |
LIKE / ILIKE | Pattern (% _); ILIKE case-insensitive | name ILIKE 'kal%' |
~ / ~* | POSIX regex / case-insensitive | code ~ '^[A-Z]{3}$' |
IS [NOT] NULL | Null test | deleted_at IS NULL |
IS [NOT] DISTINCT FROM | NULL-safe equality | a IS NOT DISTINCT FROM b |
Any comparison with NULL yields UNKNOWN. A row passes WHERE only on TRUE.
| AND | T | F | NULL |
|---|---|---|---|
| T | T | F | NULL |
| F | F | F | F |
| NULL | NULL | F | NULL |
| OR | T | F | NULL |
|---|---|---|---|
| T | T | T | T |
| F | T | F | NULL |
| NULL | T | NULL | NULL |
| Trap | Fix |
|---|---|
x = NULL never true | x IS NULL |
NOT IN (…,NULL) → 0 rows | Filter NULLs / NOT EXISTS |
COUNT(col) skips NULLs | COUNT(*) for total rows |
a = b false when both NULL | IS NOT DISTINCT FROM |
'x'||NULL → NULL | COALESCE(col,'') |
NULLS NOT DISTINCT (PG 15+).SUM(CASE WHEN region='East' THEN sales ELSE 0 END). Cleaner:
SUM(sales) FILTER (WHERE region='East').The working toolkit, grouped by what you're transforming. Scalar functions act per row; aggregates collapse many rows into one.
| Function | Result |
|---|---|
a || b | Concatenate |
length(s) | Character count |
lower / upper / initcap | Case transforms |
trim / ltrim / rtrim | Strip whitespace |
substring(s,from,len) | Extract slice |
replace(s,old,new) | Substitute |
split_part(s,delim,n) | Nth field |
position(sub in s) | Index of substring |
format('%s=%s',k,v) | printf-style build |
string_agg(x,',') | aggregate: join rows |
| Function | Result |
|---|---|
now() / current_date | Current moment / day |
age(ts) | Interval from now |
date_trunc('month',ts) | Round down to unit |
extract(year from ts) | Pull a field |
ts + interval '1 day' | Date arithmetic |
to_char(ts,'YYYY-MM') | Format to text |
to_date(s,'DD/MM/YYYY') | Parse from text |
generate_series(a,b,step) | Row-set of values |
| Function | Result |
|---|---|
round(x,n) / trunc | Round / truncate |
ceil / floor / abs | Standard math |
mod(a,b) / a % b | Remainder |
power(a,b) / sqrt | Exponent / root |
random() | 0 ≤ x < 1 |
x::type / cast(x as t) | Convert type |
| Function | Result |
|---|---|
count(*) / count(col) | All rows / non-NULL |
sum / avg / min / max | Standard rollups |
array_agg(x) | Collect into array |
string_agg(x,sep) | Collect into string |
jsonb_agg(x) | Collect into JSON |
bool_and / bool_or | Logical rollup |
percentile_cont(0.5) WITHIN GROUP (ORDER BY x) | Median |
SUM(CASE WHEN paid THEN amt END), write SUM(amt) FILTER (WHERE paid) — clearer and
NULL-clean. Any aggregate accepts FILTER (WHERE …).COUNT(DISTINCT user_id) counts unique values;
plain COUNT(user_id) counts non-NULL occurrences — they differ whenever duplicates exist.Three ways to combine relations: JOINs widen rows (more columns), set operations stack rows (same columns), subqueries nest one query inside another.
| Join | Keeps | Unmatched rows |
|---|---|---|
INNER | Matching pairs only | Dropped both sides |
LEFT OUTER | All left + matches | Left kept, right = NULL |
RIGHT OUTER | All right + matches | Right kept, left = NULL |
FULL OUTER | Everything | Either side padded NULL |
CROSS | Every combination | Cartesian product |
LATERAL | Per-row correlated subquery | Right may reference left's columns |
ON a.id=b.id — explicit, both columns survive.
USING (id) — equal-named cols, merged to one.
NATURAL JOIN — auto-matches all shared names; convenient but fragile — a new shared column silently changes results.
Same column count & compatible types. Default dedups; ALL keeps dups. INTERSECT binds tighter than UNION/EXCEPT.
NOT IN (subquery) returns zero rows if the
subquery yields a single NULL. NOT EXISTS has no such trap and usually plans better for anti-joins.Same arithmetic, opposite shapes. GROUP BY folds many rows into one summary. A window function keeps every row and attaches a value computed over a window of related rows.
| dept | total |
|---|---|
| East | 400 |
| West | 800 |
Employees gone — you can't also select employee; no single value per group.
| emp | dept | sales | dt |
|---|---|---|---|
| Asha | East | 100 | 400 |
| Bina | East | 300 | 400 |
| Deepa | West | 600 | 800 |
Every row survives and knows its group total. Detail and aggregate together.
| Function | Returns | Needs ORDER BY |
|---|---|---|
ROW_NUMBER() | Unique sequential index | Yes |
RANK() / DENSE_RANK() | Rank with / without gaps after ties | Yes |
LAG(x,n) / LEAD(x,n) | Value n rows back / ahead | Yes |
SUM(x) OVER(ORDER BY …) | Running / cumulative total | Yes |
NTILE(k) | Split into k buckets | Yes |
FIRST_VALUE / LAST_VALUE / NTH_VALUE | Value at frame edge / position | Usually |
GROUP BY.
Every row plus a peer-computed value (rank, running total, share, diff-from-avg)? → window OVER (…).
Windowing runs after grouping, so the two can coexist.A Common Table Expression names a subquery up front, so complex logic reads top-to-bottom instead
of inside-out. RECURSIVE extends this to hierarchies and graphs — the standard way to walk a tree.
Name the intermediate result once, reuse it. Chains of CTEs read as sequential steps.
PG may inline a CTE into the outer query for speed. Force "compute once"
with AS MATERIALIZED, or allow inlining with NOT MATERIALIZED. Use MATERIALIZED when the
CTE is expensive and referenced multiple times.
Two parts joined by UNION ALL: a base case (anchor) and a
recursive case that references the CTE itself. PG iterates until the recursive part returns no new rows.
| id | name | mgr |
|---|---|---|
| 1 | Asha | — |
| 2 | Bina | 1 |
| 3 | Chandu | 1 |
| 4 | Deepa | 2 |
| tree | depth |
|---|---|
| Asha | 1 |
| Bina | 2 |
| Chandu | 2 |
| Deepa | 3 |
UNION … CYCLE col SET is_cycle USING path; otherwise track a visited
path array and filter.WITH moved AS (DELETE FROM old WHERE … RETURNING *) INSERT INTO archive SELECT * FROM moved; —
delete and re-insert atomically in one statement.DML changes rows · DDL changes structure · TCL controls transactions · DCL controls access.
UPDATE/DELETE hits every row. Wrap risky writes in
BEGIN; … ROLLBACK; to preview the row count before committing.Levels: READ COMMITTED (default) · REPEATABLE READ · SERIALIZABLE. ACID is per-transaction.
priv = SELECT/INSERT/UPDATE/DELETE/ALL. Row-level security via CREATE POLICY.
An index is a shortcut the planner may use. Knowing which index suits which query — and reading the plan to confirm it's used — is the difference between a 5ms and a 5s query.
| Type | Best for | Example |
|---|---|---|
| B-tree (default) | Equality & range on scalars; sorting | WHERE price > 100 ORDER BY price |
| Hash | Equality only | WHERE id = 42 |
| GIN | Multi-value: jsonb, arrays, full-text | WHERE tags @> '{sql}' |
| GiST | Geometric, ranges, nearest-neighbour | WHERE geom && box |
| BRIN | Huge, naturally-ordered tables (time-series) | WHERE created_at > … |
| SP-GiST | Non-balanced: quadtrees, IP ranges | WHERE inet && '10.0.0.0/8' |
• Function on the column: WHERE lower(email)=… skips a plain email index — index the expression instead.
• Leading wildcard: LIKE '%x' can't use B-tree.
• Type mismatch forces a cast.
• Tiny table: a seq scan is genuinely faster.
• Stale stats: run ANALYZE.
| Plan node | Meaning | Concern |
|---|---|---|
Seq Scan | Read whole table | Bad on large tables w/ selective filter |
Index Scan / Index Only Scan | Use index; "Only" = no heap fetch | Good |
Bitmap Heap Scan | Many matching rows via index | Fine for medium selectivity |
Nested Loop | Probe inner per outer row | Blows up if outer rows large |
Hash Join / Merge Join | Build hash / merge sorted inputs | Good for large joins |
rows=1 but got
rows=50000, its choices rested on wrong stats — run ANALYZE, then match the index to the
actual WHERE/JOIN/ORDER BY columns.