SQL grammar in EBNF, the logical data-flow that explains every rule, the practical fluency to write correct queries, and a rigorous PostGIS spatial extension — a progressive learning arc, not a feature dump.
SQL is written in one order and executed in another. The grammar says what's legal;
the pipeline says why — why an alias works in ORDER BY but not WHERE,
why aggregates are banned in WHERE but allowed in HAVING.
WHERE (2) runs before SELECT (5),
so a SELECT alias is invisible to WHERE/HAVING — repeat the full expression. ORDER BY (7) runs after
SELECT, so it can use the alias.LIMIT/OFFSET aren't standard (portable: OFFSET … FETCH).
② DISTINCT ON (k) keeps the first row per key — PG-only.
③ FROM may be omitted entirely.Every column, expression and comparison has a type. Choosing well prevents silent bugs (float money, naive timestamps) and unlocks the right operators and indexes.
| Type | Use for | Note |
|---|---|---|
smallint/int/bigint | Whole numbers | 2 / 4 / 8 bytes |
numeric(p,s) | Money, exact | Arbitrary precision, no rounding error |
real / double precision | Scientific | Float — never for currency |
serial / bigserial | Auto-increment | Legacy; prefer GENERATED … IDENTITY |
| Type | Use for | Note |
|---|---|---|
text | Default choice | Unlimited; no perf penalty in PG |
varchar(n) | Length-capped | Enforces a max; otherwise = text |
char(n) | Fixed width | Space-padded; rarely wanted |
text and varchar perform identically in PG — reach for text unless you need a hard length limit.| Type | Stores |
|---|---|
date | Calendar day, no time |
time [tz] | Time of day |
timestamp | Date+time, no zone (naive) |
timestamptz | Zone-aware — use this |
interval | Duration ('2 days 3 hours') |
timestamptz. It stores UTC and converts on display; timestamp drops zone context and causes off-by-hours bugs.| Type | Values / use |
|---|---|
boolean | TRUE / FALSE / NULL |
uuid | 128-bit id; gen_random_uuid() |
bytea | Raw binary |
enum | CREATE TYPE mood AS ENUM(…) |
jsonb | Indexed JSON; -> ->> @> |
type[] | Array: int[], text[] |
tsvector / int4range | Full-text / range types |
5/2 = 2. Force real division by casting one side: 5::numeric/2 = 2.5.WHERE keeps a row only when its condition is TRUE. NULL turns Boolean logic from
two values into three — the source of more silent bugs than any other SQL feature.
| Predicate | Matches | Example |
|---|---|---|
= <> < > <= >= | Comparison | price >= 100 |
BETWEEN a AND b | Inclusive range | age BETWEEN 18 AND 65 |
IN (…) | Any of a set | state IN ('AP','TS') |
LIKE / ILIKE | Pattern (% _); ILIKE case-insensitive | name ILIKE 'kal%' |
~ / ~* | POSIX regex / case-insensitive | code ~ '^[A-Z]{3}$' |
IS [NOT] NULL | Null test | deleted_at IS NULL |
IS [NOT] DISTINCT FROM | NULL-safe equality | a IS NOT DISTINCT FROM b |
Any comparison with NULL yields UNKNOWN. A row passes WHERE only on TRUE.
| AND | T | F | NULL |
|---|---|---|---|
| T | T | F | NULL |
| F | F | F | F |
| NULL | NULL | F | NULL |
| OR | T | F | NULL |
|---|---|---|---|
| T | T | T | T |
| F | T | F | NULL |
| NULL | T | NULL | NULL |
| Trap | Fix |
|---|---|
x = NULL never true | x IS NULL |
NOT IN (…,NULL) → 0 rows | Filter NULLs / NOT EXISTS |
COUNT(col) skips NULLs | COUNT(*) for total rows |
a = b false when both NULL | IS NOT DISTINCT FROM |
'x'||NULL → NULL | COALESCE(col,'') |
NULLS NOT DISTINCT (PG 15+).SUM(CASE WHEN region='East' THEN sales ELSE 0 END). Cleaner:
SUM(sales) FILTER (WHERE region='East').The working toolkit, grouped by what you're transforming. Scalar functions act per row; aggregates collapse many rows into one.
| Function | Result |
|---|---|
a || b | Concatenate |
length(s) | Character count |
lower / upper / initcap | Case transforms |
trim / ltrim / rtrim | Strip whitespace |
substring(s,from,len) | Extract slice |
replace(s,old,new) | Substitute |
split_part(s,delim,n) | Nth field |
position(sub in s) | Index of substring |
format('%s=%s',k,v) | printf-style build |
string_agg(x,',') | aggregate: join rows |
| Function | Result |
|---|---|
now() / current_date | Current moment / day |
age(ts) | Interval from now |
date_trunc('month',ts) | Round down to unit |
extract(year from ts) | Pull a field |
ts + interval '1 day' | Date arithmetic |
to_char(ts,'YYYY-MM') | Format to text |
to_date(s,'DD/MM/YYYY') | Parse from text |
generate_series(a,b,step) | Row-set of values |
| Function | Result |
|---|---|
round(x,n) / trunc | Round / truncate |
ceil / floor / abs | Standard math |
mod(a,b) / a % b | Remainder |
power(a,b) / sqrt | Exponent / root |
random() | 0 ≤ x < 1 |
x::type / cast(x as t) | Convert type |
| Function | Result |
|---|---|
count(*) / count(col) | All rows / non-NULL |
sum / avg / min / max | Standard rollups |
array_agg(x) | Collect into array |
string_agg(x,sep) | Collect into string |
jsonb_agg(x) | Collect into JSON |
bool_and / bool_or | Logical rollup |
percentile_cont(0.5) WITHIN GROUP (ORDER BY x) | Median |
SUM(CASE WHEN paid THEN amt END), write SUM(amt) FILTER (WHERE paid) — clearer and
NULL-clean. Any aggregate accepts FILTER (WHERE …).COUNT(DISTINCT user_id) counts unique values;
plain COUNT(user_id) counts non-NULL occurrences — they differ whenever duplicates exist.Three ways to combine relations: JOINs widen rows (more columns), set operations stack rows (same columns), subqueries nest one query inside another.
| Join | Keeps | Unmatched rows |
|---|---|---|
INNER | Matching pairs only | Dropped both sides |
LEFT OUTER | All left + matches | Left kept, right = NULL |
RIGHT OUTER | All right + matches | Right kept, left = NULL |
FULL OUTER | Everything | Either side padded NULL |
CROSS | Every combination | Cartesian product |
LATERAL | Per-row correlated subquery | Right may reference left's columns |
ON a.id=b.id — explicit, both columns survive.
USING (id) — equal-named cols, merged to one.
NATURAL JOIN — auto-matches all shared names; convenient but fragile — a new shared column silently changes results.
Same column count & compatible types. Default dedups; ALL keeps dups. INTERSECT binds tighter than UNION/EXCEPT.
NOT IN (subquery) returns zero rows if the
subquery yields a single NULL. NOT EXISTS has no such trap and usually plans better for anti-joins.Same arithmetic, opposite shapes. GROUP BY folds many rows into one summary. A window function keeps every row and attaches a value computed over a window of related rows.
| dept | total |
|---|---|
| East | 400 |
| West | 800 |
Employees gone — you can't also select employee; no single value per group.
| emp | dept | sales | dt |
|---|---|---|---|
| Asha | East | 100 | 400 |
| Bina | East | 300 | 400 |
| Deepa | West | 600 | 800 |
Every row survives and knows its group total. Detail and aggregate together.
| Function | Returns | Needs ORDER BY |
|---|---|---|
ROW_NUMBER() | Unique sequential index | Yes |
RANK() / DENSE_RANK() | Rank with / without gaps after ties | Yes |
LAG(x,n) / LEAD(x,n) | Value n rows back / ahead | Yes |
SUM(x) OVER(ORDER BY …) | Running / cumulative total | Yes |
NTILE(k) | Split into k buckets | Yes |
FIRST_VALUE / LAST_VALUE / NTH_VALUE | Value at frame edge / position | Usually |
GROUP BY.
Every row plus a peer-computed value (rank, running total, share, diff-from-avg)? → window OVER (…).
Windowing runs after grouping, so the two can coexist.A Common Table Expression names a subquery up front, so complex logic reads top-to-bottom instead
of inside-out. RECURSIVE extends this to hierarchies and graphs — the standard way to walk a tree.
Name the intermediate result once, reuse it. Chains of CTEs read as sequential steps.
PG may inline a CTE into the outer query for speed. Force "compute once"
with AS MATERIALIZED, or allow inlining with NOT MATERIALIZED. Use MATERIALIZED when the
CTE is expensive and referenced multiple times.
Two parts joined by UNION ALL: a base case (anchor) and a
recursive case that references the CTE itself. PG iterates until the recursive part returns no new rows.
| id | name | mgr |
|---|---|---|
| 1 | Asha | — |
| 2 | Bina | 1 |
| 3 | Chandu | 1 |
| 4 | Deepa | 2 |
| tree | depth |
|---|---|
| Asha | 1 |
| Bina | 2 |
| Chandu | 2 |
| Deepa | 3 |
UNION … CYCLE col SET is_cycle USING path; otherwise track a visited
path array and filter.WITH moved AS (DELETE FROM old WHERE … RETURNING *) INSERT INTO archive SELECT * FROM moved; —
delete and re-insert atomically in one statement.DML changes rows · DDL changes structure · TCL controls transactions · DCL controls access.
UPDATE/DELETE hits every row. Wrap risky writes in
BEGIN; … ROLLBACK; to preview the row count before committing.Levels: READ COMMITTED (default) · REPEATABLE READ · SERIALIZABLE. ACID is per-transaction.
priv = SELECT/INSERT/UPDATE/DELETE/ALL. Row-level security via CREATE POLICY.
An index is a shortcut the planner may use. Knowing which index suits which query — and reading the plan to confirm it's used — is the difference between a 5ms and a 5s query.
| Type | Best for | Example |
|---|---|---|
| B-tree (default) | Equality & range on scalars; sorting | WHERE price > 100 ORDER BY price |
| Hash | Equality only | WHERE id = 42 |
| GIN | Multi-value: jsonb, arrays, full-text | WHERE tags @> '{sql}' |
| GiST | Geometric, ranges, nearest-neighbour | WHERE geom && box |
| BRIN | Huge, naturally-ordered tables (time-series) | WHERE created_at > … |
| SP-GiST | Non-balanced: quadtrees, IP ranges | WHERE inet && '10.0.0.0/8' |
• Function on the column: WHERE lower(email)=… skips a plain email index — index the expression instead.
• Leading wildcard: LIKE '%x' can't use B-tree.
• Type mismatch forces a cast.
• Tiny table: a seq scan is genuinely faster.
• Stale stats: run ANALYZE.
| Plan node | Meaning | Concern |
|---|---|---|
Seq Scan | Read whole table | Bad on large tables w/ selective filter |
Index Scan / Index Only Scan | Use index; "Only" = no heap fetch | Good |
Bitmap Heap Scan | Many matching rows via index | Fine for medium selectivity |
Nested Loop | Probe inner per outer row | Blows up if outer rows large |
Hash Join / Merge Join | Build hash / merge sorted inputs | Good for large joins |
rows=1 but got
rows=50000, its choices rested on wrong stats — run ANALYZE, then match the index to the
actual WHERE/JOIN/ORDER BY columns.PostGIS turns PostgreSQL into a spatial database: new column types, ~300 ST_ functions, and
spatial indexes. Everything starts with two decisions — which geometry type, and geometry vs geography.
| Type | Is | WKT example |
|---|---|---|
POINT | One location | POINT(77.6 14.4) |
LINESTRING | Ordered vertices | LINESTRING(0 0, 1 1, 2 1) |
POLYGON | Ring(s); first = outer | POLYGON((0 0,4 0,4 4,0 4,0 0)) |
MULTIPOINT | Many points | MULTIPOINT((0 0),(1 1)) |
MULTILINESTRING | Many lines | … |
MULTIPOLYGON | Many polygons (islands) | … |
GEOMETRYCOLLECTION | Mixed bag | GEOMETRYCOLLECTION(POINT(0 0),…) |
Append Z (elevation), M (measure), or ZM:
POINTZ, LINESTRINGM, POINTZM.
| Form | Meaning |
|---|---|
| WKT | Well-Known Text (human-readable) |
| WKB | Well-Known Binary (storage/wire) |
| EWKT/EWKB | PostGIS extended: embeds SRID + Z/MSRID=4326;POINT(77.6 14.4) |
Every geometry carries an SRID (Spatial Reference ID) declaring how its numbers map to the
Earth. Stored in the spatial_ref_sys table. The two you'll meet constantly:
| SRID | System | Units | Use |
|---|---|---|---|
4326 | WGS 84 lon/lat | degrees | GPS, global storage, web data |
3857 | Web Mercator | metres (projected) | Tile maps (Google/OSM/Leaflet) |
326xx/327xx | UTM zone N/S | metres | Accurate regional measurement |
ST_Distance(a,b) on
SRID 4326 vs SRID 0 raises Operation on mixed SRID geometries. Reproject one with
ST_Transform(geom, 3857) first. Use ST_SetSRID only to label an unlabelled geometry — it does not reproject.| geometry | geography | |
|---|---|---|
| Math | Flat / Cartesian (planar) | Curved / geodesic (spheroid) |
| Units | Whatever the SRID says — 4326 ⇒ degrees | Always metres |
| Speed | Fast | Slower (great-circle math) |
| Function library | Full (~300 funcs) | Subset: ST_Distance, ST_DWithin, ST_Area, ST_Length, ST_Intersects, ST_Covers… |
| Best for | Projected/local data; heavy processing | Global lon/lat, correct distances out of the box |
geometry column is the #1 production bug.
ST_Distance on two geometry(Point,4326) values returns a number in degrees — a
"1 km" radius search silently matches points 111 km away. Fix: store as geography(Point,4326), or cast
per-query: ST_Distance(a::geography, b::geography) → metres. A degree of longitude is ~111 km at the
equator but shrinks to 0 at the poles, so degree "distance" is meaningless.Nearly all spatial functions are prefixed ST_ (Spatial Type, from the OGC standard).
Grouped by purpose: construct geometries, inspect them, test relationships,
measure, process, and output.
| Function | Builds geometry from |
|---|---|
ST_MakePoint(x,y[,z]) | Raw coordinates (no SRID!) |
ST_Point(x,y,srid) | Coordinates + SRID |
ST_GeomFromText(wkt,srid) | WKT string |
ST_GeogFromText(ewkt) | Text → geography |
ST_GeomFromGeoJSON(json) | GeoJSON |
ST_SetSRID(geom,srid) | Labels SRID (no reproject) |
ST_MakePoint returns SRID 0. Always wrap:
ST_SetSRID(ST_MakePoint(lon,lat),4326).| Function | Returns |
|---|---|
ST_X(pt) / ST_Y(pt) | Coordinate (geometry only) |
ST_GeometryType(g) | ST_Point … |
ST_SRID(g) | The SRID |
ST_NPoints(g) | Vertex count |
ST_IsValid(g) | OGC validity (self-intersect?) |
ST_AsText(g) / ST_AsEWKT(g) | Back to WKT / EWKT |
ST_MakeValid(geom) before area/overlay operations.| Function | True when |
|---|---|
ST_Intersects(a,b) | They share any point (most common test) |
ST_Contains(a,b) / ST_Within(a,b) | a fully contains b / b within a |
ST_Covers(a,b) / ST_CoveredBy | Like contains, boundary-inclusive (preferred) |
ST_DWithin(a,b,d) | Within distance d (metres for geography) — index-aware |
ST_Touches / ST_Crosses / ST_Overlaps | Boundary-only / partial / same-dim overlap |
ST_Disjoint(a,b) | Share no point (not index-aware — negation) |
| Function | Returns |
|---|---|
ST_Distance(a,b) | Min distance (SRID units / m for geog) |
ST_Length(line) | Length |
ST_Area(poly) | Area |
ST_Perimeter(poly) | Boundary length |
ST_Azimuth(a,b) | Bearing in radians |
| Function | Produces |
|---|---|
ST_Buffer(g,d) | Zone within d (geometry) |
ST_Transform(g,srid) | Reproject to new CRS |
ST_Union(g) / ST_Intersection | Merge / overlap |
ST_Centroid / ST_Envelope | Center / bounding box |
ST_Simplify(g,tol) | Fewer vertices |
ST_AsGeoJSON / ST_AsMVT | Web output / vector tiles |
ST_MakePoint (defaults to 0), storing lon/lat in geometry instead of
geography, and passing lat/lon reversed — PostGIS expects (longitude, latitude), i.e. (X, Y).A spatial index can't index irregular shapes directly, so it indexes each geometry's bounding box in a GiST R-Tree. Queries then run in two phases: a fast index filter, then an exact refine.
USING GIST and you get a B-tree. A B-tree tries to index the whole geometry and
errors with index row requires N bytes, maximum size is 8191 — and even when it builds, it can't answer
spatial queries. The USING GIST clause is mandatory for spatial columns.ST_Intersects) runs only on candidates, confirming real matches. → resultIndex-aware functions (ST_Intersects, ST_Contains, ST_Covers,
ST_DWithin, ST_Within …) inject phase 1 automatically. ST_DWithin internally expands
the bounding box by the distance and applies && on both sides — which is why it's vastly faster than
filtering on ST_Distance(...) < d.
| Op | Meaning | Use |
|---|---|---|
&& | Bounding boxes overlap/touch (2-D) | Manual index prefilter; pair with exact test |
<-> | Distance between geometries (KNN) | ORDER BY geom <-> pt — index-ordered nearest-neighbour |
<#> | Distance between bounding boxes (KNN) | Approximate nearest, cheapest |
~ / @ | Box contains / contained by | Pure box containment |
ST_Relate & ST_Distance stay slow. They are not index-aware — no
automatic && prefilter. For a bounding-box search without an exact predicate, add && yourself:
WHERE a.geom && b.geom. For nearest-neighbour, prefer the <-> KNN operator over sorting by
ST_Distance.| Index | When |
|---|---|
| GiST | Default. R-Tree over bounding boxes; handles all geometry & geography |
| SP-GiST | Point-heavy data with strong spatial clustering (space-partitioned) |
| BRIN | Huge tables already physically sorted by location; tiny index, weaker filtering — poor for trajectories |
Seq Scan: run ANALYZE (stale stats), check the function is
index-aware, confirm both sides share one SRID/type, and ensure you didn't wrap the column in a function
(ST_Transform(geom,…) in the WHERE defeats the index — index the transformed expression instead, or store
a second projected column).