caduh

How Git Actually Works (Under the Hood) — blobs, trees, commits

6 min read

A gentle tour of Git’s content‑addressed storage: blobs, trees, commits, refs, and the index—so complex commands like rebase, cherry‑pick, and reset feel predictable.

TL;DR

  • Git is a content‑addressed database. Everything (files, folders, commits, tags) is an object addressed by a hash (SHA‑1 in most repos; SHA‑256 in newer ones).
  • A commit points to a tree (snapshot of the repo) plus parent commit(s) and metadata (author, message). Trees point to blobs (file contents) and other trees (subfolders).
  • Branches/HEAD are just pointers (refs) to commits. Moving pointers (fast‑forward, reset, rebase) changes history view without changing content—until you write new commits.
  • The three areas explain every “weird” Git moment: HEAD (last committed snapshot), index (staging area), working tree (your files).
  • “Dangerous” commands are predictable when you think in objects: merge creates a commit with 2 parents, rebase copies commits on top of another base, cherry‑pick copies one commit, reset moves pointers and optionally index/working tree.

The mental model (60 seconds)

commit (hash: c3) ──┐
  tree ─────────────┼─▶ blobs/trees (your files at that commit)
  parent: c2        └─▶ metadata (author, message, timestamp)

c1 ──▶ c2 ──▶ c3   (a simple linear history)
^
ref "main" points here (e.g., c3); HEAD points to "main"
  • A commit is a Merkle node: its hash depends on its tree hash and parent hash(es). Change any file → new tree and new commit hash.
  • Refs (e.g., refs/heads/main) are tiny text files containing a hash. HEAD is usually a symbolic ref pointing at a branch (ref: refs/heads/main).

The objects: blobs, trees, commits (and tags)

| Type | What it stores | You can peek with | |---|---|---| | blob | Raw file content (no filename) | git cat-file -p <blob> | | tree | Directory entries (name, mode, type, hash) | git ls-tree <tree> | | commit | Pointer to a tree + parent(s) + metadata | git cat-file -p <commit> | | tag | (Annotated tag) message + signature + target object | git cat-file -p <tag> |

Explore your repo

git rev-parse HEAD          # the current commit id
git cat-file -p HEAD        # see commit: tree + parent + message
git ls-tree -r --name-only HEAD   # list files in the commit snapshot

Objects live under .git/objects/ (loose) or in packfiles (.git/objects/pack/*.pack) after GC/clone to save space.


The three areas (a.k.a. why “it didn’t commit!”)

HEAD (last commit)  ↔  INDEX (staged)  ↔  WORKING TREE (files)
      git checkout        git add               edit files
  • Working tree: your current files.
  • Index/staging area: what will go into the next commit.
  • HEAD: what’s in the current commit.

Common operations:

  • git add: copy changes from working tree → index.
  • git commit: write an object for the index’s tree, then a commit pointing to it.
  • git restore --staged path: remove from index (keep file edited).
  • git restore path: restore file from HEAD to working tree.

What a commit really contains

Example (abridged):

$ git cat-file -p HEAD
tree a9f3e2...
parent 7c1b6d...
author You <[email protected]> 1693920000 +0200
committer You <[email protected]> 1693920000 +0200

Add feature X

Now inspect the tree:

$ git ls-tree a9f3e2
100644 blob 3b18e5  README.md
040000 tree 8c2a1a  src

Trees list modes (file/exec/subtree), types, and names.


Branches & HEAD (moving pointers)

  • A branch is just a ref pointing to a commit. New commits advance the branch pointer.
  • Fast‑forward merge: move the pointer forward to the other commit (no new commit).
  • Merge commit: create a new commit with two parents when histories diverged.
Before merge:
main:  A ── B
feature:       ╲
                C

After merge commit M:
A ── B ── M
        ╲  ╲
         C   (M has parents B and C)

Detached HEAD:

git checkout <commit>   # HEAD now points directly to a commit, not a branch

Create a branch to keep work:

git switch -c experiment

Rebase vs merge vs cherry‑pick (copying commits)

  • Merge: keeps both histories; one new commit ties them together.
  • Rebase: copy commits onto a new base (new hashes). Linear history, new commit ids.
  • Cherry‑pick: copy one commit onto the current branch.
Rebase (concept):
feature:     C1 ─ C2            main: A ─ B ─ D
                │                           ▲
git rebase main │          becomes          │
                ▼                           │
feature:           C1' ─ C2' (same changes, new bases → new hashes)

Because rebase rewrites history, you usually need git push --force-with-lease to update the remote safely.


git reset demystified (what moves where)

| Command | Moves branch? | Moves index? | Moves working tree? | Use it for | |---|---|---|---|---| | git reset --soft X | ✅ to X | ❌ | ❌ | “Undo last commit but keep changes staged” | | git reset --mixed X (default) | ✅ | ✅ (to X) | ❌ | “Unstage changes; keep edits” | | git reset --hard X | ✅ | ✅ | ✅ | Discard everything and go to X |

X can be a commit, branch, or HEAD~1.


Merges & conflicts (index stages)

When Git can’t auto‑merge, files enter conflict with index stages:

  • Stage 1: base version
  • Stage 2: --ours (current)
  • Stage 3: --theirs (incoming)

Resolve, then:

git add <file>
git commit   # completes the merge

Plumbing: build a commit by hand (tiny demo)

echo "hello" > hello.txt
git hash-object -w hello.txt             # writes a blob, prints its id
git update-index --add hello.txt         # stage it (index)
tree_id=$(git write-tree)                # write tree from index
parent=$(git rev-parse --verify HEAD 2>/dev/null || true)
echo "first commit" | git commit-tree "$tree_id" ${parent:+-p $parent}
# Output is a commit id; update a branch to point to it:
git update-ref refs/heads/main <that-commit-id>

This is what git add + git commit automate for you.


Where space savings come from

  • Git stores snapshots, but uses delta compression in packfiles, so repeated content across versions is efficient.
  • git gc repacks loose objects; git count-objects -vH shows size.
  • Renames are detected heuristically; identical content → identical blob hash.

Safety nets: reflog & friends

  • Reflog records where refs/HEAD pointed recently. If you lost a branch by resetting/force‑pushing locally:
git reflog
git checkout -b rescue <old-hash-from-reflog>
  • git fsck --lost-found can locate orphaned objects.
  • Many commands accept the @{-1} syntax (previous branch).

Quick mapping: commands → data structure moves

| You run… | Under the hood it… | |---|---| | git add . | updates index entries (staged snapshot) | | git commit | writes a tree from index; writes a commit pointing to that tree (+ parent) | | git merge | computes new tree; writes a merge commit with 2 parents | | git rebase | copies commits, writing new ones on a new base; moves branch ref | | git checkout/switch | moves HEAD (and updates index/working tree to match) | | git tag -a v1 | writes a tag object pointing at a commit + message/signature |


Quick checklist (become unflappable)

  • [ ] Think snapshots, not diffs: commits point to trees.
  • [ ] Remember the three areas: HEAD ↔ index ↔ working tree.
  • [ ] Branches are pointers; moving them is cheap and reversible (reflog).
  • [ ] Rebase/cherry‑pick copy commits (new ids). Merge adds a tie‑point.
  • [ ] Use git cat-file, ls-tree, reflog to debug anything scary.

One‑minute hands‑on plan

  1. Run git cat-file -p HEAD and git ls-tree -r --name-only HEAD.
  2. Make a tiny change; inspect git diff (working tree) → git addgit diff --cached (index).
  3. Commit; run git log --graph --oneline --decorate to see pointers move.
  4. Try git switch -c demo, add a commit, then git rebase main to watch hashes change.
  5. Create a conflict on purpose; resolve and complete the merge; inspect .git/MERGE_HEAD and stages.