Blogging Application Like Medium

Perfect 👌 thanks for sharing the transcript in full — now I see exactly what the instructor covered. Let me distill this into deep structured notes + mindmap-style breakdown you can carry forward. I'll cover Database design decisions and Caching strategies separately, and highlight where decisions impact scaling, concurrency, delegation, communication (the broader NFRs).

🏛 System Design Decisions for Blogging Platform (Lecture Deep Notes)

1. Database Design Decisions

1.1 Deletion Strategy

Hard Delete
- Row physically removed from DB.
- ✅ Pros: immediate space recovery.
- ❌ Cons:
  - No recoverability for users.
  - Loss of historical/audit/legal evidence.
  - Loss of analytics/ML training data.
  - Expensive due to B+Tree rebalancing on disk every delete.
Soft Delete
- Mark row as deleted_at timestamp.
- ✅ Pros:
  - Recoverability ("Trash Bin" feature).
  - Auditing / compliance (legal evidence preserved).
  - Archival possible → ML/analytics on “deleted” data.
  - DB operations faster (avoids frequent rebalancing).
- ❌ Cons:
  - Requires additional queries (WHERE deleted_at IS NULL).
  - Storage overhead if not purged.
Recommended Hybrid
- Soft delete immediately.
- Hard delete later in batched jobs (off-peak).

🔑 Takeaway: In user-generated platforms, always default to soft delete (compliance + recoverability + analytics).

1.2 Large Text Columns (Blog Body)

Tables store rows together on disk → scanning entire row even if you select few columns.
Problem: Large text fields (5KB–50KB) bloat row size → expensive table scans.
DB Optimization:
- For long text (TEXT/BLOB):
  - DB stores large content in separate location.
  - Row contains a pointer/reference.
  - ✅ Keeps rows compact → fast scans for metadata queries (title, author, views).
  - ❌ Fetching body requires an extra disk read (only when needed).
- For short text (VARCHAR, CHAR):
  - Stored inline with row.
  - ✅ Efficient if frequently queried together.
Design Pattern:
- Split into two tables if necessary:
  - blogs_metadata (id, title, author, views, published_at, deleted_at)
  - blogs_content (blog_id, body, type)
- Prevents slowing down common “list views”.

🔑 Takeaway: Always minimize row size for high-throughput queries. Store heavy fields out-of-line or in separate tables.

1.3 Timestamps / Datetime Columns

Option 1: DATETIME / TIMESTAMP type
- ✅ Convenient (ORM auto-converts).
- ❌ DB must parse/compare strings → overhead.
Option 2: Epoch Integer (Unix timestamp)
- ✅ Efficient: integers are CPU-optimized.
- ✅ Small fixed storage size (e.g., 4–8 bytes).
- ✅ Indexing & range queries super fast.
- ❌ Not human-readable.
- ❌ Timezone handling must be external (store UTC).
Option 3: Custom Integer Format (e.g., YYYYMMDD)
- ✅ Efficient integer.
- ✅ Human-readable.
- ✅ Works well for date-range queries.
- ❌ Loses granularity (no time part).
Real Example (RedBus 2016):
- Migrated huge tables suffering from datetime perf bottleneck.
- Switched to integer YYYYMMDD.
- ✅ Saw significant query speedup.

🔑 Takeaway:

Use epoch int for efficiency & large-scale range queries.
Use DATETIME only if timezone handling & readability are critical.
For massive datasets with date-based queries, consider YYYYMMDD integer trick.

2. Caching Strategies

2.1 What Caching Really Is

Not just Redis/in-memory.
Definition: Any mechanism that saves an expensive operation (I/O, network, compute) by reusing precomputed results.
Examples:
- Browser cache.
- CDN (Cloudflare, Akamai).
- Disk cache on API server.
- Redis / Memcached (classic).
- Even DB buffer pool = cache.

🔑 Mental Model: At every layer (client → CDN → API → DB → disk), ask: Can I avoid recomputing/reading by caching here?

2.2 Standard Remote Cache Pattern

Flow:
- Request → Check cache → Hit → Return.
- Miss → Query DB → Populate cache → Return.
✅ Saves DB load for hot content.

2.3 Advanced Techniques

(a) Debouncing / Thundering Herd Protection

Problem:
- Popular blog cache miss → 10K requests hit DB simultaneously.
- DB collapses.
Solution:
- First request sets "populating flag" (local var or Redis key).
- Other requests wait for cache to be filled instead of hitting DB.
- Once filled, all get data from cache.
Implementation:
- Local (per server): in-memory flag.
- Distributed: Redis "SETNX" lock.
Analogy: One person buys 50 movie tickets for group; others wait.

🔑 Prevents redundant work and DB meltdowns under load.

(b) Load Leaking

Problem:
- High cache-hit (>95%) = DB underutilized.
- Ops team scales DB down too aggressively.
- If cache fails → DB overwhelmed instantly (cold caches).
Solution:
- Intentionally leak 5–10% of traffic to DB even on cache hits.
- Keeps:
  - DB buffer pools warm.
  - DB load baseline steady.
  - System resilient to cache failure.
Implementation:
- On cache hit, probabilistically issue async DB query.

🔑 Cache should be an optimization, not a single point of failure.

2.4 Golden Rule

Cache = performance enhancement.
System should degrade gracefully if cache dies, not collapse.

3. Key Takeaways for Blogging Platform

Deletion: Always soft delete first. Batch hard delete later.
Large Text: Store blog body out-of-line / separate table for performance.
Datetime: Prefer integers (epoch/yyyymmdd) for scale. Be mindful of timezone & query patterns.
Caching:
- Think beyond Redis.
- Implement debouncing to prevent thundering herd.
- Use load leaking to avoid DB “atrophy.”
- Cache at every possible layer (client, CDN, API, DB buffer).

🧩 Mindmap Structure (Textual)

System Design (Blogging Platform)
├── Database
│   ├── Deletion
│   │   ├── Hard Delete
│   │   └── Soft Delete → recommended + batch hard delete
│   ├── Large Text
│   │   ├── Inline (short text)
│   │   └── Out-of-line / Separate table (long text)
│   ├── Timestamps
│   │   ├── DATETIME
│   │   ├── Epoch Integer
│   │   └── YYYYMMDD Integer (RedBus example)
│   └── Compliance
│       ├── Archival
│       ├── Auditing (police/legal cases)
│       └── Recoverability
├── Caching
│   ├── Concept: any shortcut to avoid expensive ops
│   ├── Layers
│   │   ├── Client cache
│   │   ├── CDN
│   │   ├── API disk cache
│   │   ├── Redis/Memcached
│   │   └── DB buffer pools
│   ├── Techniques
│   │   ├── Debouncing → avoid thundering herd
│   │   └── Load leaking → keep DB warm
│   └── Principle: Cache = enhancement, not SPOF
└── NFR Alignment
    ├── Scaling → caching layers, batch deletes
    ├── Concurrency → debouncing
    ├── Delegation → CDN, async cache fill
    ├── Communication → logs, audit trails

Would you like me to now expand this into a polished visual mindmap (XMind/Markdown-based) so you can study from it, or keep it as text + bullets?

System Design Masterclass

Curriculum