Blogging Application Like Medium
Perfect π thanks for sharing the transcript in full β now I see exactly what the instructor covered. Let me distill this into deep structured notes + mindmap-style breakdown you can carry forward. I'll cover Database design decisions and Caching strategies separately, and highlight where decisions impact scaling, concurrency, delegation, communication (the broader NFRs).
π System Design Decisions for Blogging Platform (Lecture Deep Notes)
1. Database Design Decisions
1.1 Deletion Strategy
-
Hard Delete
-
Row physically removed from DB.
-
β Pros: immediate space recovery.
-
β Cons:
- No recoverability for users.
- Loss of historical/audit/legal evidence.
- Loss of analytics/ML training data.
- Expensive due to B+Tree rebalancing on disk every delete.
-
-
Soft Delete
-
Mark row as
deleted_at
timestamp. -
β Pros:
- Recoverability ("Trash Bin" feature).
- Auditing / compliance (legal evidence preserved).
- Archival possible β ML/analytics on βdeletedβ data.
- DB operations faster (avoids frequent rebalancing).
-
β Cons:
- Requires additional queries (
WHERE deleted_at IS NULL
). - Storage overhead if not purged.
- Requires additional queries (
-
-
Recommended Hybrid
- Soft delete immediately.
- Hard delete later in batched jobs (off-peak).
π Takeaway: In user-generated platforms, always default to soft delete (compliance + recoverability + analytics).
1.2 Large Text Columns (Blog Body)
-
Tables store rows together on disk β scanning entire row even if you select few columns.
-
Problem: Large text fields (5KBβ50KB) bloat row size β expensive table scans.
-
DB Optimization:
-
For long text (
TEXT
/BLOB
):- DB stores large content in separate location.
- Row contains a pointer/reference.
- β Keeps rows compact β fast scans for metadata queries (title, author, views).
- β Fetching body requires an extra disk read (only when needed).
-
For short text (
VARCHAR
,CHAR
):- Stored inline with row.
- β Efficient if frequently queried together.
-
-
Design Pattern:
-
Split into two tables if necessary:
blogs_metadata (id, title, author, views, published_at, deleted_at)
blogs_content (blog_id, body, type)
-
Prevents slowing down common βlist viewsβ.
-
π Takeaway: Always minimize row size for high-throughput queries. Store heavy fields out-of-line or in separate tables.
1.3 Timestamps / Datetime Columns
-
Option 1: DATETIME / TIMESTAMP type
- β Convenient (ORM auto-converts).
- β DB must parse/compare strings β overhead.
-
Option 2: Epoch Integer (Unix timestamp)
- β Efficient: integers are CPU-optimized.
- β Small fixed storage size (e.g., 4β8 bytes).
- β Indexing & range queries super fast.
- β Not human-readable.
- β Timezone handling must be external (store UTC).
-
Option 3: Custom Integer Format (e.g.,
YYYYMMDD
)- β Efficient integer.
- β Human-readable.
- β Works well for date-range queries.
- β Loses granularity (no time part).
-
Real Example (RedBus 2016):
- Migrated huge tables suffering from datetime perf bottleneck.
- Switched to integer YYYYMMDD.
- β Saw significant query speedup.
π Takeaway:
- Use epoch int for efficiency & large-scale range queries.
- Use DATETIME only if timezone handling & readability are critical.
- For massive datasets with date-based queries, consider YYYYMMDD integer trick.
2. Caching Strategies
2.1 What Caching Really Is
-
Not just Redis/in-memory.
-
Definition: Any mechanism that saves an expensive operation (I/O, network, compute) by reusing precomputed results.
-
Examples:
- Browser cache.
- CDN (Cloudflare, Akamai).
- Disk cache on API server.
- Redis / Memcached (classic).
- Even DB buffer pool = cache.
π Mental Model: At every layer (client β CDN β API β DB β disk), ask: Can I avoid recomputing/reading by caching here?
2.2 Standard Remote Cache Pattern
-
Flow:
- Request β Check cache β Hit β Return.
- Miss β Query DB β Populate cache β Return.
-
β Saves DB load for hot content.
2.3 Advanced Techniques
(a) Debouncing / Thundering Herd Protection
-
Problem:
- Popular blog cache miss β 10K requests hit DB simultaneously.
- DB collapses.
-
Solution:
- First request sets "populating flag" (local var or Redis key).
- Other requests wait for cache to be filled instead of hitting DB.
- Once filled, all get data from cache.
-
Implementation:
- Local (per server): in-memory flag.
- Distributed: Redis "SETNX" lock.
-
Analogy: One person buys 50 movie tickets for group; others wait.
π Prevents redundant work and DB meltdowns under load.
(b) Load Leaking
-
Problem:
- High cache-hit (>95%) = DB underutilized.
- Ops team scales DB down too aggressively.
- If cache fails β DB overwhelmed instantly (cold caches).
-
Solution:
-
Intentionally leak 5β10% of traffic to DB even on cache hits.
-
Keeps:
- DB buffer pools warm.
- DB load baseline steady.
- System resilient to cache failure.
-
-
Implementation:
- On cache hit, probabilistically issue async DB query.
π Cache should be an optimization, not a single point of failure.
2.4 Golden Rule
- Cache = performance enhancement.
- System should degrade gracefully if cache dies, not collapse.
3. Key Takeaways for Blogging Platform
-
Deletion: Always soft delete first. Batch hard delete later.
-
Large Text: Store blog body out-of-line / separate table for performance.
-
Datetime: Prefer integers (epoch/yyyymmdd) for scale. Be mindful of timezone & query patterns.
-
Caching:
- Think beyond Redis.
- Implement debouncing to prevent thundering herd.
- Use load leaking to avoid DB βatrophy.β
- Cache at every possible layer (client, CDN, API, DB buffer).
π§© Mindmap Structure (Textual)
System Design (Blogging Platform)
βββ Database
β βββ Deletion
β β βββ Hard Delete
β β βββ Soft Delete β recommended + batch hard delete
β βββ Large Text
β β βββ Inline (short text)
β β βββ Out-of-line / Separate table (long text)
β βββ Timestamps
β β βββ DATETIME
β β βββ Epoch Integer
β β βββ YYYYMMDD Integer (RedBus example)
β βββ Compliance
β βββ Archival
β βββ Auditing (police/legal cases)
β βββ Recoverability
βββ Caching
β βββ Concept: any shortcut to avoid expensive ops
β βββ Layers
β β βββ Client cache
β β βββ CDN
β β βββ API disk cache
β β βββ Redis/Memcached
β β βββ DB buffer pools
β βββ Techniques
β β βββ Debouncing β avoid thundering herd
β β βββ Load leaking β keep DB warm
β βββ Principle: Cache = enhancement, not SPOF
βββ NFR Alignment
βββ Scaling β caching layers, batch deletes
βββ Concurrency β debouncing
βββ Delegation β CDN, async cache fill
βββ Communication β logs, audit trails
Would you like me to now expand this into a polished visual mindmap (XMind/Markdown-based) so you can study from it, or keep it as text + bullets?