Business View
v1.2.0-stable

ENGINEERING
DATA SOVEREIGNTY.

??工? Notion Internal API,實?????遷移?br> ?迴???斷點????件??以??自????Confluence ????

zsh ??node-1
??/span> ~ python crawl_notion_api.py
[INFO] Initializing reverse API connection...
[INFO] Authenticated as user_8a2d...
[INFO] Found 517 pages in workspace.
> Processing Page: 'API Specification' (ID: a1b2...)
> Recursively fetching children... [OK]
_

// README.md: THE_WHY

"In the era of AI & Vibe Coding, building tools is faster than ever. But Token costs and Time are real."

PERFORMANCE_METRICS

benchmark_result.json
MANUAL_MIGRATION (est.)
Time: 129.0 hrs
NOTION_CRAWLER (v1.2)
Time: 0.5 hrs

> Speedup Factor: 258x
> Data Integrity: 100%
> Memory Usage: <150MB (Streaming)

Exponential Efficiency

?於 500+ ?????實?移??測試?人工搬???耗? (?估 9 ??/??,?容???層?結??? Notion Crawler ??並????自??轉?,??週?工?縮短??1 小?????

517
PAGES PROCESSED
0
DATA LOSS
258x
FASTER

SYSTEM_ARCHITECTURE

Notion Crawler System Architecture Diagram showing Recursive Crawler, Markdown Transpiler, and Confluence Integrator data flow

Recursive Crawler Engine

  • Reverse Internal API:直?串?? `loadPageChunk` ??結???Block 資?,速度?Playwright ?10 ??/li>
  • ?迴?歷 (Recursive):自?解???面 (Sub-pages) ??Database Rows,精確??無?層級?構?/li>
  • ?偵測???/b>:實?Exponential Backoff ??Jitter ??延遲,?????429 Rate Limit??/li>

Markdown Transpiler

  • AST ??:??? JSON ???抽象?法樹,確? Table, Callout, Code Block 等???件精確渲??/li>
  • Knowledge Stitcher:自???並縫??散??API ?數?面 (Input/Output/Schema),?組為?? Truth??/li>

Resilience & Failover

  • Dual-Domain Failover:優??? `notion.so`,??自???至 `notion.site` ????,確?99.9% ?用??/li>
  • Connection Pooling:使?? `requests.Session` 維? TCP ???池?減? TLS ???銷,??大?爬????/li>
  • Granular Checkpoints:SQLite/JSON 記? Page ID ???實現 100% ??續傳??/li>

Confluence Integrator

  • BFS Traversal:採?廣度優??????確?????對優?於子??建立??? Orphan Pages??/li>
  • Smart Transform:自?? Mermaid ?塊??為 Confluence Macro,並修復?面??????? (Internal Links)??/li>
  • Auto-Root Management:自?在 Space ?目?建?`Notion_KB`,支??`--clean` ?迴?除以進?乾淨?部署?/li>

Legacy Mode (Fallback)

  • Playwright Renderer:??瀏覽?模???? DOM ?? Breadcrumb 決?路?,解?API ?????特?Edge Case??
  • Interactive Crawling:支?? Auto-Scroll 觸發 Lazy Loading?自????Toggle????Database??/li>
  • Stealth Mode:使??Headed 模??隨機延??(3-10s) 繞? Cloudflare 驗???/li>

Test Suite (Quality Gate)

  • 130 Unit Tests:使??pytest 覆?三大??模??????渲??輯??併策略,確保?次???????行為??/li>
  • Zero-IO Pure Testing:RichText 轉??Block 渲???題?歧????輯?為純函式測試??? Mock 外部????/li>
  • Filesystem Isolation:?併?樹建構測試使??pytest tmp_path fixture,?????污??實檔?系統??

DEV_EXPERIENCE (DX)

?發????機?快速迭???/span>
// DEBUG MODE: OFFLINE
enable --offline-replay
> Loading snapshot `dump_20250131.json`
> Mocking API responses...
> Ready. (0ms latency)
Offline Replay

?發???輯?直???本??Snapshot?b>完全???網,?迭代?度?? 100 ??/p>

// PRE-FLIGHT CHECK
run --dry-run
> Simulating write operations...
> [SKIP] POST /wiki/rest/api/content
> No changes applied.
Dry Run Mode

模擬轉譯????程??輸?日誌而???寫入,確保??? Confluence 資???突?/p>

// ERROR RECOVERY
status --checkpoints
> Pending: 42 pages
> Failed: 3 pages (Rate Limited)
> Resuming from last success...
Smart Resume

程????自????Checkpoint,跳?已?? (Success) ??????試失????/p>

CLI_COMMANDS

使用?? API 快速爬?????覽???。適?於大批??????/p>

INPUT
python crawl_notion_api.py --token $NOTION_TOKEN_V2 --page $ROOT_PAGE_ID

將零???案?併為 API ?件,並?? MkDocs ?地伺???覽?/p>

INPUT
python build_knowledge_base.py && mkdocs serve

??讀??output ??並??至?? Space?`--source all` ???上傳??/p>

INPUT
python upload_to_confluence.py --source all --space ENGINEERING

FALLBACK_STRATEGY

PLAN B ??WHEN API FAILS

Why We Need a Fallback

Notion ?部 API (loadPageChunk) 屬於?公?端點????能變更??????證????? ?爬?被識別?自??工具,API 請?將直????403 ??429??br>
?此??案內?Playwright ?覽?模?/b>作為完整?援??? 以?實瀏覽?渲????完全繞? API 層?確??任何?境??能完?資??移??

Headed Browser Mode
使用??Headless Chromium 渲?,??執?JavaScript,??? Cloudflare Bot Detection??/div>
Randomized Delay (3-10s)
每次請??? 3~10 秒隨機延??模擬人??覽行為,避?固定?奏被?測??/div>
Auto-Scroll & Expand
??觸發 Lazy Loading????Toggle Block????Database 完整?表,???任??容??/div>
Breadcrumb Path Resolution
???面 DOM 中? Breadcrumb 導覽??精確???面層?結??儲存路徑?/div>
crawl_notion.py ??Playwright
$ python crawl_notion.py
[INFO] Launching Chromium (headed mode)...
[INFO] User-Agent: Chrome/131.0 (Windows)
[INFO] Navigating to notion.site/...
[INFO] Waiting 6.2s (random delay)...
[INFO] Auto-scrolling page content...
[INFO] Expanding 3 toggle blocks...
[INFO] Breadcrumb: Project > Auth > Login
[INFO] Saved: output/Project/Auth/Login.md
[INFO] Waiting 4.8s (random delay)...
[INFO] Processing next page...
_
API vs Playwright 比?
API Mode Playwright
?度 ~10 min ~3 hrs
?偵?/td> Header ?? ?實?覽??/span>
API 依賴 ?公??API ??API 依賴
Cloudflare ?能被???/td> 完全繞?
??續傳
記憶?/td> <50MB ~500MB

TEST_SUITE

pytest ??130 tests across 3 core modules
130
TEST CASES
3
MODULES COVERED
0.3s
EXECUTION TIME
100%
PASS RATE
pytest -v --tb=short
$ pytest -v --tb=short
======================== test session starts ========================
collected 130 items
 
test_crawl_notion_api.py::TestSanitizeFilename::test_basic PASSED
test_crawl_notion_api.py::TestRichTextConvert::test_bold PASSED
test_crawl_notion_api.py::TestApplyDecorations::test_equation PASSED
test_crawl_notion_api.py::TestBlockToMarkdown::test_table_2x2 PASSED
test_build_knowledge_base.py::TestStripH1::test_removes_h1 PASSED
test_build_knowledge_base.py::TestMergeStandardSplit::test_basic PASSED
test_upload_to_confluence.py::TestEscapeXml::test_ampersand PASSED
test_upload_to_confluence.py::TestConvertMd::test_mermaid PASSED
... 122 more passed ...
 
======================== 130 passed in 0.32s ========================
MODULE COVERAGE
crawl_notion_api.py 44 tests
RichTextConverter • BlockToMarkdown • sanitize_filename • page ID helpers
upload_to_confluence.py 34 tests
_escape_xml • PageNode tree • title conflicts • MD?Confluence conversion
build_knowledge_base.py 32 tests
strip_h1 • extract_api_meta • Mermaid generators • merge functions (tmp_path)
TEST CATEGORIES
Pure Functions ~50
Rich Text Parsing ~27
Block Rendering ~16
File Merge (I/O) ~9
Tree Building ~12
MD?Confluence ~5
RUN TESTS
pip install pytest && pytest -v --tb=short