Teachable Scraper: Lesson Content Page Extraction

Available since: v1.0.17

The Teachable import engine fetches each lesson's full content page and extracts all meaningful content elements. This enables the platform to reconstruct lesson content faithfully without any manual migration effort.

What Gets Extracted

When the scraper processes a Teachable school, it visits every individual lesson page and pulls out the following:

Content Type	Description
HTML body text	The full prose and formatted text content of the lesson
Embedded video iframes	Video players embedded via Vimeo, YouTube, Wistia, or Teachable's native video host
Image tags	Inline images appearing within lesson content
Downloadable attachment links	Links to PDFs, documents, and other files learners can download

Content Structure Preservation

The scraper preserves the original content order as it appears in the Teachable lesson page. Text blocks, video embeds, images, and attachment links are extracted in sequence, so the reconstructed lesson inside the platform mirrors the source layout.

This means:

No manual reordering of content elements after import.
Media and attachments are associated with the correct lesson automatically.
Text content retains its structural context (paragraphs, headings, etc.) from the original HTML.

How It Fits Into the Import Pipeline

Course structure is traversed — sections and lessons are enumerated.
Each lesson page is fetched individually.
Content elements (text, video iframes, images, attachments) are extracted and stored.
Reconstruction uses the extracted data to rebuild the lesson inside the platform.

This release covers step 2 and step 3, delivering the raw extracted content that the reconstruction layer consumes.

Notes

Extraction targets content rendered in the Teachable lesson body. Content outside the lesson body (e.g. navigation chrome, course sidebar) is not captured.
Video extraction identifies <iframe> elements; the actual video files are not downloaded — only the embed references are preserved.
Image tags are extracted as references; binary image assets are handled separately by the asset copy pipeline.
Attachment links are extracted as URLs pointing to Teachable-hosted downloadable files.