How to move off Medium and port your content to another platform using Markdown

22/09/2022 — JavaScript, Automation, Technology — 9 min read

Woman standing in front of house with moving boxes — Photo by Zachary Kadolph on Unsplash

I’ve been on Medium for four years now, I’ve published close to 200 posts and I’m part of the partner program which gets me enough a month to keep me in coffee.

I enjoy writing on Medium, it allows me to quicky write and get my posts out there and then find related content, but I’m aware I’m putting all my eggs in one basket and opening myself up to the risk that Medium change something and I no longer have a means to freely write.

Having my posts on a platform I can control allows me to better structure my writing to use more than three typographical hierarchy and display images, text and interactive elements in order to better engage with my audience.

The question would then be how to get the content I’d built up over the last four years out of Medium and easily build a blog from them as well as make it easy enough to continue adding content to that blog.

Thanks to the EU and the GDPR’s requirement for data portability it’s easy to export data from Medium, you just request a .zip file of your data from the Settings screen (it’s under the Account section).

In order to keep writing content as enjoyable as Medium makes it while also giving me that flexibility, I decided to use the Markdown format for which there are many editors that provide a Medium like UI but save as .md files.

Gatbsy, a framework I’ve used before for my personal website as well as for a few of my start-up ideas supports Markdown and MDX (which extends Markdown to allow web components) so it made sense to continue using that.

Looking through the Gatsby ‘starters’ (a term for a scaffold project you can use to get you started) I found a nice blog template by Lekoarts which would pull in posts from a specific directory and used some metadata (called frontmatter in Gatsby) to provide functionality like date ordering and tagging.

This meant I had to clean up the Medium export and add this frontmatter data, which is no easy task when you have to do this almost 200 times!

Filtering out unwanted posts

The Medium export contains a lot of information related to your account but the main folder of interest is the posts folder which contains your drafts and published posts as well as any comments you’ve left or responses you’ve made to comments.

Removing draft posts is pretty easy to do as the filename starts with the draft_ prefix so you can just delete these in your file manager.

There’s no easy way to filter out the comment posts so you’ll need to step through each file in that folder and view the content to determine if it’s a proper post or just a comment you left. If you’re on a Mac or use one of the better File Managers on Linux then you can preview these files to make your job easier, but this is still a manual task.

Once the comment files were removed I then chose to remove any posts I didn’t think would make sense as standalone posts on my personal site. I tend to do monthly round up posts on Medium which make sense if you’re following me on that site but won’t if you’re just viewing my list of posts on my personal website.

Once you have your list of posts you want to have on your website the next step is to turn the Medium export into the more managable Markdown format and clean up the markup to remove any publication links and the export footer.

Converting the HTML export to Markdown

It’ll be easier to clean up the rest of the post’s content once it’s in Markdown so the first step is to convert the HTML that Medium exports it’s posts as and turn it into Markdown.

This process isn’t perfect but it’ll get you far enough to save you time and depending on your content it might be ok

To do this I created the a script to iterate through the files in the posts directory of the Medium export and convert them from HTML to Markdown, while also doing the following tasks:

Setting the frontmatter metadata needed by Gatsby to create the post correctly
Downloading any images the post contains locally so I can safely delete these from Medium
Converting embedded Github Gists into markdown codeblocks
Removing the extra elements that the Medium export contained such as the export footer and header
Keeping the HTML structure of the figure tags as Markdown doesn’t support these natively
Removing the [NBSP] , [HSP] and [BS] elements that Medium seems to put into their markup
The script uses the unified library and a number of util functions to convert the HTML to an Abstract Syntax Tree (AST) and then iterate over the nodes in that tree and perform tasks based on the node type.

Adding the frontmatter metadata for each post

Medium’s HTML export contains the following sections:

Title — The title of the post (xpath: html/body/articleheader/h1 )
Subtitle — The short form description of the post (xpath: html/body/article/section[data-field="subtitle"] )
Description — not always available, this only on posts from when Medium allowed a short and long description for posts (xpath: html/body/article/section[data-field="description"] )
Body — The actual content of the post (xpath: html/body/article/section[data-field="body"] )
Footer — This is a footer that Medium adds to each post with the date published and information about when the export was created (xpath: html/body/article/footer )

In order to populate the frontmatter we need the following information:

Slug — The URL for the post (we can use a hyphenated version of the title for this)
Date — The date the post was published (we can use the date in the footer for this)
Title — The title of the post (we can use the post title for this)
Excerpt — A summary of the post, used for listings (we can use the subtitle for this)

There is also other information we can to this frontmatter section such as a list of tags for the post but Medium doesn’t include tags used for the posts in the export so these need to be added manually.

This function reads the values from the HTML and sets them on the file object so they can be accessed at a later stage, which is required as we read the values from the tree when it’s in a HTML format but write the frontmatter when it’s in a Markdown format.

1import { read } from 'to-vfile'
2import { unified } from 'unified'
3import rehypeParse from 'rehype-parse'
4import rehypeRemark from 'rehype-remark'
5import remarkStringify from 'remark-stringify'
6import frontmatter from 'remark-frontmatter'
7import slugify from 'slugify'
8
9function gatherFrontmatterData() {
10  return async (tree, file) => {
11    const title = tree.children.find(x => x.tagName === 'title').children[0].value
12    const articleContent = tree.children.find(x => x.tagName === 'article').children
13    const subtitle = articleContent.find(x => x.properties && x.properties.dataField === 'subtitle')
14    const footerContent = articleContent.find(x => x.tagName === 'footer').children
15    const date = footerContent.reduce((acc, node) => {
16      const links = node.children ? node.children.filter(x => x.tagName === 'a') : []
17      links.map(x => {
18        const time = x.children.find(x => x.tagName === 'time')
19        if (time) {
20          acc = time.properties.dateTime.split('T')[0]
21        }
22      })
23      return acc
24    }, '')
25    file.frontmatter = {
26      slug: slugify(title, { lower: true }),
27      date,
28      title,
29      excerpt: subtitle ? subtitle.children[0].value : ''
30    }
31    return Promise.resolve(tree)
32  }
33}
34
35function setFrontmatter() {
36  return (tree, file) => {
37    tree.children.unshift({
38      type: 'yaml',
39      value: `
40slug: "${file.frontmatter.slug}"
41date: "${file.frontmatter.date}"
42title: "${file.frontmatter.title}"
43except: "${file.frontmatter.excerpt.trim()}"
44      `
45    })
46  }
47}
48
49async function convertPost() {
50  const tree = await unified()
51    .use(rehypePrase, { fragment: true }) // parse the HTML into an AST
52    .use(gatherFrontmatterData) // Read the frontmatter values from HTML and set on file
53    .use(rehypeRemark) // Convert HTML AST to MD AST
54    .use(frontmatter) 
55    .use(setFrontmatter) // Set the frontmatter values in the MD AST
56    .use(remarkStringify)
57    .process(await read('example.html')
58 
59  console.log(String(tree))
60}

Gather frontmatter values from HTML and set them in MD

The remark-frontmatter library is needed in order for remark to work with frontmatter but once that’s in the chain then we can prepend the frontmatter into the tree.

Downloading images and updating src and alt attributes

To remove the need to rely on hotlinking from the original Medium post we’ll need to automate downloading the images from Medium and then updating the src attribute to use the relative url.

Additionally because of the way that Medium used figure and figcaption tags previously to provide the description of the image (they’ve only recently allowed for alt text to be set) we need a means to use that description for the alt text where it’s not set.

Downloading the images is simple as we can use the src atttribute on the img tag to get the URL and then once the image is downloaded locally update that src attribute value to that local file and Gatsby will take care of the rest.

Generating the alt attribute is a little more complicated as this requires traversing up the tree to the parent figure tag in order to access the figcaption sibling for the image. Additionally, as Markdown doesn’t support figure elements we’ll need to find a means to include that element as HTML in the final output.

Including the HTML in the final output can be achieved by passing a handler for the figure tag into the settings of the rehypeRemark stage of the transformation. This stage converts the HTML AST into a Markdown AST and the handler allows you to define a custom behaviour for that tag and its children (this conversion essentially converts that sub-tree into a string representation of the HTML so you have to it at this level)

This function first downloads the images within the HTML, saves them locally and updates the src attribute to point to those files before overriding the behaviour of the figure element to return the sub-tree as a single HTML element in the final Markdown.

1import fetch from 'node-fetch'
2import {read} from 'to-vfile'
3import {unified} from 'unified'
4import rehypeParse from 'rehype-parse'
5import rehypeRemark from 'rehype-remark'
6import remarkStringify from 'remark-stringify'
7import { writeFile, promises as fs } from 'fs'
8import { selectAll } from "hast-util-select";
9import {toHtml} from "hast-util-to-html";
10import path from 'path'
11
12async function downloadImage(url, pathName) {
13  try {
14    const imageResp = await fetch(url)
15    const imageData = await imageResp.arrayBuffer()
16    await writeFile(pathName, Buffer.from(imageData), (err) => {
17      if (err)
18        console.log(`Failed to write ${pathName}`)
19      else {
20        console.log(`Wrote ${pathName}`)
21      }
22    })
23    return pathName
24  } catch (error) {
25    console.error('Failed to fetch', url, error)
26    return url
27  }
28}
29
30function getFilename(src) {
31  const filename = src.split('/').pop()
32  if (filename.split('.').length > 1) {
33    return filename
34  } else {
35    return `${filename}.jpg`
36  }
37}
38
39function rehypeDownloadImages() {
40  return async (tree, file) => {
41    const nodes = selectAll('img', tree)
42    await Promise.all(nodes.map(async (node) => {
43      const filename = getFilename(node.properties.src)
44      const outputPath = path.join(file.outputFolder, filename)
45      await downloadImage(node.properties.src, outputPath)
46      node.properties.src = filename
47      return node
48    }))
49    return tree
50  }
51}
52
53function createDirectoryForPost(options) {
54  return async (tree, file) => {
55    const fullPath = path.join(options.outputFolder, `${file.frontmatter.date}-${file.frontmatter.slug}`)
56    file.outputFolder = fullPath
57    try {
58      await fs.mkdir(fullPath, (err) => {
59        if (err && err.code != 'EEXIST') throw err
60      })
61    } catch (error) {
62      console.error(`Failed to create output folder at ${fullPath}`)
63    }
64  }
65}
66
67const paddingNode = {
68  type: 'text',
69  value: '\n',
70}
71
72function getCaption(node) {
73  switch(node.children.length) {
74    case 0:
75      return ''
76    case 1:
77      return node.children[0].value
78    default: // if the figcaption content is not a single text node then we need to construct the text from the children
79      return node.children.reduce((acc, child) => {
80        if (child.type === 'text') {
81          acc = `${acc} ${child.value}`
82        }
83        if (child.tagName === 'a') {
84          acc = `${acc} ${child.children[0].value}`
85        }
86        return acc
87      }, '')
88  }
89}
90
91async function convertHtmlToMarkdown(filePath, outputFolder ) {
92  const tree = await unified()
93    .use(rehypeParse, {fragment: true})
94    .use(gatherFrontMatterData) // covered in previous section
95    .use(createDirectoryForPost, { outputFolder }) // Required so we have somewhere to save the image
96    .use(rehypeDownloadImages)
97    .use(rehypeRemark, {
98      handlers: { // defines how to handle specific HTML tags
99        figure(h, node) {
100          const captionNode = node.children.find(child => child.tagName === 'figcaption')
101
102          const caption = captionNode ? getCaption(captionNode) : ''
103          const cleansedChildren = node.children.map((child) => {
104            // Add the figcaption text to the img so when converted to Markdown it will use that
105            if (child.tagName === 'img') {
106              return {
107                ...child,
108                properties: {
109                  src: child.properties.src,
110                  alt: child.properties.alt || caption
111                }
112              }
113            }
114            if (child.tagName === 'figcaption') {
115              return {
116                ...child,
117                properties: {} // remove any classes and attributes Medium sets
118              }
119            }
120            return child
121          })
122          const cleansedNode = {
123            ...node,
124            properties: {}, // remove any classes and attributes Medium sets
125            children: cleansedChildren.reduce((acc, child) => {
126              acc.push(child)
127              acc.push(paddingNode) // This adds new lines between the child nodes in order to keep markdown formatting
128              return acc
129            }, [paddingNode])
130          }
131        return h(cleansedNode, 'html', toHtml(cleansedNode, { closeSelfClosing: true })) // closeSelfClosing is needed to close img tags
132      },
133    }
134    })
135    .use(remarkStringify)
136    .process(await read(filePath))
137  
138  console.log(String(tree))
139}

Download the images in the HTML, save them and then use the figcaption to set the alt text for images

Converting embedded Github Gists into codeblocks

Similar to downloading the images I also wanted to bring the embedded code I had in my Medium posts back ‘in-house’ so I could take advantage of the codeblock presentation that Gatbsy offers.

This embedded code was hosted on Github as a Gist and then the URL of that Gist was pasted into Medium when writing a post. In the Medium export the code is a script tag inside of a figure element.

The URL of the Gist script is pretty easy to convert to the URL for the raw version of the Gist code, you just need to replace the .js extension with /raw .

Converting the script tag into a Markdown codeblock was the bigger challenge as you need to change the node type from element to text and then wrap the code in the Markdown codeblock syntax as well as add newlines in order for the Markdown to parse the block correctly.

1import fetch from 'node-fetch'
2import {read} from 'to-vfile'
3import {unified} from 'unified'
4import rehypeParse from 'rehype-parse'
5import rehypeRemark from 'rehype-remark'
6import remarkStringify from 'remark-stringify'
7import { selectAll } from "hast-util-select";
8import {toHtml} from "hast-util-to-html";
9
10async function downloadGistCode(url) {
11  try {
12    const rawUrl = `${url.split('.js')[0]}/raw`
13    const codeResp = await fetch(rawUrl)
14    return await codeResp.text()
15  } catch (error) {
16    console.error('Failed to fetch', url, error)
17    return false
18  }
19}
20
21function rehypeInlineGistScript() {
22  return async (tree) => {
23    const nodes = selectAll('script', tree)
24    await Promise.all(nodes.map(async (node) => {
25      if (node.properties.src.indexOf('gist') > -1) { // we only want to process gists, other embeds like tweets also use the script tag
26        const code = await downloadGistCode(node.properties.src)
27        node.properties = {}
28        node.type = 'text'
29        node.value = '\n```\n' + code + '\n```\n' // This will create codeblock syntax and add newlines to correctly format the codeblock
30      }
31      return node
32    }))
33    return tree
34  }
35}
36
37const paddingNode = {
38  type: 'text',
39  value: '\n',
40}
41
42async function convertHtmlToMarkdown(filePath, outputFolder ) {
43  const tree = await unified()
44    .use(rehypeParse, {fragment: true})
45    .use(rehypeInlineGistScript)
46    .use(rehypeRemark, {
47      handlers: { // defines how to handle specific HTML tags
48        figure(h, node) {
49          const cleansedNode = {
50            ...node,
51            properties: {},
52            children: node.children.reduce((acc, child) => {
53              acc.push(child)
54              acc.push(paddingNode)
55              return acc
56            }, [paddingNode])
57          }
58        return h(cleansedNode, 'html', toHtml(cleansedNode, { closeSelfClosing: true }))
59      },
60    }
61    })
62    .use(remarkStringify)
63    .process(await read(filePath))
64
65  console.log(String(tree))
66}

Downloading the code from a Gist and converting to a Markdown code block

Cleaning up the output

After automating the conversion of the different elements that make up my posts, the final step was to clean up the markup a little by removing unneeded content and converting special characters that Medium uses to more standard ones.

The Medium export creates the following sections:

Header — Contains the title of the post
Subtitle — Contains the subtitle for the post
Body — Contains the post content
Footer — Contains the date published and when the export was generated

We only need the body section as everything else contains data that we’re already adding to the Gatsby frontmatter metadata so we can remove these extra sections.

We also need to clean up the body section content a little bit. The Medium export adds a horizontal rule and a post title to the body content so these need to be removed.

Lastly there are a few special characters that Medium uses in the post that can trip Gatsby up when it tries to parse the Markdown content. These are [NBSP] , [HSP] , [BS] and the [BS] has caused me issues in particular. Luckily running a string replace for these troublesome characters on the final output is enough to fix the issue.

1import {read} from 'to-vfile'
2import {unified} from 'unified'
3import rehypeParse from 'rehype-parse'
4import rehypeRemark from 'rehype-remark'
5import remarkStringify from 'remark-stringify'
6import {visit} from "unist-util-visit";
7
8function removeMediumExtras() {
9  return (tree) => {
10    const article = tree.children.find(x => x.tagName === 'article')
11    article.children = article.children.filter((node) => node.properties && node.properties.dataField === 'body')
12    visit(tree, { tagName: 'hr' }, (node, index, parent) => {
13      if (node.properties.className.includes('section-divider')) {
14        parent.children.splice(index, 1)
15      }
16    })
17    visit(tree, { tagName: 'h3' }, (node, index, parent) => {
18      if (node.properties.className.includes('graf--title')) {
19        parent.children.splice(index, 1)
20      }
21    })
22  }
23}
24
25async function convertHtmlToMarkdown(filePath, outputFolder ) {
26  const tree = await unified()
27    .use(rehypeParse, {fragment: true})
28    .use(removeMediumExtras)
29    .use(rehypeRemark)
30    .use(remarkStringify)
31    .process(await read(filePath))
32
33  console.log(String(tree).replace(/ | |/g, ' '))
34}

Remove the unneeded sections, the hr and title within the body content and remove any special characters Medium uses in it’s content

Automating the conversion across the Medium export

When all the scripts are combined (Final script on my Github for brevity) a folder will be created with the format [YYYY-MM-DD]-slug , the images will be downloaded into the folder and the the final markdown version of the post will be saved under the filename index.mdx within that folder.

The final script follows these steps:

Create an output directory
Get a list of HTML files in the posts directory of the output
Map over the HTML files and convert them to Markdown using the full conversion process
Save the converted Markdown to a file within the folder for the post

1import {read} from 'to-vfile'
2import {unified} from 'unified'
3import rehypeParse from 'rehype-parse'
4import rehypeRemark from 'rehype-remark'
5import remarkStringify from 'remark-stringify'
6import { writeFile, promises as fs } from 'fs'
7import {toHtml} from "hast-util-to-html"
8import frontmatter from 'remark-frontmatter'
9import path from 'path'
10
11async function convertHtmlToMarkdown(filePath, outputFolder ) {
12  const tree = await unified()
13    .use(rehypeParse, {fragment: true})
14    .use(gatherFrontMatterData)
15    .use(createDirectoryForPost, { outputFolder })
16    .use(rehypeDownloadImages)
17    .use(rehypeInlineGistScript)
18    .use(removeMediumExtras)
19    .use(rehypeRemark)
20    .use(frontmatter)
21    .use(setFrontMatter)
22    .use(remarkStringify)
23    .process(await read(filePath))
24
25  const fullPath = path.join(outputFolder, `${tree.frontmatter.date}-${tree.frontmatter.slug}/index.mdx`)
26  const fileContent = String(tree)
27  const cleanedFileContent = fileContent.replace(/ | |/g, ' ')
28  await writeFile(fullPath, cleanedFileContent, (err) => {
29    if (err) {
30      console.error(`Failed to write file at ${fullPath}`, err)
31    }
32  })
33}
34
35async function processFilesInDirectory(directory) {
36  try {
37    const cwd = path.resolve()
38    const fullPath = path.join(cwd, directory)
39    const outputPath = path.join(cwd, 'gatsby-posts')
40    const files = await fs.readdir(fullPath)
41    await fs.mkdir(outputPath)
42    const htmlFiles = files.filter(file => path.extname(file) === '.html')
43    await Promise.all(htmlFiles.map(async (file) => {
44      const filePath = path.join(fullPath, file)
45      await convertHtmlToMarkdown(filePath, outputPath)
46    }))
47  } catch (error) {
48    console.error('Failed to process files in directory', error)
49  }
50}
51
52processFilesInDirectory('posts')

I’ve not included the transformer functions themselves as they are detailed in the sections above

After the script has been run and created the posts there are few manual steps that might be needed, for instance if your posts have been included in a publication on Medium.

Manual edits to content

If your post has been picked up by a publication then you’ll likely find that as part of the Medium export the amendments the publication made to advertise themselves has also been included in your exported post.

There’s no means to automate removing the publication edits as there’s nothing in the exported HTML markup that indicates where these changes have been made or if the article is part of a publication.

The best means I found to speed up the removal of these edits was to review the list of publications in your Medium account and search for the URL of the publication within the files.

Setting up Gatsby to use the converted content

My conversion script is built with a specific theme in mind which requires the ‘folder per post’ format but you should be able to amend the script to output the files in a single directory if you want to use something else.

The theme I’ve used for my Gatsby site is the @lekoarts/gatsby-theme-minimal-blog starter. You can create the skeleton via gatsby new minimal-blog LekoArts/gatsby-starter-minimal-blog and then copying over the generated output from the conversion script into the content/posts folder within the file structure.

Once I had the theme set up I then had to go and manually add tags to the frontmatter of each post so that I could use the tag functionality within the theme but this is optional.

Gotchas for the gatsby-theme-minimal-blog set up

GIF support needs setting up manually

The gatsby-theme-minimal-blog theme is a little awkward to work with if you want to include GIFs in your posts, as by default the underlying gatsby-plugin-mdx library only supports JPEG and PNG files.

You can overwrite the theme’s gatsby-plugin-mdx configuration however by setting the mdx flag in the theme’s configuration to false and then defining your own gatsby-plugin-mdx configuration later on. This configuration worked for me (you’ll need to add the gatsby-remark-copy-linked-files and gatsby-remark-gifs libraries to your package.json )

1// a lot of config removed for brevity
2
3module.exports = {
4  plugins: [
5    {
6      resolve: `@lekoarts/gatsby-theme-minimal-blog`,
7      options: {
8        mdx: false,
9      }
10    },
11    {
12      resolve: `gatsby-plugin-mdx`,
13      options: {
14        extensions: [`.mdx`, `.md`],
15        gatsbyRemarkPlugins: [
16          {
17            resolve: `gatsby-remark-images`,
18            options: {
19              maxWidth: 960,
20              quality: 90,
21              linkImagesToOriginal: false,
22              backgroundColor: `transparent`,
23            },
24          },
25         `gatsby-remark-copy-linked-files`,
26         `gatsby-remark-gifs`,
27       ],
28        plugins: [
29          {
30            resolve: `gatsby-remark-images`,
31            options: {
32              maxWidth: 960,
33              quality: 90,
34              linkImagesToOriginal: false,
35              backgroundColor: `transparent`,
36            },
37          },
38          `gatsby-remark-copy-linked-files`,
39          `gatsby-remark-gifs`,
40        ],
41      },
42    },
43  ].filter(Boolean),
44}

With mdx set to false you can define your own gatsby-plugin-mdx config

Shadowing doesn’t work well if you turn off mdx in theme config

Shadowing is a way within Gatsby to define overrides for a theme’s components and content by defining your own version under the package structure of the original.

The gatsby-theme-minimal-blog theme uses this mechanism to allow you to define your own content for the hero and bottom text on the homepage by creating src/@lekoarts/gatsby-theme-minimal-blog/texts/bottom.mdx and src/@lekoarts/gatsby-theme-minimal-blog/texts/hero.mdx files.

However when you disable the mdx flag for the theme in your config file then the shadowing of the bottom.mdx and hero.mdx files stops working. With the setting on it does work as intended but my posts have GIFs so I had to find a workaround.

I fixed this by shadowing the src/@lekoarts/gatsby-theme-minimal-blog/components/homepage.tsx component and then redefining the imports for the Hero and Bottom components in there to point to the shadowed files.

Peer dependencies are tricky in npm 8+

The theme does mention this in it’s README but it’s easy to get caught out. When installing with npm 8+ you need to use the --legacy-peer-deps flag. I found that using Yarn was a better option though but I did need to configure Netlify to use Yarn instead of npm when building my site.

Summary

The export that Medium gives you isn’t in the best shape for portability but using unified , rehype and remark you can produce files that make it easy enough to bring them into another platform or website builder.

Now that I’ve ported my articles to Markdown and Gatsby I now have a very portable version of the content I had in my Medium account and I can start building up the capabilities of my website to make the most of the content I’ve already written.