How to move off Medium and port your content to another platform using Markdown
— JavaScript, Automation, Technology — 9 min read
I’ve been on Medium for four years now, I’ve published close to 200 posts and I’m part of the partner program which gets me enough a month to keep me in coffee.
I enjoy writing on Medium, it allows me to quicky write and get my posts out there and then find related content, but I’m aware I’m putting all my eggs in one basket and opening myself up to the risk that Medium change something and I no longer have a means to freely write.
Having my posts on a platform I can control allows me to better structure my writing to use more than three typographical hierarchy and display images, text and interactive elements in order to better engage with my audience.
The question would then be how to get the content I’d built up over the last four years out of Medium and easily build a blog from them as well as make it easy enough to continue adding content to that blog.
Thanks to the EU and the GDPR’s requirement for data portability it’s easy to export data from Medium, you just request a .zip
file of your data from the Settings screen (it’s under the Account section).
In order to keep writing content as enjoyable as Medium makes it while also giving me that flexibility, I decided to use the Markdown format for which there are many editors that provide a Medium like UI but save as .md files.
Gatbsy, a framework I’ve used before for my personal website as well as for a few of my start-up ideas supports Markdown and MDX (which extends Markdown to allow web components) so it made sense to continue using that.
Looking through the Gatsby ‘starters’ (a term for a scaffold project you can use to get you started) I found a nice blog template by Lekoarts which would pull in posts from a specific directory and used some metadata (called frontmatter in Gatsby) to provide functionality like date ordering and tagging.
This meant I had to clean up the Medium export and add this frontmatter data, which is no easy task when you have to do this almost 200 times!
Filtering out unwanted posts
The Medium export contains a lot of information related to your account but the main folder of interest is the posts folder which contains your drafts and published posts as well as any comments you’ve left or responses you’ve made to comments.
Removing draft posts is pretty easy to do as the filename starts with the draft_
prefix so you can just delete these in your file manager.
There’s no easy way to filter out the comment posts so you’ll need to step through each file in that folder and view the content to determine if it’s a proper post or just a comment you left. If you’re on a Mac or use one of the better File Managers on Linux then you can preview these files to make your job easier, but this is still a manual task.
Once the comment files were removed I then chose to remove any posts I didn’t think would make sense as standalone posts on my personal site. I tend to do monthly round up posts on Medium which make sense if you’re following me on that site but won’t if you’re just viewing my list of posts on my personal website.
Once you have your list of posts you want to have on your website the next step is to turn the Medium export into the more managable Markdown format and clean up the markup to remove any publication links and the export footer.
Converting the HTML export to Markdown
It’ll be easier to clean up the rest of the post’s content once it’s in Markdown so the first step is to convert the HTML that Medium exports it’s posts as and turn it into Markdown.
This process isn’t perfect but it’ll get you far enough to save you time and depending on your content it might be ok
To do this I created the a script to iterate through the files in the posts directory of the Medium export and convert them from HTML to Markdown, while also doing the following tasks:
- Setting the frontmatter metadata needed by Gatsby to create the post correctly
- Downloading any images the post contains locally so I can safely delete these from Medium
- Converting embedded Github Gists into markdown codeblocks
- Removing the extra elements that the Medium export contained such as the export footer and header
- Keeping the HTML structure of the figure tags as Markdown doesn’t support these natively
- Removing the
[NBSP]
,[HSP]
and[BS]
elements that Medium seems to put into their markup - The script uses the
unified
library and a number of util functions to convert the HTML to an Abstract Syntax Tree (AST) and then iterate over the nodes in that tree and perform tasks based on the node type.
Adding the frontmatter metadata for each post
Medium’s HTML export contains the following sections:
- Title — The title of the post (xpath:
html/body/articleheader/h1
) - Subtitle — The short form description of the post (xpath:
html/body/article/section[data-field="subtitle"]
) - Description — not always available, this only on posts from when Medium allowed a short and long description for posts (xpath:
html/body/article/section[data-field="description"]
) - Body — The actual content of the post (xpath:
html/body/article/section[data-field="body"]
) - Footer — This is a footer that Medium adds to each post with the date published and information about when the export was created (xpath:
html/body/article/footer
)
In order to populate the frontmatter we need the following information:
- Slug — The URL for the post (we can use a hyphenated version of the title for this)
- Date — The date the post was published (we can use the date in the footer for this)
- Title — The title of the post (we can use the post title for this)
- Excerpt — A summary of the post, used for listings (we can use the subtitle for this)
There is also other information we can to this frontmatter section such as a list of tags for the post but Medium doesn’t include tags used for the posts in the export so these need to be added manually.
This function reads the values from the HTML and sets them on the file object so they can be accessed at a later stage, which is required as we read the values from the tree when it’s in a HTML format but write the frontmatter when it’s in a Markdown format.
The remark-frontmatter
library is needed in order for remark
to work with frontmatter but once that’s in the chain then we can prepend the frontmatter into the tree.
Downloading images and updating src and alt attributes
To remove the need to rely on hotlinking from the original Medium post we’ll need to automate downloading the images from Medium and then updating the src
attribute to use the relative url.
Additionally because of the way that Medium used figure
and figcaption
tags previously to provide the description of the image (they’ve only recently allowed for alt
text to be set) we need a means to use that description for the alt
text where it’s not set.
Downloading the images is simple as we can use the src
atttribute on the img
tag to get the URL and then once the image is downloaded locally update that src
attribute value to that local file and Gatsby will take care of the rest.
Generating the alt
attribute is a little more complicated as this requires traversing up the tree to the parent figure
tag in order to access the figcaption
sibling for the image. Additionally, as Markdown doesn’t support figure
elements we’ll need to find a means to include that element as HTML in the final output.
Including the HTML in the final output can be achieved by passing a handler for the figure
tag into the settings of the rehypeRemark
stage of the transformation. This stage converts the HTML AST into a Markdown AST and the handler allows you to define a custom behaviour for that tag and its children (this conversion essentially converts that sub-tree into a string representation of the HTML so you have to it at this level)
This function first downloads the images within the HTML, saves them locally and updates the src
attribute to point to those files before overriding the behaviour of the figure
element to return the sub-tree as a single HTML element in the final Markdown.
Converting embedded Github Gists into codeblocks
Similar to downloading the images I also wanted to bring the embedded code I had in my Medium posts back ‘in-house’ so I could take advantage of the codeblock presentation that Gatbsy offers.
This embedded code was hosted on Github as a Gist and then the URL of that Gist was pasted into Medium when writing a post. In the Medium export the code is a script
tag inside of a figure
element.
The URL of the Gist script is pretty easy to convert to the URL for the raw version of the Gist code, you just need to replace the .js
extension with /raw
.
Converting the script
tag into a Markdown codeblock was the bigger challenge as you need to change the node type from element
to text
and then wrap the code in the Markdown codeblock syntax as well as add newlines in order for the Markdown to parse the block correctly.
Cleaning up the output
After automating the conversion of the different elements that make up my posts, the final step was to clean up the markup a little by removing unneeded content and converting special characters that Medium uses to more standard ones.
The Medium export creates the following sections:
- Header — Contains the title of the post
- Subtitle — Contains the subtitle for the post
- Body — Contains the post content
- Footer — Contains the date published and when the export was generated
We only need the body section as everything else contains data that we’re already adding to the Gatsby frontmatter metadata so we can remove these extra sections.
We also need to clean up the body section content a little bit. The Medium export adds a horizontal rule and a post title to the body content so these need to be removed.
Lastly there are a few special characters that Medium uses in the post that can trip Gatsby up when it tries to parse the Markdown content. These are [NBSP]
, [HSP]
, [BS]
and the [BS]
has caused me issues in particular. Luckily running a string replace for these troublesome characters on the final output is enough to fix the issue.
Automating the conversion across the Medium export
When all the scripts are combined (Final script on my Github for brevity) a folder will be created with the format [YYYY-MM-DD]-slug
, the images will be downloaded into the folder and the the final markdown version of the post will be saved under the filename index.mdx
within that folder.
The final script follows these steps:
- Create an output directory
- Get a list of HTML files in the
posts
directory of the output - Map over the HTML files and convert them to Markdown using the full conversion process
- Save the converted Markdown to a file within the folder for the post
After the script has been run and created the posts there are few manual steps that might be needed, for instance if your posts have been included in a publication on Medium.
Manual edits to content
If your post has been picked up by a publication then you’ll likely find that as part of the Medium export the amendments the publication made to advertise themselves has also been included in your exported post.
There’s no means to automate removing the publication edits as there’s nothing in the exported HTML markup that indicates where these changes have been made or if the article is part of a publication.
The best means I found to speed up the removal of these edits was to review the list of publications in your Medium account and search for the URL of the publication within the files.
Setting up Gatsby to use the converted content
My conversion script is built with a specific theme in mind which requires the ‘folder per post’ format but you should be able to amend the script to output the files in a single directory if you want to use something else.
The theme I’ve used for my Gatsby site is the @lekoarts/gatsby-theme-minimal-blog
starter. You can create the skeleton via gatsby new minimal-blog LekoArts/gatsby-starter-minimal-blog
and then copying over the generated output from the conversion script into the content/posts
folder within the file structure.
Once I had the theme set up I then had to go and manually add tags to the frontmatter of each post so that I could use the tag functionality within the theme but this is optional.
Gotchas for the gatsby-theme-minimal-blog set up
GIF support needs setting up manually
The gatsby-theme-minimal-blog
theme is a little awkward to work with if you want to include GIFs in your posts, as by default the underlying gatsby-plugin-mdx
library only supports JPEG and PNG files.
You can overwrite the theme’s gatsby-plugin-mdx
configuration however by setting the mdx
flag in the theme’s configuration to false
and then defining your own gatsby-plugin-mdx
configuration later on. This configuration worked for me (you’ll need to add the gatsby-remark-copy-linked-files
and gatsby-remark-gifs
libraries to your package.json
)
Shadowing doesn’t work well if you turn off mdx in theme config
Shadowing is a way within Gatsby to define overrides for a theme’s components and content by defining your own version under the package structure of the original.
The gatsby-theme-minimal-blog
theme uses this mechanism to allow you to define your own content for the hero and bottom text on the homepage by creating src/@lekoarts/gatsby-theme-minimal-blog/texts/bottom.mdx
and src/@lekoarts/gatsby-theme-minimal-blog/texts/hero.mdx
files.
However when you disable the mdx
flag for the theme in your config file then the shadowing of the bottom.mdx
and hero.mdx
files stops working. With the setting on it does work as intended but my posts have GIFs so I had to find a workaround.
I fixed this by shadowing the src/@lekoarts/gatsby-theme-minimal-blog/components/homepage.tsx
component and then redefining the imports for the Hero
and Bottom
components in there to point to the shadowed files.
Peer dependencies are tricky in npm 8+
The theme does mention this in it’s README
but it’s easy to get caught out. When installing with npm 8+ you need to use the --legacy-peer-deps
flag. I found that using Yarn was a better option though but I did need to configure Netlify to use Yarn instead of npm when building my site.
Summary
The export that Medium gives you isn’t in the best shape for portability but using unified
, rehype
and remark
you can produce files that make it easy enough to bring them into another platform or website builder.
Now that I’ve ported my articles to Markdown and Gatsby I now have a very portable version of the content I had in my Medium account and I can start building up the capabilities of my website to make the most of the content I’ve already written.