From Newsletter to Podcast

By Connor Taffe | Published .

Recently, I subscribed to a new newsletter and podcast called Journal Club, a daily email in which Malcolm Diggs walking through a recently published paper related to the field of computer science -- often involving machine learning. It contains a transcript and links to an audio recording and the paper. Unfortunately, this isn't how I like to consume podcasts. Instead, I use the Apple Podcasts app on my iPhone.

Is there a way to go from a series of emails in my iCloud account to an iTunes podcast?

Journal Club Podcast in Apple Podcasts on iPhone
Journal Club Podcast in Apple Podcasts on iPhone

Email

I use a custom domain with iCloud mail to receive mail at connor.zip addresses. Since I don't control my mail server, I can't use existing filter languages like Sieve to move or otherwise process emails.

MailRules

Instead, I wrote a simple mail filtering utility which connects via IMAP and listens for new messages to process. mailrules takes simple text rules such as:

if to ~ "^marketing[\\+\\.]"
    then move "Marketing";

This rule allows me to give out the address marketing+llbean@connor.zip, and when those emails arrive from any From address, they'll be delivered to the Marketing folder. Usually bogus email addresses would be returned to sender by iCloud, but with the catch all setting enabled they'll be delivered to my main address.

I use a goyacc generated parser to implement the rule language, which uses as input tokens from the lexer. The lexer is a modified version of Eli Bendersky's A Faster Lexer in Go1. The parser builds a list of rules from the input rules file, and matches those rules in order to each email based on the metadata fetched from the email server.

For instance, the above rule would become the tokens:

IF IDENTIFIER TILDE QUOTE THEN MOVE QUOTE SEMICOLON

Let's walk through the relevant yacc rules:

Which results in [MoveRule(FieldPredicate("to", /^marketing[\+\.]/), "Marketing")].

I then use the go-imap package to interact with the mail server. First we fetch metadata from the input server, the following is condensed:

c, _ := client.DialTLS("imap.mail.me.com:993", nil)
c.Login("username", "password")
mbox, _ := c.Select("INBOX", false)

// within processMailbox
seqset := new(imap.SeqSet)
seqset.AddRange(1, 0)
messages := make(chan *imap.Message, 10)
done := make(chan error, 1)
go func() {
    done <- c.UidFetch(seqset, []imap.FetchItem{imap.FetchUid, imap.FetchEnvelope}, messages)
}()

We use UIDs instead of sequence numbers because the sequence number of a message will change if a message with a lower sequence number is moved out of the inbox, which can lead to strange behavior. The envelope contains just enough metadata to apply our rules, without pulling the entire body and attachments.

Then, we apply rules to each of the emails.

for msg := range messages {
    for _, rule := range rules {
        rule.Message(msg)
    }
}

The rules are then applied to all emails they matched in the order of the rules file:

for _, rule := range rules {
    err := rule.Action(c)
    if err != nil {
        log.Println("Apply rule:", err)
    }
}

After the first pass, we wait for additional emails regarding our mailbox and at that point re-process:

for {
    processMailbox(c, mbox, rules)

    log.Println("Listening...")

    // Create a channel to receive mailbox updates
    updates := make(chan client.Update)
    c.Updates = updates

    // Start idling
    stop := make(chan struct{})
    done := make(chan error, 1)
    go func() {
        done <- c.Idle(stop, nil)
    }()

    // Listen for updates
    for {
        select {
        case update := <-updates:
            switch update := update.(type) {
            case *client.MailboxUpdate:
                if update.Mailbox.Name != "INBOX" {
                    break
                }
                log.Println("Saw change to Inbox")

                // stop idling
                close(stop)
                close(updates)
                c.Updates = nil
            }
        case err := <-done:
            if err != nil {
                log.Fatal(err)
            }
            goto Process
        }
    }
Process:
}

The rule keeps track of which messages matched and resets its internal state within Action. For instance, the Message match function for the move rule looks like:

func (r MoveRule) Message(msg *imap.Message) {
	if r.Predicate.MatchMessage(msg) {
		log.Printf("Moving '%s' to '%s'", msg.Envelope.Subject, r.Mailbox)
		r.messages.AddNum(msg.Uid)
	}
}

Here r.messages is an imap.SeqSet, which is used to represent a set of message UIDs. Also note that the predicate is pluggable and is swapped in by the parser matching logic based on whether the predicate is a simple regex or equivalence match or a more complex boolean logic statement.

Stream

To keep mailrules a generic IMAP email processing tool, I added a new stream command, which can be plugged into any number of backends. The rule looks like this:

if from ~ "^members@journalclub.io$"
    then stream rfc822 "curl --silent --show-error --fail-with-body --header \"Content-Type: message/rfc822\" --header \"Accept: application/json\" --data-binary @- http://email2rss/journalclub/email";

When the from address matches our regular expression, this rule sends the entire RFC 822 formatted email message into the input of the command provided. The command can be anything, in this case we use curl to send the body of the email to a sibling service running on the same Kubernetes cluster, email2rss.

To fetch the full representation of the email, StreamRule's Action function:

Since this rule asks for the rfc822 representation of a message instead of html, we don't attempt to parse the body of the message.

Podcasts

Apple Podcasts supports ingesting RSS feeds as long as they meet its requirements, which mostly involves the use of the itunes namespace and the recently standardized podcast namespace. See also Apple's required tags page and their sample feed.

Here's an example of what we need to produce:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:podcast="https://podcastindex.org/namespace/1.0" >
  <channel>
    <title>Journal Club</title>
    <link>https://journalclub.io/</link>
    <atom:link href="{REDACTED}" rel="self" type="application/rss+xml" />
    <language>en-us</language>
    <copyright>&#169; 2024 JournalClub.io</copyright>
    <itunes:author>Journal Club</itunes:author>
    <description> Journal Club is a premium daily newsletter and podcast authored and hosted by Malcolm Diggs. Each episode is lovingly crafted by hand, and delivered to your inbox every morning in text and audio form.</description>
    <itunes:image href="https://www.journalclub.io/cdn-cgi/image/width=1000/images/journals/journal-splash.png"/>
    <itunes:category text="Science" />
    <itunes:explicit>false</itunes:explicit>
    <item>
        <title>Employing deep learning in crisis management and decision making through prediction using time series data in Mosul Dam Northern Iraq</title>
        <description>
          <![CDATA[
          <p>Today's article comes from the PeerJ Computer Science journal. The authors are Khafaji et al., from the University of Sfax, in Tunisia. In this paper they attempt to develop machine learning models that can predict the water-level fluctuations within a dam in Iraq. If they succeed, it will help the dam operators prevent a catastrophic collapse. Let's see how well they did.</p>]]>
        </description>
        <guid isPermaLink="false">1b1dd75f-e37e-4c55-b759-dea3b1dbba3a</guid>
        <pubDate>Sun, 03 Nov 2024 13:55:35 UTC</pubDate>
        <enclosure url="{REDACTED}" length="12926609" type="audio/mpeg" />
        <itunes:image href="https://embed.filekitcdn.com/e/3Uk7tL4uX5yjQZM3sj7FA5/sSM8ecFNXywfm7M3qy1tWu" />
        <itunes:explicit>false</itunes:explicit>
    </item>
  </channel>
</rss>

Podcasts are RSS 2.0 feeds, in 2023 Apple deprecated the use of Atom feeds.

Object Fields Description
<channel> <title> The title of the feed, we use the name of the newsfeed.
<channel> <link> A link to the source of the information in the feed. Since this feed is based on an email newsletter and not a website, we use the homepage of the feed.
<channel> <atom:link> A self-link in the atom namespace, back-porting this feature from the Atom feed specification to RSS 2.0. We place the URL of the feed file itself here.
<channel> <language> The language of the content, in the same format of the Accept-Language HTTP header.
<channel> <copyright> Who owns the rights to the content in this file, we use the copyright statement from the homepage.
<channel> <itunes:author> This is the first of the itunes namespaced fields, the author of the content.
<channel> <description> A description of the content, we use the one available on the website.
<channel> <itunes:image> The image to use as cover art.
<channel> <itunes:category> The category of the podcast, this can also contain a subcategory.
<channel> <itunes:explicit> Whether or not this podcast contains explicit content.
<item> <title> The title of a podcast episode, extracted from the Subject line of the email.
<item> <description> A description of the podcast episode, taken from the first paragraph of the body of the email. This field can contain HTML tags such as paragraphs and links by using a CDATA block.
<item> <guid> A globally unique id, we use X-Apple-UUID so we must set isPermaLink to false since it's not a URL to the content.
<item> <pubDate> The date the podcast episode was published, we use the Date field from the email. This won't work in the case of back-dated episodes, for instance JournalClub has a mechanism to resend old episodes and those emails would have a renewed send date.
<item> <enclosure> The audio of the podcast episode, a URL along with its MIME type and file size.
<item> <itunes:image> The image to use for a specific podcast episode. We use the paper image, but it's so small Apple's podcast app ignores it.
<item> <itunes:explicit> Whether this particular episode is explicit.

We can use the W3C Feed Validation Service and the Podbase Podcast Validator for podcast-specific validation.

Email2RSS

At this point, mailrules has shelled out to curl which has sent the body of our Journal Club email to a sibling email2rss service, in the same Kubernetes cluster which mailrules is deployed within. This service has two relevant endpoints:

POST /{feed}/email

The POST endpoint needs to first parse the input email to find the HTML representation we'll be pulling relevant information from. To do that, we first need to parse the RFC 822 message body using Go's net/mail package using mail.ReadMessage(req.Body). Then, we extract the msg.Header.Date() which (RFC 3339 formatted) becomes the key for our state in cloud storage.

MIME2

To find the HTML, we use the MessageMIME method:

// MessageMIME finds and parses a portion of the message based on the MIME type
func MessageMIME(message *mail.Message, contentType string) (io.Reader, error) {
	mediaType, params, err := mime.ParseMediaType(message.Header.Get("Content-Type"))
	if err != nil {
		return nil, fmt.Errorf("parse message content type: %w", err)
	}
	if !strings.HasPrefix(mediaType, "multipart/") {
		return nil, fmt.Errorf("expected multipart message but found %s", mediaType)
	}
	reader := multipart.NewReader(message.Body, params["boundary"])
	if reader == nil {
		return nil, fmt.Errorf("could not construct multipart reader for message")
	}
	for {
		part, err := reader.NextPart()
		if err != nil {
			return nil, fmt.Errorf("could not find %s part of message: %w", contentType, err)
		}
		mediaType, _, err := mime.ParseMediaType(part.Header.Get("Content-Type"))
		if err != nil {
			return nil, fmt.Errorf("parse multipart message part content type: %w", err)
		}
		if mediaType == contentType {
			enc := strings.ToLower(part.Header.Get("Content-Transfer-Encoding"))
			switch enc {
			case "base64":
				return base64.NewDecoder(base64.StdEncoding, part), nil
			case "quoted-printable":
				return quotedprintable.NewReader(part), nil
			default:
				return part, nil
			}
		}
	}
}

The method parses the Content-Type header of the message to determine if it is multipart/, if so we need to determine the boundary string used to split each portion and iterate through each of the multiple parts using a multipart.Reader. As we iterate over each part, we again parse the Content-Type looking for our target text/html. Each of these message parts could be another multipart message (in which case we could recurse) or even an entire email (message/rfc822); but for our purposes we only expect a single level in the tree. Once we've found the appropriate portion, we check Content-Transfer-Encoding; in our case the email is quoted-printable3 encoded, which looks like:

<h3 style=3D"font-weight:bold;font-style:normal;font-size:1em;margin:0;font=
-size:1.17em;margin:1em 0;font-family:Charter, Georgia, Times New Roman, se=
rif;font-size:28px;color:#12363f;font-weight:400;letter-spacing:0;line-heig=
ht:1.5;text-transform:none;margin-top:0;margin-bottom:0" class=3D"">Employi=
ng deep learning in crisis management and decision making through predictio=
n using time series data in Mosul Dam Northern Iraq</h3>

Notice the trailing = for soft line-breaks and =3D to encode literal equal signs.

I may rewrite this in the future to leverage a more general library like go-message.

Parsing the HTML

Next, we extract information form the message using regular expressions4:

Expression Description
"(https?://[^ ]+\.mp3)" The audio recording link included in each email, which becomes the <enclosure> url field.
<img src="(https?://[^ ]*)" An image of the first page of the paper we can use as the podcast episode <itunes:image>. Unfortunately this is too low-resolution to be used by the Podcast app.
Hi[ ]+Connor, (.*)</p> The <description> of each episode, which begins with a salutation specific to each subscriber.
<a [^>]*href="(https?://(\w+\.)?doi.org[^"]*)"[^>]*> The link to the paper, using the DOI. This becomes part of the <description>.

Apple requires the <enclosure> length field contain the number of bytes within the file, so we send a HEAD request to the audio URL and record the Content-Length to populate this field.

We also extract some information from headers:

Header Description
Subject Populates <title>, but needs UTF-8 characters decoded according to RFC 2047 using mime.WordDecoder's DecodeHeader
Date Used for the file name in cloud storage.
X-Apple-UUID Used for the <guid> tag.

Once we've extracted this information, we format our state into a JSON blob and write it to storage:

{
  "uuid": "1b1dd75f-e37e-4c55-b759-dea3b1dbba3a",
  "subject": "Employing deep learning in crisis management and decision making through prediction using time series data in Mosul Dam Northern Iraq",
  "description": "Today's article comes from the PeerJ Computer Science journal. The authors are Khafaji et al., from the University of Sfax, in Tunisia. In this paper they attempt to develop machine learning models that can predict the water-level fluctuations within a dam in Iraq. If they succeed, it will help the dam operators prevent a catastrophic collapse. Let's see how well they did.",
  "date": "2024-11-03T13:55:35Z",
  "imageURL": "https://embed.filekitcdn.com/e/3Uk7tL4uX5yjQZM3sj7FA5/sSM8ecFNXywfm7M3qy1tWu",
  "audioURL": "{REDACTED}",
  "audioSize": 12926609,
  "paperURL": "http://dx.doi.org/10.7717/peerj-cs.2416"
}

Then, all state files are read in from cloud storage and passed through a template to generate the new Podcast RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:podcast="https://podcastindex.org/namespace/1.0" >
  <channel>
    <title>Journal Club</title>
    <link>https://journalclub.io/</link>
    <atom:link href="{REDACTED}" rel="self" type="application/rss+xml" />
    <language>en-us</language>
    <copyright>&#169; 2024 JournalClub.io</copyright>
    <itunes:author>Journal Club</itunes:author>
    <description> Journal Club is a premium daily newsletter and podcast authored and hosted by Malcolm Diggs. Each episode is lovingly crafted by hand, and delivered to your inbox every morning in text and audio form.</description>
    <itunes:image href="https://www.journalclub.io/cdn-cgi/image/width=1000/images/journals/journal-splash.png"/>
    <itunes:category text="Science" />
    <itunes:explicit>false</itunes:explicit>
    {{- range . }}
    <item>
        <title>{{.Subject}}</title>
        <description>
          <![CDATA[
          <p>{{- .Description -}}</p>
          {{- if .PaperURL -}}
            <p>Want the paper? This <a href="{{.PaperURL}}">link</a> will take you to the original DOI for the paper (on the publisher's site). You'll be able to grab the PDF from them directly.</p>
          {{- end -}}
          ]]>
        </description>
        <guid isPermaLink="false">{{.UUID}}</guid>
        <pubDate>{{ rfc2822 .Date }}</pubDate>
        <enclosure
            url="{{.AudioURL}}"
            length="{{.AudioSize}}"
            type="audio/mpeg"
            />
        <itunes:image href="{{.ImageURL}}" />
        <itunes:explicit>false</itunes:explicit>
    </item>
    {{- end }}
  </channel>
</rss>

And the final output is cached in cloud storage.

GET /{feed}/feed.xml

Using the portable blob package, we can avoid coupling ourselves to a specific cloud storage backend, and even write tests using a mem or file backend. Then, we use http.ServeContent to handle the finicky logic around Last-Mofied/If-Modified-Since and friends. Here's the implementation of GetFeed:

func (s *Server) GetFeed(w http.ResponseWriter, req *http.Request) {
	ctx := req.Context()
	key := fmt.Sprintf("%s/feed.xml", req.PathValue("feed"))
	attrs, err := s.bucket.Attributes(ctx, key)
    if err != nil {
		http.Error(w, "Could not fetch feed attributes", http.StatusInternalServerError)
		log.Printf("fetch object attributes: %v", err)
		return
	}
	blobReader, err := s.bucket.NewReader(ctx, key, nil)
	if err != nil {
		http.Error(w, "Could not fetch feed", http.StatusInternalServerError)
		log.Printf("construct object reader: %v", err)
		return
	}
	defer blobReader.Close()

	w.Header().Add("Content-Type", "application/xml+rss;charset=UTF-8")
	w.Header().Add("Content-Disposition", "inline")
	w.Header().Add("Cache-Control", "no-cache")
	w.Header().Add("ETag", attrs.ETag)
	http.ServeContent(w, req, "", blobReader.ModTime(), blobReader)
}

We can finally put the entire flow together:

%3cluster_k8sKubernetescluster_mailrulesmailrulescluster_iphoneiPhonemailserverMail ServerrulesRulesmailrules_loopLooprules->mailrules_loopemail2rssemail2rssgcsCloud Storageemail2rss->gcsstatetraefikTraefiktraefik->email2rssRSSmailrules_loop->mailserverIMAPstreamStreamRulemailrules_loop->streamInvokesstream->email2rssRFC 822podcastsPodcastspfsensepfSensepodcasts->pfsensepfsense->traefik

Using

Once the email2rss's GET /{feed}/feed.xml endpoint has been published to the internet, and mailrules is sending emails to the service to populate the RSS feed; we can finally open the Podcasts app and see what we've accomplished.

Journal Club Podcast in Apple Podcasts on Mac
Journal Club Podcast in Apple Podcasts on Mac

  1. See also Lexical Scanning in Go by Rob Pike, which I've used as the basis for several previous projects involving lexers. ↩︎

  2. MIME was introduced as an email standard in 1992, see a brief history in The MIME guys: How two Internet gurus changed e-mail forever. Its use in the Web through Content-Type and later Accept has not been with hiccups necessitating that browsers sniff content and even a MIME Sniffing standard. ↩︎

  3. Email predates the Internet, and is always ASCII encoded. ASCII was introduced as a 7-bit standard for telegraphs in the 60s, so to represent 8-bit UTF-8 characters, we need a way to encode the 8-bit characters into 7-bit ASCII. The quoted-printable scheme (RFC 2045 §6.7) does this by using an = sign followed by two hex digits. In the case of a literal equals sign in the original text, it must be encoded =3D, 3D being the hex for the ASCII code for =. Quoted printable also requires that lines be a maximum of 76 characters long, if the original text is a longer a soft line break can be inserted which results in a = before the \r\n sequence. The only mention of line length in RFC 822 is when referencing long headers:

    "Long" is commonly interpreted to mean greater than 65 or 72 characters. The former length serves as a limit, when the message is to be viewed on most simple terminals which use simple display software; however, the limit is not imposed by this standard.

    Long ago, email was delivered on UNIX machines using UUCP (UNIX-to-UNIX Copy), to user-specific mailboxes and viewed with a command such as mail. In fact email addresses looked much different before DNS, for instance a UUCP bang path representing the full route to the sender or recipient. As an example, Brian Ried's essay on Interpress was sent from the address decwrl!glacier!reid↩︎

  4. These were ported from my first attempt at a solution, which involved integrating the HTML parser into mailrules and generating item.xml intermediate files which were globbed into a final feed.xml all via shell script. The html command input is still an option for the stream rule in place of rfc822. I then used gsutil rsync to copy files from the mailrules pod shell script workspace to cloud storage, and a simple static server container using BusyBox's httpd and the same gsutil rsync in an init container and in a loop as a sidecar using a shared empty volume↩︎