From Newsletter to Podcast

Recently, I subscribed to a new newsletter and podcast called Journal Club, a daily email in which Malcolm Diggs walking through a recently published paper related to the field of computer science -- often involving machine learning. It contains a transcript and links to an audio recording and the paper. Unfortunately, this isn't how I like to consume podcasts. Instead, I use the Apple Podcasts app on my iPhone.

Is there a way to go from a series of emails in my iCloud account to an iTunes podcast?

Journal Club Podcast in Apple Podcasts on iPhone

Email

I use a custom domain with iCloud mail to receive mail at connor.zip addresses. Since I don't control my mail server, I can't use existing filter languages like Sieve to move or otherwise process emails.

MailRules

Instead, I wrote a simple mail filtering utility which connects via IMAP and listens for new messages to process. mailrules takes simple text rules such as:

if to ~ "^marketing[\\+\\.]"
    then move "Marketing";

This rule allows me to give out the address marketing+llbean@connor.zip, and when those emails arrive from any From address, they'll be delivered to the Marketing folder. Usually bogus email addresses would be returned to sender by iCloud, but with the catch all setting enabled they'll be delivered to my main address.

When mailrules starts, it applies its rules to all emails the inbox.
Then, it waits for additional events such as incoming emails and applies any rules to that email.

I use a goyacc generated parser to implement the rule language, which uses as input tokens from the lexer. The lexer is a modified version of Eli Bendersky's A Faster Lexer in Go¹. The parser builds a list of rules from the input rules file, and matches those rules in order to each email based on the metadata fetched from the email server.

For instance, the above rule would become the tokens:

IF IDENTIFIER TILDE QUOTE THEN MOVE QUOTE SEMICOLON

Let's walk through the relevant yacc rules:

We start at the rules rule, defined as either a single rule or a series of semicolon-delimited rules. Here we strip SEMICOLON off the end and rule must be IF IDENTIFIER TILDE QUOTE THEN MOVE QUOTE.
```
rules: rule SEMICOLON
    { $$ = append($$, $1) }
    ...
```
One of the options for a rule is an if ... then predicate followed by a move action. Here we break our tokens apart into a IDENTIFIER TILDE QUOTE and MOVE QUOTE to fill in the blanks.
```
rule: IF condition THEN move
    {
        $4.Predicate = $2
        $$ = $4
    }
    ...
```
Focusing on the latter part of the if ... then, the move is a simple keyword followed by a string. Notice its first argument, the predicate, is empty; it's assigned within the if ... then rule once the condition is resolved. With MOVE covered, string must be QUOTE.
```
move: MOVE string
    { $$ = rules.NewMoveRule(nil, $2) }
```
Within the if ... then, the condition can be a simple comparison, or it can contain and, or, not, etc. We're still handling IDENTIFIER TILDE QUOTE at this point.
```
condition: comparison
    { $$ = $1 }
    ...
```

The comparison we use here is the ~ regular expression match. With IDENTIFIER TILDE covered, string must be QUOTE.

comparison:
    IDENTIFIER TILDE string
    {
        rexp, err := regexp.Compile($3)
        if err != nil {
            yylex.Error(fmt.Sprintf("malformed regex '%s' in predicate: %v", $3, err))
            return -1
        }
        $$, err = rules.NewFieldPredicate($1, rexp)
        if err != nil {
            yylex.Error(err.Error())
            return -1
        }
    }
    ...

And as we expect, string is a QUOTE atom where we've handled normalizing escaped quotes:

string: QUOTE
    { $$ = strings.ReplaceAll(strings.ReplaceAll($1[1:len($1)-1], "\\\"", "\""), "\\\\", "\\") }

Which results in [MoveRule(FieldPredicate("to", /^marketing[\+\.]/), "Marketing")].

I then use the go-imap package to interact with the mail server. First we fetch metadata from the input server, the following is condensed:

Connect using TLS to our mail server
Login using our application credentials
Select the INBOX as our active mailbox
Use an infinite range to select all emails

c, _ := client.DialTLS("imap.mail.me.com:993", nil)
c.Login("username", "password")
mbox, _ := c.Select("INBOX", false)

// within processMailbox
seqset := new(imap.SeqSet)
seqset.AddRange(1, 0)
messages := make(chan *imap.Message, 10)
done := make(chan error, 1)
go func() {
    done <- c.UidFetch(seqset, []imap.FetchItem{imap.FetchUid, imap.FetchEnvelope}, messages)
}()

We use UIDs instead of sequence numbers because the sequence number of a message will change if a message with a lower sequence number is moved out of the inbox, which can lead to strange behavior. The envelope contains just enough metadata to apply our rules, without pulling the entire body and attachments.

Then, we apply rules to each of the emails.

for msg := range messages {
    for _, rule := range rules {
        rule.Message(msg)
    }
}

The rules are then applied to all emails they matched in the order of the rules file:

for _, rule := range rules {
    err := rule.Action(c)
    if err != nil {
        log.Println("Apply rule:", err)
    }
}

After the first pass, we wait for additional emails regarding our mailbox and at that point re-process:

for {
    processMailbox(c, mbox, rules)

    log.Println("Listening...")

    // Create a channel to receive mailbox updates
    updates := make(chan client.Update)
    c.Updates = updates

    // Start idling
    stop := make(chan struct{})
    done := make(chan error, 1)
    go func() {
        done <- c.Idle(stop, nil)
    }()

    // Listen for updates
    for {
        select {
        case update := <-updates:
            switch update := update.(type) {
            case *client.MailboxUpdate:
                if update.Mailbox.Name != "INBOX" {
                    break
                }
                log.Println("Saw change to Inbox")

                // stop idling
                close(stop)
                close(updates)
                c.Updates = nil
            }
        case err := <-done:
            if err != nil {
                log.Fatal(err)
            }
            goto Process
        }
    }
Process:
}

The rule keeps track of which messages matched and resets its internal state within Action. For instance, the Message match function for the move rule looks like:

func (r MoveRule) Message(msg *imap.Message) {
	if r.Predicate.MatchMessage(msg) {
		log.Printf("Moving '%s' to '%s'", msg.Envelope.Subject, r.Mailbox)
		r.messages.AddNum(msg.Uid)
	}
}

Here r.messages is an imap.SeqSet, which is used to represent a set of message UIDs. Also note that the predicate is pluggable and is swapped in by the parser matching logic based on whether the predicate is a simple regex or equivalence match or a more complex boolean logic statement.

Stream

To keep mailrules a generic IMAP email processing tool, I added a new stream command, which can be plugged into any number of backends. The rule looks like this:

if from ~ "^members@journalclub.io$"
    then stream rfc822 "curl --silent --show-error --fail-with-body --header \"Content-Type: message/rfc822\" --header \"Accept: application/json\" --data-binary @- http://email2rss/journalclub/email";

When the from address matches our regular expression, this rule sends the entire RFC 822 formatted email message into the input of the command provided. The command can be anything, in this case we use curl to send the body of the email to a sibling service running on the same Kubernetes cluster, email2rss.

To fetch the full representation of the email, StreamRule's Action function:

Initiates a Fetch using the UID set constructed in its Message matching logic, in which it asks for UID, RFC822.HEADER, and RFC822.TEXT.
Finds the header and text portions of the response for a given message and concatenates them together.
Executes the command with the stdin set to the message.

Since this rule asks for the rfc822 representation of a message instead of html, we don't attempt to parse the body of the message.

Podcasts

Apple Podcasts supports ingesting RSS feeds as long as they meet its requirements, which mostly involves the use of the itunes namespace and the recently standardized podcast namespace. See also Apple's required tags page and their sample feed.

Here's an example of what we need to produce:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:podcast="https://podcastindex.org/namespace/1.0" >
  <channel>
    <title>Journal Club</title>
    <link>https://journalclub.io/</link>
    <atom:link href="{REDACTED}" rel="self" type="application/rss+xml" />
    <language>en-us</language>
    <copyright>&#169; 2024 JournalClub.io</copyright>
    <itunes:author>Journal Club</itunes:author>
    <description> Journal Club is a premium daily newsletter and podcast authored and hosted by Malcolm Diggs. Each episode is lovingly crafted by hand, and delivered to your inbox every morning in text and audio form.</description>
    <itunes:image href="https://www.journalclub.io/cdn-cgi/image/width=1000/images/journals/journal-splash.png"/>
    <itunes:category text="Science" />
    <itunes:explicit>false</itunes:explicit>
    <item>
        <title>Employing deep learning in crisis management and decision making through prediction using time series data in Mosul Dam Northern Iraq</title>
        <description>
          <![CDATA[
          <p>Today's article comes from the PeerJ Computer Science journal. The authors are Khafaji et al., from the University of Sfax, in Tunisia. In this paper they attempt to develop machine learning models that can predict the water-level fluctuations within a dam in Iraq. If they succeed, it will help the dam operators prevent a catastrophic collapse. Let's see how well they did.</p>]]>
        </description>
        <guid isPermaLink="false">1b1dd75f-e37e-4c55-b759-dea3b1dbba3a</guid>
        <pubDate>Sun, 03 Nov 2024 13:55:35 UTC</pubDate>
        <enclosure url="{REDACTED}" length="12926609" type="audio/mpeg" />
        <itunes:image href="https://embed.filekitcdn.com/e/3Uk7tL4uX5yjQZM3sj7FA5/sSM8ecFNXywfm7M3qy1tWu" />
        <itunes:explicit>false</itunes:explicit>
    </item>
  </channel>
</rss>

Podcasts are RSS 2.0 feeds, in 2023 Apple deprecated the use of Atom feeds.

Object	Fields	Description
`<channel>`	`<title>`	The title of the feed, we use the name of the newsfeed.
`<channel>`	`<link>`	A link to the source of the information in the feed. Since this feed is based on an email newsletter and not a website, we use the homepage of the feed.
`<channel>`	`<atom:link>`	A self-link in the `atom` namespace, back-porting this feature from the Atom feed specification to RSS 2.0. We place the URL of the feed file itself here.
`<channel>`	`<language>`	The language of the content, in the same format of the `Accept-Language` HTTP header.
`<channel>`	`<copyright>`	Who owns the rights to the content in this file, we use the copyright statement from the homepage.
`<channel>`	`<itunes:author>`	This is the first of the `itunes` namespaced fields, the author of the content.
`<channel>`	`<description>`	A description of the content, we use the one available on the website.
`<channel>`	`<itunes:image>`	The image to use as cover art.
`<channel>`	`<itunes:category>`	The category of the podcast, this can also contain a subcategory.
`<channel>`	`<itunes:explicit>`	Whether or not this podcast contains explicit content.
`<item>`	`<title>`	The title of a podcast episode, extracted from the `Subject` line of the email.
`<item>`	`<description>`	A description of the podcast episode, taken from the first paragraph of the body of the email. This field can contain HTML tags such as paragraphs and links by using a `CDATA` block.
`<item>`	`<guid>`	A globally unique id, we use `X-Apple-UUID` so we must set `isPermaLink` to false since it's not a URL to the content.
`<item>`	`<pubDate>`	The date the podcast episode was published, we use the `Date` field from the email. This won't work in the case of back-dated episodes, for instance JournalClub has a mechanism to resend old episodes and those emails would have a renewed send date.
`<item>`	`<enclosure>`	The audio of the podcast episode, a URL along with its MIME type and file size.
`<item>`	`<itunes:image>`	The image to use for a specific podcast episode. We use the paper image, but it's so small Apple's podcast app ignores it.
`<item>`	`<itunes:explicit>`	Whether this particular episode is explicit.

We can use the W3C Feed Validation Service and the Podbase Podcast Validator for podcast-specific validation.

Email2RSS

At this point, mailrules has shelled out to curl which has sent the body of our Journal Club email to a sibling email2rss service, in the same Kubernetes cluster which mailrules is deployed within. This service has two relevant endpoints:

GET /{feed}/feed.xml which fetches the generated Podcast RSS.
POST /{feed}/email which accepts an RFC 822 formatted email and updates the Podcast RSS.

`POST /{feed}/email`

The POST endpoint needs to first parse the input email to find the HTML representation we'll be pulling relevant information from. To do that, we first need to parse the RFC 822 message body using Go's net/mail package using mail.ReadMessage(req.Body). Then, we extract the msg.Header.Date() which (RFC 3339 formatted) becomes the key for our state in cloud storage.

MIME²

To find the HTML, we use the MessageMIME method:

// MessageMIME finds and parses a portion of the message based on the MIME type
func MessageMIME(message *mail.Message, contentType string) (io.Reader, error) {
	mediaType, params, err := mime.ParseMediaType(message.Header.Get("Content-Type"))
	if err != nil {
		return nil, fmt.Errorf("parse message content type: %w", err)
	}
	if !strings.HasPrefix(mediaType, "multipart/") {
		return nil, fmt.Errorf("expected multipart message but found %s", mediaType)
	}
	reader := multipart.NewReader(message.Body, params["boundary"])
	if reader == nil {
		return nil, fmt.Errorf("could not construct multipart reader for message")
	}
	for {
		part, err := reader.NextPart()
		if err != nil {
			return nil, fmt.Errorf("could not find %s part of message: %w", contentType, err)
		}
		mediaType, _, err := mime.ParseMediaType(part.Header.Get("Content-Type"))
		if err != nil {
			return nil, fmt.Errorf("parse multipart message part content type: %w", err)
		}
		if mediaType == contentType {
			enc := strings.ToLower(part.Header.Get("Content-Transfer-Encoding"))
			switch enc {
			case "base64":
				return base64.NewDecoder(base64.StdEncoding, part), nil
			case "quoted-printable":
				return quotedprintable.NewReader(part), nil
			default:
				return part, nil
			}
		}
	}
}

The method parses the Content-Type header of the message to determine if it is multipart/, if so we need to determine the boundary string used to split each portion and iterate through each of the multiple parts using a multipart.Reader. As we iterate over each part, we again parse the Content-Type looking for our target text/html. Each of these message parts could be another multipart message (in which case we could recurse) or even an entire email (message/rfc822); but for our purposes we only expect a single level in the tree. Once we've found the appropriate portion, we check Content-Transfer-Encoding; in our case the email is quoted-printable³ encoded, which looks like:

<h3 style=3D"font-weight:bold;font-style:normal;font-size:1em;margin:0;font=
-size:1.17em;margin:1em 0;font-family:Charter, Georgia, Times New Roman, se=
rif;font-size:28px;color:#12363f;font-weight:400;letter-spacing:0;line-heig=
ht:1.5;text-transform:none;margin-top:0;margin-bottom:0" class=3D"">Employi=
ng deep learning in crisis management and decision making through predictio=
n using time series data in Mosul Dam Northern Iraq</h3>

Notice the trailing = for soft line-breaks and =3D to encode literal equal signs.

I may rewrite this in the future to leverage a more general library like go-message.

Parsing the HTML

Next, we extract information form the message using regular expressions⁴:

Expression	Description
`"(https?://[^ ]+\.mp3)"`	The audio recording link included in each email, which becomes the `<enclosure>` `url` field.
`<img src="(https?://[^ ]*)"`	An image of the first page of the paper we can use as the podcast episode `<itunes:image>`. Unfortunately this is too low-resolution to be used by the Podcast app.
`Hi[ ]+Connor, (.*)</p>`	The `<description>` of each episode, which begins with a salutation specific to each subscriber.
`<a [^>]href="(https?://(\w+\.)?doi.org[^"])"[^>]*>`	The link to the paper, using the DOI. This becomes part of the `<description>`.

Apple requires the <enclosure> length field contain the number of bytes within the file, so we send a HEAD request to the audio URL and record the Content-Length to populate this field.

We also extract some information from headers:

Header	Description
`Subject`	Populates `<title>`, but needs UTF-8 characters decoded according to RFC 2047 using `mime.WordDecoder`'s `DecodeHeader`
`Date`	Used for the file name in cloud storage.
`X-Apple-UUID`	Used for the `<guid>` tag.

Once we've extracted this information, we format our state into a JSON blob and write it to storage:

{
  "uuid": "1b1dd75f-e37e-4c55-b759-dea3b1dbba3a",
  "subject": "Employing deep learning in crisis management and decision making through prediction using time series data in Mosul Dam Northern Iraq",
  "description": "Today's article comes from the PeerJ Computer Science journal. The authors are Khafaji et al., from the University of Sfax, in Tunisia. In this paper they attempt to develop machine learning models that can predict the water-level fluctuations within a dam in Iraq. If they succeed, it will help the dam operators prevent a catastrophic collapse. Let's see how well they did.",
  "date": "2024-11-03T13:55:35Z",
  "imageURL": "https://embed.filekitcdn.com/e/3Uk7tL4uX5yjQZM3sj7FA5/sSM8ecFNXywfm7M3qy1tWu",
  "audioURL": "{REDACTED}",
  "audioSize": 12926609,
  "paperURL": "http://dx.doi.org/10.7717/peerj-cs.2416"
}

Then, all state files are read in from cloud storage and passed through a template to generate the new Podcast RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:podcast="https://podcastindex.org/namespace/1.0" >
  <channel>
    <title>Journal Club</title>
    <link>https://journalclub.io/</link>
    <atom:link href="{REDACTED}" rel="self" type="application/rss+xml" />
    <language>en-us</language>
    <copyright>&#169; 2024 JournalClub.io</copyright>
    <itunes:author>Journal Club</itunes:author>
    <description> Journal Club is a premium daily newsletter and podcast authored and hosted by Malcolm Diggs. Each episode is lovingly crafted by hand, and delivered to your inbox every morning in text and audio form.</description>
    <itunes:image href="https://www.journalclub.io/cdn-cgi/image/width=1000/images/journals/journal-splash.png"/>
    <itunes:category text="Science" />
    <itunes:explicit>false</itunes:explicit>
    {{- range . }}
    <item>
        <title>{{.Subject}}</title>
        <description>
          <![CDATA[
          <p>{{- .Description -}}</p>
          {{- if .PaperURL -}}
            <p>Want the paper? This <a href="{{.PaperURL}}">link</a> will take you to the original DOI for the paper (on the publisher's site). You'll be able to grab the PDF from them directly.</p>
          {{- end -}}
          ]]>
        </description>
        <guid isPermaLink="false">{{.UUID}}</guid>
        <pubDate>{{ rfc2822 .Date }}</pubDate>
        <enclosure
            url="{{.AudioURL}}"
            length="{{.AudioSize}}"
            type="audio/mpeg"
            />
        <itunes:image href="{{.ImageURL}}" />
        <itunes:explicit>false</itunes:explicit>
    </item>
    {{- end }}
  </channel>
</rss>

And the final output is cached in cloud storage.

`GET /{feed}/feed.xml`

Using the portable blob package, we can avoid coupling ourselves to a specific cloud storage backend, and even write tests using a mem or file backend. Then, we use http.ServeContent to handle the finicky logic around Last-Mofied/If-Modified-Since and friends. Here's the implementation of GetFeed:

func (s *Server) GetFeed(w http.ResponseWriter, req *http.Request) {
	ctx := req.Context()
	key := fmt.Sprintf("%s/feed.xml", req.PathValue("feed"))
	attrs, err := s.bucket.Attributes(ctx, key)
    if err != nil {
		http.Error(w, "Could not fetch feed attributes", http.StatusInternalServerError)
		log.Printf("fetch object attributes: %v", err)
		return
	}
	blobReader, err := s.bucket.NewReader(ctx, key, nil)
	if err != nil {
		http.Error(w, "Could not fetch feed", http.StatusInternalServerError)
		log.Printf("construct object reader: %v", err)
		return
	}
	defer blobReader.Close()

	w.Header().Add("Content-Type", "application/xml+rss;charset=UTF-8")
	w.Header().Add("Content-Disposition", "inline")
	w.Header().Add("Cache-Control", "no-cache")
	w.Header().Add("ETag", attrs.ETag)
	http.ServeContent(w, req, "", blobReader.ModTime(), blobReader)
}

We can finally put the entire flow together:

Using

Once the email2rss's GET /{feed}/feed.xml endpoint has been published to the internet, and mailrules is sending emails to the service to populate the RSS feed; we can finally open the Podcasts app and see what we've accomplished.

On mobile, navigate in the Apple Podcasts app on iPhone to Library > ..., then Follow a Show by URI..., then paste the full URL of our feed.xml endpoint.
On desktop, navigate to File > Follow a Show by URI... (or command+shift+N).

Journal Club Podcast in Apple Podcasts on Mac

See also Lexical Scanning in Go by Rob Pike, which I've used as the basis for several previous projects involving lexers. ↩︎
MIME was introduced as an email standard in 1992, see a brief history in The MIME guys: How two Internet gurus changed e-mail forever. Its use in the Web through Content-Type and later Accept has not been with hiccups necessitating that browsers sniff content and even a MIME Sniffing standard. ↩︎
Email predates the Internet, and is always ASCII encoded. ASCII was introduced as a 7-bit standard for telegraphs in the 60s, so to represent 8-bit UTF-8 characters, we need a way to encode the 8-bit characters into 7-bit ASCII. The quoted-printable scheme (RFC 2045 §6.7) does this by using an = sign followed by two hex digits. In the case of a literal equals sign in the original text, it must be encoded =3D, 3D being the hex for the ASCII code for =. Quoted printable also requires that lines be a maximum of 76 characters long, if the original text is a longer a soft line break can be inserted which results in a = before the \r\n sequence. The only mention of line length in RFC 822 is when referencing long headers:

"Long" is commonly interpreted to mean greater than 65 or 72 characters. The former length serves as a limit, when the message is to be viewed on most simple terminals which use simple display software; however, the limit is not imposed by this standard.

Long ago, email was delivered on UNIX machines using UUCP (UNIX-to-UNIX Copy), to user-specific mailboxes and viewed with a command such as mail. In fact email addresses looked much different before DNS, for instance a UUCP bang path representing the full route to the sender or recipient. As an example, Brian Ried's essay on Interpress was sent from the address decwrl!glacier!reid. ↩︎
These were ported from my first attempt at a solution, which involved integrating the HTML parser into mailrules and generating item.xml intermediate files which were globbed into a final feed.xml all via shell script. The html command input is still an option for the stream rule in place of rfc822. I then used gsutil rsync to copy files from the mailrules pod shell script workspace to cloud storage, and a simple static server container using BusyBox's httpd and the same gsutil rsync in an init container and in a loop as a sidecar using a shared empty volume. ↩︎