MarkMail Takes Mailing List Archives to the Next Level
...MarkMail provides all this and more. MarkLogic has stored approximately 5.5 million email messages across over 700 plus open source mailing lists -- all of the Apache, MySQL, Mozilla, and PHP lists, plus a smattering of others, with more to be added over time (hopefully soon) -- and provided an interface that beats Googling. It's as fast or faster, but more importantly, you have built-in data mining capabilities that, I trust, will eventually make their way into more traditional email systems...Where MarkMail really shines is in managing large mail archives. And that's why, of course, MarkLogic has put up MarkMail for free. They know that there are potential corporate clients who have huge mail archives that they want to mine. And the performance of their existing systems (not to mention their interfaces) just won't cut it.
InfoQ sat down with Jason Hunter of MarkLogic to find out more details on site and where it is heading in the future.
Could you explain what an XML content server is in a nutshell for readers that might be unfamiliar?
An XML content server is a database built for content (as opposed to one built for data). Content is stuff that's textual, hierarchical, and commonly without a repeating structure. It's frequently encoded in XML. Some examples are books, articles, web pages, blog entries, and email messages.
We built MarkMail on top of MarkLogic Server, a commercial XML content server. MarkLogic Server has a transactional store but instead of thinking in tables it thinks in XML. In MarkMail every email is represented and held as an XML document. All the text searches, faceted navigation, analytic queries and HTML page renderings you see are performed by a single MarkLogic Server machine against millions of XML documents.
Why is XML a better format for email than a RDMS?
I've worked on relational database-based email search systems before and I can tell you it's a real challenge because of the nature of email. Email is messy. It's textual, irregular, and hierarchical. It's all the things that challenge traditional relational-centric databases.
Email headers are fairly well structured, but not perfectly, and there's no inherent size cap on any field. The email body itself may seem like just flat text (what you'd call unstructured), but really there's more to it. There are paragraphs, nested quote blocks (where person A quotes person B), initial greetings, and trailing footers. There are also attachments, inside of which there are pages and paragraphs and things like that, all hierarchically laid out.
With MarkMail we load every email into MarkLogic Server as an XML document. It's a very natural format for email. We mark up the headers, the body paragraphs, the quote blocks, the greetings, the footers, and everything. Even the attachments and their contents. Then, using XQuery as an XML-aware query language, we can build an application that uses the structure.
We use the internal XML structure for many purposes. For example, when performing a basic search we exclude all the boilerplate footer text. We can do that because we recognize it for what it is.
Of course there are times you may want to include the footer text, in fact maybe only the footer text, like with a contact information lookup service. Imagine a service where you give us a person's name, and we'll find their contact information for you by extracting their signature footers, and showing you the most recent at the top. We can do this because we can mix the (irregular and unpredictable but still present) structure in the email body with the (more regular but still not fully predictable) structure of the metadata telling us who posted the mail and when it was sent.
Another example is our opt:noquote search option. It indicates you want to find text matches in original text only, not quoted text. So take this search as an example:
"godwin's law" opt:noquote
It finds emails where the sender wrote the phrase "godwin's law" and excludes emails where the sender only quoted someone else writing the phrase. Because we see quote structure in emails, this is easy. We use the quote structure for other things too, like to color code the email messages when we display them.
For another example, we use structure for our handling of attachment files. Try this search:
It finds emails containing attachments that have a PPT extension and relate somehow to axis (an Apache web services project). You'll see with the email results that not only do we know which attachments include an "axis" term hit but on how many slides, and when you view the attachment we underline the slides with the hits. That's all possible because what seems unstructured actually has lots of structure.
In developing MarkMail what was the most interesting thing you learned about MarkLogic Server?
When developing MarkMail we decided to push the envelope a little and build the site without Java, Ruby, Python, Perl, or any of the traditional web languages. Instead we developed with XQuery, an XML-centric language from the World Wide Web Consortium (W3C).
Think about how a classic web-based Java app works. Your data resides in relational tables, you program things in objects, and you convert objects to structured HTML for the web browser. It's a double impedance mismatch, it's painful, and it's why there's a new ORM tool and web framework invented every month promising to take away the pain.
By using XQuery, we kept a unified XML-centric view at all layers. As our model, the millions of XML documents holding emails. As our controller, XQuery with its native support for XML. As our view, a set of XQuery libraries that could transform any email or email fragment to produce XHTML.
One of the nice perks we discovered along the way is life's better when the rendered node is "live", or still tied with its native placement. So if I say to a render library "here, render this footer from an email" it can still walk up the tree to find the date, sender, subject, or whatever it might want; it can even query to see if it's the person's latest footer or not. When you interact with "dead" data you have to pass the view logic everything it might need. Sounds nice, but over time that changes meaning you have to adjust your "handshake" (or more commonly the view just doesn't do something it could because it's too hard to give it the data it needs). Then you have times where something's expensive to calculate and only needed *sometimes* by the view, so what do you do. Pay the price every time? No, often you try to get fancy about passing it selectively. Well I'm happy I don't need to worry about that again.
Of course, the fundamental overwhelming reason we did it with pure XQuery is that mixed content (like email here) is really, really hard to represent as anything other than XML. So the model kind of demands to be XML. And as much as I love JDOM, it's a weak substitute to XQuery when writing a controller to work against the XML. When your model is XML and your view produces XML (as xhtml), it's nice when the language you use to pull the model and produce the view is also XML-centric.
MarkMail features a number of different analytic views during search. What drove the current views and what has been suggested in terms of future enhancements?
We believe people don't just want to search content, they want to interact with content. Sometimes you want to find the needle, and sometimes you want to see what the haystack looks like. Other times you just want to explore.
We find the histogram charting helps people get their bearings with what they're looking at. It's also a great way to spot trends and explore what some call "the collective intelligence".
Is discussion of a topic trending up or down? Does discussion come and go? Is there a pattern to it? As a fun example, search for "javaone". You'll see spikes each year right around the month of the conference. You also see, despite what you might think, that discussion has increased year over year, especially in 2006.
What's coming next with MarkMail analytics? Interaction with the chart, new charts, and new visualizations. That's as specific as I'll be right now.
The website makes heavy use of Ajax. What frameworks did you consider and why did you pick the one you did?
I wish I could say we found an off-the-shelf Ajax framework that satisfied our every need, gave us all the widgets we could ever want, and made everything as easy as a walk in the park. The truth is we used MochiKit for the low-level Ajax interactions, but developed almost all our own widgetry. The left/right slide, the facet select boxes, the thumbnail popup viewer, and the interactive results list, everything we developed ourselves.
What was the most challenging UI feature to implement?
The left/right slide when you navigate between "result set" and "thread/message" views took a few tries to get right. Here's the background story from Ryan Grimm, who invented the magic to get that working:
"First, I tried using three div's and setting each of them to float left so they would appear next to each other. The plan was to then adjust the left margin of the first div and have the others slide left along with it. This proved to be quite difficult to get working in the major browsers.
So next I said forget this fancy CSS layout, I'll try a table-based layout. One table row with three table cells, one for each section. I wanted to just adjust the position of the table. But as soon as I thought of this I realized that I could combine the previous two solutions into one.
So I ended up putting all three of the content div's inside one main div (with an id of 'please' because I was really hoping that this would work). Then by adjusting the left margin on the 'please' div I could get all three of them to slide together.
One of the side effects of this sliding trick is figuring out what to do when the user resizes their browser. When the user makes their browser bigger we need to adjust a few things, namely the width of the search results div so the message div doesn't sneak itself into the view. When working on this we realized that some of the newer displays are actually wide enough to show all three columns. So when resizing we're constantly checking to see if the users window is large enough to show all three.
I don't adjust the visibility on any of the sliding columns. They are all still visible (just off screen) and the width of the content doesn't actually change.""
We have in fact developed one feature that's even more technically challenging than the slider, although we haven't yet released it on the public site. It's a new rendering technique that makes it easier to read mail messages having multiple search term hits. We'll be happy to talk more about that once we've put it out into the public eye.
What is on the agenda for future versions of MarkMail? Do you see it becoming competitive to GMane or Nabble in terms of archiving most open source mailing lists?
GMane was designed as a mail to news gateway, and I'd recommend it for anyone wanting to view mailing lists as news feeds. Nabble does a lot with hosting forums for people, and if you want to do that, they're a good place for it.
At MarkMail we're focusing on search, analytics, and performance -- keeping a strong focus on helping communities get more out of their histories. We've loaded two million more emails since launch and will continue to load additional mailing lists, both for open source projects and others. We're at 6 million now. We prioritize loads based on community requests, so if anyone would like their lists added, let us know http://markmail.org/docs/feedback.xqy.