CardBot, A History

Several years ago, one of my friends told me that he had made an IRC channel for the Magic: The Gathering subreddit. He said that when they were talking about cards, it was often hard for people in the channel because they didn't have the text of those cards in front of them. He then asked me if I would be willing to write a bot for the channel to provide card text. I had never written an IRC bot, and that seemed a bit like yak-shaving to me, so I found an existing bot in Python that I could add modules to. I went about writing up a module so that people could get card text.

Version 1

The first incarnation of CardBot was a REST API driven module. I had stumbled upon the very nice Tutor project, written by David Chambers. At the time, it was a javascript app that provided a RESTful API for Wizards of the Coast's Gatherer, which holds the official errata'd text (Oracle text) for all Magic cards. Whever something sent a request to the Tutor server, it would then go find the correct page in Gatherer and then scrape it and return the data. This worked pretty well, although you were limited in what you could do by the underlying API. Unfortunately, Chambers updated Tutor to be a Javascript API rather than a REST API. This meant that I could no longer use it with my Python IRC modules. (I realize that I could have tried to switch to using [Hubot][hubot], which is Javascript-based, but I was far more comfortable with Python).

Version 2

Version 2 was supposed to be a stop-gap until I wrote something better. As usual, though, it lived a lot longer than intended. [Yawgatog][8] has, for quite a long time, made a text file full of Oracle text available on his web site. I wrote up a script that searched through this text file for the card name that people wanted and then read the subsequent lines until it hit a line with no text on it. This approach had some problems:

The text of every card was held in memory forever.
The program had to scan every line of the file for every search.
It didn't look back at previous lines. This caused some humorous interactions when searching for card names that were also keywords on the rules text of the card. For example, there is a card named Flash and an ability named Flash. When someone would search for Flash, the card, they would just end up with part of the card text for the first card listed with the ability Flash.
Searching was not flexible. In order to let users do more complicated things, I had to make sure to account for all edge cases in parsing the text, or run into more problems like #3.
It was missing a lot of data that was available from Gatherer like which sets the cards were part of and what the flavor text for each card was.

I had intended to just a version of Tutor myself that parsed Gatherer, but I did not really want to deal with keeping a scaper updated to their HTML, especially since they had just recently updated it. I ended up sitting on this temporary version for a while before I came up with an idea.

Version 3

I knew that a number of programs existed to let you play with cards digitally and that each of these got new set lists in the form of an XML file or similarly formatted document. While sometimes these documents were scraped from Gatherer, sometimes they were made by hand as well. I knew that the community demand for them would always be high enough that someone would make one for the new sets. There are also several of these programs that exist, so even if one disappeared, I could always rely on another. I decided that providing a tiered approach to this would make more sense if I want to account for changing data sources, so I set about designing an architecture more like this:

Data Source (XML, JSON, etc)
Data Source Specific Parser
Card Database (Probably via an ORM like SQLAlchemy)
Data API
IRC Module

In this way, if I needed to change data sources, I only needed to rewrite the parser and if I needed to change IRC bots, I would only need to rewrite the IRC module. Of course, changing the database might affect other layers, but it's hard to be completely decoupled. I ended up using the wonderful MTG JSON project as my data source. The creator of it was responsive, it had a huge amount of information, and was open source as well.

This approach actually worked pretty well. I was able to add a ton of features to the bot that did not exist before like displaying image links, showing rules issues related to cards, and printing the flavor text on different editions of the card. During this time, CardBot was invited to two other MTG related channels and performed really well. I wrote up some deployment scripts to help make updating the database easier and I wrote up a few tests for the code. I did run into some issues.

Problems

Unicode proved to be a real problem. The bot was in Python 2, which has some challenges with handling Unicode. Quite a few Magic cards use non-ascii characters, but a lot do. When I was manually testing, everything appeared to be fine, because I was not testing Unicode characters. Then after deployment, someone would show me something that broke. To complicate matters further, the terminal on my local machine, the terminal on my server, and IRC all had different character sets that they supported. Characters that would show up fine on my local terminal didn't work on the server terminal. Things that showed up fine in IRC raised exceptions when writing to logs. Eventually I got things squared away by using Unicode types everywhere and specifying the encoding of many of my source files. There were other Unicode related issues, too. For example, how do you search for a card like Jötun Grunt? I could have people type it out, but the most likely scenario is that they type Jotun Grunt. I ended up finding a package called unidecode that works to make the most common substitutions for characters like that and using that to create a search_name field in the database. This also helped when communicating with other services (to get card price information, for example) since a number do not use unicode in card names.