CardBot, A History

Several years ago, one of my friends told me that he had made an IRC channel for the Magic: The Gathering subreddit. He said that when they were talking about cards, it was often hard for people in the channel because they didn't have the text of those cards in front of them. He then asked me if I would be willing to write a bot for the channel to provide card text. I had never written an IRC bot, and that seemed a bit like yak-shaving to me, so I found an existing bot in Python that I could add modules to. I went about writing up a module so that people could get card text.

Version 1

The first incarnation of CardBot was a REST API driven module. I had stumbled upon the very nice Tutor project, written by David Chambers. At the time, it was a javascript app that provided a RESTful API for Wizards of the Coast's Gatherer, which holds the official errata'd text (Oracle text) for all Magic cards. Whever something sent a request to the Tutor server, it would then go find the correct page in Gatherer and then scrape it and return the data. This worked pretty well, although you were limited in what you could do by the underlying API. Unfortunately, Chambers updated Tutor to be a Javascript API rather than a REST API. This meant that I could no longer use it with my Python IRC modules. (I realize that I could have tried to switch to using [Hubot][hubot], which is Javascript-based, but I was far more comfortable with Python).

Version 2

Version 2 was supposed to be a stop-gap until I wrote something better. As usual, though, it lived a lot longer than intended. [Yawgatog][8] has, for quite a long time, made a text file full of Oracle text available on his web site. I wrote up a script that searched through this text file for the card name that people wanted and then read the subsequent lines until it hit a line with no text on it. This approach had some problems:

  1. The text of every card was held in memory forever.
  2. The program had to scan every line of the file for every search.
  3. It didn't look back at previous lines. This caused some humorous interactions when searching for card names that were also keywords on the rules text of the card. For example, there is a card named Flash and an ability named Flash. When someone would search for Flash, the card, they would just end up with part of the card text for the first card listed with the ability Flash.
  4. Searching was not flexible. In order to let users do more complicated things, I had to make sure to account for all edge cases in parsing the text, or run into more problems like #3.
  5. It was missing a lot of data that was available from Gatherer like which sets the cards were part of and what the flavor text for each card was.

I had intended to just a version of Tutor myself that parsed Gatherer, but I did not really want to deal with keeping a scaper updated to their HTML, especially since they had just recently updated it. I ended up sitting on this temporary version for a while before I came up with an idea.

Version 3

I knew that a number of programs existed to let you play with cards digitally and that each of these got new set lists in the form of an XML file or similarly formatted document. While sometimes these documents were scraped from Gatherer, sometimes they were made by hand as well. I knew that the community demand for them would always be high enough that someone would make one for the new sets. There are also several of these programs that exist, so even if one disappeared, I could always rely on another. I decided that providing a tiered approach to this would make more sense if I want to account for changing data sources, so I set about designing an architecture more like this:

  1. Data Source (XML, JSON, etc)
  2. Data Source Specific Parser
  3. Card Database (Probably via an ORM like SQLAlchemy)
  4. Data API
  5. IRC Module

In this way, if I needed to change data sources, I only needed to rewrite the parser and if I needed to change IRC bots, I would only need to rewrite the IRC module. Of course, changing the database might affect other layers, but it's hard to be completely decoupled. I ended up using the wonderful MTG JSON project as my data source. The creator of it was responsive, it had a huge amount of information, and was open source as well.

This approach actually worked pretty well. I was able to add a ton of features to the bot that did not exist before like displaying image links, showing rules issues related to cards, and printing the flavor text on different editions of the card. During this time, CardBot was invited to two other MTG related channels and performed really well. I wrote up some deployment scripts to help make updating the database easier and I wrote up a few tests for the code. I did run into some issues.

Problems

Unicode proved to be a real problem. The bot was in Python 2, which has some challenges with handling Unicode. Quite a few Magic cards use non-ascii characters, but a lot do. When I was manually testing, everything appeared to be fine, because I was not testing Unicode characters. Then after deployment, someone would show me something that broke. To complicate matters further, the terminal on my local machine, the terminal on my server, and IRC all had different character sets that they supported. Characters that would show up fine on my local terminal didn't work on the server terminal. Things that showed up fine in IRC raised exceptions when writing to logs. Eventually I got things squared away by using Unicode types everywhere and specifying the encoding of many of my source files. There were other Unicode related issues, too. For example, how do you search for a card like Jötun Grunt? I could have people type it out, but the most likely scenario is that they type Jotun Grunt. I ended up finding a package called unidecode that works to make the most common substitutions for characters like that and using that to create a search_name field in the database. This also helped when communicating with other services (to get card price information, for example) since a number do not use unicode in card names.

Other problems I ran into:

  1. I was using an outdated library called Elixir, which is a declarative layer on top of SQLAlchemy.

    • Since I had originally used Elixir back in college, SQLAlchemy had released it's own declarative layer and Elixir's development had effectively stopped. It depended on a very old version of SQLAlchemy.
  2. It maxed out my server's memory when parsing card data.

    • I think this is because I was doing everything as one large unit of work, so it effectively created the whole database in-memory and then wrote it to disk. In any case, I had to make sure that nothing else was running on my server when parsing card data.
  3. Card Types were not always in the correct order.

    • Each card has a number of game types associated with it. For example, a creature is a type of card. Each type has a number of subtypes associated with it. A creature might be a goblin, for example. The problem I ran into is that some cards are multiple subtypes, "Goblin Mercenary", for example. Since I was storing each subtype as it's own database record, they didn't necessarily go in the correct order. When the parser came across a subtype, it would first see if a record already existed. If it did, it just created a relationship. If it didn't, it created it. What this meant is that the card might say "Mercenary Goblin" instead of "Goblin Mercenary". While not a huge problem, as it didn't affect the effect of the card, it did throw some people off.
  4. The IRC code was in the same repository as the database code.

    • This makes it harder to write other front ends for the database, which fights against some of the decoupling that I was originally aiming for.

I also ran into some problems when I tried to port the bot to Discord. Aside from the current version of the Discord Python API Wrapper heavily using async and await, I also realized that the output format was very IRC-centric. The IRC protocol doesn't allow for multiple lines, for example, so the output included | instead of a line break. This ended up looking terrible in Discord.

As such, I am now working on yet another version of the bot. This one will:

  1. Separate the database and related utilities (searching, parsing, etc.) into their own module.
  2. Upgrade to more modern software-- Python 3, SQLAlchemy
  3. Be more front end agnostic. This is supported by #1.
    • The idea here is that I can have an IRC version, a discord version, a CLI client, and even perhaps a website down the road.
  4. Use fewer resources.
    • My hope is that since I now understand the unit-of-work pattern better, I'll be able to more easily control the amount of resources used.
  5. Give better feedback during parsing. (AKA Add Progress Bars)
  6. Have better documentation such that other people can actually use it.

Current work can be found at the github repository.