Literary Clock - A constantly ticking clock made of literary quotations.

Data Mining

As mentioned in the previous post, I had around a third of the day covered. Now for the other 1000 minutes. First up I have a collection of ebooks, however they are in not as easy to read formats like epub and mobi. I want to convert them to plain text files - easy for my computer to read on mass. Step forward Calibre, not only a fantastic e-book manager, it also comes with a command line function to convert a host of book files to other type of files. So working with an illustrated Sherlock Holmes collection mobi file (4MB), I used the following command:

./ebook-convert "Arthur Conan Doyle - Sherlock Holmes.mobi" \
 "Arthur Conan Doyle - Sherlock Holmes.txt"

to convert it to a 600KB txt file in under a minute on my 2009 MacBook Pro (using Calibre version 2.85.1). It is also gives a nice play by play of what is happening in the conversion:

Found KF8 MOBI of type 'joint'
Parsing all content...
34% Running transforms on e-book...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Cleaning up manifest...
Trimming unused files from manifest...
Trimming u'images/00027.jpeg' from manifest
Creating TXT Output...
67% Running TXT Output plugin
Converting XHTML to TXT...
TXT output written to Arthur Conan Doyle - Sherlock Holmes.txt
Output saved to   Arthur Conan Doyle - Sherlock Holmes.txt

This works great with the more popular ebook formats, as well as Word doc and docx formats, though pdfs take a much longer time.

Having left this running overnight in a nice python wrapper - convert_books_wrapper.py - I now had my library of 400 books in txt format as well. So how to read through these rapidly to get literary quotes with the time of day in them?

First of all, I had the input of times I needed in the 24 hour digital format, for example 14:54. The first step was to convert this into the myriad number of ways this time can be displayed in a book; 12 hour/24 hour; with/without the colon/fullstop; in words/digits; with/without minutes or hours, past,after/to,before; with/without hyphens. In short I made code - get_times.py to translate times like 14:54 into words (with special rules for the numbers 0 to 13, 15, 18, 20, 30, 40, 50 and the rest can be built up) and permute them to all the possible ways they might be transcribed:

_{'14:54', '14.54', '1454', '2:54', '2.54', 'fifty-four past two', 'fifty-four minutes past two', '54 minutes past two', '54 past two', 'fifty-four after two', 'fifty-four minutes after two', '54 minutes after two', '54 after two', 'fifty-four past 14', 'fifty-four minutes past 14', '54 minutes past 14', '54 past 14', 'fifty-four after 14', 'fifty-four minutes after 14', '54 minutes after 14', '54 after 14', 'fifty-four past 2', 'fifty-four minutes past 2', '54 minutes past 2', '54 past 2', 'fifty-four after 2', 'fifty-four minutes after 2', '54 minutes after 2', '54 after 2', 'six minutes to three', 'six to three', '6 minutes to three', '6 to three', 'six minutes before three', 'six before three', '6 minutes before three', '6 before three', 'six minutes to 3', 'six to 3', '6 minutes to 3', '6 to 3', 'six minutes before 3', 'six before 3', '6 minutes before 3', '6 before 3', 'two fifty-four', 'two-fifty-four', '2 fifty-four', '2-fifty-four'.}

Admittedly some are bordering on non-sensical, though I did this for all the times I had missing (including o'clock, half past and quarter to/past), so had some quite broad rules to follow.

Having got this exhaustive list of times the next step was to actually read through my converted library to pick out sentences that may contain one of these missing times. For this I used Spark, specifically the Spark Python API (PySpark). Simply this allows me to two make full use of my computer's resources to read through the books in parallel to extract sentences that have a time in it. This can be quite memory intensive, for the 8GB of RAM on my personal computer, I have an upper limit of 40MB per folder of books and if the process crashes, I will reduce the number of times I am looking for. The result from find_times.py is a tab separated file, with each line containing the time, quote, book title and author.

_{14:54 The waitress with the perfect neck has finished her shift – the clock says six minutes to three – and changed out of her uniform. number9dream David Mitchell}

Not quite done. The quote is a sentence, or at least the string of text containing the time and bookended either side by '. ' - a full stop and space. First up I need to check that this actually a time in the context of the sentence. There are a lot of false positives for years (1454 - the year Richard Plantagenet, Duke of York becomes and is dismissed as Protector for the insane and then sane King Henry VI of England), ratios and odds (2:11), exact numbers (14.54) and other miscellany ('two eleven metre poles' for example).

Next if it is a time, I need to check in the context of the book, whether this happened in the morning, afternoon or is ambiguous and can go in either. Finally I try to extract a nice quote around the time to give it context.

The waitress with the perfect neck has finished her shift – the clock says six minutes to three – and changed out of her uniform. She is wearing a purple sweater and white jeans. She looks drop-dead cool.

number9dream by David Mitchell

My library would give us a further 400 or so times. So just a third of the day to go, unfortunately these are at the least popular times of the day (the early morning is particularly empty in places). The next post I will discuss on how I collected 50,000 out of copyright works from the wonderful Project Gutenberg to run my time finding tools on.

Prev: Crowdsourcing

Next: 50,000 Books