danvk.org blog 2016-12-15T16:13:15+00:00 https://danvk.github.io/ Dan Vanderkam danvdk@gmail.com AlphaGo vs. Lee Sedol 2016-03-15T00:00:00+00:00 https://danvk.github.io//2016/03/15/alphago-vs-lee-sedol <p>I was dimly aware of the <a href="https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol">ongoing competition between AlphaGo and Lee Sedol</a>, but I hadn&#39;t paid much attention until I saw this chart <a href="https://www.reddit.com/r/dataisbeautiful/comments/4a8336/lee_sedol_vs_alphago_4th_game_thinking_time_in/">on reddit</a>:</p> <p><img src="http://i.imgur.com/3HcJKbk.png" border=0 width=700 alt="time per move during game 4"></p> <p>It&#39;s hard to read &quot;Lee Sedol&#39;s brilliant attack (78th)&quot; and not get curious! This led me into a deep dive on the competition. You can read more about <a href="https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/">the move</a> or watch a <a href="https://www.youtube.com/watch?v=G5gJ-pVo1gs">15 minute summary</a> of the match. The full <a href="https://www.youtube.com/watch?v=yCALyQRN3hw&amp;feature=youtu.be&amp;t=3h10m20s">6 hour match</a>, including a press conference afterwards, is also online. You can learn a lot about Go from listening to the YouTube commentators. One of them is <a href="https://en.wikipedia.org/wiki/Michael_Redmond_(Go_player)">Michael Redmond</a>, the all-time top-rated American Go player. Even if you don&#39;t understand how to play Go (I barely do), it&#39;s fun to watch <a href="https://www.youtube.com/watch?v=SMqjGNqfU6I&amp;feature=youtu.be&amp;t=1h25m32s">the experts react</a> to this move: &quot;Oh, this [is] very creative.&quot;</p> <p><img src="/images/alphago.png" width=231 height=190 style="float: right; margin-left: 1em;"> <a href="https://en.wikipedia.org/wiki/Lee_Sedol">Lee Sedol</a>&#39;s win in the fourth match is being celebrated as a victory for humankind. But it&#39;s surprising that we&#39;re here at all. AlphaGo, Google DeepMind&#39;s computer Go program, had already won the first three games and hence the best-of-five competition. This is a coming of age moment for the neural net community. Over the past ten years, thanks to the emergence of <a href="https://www.cs.toronto.edu/%7Ekriz/cifar.html">large data sets</a> and <a href="http://www.nvidia.com/object/what-is-gpu-computing.html">GPUs</a>, the whole field has experienced a renaissance. But most of the great results have been on toy problems like image classification or traditional signal processing problems like speech recognition. Beating an elite Go player for the first time is a marquee result that transcends the field. As long as I&#39;ve worked in software, Go has been the one game at which computers couldn&#39;t compete. Everyone thought that this result was years away.</p> <p>Last fall, AlphaGo competed against <a href="https://en.wikipedia.org/wiki/Fan_Huio">Fan Hui</a>, a lower-ranked Go professional. It beat him 5-0. This was the first time that a computer Go program had defeated a professional. What happened next is suggestive. <a href="http://www.wired.com/2016/03/sadness-beauty-watching-googles-ai-play-go/">According to Wired</a>, he began consulting with the DeepMind team:</p> <blockquote> <p>As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”</p> </blockquote> <p>We learn most quickly from our betters—people who can review your work, say what you did well and point out what would have been better. But if you&#39;re the best Go player in the world, who do you learn from? I wouldn&#39;t be surprised if this result prompts humans to discover new styles of play. Computers may get the best of us at Go in the long run, but we&#39;ll get better at it in the process.</p> <p>The fifth and final game happens tonight. If Lee Sedol wins, you could make the case that he just took a few games to figure out how to get the best of AlphaGo. If not, it means that it took completely brilliant play from the best in the world to beat a computer, and it&#39;s likely to be the last time a human ever pulls this off. The match is being <a href="https://www.youtube.com/watch?v=mzpW10DPHeQ">streamed live on YouTube</a>.</p> Extending the Grid to Add 1,000 Photos to OldNYC 2016-01-19T00:00:00+00:00 https://danvk.github.io//2016/01/19/oldnyc-update <p>I recently added around 1,000 new photos to the map on OldNYC. Read on to find out how!</p> <div class="center"> <img src="/images/stuytown-update.gif" width=300 height=275 alt="OldNYC update around Stuytown"> </div> <p>At its core, OldNYC is based on <a href="https://en.wikipedia.org/wiki/Geocoding">geocoding</a>: the process of going from textual addresses like &quot;9th Street and Avenue A&quot; to numeric latitudes and longitudes. There&#39;s a bit of a mismatch here. The NYPL photos have 1930s addresses and cross-streets, but geocoders are built to work with contemporary addresses. OldNYC makes an assumption that contemporary geocoders will produce accurate results for these old addresses. For NYC, this is usually a good assumption! The street grid hasn&#39;t changed too much in the past 150 years. But it is an assumption, and it doesn&#39;t always pan out.</p> <p>Two of the most noticeable problem spots are Stuytown and Park Avenue South:</p> <div class="center"> <img src="/images/stuytown-pas-missing.png" width=600 height=550 alt="Stuytown and Park Avenue South have no images"> </div> <p>The lettered Avenues (A, B, C, D) used to continue above 14th street. This was the <a href="https://en.wikipedia.org/wiki/Gashouse_District">Gas House district</a>. But in the 1940s, this area was destroyed to make way for the super-blocks of <a href="https://en.wikipedia.org/wiki/Stuyvesant_Town%E2%80%93Peter_Cooper_Village">Stuyvesant Town</a>. Intersections like &quot;15th and A&quot; do no exist in the contemporary Manhattan grid and geocoders can&#39;t make sense of them. But there <em>are</em> <a href="http://www.oldnyc.org/#711319f-a">photos there</a>!</p> <p>The problem for Park Avenue South is different. <a href="https://en.wikipedia.org/wiki/Park_Avenue#History">Until 1959</a>, it was known as 4th Avenue. So photographs from the 1930s are recorded as being at, for example, &quot;4th Avenue and 17th street&quot;, an interesection which no longer exists. Again, contemporary geocoders can&#39;t make sense of this.</p> <p>The frustrating thing here is that it&#39;s perfectly obvious where all of these interesctions <em>should</em> be. Manhattan has a <a href="https://en.wikipedia.org/wiki/Commissioners%27_Plan_of_1811">regular street grid</a>, after all. So I set out to build my own Manhattan street grid geocoder.</p> <p>To begin with, I gathered lat/lons for <a href="https://docs.google.com/spreadsheets/d/1Kv0xXYHXUrlFmyIF92NsWTMuuRWAl-heBEI_OejcBA0/edit#gid=0">every intersection</a> that I could. With <a href="https://github.com/danvk/oldnyc/blob/9aa15e0a4ce46d746531c7f9531ff4a50e17a53f/grid/gold.py#L21-L41">some simple logic</a>, this handled the Avenue renaming issue.</p> <p>My initial idea to geocode unknown interesections was to interpolate on the avenues. For example, to find where the intersection of 18th Street and Avenue A should be, you can assume that the intersections of numbered streets and Avenue A are evenly spaced and then find where the 18th street intersection would fall:</p> <div class="center"> <img src="/images/extrapolation.png" width=552 height=428 alt="Extrapolation of intersections"> </div> <p>Mathematically, you fit linear regressions from cross-street to latitude and longitude. This feels like it should work but, because the streets aren&#39;t all perfectly spaced, it winds up producing results that don&#39;t quite look right.</p> <p>While I was playing around with this approach, I realized that I was checking the results using a different technique: continuing the straight lines of the streets until they intersected:</p> <div class="center"> <img src="/images/continued-lines.png" width=403 height=329 alt="Extrapolation of streets"> </div> <p>Mathematically, this means that you fit a linear regression to the latitude→longitude mapping for each Street and Avenue. To find an intersection, you find the point where these lines intersect. This works so long as the Streets and Avenues are straight. Fortunately, with a few exceptions like Avenue C and the West Village, they are (r<sup>2</sup>&gt;0.99).</p> <p>This approach produced very good results. The oddities which remained were as likely to be problems with the data as with the geocoder (<a href="http://digitalcollections.nypl.org/items/510d47dd-07a6-a3d9-e040-e00a18064a99">one image</a> was non-sensically labeled as &quot;25th &amp; D&quot;, which extrapolates to somewhere in the East River).</p> <p>While Stuytown and Park Avenue South were clear winners, new photos appeared all over the map:</p> <div class="center"> <img src="/images/oldnyc-update.gif" width=400 height=400 alt="Update for lower Manhattan"> </div> <p>It even helped uptown:</p> <div class="center"> <img src="/images/oldnyc-uptown-update.gif" width=640 height=400 alt="Update for Uptown"> </div> <p>All told, there are about 1,000 new images on the map. Go check them out and! And please help transcribe the text on the back of them. My <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR system</a> didn&#39;t run on the new images, so they&#39;re sorely lacking descriptions.</p> <p>Here are a few favorites:</p> <div class="center"> <a href="https://www.oldnyc.org/#716426f-a"><img src="/images/716426f-a.jpg" width=600 height=421></a> <br> <i>Looking down Avenue B from 15th to 17th Street. This Avenue no longer exists.</i> </div> <div class="center"> <a href="https://www.oldnyc.org/#708194f-a"><img src="/images/708194f-a.jpg" width=482 height=348></a> <br> <i>The Everett House hotel at Union Square in 1906. This building no longer exists, but there's a new hotel (the W) at the same location.</i> </div> My takeaways from NIPS 2015 2015-12-12T00:00:00+00:00 https://danvk.github.io//2015/12/12/nips-2015 <p>I’ve just wrapped up my trip to <a href="https://nips.cc/Conferences/2015">NIPS 2015</a> in Montreal and thought I’d jot down a few things that struck me this year:</p> <ul> <li><p><strong>Saddle Points vs Local Minima</strong></p> <p>I heard this point repeated in a talk almost every day. In low-dimensional spaces (i.e. the ones we can visualize) local minima are the major impediment to optimizers reaching the global minimum. But this doesn’t generalize. In high-dimensional spaces, local minima are almost non-existent. Instead, there are saddle points: points which are a minimum in some directions but a maximum in others. Intuitively, this makes sense: in N dimensions, the odds of the curvatures all going the same way at a point is (1/2)^N. As Yoshua Bengio said, &quot;it&#39;s hard to build an n-dimensional wall.&quot; This gives an intuition for why procedures like <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> are effective at optimizing the thousands of weights in a neural net: they won’t get stuck in a local optimum. And it gives an intuition for why <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum">momentum</a> is helpful: it helps gradient descent escape from saddle points.</p></li> <li><p><strong>Model Compression</strong></p> <p>The tutorial on <a href="http://nips2015.sched.org/event/4FAy/high-performance-hardware-for-machine-learning"><em>Hardware for Deep Learning</em></a> was less about new hardware and more about how to make your software get the most out of existing hardware. Due to the high cost of uncached, off-chip memory reads, reducing the memory footprint of your models can be a huge performance win. Bill Dally presented a result on model pruning that I found interesting: by iteratively removing small weights from a model and retraining, they were able to remove 90+% of the weights with zero loss of precision. This parallels an observation from <a href="http://www.ttic.edu/dl/dark14.pdf">transfer learning</a>, that small networks are most effectively trained using the output of larger networks. It would be nice if we could train these smaller networks directly. See the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> paper.</p></li> <li><p><strong>The importance of canonical data sets / problems</strong></p> <p>Over and over, talks and posters referenced the same canonical data sets: the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> set of handwritten digits, the <a href="https://www.cs.toronto.edu/%7Ekriz/cifar.html">CIFAR</a> and <a href="http://www.image-net.org/">ImageNet</a> images, the <a href="https://catalog.ldc.upenn.edu/LDC93S1">TIMIT</a> speech corpus and the <a href="http://www.arcadelearningenvironment.org/">Atari/Arcade Learning Environment</a> (ALE). These have given researchers in their fields a shared problem on which to experiment, compete, collaborate and measure their progress. If you want to push a field forward, built a good challenge problem.</p></li> <li><p><strong>One-shot Learning</strong></p> <p>There was much high-level talk about how the human brain is very good at learning to perform new tasks quickly. Contrast this with neural nets, which require thousands or millions of training examples to reach human performance. This comparison is somewhat unfair because adult humans have years of experience interacting with the real world from which to draw on. There seems to be a great deal of interest in getting machines to do a better job of transferring general knowledge to specific tasks.</p></li> <li><p><strong>This conference has gotten huge!</strong></p> <p>I’ve read that registrations have gone up significantly over the last few years. This was palpable at the conference. Many of the workshop rooms were packed to the gills and I watched most of the larger talks in overflow rooms. This is probably good thing for the ML community. I’m not sure if it’s a good thing for NIPS.</p></li> </ul> <p>A few smaller bits that struck me:</p> <ul> <li><p><a href="http://arxiv.org/abs/1505.00387">Highway Networks</a> are cool. They let you train very deep networks, where the depth has to be learned. They found that depth &gt; 20 was not helpful for MNIST, but was for CIFAR.</p></li> <li><p><a href="http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">AlexNet</a> seems to have become a canonical neural net for experimentation. I saw it referenced repeatedly, e.g. on the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> poster.</p></li> <li><p>As an example of the above, I really enjoyed the <a href="http://arxiv.org/abs/1407.5104"><em>Pixels to Voxels</em></a> talk. Pulpit Agarwal &amp; co showed that the activations of individual layers of AlexNet correlate to activations in regions of the brain. They were able to use this correspondence to learn what some of the <a href="https://en.wikipedia.org/wiki/Visual_cortex#V4">mid-level regions</a> of the visual cortex are doing.</p></li> <li><p>We heard several speakers say that “backprop is not biologically plausible.” I assume this is because we don’t consume nearly enough labels for it to be practical at the scale of a human brain?</p></li> <li><p>I asked someone at the NVIDIA booth whether the ML industry is large enough to drive GPU design &amp; sales (as opposed to the game industry). It is. The GPUs designed for ML tend to be more robust than those used in games. They have error correcting codes built-in. In a game, if an arithmetic unit makes a mistake, it’ll be fixed on the next frame. When you’re training a neural net, that mistake can propagate.</p></li> </ul> <p>I do <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">relatively little</a> machine learning in my day-to-day. NIPS is always a bit over my head, but it&#39;s a good way to rekindle my interest in the field.</p> Dan writes on HammerLab 2015-10-21T18:03:00+00:00 https://danvk.github.io//2015/10/21/hammerlab-posts <p>I haven&#39;t written a substantial blog post on danvk.org <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">since January</a>. Instead, I&#39;ve been writing over on my groups&#39;s blog at <a href="http://www.hammerlab.org/">hammerlab.org</a>.</p> <p>Here are the posts I&#39;ve written or edited:</p> <ul> <li><p><a href="http://www.hammerlab.org/2015/10/13/svg-canvas-the-pileup-js-journey/">SVG→Canvas, the pileup.js Journey</a> (13 Oct 2015)</p> <p>In which I explain why we changed from using SVG to using canvas for our genome browser (spoiler: it&#39;s performance). This also introduces <a href="https://github.com/hammerlab/data-canvas">data-canvas</a>, which compensates for some of the drawbacks of canvas, e.g. difficulty in tracking clicks and writing tests.</p></li> <li><p><a href="http://www.hammerlab.org/2015/07/09/bundling-and-distributing-complex-es6-libraries-in-an-es5-world/">Bundling and Distributing Complex ES6 Libraries in an ES5 World</a> (09 Jul 2015)</p> <p>I helped edit this post, which was written by <a href="http://arman.aksoy.org/">Arman Aksoy</a>. It outlines our approach to writing ES6 JavaScript while distributing/bundling it for ES5 clients.</p></li> <li><p><a href="http://www.hammerlab.org/2015/06/19/introducing-pileup-js-a-browser-based-genome-viewer/">Introducing pileup.js, a Browser-based Genome Viewer</a> (19 Jun 2015)</p> <p>This post introduces pileup.js, the in-browser genome visualizer I&#39;ve been working on for most of the past year.</p></li> <li><p><a href="http://www.hammerlab.org/2015/02/14/testing-react-web-apps-with-mocha/">Testing React Web Apps with Mocha</a> (14 Feb 2015)</p></li> <li><p><a href="http://www.hammerlab.org/2015/02/21/testing-react-web-apps-with-mocha-part-2/">Testing React Web Apps with Mocha (Part 2)</a> (21 Feb 2015)</p> <p>A two-parter in which I explain how we set up testing for our React web app using Mocha, rather than Jest. Since writing this post, I&#39;ve completely changed my mind about how web testing should be done. If your code is intended to be run in the browser, then you should test it in the browser, rather than using Node as these posts suggest.</p></li> <li><p><a href="http://www.hammerlab.org/2015/01/23/faster-pileup-loading-with-bai-indices/">Faster Pileup Loading with BAI Indices</a> (23 Jan 2015)</p> <p>BAM is a very widely used bioinformatics file format which stores aligned reads from a genome. These files can be huge, so the BAM Index (BAI) format was created to speed up retrieval. We found that the BAI was <em>also</em> too large for convenient access on the web, so we created an index of the index.</p></li> <li><p><a href="http://www.hammerlab.org/2014/12/05/igv-httpfs/">Streaming from HDFS with igv-httpfs</a> (05 Dec 2014)</p> <p>How we got data from HDFS into our genome viewer of choice by building a small piece of infrastructure. We&#39;ve since stopped using this. Nowadays we have an NFS mount for our HDFS file system and serve from that using nginx.</p></li> </ul> Launched: OldNYC 2015-06-04T02:24:00+00:00 https://danvk.github.io//2015/06/04/launched-oldnyc <p><a href="http://www.oldnyc.org/"><img src="/images/oldnyc-big.jpg" width=300 height=236 style="float:right; margin-left: 20px;"></a> Two weeks ago I <a href="https://twitter.com/danvdk/status/601381598525796352">launched</a> my latest side project, <a href="http://www.oldnyc.org/">OldNYC</a>. It&#39;s a collaboration with NYPL Labs which places around 40,000 historical photos of New York City on a map. Avid readers of this blog know that I&#39;ve been <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">working on</a> this <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">for years</a>.</p> <p>The response to OldNYC has been completely overwhelming. Hundreds of thousands of people have used the site. Millions of images have been viewed. Users have left nearly a thousand comments and fixed thousands of typos in the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR&#39;d descriptions</a>.</p> <p>Rather than say more about the project myself, I&#39;ll let you pick your write-up of choice. There have been many!</p> <ul> <li>New York Times: <a href="http://cityroom.blogs.nytimes.com/2015/05/26/new-york-today-new-views-of-the-past/?_r=0">New York Today: New Views of the Past</a></li> <li>Gothamist: <a href="http://gothamist.com/2015/05/26/old_nyc_photo_map.php">This Photo Map Will Bring You Back To Old NYC, Block By Block</a></li> <li>The Guardian: <a href="http://www.theguardian.com/us-news/2015/may/21/new-york-historical-photo-archive-lets-residents-track-changes-to-a-city-in-sepia">New York vintage photo archive lets you track a century of change to your block</a></li> <li>Citylab: <a href="http://www.citylab.com/design/2015/05/pictures-map-old-new-york-historical/393944/">Mapping the New York That Once Was</a></li> <li>Pix11 News (TV): <a href="http://pix11.com/2015/05/26/website-gives-a-virtual-tour-of-old-nyc-through-photo-maps/">Website gives a virtual tour of old NYC through photo maps</a></li> <li>Library Journal: <a href="http://lj.libraryjournal.com/2015/06/digital-content/developer-maps-library-photo-archives/">Developer Maps Library Photo Archives</a></li> <li>Kottke: <a href="http://kottke.org/15/05/mapping-photos-of-old-nyc">Mapping photos of old NYC</a></li> <li>Gizmodo: <a href="http://gizmodo.com/here-are-40-000-photos-of-old-new-york-plotted-on-a-cit-1706342527">Here Are 40,000 Photos Of Old New York Plotted on a City Map</a></li> <li>DNAInfo: <a href="http://www.dnainfo.com/new-york/20150522/midtown/see-old-photos-of-your-neighborhood-with-this-interactive-map">See Old Photos of Your Neighborhood With This Interactive Map</a></li> <li>West Side Rag: <a href="http://www.westsiderag.com/2015/05/21/throwback-thursday-new-tool-lets-you-see-photos-of-your-street-through-the-years">Throwback Thursday: New Tool Lets You See Photos of Your Street Through the Years</a></li> <li>Daily Mail: <a href="http://www.dailymail.co.uk/sciencetech/article-3093512/Step-time-OldNY-Interactive-map-allows-users-explore-New-York-40-000-old-photos.html">The Big Apple back in time: Forty thousand historic photographs on an interactive map offer slice of life from old New York</a></li> <li>Business Insider: <a href="http://www.businessinsider.com/nypl-old-photos-interactive-map-2015-5">This incredible map lets New Yorkers see vintage photos of their street corners</a></li> <li>EV Grieve: <a href="http://evgrieve.com/2015/05/immerse-yourself-in-archival-photos-of.html">Immerse yourself in archival photos of NYC</a></li> <li>The Awl: <a href="http://www.theawl.com/2015/05/old-new-york-mapped">Old New York, Mapped</a></li> <li>The Week: <a href="http://theweek.com/speedreads/556593/check-amazing-archive-historic-new-york-pictures">Check out this amazing archive of historic New York pictures</a></li> <li>Fast Company: <a href="http://www.fastcodesign.com/3046606/see-what-times-square-and-wall-street-looked-like-100-years-ago">See What Times Square And Wall Street Looked Like 100 Years Ago</a></li> <li>Free Williamsburg: <a href="http://freewilliamsburg.com/see-photos-of-old-williamsburg-thanks-to-the-ny-public-library/">See photos of old Williamsburg thanks to the NY Public Library</a></li> <li>PetaPixel: <a href="http://petapixel.com/2015/05/25/oldsf-and-oldnyc-historical-photos-plotted-on-maps/">OldSF and OldNYC: Historical Photos Plotted on Maps</a></li> <li>Brooklyn Magazine: <a href="http://www.bkmag.com/2015/05/21/here-are-thousands-of-historical-photos-of-new-york-city-all-on-one-interactive-map/">Here are Thousands of Historical Photos of New York City, All on One Interactive Map</a></li> <li>Mental Floss: <a href="http://mentalfloss.com/article/64253/tour-old-new-york-your-computer">Tour Old New York on Your Computer</a></li> </ul> PyCon 2015: Make web development awesome with visual diffing tools 2015-04-12T15:05:00+00:00 https://danvk.github.io//2015/04/12/pycon-2015-make-web-development-awesome-with-visual-diffing-tools <p>Here&#39;s the <a href="https://youtu.be/jUUTqgzNR3M">video</a> of my talk from PyCon 2015, <a href="https://us.pycon.org/2015/schedule/presentation/395/">Make web development awesome with visual diffing tools</a>:</p> <iframe width="640" height="360" src="https://www.youtube.com/embed/jUUTqgzNR3M" frameborder="0" allowfullscreen></iframe> <p>Here are the <a href="http://bit.ly/pycon2015-visual-diffs">slides</a> for the talk:</p> <iframe src="https://docs.google.com/presentation/embed?id=1C6TcdHSBQNcLEIH6SFmmAOLJ7cy1gvIbmCj3cyd7fxE&amp;start=false&amp;loop=false&amp; frameborder="0" width="520" height="405"></iframe> <p>The two tools referenced are:</p> <ul> <li><a href="https://github.com/bslatkin/dpxdt">dpxdt</a> for generating screenshots</li> <li><a href="https://github.com/danvk/webdiff">webdiff</a> for viewing image diffs</li> </ul> <p>I used <a href="http://www.comparea.org/">comparea</a> and <a href="http://www.dygraphs.com/">dygraphs</a> as sample apps for the demos.</p> <p>If you enjoyed my talk, you might also enjoy <a href="http://www.onebigfluke.com/">Brett&#39;s</a> talk from a few years ago, <a href="https://www.youtube.com/watch?v=1wHr-O6gEfc">The Secret of Safe Continuous Deployment: Perceptual Diffs</a>. It goes into more depth on how screenshots can facilitate rapid deployment.</p> Training an Ocropus OCR model 2015-01-11T04:58:00+00:00 https://danvk.github.io//2015/01/11/training-an-ocropus-ocr-model <style> .bordered { border: 1px solid rgb(204, 204, 204); } </style> <p>In the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>, we walked through the steps in the <a href="https://github.com/tmbdev/ocropy">Ocropus</a> OCR pipeline. We extracted text from images like this:</p> <div class="center"> <img src="/images/ocropus/start.png" width="608" height="137"> </div> <p>The results using the default model were passable but not great:</p> <div class="highlight"><pre><code class="language-" data-lang="">O1inton Street, aouth from LIYingston Street. Auguat S, 1934. P. L. Sperr. NO REPODUCTIONS. </code></pre></div> <p>Over the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">larger corpus</a> of images, the error rate was around 10%. The default model has never seen typewriter fonts, nor has it seen ALLCAPS text, both of which figure prominently in this collection. So its poor performance comes as no surprise.</p> <p>In this post I&#39;ll walk through the process of training an Ocropus model to recognize the typewritten text in this collection. By the end of this post, the performance will be extremely good.</p> <h3>Generating truth data</h3> <p>Ocropus trains its model using <a href="http://en.wikipedia.org/wiki/Supervised_learning">supervised learning</a>: it requires images of lines along with correct transcriptions. If you&#39;re trying to recognize a known font, you can generate arbitrary amounts of labeled data (using <code>ocropus-linegen</code>). But in our case, we have to label some images by hand.</p> <p>This is tedious and involves a lot of typing. Amazon&#39;s <a href="https://www.mturk.com/mturk/welcome">Mechanical Turk</a> is a popular way of farming out small tasks like this, but I prefer to do the transcription myself using <a href="http://www.danvk.org/wp/2013-04-20/generating-training-data/index.html">localturk</a>. It doesn&#39;t take as long as you might think (I typed 800 lines in about an hour and 20 minutes). And it has the benefit of forcing you to look at a large sample of your data, something that&#39;s likely to lead to insights.</p> <div class="center"> <img src="/images/ocropus/localturk.png" width="474" height="401"> <br><i>(localturk in action)</i> </div> <p>I used <a href="https://gist.github.com/danvk/abec6e0c7657e2c8e86f">this template</a> for the transcription. Ocropus expects truth data to be in <code>.gt.txt</code> files with the same name as the PNG files for the lines. For example:</p> <ul> <li><code>book/0001/010001.png</code></li> <li><code>book/0001/010001.gt.txt</code></li> </ul> <p>It&#39;s important that you transcribe <em>lines</em>, not entire pages. I initially transcribed pages and tried to have Ocropus learn on them, but this doesn&#39;t work at all.</p> <h3>Training a model</h3> <p>Ocropus trains a model by learning from its mistakes. It transcribes the text in a line, then adjusts the weights in the Neural Net to compensate for the errors. Then it does this again for the next line, and the next, and so on. When it gets to the last line of labeled data, it starts over again. As it loops through the training data over and over again, the model gets better and better.</p> <div class="highlight"><pre><code class="language-" data-lang="">ocropus-rtrain -o modelname book*/????/*.bin.png </code></pre></div> <p>This produces lots of output like this:</p> <div class="highlight"><pre><code class="language-" data-lang="">2000 70.56 (1190, 48) 715641b-crop-010002.png TRU: u'504-508 West 142nd Street, adjoining and west of Hamilton' ALN: u'504-5088 West 422nd Street, adjoining and west of Hammilton' OUT: u'3od-iS est 4nd Street, doning nd est of Sarilton' 2001 32.38 (341, 48) 726826b-crop-010003.png TRU: u'NO REPRODUCTIONS' ALN: u'NO REPRODUCTIONS' OUT: u'sO EROCoOri' ... </code></pre></div> <p><code>TRU</code> is the truth data. <code>OUT</code> is the output of the model. <code>ALN</code> is a variant of the model output which is aligned to the truth data. It&#39;s used to adjust the model weights more precisely. It typically looks better than the model output, especially in early iterations. It lets you know that you&#39;re making progress.</p> <p>Here&#39;s a video that Thomas, the Ocropus developer, put together. It shows the network&#39;s output for a single image as it learns (see the <a href="https://www.youtube.com/watch?v=czG5Jk9iC7c">YouTube page</a> for explanations of the different charts): </p> <div class="center"> <iframe width="560" height="315" src="//www.youtube.com/embed/czG5Jk9iC7c" frameborder="0" allowfullscreen></iframe> </div> <p>For my first model, I used 400 of the labeled lines as training data and held out the other 400 as test data. Ocropus saves models to disk every 1000 iterations, so it&#39;s simple to evaluate the model&#39;s performance as it learns:</p> <div class="center"> <img src="/images/ocropus/err-train-test.png" width="552" height="326"> </div> <p>The error rate starts high (over 50%) but quickly comes down to about 2% after 10,000 iterations, eventually hitting a minimum of 0.96% at 16,000 iterations.</p> <p>The error rate on the test set is consistently about 3% higher than that on the training set. The best error rate on the test set was 4.20%.</p> <p>There&#39;s a lot of variation in the error rate. You might expect it to slowly decrease over time, but that&#39;s not at all the case. I&#39;m not quite sure how to interpret this. Does the error rate spike at 17,000 iterations because the model tries to jolt itself out of a local minimum? Is it just randomness?</p> <p>In any case, it&#39;s important to generate a chart like this. Choosing the wrong model could lead to needlessly bad performance.</p> <h3>Training with more data.</h3> <p>You&#39;d expect that training on more data would yield a better model. So for my next model, I trained on all 800 labeled images (rather than just 400). I didn&#39;t have a test set. Here&#39;s what the error rate looked like:</p> <div class="center"> <img src="/images/ocropus/err-train800.png" width="472" height="294"> </div> <p>This doesn&#39;t make much sense to me. The lowest error rate on the 800 training images is 3.59%. But the model from the previous section achieved an error rate of 2.58% on the same data set (average of 0.96% and 4.20%). And it only saw half the data! How is that possible? Maybe this model just had bad luck.</p> <p>There&#39;s the same pattern as before of occasional spikes in error rate. More disturbing, after around 40,000 iterations, I started seeing lots of <a href="https://github.com/tmbdev/ocropy/issues/5">FloatingPointErrors</a>. It&#39;s unclear to me exactly what this means. Perhaps the model is diverging?</p> <p>Here&#39;s another model that I trained for even longer:</p> <div class="center"> <img src="/images/ocropus/err-typewriter.png" width="522" height="275"> </div> <p>It achieves an error rate of 0.89% at iteration 33,000, then spikes to over 15% at 37,000. It eventually gets back down to 0.85% after 53,000 iterations, then starts spiking again. By the time I stopped it, I was again seeing lots of <code>FloatingPointErrors</code>.</p> <p>The point of all this is that the error rates are quite erratic, so you need to look at them before choosing which model you use!</p> <h3>Training with the default model</h3> <p>So far we&#39;ve built our models from scratch. But you can also build on top of an existing model.</p> <p>Even though it&#39;s never seen typewriter text or ALLCAPS, the <a href="http://www.tmbdev.net/">default Ocropus model</a> presumably knows a lot about Latin characters and the relationship between them in English words. And I trust the Ocropus developers to build a good Ocropus model far more than I trust myself.</p> <p>You train on top of an existing model using the <code>--load</code> option:</p> <div class="highlight"><pre><code class="language-" data-lang="">ocropus-rtrain --load en-default.pyrnn.gz -o my-model *.png </code></pre></div> <p>Here&#39;s what the error rate looks like:</p> <div class="center"> <img src="/images/ocropus/err-default.png" width="468" height="286"> </div> <p>Now we&#39;re getting somewhere: the error rate gets all the way down to 0.277%!</p> <p>Something interesting happens when you get the error rate significantly below 1%. The &quot;mistakes&quot; that the model makes are quite likely to be errors that <em>you</em> made while transcribing truth data! I noticed that I misspelled some words and even hallucinated new words like &quot;the&quot; into some of the lines.</p> <p>Even crazier, there were typos in the original images that I subconsciously corrected:</p> <div class="center"> <img src="/images/ocropus/milstein-typo.png" width="594" height="30"> </div> <p>(Look at the second to last word.)</p> <p>A model with a 0.2% error rate is good enough to produce readable text. For example, here&#39;s what it produces for the image from the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35"><br> → <code>Clinton Street, south from Livingston Street.</code> <br> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25"> <br> → <code>P. L. Sperr.</code> <br> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24"> <br> → <code>NO REPRODUCTIONS.</code> <br> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27"> <br> → <code>August 5, 1934.</code></p> <p>i.e. it&#39;s perfect. Here&#39;s the output of the Neural Net for the last line:</p> <div class="center"> <img src="/images/ocropus/new-network.png" width="699" height="216"> </div> <p>Compare that to what it was before:</p> <div class="center"> <img src="/images/ocropus/network-labeled.png" width="699" height="216"> </div> <p>There&#39;s still some ambiguity around <code>5</code>/<code>S</code>, but it makes the right call. The <code>a</code> vs <code>s</code> error is completely gone.</p> <h3>Conclusions</h3> <p>At this point the model is good enough. If I were to improve it further, I&#39;d either improve my <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">image cropper</a> or incorporate some kind of spell checking as a post-processing step.</p> <p>The behavior of the models as they&#39;re trained is sometimes inscrutable. Finding a good one involves a lot of trial and error. To avoid flailing, measure your performance constantly and keep a list of ideas to explore. &quot;Train a model starting with the pre-built one&quot; was item #6 on my list of ideas and it took me a while to get around to trying it. But it was the solution!</p> <p>If you&#39;re feeling lost or frustrated, go generate some more training data. At least you&#39;ll be doing something useful.</p> <p>At the end of the day, I&#39;m very happy with the OCR model I built. Ocropus has some rough edges, but it&#39;s simple enough that you can usually figure out what&#39;s going on and how to fix problems as they come up. And the results speak for themselves!</p> Extracting text from an image using Ocropus 2015-01-09T22:45:00+00:00 https://danvk.github.io//2015/01/09/extracting-text-from-an-image-using-ocropus <style> .bordered { border: 1px solid rgb(204, 204, 204); } </style> <p>In the <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">last post</a>, I described a way to crop an image down to just the part containing text. The end product was something like this:</p> <div class="center"> <img src="/images/ocropus/start.png" width="608" height="137"> </div> <p>In this post, I&#39;ll explain how to extract text from images like these using the <a href="https://github.com/tmbdev/ocropy">Ocropus</a> <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> library. Plain text has a number of advantages over images of text: you can search it, it can be stored more compactly and it can be reformatted to fit seamlessly into web UIs.</p> <p>I don&#39;t want to get too bogged down in the details of why I went with Ocropus over its more famous cousin, <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, at least not in this post. The gist is that I found it to be:</p> <ol> <li>more transparent about what it was doing.</li> <li>more hackable</li> <li>more robust to <a href="http://stackoverflow.com/questions/27592430/how-can-i-tell-tesseract-that-my-font-has-a-particular-size">character segmentation issues</a></li> </ol> <p>This post is a bit long, but there are lots of pictures to help you get through it. Be strong!</p> <h3>Ocropus</h3> <p>Ocropus (or Ocropy) is a collection of tools for extracting text from scanned images. The basic pipeline looks like this:</p> <div class="center"> <img src="/images/ocropus/ocropus-pipeline.png" width="304" height="341"> </div> <p>I&#39;ll talk about each of these steps in this post. But first, we need to install Ocropus!</p> <h3>Installation</h3> <p>Ocropus uses the <a href="http://www.scipy.org/about.html">Scientific Python</a> stack. To run it, you&#39;ll need <code>scipy</code>, <code>PIL</code>, <code>numpy</code>, <code>OpenCV</code> and <code>matplotlib</code>. Setting this up is a bit of a pain, but you&#39;ll only ever have to do it once (at least until you get a new computer).</p> <p>On my Mac running Yosemite, I set up <a href="http://brew.sh">brew</a>, then ran:</p> <div class="highlight"><pre><code class="language-" data-lang="">brew install python brew install opencv brew install homebrew/python/scipy </code></pre></div> <p>To make this last step work, I had to follow the workaround described in <a href="https://github.com/Homebrew/homebrew/issues/16016#issuecomment-42912638">this comment</a>:</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cd</span> /usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages rm cv.py cv2.so ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv.py cv.py ln -s /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv2.so cv2.so </code></pre></div> <p>Then you can follow the instructions on the <a href="https://github.com/tmbdev/ocropy">ocropy site</a>. You&#39;ll know you have things working when you can run <code>ocropus-nlbin --help</code>.</p> <h3>Binarization</h3> <p>The first step in the Ocropus pipeline is <em>binarization</em>: the conversion of the source image from grayscale to black and white.</p> <p>There are many ways to do this, some of which you can read about <a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i39">in this presentation</a>. Ocropus uses a form of <a href="http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html">adaptive thresholding</a>, where the cutoff between light and dark can vary throughout the image. This is important when working with scans from books, where there can be variation in light level over the page.</p> <p>Also lumped into this step is <em>skew estimation</em>, which tries to rotate the image by small amounts so that the text is truly horizontal. This is done more or less through brute force: Ocropy tries 32 different angles between +/-2&deg; and picks the one which maximizes the variance of the row sums. This works because, when the image is perfectly aligned, there will be huge variance between the rows with text and the blanks in between them. When the image is rotated, these gaps are blended.</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash">ocropus-nlbin -n 703662b.crop.png -o book </code></pre></div> <div class="center"> <img src="/images/ocropus/binarized.png" width="608" height="137"> </div> <p>The <code>-n</code> tells Ocropus to suppress page size checks. We&#39;re giving it a small, cropped image, rather than an image of a full page, so this is necessary.</p> <p>This command produces two outputs:</p> <ul> <li><code>book/0001.bin.png</code>: binarized version of the first page (above)</li> <li><code>book/0001.nrm.png</code>: a &quot;flattened&quot; version of the image, before binarization. This isn&#39;t very useful.</li> </ul> <p>(The Ocropus convention is to put all intermediate files in a <code>book</code> working directory.)</p> <h3>Segmentation</h3> <p>The next step is to extract the individual lines of text from the image. Again, there are many ways to do this, some of which you can read about in this <a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i151">presentation on segmentation</a>.</p> <p>Ocropus first estimates the &quot;scale&quot; of your text. It does this by finding connected components in the binarized image (these should mostly be individual letters) and calculating the median of their dimensions. This corresponds to something like the <a href="http://en.wikipedia.org/wiki/X-height">x-height</a> of your font.</p> <p>Next it tries to find the individual lines of text. The sequence goes something like this:</p> <ol> <li>It removes components which are too big or too small (according to <em>scale</em>). These are unlikely to be letters.</li> <li>It applies the <a href="http://www.cs.cornell.edu/courses/CS6670/2011sp/lectures/lec02_filter.pdf">y-derivative of a Gaussian kernel</a> (p. 42) to detect top and bottom edges of the remaining features. It then blurs this horizontally to blend the tops of letters on the same line together.</li> <li>The bits between top and bottom edges are the lines.</li> </ol> <p>A picture helps explain this better. Here&#39;s the result of step 2 (the edge detector + horizontal blur):</p> <div class="center"> <img src="/images/ocropus/edge-blur.png" width="608" height="137"> </div> <p>The white areas are the tops and the black areas are the bottoms.</p> <p>Here&#39;s the another view of the same thing:</p> <div class="center"> <img src="/images/ocropus/edge-boxes.png" width="608" height="137"> </div> <p>Here the blue boxes are components in the binarized image (i.e. letters). The wispy green areas are tops and the red areas are bottoms. I&#39;d never seen a Gaussian kernel used this way before: its derivative is an edge detector.</p> <p>Here are the detected lines, formed by expanding the areas between tops and bottoms:</p> <div class="center"> <img src="/images/ocropus/line-components.png" width="608" height="137"> </div> <p>It&#39;s interesting that the lines needn&#39;t be simple rectangular regions. In fact, the bottom two components have overlapping y-coordinates. Ocropus applies these regions as masks before extracting rectangular lines:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35"><br> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25"><br> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24"><br> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27"></p> <p>Here&#39;s the command I used (the <code>g</code> in <code>ocropus-gpageseg</code> stands for &quot;gradient&quot;):</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash">ocropus-gpageseg -n --maxcolseps 0 book/0001.bin.png </code></pre></div> <p>The <code>--maxcolseps 0</code> tells Ocropus that there&#39;s only one column in this image. The <code>-n</code> suppresses size checks, as before.</p> <p>This has five outputs:</p> <ul> <li><code>book/0001.pseg.png</code> encodes the segmentation. The color at each pixel indicates which column and line that pixel in the original image belongs to.</li> <li><code>book/0001/01000{1,2,3,4}.bin.png</code> are the extracted line images (above).</li> </ul> <h3>Character Recognition</h3> <p>After all that prep work, we can finally get to the fun part: character recognition using a <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Net</a>.</p> <p>The problem is to perform this mapping:</p> <div class="center"> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27"> → <code>August 5, 1934.</code> </div> <p>This is challenging because each line will have its own quirks. Maybe binarization produced a darker or lighter image for this line. Maybe skew estimation didn&#39;t work perfectly. Maybe the typewriter had a fresh ribbon and produced thicker letters. Maybe the paper got water on it in storage.</p> <p>Ocropus uses an <a href="http://en.wikipedia.org/wiki/Long_short_term_memory">LSTM Recurrent Neural Net</a> to learn this mapping. The default model has 48 inputs, 200 nodes in a hidden layer and 249 outputs.</p> <p>The inputs to the network are columns of pixels. The columns in the image are fed into the network, one at a time, from left to right. The outputs are scores for each possible letter. As the columns for the <code>A</code> in the image above are fed into the net, we&#39;d hope to see a spike from the <code>A</code> output.</p> <p>Here&#39;s what the output looks like:</p> <div class="center"> <img src="/images/ocropus/network.png" width="700" height="330"> </div> <p>The image on the bottom is the output of the network. Columns in the text and the output matrix correspond to one another. Each row in the output corresponds to a different letter, reading alphabetically from top to bottom. Red means a strong response, blue a weaker response. The red streak under the <code>A</code> is a strong response in the <code>A</code> row.</p> <p>The responses start somewhere around the middle to right half of each letter, once the net has seen enough of it to be confident it&#39;s a match. To extract a transcription, you look for maxima going across the image.</p> <p>In this case, the transcription is <code>Auguat S, 1934.</code>:</p> <div class="center"> <img src="/images/ocropus/network-labeled.png" width="699" height="216"> </div> <p>It&#39;s interesting to look at the letters that this model gets wrong. For example, the <code>s</code> in <code>August</code> produces the strongest response on the <code>a</code> row. But there&#39;s also a (smaller) response on the correct <code>s</code> row. There&#39;s also considerable ambiguity around the <code>5</code>, which is transcribed as an <code>S</code>.</p> <p>My #1 <a href="https://github.com/tmbdev/ocropy/issues/16">feature request</a> for Ocropus is for it to output more metadata about the character calls. While there might not be enough information in the image to make a clear call between <code>Auguat</code> and <code>August</code>, a post-processing step with a dictionary would clearly prefer the latter.</p> <p>The transcriptions with the default model are:</p> <p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35"><br> → <code>O1inton Street, aouth from LIYingston Street.</code> <br> <img class="bordered" src="/images/ocropus/line2.png" width="163" height="25"> → <code>P. L. Sperr.</code> <br> <img class="bordered" src="/images/ocropus/line3.png" width="233" height="24"> → <code>NO REPODUCTIONS.</code> <br> <img class="bordered" src="/images/ocropus/line4.png" width="208" height="27"> → <code>Auguat S, 1934.</code></p> <p>This is passable, but not great. The Ocropus site explains why:</p> <blockquote> <p>There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including &quot;?&quot;), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release, and, of course, you are welcome to contribute new, trained models.</p> </blockquote> <p>We&#39;ll fix this in the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a> by training our own model.</p> <p>The command to make predictions is:</p> <div class="highlight"><pre><code class="language-" data-lang="">ocropus-rpred -m en-default.pyrnn.gz book/0001/*.png </code></pre></div> <p>I believe the <code>r</code> stands for &quot;RNN&quot; as in &quot;Recurrent Neural Net&quot;.</p> <p>The outputs are <code>book/0001/01000{1,2,3,4}.txt</code>.</p> <p>If you want to see charts like the one above, pass <code>--show</code> or <code>--save</code>.</p> <h3>Extracting the text</h3> <p>We&#39;re on the home stretch!</p> <p>One way to get a text file out of Ocropus is to concatenate all the transcribed text files:</p> <div class="highlight"><pre><code class="language-" data-lang="">cat book/????/??????.txt &gt; ocr.txt </code></pre></div> <p>The files are all in alphabetical order, so this should do the right thing.</p> <p>In practice, I found that I often disagreed with the line order that Ocropus chose. For example, I&#39;d say that <code>August 5, 1934.</code> is the second line of the image we&#39;ve been working with, not the fourth.</p> <p>Ocropus comes with an <code>ocropus-hocr</code> tool which converts its output to <a href="http://en.wikipedia.org/wiki/HOCR">hOCR format</a>, an HTML-based format designed by Thomas Breuel, who also developed Ocropus.</p> <p>We can use it to get bounding boxes for each text box:</p> <div class="highlight"><pre><code class="language-" data-lang="">$ ocropus-hocr -o book/book.html book/0001.bin.png $ cat book/book.html ... &lt;div class='ocr_page' title='file book/0001.bin.png'&gt; &lt;span class='ocr_line' title='bbox 3 104 607 133'&gt;O1inton Street, aouth from LIYingston Street.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 3 22 160 41'&gt;P. L. Sperr.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 1 1 228 19'&gt;NO REPODUCTIONS.&lt;/span&gt;&lt;br /&gt; &lt;span class='ocr_line' title='bbox 377 67 579 88'&gt;Auguat S, 1934.&lt;/span&gt;&lt;br /&gt; &lt;/div&gt; ... </code></pre></div> <p>Ocropus tends to read text more left to right than top to bottom. Since I know my images only have one column of text, I&#39;d prefer to emphasize the top-down order. I wrote a <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/extract_ocropy_text.py">small tool</a> to reorder the text in the way I wanted.</p> <h3>Conclusions</h3> <p>Congrats on making it this far! We&#39;ve walked through the steps of running the Ocropus pipeline.</p> <p>The overall results aren&#39;t good (~10% of characters are incorrect), at least not yet. In the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a>, I&#39;ll show how to train a new LSTM model that completely destroys this problem.</p> Finding blocks of text in an image using Python, OpenCV and numpy 2015-01-07T00:00:00+00:00 https://danvk.github.io//2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy <p>As part of an <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">ongoing project</a> with the New York Public Library, I&#39;ve been attempting to <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> the text on the back of the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">Milstein Collection</a> images. Here&#39;s what they look like:</p> <div class="center"> <img src="/images/milstein-backing.jpg" width="332" height="512"> </div> <p>A few things to note:</p> <ul> <li>There&#39;s a black border around the whole image, gray backing paper and then white paper with text on it.</li> <li>Only a small portion of the image contains text.</li> <li>The text is written with a tyepwriter, so it&#39;s monospace. But the typewriter font isn&#39;t <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=397983&amp;imageID=708193b&amp;total=683&amp;num=160&amp;word=4002&amp;s=1&amp;notword=&amp;d=&amp;c=&amp;f=13&amp;k=3&amp;lWord=&amp;lField=&amp;sScope=Name&amp;sLevel=&amp;sLabel=Brown%20Brothers%20%28New%20York%29&amp;sort=&amp;imgs=20&amp;pos=163&amp;e=w&amp;cdonum=0">always consistent</a> across the collection. Sometimes a <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=362388&amp;imageID=701471b&amp;total=534&amp;num=180&amp;word=Pumps&amp;s=3&amp;notword=&amp;d=&amp;c=&amp;f=2&amp;k=1&amp;lWord=&amp;lField=&amp;sScope=&amp;sLevel=&amp;sLabel=&amp;sort=&amp;imgs=20&amp;pos=184&amp;e=w&amp;cdonum=0">single image</a> has two fonts!</li> <li>The image is slightly rotated from vertical.</li> <li>The images are ~4x the resolution shown here (2048px tall)</li> <li>There are ~34,000 images: too many to affordably <a href="https://www.mturk.com/mturk/welcome">turk</a>.</li> </ul> <p>OCR programs typically have to do some sort of <a href="http://en.wikipedia.org/wiki/Document_layout_analysis">page-layout analysis</a> to find out where the text is and carve it up into individual lines and characters. When you hear &quot;OCR&quot;, you might think about fancy Machine Learning techniques like <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Nets</a>. But it&#39;s a dirty secret of the trade that page layout analysis, a much less glamorous problem, is at least as important in getting good results.</p> <p>The most famous OCR program is <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, a remarkably long-lived open source project developed over the past 20+ years at HP and Google. I quickly noticed that it performed much better on the Milstein images when I manually cropped them down to just the text regions first:</p> <div class="center"> <img src="/images/cropping.png" width="620" height="512"> </div> <p>So I set out to write an image cropper: a program that could automatically find the green rectangle in the image above. This turned out to be surprisingly hard!</p> <p><a href="http://en.wikipedia.org/wiki/Computer_vision">Computer Vision</a> problems like this one are difficult because they&#39;re so incredibly easy for humans. When you looked at the image above, you could immediately isolate the text region. This happened instantaneously, and you&#39;ll never be able to break down exactly how you did it.</p> <p>The best we can do is come up with ways of breaking down the problem in terms of operations that are simple for computers. The rest of this post lays out a way I found to do this.</p> <p>First off, I applied the <a href="http://en.wikipedia.org/wiki/Canny_edge_detector">canny edge detector</a> to the image. This produces white pixels wherever there&#39;s an edge in the original image. It yields something like this:</p> <div class="center"> <img src="/images/edges.png" width="332" height="512"> </div> <p>This removes most of the background noise from the image and turns the text regions into bright clumps of edges. It turns the borders into long, crisp lines.</p> <p>The sources of edges in the image are the borders and the text. To zero in on the text, it&#39;s going to be necessary to eliminate the borders.</p> <p>One really effective way to do this is with a <a href="http://scikit-image.org/docs/dev/auto_examples/applications/plot_rank_filters.html">rank filter</a>. This essentially replaces a pixel with something like the median of the pixels to its left and right. The text areas have lots of white pixels, but the borders consist of just a thin, 1 pixel line. The areas around the borders will be mostly black, so the rank filter will eliminate them. Here&#39;s what the image looks like after applying a vertical and horizontal rank filter:</p> <div class="center"> <img src="/images/edges-rankfilter.png" width="332" height="512"> </div> <p>The borders are gone but the text is still there! Success!</p> <p>While this is effective, it still leaves bits of text <em>outside</em> the borders (look at the top left and bottom right). That may be fine for some applications, but I wanted to eliminate these because they&#39;re typically uninteresting and can confuse later operations. So instead of applying the rank filter, I found the <a href="http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_contours/py_contours_begin/py_contours_begin.html">contours</a> in the edge image. These are sets of white pixels which are connected to one another. The border contours are easy to pick out: they&#39;re the ones whose <a href="http://en.wikipedia.org/wiki/Minimum_bounding_box">bounding box</a> covers a large fraction of the image:</p> <div class="center"> <img src="/images/edges-bordercontour.png" width="332" height="512"> </div> <p>With polygons for the borders, it&#39;s easy to black out everything outside them.</p> <p>What we&#39;re left with is an image with the text and possibly some other bits due to smudges or marks on the original page.</p> <p>At this point, we&#39;re looking for a crop <code>(x1, y1, x2, y2)</code> which:</p> <ol> <li>maximizes the number of white pixels inside it and</li> <li>is as small as possible.</li> </ol> <p>These two goals are in opposition to one another. If we took the entire image, we&#39;d cover all the white pixels. But we&#39;d completely fail on goal #2: the crop would be unnecessarily large. This should sound familiar: it&#39;s a classic <a href="http://en.wikipedia.org/wiki/Precision_and_recall">precision/recall tradeoff</a>:</p> <ul> <li>The recall is the fraction of white pixels inside the cropping rectangle.</li> <li>The precision is the fraction of the image outside the cropping rectangle.</li> </ul> <p>A fairly standard way to solve precision/recall problems is to optimize the <a href="http://en.wikipedia.org/wiki/F1_score">F1 score</a>, the harmonic mean of precision and recall. This is what we&#39;ll try to do.</p> <p>The set of all possible crops is quite large: <em>W</em><sup>2</sup><em>H</em><sup>2</sup>, where <em>W</em> and <em>H</em> are the width and height of the image. For a 1300x2000 image, that&#39;s about 7 trillion possibilities!</p> <p>The saving grace is that most crops don&#39;t make much sense. We can simplify the problem by finding individual chunks of text. To do this, we apply <a href="http://homepages.inf.ed.ac.uk/rbf/HIPR2/dilate.htm">binary dilation</a> to the de-bordered edge image. This &quot;bleeds&quot; the white pixels into one another. We do this repeatedly until there are only a few connected components. Here&#39;s what it looks like:</p> <div class="center"> <img src="/images/edges-dilated.png" width="560" height="373"> </div> <p>As we hoped, the text areas have all bled into just a few components. There are five connected components in this image. The white blip in the top right corresponds to the &quot;Q&quot; in the original image.</p> <p>By including some of these components and rejecting others, we can form good candidate crops. Now we&#39;ve got a <a href="http://en.wikipedia.org/wiki/Subset_sum_problem">subset sum problem</a>: which subset of components produces a crop which maximizes the F1 score?</p> <p>There are 2<sup><em>N</em></sup> possible combinations of subsets to examine. In practice, though, I found that a greedy approach worked well: order the components by the number of white pixels they contain (in the original image). Keep adding components while it increases the F1 score. When nothing improves the score, you&#39;re done!</p> <p>Here&#39;s what that procedure produces for this image:</p> <div class="center"> <img src="/images/edges-chosen.png" width="560" height="373"> </div> <p>The components are ordered as described above. Component #1 contains the most white pixels in the original image. The first four components are accepted and the fifth is rejected because it hurts the F1 score:</p> <ol> <li>Accept #1, F1 Score → 0.886</li> <li>Accept #2, F1 Score → 0.931</li> <li>Accept #3, F1 Score → 0.949</li> <li>Accept #4, F1 Score → 0.959</li> <li>Reject #5 (F1 Score → 0.888)</li> </ol> <p>Applying this crop to the original image, you get this:</p> <div class="center"> <img src="/images/milstein-cropped.jpg" width="875" height="233"> </div> <p>That&#39;s 875x233, whereas the original was 1328x2048. That&#39;s a <strong>92.5% decrease in the number of pixels, with no loss of text</strong>! This will help any OCR tool focus on what&#39;s important, rather than the noise. It will also make OCR run faster, since it can work with smaller images.</p> <p>This procedure worked well for my particular application. Depending on how you count, I&#39;d estimate that it gets a perfect crop on about 98% of the images, and its errors are all relatively minor.</p> <p>If you want to try using this procedure to crop your own images, you can find the <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/crop_morphology.py">source code here</a>. You&#39;ll need to install <code>OpenCV</code>, <code>numpy</code> and <code>PIL</code> to make it work.</p> <p>I tried several other approaches which didn&#39;t work as well. Here are some highlights:</p> <ul> <li><p>I ran the image through Tesseract to find areas which contained letters. These should be the areas that we crop to! But this is a bit of a chicken and the egg problem. For some images, Tesseract misses the text completely. Cropping fixes the problem. But we were trying to find a crop in the first place!</p></li> <li><p>I tried running the images through <a href="https://www.flameeyes.eu/projects/unpaper">unpaper</a> first, to remove noise and borders. But this only worked some of the time and I found unpaper&#39;s interface to be quite opaque and hard to tweak.</p></li> <li><p>I ran canny, then calculated row and column sums to optimize the x- &amp; y-coordinates of the crop independently. The text regions did show up clearly in charts of the row sums: <img src="/images/milstein-rowsums.png" width="379" height="258"><br> The four spikes are the tops and bottoms of the two borders. The broad elevated region in the middle is the text. Making this more precise turned out to be hard. You lose a lot of structure when you collapse a dimension—this problem turned out to be easier to solve as a single 2D problem than as two 1D problems.</p></li> </ul> <p>In conclusion, I found this to be a surprisingly tricky problem, but I&#39;m happy with the solution I worked out.</p> <p>In the next post, I&#39;ll talk about my experience running OCR tools over these cropped images.</p> Choosing an iOS Podcasting App 2014-12-29T00:00:00+00:00 https://danvk.github.io//2014/12/29/choosing-an-ios-podcasting-app <p>I recently switched back to iOS after a few years using Android. One aspect of Android that always bothered me was that I had trouble finding a great Podcasting app. I&#39;d happily used <a href="http://www.macstories.net/reviews/instacast-3-review/">Instacast</a> on iOS, but I couldn&#39;t find anything quite like it for Android.</p> <p>I eventually settled on <a href="https://play.google.com/store/apps/details?id=com.bambuna.podcastaddict&amp;hl=en">Podcast Addict</a>. It epitomizes a stereotype of Android apps: tons of features, gajillions of options and a UI that was clearly designed by an engineer.</p> <p>After switching back to iOS, I had a surprisingly hard time finding a Podcasting app which could do everything I wanted. For me, the hard requirements were:</p> <ol> <li>A view of podcasts ordered by most-recently updated</li> <li>Stream by default—don&#39;t download episodes unless I ask!</li> <li>An option to disable play-through.</li> </ol> <p>In the last few years, <strong>Instacast</strong> received a <a href="http://vemedio.com/blog/posts/instacast-5-available-today">major update</a>. This introduced a new player interface which doesn&#39;t work on the iOS lock screen. As a result, you can&#39;t use the play/pause/volume features on your earbuds!</p> <p>It also lacks the ability to sort podcasts by most-recently-updated. I subscribe to a mix of podcasts that update <a href="http://www.npr.org/rss/podcast/podcast_detail.php?siteId=5183215">frequently</a> and <a href="http://www.dancarlin.com/hardcore-history-series/">infrequently</a>, so I find this to be a far more useful view than a complete list of all episodes. In particular, I don&#39;t want an hourly podcast to crowd out less frequently updated shows.</p> <p>So I set out to find a new Podcasting app. I really wish the iOS App Store had a try-before-you-buy option. Order-by-date is a fairly obscure feature, and it was impossible to determine if an app supported it without buying it.</p> <p>Since I spent about $10 trying out apps and deleting them, here&#39;s my guide to spare you that chore!</p> <h4>Instacast</h4> <p>As I mentioned, it&#39;s missing the order by most recently updated feature and the ability to control playback via the lock screen.</p> <h4>Overcast</h4> <p>Overcast is the best looking of the bunch. I appreciate its freemium model and functional shoutouts to alternative apps—they made moving my list of feeds between apps painless.</p> <p>That being said, I had a few problems with Overcast:</p> <ul> <li>It doesn’t support streaming</li> <li>Disabling play-through is a paid feature (which is fine, but it&#39;s worth noting!)</li> <li>It makes a distinction between all episodes and unplayed episodes for each Podcast which I don&#39;t find to be very helpful. I&#39;m not a completist—I tend to pick and choose episodes.</li> <li>It&#39;s very aggressive about downloading new episodes!</li> </ul> <h4>Podcasts (built-in app)</h4> <p>Podcasts is a barebones app for listening to Podcasts. I couldn&#39;t tell if it supported streaming (I don&#39;t think it does) and it definitely doesn&#39;t support showing your podcasts ordered by when they were last updated. Next!</p> <h4>PodCruncher</h4> <p>This app came up when I searched for Downcast on the App store and looked promising. Unfortunately, it hasn&#39;t been updated for the new &quot;flat&quot; look in iOS 7 and it doesn&#39;t support the larger screen on the iPhone 6. It also lacks the most-recently-updated view.</p> <h4>Downcast</h4> <p>Finally I found <a href="http://www.downcastapp.com/">Downcast</a>, which meets all my Podcasting needs. It lets you order your podcasts by most-recently updated, though the option is tricky to find (Edit→Sort→Publication Date):</p> <p><img src="/images/sort-by-mru.png" width="375" height="327" alt="Sort by Publication Date option"></p> <p>And there&#39;s an option to disable play-through on the player screen, if you can recognize it:</p> <p><img src="/images/disable-playthrough.png" width="375" height="172" alt="Option to disable playthrough"></p> <p>Next on my list to try were Pocket Casts and Castro, but I stopped when I found that I was happy with Downcast.</p> <p>After all this, I&#39;m coming to appreciate that everyone has different ways of using their podcasting app, which makes designing them hard. I don&#39;t care at all about Smart Playlists (which many apps emphasize) and I care about slightly obscure features like ordering my podcasts by when they were updated. This is a case of <a href="http://prog21.dadgum.com/160.html">dangling by a trivial feature</a>.</p> JavaScript String slice, substr, substring: which to use? 2014-11-17T00:00:00+00:00 https://danvk.github.io//2014/11/17/javascript-string-splice-substr-substring-which-to-use <p>I recently read Douglas Crockford&#39;s <a href="http://www.amazon.com/JavaScript-Good-Parts-Douglas-Crockford/dp/0596517742"><em>JavaScript: The Good Parts</em></a>. It&#39;s a classic (published in 2008) which is credited with reviving respect for JavaScript as a programming language. Given its title, it&#39;s also famously short.</p> <p>One very specific thing it cleared up for me was what to do with all of JavaScript&#39;s various substring methods:</p> <ul> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr">String.prototype.substr</a>(start[, length])</li> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring">String.prototype.substring</a>(start[, stop])</li> </ul> <p>One method takes an offset and a length. The other takes two offsets. The names don&#39;t reflect this distinction in any way, so they&#39;re impossible to remember. I resorted to writing an <a href="http://support.alfredapp.com/tutorials:clipboard-snippets">Alfred Snippet</a> with the syntaxes, to quickly look it up (since I always had to).</p> <p>They also have some other differences in behavior: <code>substring</code> doesn&#39;t allow its offset to be negative, but <code>substr</code> does. This is arbitrary. It&#39;s impossible to remember because there&#39;s no rhyme or reason to it.</p> <p>Crockford cleared this all up. The solution is to never use <em>either</em> method!</p> <p>Instead, you should use the <code>slice</code> method:</p> <ul> <li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/slice">String.prototype.slice</a>(start[, stop])</li> </ul> <p>This takes two offsets. Either can be negative. And it&#39;s the exact same as the corresponding <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice">Array <code>slice</code></a> method.</p> <p>So: stop using <code>substr</code> and <code>substring</code>. Use <code>slice</code>!</p> GitHub integration and Image Diff improvements headline webdiff 0.8 2014-11-07T00:00:00+00:00 https://danvk.github.io//2014/11/07/github-integration-and-image-diff-improvements-headline-webdiff-0-8 <p>I&#39;ve released <a href="https://pypi.python.org/pypi/webdiff/0.8.0">webdiff 0.8.0</a>, which you can install via:</p> <div class="highlight"><pre><code class="language-" data-lang="">pip install --upgrade webdiff </code></pre></div> <p>The most interesting new features are GitHub pull request integration and expanded image diffing modes.</p> <p>You can view a GitHub Pull Request in webdiff by running something like:</p> <div class="highlight"><pre><code class="language-" data-lang="">webdiff https://github.com/hammerlab/cycledash/pull/175 </code></pre></div> <p>Any github Pull Request URL will do. This will pull down the files from GitHub to local disk and then diff them in the standard webdiff UI. My main use case for this is looking at screenshot diffs and thinking &quot;I want to see bigger images in this PR diff&quot;. Speaking of which…</p> <p><img src="/images/webdiff-swipe.gif" width="700" height="451"></p> <p>This version includes a few improvements to the image diff mode:</p> <ol> <li>A &quot;shrink to fit&quot; option, which is enabled by default. This shrinks large images to fit in your browser window.</li> <li>Consistent use of red/green borders for before/after images.</li> <li>&quot;Onion Skin&quot; diff mode, which fades one image into the other.</li> <li>&quot;Swipe&quot; diff mode, which lets you lets you drag a dividing line between the images.</li> </ol> <p>These are based on <a href="https://github.com/blog/817-behold-image-view-modes">GitHub&#39;s image view modes</a>. I still find &quot;blink&quot; to be the most helpful for spotting small changes, but now you&#39;ve got choices!</p> <p>There were a few smaller changes as well. Full release notes are <a href="https://pypi.python.org/pypi/webdiff/0.8.0">on PyPI</a>.</p> Life after Google, Six Months In 2014-10-31T00:00:00+00:00 https://danvk.github.io//2014/10/31/life-after-google-six-months-in <p>It&#39;s been almost exactly six months since I ended an eight year run at Google. One of the biggest reasons to do this was to come back up to speed with the open source ecosystem and to experience a different working environment (sample size 1→2!).</p> <p>When I joined Google, it seemed like our tech stack was years ahead of anything else. This included tools like <a href="http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html">BUILD files</a> for managing dependencies and compilation, <a href="http://www.quora.com/What-is-Borg-at-Google">borg</a> for running jobs in a data center, and <a href="https://developers.google.com/closure/compiler/">closure compiler</a> for dealing with JavaScript&#39;s idiosyncracies.</p> <p>The open source world has come a long way since 2006. It&#39;s solved many of the same problems that Google did, but in different ways. There&#39;s the hadoop stack for distributed work. And there are polyfills and CommonJS for JavaScript. These aren&#39;t necessarily better or worse than Google&#39;s solutions, but they are different. And they&#39;re the tools that new developers are learning.</p> <p>Long term, this is a big problem for Google. &quot;Ahead of the field&quot; can rapidly turn into &quot;Not Invented Here&quot;. &quot;Better&quot; can become merely &quot;different&quot;.</p> <p>A simple example of this is <a href="http://docs.closure-library.googlecode.com/git/namespace_goog.html"><code>goog.bind</code></a>. It works around some confusing behavior involving <code>this</code> in JavaScript. It was a great tool in 2004. But modern JavaScript has its own solution (<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/bind">Function.prototype.bind</a>). It&#39;s been around for years and is supported in 90+% of browsers. Google will still be writing <code>goog.bind</code> long after IE8 ceases to be relevant.</p> <p>Facebook has a better solution to this: they always use the latest version of the JavaScript standard, and then <a href="https://github.com/facebook/jstransform">transpile</a> to something that older browsers will understand. This way they stay on the mainstream of technological development.</p> <p>Here are a few other things I&#39;ve found notable:</p> <ul> <li><p><strong>Almost any Google technology has an open-source equivalent.</strong><br> Examples include Travis-CI (similar to TAP), the Hadoop stack (MapReduce, CNS, Dremel, …), CommonJS (Closure modules).</p></li> <li><p><strong>Package managers</strong><br> For most Google engineers, something like 95+% of the code you work with is first-party (i.e. written at Google). For a smaller group, this ratio is going to be dramatically different. As a result, third party package managers are much more important. And the good news is that they&#39;ve gotten much better over the last eight years! Leaving Google has finally forced me to learn how to use tools like NPM and pip. I&#39;ve found this incredibly empowering. I used to avoid external dependencies for personal projects. I suspect many Google engineers do the same.</p></li> <li><p><strong>Markdown is more pervasive than I&#39;d realized</strong><br> I was vaguely aware of <a href="http://daringfireball.net/projects/markdown/">Markdown</a> while I was at Google, but didn&#39;t really see the point. Now that I&#39;m out, I&#39;ve been surprised to see how pervasive it is. I suspect that much of this comes from GitHub, where you use Markdown for READMEs, issues and code review comments. You also use it with GitHub pages, so I&#39;m typing in it right now! It&#39;s worthwhile to learn Markdown a bit better. It&#39;s like HTML with infinitely less boilerplate.</p></li> <li><p><strong>Knowledge of git</strong><br> It&#39;s completely reasonable for someone with ten years of experience at Google to never have run git. I&#39;m happy I was involved enough with open source projects that I have years of basic experience with it. But others have clearly gone much deeper. My git skills have gotten much better since leaving Google. I know how to use <code>git rebase</code> now!</p></li> <li><p><strong>sysadmin knowledge</strong><br> borg meant that I never had to learn any systems administration. It&#39;s a particular blind spot for me. For example, <a href="http://upstart.ubuntu.com/">upstart</a> was released over eight years ago, just after I joined Google. It&#39;s incredibly widely used. I&#39;d never heard of it.</p></li> <li><p><strong>Google uses a lot of email</strong><br> We use almost no email in my new group. HipChat takes the place of a group mailing list and your coworkers @mention you when they want you to chime in. The chat format keeps things short. I still get emails for code reviews, but I could imagine this going through GitHub notifications instead.</p></li> <li><p><strong>Buying your own lunch isn&#39;t that big a deal.</strong><br> As it turns out, there are people in NYC who will make you food in exchange for money, or even deliver it! The variety is much greater than what you get at a Google Cafe, and it&#39;s nice to have a good reason to go outside during the day. I also like the more flexible eating schedule. Almost everyone in my office eats lunch very late, perhaps at 3 or 4.</p></li> </ul> Fully Migrated to GitHub Pages 2014-10-23T00:00:00+00:00 https://danvk.github.io//2014/10/23/fully-migrated-to-github-pages <p><img src="/images/jekyll.png" alt="Jekyll Logo" width="249" height="115" style="float:right"></p> <p>My danvk.org site is now fully hosted on GitHub pages. I changed the DNS entry last night.</p> <p>My hope was to do this without breaking anything. That didn&#39;t prove to be possible, but I came close. And overall, the process wasn&#39;t too bad! It was helpful to make a <a href="https://github.com/danvk/danvk.github.io/issues/1">census of material</a> that was on my old site using access logs. This turned up a few redirects I wouldn&#39;t have thought of, and also reminded me of the many features my old site accumulated over the years. Some of these are now accessible under the &quot;Features&quot; menu on the new site.</p> <p>There were a few pain points:</p> <ul> <li><p><strong>Redirects</strong></p> <p>It would have been enormously helpful if GitHub pages supported something like <a href="https://en.wikipedia.org/wiki/Rewrite_engine">mod_rewrite</a>. As it was I had to kill a few old links because I was <a href="http://stackoverflow.com/questions/26374707/can-jekyll-serve-content-based-on-a-url-parameter">completely unable to generate 301/302 redirects</a>. I wound up <a href="https://github.com/danvk/danvk.github.io/commit/f99fa0d6ef808a2ba468587d3f7eab800d448f1e">hard-coding JavaScript redirects</a> instead. It&#39;s not ideal, and I&#39;ll probably lose some pagerank, but it&#39;s the best I could do.</p></li> <li><p><strong>Migrating the domain</strong></p> <p>I really hate DNS. It&#39;s impossible to know whether your site isn&#39;t working because you&#39;ve misconfigured your DNS, or because the new records haven&#39;t propagated out yet. I&#39;m surprised more sites don&#39;t go down because of DNS problems. The <a href="https://www.whatsmydns.net/">Global DNS Propagation Tracker</a> was indispensible as a sanity check.</p></li> </ul> <p>One thing that worked really well was migrating my old WordPress blog. I&#39;d expected this to be a complete pain, but it was nothing of the sort. I used <a href="http://www.httrack.com/">httrack</a> to mirror the rendered version of my blog site to a folder on local disk. Then I checked that folder into GitHub. Done.</p> <p>Finally, I learned a lot about <a href="http://jekyllrb.com/">Jekyll</a> from this process. It really is a static content generator. It&#39;s not a serving system. This is why it can&#39;t do things like 301/302 redirects. GitHub pages would be a much more powerful serving system if it included support for things like mod_rewrite. But then it would be less Jekyll-y. The beauty of the system is that it&#39;s pure static content, and hence insanely fast and simple.</p> Filtering JSON with pyjsonselect and jss 2014-10-13T00:00:00+00:00 https://danvk.github.io//2014/10/13/filtering-json-with-pyjsonselect-and-jss <p>The data store for <a href="http://www.comparea.org/">Comparea</a> is a <a href="https://github.com/danvk/comparea/blob/master/comparea/static/data/comparea.geo.json">giant</a> 23MB <a href="http://geojson.org/">GeoJSON</a> file. Most of the space in that file is taken up by the giant lists of coordinates which define the boundaries of each shape. But there&#39;s also some interesting metadata hidden amongst all those latitudes and longitudes:</p> <div class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w"> </span><span class="nt">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="p">,</span><span class="w"> </span><span class="nt">"coordinates"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="mf">-69.89912109375001</span><span class="p">,</span><span class="w"> </span><span class="mf">12.452001953124963</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="mf">-69.89570312500004</span><span class="p">,</span><span class="w"> </span><span class="mf">12.422998046875009</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nt">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w"> </span><span class="nt">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w"> </span><span class="nt">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w"> </span><span class="nt">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w"> </span><span class="nt">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nt">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div> <p>I&#39;d hoped that I could use <a href="http://stedolan.github.io/jq/">jq</a> to filter out all the coordinates and just look at the metadata. But I got bogged down reading through its extensive <a href="http://stedolan.github.io/jq/manual/">manual</a>. At the end of the day, I didn&#39;t want to learn an ad-hoc language just for filtering JSON files.</p> <p>The maddening thing was that there&#39;s already a great language for selecting elements in trees: <strong>CSS Selectors</strong>! I did some searching and learned that there&#39;s already a standard for applying CSS-like selectors to JSON called <a href="http://jsonselect.org/#overview">JSONSelect</a>. It dates from 2011. It has a spec and <a href="https://github.com/lloyd/JSONSelectTests">conformance tests</a>, and it&#39;s been implemented in a <a href="http://jsonselect.org/#code">number of languages</a>.</p> <p>So I picked my language of choice (Python) and began implementing a new command line tool for filtering JSON files.</p> <p>The first issue I ran into: the standard <a href="https://github.com/mwhooker/jsonselect">Python implementation</a> didn&#39;t conform to the standard! It only implemented 2/3 levels of CSS selectors from the spec, and many of the interesting selectors are in level 3.</p> <p>The reference <a href="https://github.com/lloyd/JSONSelect/blob/master/src/jsonselect.js">JavaScript implementation</a> was only 572 lines of code and, with all those tests, I figured it wouldn&#39;t be too hard to port it directly to Python. This was a fun project—there&#39;s something very zen about coding against a spec, getting test after test to pass. I learned about a few nuances of JavaScript and Python by doing this:</p> <ul> <li>Their regular expressions differ in how they specify <a href="http://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters">unicode ranges</a></li> <li>the reference implementation made use of the <code>null</code> vs. <code>undefined</code> distinction</li> <li>JavaScript&#39;s <code>typeof</code> function is quite odd</li> <li>JavaScript&#39;s <code>Array.prototype.concat</code> method is quite subtle in its behavior</li> </ul> <p>I wound up re-implementing all of these quirks these in Python.</p> <p>At the end of the day, I published <a href="https://github.com/danvk/pyjsonselect/">pyjsonselect</a>, the first fully-conformant JSONSelect implementation in Python. A small win for the open source world!</p> <h2>jss</h2> <p>So, how does the tool work? You can read about installation and basic usage <a href="https://github.com/danvk/jss">on github</a>, but here are a few motivating examples.</p> <p>jss is a JSON→JSON converter. It supports three modes:</p> <ol> <li>select: find all the values that match a selector (1→N)</li> <li>filter out (<code>-v</code>): remove all values which match a selector (1→1)</li> <li>filter in (<code>-k</code>): keep only values which match a selector (1→1)</li> </ol> <p>Here&#39;s how the filter out mode works:</p> <div class="highlight"><pre><code class="language-" data-lang="">$ jss -v '.coordinates' comparea.geo.json </code></pre></div><div class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w"> </span><span class="nt">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nt">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w"> </span><span class="nt">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w"> </span><span class="nt">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w"> </span><span class="nt">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w"> </span><span class="nt">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nt">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div> <p>That knocked out all the <code>coordinates</code> keys from the GeoJSON file!</p> <p>I eventually did figure out how to do this in jq. Here&#39;s what it looks like:</p> <div class="highlight"><pre><code class="language-" data-lang="">$ jq 'del(..|.coordinates?| select(. != []))' comparea.geo.json (same output) </code></pre></div> <p>To come up with that incantation, I had to dig through jq&#39;s <a href="https://github.com/stedolan/jq/issues/319">github issues</a>. It&#39;s certainly not something I could re-type from memory! The <code>jss</code> version is clear as could be.</p> <p>It&#39;s also significantly faster. For the 23MB <code>comparea.geo.json</code> file, the <code>jss</code> command runs in 1.7s on my laptop vs. 12.9s for <code>jq</code>. The trick to this speed is <a href="http://stackoverflow.com/questions/26221309/preorder-traversal-using-python-generators-with-a-mechanism-to-ignore-subtrees">appropriate pruning</a> of the selector search.</p> <p>Here&#39;s how the &quot;select&quot; mode works:</p> <div class="highlight"><pre><code class="language-" data-lang="">$ jss '.name' comparea.geo.json </code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">"Aruba" "Afghanistan" "Angola" "Anguilla" "Albania" "Andorra" "United Arab Emirates" "Argentina" "Armenia" "American Samoa" ... </code></pre></div> <p>Unlike &quot;filter out&quot;, which maps one JSON object to another JSON object, &quot;select&quot; extracts multiple values from a single object. Each line of output is its own JSON object. This is why it&#39;s 1→N, vs 1→1 for the other modes. It&#39;s useful if you want to do more processing using grep, sed and other familiar line-oriented tools.</p> <h2>Fancy selectors</h2> <p>You can specify as operations as you like. Here&#39;s a more complex invocation:</p> <div class="highlight"><pre><code class="language-" data-lang="">$ jss -v .coordinates -k '.features&gt;*:has(:contains("ZAF"))' comparea.geo.json </code></pre></div><div class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w"> </span><span class="nt">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w"> </span><span class="nt">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAX"</span><span class="p">,</span><span class="w"> </span><span class="nt">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa, officially the Republic of South Africa, is a country located at the southern tip of Africa. It has 2,798 kilometres of coastline that stretches along the South Atlantic and Indian oceans."</span><span class="p">,</span><span class="w"> </span><span class="nt">"population_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w"> </span><span class="nt">"sov_a3"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAF"</span><span class="p">,</span><span class="w"> </span><span class="nt">"freebase_mid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/m/0hzlz"</span><span class="p">,</span><span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa"</span><span class="p">,</span><span class="w"> </span><span class="nt">"population_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2119.html"</span><span class="p">,</span><span class="w"> </span><span class="nt">"area_km2_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2147.html"</span><span class="p">,</span><span class="w"> </span><span class="nt">"population_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"July 2014"</span><span class="p">,</span><span class="w"> </span><span class="nt">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/South_Africa"</span><span class="p">,</span><span class="w"> </span><span class="nt">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mi">1214470</span><span class="p">,</span><span class="w"> </span><span class="nt">"area_km2_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w"> </span><span class="nt">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">48375645</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div> <p>After filtering out the <code>coordinates</code> fields, it keeps only elements directly under the <code>features</code> key (i.e. a top-level feature) which contains &quot;ZAF&quot; somewhere (the &quot;sov_a3&quot; field, in this case).</p> <p>Isn&#39;t this just as complicated as the jq syntax? Sure! But at least you learned something useful. If you get better at writing CSS selectors as a result of filtering JSON files, then that&#39;s great! You&#39;ve become a better web developer in the process.</p> <p>You can install <code>jss</code> with <code>pip</code>. Read more <a href="https://github.com/danvk/jss">on github</a>!</p> Facebook (non-)Insights 2014-10-13T00:00:00+00:00 https://danvk.github.io//2014/10/13/facebook-non-insights <p>There was a spike of traffic to <a href="https://www.comparea.org">Comparea</a> over the weekend:</p> <p><img src="/images/comparea-spike.png" alt="Spike of traffic to comparea.org" title="Spike of traffic to comparea.org"></p> <p>Awesome! All of it came from Facebook and went to a comparison of <a href="https://www.comparea.org/FXX+AUS">France vs Australia</a>:</p> <p><img src="/images/comparea-all-facebook.png" alt="All traffic coming from Facebook" title="All traffic coming from Facebook"></p> <p>I can easily get insight into who <a href="https://twitter.com/TeachingIdeas/status/520656010391093248">tweeted it</a>. But Facebook is a big black box. In theory, I can use <a href="https://www.facebook.com/FacebookInsights">Facebook Insights</a> to track this. It claims that 50 actions have led to ~2400 visits to my site, but declines to say anything more:</p> <p><img src="/images/comparea-facebook-non-insight.png" alt="All information is suppressed" title="All information is suppressed"></p> <p>I understand that Facebook is more private than Twitter, but this is frustrating. I&#39;d hope to at least see which country the shares are happening in. Are these French people coming to Comparea? Australians? Facebook won&#39;t even reveal that all the recent visits are to a single URL! I&#39;d think that thousands of clicks would be enough to provide anonymity there.</p> <p>I&#39;m not sure exactly what sorts of insights Facebook can share without revealing identities, but I&#39;d very much hoped for more than this.</p> <p>Has anyone had positive experiences with Facebook Insights?</p> Trying out GitHub Pages 2014-10-01T00:00:00+00:00 https://danvk.github.io//2014/10/01/post <p>I&#39;m going to try hosting my site and blog on GitHub pages. My hope is that blogging using GitHub and Markdown will lower the barrier to writing, and that GitHub pages will eliminate any worries about performance and security while hosting my own site.</p> <p>This is all very much a work in progress, so feedback is welcome!</p>