>> From the Library of Congress in Washington, D.C. >> Good morning and welcome to the Library of Congress and to Collections as Data. Please welcome Kate Zwaard, Chief of National Digital Initiatives at the Library of Congress [applause]. >> Kate Zwaard: Hey everybody. Thank you so much for coming. We're so excited you could be here both in person and via the live stream. I don't have my clicker. Is there a clicker? >> Is it - oh, the clicker? >> Kate Zwaard: Yeah, I need a clicker. Oh, it's here. Sorry. So we're so excited to have you here in person and also via the live stream. Welcome to Collections as Data. We had an event last year you may remember, Collections as Data, where we discussed the challenges and opportunities inherent in doing computational analysis on digital collections. And we were really overwhelmed by the response. People afterwards came up to me and talked to me about how inspired they were, how they thought maybe they could do this work in their home institutions and - but we found that we still, to this day, are struggling to explain in a tangible way what Collections as Data means to our colleagues who don't have direct experience. So we thought we'd have this event, Collections as Data Impact, to invite luminaries in the field to share their stories about how they are providing - about how they're using data to better the world and their communities. So with that in mind, I'd like to share with you a short story that is from the time when computers were people doing calculating, not when they were things in our pockets. And it's my hope that this story will connect the work you'll hear about today to a longer history. So this is a little bit of a window into my high school bedroom. You know, these beautiful people up here [laughter]. So as you may remember the Federalist Papers were written by John Jay, Alexander Hamilton, and Madison. They were published under a pseudonym in newspapers, Publius. And then when changing public opinion converted those documents from anonymous trolling to foundational to the democracy, we started to get a sense of the authorship. And poor John Jay, only five. So at some point the dust settled. We knew who wrote most of them but there were twelve that remained in dispute. Madison said that he didn't have anything to do with them. Hamilton said that they were joint papers. And everybody kind of figured that it was probably Madison. Why was there so much finger pointing? Well, these were propaganda pieces. And sometimes the authors took positions in the papers for argument's sake that were very different from the ones they held publicly. So our story really starts in 1944 when an American academic, Douglass Adair, gathered all the historical evidence and determined it was probably Madison that wrote the disputed papers. But he considered the historical evidence modest, so he sought another way of making an analysis. He contacted two statisticians, Frederick Mosteller and David Wallace, to see if there was a computational way of making a determination. And I really love this quote: "...blunder into historical and literary controversy, merciless slaughter is imminent." If this story strikes a chord with you, I really encourage you to read the book. They're both just kind of a joy to spend time with. So they did a - they decided to do a computational analysis on the authorship of the papers to see if they could have an independent assessment. So they took a stack of known Hamilton papers and known Madison papers, and they thought maybe calculating sentence average would be a way to make a determination. So they laboriously counted in these known papers how long each sentence was. And then they did some analysis. For example, they had to determine - they had to - sorry. They had to determine whether or not quoted sentences counted toward the average. Once they finished all this work, they found an average of 34.5 words per sentence for Hamilton and 34.6 words for Madison [laughter]. Not going to work. So then they decided maybe standard deviation would be the thing. So they did the same very laborious counting and calculation. They thought maybe although average sentence length was the same, one of the authors wrote very average-length sentences and the other author wrote very teeny, tiny sentences and lots of really long sentences. So they did the same kind of calculation and analysis and came out with the same result for both of them. And in retrospect this kind of makes sense. Both authors were trained in a similar style that was popular during the period. Lots of long clauses, very kind of - I mean, we've all read the Federalist Papers; you know what I'm talking about [laughter]. So they shelved the project. It was a year's worth of work that didn't have any fruit. And they kind of just went on to other things. A few years later, Douglas Sidair reached back out to say that he found something interesting. He found that while Hamilton uses the word "while", Madison uses the word "whilst." And that fact in itself is not enough to determine authorship. The word only occurs once in every thousand words, so it doesn't even appear in all the papers - all the disputed papers. Additionally, it could have been introduced during the editing process. So it's not enough to be a determinate. But it gave them somewhere to start. So they took that sample set of known Madison papers and known Hamilton papers and counted the occurrence of every word in all those sample sets. They then correlated them with each other to see which words were favored by one author versus the other. And they scoped out words that they called "dangerously contextual" which I really like. I'm going to use that, I don't know, later on today. Dangerously contextual; those were words that were correlated with a certain author's favorite subjects, not necessarily with their writing style. They came up with a total of 117 words - and this figure is from their book - 117 words that they could use to make a determination. They then used the Bayesian analysis -- Tommy Bayes; give it up for Tommy Bayes, everybody - ^M00:06:15 [ Applause ] ^M00:06:18 to determine - they determined that their data made an independent assessment that the papers were most likely written by Madison in the sense of degree of belief, which I love that because it's very Bayesian terminology. And what I really love about this story is that none of this is digital. So this was all ink on paper. But what does digital do for us? It democratizes this type of work. I mean, you could see how much time and energy went into this. And part - and that time and energy was both in inventing this analysis but also in the labor needed to do the calculations. And digital makes being wrong much less expensive which is great because I'm wrong a lot, and I need it to be very cheap. So to bring this talk around to Harry Potter, which is fun to do in any kind of academic situation, this type of linguistic analysis is now very common. So a few years ago, Patrick Juola, who's a computer scientist, got a call from a reporter who had a tip that the semi-obscure mystery writer, Robin Galbraith, might actually be JK Rowling. And they wanted to know if he could do some computation to prove that that was true. So he ran some numbers and he found out yes, in fact, he could confirm the linguistic style of JK Rowling and Robin Galbraith were very similar. And so they broke the story together. And that brings us to today. What I'm so excited about with digital is that it makes this sort of Collections as Data analysis much more possible to more people. I think that it brings Collections as Data more available to what I consider a core constituency of the Library of Congress which is the informed and curious. We're inviting a lot of academic luminaries here today, and we're so excited that they could share with us their work. We're also learning a lot from our colleagues in academic libraries including the IMLS grant-funded project Always Already Computational Library, Collections as Data. But a number of our speakers here and a number of you out there are -- you know, are doing work not connected to a large, well-funded library; institutional library. And I hope that part of this event will be to invite you to consider the Library of Congress an intellectual home for exploration. My group, National Digital Initiatives, are really inspired by our new boss, Dr. Carla Hayden. She is leading us strongly in the direction of opening up the collections. And we consider it part of our job to help to do that for the digital collections. And with that in mind, I'd like to share with you a few of the things we have going on. The first is crowdsourcing. And so what you see here is a screen shot of an application built by Tong Wang who's a developer here in the repository development center at the library. While he was an innovator in residence at NDI he had a short - short time with us. And it's an application built on the Scribe platform created by New York Public Library that invites people to identify photos and cartoons in historic newspapers and to update the captions. You can see some really awful OCR here. That helps us with findability. But what's so exciting about it to me is people engaging with the collections. Like, looking through historic newspapers really getting a sense of what's available. And also that we can create these data sets. So we can create maybe a gallery of World War I-era cartoons that I think is really useful for scholarship and interesting to page through. This is not the only crowdsourcing project we're working on, but I'll save those details for another time. And it's our hope that this will blend into a portfolio of projects that the Library of Congress is working on that helps us engage with and learn from our users including Flickr, the American Archive of Public Broadcasting game, and efforts in the Law Library and the World Digital Library. Bud Barton, our CIO, announced recently that we'll be doing a congressional data challenge. He announced this at the Legislative Data and Transparency Conference. We'd like to invite people to play with congressional data and find ways to make our democracy, you know, even more robust using the information we're making available. We're working on the details, and we hope to have something to announce about that soon. ^M00:10:29 ^M00:10:33 I am thrilled to give you guys a sneak peek into what we're launching in a few months, and that's labs.loc.gov. I'm really excited about this. It's going to be a platform for play and discovery with the digital collections. It's my hope that this will reduce the friction to innovation, and we'll be able to use it also as a home for NDI. So it'll make our work a little bit more easy to follow and provide us a place to host the results of innovators and residents and challenge grants and information about our hackathons. This one's pictured here. I love this picture because when you watch movies about people writing code it's all, like, head down, dak-a-dak-a-dak-a-dak-a-dak-a. But when I write code it's mostly me staring at a screen trying to figure out why it's doing that [laughter], pictured here. So while I have you trapped here - because we've got some amazing speakers so I know you're not going to leave - I'd like to selfishly go through a few things. The first is jobs. There are some really cool jobs posted right now, and I have a feeling there'll be more really cool jobs posted in the near future. So please check this out. We need your good brains to come here. So if you find jobs that, like, your friend might be good for, please send it and check back very often. The second, the Kluge Fellowship in Digital Studies. So if you don't want to come here for a lifetime, which you should; but if you don't want to, you can come here for just a visit. These are paid fellowships to bring people into the library to study the digital revolution on society using library collections. And if you come here and do one of these, I will hang out with you. I will. But don't not - don't not apply because you're worried about that [laugher]. Like, I'll just call you; and if you don't want to do that, you don't have to. But it's really fun, and we'll have two Kluge scholars presenting the lightening round. So if you're inspired by their work, please come and apply. They can award more than one a year, so share it with your friends. Don't horde it. You can send it out to people. The third is Innovators in Residence. So we've been working hard on an Innovators in Residence program here to bring people into the library for short-term, high-impact technical projects. Sorry. And the idea behind this is to bring some fresh ideas, new blood, great brains to create new access points to the collection. Innovators in Residence projects have included the crowdsourcing project that I referenced earlier, a MARC parser to help people interact with the bibliographic records the Library of Congress recently released, and we'll have more information about this year's soon. So thanks for coming. I'm so thrilled you're here. As I mentioned when we launch blogs.loc.gov it'll be much easier to follow what we're working on. In the meantime, please check the blog. You can sign up for e-mail alerts. We won't have time for questions today. It's wall-to-wall information into your brains. But we're encouraging you to please interact with the speakers directly on social media. You can also comment on the blog with any questions. Thank you so much. ^M00:13:26 [ Applause ] ^M00:13:32 >> This has been a presentation of the Library of Congress. Visit us at loc.gov.