>> From the Library of Congress in Washington, D.C. >> Next we have Nick Adams, a sociologist and research fellow at the Berkeley Institute for Data Science. He's going to talk about how social scientists use the congressional record as a data source. Welcome, Nick. >> Nick Adams: Hello, hello. Library of Congress, archivists, librarians, journalists, data nerds, fellow social scientists and digital humanists. I'm very honored to be here today. This is really exciting for me, kind of a mecca for someone like me. I've done a lot of work with a lot of textural data and to be able to speak here is truly an honor. For many of us, archives evoke images of old leather-bound books [inaudible] sweet acrid odors of oxidizing pulp, and they arouse the excitement of discovering some long lost history as though it were a buried treasure. But my particular excitement about archives is a bit different. Like all of my fellow scholars, I swoon at the thought of uncovering old worlds and newspapers or policy tracks, meeting minutes, contracts or even receipts. These are the sort of documents that most people throw away with the recycling, but we really cherish these data as unspoiled records of human behavior. They tell the stories of who we were economically, politically, socially and culturally. So I want to first say thank you to all the librarians and archivists who make this data available. It really allows us to do our work. And we completely depend on you. So in the way that I can't because I only have two hands, can you all help me give a round of applause to the people that make this happen? ^M00:01:47 [ Applause ] ^M00:01:50 A little louder. Okay. Maybe I should end there some people might be thinking. I said everything that needs to be said. But like I said, there's a little more -- my love for archives is a bit different than typical because I think archives can be far more powerful than most of my fellow scholars have yet imagined, and I want to show all of you to the extent I can how we can bring about a future where archives are used much more often and much more intensively to give us much deeper understanding of human behavior. So right now central scientists and digital humanists are at the beginning of what I see as a decade long process of kind of updating the way we work with documentary evidence. We're learning to read right along side computers. We're beginning to understand and trust how they can read differently and much faster than we can and find patterns in minutes that we really couldn't find without months of close reading. So there's a bunch of new computational text analyst approaches that people can use. We can [inaudible] sentences by grammar. We can find named entities like people and locations. We can model topics of what people are talking about across documents. We can model networks of individuals and how they relate. We can even do really kind of thing that a classical qualitative ethnographic research like grounded theory with the assistance of computers. There's so much we can do right now, but these powerful tools require very different ways of organizing collections and that's what I really want to talk about today. So most of us here are familiar with traditional practices of archiving, the preservation, organization, maintenance and the careful control and protection of documents and from the researcher's standpoint, when we want to pull an archive, we consult a librarian who helps us pull the records and then we look at the documents usually one document at a time often wearing gloves or viewing through some pane of glass. And these archives are really helpful. They allow us to ask important specific questions of specific documents, questions like what happened when, did he or she really say or do that? Who paid whom? And who voted which way on which bill? But all of this kind of research, these questions are all kind of a forensic style of research where we're looking for particular details at a particular date occurring in a particular document. But now we've entered this age of digitalization. So what's new? What's different? Obviously, no archivist or librarian would want me to thumb through their most aged and cherished records at 40 pages per second, applying an array of highlighter colors directly to the collections. I can hear some of you like wincing at the thought. But that sort of thing can really easily be done with digital records. So how is that going to change research and how are we going to adapt to that? Exciting. So did we go too fast? We went past. Oh, there it goes. Okay. So there are some digital archiving products that are really plowing forward. So chronicling America, a bunch of people have mentioned, is allowing researchers to skim through thousands of newspapers using digital approaches. And in this talk series a while back, Matthew Weber's archives unleash program was discussed, and it's generating valuable findings hacking away at what he has called well curated archives, but these are not really the norm. Actually, what's normal is quite a bit different. As Elizabeth Lorang mentioned during her talk in this series, one of the most prominent challenges cited by respondents in their use of digital collections was the inability to search effectively through the collection materials, and this is really common. To give you an example, the government publishing office -- the GPO website has a lot of really impressive holdings describing the activities of our government, the congressional record, hearings, all kinds of votes, and these data are really essential to democracy and to the GPO's mission of keeping America informed. The data are really valuable. They're important. They're totally available to the public, which is great, and they're even digitized in an ideal kind of machine readable format. Yet, they're extremely difficult to research using this state of art text analysis techniques that I just discussed. Let me give you a sense of what that looks like. Here's a GPO website. So if I wanted to understand how my congress person is dealing with science, I could go to a particular congress. I could go to the house hearings. I could look for the committee on science and technology and then I could read the text of every last committee hearing and try to follow along with what my congress person was doing. So we select a file, we read it in the viewer, in this case a web browser, and we view it without touching it, without labeling it, without scribbling on it and we have to take our notes in the separate document. So in a lot of ways, even though this text is fully digital, we're still wearing the gloves. We're still looking at a document from behind a pane of glass. Now to be clear, I don't believe this is the state of affairs because archivists are somehow over protective or uninterested or just ornery. Maybe some of you are. But this is a very well organized set of links that makes perfect sense for human looking for information about a particular hearing. And we organize our digital records this way, the same way we would organize our physical records because these systems are so sensible for traditional research and so expedient for archivist's fundamental tasks of preservation and publication. So if we don't want to click and click and click and download each file, what do social scientists want? I know this is the question of Washington. Every funding conversation, every policy conversation. What do the social scientists want? I can dream. I can dream. Well, we want data that are well structured. That's what we want. And I'll have a slide in a moment that kind of lays that out. But let me offer you a sense just through some examples. Perhaps we want to see all the materials in the congressional record or it could be for any archive, we want to see all the materials that are authored by men or women or people from this or that geographic region. All instances we're [inaudible] particular phrase like interstate commerce and the date of the utterance. Who said it? The district they represent if we're looking at the congressional record. Even adjectives that are marshalled to describe particular nouns like veteran, teacher, doctor or child. How are these people described? Perhaps I want to assess the comfort of congress members with the scientific process. So I want to do some sort of search, and this is just a little bit pseudo code here. I want to see if I can find all instances where people say the word hypothesis or falsify, correlation, causation, statistical significance and if I find that, then I want to see the speech displayed along with the idea of the speaker, the identification of the speaker, the party they're associated with, etc. That would be -- that would be really nice, and I'd be able to kind of compare across the entire congressional record how people are talking about science. But that's not really what we have right now. That's just not possible on the government publishing office and those sorts of questions that dig into both the metadata and the content of the text itself are not really possible right now. So we can't search all the text of all the science hearings. We can't easily identify particular speeches and speakers even though as a reader, if I read through that document, I know when member Jones is speaking and when member Smith is speaking. And I can't immediately know the party identification of the person even though this is really common knowledge. This stuff is not actually connected to the data right there as a researcher when I want to query it. So what do social scientists do with archives like the GPO? Well, the answer is not much because it's just too hard to look through all that data one document at a time or to learn all the skills to figure out how to bring it together so that you can search it the way you want to search it. But I'm here today -- that's [inaudible] the talk either. We don't have to sigh and walk away with our heads held down. I'm here to talk about an effort that's trying to help change that -- the Capitol Query Project. So a few years ago, I founded the computational text analysis working group at UC Berkeley. And a key [inaudible] of that group's mention was to not just teach researchers how to fish, but to teach a research team how to whale. No marine mammals have been harmed just so you know. But to teach a research team how to organize -- how to do the massive amount of work necessary to acquire clean process, analyze, interpret and report on massive textual archives. Now we're partnering with the Goodly Labs and the Social Science Research Council so that we can not only convert the government publishing office data into a queriable record of what congress is up to, but also so that we can create tutorials showing anyone how to do this for their own digital collections. I just want to recognize some of my teammates here quickly before I say a little bit more about the project. And any of you could join this slide. We'll talk about that later. So here's a project in pictures. First thing we're going to do is we're going to gather all the data, all the documents across all those links into one place. And then next, we're finding and labeling structures in the text. So maybe we're looking for particular speech acts or events or locations or ad hoc groups, and we can add that structure to the text through annotations through XML. And then we take the existing structure that was already there with the newly created structure, and link it into -- with external data sources that enable us to -- that enable powerful queries answering questions that we couldn't even ask before. So let me say a little bit about well structured research ready data. That's the hard product of this project is to create a database, and for the record, we should now what well structured research ready data looks like. First of all, it should be digital text. It should be machine readable so that a computer can look for different tokens and compare across various different documents. It should be queriable with that token search but also across a set of documents that the researcher has defined. So if they want everything from Arizona or they want everything from a decade, should be able to look at that immediately. This is really important and a lot of people when they move into a digital archive and then they try to make it public, they make this mistake. It's really important to retain the original structure and formatting of the documents. So I'm going to ding proquest here. I don't think anyone's going to be too upset. Proquest takes a congressional record, and they strip it of all the white space, all the new line characters and the paragraph breaks, and these subheadings that help us read through the document and understand, oh, this is just the table of contents or this is a portion where they're going to do some prematter, but they're not really talking about the hearing yet. It's important to retain the original structure that the original authors used to keep track of which portions of the text are doing what sort of conceptual work. And then as much as possible, we want to add in some supplemental and searchable annotations on the text, things like I said, looking for speech acts or speakers or locations, etc. And finally, we want to be able to link the data to other data that describe the same objects. So that hearing document is going to talk about congress member Smith, but there's a lot I know about congress member Smith in another database that tells me about her age and her district, population and demographics of her district, which might actually influence the way she's behaving. So in all of this, we're not just creating a queriable database, but there's a huge learning opportunity here for anyone that wants to take their digital archive and make it more accessible to a larger audience of researchers and the public. So we're going to be creating a kind of step by step how to guide with tutorial notebooks. We'll be using Jupyter notebooks which allow us to write in plain English what we're doing right next to a cell that runs executable code. So for all those of you who are a little afraid of programming, the idea is to take you step by step through the process of doing this. So phase one, gathering the data together, phase two, finding and adding structure to it and then phase three, linking it to other useful data. To dig into these a little bit, in phase one, bring all your data together, this is just trying to get around pointing, clicking and downloading a thousand times. And so we can actually train a computer to do this or we can train people to tell a computer how to do it. So we'll be turning people in these notebooks to use regular expressions to use [inaudible] so that they can look on a website and with a web browser automation tool like [inaudible] instead of doing all that pointing and clicking yourself, you can have a computer do it for you. Phase two is really where were trying to move from the current state of the art, which is kind of a craft to a science. Finding structure in texts is -- it takes a lot of working back and forth with the text, knowing what is theoretically important that you're trying to find there, and then using the computer to help you find that structure to chunk it out, to label it and to add structure where there wasn't before. So we take people through the very basics of regular expressions and how to use XML, and we show them some of the text analysis techniques that I talked about at the top of the talk, and we're even using crowdsource annotation software called text [inaudible] which is also a technology created by the Goodly Labs. All of these things allow us to get the humans and the computers together working to find that structure in the text. And then in order to do this really efficiently with a programming script, you need to know some basic programming architecture that's not super hard to learn but we'll teach it. Finally in phase three, phase three is about linking all of that data. So it's identifying relevant data sources, effectively structuring your database so that you're anticipating the queries of the research community and you're also reducing redundancies in the data storage. Paul who spoke just before me really kind of set this up. If we can use SQL and relational databases, we can compute much more quickly over some of this sort of multilayered structured text. So to close really, I just want to ask you all to get research ready. We'll be pushing some of this out in November, and we're hoping that people will give us feedback and let us know like your phase one tutorials were great. Your phase two tutorials were quite confusing, and you need to go back to the drawing board. We're really looking for engagement from the Library of Congress and any libraries or archives or researchers who are interested in getting in on this, and I don't know if they cut me off, but it seems they cut me off. Oh, here we go. Yeah. So if you'd like to join in this effort or follow along, here are some links. What I was going to show you if I could is the entire project is completely open to the public. You can go to the open society framework that's hosted by the Center for Open Science and you can watch and follow along with our project. If you want to jump in, you can actually go to our [inaudible] page and just start doing pull requests. That would be exciting, although feel free to contact us first and maybe we can lead you to the work that's most pressing at the moment. But I really appreciate everything that everyone in this room is doing, and I'm really looking forward to getting to this kind of blue sky space in the future. So help us out as you can. Thanks a lot. ^M00:18:48 [ Applause ] ^M00:18:50 >> This has been a presentation of the Library of Congress. Visit us at loc.gov. ^E00:18:58