>> From the Library of Congress in Washington, DC. ^M00:00:02 ^M00:00:24 >> Beacher Wiggins: Good morning, everyone. I'm Beacher Wiggins, Director for Acquisitions and Bibliographic Access here at the Library, and it's again a pleasure to start off, and actually, this is our first session for the year of LC's Digital Future and You! that is coordinated by Angela Kinney and Judith Cannan, two of the chiefs in ABA, who have been doing this for a little while now. And we try to find interesting topics and presenters to come and help keep you informed in some of the latest that's going on related to automation applications and automated techniques. This morning, I am particularly pleased because we are focusing on BIBFRAME, the model that the Library of Congress has been developing under the auspices of Network Development and MARC Standards Office, Sally McCallum and her capable staff, have been developing what we're expecting to be the replacement for the MARC format. I have been working with our presenters this morning, Tiziana Possemato, and Michele Casalini for a while now. We have been involved with Casalini as one of our leading acquisitions vendors, and in the past few years with their exploration and working with BIBFRAME as a model and helping us to prove that BIBFRAME can be expanded and is scalable. Their involvement with BIBFRAME has been related to their development of the shared virtual discovery environment in linked data, what they are dubbing SHARE-VDE, and they will be talking about that this morning. And SHARE-VDE is a community-driven research and development project with a chief goal of bringing BIBFRAME into a practice that libraries, librarians, and the library and library patrons can put to good use. So, they'll give us some sense of what they've been doing with this, and the Library of Congress has been one of the institutions that has been working with them on the SHARE-VDE approach. Tiziana is the Chief Information Officer of Casalini, and she's also Director of @CULT. She holds a degree in philosophy from La Sapienza in Rome and diplomas in Archival Science and Library Sciences from the Vatican Library Schools and has also a master's degree in Library Sciences from the University of Florence. She has led numerous projects during her career, both for library automation, for analysis mapping and conversion, and for the transformation of LinkUp and data into a publication format for numerous institutions. Michele Casalini is the Managing Director of Casalini, which produces authority data and bibliographic data in addition to supplying books and journals for the customer base of Casalini. Following studies in Modern languages and literature at the University of Florence and a period of working in the publishing company La Nuova Italia, Michele specialized in the field of Information and Technology Management. Michele is also a board member of the Council on Library and Information Resources. So, we have very well-informed presenters this morning, and I look forward to hearing what they have to say, even though I am familiar with some of it, and I more especially want you as LC staff to hear what's been going on. So, Tiziana and Michele, and I'll let you decide who goes first, and maybe you play off each other. ^M00:04:56 ^M00:05:03 >> Michele Casalini: Thank you very much, Beacher, and thank you all for being here. For Tiziana and me, it's a great privilege to be here today and to have the opportunity to talk to you about the SHARE Virtual Discovery Environment project. As you know, Casalini Libri is one of the main suppliers of European publications, cataloging, and technical services to academic libraries across the world. We have been working with the Library of Congress for decades, are a member of PCC, the Program for Cooperative Cataloging, participating in NACO and SACO, and our original cataloging division has received valuable input from the Library during the implementation of RDA which begun on day one, March 31st of 2015. After the first year of RDA, we began to collect input from the Library community on the application and tools of the emerging BIBFRAME data model. The input received form the wider library community and, in particular, the Library of Congress, made it clear that we would in the future be expected to supply data not only in MARC format but also according to the new framework. In response, we embarked on a strategy for the progressive implementation of BIBFRAME in close collaboration with @CULT, our long-standing technological partner with extensive experience in technological solutions for cultural heritage institutions. @CULT has also been a technological partner of [inaudible], automatic publication and a linked data provider of library data, an European community funded project. Following a series of initial studies and a feasibility analysis in January 2016, we presented the first results and a plan for action at the Library of Congress BIBFRAME Update Forum at Midwinter ALA in Boston. Considering that this evolution had an effect on many aspects of the traditional Library catalog, for example, the format, organization, and retrieval of data, the objectives stand out and interpretations will be in continuous transformation for quite some time. Each organization will have its own specific workflow. We were very pleased in the summer 2016 to receive approval and support of 12 North American institutions for the SHARE-VDE project. We are particularly grateful to our contacts and colleagues at Stanford University Libraries for their valuable thrust and encouragement. We feel a great responsibility and ever-growing enthusiasm for this research and development initiative, which aims to make real steps toward the effective use of BIBFRAME. Seeing not only the practical results but also the incredible potential of linked data for providing discoverability has been extremely satisfying. Indeed, using linked data in libraries, catalogs can provide a number of advantages. It gives visibility to and other information and resources that would otherwise be hidden. It helps to connect and improve interoperability among library archives and museum domains. It is a basis for new cooperative initiatives in the sector, and it improves accessibility for library patrons and the wider public with the ultimate benefit of safeguarding cultural heritage. Of course, some problems have emerged during our work on the project, and we will see some of these during the presentation. The work is since the beginning based on the requirements and expectations of the library community, and this will be the driving force for the definition of the activities and priorities to be put into production, the Phase III, modulizing and parametrizing individual components as necessary. I hope there will be enough time later for questions and for a discussion. With that in mind, let's now look into details at the results of Phase I and II. I will cover only the first part of the presentation. The majority will be handled by Tiziana Possemato, who is the Director of the project. This is the index of the topics that we will cover this morning. ^M00:10:10 I'm sorry. You already know who we are, and we are based in Florence and Rome and a little bit, I mean, the background of our company that I mentioned before. What we are at the starting point of a design of the project. First of all, the enrichment of MARC records with URIs in order to simplify and make the BIBFRAME conversion possible and more effective, of course. The use of a framework to automate the conversion from MARC into RDF, using BIBFRAME vocabulary, and the creation of a BIBFRAME layered platform prototype starting from bibliographic and authority records, to test and to demonstrate the advantages that the BIBFRAME data model is bringing, and this has been in order for us at the beginning a surprising element because we understood that it is necessary in order to discuss and to share future plans with the community, really to touch completely the potential advantages of the new, that the new data model can bring. The main goals of Phase 1 and 2 of that Research and Development project has been a reconciliation and clusterization of variant forms of the same entity, a key element and starting point for several further activities; the enrichment of MARC records with URIs with the development of detection procedures for entity identification, including relator terms; the conversion, supply, and management of authority and bibliographical data in BIBFRAME taking into account the complexity of the long and heterogeneous transition time on both the library and the data producer sides; and the publication, as I mentioned, of a BIBFRAME three-layered platform prototype. Phase 1, in particular, has seen the contribution of the participating libraries with their datasets with their record of two imprint years, 1985 and 2015. The reason to have these two different imprint years, we are to have a record so that we are based on the different rules, and the RDA records in 2015 and, of course, pre-RDA records for 1985. A total of approximately 2 million bibliographical records and 3 million authority records were processed, converted at that time in BIBFRAME 1 and then released on the SHARE-VDE portal. This phase ended one year ago. Phase 2 of the project, and it was seen that each of the participant libraries sent us their entire catalog. So, we handled approximately 100 million bibliographical and authority records. A relationship database that registers the relationships between entities, person, work, instances, subjects, publishers is established in order to assure a more precise identification rate for each entity in order to reach a higher quality of results without human intervention. And of course, we have worked on refinement of data, for example, for co-authors or editors, where there is a variety of ways in which they are identified in library records. And Phase 2 handled the export of data in RDF format, filtering the library preferred URIs, of course, in BIBFRAME 2. And we included URI sources from a number of projects. We analyzed, and we made an analysis for the creation of the relationships between subject terms and strings in different languages, and we worked on the Provenance Declaration, update management, and built-in instances that are also quite key elements and prerequisites for the production scene. So, we ended now Phase 2, and we are now discussing with the library community about priorities and use cases for a modular implementation, introduction of the various components for Phase 3. That will, of course, vary from institution to institution because worker workflows are different, the habits are different, and so there is absolutely not one schema that works, one working schema that works for all. These are, this is a list of the participant libraries in Phase 1 and/or Phase 2 of the project. So, I ask now Tiziana to move on with a very brief overview of theoretical context. >> Tiziana Possemato: Thank you. The theoretical context will be briefly, of course, because you know all these environments where we are, and we started from March 21, and we analyze the factor of a conversion using RDA records and, of course, we try to have in mind also the International Cataloging Principles and the FRBR model. And of course, to arrive to a conversion in semantic web linked data using specifically but not totally BIBFRAME. So, our focus is on BIBFRAME but, of course, each library can choose to add additional anthologies to complete the BIBFRAME used as core, as a heart of the project. Why we focus on ADA linked up in that and BIBFRAME because if you think this guideline serve format model at all related to exactly the same model, that is an entity relationship model. So, had in mind in any case the entities and the relationship between them. So, also this type of a cultural environment has something in common. And so I can try to share you, with you the process overview of SHARE-VDE. In this picture is, summarizes some of important steps of the project. Starting from the left, where we indicated the different types and different format of data records, not only MARC 21 but other family of MARC or rather formats and not specifically MARC. The record that a worker have handled and using two different processes, the main processes, of course, is similar this core process, but is a lot of algorithms to identify, similarity in heading, for example, and to find the similarity in the records and so on, and the Authify. The Authify is important because it's a tool; it's a process that makes use of external sources such as the Library of Congress's subject heading, the NAF, ISNI Library of Congress Authority file and son on to reach information and to simplify the classification process. So, both processes, the similarity score and Authify produce an entity detection to identify specifically each entity that is present in a description, the enrichment of information with the URIs but also with additional attributes in different format and the reconciliation classification process. That is really the key concept we hope, of course, of our project. So, starting from this output, we produce a database of relationship because as you know MARC records are full of relationships, not clearly expressed, but each record presents full relationships between entities. So, we reported and we are creating a database of relationships, a knowledge base of clusters because of the clusterization and the reconciliation process produce a lot of relationship between, for example, body and form to identify a name, a person, and go through the Lodify that is the commercial tool we produce. ^M00:20:25 We release a record, the data, in RDF to be published autonomously by library or to be published in the SHARE-VDE portal. So, some focus on some processes. We start usually from authority records. In fact, as Michele told, we receive authority records not from all libraries but from libraries who manage this kind, type of record, and we manage authority records using the similarity score and Authify tool samples. We use a clusters knowledge base as a bridge to relate and to save and store some information that we need. For example, all the variant form of a heading of the person, and we produce the class and knowledge base in RDF, link it up in data. The second step is manage the bibliographic records. So, the bibliographic record use, again, the similarity score and the Authify tools and enrich the Postgress database. And finally, we produce a MARC enriched record. So, MARC enriched with the URIs coming from different sources, and the process 2, we start from the clusters knowledge base to convert the record using the Lodify and to store it in a Triplestore. We mention here Blazegraph that is in Open Source Triplestore even if we are considering other solutions more powerful than Blazegraph. But in any case the important is the concept of the Triplestore. And the second one, we publish the class that knowledge base. We produce and publish knowledge base and a second one of the bibliographic record that pass through the similar process and Lodify, the Triplestore, and we produce each dataset for each library related to the class of knowledge base. I always try to explain after this sense, the sense of this relation, and we produce also the MARC and reach the in a [inaudible] format or other format. So, just to give you a very brief idea of our technology, as you can see, the class assembly base is the heart of the project because we think that is the power of the project to have the entity, the conciliated and perfectly identified. And we now are working in a new function to create a new cluster starting from the already existent, but we will explain this at the end of presentation. And just a brief look to our technology, we try to put together different type of technology because what the Triplestore can do well is something but not so well as a relational database or not so well as an immersive index such as SOLR. So, we try to put together all this type of technologies to use the best of each of them and to put together. So, I tried to explain to you the Entity Identification, Reconciliation, and Data Enrichment processes. So, of course, I will jump this slide about the someone of you probably never have seen, but our first question is how to identify an entity, starting from a standard record or a standard description, and the identification question, problem, is something that is for all persons, all people in the world and for all type of information. So, in this case, in this picture, you can see different version, different picture of the same people, Albert Camus of the same person, Albert Camus, and each of these representation is a valid representation of him. It's not something that is more real than another. All the representations are valid that represent the same person, Albert Camus. So, in our process, we try to identify Albert Camus as a person, as a real world object in the world, as a real person and to give him an identification, a unique identification but using in the same type the unique identification, okay, each, that each project in the work assigned to the same person, to put together information for the exactly and uniquely person that is Albert Camus. But as I told you, the importance of identification in this, not only in the cataloging tradition, is something that is related to all people and the inner world. For example, this has two pictures taken at the City Lights Bookstore in San Francisco where a door is expressly indicated with a note, I am a Door, or Corporations are not People. So, something, the [inaudible] identified perfectly the entities of something that is related to all the world. So, with the online presence of different catalogs and authority files that are available in the various formats and various, following various guidelines and in open mode. Also the concept of authority control or of union catalog evolved into the grouping of an entity's identifying attributes from different sources. And this process is best known as reconciliation and consists in creating a cluster, a group of data, that all refer to the same entity. And so, we start exactly from the record that is our tradition in cataloging. We start from a record to arrive to identify each entity that is present in a record. And for example, just to give you some brief ideas about this, we can define the drive [inaudible] from a record. We can derive when we have, when we can. Of course, a work from a MARC record is such that they probably shared the attributes to identify an instance and so on. So, we try to identify each entity that is defined and record it in a record. But as you know, for each of these entities, we can have a different expression, different description, and different languages. So, you're following different rules. So, we need to put together this body and form, consider them as different attributes of the same entity. And for these, we thought about a shared project to put together information coming from different sources and to perfectly identify entities and put them in relationship. So, bring together and make available data from different sources in a way that we can consider a democratic way so that doesn't exist, the one form that is the best for the world, but each character, each person think that is the best for himself. And so, we try to identify the entity in question, and this is the base of our SHARE-VDE project. So, starting from traditional record before and after RDA, and we notice, of course, that some important differences in converting data before and after RDA, but we need to convert all data that we have. So, we finally try to be able to identify, for example, a work with its relationship. So, a work related to subjects that are managed and work with the creator, with the publications and so on. Or we can, we find, we define, for example, a subject, the Thomas Mann in this case as a subject or of something. All books that will speak about Thomas Mann, all we hope we will be able to define and to find the publisher and know its relationship. For example, [inaudible] in a relation with places, with other forms that are related to the same publisher, the best some of the people who published with this publisher, the subject and so on. ^M00:30:11 Or this is an example of clusterization and, for example, an example of a collaboration and sharing. So, this is the result for Antonio Vivaldi of all forms coming from the authority, local authority file, all forms coming from VIAF project, all forms coming from cross reference taken from authority record, or probably also the forms that are used in bibliographical record and that doesn't match with authority. Because we can imagine that all these forms are used by someone to search an entity and to define an entity. And so, this is an example of Work/Instances reconciliation. This is a work of Vivaldi Cimento dell'armonia e dell-inventione, and all possible instances. So, a different title to identify an instance related to the same work, to the same intellectual production of a person or, for example, the instance, a different instance is present in different libraries, so the same instance that is present in different libraries and put together in one group so that the end user can simply define and find the information and choose what he needs. So, how reconciliation and enrichment can work. We work using usually two different processes. One is an automated process, and the second one is a manual process because, as you know, Casalini produce records for different libraries so has the opportunity to work on the records during the production. Of course, taking in consideration that automated process produce a high level of reconciliation and clusterization but a low level of results validation because are automated processes. And contrary, manual processes produce a low level of reconciliation and clustering and a high level of results validation. So, we think that to put together will produce the best result. And this is again the same picture than before to identify automated reconciliation and enrichment process. So, using in this case, Antonio Vivaldi as an entity and reaching this entity with all the forms coming from different projects and so on. And for this process, the heart of the process is Authify. I'll try to explain to you briefly what is Authify, a general description. Authify is a RESTful module that offers several search and detection services. The project started at the very beginning for overcoming some limitation in using the public VIAF Web API. We started the project using the API available in VIAF but, of course, VIAF, being a public project, doesn't allow a massive invocation of its API, while we need to manage more than 100 million of record for this type of project. So, we discovered, we found that this type of views of Web API was not feasible, and so we tried to find a different way. So, we tried to use them for VIAF and to implement a local power, the search engine, to use this source. And of course, this was just the first model, but in the time we acted, a lot of different sources is in the same tool. So, how to file our first list of different services, the class search services, for example, the Authify classes of services provides a full text search service among the names and works clusters. The search Web API uses, behind the scenes, an "invisible queries" approach in order to or to try to, of course, find a match, as much precise as possible, within the managed clusters. The invisible queries approach allows to make everything transparent to the caller. So, on top of a single search request, the system executes a chain of different search strategies with different priority, and the first match that produces a result will populate the response that will be returned. For debugging purposes, the response will also include the matching strategies that produced the results, so the technician can check and can use and can improve the search method. The system has been built with extensibility in mind as to manage different sources and different format and adding new services. So, dimension of the chain is fully configurable by the project manager. For instance, here is a brief description of the current configuration when searching name cluster. For example, we can search in subfield matching, so we can define what kind of subfield of each tag we can use in searching. Or we can input heading exact match before the first searching is okay. If we try to find something that exactly match with my heading, or we pass it to a full text search. If exact match is not possible, then the regular full text search is executed with things like a proximity search for names and for other entities. As last chance, the system executes a search by "initials" in order to find a valid match in those cases when the input string, or the indexed heading, contains the name in its short forms, and that mean you use usually the initials to identify person. And so in this case, we try to find and to identify the person also using the initial of name. And so, this is a type of a response where the system finds that Meyer, Bertrand is exactly the same entity that Bertrand Meyer, or Meyer, Bertrand with or without dates. But another service is important in Authify is the "relator term detection." Starting from a MARC record, the system produces an analysis of tags that contain a name and, for each of them, tries to figure out, using the statement of responsibility of the input record, but using in addition, for example, field for notes such as the content notes and some, and what is the corresponding rule within the work represented by the given record? So, for instance, on top of the following input, you can see here a declaration of responsibility with two authors, two creators, and Prefazione all'edizione italiana di G. Biorci and the three headings reported in this record. So, in this case, the relator term detection was able to identify two authors for the same work, and someone as having an additional role and not as co-authors. And in this case, we used the author, that is an unclassified role, when we are not able to perfectly identify the role. What kind of role? But declaring of, okay, not the role, not the co-author is the first step to identify. And this is another example of detection, entity detection, with analyzing this statement From Chaucer to the present, John Stephens and Ruth Waterhouse, and having this field without a subfield e or a subfield 4, and in this case, the Authify detect response, identify either as an author and as some, okay, as an author, yes. And this is another example, edited and introduced by, and in this case, this is the response, and in this case the tool identified perfectly that Lena Jayyusi is an editor in this case. ^M00:40:18 Of course, we usually can find also critical case. See, in this case, Cedizioine anastatica, introduzione e appendice a cura di; con la collaborizione di. So, very and many roles, many roles for the same person. So, in this case, we prefer now to define, okay, other. So, this makes possible to a person to check the record using this response and to correct or to input. Of course, we are trying to improve the knowledge base with all expressions present in responsibility coming from different languages. So, to have a more and more increased knowledge base and, of course, we need to do some step again to arrive to something of perfect or perfectly working. So, I show you a brief name cluster process, simply name cluster process. In this case, we have a form coming from authority record from Jose de Lucio and different forms coming from different bibliographic record, and the similarity score in Authify, Identify, this variant form as the form related to the same person, assigns also a weight, a weight that is important in issuing a term to accept or not accept the class. So, in the massive cluster processes, we try to give you the main processes, the things the authority headings, analysis and process in Postgres, the first step. The data enrichment with external sources are very important that we can use international project already available to identify the entity. The MARC bibliographic process; the entity detection process, the name heading-to-Authority name association, but because it's not so used; also, general use, use the Authority authors from all similar bibliographic records. Not all libraries have an Authority system, and the name heading to vary and the name association and the clusters check, it exists and it doesn't exist, it creates a new cluster. But we have also a manual process, and we try to follow the PCC directives because PCC identifies and addresses policy issues on the use of identifiers in MARC, and many items are now in discussion into the PCC growth. For example, developing guidelines to include identifiers in MARC bibliographic and authority records; the use of multiple identifiers for the same entity; how to use a multiple identifier; determining the entities for which identifiers should be provided in an initial implementation; and identifying automated methods for populating and maintaining new and existing records with identifiers. But also in this case, the importance of identification and detection in the Semantic Web to enrich a MARC record with URIs, Casalini Libri uses the "URI MANAGEMENT SYSTEM" that is included into the cataloging module of OLISuite; that is the cataloging module developed by @CULTt and then used in Casalini. This also simplifies the reconciliation of varying forms of the same entry with the development of detection procedures for entity identification and the conversion to BIBFRAME. So, the URI MANAGEMENT SYSTEM allows the management of multiple identifiers for each access point or heading; the use of external sources with API and web services, or using in some cases the dump of data; and associate heading with the URIs that identify it in each of the projects. But this is just an example of the URI Management System in OLISuite where the cataloger can choose for each heading, define a URI System, and it can add, check, modify, or delete a URI coming from different sources or can choose, can search in different sources and search and check the preferred URI for this heading. But why multiple identifiers associated with the same heading or access point? First of all, because Casalini produce records for many libraries, and each library is defined and preferred in the form of URI, preferred the source or a different order in having different URI. So, Casalini starting from the same record has to produce a different record as final output for different libraries. So, in this case, we realize the system that makes possible to assign URI not in original in the MARC record but in a different table with different system. And so, for each library, Casalini can choose how many URIs to make available for each heading; how to associate them with the heading; how to show them in relation to data use and formats, following, of course, the preferences declared by each library. And so the cataloging system is related and linked to the Adempiere system, and that is an EFP that use Casalini to manage all this preferences and the profile, customer profile. The most important section is the customer profile. So, this is, for example, the customer profile for Harvard College Library with a mapping URI, and each library can define the mapping for each time they can define, okay, what is the preferred URI if it exists? In this case, for example, uno1, sorry, 1 is for Italian. Okay. One is for ISNI, 2 is for VIAF, 3 is for Library of Congress and NAF. So, if he's in ISNI, please put the ISNI URI. If it doesn't fix ISNI, please put VIAF, and so on. And each library can define a different mapping, so that we can produce and start it from exactly the same record in that local database to different record with different URI for each library. And of course, each library can define how to receive this data. For example, as enrichment in tag 100, the choosing of the subfield; or for example, as enrichment in Authority file, choose the perfect and the correct type and so on. But of course, when we work with URI with identifier, so we need to also think about to manage them because when we publish our data, enriched with URI, we are thinking that other people, other projects make use of our data. So, we need to provide something that makes possible the use of this data in the future. And so, we thought about a URI Registry. That is the history of each URI when something changes in it, and the, for example, when we reorganize the cluster because we discover something that is not covered; for example, we put together two person, two people in the same cluster, we discover that this is not the same people, are two different people. So, we need to split the cluster and assign a new URI for the second person. So, the URI just reports on this changing. For example, the resource added to cluster but also modified or removed from it; the date of the update; the particular operation performed and status of a new URI, for example, is valid, or it is not still valid, or the URI Aliases. This is, okay, use this URI for this entity and so on. So, finally, we can start the conversion in RDF and, in this case, in BIBFRAME. I don't give you the details of Lodify, but just to explain that LODify is the evolution of ALIADA, coming from a European project that is a conversion tool. ^M00:50:02 ALIADA is transparent. It can manage as LODify different ontologies, but LODify was the evolution that we produce to manage BIBFRAME. So, the importance of LODify gives you just a particular piece, the possibility to work with atomic pieces, atomic processors. This makes possible to work with a small part of the overall task, and this means that you can change the font for each specific piece, a specific class of property of an ontology or to change and so on. So, this makes the tools very elastic and very powerful and easy to manage. This is a conversion template because LODify converts each incoming record by means of Conversion templates, and each template associates a MARC record belonging to the incoming datasteam with a set of rules associated with one or more ontologies to produce the final session. And finally, the trust and provenance because when you publish your data, or when you put together data coming from different sources, we need to think about the provenance of your letter, the quality. You need to assure the quality of your data, and you need guarantee the accuracy of this information. So, we use the concept of provenance that is the origin, the authorship of an information to extend and to give trust about information. So, creating a link between information and its source has become essential for the purpose of guaranteeing the authority of the information itself. If so the triple become a quadruple because we add the third mode of source to the claim that, okay, it's not something that is a universal value, but someone says that this is blue, this is this color, this is the name, and so on. Now, we go in the SHARE-VDE Phase 2 deliverables that, as Michele told you, was finalized and finished, more or less, in the same way of the last year. We have totally something else, but we are more ready, and so we have four different deliverables. The deliverable 1 is the datasets in BIBFRAME 2.0 of the entire catalog of each institution with the triples or "tuples" derived directly from the MARC records, delivered both as triples and as quadruples with the addition of the provenance and with SHARE-VDE URIs. This first deliverable is strictly related to Deliverable 2. It is the knowledge base of clusters accessible in RDF, and we think it is the other value of the project because each library can receive their own data but all related to the same knowledge base to demonstrate also the power of sharing data and of a cooperating project. The Delivery 3 is more autonomous, is independent by the knowledge base, is out of [inaudible] and in the Deliverable 3, you will have the URIs from external sources. So, not the Shelby D URI but the international definition of URIs. So, each library can use Deliverable 3 to publish their own data autonomously with respect to other libraries. And Deliverable 4 is the MARC 21 records for each institution enriched with URIs. So, this is a version of this deliverable to the most [inaudible] that Deliverable 1 and 2 are strictly related to, and Deliverable 3 and 4 are autonomous, and each of, each library can use autonomously. What we have now with BIBFRAME, BIBFRAME is for us a, we are a sort of pioneer project. So, each day we receive news, new information upgrade; so, it's not so easy to follow the project. But it's a very, very, very interesting special phase of our project. Now, we are managing the extension of the Library of Congress to perfect manage in MARC 21 tags, and we underline the class that we already managed in our state-of-the-art, but we will continue, of course, each day to include additional classes and so on. So, this is an example, Deliverable 1, the RDF conversion, and see here the use of LC vocabulary extension. So, to manage some specific information, this is the contents of the record and the deliverable to the knowledge base cluster. I give you just an idea about a clusterization process of "forename" heading type. See here an example, Bridget, of Sweden, Saint, with a date. And so, this is an example of the algorithm to identify a person. So, the selection of interesting subfield, we start from the full tags, and after we go back to find each piece of information. The first step is the normalization of string text without diacritics, accents, and so on. So, this field becomes this field because let all in upper case and search string into db variance form and cross references. So, the fields become something like this, and if the cluster found, because we use this heading to match the string; if no cluster found, the subfields will be analyzed for comparing, for example, subfield a with other existing forms; or comparing only the numeric part of subfield d, having the same subfield a; or comparing subfield c for "saint" or "santa" or other forms in other languages and so on, to arrive finally to identify the entity. And this produces the knowledge base, and we use, as I told you before, Postgres as a bridge. We don't publish Postgres database. We will publish the class assembly base in RDF but we use a lockout at the base in Postgres because in some cases, the relational database is able to do something that the DND [inaudible] is not able to do. So, this is the final result of all forms that we found in libraries for this entity, and we reported these in Postgres, and finally the results we published on this form in a portal because the goal for this project is having a [inaudible] library and using BIBFRAME and new technology, but in helping, of course, end user and use this information to improve the search engine experience and the use of data. So, this is another complement to produce a deliverable and is a table for external sources, a sort of map of all external sources that we use to improve the record with the URIs. And of course, each library can select and choose the preferred sources. And so, we arrive to the enriched MARC 21. This is the original record, and we [inaudible] enriched with the URI. Of course, we have also in this step, a lot of dubs [phonetic] that we try to solve with the availability of Sally McCallum and this group who help us in understanding what is hard to understand, and also for MARC 21. For example, we have some labs or if you use the subfield 0, subfield 1, to reach the record, what kind of source is the best to use in what cases. So, it's interesting this project because it gives us the opportunity to share our thoughts with very expert people, skilled people, and to improve our knowledge about this. So, I ask Michele to introduce the last three slides that is the possible steps for the future. ^M01:00:09 ^M01:00:13 >> Michele Casalini: We started last summer at ALA during a workshop of SHARE-VDE to discuss preliminary what a production phase could be, and it came immediately out of it that and the scene is very heterogeneous because the priorities are also different from institution to institution, and there are, of course, different workflows to cover the other aspects in both directions in the sense from traditional records into linked data. But then there are also, and this will be more frequent in the coming future data created with a BIBFRAME editor originally that needs sometimes to be later converted because of the operational systems in libraries still need, and the need for a foreseeable long period of time and MARC record to place orders to handle a number of services. So, we started to work on a list of candidate use cases, and we had a very important meeting here on October 6th at the Library of Congress. We were very grateful that this meeting could be hosted here, and where we analyzed mostly the original and copy cataloging use cases. We will have, our next meeting will be in Denver. And the next couple of slides give some details about the original and copy cataloging. Maybe we need also a new name for copy cataloging and definition in the future. So, so far, the candidate use cases for a production phase foresee on one side the publication of the entire datasets on a portal that should be no longer a prototype, of course. With that and a batch or automated data flows in order to allow libraries to send the daily of the new data they created in MARC or with BIBFRAME editor and, of course, to receive back the data reconciled, enriched, and with their preferences. At the same time, let's say, considering this as the basis for a production phase, there are a number of further steps and further use cases that you see here listed as candidate Phase 3b. And before we move on with some further details about two of these use cases, I would like to join Tiziana's words that expressed deep gratitude to Sally McCallum and the colleagues of the Standards office because without many answers that we received, I mean, what we have done would not have been really possible. But let's go here in some detail, Tiziana, in explaining this and the next slide, please. >> Tiziana Posssemato: This is, of course, something that we are studying with the group, so I cannot answer in detail because we need to define details. But the general idea is to make the SHARE-VDE project something more than a publication layer, more than a publication portal. But to make possible that the richness of the project can become a richness for libraries in using, for example, the cluster knowledge base, during the cataloging process, and so on. So, the first thing in use cases, try to analyze the flow, a possible flow to that have the SHARE-VDE project in the center and the possibility to use this important powerful knowledge base of data to copy cataloging or to define the carrier and holding. Okay, yes, a copy of this book, and I can report is using the SHARE-VDE portal and the SHARE-VDE process. So, we are thinking to allow the contribution by libraries to have something that is useful in the cataloging process and a world that makes use of this richness and exactly the same but with different flows, of course, for the region of cataloging because this makes possible the use of different tools; for example, the tools that the library will call on Congress propose for creating or for editing data. Also because we are trying to think to a new environment where libraries not only will publish data in RDF but will produce authority data and original data in RDF. So, a new environment that will use all these tools that come from our project but also from other projects. I have a whole Library of Congress projects to put together and to give to libraries. So, this new instrument to produce and to manage data in RDF. Of course, in this case, we need to think to another scenario where the conversion will be from RDF to MARC, for example, because a lot of libraries will continue to use MARC 25, MARC 21 as a cataloging format, as format to exchange data. So, the attention to change and now we are thinking to imagine these processes starting from RDF and ending with MARC 21. But I hope to have someone, a few interested in these use cases and to contribute in developing these use cases that now we are starting to work with all librarians that are part of this project. So, I jump to conclusion because I think that you already can imagine the conclusion, how important is to sharing and to reuse information resources, how it could be important not only for librarians but also for end users to have an enriched information, a linked information, to have the possibility to receive also information from online dictionaries, online encyclopedias, that usually they use, and to have a common knowledge base, a common shared information with all the world. So, thank you for attention, and if you have questions, we are available. ^M01:08:25 [ Audience Applause ] ^M01:08:37 >> Beacher Wiggins: Well, you've heard a lot. What questions do you have? Aaron? >> Would there be a way that we could get data from somebody who doesn't plan on sharing with us? I'll give an example. I do Italian law. Many of the authors are magistrates. I discovered that if put the name, the word, Magistrato, and then even Nato or Nata, I almost always get the Ministry of Justice appointment list where the person was appointed or promoted, which gives their birth date and place of birth, which is very useful, since it seems many authors in the world like to have the same name as other authors. But of course, the Ministry of Justice is not producing this list to help catalogers. Or another example was Italy, for example, I also discovered there's a union list of Italian Law libraries that has cataloging, full cataloging for records that the Casalini record that LC or OCLC isn't just like a core acquisitions record, but they're beautiful records by Law Libraries, but there's not any way to get the data other than cutting and pasting. Is there a way from all of this we could get data without people having produced it for us? >> Michele Casalini: Thank you because this is absolutely a very interesting domain, I mean, that needs progressively to be covered, I mean, and there, I mean, I guess, and totally Tiziana can help me in replying. ^M01:10:17 This can be done through various, in various ways, using a source, I mean, various data sources. I make an example that is not completely, I mean, overlapping with what you were saying, but just to tell you, I mean, that there is the intention to address these kinds of topics. In Italia, there is an agency, a national agency for the evaluation of the research output [inaudible], and they have, and we were involved in it to set up a working group to study the possibility to make, let's say, better use of the information and to have a database that can then be shared with a community in a more efficient way. The other, and also some attempt to make use of the database of ORCID database in where a lot of Italia researchers are now obliged to be registered. So, about to mention, it is certainly, I mean, we have to double check and put it on our list that the Minister of Justice, I mean, and of course in the extent that they can publish and they're willing to make available their data that list data is then included for having a more precise identification. >> Tiziana Possemato: And probably interesting is also the role of publishers because publishers usually are the first people who met the author, and usually they can have a lot of information that the librarian cannot have. So, also in this case, and in Italy, there is an initiative to involve the organization of all publishers and be more, more strongly including this type of project because they are really the corporate body and the entity that can have some information to identify it to disambiguate the person or type of people. ^M01:12:57 ^M01:13:05 >> A question. >> Yes. >> When will you release the dataset and publish it to the various libraries? >> The participating libraries. >> For Phase 2? >> Yeah. >> Tiziana Possemato: Okay, we started to release in October, and we are progressively releasing to all libraries. We left the Library of Congress as last because it is, of course, a bit more complicated and because we preferred to be sure before, but we are working in defining the last requirements coming from extensions, and we started sending very, very small files to the group, the BIBFRAME group, so to check the quality. And so, we think that within one month we will end this process. But of course, this is important. The Phase 2, the release of this deliverable, is something that progressively will continue because we discover each day new requirements coming from BIBFRAME but also from PCC, URI group. For example, we are waiting to have the final requirement to manage work in subfield 7 and 8 because we know that some is under discussion, and we are taking part in this discussion. But when finally a decision will be taken, we need build again the MARC 21 with the new requirements and give to libraries. So, I think that the library understands that it is a delicate phase of transition and that each day happens something so that we can improve the process. ^M01:15:12 ^M01:15:17 >> Beacher Wiggins: Needless to say, I am encouraged that BIBFRAME is a workable and viable model based on what you have shared. I guess one of the questions that comes out is you're working with a discrete group of libraries, so how do you perceive this being shared more broadly to -- is it all going to rely on each of the individual member libraries receiving its data and then sharing that data through a publishing mechanism and making it available for searching as opposed to a single source that is someplace that's getting interested, the library or cultural institution can link into as part of a linked data environment? Obviously, I want to share but also realize that you are an entity that has to worry about all that input into it. What do you say about that, and I know you will be meeting with colleagues in Denver on Friday, which I won't be attending; probably the first time in 25 years. But what are your thoughts on that? >> Michele Casalini: Thank you for asking the question because it's, of course, a very crucial one. First of all, the data is all your data. So, you can use it to enter locally; you can use it in a shared repository. I mean, it's totally up to you. As the future of this initiative that is ready now in the very short future to come out under the research and development phase and to go into production, but what this production environment will be it really is up to you in the sense that we intentionally don't have a product to propose. I mean, it must, otherwise, it would not make sense. So, from our perspective, it must be really designed and based on your expectation to your priorities. Another aspect to say is to mention is that everything is totally independent from the library local system. So, what our feeling so far is that, and this was one of the goals, of course, of Phase 1 and 2 of SHARE-VDE is to improve that certain number of tools are what started to be mature to be put into production, and this then can be used in ways that really your community will shape environment together and to do certain things. A consensus of a certain number of institutions is needed because otherwise it can make little sense, and there are therefore aspects related to the availability of the [inaudible] in RDF of the knowledge base. In order to have a unique identifier among the community, there are aspects related to the production or conversion of data after all the reconciliation processes and enrichment into linked open data. And yeah, so let's say it's really, we hope now that there are enough ingredients on the table to play and to allow you to, I mean, to express your vision of what meets best your needs, and this is the intention of the coming meeting on Friday and, of course, the discussion will continue. From the side of Casalini Libri and our original Cataloging Department, we are now almost ready to start applying MARC record enriched with URIs or to supply to libraries the equivalent BIBFRAME datasets. And this is, of course, as original cataloging agency, one of our responsibilities, as we cover, I mean, your cataloging for the Romance language countries. So, I'm sorry, Beacher, if my answer was not a complete answer. ^M01:20:20 >> Beacher Wiggins: Your answer is about as good as I would expect you to give us on the spot, but I wanted to hear you thinking like that and to see if there was any reaction to that. Other comments or questions, Melanie? >> Actually, I had a question that he bounced off of me last because one of the benefits of the system we have now is that we can't benefit from others cataloging. This is an example we put in an authority record. I made the authority record as a baseline of information, but the next cataloger works with that person's name for which more information that can be added to it and everybody benefits from that. I was wondering where does your work take place in the system since it is quite complex? Is there any way that it's, are other catalogers or institutions might work with the data that you've supplied, that any changes they might make to feed back into your system? Would it be reasonable to do it in the same way? And not just as a catalog results or perhaps, other users, patrons who might be able to say I'm looking at this book as I'm reading it. Oh, I could actually move that catalog record to [inaudible]. So, [inaudible] that this person was the writer of this? ^M01:21:35 [ Audience Laughter ] ^M01:21:42 >> Michele Casalini: That and the feedback, I mean, I answer certainly very superficially, I mean, so please, I beg your pardon. The feedback of more precise information that is already available or, of course, the addition of certain data elements that are added with all the Provenance Declaration, of course, and are absolutely part of this two-draft of data flow, I mean, that foresee a cooperation among the institutions. So, there are institutions that foresee certainly to keep data locally but that can take advantage of this, I mean, Cloud datasets and knowledge base. There are other libraries that foresee to rely more on, let's say, on Cloud approach. But in any case, this aspect of interaction of data is absolutely critical, I mean, and very important. >> Tiziana Possemato: This is so important that this was the first request that we received from our first project in Italy, SHARE catalog that we built with eight university libraries and in the South of Italy. And so, also, in this case, one of use cases is the opportunity to have a tool to share in the first step of the production is to share suggestions for correction, not to add it directly. The heading without a corrector authorization level but to have a system to share suggestions so that also end users, and when we think about the end users, we don't think also to students but also to professors, to experts of the mark up and so on, can give that contribution to the project. So, can sciences can send an information about manifestation and your work. We handle two different entities, but this is the same person. This is the same work. This is the same subject in different languages and so on. So, it collaborative and cooperative tools that will be in different steps or grades of authorization to arrive to the last level that will be the ability, if authorized, to change the heading and to correct, for example, the cluster and so on. This is already present in our use cases. ^M01:25:05 ^M01:25:10 >> As a clarification along the same lines, so there, from what others have just expressed, there is an ability for institution specific levels of access for, let's say, a specific vocabulary like the Getty Thesaurus or that won't be shared back out into the larger community will be imported and just used in the same frameworks. Is that correct, just to confirm? >> Yes. This is correct. Yes. >> Okay, I just wanted to make sure I understood. Thank you. >> Beacher Wiggins: Well, thank you. It is 11:30, and again thank you for sharing this. I'm hoping by the time we wrap up BIBFRAME pilot test 2 in June and get our assessment and you can get feedback from external entities that are using BIBFRAME we'll be ready to say that the library is moving forward in this brave new world. I was interested in pioneers. So, again, on behalf of the Library of Congress, thank you so much for your time and your effort this morning. Tiziana Possemato: Thank you. ^M01:26:31 [ Audience Applause ] ^M01:26:39 >> This has been a presentation of the Library of Congress. Visit us at www.loc.gov. ^E01:26:45