Biblioteca 2.0/Curso

From EmergiaWiki

Jump to: navigation, search

Taxonomies, folksonomies and ontologies (2008/09)

Módulo del curso de doctorado I0703-Web semántica, Programa de Doctorado en Sistemas de Información, Facultad de Ingeniería-ESIDE, Universidad de Deusto, marzo de 2009 impartido por Josuka Díaz Labrador y Joseba Abaitua.

Contents

[edit] Contextualization

  • UD's new library
  • Resource discovery and knowledge extraction
    • Online open access
    • Content processing, knowledge extraction and aggregation

[edit] Questions

  1. Why are scientific papers the main literary or communicative format used by the scientific community?
  2. What is their structure? What are the best means for dealing with them?
  3. Which tools can we use to discover, manage and extract information from them
  4. How can these tools be combined to make them more efficient?
  5. What are the future directions of the field? Which field or fields?

[edit] Planning

  1. March 2. Introduction
    1. Resource discovery
    2. Scientific publications: sources, problems, and perspectives
  2. March 3. SRM and Open Access
    1. Scholar reference management (practical exercise 1)
    2. Open Access, zOAZ
  3. March 4. Information extraction and knowledge management
    1. Information extraction (practical exercise 2)
    2. Knowledge representation: Review of YAGO (IE, AI, HAL 9000, OWL/RDFS)
    3. Project proposal and NLP/KR tools and demo systems (practical exercise 3)
  4. March 5. Knowledge construction and dissemination
    1. Project proposal: PaperSqueezer
    2. Presentation of slides (practical exercises 4 and 5)
    3. Discussion: What can we do for UD new library?

[edit] Exercises

  • Exercise 1. How to deal with scientific references
    • March 2. Select a cloud computing tool for managing scientific references
    • March 3. Make a full reference of a scientific paper
    • March 3. Include short references of six more papers, both forward and backward citations of your main paper (by means of citation search engines)
  • Exercise 2. Information extraction from scientific papers
    • March 4. Identify in your paper entities, relations, and facts
    • March 4. Make a small taxonomy of entities (classes, subclasses)
    • March 4. Test your taxonomy against SUMO http://www.ontologyportal.org/
  • Exercise 3. Demo and tool testing
  • Exercise 4. Knowledge dissemination
    • March 4. Include at least one short review of the papers in your SRM account
    • March 4. Add three slides [4] with a summary of your paper
    • March 5. Add three slides [5] with shallow evaluation of the tested tools
    • March 5. Add one slide [6] with a contribution proposal to the new university Library
    • March 5. Present your slides [7]
  • Exercise 5. Aggregation of shared experience

[edit] Resource discovery

What do I do when I am looking for related work or documentation on a topic of my research? Do I use catalogues or search engines? Which type of engines? Do I try other resources?

[edit] Resources: reference

[edit] Catalogues (mainly book-oriented)

[edit] Online Publishing (journal and conference papers)

[edit] Catalogues (scientific paper oriented)

  • ISI Web of Knowledge (WoK) http://www.accesowok.fecyt.es/login/
    • also known before as ISI, known now as TMAC (The Mother of All Catalogues)
    • because of the "impact factor" (JCR-Journal Citation Reports)
    • anyway, it's only a database, it offers only abstract and metadata (reference) information, no full-text, even TA

[edit] Online (self-)publishing (authors, groups)

  • DELi publications, bad example, not updated for a long time
  • Many researchers/groups have now a publications page, full-text papers indeed
  • Very good, but this leads to several problems
    • publisher copyright (because of this, sometimes author publish draft versions)
    • nanometric disaggregation of resources (it's the Web!), you have to know the author to access his/her publications
    • metadata not present or difficult to (machine-) process

[edit] Resources: full-text

A reference (author, title), and an abstract, are not enough to make use of the previous works of others. We need full text.

This conclusion seems very clever, but it is trivial. Actually, references are not only not enough, but they are almost nothing. We know or learn because of the content. We can discern by means of an abstract of 10 lines whether or not we want to know more about some work, but we need the content to fulfill our learning of that work. In the world of scientific databases, catalogues, journal listings, references, etc., we get used to think that there is no life out of there (out of the-references).

Yesterday, we used the Library's catalogue to look for book titles of our interest, and then we went to the bookcase to read the books (not only the titles). In the computer and Internet age, we use digital catalogues to look for paper titles of our interest, and then... there are several difficulties to read the papers. The thing to appreciate is that digital full text of a paper exists without any doubt: now, every journal and conference asks authors to send PDF (or another text-preserved format), so that the full text resource exists. There are, among others, several possibilities:

  • You can reach the TA agent (publisher, organization) responsible for the paper publication. If you are lucky, being at the UD, you can access the full text digital libraries of ACM, IEEE Computer Society, SIAM, and several others listed before. In other cases, we are not lucky.
  • You could look for the author/group homepage, to see if they have full text versions of his work. Sometimes, it works.
  • There are now sites that aggregate document bases of considerable extension, most of them of public access and full text contents (see later Google Scholar, Citeseer, DBLP and others).

The relevant aspect to know is that, partly in reaction to the TA (closed access) policies of publishers and other organizations, there is a movement that seeks and promotes open access to scientific publications, much in the way of well-known phenomena of the digital era as open source software, free software, Creative Commons licenses, P2P networks, and many others. Open access movement shares with the aforementioned the implication of a series of factors such as:

  • payment for a resource
  • copyrights problems
  • shift from hard (physical) media to digital media, and others

but has also distinguished ones.

[edit] Open Access (OA)

[edit] Technology behind Open Access

[edit] Open content access

Open access has many benefits. First of all, universal dissemination of knowledge and the whole implications of that, as has been stressed in the references above. But there is another side of the word open, related to the support media. For example, if the document has been digitalized as an image, the content is open for the human eye, but closed to any other respect. Not very fortunately, because it is not an information representation format, the vast majority of scientific works is now disseminated in PDF, but at least PDF preserves textual content as such. We can suppose then that full text means also the cappability of accessing the textual content.

This is the final key question, because full text is the entry key to the full text resource discovery. As you may have concluded now, many of the databases, catalogues, search engines listed above allow for a reference, or metadata-based, resource discovery, but they miss the content in most cases. Imagine Google could only give results based on the title and meta elements of web pages (you may think that the World Wide Web would not be such a mess doing that, but this is another question).

The analogy is very relevant, because the head element of a web page plays exactly the same role as the reference or metatada associated to a document, althougth its expressiveness is very limited. But you may remember the reasons of Google success ten years ago: before Google, there were mainly web directories (Yahoo, Lycos, etc., think that directory management was a human performed task), that is, structured metadata-based resource discovery. Google is a full-text automatic resource discoverer, althought it also uses metadata information (the head element).

Anyway, the thing is that if full text documents (PDF included) are put in any public access web page, we have the whole rich set of cappabilities and tools that now exist on the web to perform discovery and knowlegde extraction (you may think again that if the web was not enough such a mess, let's enrich it with the whole scientific production of the planet; well, actually, it's being done). For a first example, as Google indexes PDF documents, the fact is that we can use Google to perform the first step of discovery (and as a corolary, that reference-based engines are mostly condemned to dissapear).

Indeed, from the strict scientific point of view, this is very significative. On the first hand, when a research is started, it is essential to know and mention the previous works on the subject ("the shoulders of the giants", as said by Newton). Now it's the time, by means of open content access, in which this scientific premise can be accomplished better than ever.

On the other hand, a research is relevant if it gets itself a "giant shoulder", that is, if others can use and extend it. In that case, the research gets cited. The quantitative measure of number of cites equals quality may be questioned (it is the origin of the impact factor, but also of the page rank mechanism used by Google, as you know), but the qualitative idea is there. The thing is that your work may be cited if at least it is discovered, so enhancing public content access to your work undoubtely increments the probability of being cited, supposed it is relevant.

[edit] Discovery

There is a bulk of scientific and academic information available on the Web that serves our research purposes. The question is how we discover the most relevant materials, and how we filter them to make the optimal selection.

We can benefit from several tools for the discovery, selection and management of the information.

[edit] Citation harvesters

[edit] Scholar search engines (public)

[edit] Social tagging

[edit] Scholar reference management

Comparison of reference management software. (2009, March 4). In Wikipedia, The Free Encyclopedia. Retrieved 09:29, March 4, 2009, from http://en.wikipedia.org/w/index.php?title=Comparison_of_reference_management_software&oldid=274835993

[edit] Social networks

[edit] Knowledge extraction

What is knowledge? Can epistemology help us find out?

Knowledge is defined in the Oxford English Dictionary as "(i) expertise, and skills acquired by a person through experience or education; the theoretical or practical understanding of a subject, (ii) what is known in a particular field or in total; facts and information or (iii) awareness or familiarity gained by experience of a fact or situation".

Ballard's (2004) descriptive formula "knowledge = theory + information" is a core principle underlying theory-based semantic technologies.

[edit] Artificial Intelligence

Artificial intelligence (AI) is the intelligence of machines and the branch of computer science which aims to create it. Major AI textbooks define the field as "the study and design of intelligent agents," where an intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success. John McCarthy, who coined the term in 1956, defines it as "the science and engineering of making intelligent machines."

The problem of simulating (or creating) intelligence has been broken down into a number of specific sub-problems. These consist of particular traits or capabilities that researchers would like an intelligent system to display. The traits described below have received the most attention:

  1. Perception: ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world.
  2. Learning: ability to find patterns in a stream of input.
  3. Natural language processing: ability to read and understand the languages that the human beings speak.
  4. Knowledge representation: representation of objects, properties, categories and relations between objects; situations, events, states and time; causes and effects; knowledge about knowledge (what we know about what other people know); and many other, less well researched domains.
  5. Social intelligence: ability to predict the actions of others, by understanding their motives and emotional states
  6. Deduction, reasoning, problem solving: algorithms that imitated the step-by-step reasoning that human beings use when they solve puzzles, play board games or make logical deductions.
  7. Creativity: both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative).
  8. Planning: ability to set goals and achieve them.
  9. Motion and manipulation: to handle such tasks as object manipulation and navigation, with sub-problems of localization (knowing where you are), mapping (learning what is around you) and motion planning (figuring out how to get there).
  10. General intelligence: ability to combine all the skills above and exceeding human abilities at most or all of them.

Artificial intelligence. (2009, March 2). In Wikipedia, The Free Encyclopedia. Retrieved 09:09, March 4, 2009, from http://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=274447760

Examples:

  • 2001: A Space Odyssey (film) by Arthur C. Clarke and Stanley Kubrick Wikipedia
  • The Shutting Down of HAL 9000 YouTube, [9]

[edit] Knowledge management

Knowledge management comprises a range of practices used in an organisation to identify, create, represent, distribute and enable adoption of insights and experiences. Such insights and experiences comprise knowledge, either embodied in individuals or embedded in organisational processes or practice [10].

The literature provides many definitions of knowledge, most of which build the concept from data, to information, to knowledge. Some of the literature even takes this one step further and expands knowledge to understanding and wisdom (Ackoff 1989; Kannegieter 2001; Stewart 1999); however there is little agreement for a precise definition of knowledge (Biggam 2001, p. 2; Håkanson 2001, p. 3). Unfortunately data and information are often used interchangeably, and information and knowledge are used as synonymsDurant-Law Consulting Pty Limited (2004).

An established discipline since 1995, Knowledge Management (KM) includes courses taught in the fields of business administration, information systems, management, and library and information sciences (Alavi & Leidner 1999). More recently, other fields, to include those focused on information and media, computer science, public health, and public policy, also have started contributing to KM research.

KM efforts can help individuals and groups to share valuable organisational insights, to reduce redundant work, to avoid reinventing the wheel per se, to reduce training time for new employees, to retain intellectual capital as employees turnover in an organisation, and to adapt to changing environments and markets (McAdam & McCreedy 2000)(Thompson & Walsham 2004).

A basic expectation of scientific method is to document, archive and share all data and methodology so they are available for careful scrutiny by other scientists, thereby allowing other researchers the opportunity to verify results by attempting to reproduce them. This practice, called full disclosure, also allows statistical measures of the reliability of these data to be established.

Readings

  • Thompson, Mark P.A. & Geoff Walsham (2004), "Placing Knowledge Management in Context", Journal of

Management Studies 41 (5): 725-747


[edit] Related topics

  • Data, information, knowledge
  • What is a "knowledge unit"? GoolgeScholar
  • Can we build new knowledge on top of aggregated shared knowledge units?
  • How can be parametrize or rank our knowledge units based on:
    • peer review (Scholar, Conferences, OAI-PMH)
    • cross references (Scholar, CiteSeer)
    • other notions of authority? GoogleScholar

[edit] Content annotation

Documents may not be the best means for the transmission of knowledge in the semantic web, Chris Welty and J. William Murdock (2006). But they still are the standard method for sharing and disseminating knowledge within the scientific community. It is not clear how can it be made explicit to computers. There are an number of mechanisms for knowledge representation:

  • Formal languages, propositional logic, semantic networks, frames
  • Markup, categories, metadata, social tags
  • Ontologies, taxonomies, folksonomies

Readings

[edit] Markup languages

After the publication of XML official specification in 1998, markup languages based on XML became an efficient way of making explicit the interpretation of data, through metadata. RDF is a general method for conceptual description of information available on the Web.

[edit] Content aggregation

A necessary solution to overcome information overload that complements information selection and filtering, is aggregation, particularly when information is redundant. Redundacy however help us detect possibly more relevant information. Techniques:

  • Tag clouds [17]
  • RSS aggregation [18]
  • Linked data [19]
  • Natural language processing
    • Summarization
    • Name entity recognition
    • Terminology extraction
    • Automatic ontology construction

Readings

  • Salonen, J (2007). Self-organising map based tag clouds - Creating spatially meaningful representations of tagging data. Proceedings of the 1st OPAALS conference, 26-27 November 2007, Rome, Italy
  • Tim Berners-Lee (2006). Linked data. Retrieved 2009 March 2 from http://www.w3.org/DesignIssues/LinkedData.html

[edit] Natural language processing

Getting explicit semantic content from text has attracted the scientific community for decades. The main computing and Internet companies have strong research groups that work on the field:

Readings

[edit] NLP tools

[edit] Semantic searching

[edit] Content aggregators

Related projects

[edit] Documentation

  • Silviu Cucerzan. (2007). Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of Empirical Methods in Natural Processing (EMNLP 2007), Prague, Czech Republic. [28]


[edit] See also

Personal tools