info retrieval
view markdownSome notes on information retrieval, based on UVA”s Info Retrieval course.
introduction
- building blocks of search engines
    
- search (user initiates)
 - reccomendations - proactive search engine (program initiates e.g. pandora, netflix)
 - information retrieval - activity of obtaining info relevant to an information need from a collection of resources
 - information overload - too much information to process
 - memex - device which stores records so it can be consulted with exceeding speed and flexibility (search engine)
 
 - IR pieces
    
- Indexed corpus (static)
        
- crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
            
- web crawler - automatic program that systematically browses web
 
 - document analyzer - knows which section has what -takes in the metadata and outputs the index (condensed), manage content to provide efficient access of web documents
 
 - crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
            
 - User
        
- query parser - parses the search terms into managed system representation
 
 - Ranking
        
- ranking model -takes in the query representation and the indices, sorts according to relevance, outputs the results
 - also need nice display
 - query logs - record user’s search history
 - user modeling - assess user’s satisfaction
 
 
 - Indexed corpus (static)
        
 - steps 
    - repository -> document representation
 - query -> query representation
 - ranking is performed between the 2 representations and given to the user
 - evaluation - by users
 
 - information retrieval:
    
- reccomendation
 - question answering
 - text mining
 - online advertisement
 
 
related fields
they are all getting closer, database approximate search and information extraction converts unstructed data to structured:
| database systems | information retrieval | 
|---|---|
| structured data | unstructured data | 
| semantics are well-defined | semantics are subjective | 
| structured query languages (ex. SQL) | simple keyword queries | 
| exact retrieval | relevance-drive retrieval | 
| emphasis on efficiency | emphasis on effectiveness | 
- natural language processing - currently the bottleneck
    
- deep understainding of language
 - cognitive approaches vs. statistical
 - small scale problems vs. large
 
 - developing areas
    
- currently mobile search is big - needs to use less data, everything needs to be more summarized
 - interactive retrieval - like a human being, should collaborate
 
 - core concepts
    
- information need - desire to locate and obtain info to satisfy a need
 - query - a designed representation of user’s need
 - document - representation of info that could satisfy need
 - relevance - relatedness between documents and need, this is vague
        
- multiple perspectives: topical, semantic, temporal, spatial (ex. gas stations shouldn’t be behind you)
 
 
 - Yahoo used to have system where you browsed based on structure (browsing), but didn’t have queries (querying)
    
- better when user doesn’t know keywords, just wants to explore
 - push mode - systems push relevant info to users without a query
 - pull mode - users pull out info using keywords