comp neuro

view markdown

cognitive maps (tolman 1940s) - idea that rats in mazes learn spatial maps
place cells (o’keefe 1971) - in the hippocampus - fire to indicate one’s current location
remap to new locations
grid cells (moser & moser 2005) - in the entorhinal cotex (provides inputs to the hippocampus) - not particular locations but rather hexagonal coordinate system
grid cells fire if the mouse is in any location at the vertex (or center) of one of the hexagons
there are grid cells with larger/smaller hexagons, different orientations, different offsets
can look for grid cells signature in fmri: https://www.nature.com/articles/nature08704
other places with grid cell-like behavior
eye movement task
some evidence for “time cells” like place cells for time
sound frequency task https://www.nature.com/articles/nature21692
2d “bird space” task

high-dimensional computing

high-level overview
- current inspiration has all come from single neurons at a time - hd computing is going past this
- the brain’s circuits are high-dimensional
- elements are stochastic not deterministic
- can learn from experience
- no 2 brains are alike yet they exhibit the same behavior
basic question of comp neuro: what kind of computing can explain behavior produced by spike trains?
- recognizing ppl by how they look, sound, or behave
- learning from examples
- remembering things going back to childhood
- communicating with language
HD computing overview paper
- in these high dimensions, most points are close to equidistant from one another (L1 distance), and are approximately orthogonal (dot product is 0)
- memory
  - heteroassociative - can return stored X based on its address A
  - autoassociative - can return stored X based on a noisy version of X (since it is a point attractor), maybe with some iteration
    - this adds robustness to the memory
    - this also removes the need for addresses altogether

definitions

what is hd computing?
- compute with random high-dim vectors
- ex. 10k vectors A, B of +1/-1 (also extends to real / complex vectors)
3 operations
- addition: A + B = (0, 0, 2, 0, 2,-2, 0, ….)
- multiplication: A * B = (-1, -1, -1, 1, 1, -1, 1, …) - this is XOR
  - want this to be invertible, dsitribute over addition, preserve distance, and be dissimilar to the vectors being multiplied
  - number of ones after multiplication is the distance between the two original vectors
  - can represent a dissimilar set vector by using multiplication
- permutation: shuffles values
  - ex. rotate (bit shift with wrapping around)
  - multiply by rotation matrix (where each row and col contain exactly one 1)
  - can think of permutation as a list of numbers 1, 2, …, n in permuted order
  - many properties similar to multiplication
  - random permutation randomizes
basic operations
- weighting by a scalar
- similarity = dot product (sometimes normalized)
  - A $\cdot$ A = 10k
  - A $\cdot$ A = 0 (orthogonal)
  - in high-dim spaces, almost all pairs of vectors are dissimilar A $\cdot$ B = 0
  - goal: similar meanings should have large similarity
- normalization
  - for binary vectors, just take the sign
  - for non-binary vectors, scalar weight
data structures
these operations allow for encoding all normal data structures: sets, sequences, lists, databases
- set - can represent with a sum (since the sum is similar to all the vectors)
  - can find a stored set using any element
  - if we don’t store the sum, can probe with the sum and keep subtracting the vectors we find
- multiset = bag (stores set with frequency counts) - can store things with order by adding them multiple times, but hard to actually retrieve frequencies
- sequence - could have each element be an address pointing to the next element
  - problem - hard to represent sequences that share a subsequence (could have pointers which skip over the subsquence)
  - soln: index elements based on permuted sums
    - can look up an element based on previous element or previous string of elements
  - could do some kind of weighting also
- pairs - could just multiply (XOR), but then get some weird things, e.g. A * A = 0
  - instead, permute then multiply
  - can use these to index (address, value) pairs and make more complex data structures
- named tuples - have smth like (name: x, date: m, age: y) and store as holistic vector $H = N*X + D * M + A * Y$
  - individual attribute value can be retrieved using vector for individual key
- representation substituting is a little trickier….
  - we blur what is a value and whit is a variable
  - can do this for a pair or for a named tuple with new values
    - this doesn’t always work
examples
- context vectors
  - standard practice (e.g. LSA): make matrix of word counts, where each row is a word, and each column is a document
  - HD computing alternative: each row is a word, but each document is assigned a few ~10 columns at random
    - thus, the number of columns doesn’t scale with the number of documents
    - can also do this randomness for the rows (so the number of rows < the number of words)
    - can still get semantic vector for a row/column by adding together the rows/columns which are activated by that row/column
    - this examples still only uses bag-of-words (but can be extended to more)
- learning rules by example
  - particular instance of a rule is a rule (e.g mother-son-baby $\to$ grandmother)
    - as we get more examples and average them, the rule gets better
    - doesn’t always work (especially when things collapse to identity rule)
- analogies from pairs
  - ex. what is the dollar of mexico?

ex. identify the language

paper: LANGUAGE RECOGNITION USING RANDOM INDEXING (joshi et al. 2015)
benefits - very simple and scalable - only go through data once
- equally easy to use 4-grams vs. 5-grams
data
- train: given million bytes of text per language (in the same alphabet)
- test: new sentences for each language
training: compute a 10k profile vector for each language and for each test sentence
- could encode each letter wih a seed vector which is 10k
- instead encode trigrams with rotate and multiply
  - 1st letter vec rotated by 2 * 2nd letter vec rotated by 1 * 3rd letter vec
  - ex. THE = r(r(T)) * r(H) * r(E)
  - approximately orthogonal to all the letter vectors and all the other possible trigram vectors…
- profile = sum of all trigram vectors (taken sliding)
  - ex. banana = ban + ana + nan + ana
  - profile is like a histogram of trigrams
testing
- compare each test sentence to profiles via dot product
- clusters similar languages - cool!
- gets 97% test acc
- can query the letter most likely to follor “TH”
  - form query vector $Q = r(r(T)) * r(H)$
  - query by using multiply X + Q * english-profile-vec
  - find closest letter vecs to X - yields “e”

details

mathematical background
- randomly chosen vecs are dissimilar
- sum vector is similar to its argument vectors
- product vector and permuted vector are dissimilar to their argument vectors
- multiplication distibutes over addition
- permutation distributes over both additions and multiplication
- multiplication and permutations are invertible
- addition is approximately invertible
comparison to DNNs
- both do statistical learning from data
- data can be noisy
- both use high-dim vecs although DNNs get bad with him dims (e.g. 100k)
- HD is founded on rich mathematical theory
- new codewords are made from existing ones
- HD memory is a separate func
- HD algos are transparent, incremental (on-line), scalable
- somewhat closer to the brain…cerebellum anatomy seems to be match HD
- HD: holistic (distributed repr.) is robust
different names
- Tony plate: holographic reduced representation
- ross gayler: multiply-add-permute arch
- gayler & levi: vector-symbolic arch
- gallant & okaywe: matrix binding with additive termps
- fourier holographic reduced reprsentations (FHRR; Plate)
- …many more names
theory of sequence indexing and working memory in RNNs
- trying to make key-value pairs
- VSA as a structured approach for understanding neural networks
- reservoir computing = state-dependent network = echos-state network = liquid state machine - try to represen sequential temporal data - builds representations on the fly

papers

text classification (najafabadi et al. 2016)
Classification and Recall With Binary Hyperdimensional Computing: Tradeoffs in Choice of Density and Mapping Characteristics
- note: for sparse vectors, might need some threshold before computing mean (otherwise will have too many zeros)

dnns with memory

Neural Statistician (Edwards & Storkey, 2016) summarises a dataset by averaging over their embeddings
kanerva machine
- like a VAE where the prior is derived from an adaptive memory store

visual sampling

Emergence of foveal image sampling from learning to attend in visual scenes (cheung, weiss, & olshausen, 2017) - using neural attention model, learn a retinal sampling lattice
- can figure out what parts of the input the model focuses on

dynamic routing between capsules

hinton 1981 - reference frames require structured representations
- mapping units vote for different orientations, sizes, positions based on basic units
- mapping units gate the activity from other types of units - weight is dependent on if mapping is activated
- top-down activations give info back to mapping units
- this is a hopfield net with three-way connections (between input units, output units, mapping units)
- reference frame is a key part of how we see - need to vote for transformations
olshausen, anderson, & van essen 1993 - dynamic routing circuits
- ran simulations of such things (hinton said it was hard to get simulations to work)
- learn things in object-based reference frames
- inputs -> outputs has weight matrix gated by control
zeiler & fergus 2013 - visualizing things at intermediate layers - deconv (by dynamic routing)
- save indexes of max pooling (these would be the control neurons)
- when you do deconv, assign max value to these indexes
arathom 02 - map-seeking circuits
tenenbaum & freeman 2000 - bilinear models
- trying to separate content + style
hinton et al 2011 - transforming autoencoders - trained neural net to learn to shift imge
sabour et al 2017 - dynamic routing between capsules
- units output a vector (represents info about reference frame)
- matrix transforms reference frames between units
- recurrent control units settle on some transformation to identify reference frame
notes from this blog post
- problems with cnns
  - pooling loses info
  - don’t account for spatial relations between image parts
  - can’t transfer info to new viewpoints
- capsule - vector specifying the features of an object (e.g. position, size, orientation, hue texture) and its likelihood
  - ex. an “eye” capsule could specify the probability it exists, its position, and its size
  - magnitude (i.e. length) of vector represents probability it exists (e.g. there is an eye)
  - direction of vector represents the instantiation parameters (e.g. position, size)
- hierarchy
  - capsules in later layers are functions of the capsules in lower layers, and since capsule has extra properties can ask questions like “are both eyes similarly sized?”
    - equivariance = we can ensure our net is invariant to viewpoints by checking for all similar rotations/transformations in the same amount/direction
  - active capsules at one level make predictions for the instantiation parameters of higher-level capsules
    - when multiple predictions agree, a higher-level capsule is activated
- steps in a capsule (e.g. one that recognizes faces)
  - receives an input vector (e.g. representing eye)
  - apply affine transformation - encodes spatial relationships (e.g. between eye and where the face should be)
  - applying weighted sum by the C weights, learned by the routing algorithm
    - these weights are learned to group similar outputs to make higher-level capsules
  - vectors are squashed so their magnitudes are between 0 and 1
  - outputs a vector

hierarchical temporal memory (htm)

binary synapses and learns by modeling the growth of new synapses and the decay of unused synapses
separate aspects of brains and neurons that are essential for intelligence from those that depend on brain implementation

necortical structure

evolution leads to physical/logical hierarchy of brain regions
neocortex is like a flat sheet
neocortex regions are similar and do similar computation
- Mountcastle 1978: vision regions are vision becase they receive visual input
- number of regions / connectivity seems to be genetic
before necortex, brain regions were homogenous: spinal cord, brain stem, basal ganglia, …

principles

common algorithims accross neocortex
hierarchy
sparse distributed representations (SDR) - vectors with thousands of bits, mostly 0s
- bits of representation encode semantic properties
inputs
- data from the sense
- copy of the motor commands
  - “sensory-motor” integration - perception is stable while the eyes move
patterns are constantly changing
necortex tries to control old brain regions which control muscles
learning: region accepts stream of sensory data + motor commands
- learns of changes in inputs
- ouputs motor commands
- only knows how its output changes its input
- must learn how to control behavior via associative linking
sensory encoders - takes input and turnes it into an SDR
- engineered systems can use non-human senses
behavior needs to be incorporated fully
temporal memory - is a memory of sequences
- everything the neocortex does is based on memory and recall of sequences of patterns
on-line learning
- prediction is compared to what actually happens and forms the basis of learning
- minimize the error of predictions

papers

“A Theory of How Columns in the Neocortex Enable Learning the Structure of the World”
- network model that learns the structure of objects through movement
- object recognition
  - over time individual columns integrate changing inputs to recognize complete objects
  - through existing lateral connections
- within each column, neocortex is calculating a location representation
  - locations relative to each other = allocentric
- much more motion involved
- multiple columns - integrate spatial inputs - make things fast
- single column - integrate touches over time - represent objects properly
“Why Neurons Have Thousands of Synapses, A Theory of Sequence Memory in Neocortex”
- learning and recalling sequences of patterns
- neuron with lots of synapses can learn transitions of patterns
- network of these can form robust memory

forgetting

Continual Lifelong Learning with Neural Networks: A Review
- main issues is catastrophic forgetting / stability-plasticity dilemma
- 2 types of plasticity
  - Hebbian plasticity (Hebb 1949) for positive feedback instability
  - compensatory homeostatic plasticity which stabilizes neural activity
- approaches: regularization, dynamic architectures (e.g. add more nodes after each task), memory replay

deeptune-style

ponce_19_evolving_stimuli: https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930391-5
bashivan_18_ann_synthesis
adept paper
- use kernel regression from CNN embedding to calculate distances between preset images
- select preset images
- verified with macaque v4 recording
- currently only study that optimizes firing rates of multiple neurons
  - pick next stimulus in closed-loop (“adaptive sampling” = “optimal experimental design”)
J. Benda, T. Gollisch, C. K. Machens, and A. V. Herz, “From response to stimulus: adaptive sampling in sensory physiology”
- find the smallest number of stimuli needed to fit parameters of a model that predicts the recorded neuron’s activity from the stimulus
- maximizing firing rates via genetic algorithms
- maximizing firing rate via gradient ascent
C. DiMattina and K. Zhang,“Adaptive stimulus optimization for sensory systems neuroscience”](https://www.frontiersin.org/articles/10.3389/fncir.2013.00101/full)
- 2 general approaches: gradient-based approaches + genetic algorithms
- can put constraints on stimulus space
- stimulus adaptation
- might want iso-response surfaces
- maximally informative stimulus ensembles (Machens, 2002)
- model-fitting: pick to maximize info-gain w/ model params
- using fixed stimulus sets like white noise may be deeply problematic for efforts to identify non-linear hierarchical network models due to continuous parameter confounding (DiMattina and Zhang, 2010)
- use for model selection

population coding

saxena_19_pop_cunningham: “Towards the neural population doctrine”
- correlated trial-to-trial variability
  - Ni et al. showed that the correlated variability in V4 neurons during attention and learning — processes that have inherently different timescales — robustly decreases
  - ‘choice’ decoder built on neural activity in the first PC performs as well as one built on the full dataset, suggesting that the relationship of neural variability to behavior lies in a relatively small subspace of the state space.
- decoding
  - more neurons only helps if neuron doesn’t lie in span of previous neurons
- encoding
  - can train dnn goal-driven or train dnn on the neural responses directly
- testing
  - important to be able to test population structure directly
population vector coding - ex. neurons coded for direction sum to get final direction
reduces uncertainty
correlation coding - correlations betweeen spikes carries extra info
independent-spike coding - each spike is independent of other spikes within the spike train
position coding - want to represent a position
- for grid cells, very efficient
sparse coding
hard when noise between neurons is correlated
measures of information
eda
- plot neuron responses
- calc neuron covariances

interesting misc papers

berardino 17 eigendistortions
- Fisher info matrix under certain assumptions = $Jacob^TJacob$ (pixels x pixels) where Jacob is the Jacobian matrix for the function f action on the pixels x
- most and least noticeable distortion directions corresponding to the eigenvectors of the Fisher info matrix
gao_19_v1_repr
- don’t learn from images - v1 repr should come from motion like it does in the real world
- repr
  - vector of local content
  - matrix of local displacement
- why is this repr nice?
  - separate reps of static image content and change due to motion
  - disentangled rotations
- learning
  - predict next image given current image + displacement field
  - predict next image vector given current frame vectors + displacement
kietzmann_18_dnn_in_neuro_rvw
friston_10_free_energy

navigation

high-dimensional computing

definitions

ex. identify the language

details

papers

dnns with memory

visual sampling

dynamic routing between capsules

hierarchical temporal memory (htm)

necortical structure

principles

papers

forgetting

deeptune-style

population coding

interesting misc papers