SnoozyBiscuit
SnoozyBiscuit

Interview Question Transformers

Can someone answer the question below. I was asked the question in a data scientist( 8YOE) interview?

Why large language models need multi-headed attention layer as appossed to having a single attention layer?

Follow up question- During the training process why does the different attention layers get tuned to have different weights?

11mo ago
Talking product sense with Ridhi
9 min AI interview5 questions
Round 1 by Grapevine
SparklyRaccoon
SparklyRaccoon

Many sentences can have ambiguous meanings and a single attention layer might not capture the true meaning of the sentence.
Multi head attention solves this by allowing each head to focus on different parts of the sentence. For eg. 'I saw a man with binoculars.' can have 2 meanings. Multi headed attention lets each head focus on one part of the sentence and 1 interpretation of it and later by combining the information the model can decide which meaning better suits the context based on different weights. That is why they are tuned to have different weights.

SnoozyBiscuit
SnoozyBiscuit
PayTM11mo

I agree that a sentence can have mutiple interpretations but are are usually limited to 2 or 3 if not 1. Why LLMs have 12 to 50 full self attention layers .

When you say one attention head focuses on a part of the sentence, what do you mean by it? In a single full self attention head, weighted relation of a word with all other words of a sentence is captured.

SparklyRaccoon
SparklyRaccoon

How many layers to be used has been proven Empirically so that we can extract complex and deeper understanding. One head here will focus on the relationship between man and binoculars the other between 2 other words and so on.

SwirlySushi
SwirlySushi

Multihead attention learns the various relationships between different words in different latent spaces , this way they capture better semantic and syntactic relationships.

Different layers learn different kinda relationships as inputs to them and learns different weights i.e. key, values and query values

SnoozyBiscuit
SnoozyBiscuit
PayTM11mo

Thanks for your answer. Naive question - why do they learn different semantic relations. Why don't the different multihead converge to same weights?

GroovyWaffle
GroovyWaffle

Bhai ignore these kind of questions. Mostly asked by people who have no idea what they want the hire to do. Been a trend recently, where the interviewer wants to show they know stuff. But very unlikely to be used in anything that you would do on the job. Does not even test the ability, just knowledge.

Discover more
Curated from across