SnoozyBiscuit
SnoozyBiscuit

Interview Question Transformers

Can someone answer the question below. I was asked the question in a data scientist( 8YOE) interview?

Why large language models need multi-headed attention layer as appossed to having a single attention layer?

Follow up question- During the training process why does the different attention layers get tuned to have different weights?

16mo ago
Jobs
One interview, 1000+ job opportunities
Take a 10-min AI interview to qualify for numerous real jobs auto-matched to your profile 🔑
+322 new users this month
SparklyRaccoon
SparklyRaccoon

Many sentences can have ambiguous meanings and a single attention layer might not capture the true meaning of the sentence.
Multi head attention solves this by allowing each head to focus on different parts of the sentence. For eg. 'I saw a man with binoculars.' can have 2 meanings. Multi headed attention lets each head focus on one part of the sentence and 1 interpretation of it and later by combining the information the model can decide which meaning better suits the context based on different weights. That is why they are tuned to have different weights.

SnoozyBiscuit
SnoozyBiscuit
PayTM16mo

I agree that a sentence can have mutiple interpretations but are are usually limited to 2 or 3 if not 1. Why LLMs have 12 to 50 full self attention layers .

When you say one attention head focuses on a part of the sentence, what do you mean by it? In a single full self attention head, weighted relation of a word with all other words of a sentence is captured.

SparklyRaccoon
SparklyRaccoon

How many layers to be used has been proven Empirically so that we can extract complex and deeper understanding. One head here will focus on the relationship between man and binoculars the other between 2 other words and so on.

SwirlySushi
SwirlySushi

Multihead attention learns the various relationships between different words in different latent spaces , this way they capture better semantic and syntactic relationships.

Different layers learn different kinda relationships as inputs to them and learns different weights i.e. key, values and query values

SnoozyBiscuit
SnoozyBiscuit
PayTM16mo

Thanks for your answer. Naive question - why do they learn different semantic relations. Why don't the different multihead converge to same weights?

GroovyWaffle
GroovyWaffle

Bhai ignore these kind of questions. Mostly asked by people who have no idea what they want the hire to do. Been a trend recently, where the interviewer wants to show they know stuff. But very unlikely to be used in anything that you would do on the job. Does not even test the ability, just knowledge.

Discover more
Curated from across
Indian Startups
by GroovyBananaStudent

Exploring Seed Funding Options for Our AI Innovation

I'm thrilled to share our latest developments in our startup: We launched our prototype and it's gaining momentum! Our previous product, currently in stealth mode, has remarkable insights. However, the GPU usage costs have skyrocketed....

Post image