1st approach is not a good choice because leveraging the [CLS] token embedding directly might not be the best approach, in case if the BERT was fine tuned for a task other than similarity matching.
- Task-Specific Embeddings: The [CLS] token embedding is affected by the task the bert model was trained on.
- Averaging : Taking the mean of all token embeddings, we can get a more general representation of the input. This method balances out the representation by considering the contextual embeddings of all tokens.
Consider taking average or pooling (passing through another dense layer) will work.