Introduction How is it possible, my curious reader, that we can share our thoughts? How can it be that we discuss generative modeling, probability theory, or other interesting concepts? How come? The answer is simple: Language. We communicate because the human species developed a pretty distinctive trait that allows us to formulate sounds in a very complex manner to express our ideas and experiences. At some point in our history, some people realized that we forget, we lie, and we can shout as strongly as we can but we will not be understood farther than a few hundred meters. The solution was a huge breakthrough: Writing. This whole mumbling on my side here could be summarized using one word: Text. We know how to write (and read) and we can use the word "text" to mean "language" or "natural language" to avoid any confusion with artificial languages like Python or formal language.
Communication is an essential component of our lives. Therefore, it is no surprise that processing text, or natural language in general, attracted a lot of attention from day one of AI. Since the 50s of the twentieth century, people have been fascinated by the possibility of building communicating bots. Thanks to Alan Turing, we have even a famous test named after him which states that if a machine can "talk" to a human being without being recognized as a machine, then we can call it "intelligent". The first chatbot was even introduced in the 60s of the twentieth century. ELIZA, because it was its name, was able to match patterns and mimic an "intelligible" conversation. But it was way far from being "intelligent" and, after all, it was unable to generate responses that would fool good Dr. Turing. It took us over 60 years (nothing in the history of human beings but eons in the history of AI) to get to the point where we can debate about intelligence, originality, novelty, and many other aspects that were exclusive to us, humans. And all these are possible due to generative modeling.
Obviously, we communicate not only using text, but also images, audio, diagrams, and all sorts of "languages" (e.g., chemical reactions, mathematics). That is why, multimodality plays a crucial role in developing the next level of AI systems. In the following, we will talk about natural language processing (NLP) and how it was influenced by deep learning. The combination of deep learning and NLP gave rise to their baby called a Large Language Model (LLM). Eventually, we will outline how LLMs have changed our way of thinking about AI systems. As a result, we will talk about Generative AI systems (GenAISys).
Oh, and before we start, LLM is not equivalent to Generative AI. Generative AI is about all modalities, and LLMs are about processing some language. I am not a "term nazi", I am really flexible, but it is quite sad that experts nowadays repeat this nonsense. It is hurtful for the field and the industry. Alright, enough of being an old prick and let us delve into LLMs and GenAISys!
Natural Language Processing and Deep Learning Natural Language Processing (NLP) is the field that focuses on building machines that can manipulate human language. NLP aims at developing methods and techniques for processing written language (i.e., text). The tasks of NLP vary from text classification (e.g., sentiment analysis, spam detection), text correction (e.q., grammatical error correction), machine translation (e.g., translating using sequence-to-sequence models a.k.a. seq2seq), semantic analysis (e.g., topic modeling), text generation (e.g., chatbots), text summarization, named entity recognition (NER), information retrieval (IR) to question answering and chatbots. The main question though in NLP is about how to represent text such that it is useful for downstream tasks.
Classical NLP methods focused a lot on syntactic, i.e., grammar and sentence structures (Chomsky 1965). However, semantics is extremely important for proper communication and this remained a challenge for a long time. The first attempts to grasp the meaning of a piece of text relied on formulating tokenizers, specialized modules that represented text in a machine-readable manner. The first tokenizers counted words (so-called bag-of-words) in a document or an n-gram (a sequence of n words), giving rise to a numerical representation of text. However, such a representation is dependent on the document length, therefore, it seems more appropriate to look at normalized quantities like term frequency (tf), namely, the number of occurrences of a word in a document divided by the number of words in a document, and inverse document frequency (idf) that corresponds to the importance of a term in the whole corpus, calculated as the logarithm of the number of documents in the corpus divided by the number of documents that include the term. Typically, the tf-idf representations are used. These representations provide a lot of information about documents and can be easily manipulated by machines. However, they are pretty limited due to chosen n-grams and heavily depend on the available corpus of documents. Simpler tokenizers replace characters (or n-grams of characters) with integers such that text could be treated as a sequence of numbers. These tokenizers do not provide any meaning of text but play a crucial role in contemporary NLP methods, as we will see shortly.
I think you feel what is coming, my curious reader. I can almost read your mind. What are you whispering? It would be amazing to have a method that can learn semantics from data. And, ideally, by applying neural networks? You are right! The big breakthrough in NLP came with the Word2Vec method (Mikolov et al., 2013) which allowed learning word embeddings from raw text by using neural networks for a given context. In some sense, this paper opened Pandora's box (but in a good sense) that led to the development of Language Models parameterized by neural networks. Since these neural networks consisted of millions of weights, and later on (and nowadays) even of billions of weights, many researchers and practitioners have started calling them Large Language Models to distinguish them from more classical language models.
General architectures of LLMs Before we talk more about details like how to parameterize LLMs (i.e., what kind of neural networks to use) and how to learn them, let us briefly discuss their general architectures. No matter what, each LLM requires a tokenizer to turn text into numbers (e.g., integers), and an embedding that changes them eventually to real-valued vectors. Sometimes, both modules are treated as one (e.g., vectorizers in scikit-learn). A popular choice for a tokenizer these days is byte pair encoding which greedily merges commonly occurring sub-strings based on their frequency (Gage, 1994). The embedding module serves only a single purpose, namely, to map a token represented as a one-hot vector to a real-valued vector of size $D$. Then, after processing embeddings using a neural network, the output must be de-tokenized to a string again.
In general, we can distinguish three types of LLMs:
In Figure 1.A, you can see a schematic representation of an encoder or a decoder, and in Figure 1.B there is an encoder-encoder(decoder) presented. Now the question is how to parameterize LLMs.
A
B
Parameterizations The key component of any LLM is its parameterization. No matter if it is an encoder or a decoder, it is large because it uses hierarchical, (very) deep neural networks. But we cannot just use any neural network because we deal with text (i.e., sequences). As a result, a neural network must process vectors and, for decoders, output new text in the so-called causal manner (i.e., without looking "into the future").
If we go back to autoregressive models, we know that we can use recurrent neural networks (RNNs) and convolutional layers (CNNs) for processing text. RNNs seem to fit perfectly language modeling due to their intrinsic sequential structure. They were used by (Mikolov et al., 2010) to formulate one of the first RNN-based decoders and then RNN-based encoder-decoders (Cho et al., 2014). As shown in (Jozefowicz et al., 2016), the combination of CNNs and RNNs indicated an improvement in terms of language modeling. However, this was no surprise because the first successful encoders were based entirely on convolutional layers (Kalchbrenner et al., 2014) before they were used for decoders (Dauphin et al., 2017) and encoder-decoders (Kalchbrenner et al., 2016). However, RNN-based and CNN-based language models either suffered from forgetting or scaling issues.
The big breakthrough in LLMs has come with the introduction of transformers (Vaswani et al., 2017) and their utilization of (multi-head) attention layers. In all fairness, transformers proposed a few important implementation tricks, such as multi-head attention to learn multiple patterns in input tokens, layer normalization for preventing gradient blowing, and positional embeddings to ensure that tokens are treated as sequences and not as a set of tokens. Transformers turned out to be great at scaling up models, resulting in models with billions of weights. However, transformers require quadratic time and quadratic memory in the number of tokens (however, with Flash Attention (Dao et al., 2022) it becomes linear).
Recently, many people working on LLMs have focused on making them leaner (we can refer to those models as L3M: Lean Large Language Models). There are three main aspects to obtain L3M: (i) quantization of LLMs (i.e., fewer bits per weight that leads to less physical memory on disc), (ii) faster training, (iii) faster inference.
Quantization is a long-standing problem in deep learning and it poses new challenges for various architectures. For transformers, there have been a lot of successes like quantization aware training with modified self-attention (Bondarenko et al., 2024). Recently, it was shown that using transformers with weights of linear layers taking values in $\{-1, 0, 1\}$ results in leaner methods (around $\times 3$ lower memory requirements and around $\times 2$ lower latency in ms) while achieving comparable performance as float16 models.
For faster training/inference, there were various methods proposed. For instance, FlashAttention focused on speeding up attention layers by accounting for I/O operations (reads and writes) between different levels of GPU memory (Dao et al., 2022). A different approach rephrases the attention mechanism using a linear dot-product of kernel feature maps. The resulting linear transformer (Katharopoulos et al., 2020) makes further use of the associativity property of matrix products to reduce the quadratic complexity to linear complexity in the big-O notation of the length of the input sequence. However, a transformer-based language model is not all we need (and have)! There are other classes of LLMs that operate like RNNs at inference time and are fully parallelizable during training. Selective State Space Models (S4Ms) are built on top of state space models, i.e., a linear dynamical system with hidden variables/states (Gu et al., 2021). The adverb "selective" comes from the fact that S4Ms use an attention mechanism. One of the extensions of S4Ms, Mamba (Gu & Dao, 2023), outperformed some transformers. Other S4M-inspired models like Retentive Networks (RetNets) (Sun et al., 2023) or RWKV (Peng et al., 2024) successfully compete with the largest transformer-based models while maintaining linear training complexity and constant inference complexity. An interesting LLM, Jamba (Lieber et al., 2024), is a combination of Mamba layers with transformer layers. Taking a hybrid approach poses a promising future direction.
Another important extension of current LLMs is based on the idea of Mixture-of-Experts (MoE) (Fedus et al., 2022), a concept well-known in machine learning (Jacobs et al., 1991). Instead of learning a single model, we train multiple LLMs (experts) that can specialize in certain topics. Then, for given input tokens, a router selects multiple experts and the final outcome is calculated as a weighted average of the outcomes given by th experts. An example of an implementation of an MoE-LLM is Mixtral-of-Experts (Jiang et al., 2024).
An overview of various parameterizations of LLMs is presented in Table 1.
Architecture | Inference: Time | Inference: Memory | Parallel | Training: Time | Inference: Memory |
---|---|---|---|---|---|
RNNs | $O(1)$ | $O(1)$ | NO | $O(N)$ | $O(N)$ |
Transformer | $O(N)$ | $O(N)$ | YES | $O(N^2)$ | $O(N)$ |
Linear Transformer | $O(1)$ | $O(1)$ | YES | $O(N)$ | $O(N)$ |
S4M | $O(1)$ | $O(1)$ | YES | $O(N\log N)$ | $O(N)$ |
RWKV/RetNet/Mamba | $O(1)$ | $O(1)$ | YES | $O(N)$ | $O(N)$ |
Learning As you can imagine, my curious reader, training of LLMs is probably something a bit more complicated than just LMs (or other models, especially the small ones). The answer is yes and no. No because, after all, they are models that either encode or generate, so nothing we have not dealt with. Yes because LLMs are large and we need a lot of data. Very often LLMs are seen as foundation models (FMs) (Bommasani et al., 2021) that assume two phases of training:
These two steps are quite general and fully depend on a given task at hand. For instance, the first Generative Pretrained Transformers (GPTs) (Radford et al., 2018) were pre-trained using the negative log-likelihood or they were initialized by training with the masked loss like in Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018). Eventually, GPTs were fine-tuned with the negative log-likelihood loss. It is also possible to combine various losses, e.g., $\ell_{\alpha}(\theta) = \ell(\theta) + \alpha \ell_{\text{masked}}(\theta)$, which could be seen as a penalized negative log-likelihood objective. This idea was utilized in pre-training LLMs for various problems at once (Dong et al., 2019) or, with $\alpha=1$, for pre-training an LLM for molecules (Izdebski et al., 2023).
The problem with fine-tuning LLMs lies in the size of LLMs. Ideally, fine-tuning should be quick and cheap but it is hard to achieve if we deal with models with billions of weights. A possible solution is a technique known as Low-Rank Adaptation (LoRA) (Hu et al., 2021). The idea is relatively simple but crazily smart! Instead of updating a full-rank matrix of weights $\mathbf{W}$ (a side note: LoRA typically updates only attention matrices since they pose the greatest bottleneck), we introduce a new set of learnable matrices during fine-tuning, $\mathbf{A}$ and $\mathbf{B}$ while keeping $\mathbf{W}$ frozen. As a result, the forward pass looks as follows:
Where does the whole magic come from then? From sizes of the matrices! If $\mathbf{W}$ is a $D \times d$ matrix, then $\mathbf{B}$ is a $D \times k$ matrix and $\mathbf{A}$ is a $k \times d$ matrix, where $k \ll d$. As a result, we have a knob in the form of the dimensionality $k$ that controls the number of weights in the matrices $\mathbf{A}$ and $\mathbf{B}$, while the resulting matrix $\mathbf{B} \mathbf{A}$ is of the same size as $\mathbf{W}$. As a result, after applying LoRA, only a small fraction of the original number of weights is updated during fine-tuning. Typically, enlarging an LLM with even fewer than $1\%$ of new trainable weights is enough to achieve the same results as if the whole LLM is fine-tuned.
Famous LLMs There is a plethora of LLMs, a few already mentioned here. I do not even dare to start pointing them out. In Table 2, I gather only (very subjectively!) selected LLMs.
Name | Type | Size | Reference/Date |
---|---|---|---|
BERT | Encoder | 0.11B - 0.34B | (Devlin et al., 2018) |
BioGPT | Decoder | 0.35B | (Luo et al., 2022) |
BioGPT | Decoder | 0.35B | (Luo et al., 2022) |
Claude-3 | Decoder | 20B - ~2000B | Mar 4, 2024 |
Falcon | Decoder | 7B - 180B | (Almazrouei et al., 2023) |
GPT-2 | Decoder | ~1.5B | (Radford et al., 2019) |
GPT-3 | Decoder | ~175B | September, 2020 |
GPT-4 | Decoder | ~1000B | March 14, 2023 |
LLaMA-2 | Decoder | 7B - 70B | (Touvron et al., 2023) |
LLaMA-3 | Decoder | 8B - 70B | April 18, 2024 |
Mamba | Decoder | 3B | (Gu & Dao, 2023) |
Mistral | Decoder | 7B | September 27, 2023 |
Mixtral | Decoder | 8$\times$7B | (Jiang et al., 2024) |
PaLM | Decoder | 8B - 540B | (Chowdhery et al., 2023) |
RoBERTa | Encoder | 0.125B - 0.355B | (Liu et al., 2019) |
RWKV-5 (Eagle) | Decoder | 0.4B - 7B | (Peng et al., 2024) |
RWKV-6 (Finch) | Decoder | 1.6B - 3B | (Peng et al., 2024) |
T5 | Encoder-Decoder | 0.06B - 11B | (Raffel et al., 2020) |
XLNet | Decoder | 0.11B - 0.34B | (Yang et al., 2019) |
Ufff... it was a lot of text but what you can expect from a discussion about Large Language Models if not a lot of text? But without further ado, let us delve into some code! In the following, we will discuss our own GPT model we can call teenyGPT, an (extremely) tiny implementation of a decoder LLM. We focus on small dataset of newspaper headlines and a simple tokenizer working at a character level. Our data consist of the batch dimension, the number of tokens and the values, and our loss function is simply the negative log-likelihood.
class LossFun(nn.Module):
def __init__(self,):
super().__init__()
self.loss = nn.NLLLoss(reduction='none')
def forward(self, y_model, y_true, reduction='sum'):
# y_model: B(atch) x T(okens) x V(alues)
# y_true: B x T
B, T, V = y_model.size()
y_model = y_model.view(B * T, V)
y_true = y_true.view(B * T,)
loss_matrix = self.loss(y_model, y_true) # B*T
if reduction == 'sum':
return torch.sum(loss_matrix)
elif reduction == 'mean':
loss_matrix = loss_matrix.view(B, T)
return torch.mean(torch.sum(loss_matrix, 1))
else:
raise ValueError('Reduction could be either `sum` or `mean`.')
The essential component is the transformer block. We define them using the PyTorch implementation of multi-head attention layers.
class TransformerBlock(nn.Module):
def __init__(self, num_emb, num_neurons, num_heads=4):
super().__init__()
# hyperparams
self.D = num_emb
self.H = num_heads
self.neurons = num_neurons
# components
self.msha = nn.MultiheadAttention(embed_dim=self.D, num_heads=self.H, batch_first=True)
self.layer_norm1 = nn.LayerNorm(self.D)
self.layer_norm2 = nn.LayerNorm(self.D)
self.mlp = nn.Sequential(nn.Linear(self.D, self.neurons * self.D),
nn.GELU(),
nn.Linear(self.neurons * self.D, self.D))
def forward(self, x, causal=True):
# Multi-Head Self-Attention
x_attn, _ = self.msha(x, x, x, is_causal=causal, attn_mask=torch.empty(1,1), need_weights=False)
# LayerNorm
x = self.layer_norm1(x_attn + x)
# MLP
x_mlp = self.mlp(x)
# LayerNorm
x = self.layer_norm2(x_mlp + x)
return x
Finally, we need to define our teenyGPT with a forward pass for the transformer, and a sampling procedure. Additionally, we can define an auxiliary metric, top-1 reconstruction accuracy, which takes the most probable token and uses it to check whether it is the same as the input token.
class teenyGPT(nn.Module):
def __init__(self, num_tokens, num_token_vals, num_emb, num_neurons, num_heads=2, dropout_prob=0.1, num_blocks=10, device='cpu'):
super().__init__()
# Remember, always credit the author, even if it's you ;)
print('teenyGPT by JT.')
# hyperparams
self.device = device
self.num_tokens = num_tokens
self.num_token_vals = num_token_vals
self.num_emb = num_emb
self.num_blocks = num_blocks
# embedding layer
self.embedding = torch.nn.Embedding(num_token_vals, num_emb)
# positional embedding
self.positional_embedding = nn.Embedding(num_tokens, num_emb)
# transformer blocks
self.transformer_blocks = nn.ModuleList()
for _ in range(num_blocks):
self.transformer_blocks.append(TransformerBlock(num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads))
# output layer (logits + softmax)
self.logits = nn.Sequential(nn.Linear(num_emb, num_token_vals))
# dropout layer
self.dropout = nn.Dropout(dropout_prob)
# loss function
self.loss_fun = LossFun()
def transformer_forward(self, x, causal=True, temperature=1.0):
# x: B(atch) x T(okens)
# embedding of tokens
x = self.embedding(x) # B x T x D
# embedding of positions
pos = torch.arange(0, x.shape[1], dtype=torch.long).unsqueeze(0).to(self.device)
pos_emb = self.positional_embedding(pos)
# dropout of embedding of inputs
x = self.dropout(x + pos_emb)
# transformer blocks
for i in range(self.num_blocks):
x = self.transformer_blocks[i](x)
# output logits
out = self.logits(x)
return F.log_softmax(out/temperature, 2)
@torch.no_grad()
def sample(self, batch_size=4, temperature=1.0):
x_seq = np.asarray([[self.num_token_vals - 1] for i in range(batch_size)])
# sample next tokens
for i in range(self.num_tokens-1):
xx = torch.tensor(x_seq, dtype=torch.long, device=self.device)
# process x and calculate log_softmax
x_log_probs = self.transformer_forward(xx, temperature=temperature)
# sample i-th tokens
x_i_sample = torch.multinomial(torch.exp(x_log_probs[:,i]), 1).to(self.device)
# update the batch with new samples
x_seq = np.concatenate((x_seq, x_i_sample.to('cpu').detach().numpy()), 1)
return x_seq
@torch.no_grad()
def top1_rec(self, x, causal=True):
x_prob = torch.exp(self.transformer_forward(x, causal=True))[:,:-1,:].contiguous()
_, x_rec_max = torch.max(x_prob, dim=2)
return torch.sum(torch.mean((x_rec_max.float() == x[:,1:].float().to(device)).float(), 1).float())
def forward(self, x, causal=True, temperature=1.0, reduction='mean'):
# get log-probabilities
log_prob = self.transformer_forward(x, causal=causal, temperature=temperature)
return self.loss_fun(log_prob[:,:-1].contiguous(), x[:,1:].contiguous(), reduction=reduction)
You can find the full code here: Link. For $128$ neurons in MLPs, $8$ attention heads, $4$ transformer blocks, and the embedding size equal $32$, we get a model with about $1$ million weights and the performance as in Figure 2.
A
B
C
No. | Generation |
---|---|
1 | british announces star market star market contract coronavirus case |
2 | british announces show announces set series |
3 | breaking say court coronavirus test positive |
4 | coronavirus death trump court star reade say say announcement say |
5 | coronavirus death test positive coronavirus case state state state state state state court |
6 | coronavirus case start star reade country test positive |
7 | british announces set coronavirus test positive |
8 | coronavirus case show death committee |
9 | michigan pro recovered country coronavirus test positive |
10 | coronavirus case star set state state state state state state state state allegation |
D
No. | Generation |
---|---|
1 | waterment share brand fabst dailey available hystering |
2 | pm rate murder test personn hirr kannish reported homeless tested decision |
3 | torotteney headlines sentencer everyone |
4 | milamican combinize give test pair catch attack |
5 | ugc harry first bitcoin boat wuhan chip man end masking precain |
6 | true fansed andrap new full forecasts |
7 | galaxy nigeria check police liverpool dam new deal threatens order |
8 | senator addeedly discharge tonie |
9 | grime everdille get improves reveals fan talk new revela false |
10 | uk coronavirus delivery due five aviversau despite return horror year |
As you can notice, my curious reader, the teeny LLM trains successfully, even with only a small amount of data, and around 1 million of weights! Comparing samples in Figure 2.C and Figure 2.D, we can clearly see that being too stochastic results in nonsensical headlines. However, overall, even though the model does not work perfectly, it learned to combine characters in such a way they constitute words (most of the time), and some headlines make even sense.
There are so many topics in LLMs at the moment, that it is nearly impossible to cover them all. In the following, I stress out some developments that are important and require a separate write-up but I leave these to others. Here are some highlights of the LLM community:
These topics constitute only the tip of the iceberg. The speed at which LLMs evolve is unprecedented. There are multiple extremely interesting new concepts, but, in my opinion, some directions are definitely overhyped. Nevertheless, Generative AI owes a lot to LLMs because they have changed the AI landscape and brought generative modeling to a completely new level.
(Almazrouei et al., 2023) Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., ... & Penedo, G. (2023). The falcon series of open language models. arXiv preprint arXiv:2311.16867.
(Bommasani et al., 2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
(Bondarenko et al., 2024) Bondarenko, Y., Nagel, M., & Blankevoort, T. (2024). Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36.
(Cho et al., 2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
(Chomsky 1965) Chomsky, N. (1965). Aspects of the theory of syntax. Special technical report. Research laboratory of electronics. Massachusetts Institute of Technology.
(Chowdhery et al., 2023) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1-113.
(Dao et al., 2022) Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.
(Dauphin et al., 2017) Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In International conference on machine learning (pp. 933-941). PMLR.
(Devlin et al., 2018) Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
(Dong et al., 2019) Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., ... & Hon, H. W. (2019). Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.
(Dong et al., 2022) Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., ... & Sui, Z. (2022). A survey on in-context learning. arXiv preprint arXiv:2301.00234.
(Fedus et al., 2022) Fedus, W., Dean, J., & Zoph, B. (2022). A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
(Feng et al., 2020) Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., ... & Zhou, M. (2020). Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.
(Gage, 1994) Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23-38.
(GitHub Copilot) “GitHub Copilot · Your AI pair programmer.” [Online]. Available: https://copilot.github.com/, Accessed: April 23, 2024
(Gu et al., 2021) Gu, A., Goel, K., and Re, C. (2021). Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations
(Gu & Dao, 2023) Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
(Hu et al., 2021) Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
(Izdebski et al., 2023) Izdebski, A., Weglarz-Tomczak, E., Szczurek, E., & Tomczak, J. M. (2023). De Novo Drug Design with Joint Transformers. arXiv preprint arXiv:2310.02066.
(Jacobs et al., 1991) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural computation, 3(1), 79-87.
(Jiang et al., 2024) Jiang, Albert Q., Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).
(Jozefowicz et al., 2016) Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
(Kalchbrenner et al., 2014) Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
(Kalchbrenner et al., 2016) Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. V. D., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
(Katharopoulos et al., 2020) Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning (pp. 5156-5165). PMLR.
(Lieber et al., 2024) Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., ... & Shoham, Y. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv preprint arXiv:2403.19887.
(Liu et al., 2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
(Luo et al., 2022) Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., & Liu, T. Y. (2022). BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6), bbac409.
(Ma et al., 2024) Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., ... & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint arXiv:2402.17764.
(Mikolov et al., 2010) Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech (Vol. 2, No. 3, pp. 1045-1048).
(Mikolov et al., 2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
(Ouyang et al., 2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.
(Pan et al., 2024) Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.
(Peng et al., 2024) Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide, E., Biderman, S., ... & Zhu, R. J. (2024). Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence. arXiv preprint arXiv:2404.05892.
(Radford et al., 2018) Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
(Radford et al., 2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
(Rafailov et al., 2024) Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
(Raffel et al., 2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.
(Schulman et al., 2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
(Sun et al., 2023) Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J. and Wei, F. (2023). Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621.
(Touvron et al., 2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
(Vaswani et al., 2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
(Yang et al., 2019) Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
(Zou et al., 2023) Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.