OpenAI 開發出可預測序列中下一段文字、圖像和語音的深度模型
OpenAI 開發出可預測序列中下一段文字、圖像和語音的深度模型
News from: iThome & OpenAI.
OpnAI的深度神經網路模型Sparse Transformer,透過改良過的注意機制演算法,來萃取出更多序列中的模式,進而預測出序列中下一段文字、圖像或是語音。
OpnAI近日開發了一套深度神經網路模型Sparse Transformer,透過改良過的注意力(attention)機制演算法,來萃取出更多序列中的模式,進而預測出序列中下一段文字、圖像或是語音,OpenAI指出,在AI研究領域現存的一項挑戰就是,訓練並預測長範圍、不易察覺相互關係的複雜資料,像是圖像、影片或是語音等資料,Sparse Transformer模型加入了自我注意力機制,再加上一些改良,試著解決這項挑戰。
過去,用於預測這些資料的模型,都會特定為一個領域所設計,或是模型也很難擴展到多個不同的序列上,相反地,OpenAI這次開發的深度神經網路模型,可以利用好幾百層神經網路,為數萬個資料元素建立序列,用於跨多個領域的應用中,OpenAI將用這套模型,來協助打造出更了解世界的AI系統。
在Transformer模型中,每個輸出元素都與輸入元素都息息相關,且在每個輸入和輸出資料之間的權重,都是動態改變的,權重會依據各種情況來計算,這個過程稱之為注意力(attention)機制,雖然這項機制被認為能夠使Transformer比固定連接模式的模型,更加有彈性,但是實行上來說,每一層網路都要生成N x N的注意力矩陣,因此,用於資料類型含有多個元素的資料時,會需要耗費龐大的記憶體計算資源,像是影像或是原始語音檔。
其中一項降低記憶體資源的方式,就是在反向傳播演算法(backpropagation)中,從checkpoints重新計算注意力矩陣,反向傳播演算法是在深度學習中,被廣泛應用於降低記憶體用量的技術,該技術用於Transformer注意力矩陣運算後,記憶體成本和層數就會無關,因此,相比以往,OpenAI現在能夠訓練更深的神經網路,在OpenAI的實驗中,Transformer最多能夠到128層,為了訓練這些越深的模型,OpenAI還針對Transformer模型的操作順序,以及scheme初始化做了一些調整,OpenAI也將詳細的研究內容發表成論文。
但是,即使只計算單一個注意力矩陣,也會因為龐大的輸入資料變得不切實際,因此,OpenAI改用稀疏(sparse)注意力模式,也就是只針對每個輸出位置,從輸入位置的子集合中計算權重,當子集合比整個輸入集相對小時,就算是非常大的序列,注意力計算結果也會變得較容易處理。
為了實現該方法,OpenAI首先將用於預測影像的Transformer模型中的學習注意力模式視覺化,找出許多可解釋和結構化的稀疏模式,當輸入部分聚焦於小的子集上,且出現高度的規則性時,該層就屬於易稀疏化,不過,雖然有許多層都顯現出稀疏的架構,有些層在整張圖上還是會清楚地出現動態的注意力,為了保留模型學習這類型模式的能力,OpenAI對注意力矩陣進行二維分解,因此,模型就可以透過稀疏注意力,來檢視圖像中的所有位置。
---------------------------------------------------------------
Generative Modeling with
News from: iThome & OpenAI.
OpnAI的深度神經網路模型Sparse Transformer,透過改良過的注意機制演算法,來萃取出更多序列中的模式,進而預測出序列中下一段文字、圖像或是語音。
OpnAI近日開發了一套深度神經網路模型Sparse Transformer,透過改良過的注意力(attention)機制演算法,來萃取出更多序列中的模式,進而預測出序列中下一段文字、圖像或是語音,OpenAI指出,在AI研究領域現存的一項挑戰就是,訓練並預測長範圍、不易察覺相互關係的複雜資料,像是圖像、影片或是語音等資料,Sparse Transformer模型加入了自我注意力機制,再加上一些改良,試著解決這項挑戰。
過去,用於預測這些資料的模型,都會特定為一個領域所設計,或是模型也很難擴展到多個不同的序列上,相反地,OpenAI這次開發的深度神經網路模型,可以利用好幾百層神經網路,為數萬個資料元素建立序列,用於跨多個領域的應用中,OpenAI將用這套模型,來協助打造出更了解世界的AI系統。
在Transformer模型中,每個輸出元素都與輸入元素都息息相關,且在每個輸入和輸出資料之間的權重,都是動態改變的,權重會依據各種情況來計算,這個過程稱之為注意力(attention)機制,雖然這項機制被認為能夠使Transformer比固定連接模式的模型,更加有彈性,但是實行上來說,每一層網路都要生成N x N的注意力矩陣,因此,用於資料類型含有多個元素的資料時,會需要耗費龐大的記憶體計算資源,像是影像或是原始語音檔。
其中一項降低記憶體資源的方式,就是在反向傳播演算法(backpropagation)中,從checkpoints重新計算注意力矩陣,反向傳播演算法是在深度學習中,被廣泛應用於降低記憶體用量的技術,該技術用於Transformer注意力矩陣運算後,記憶體成本和層數就會無關,因此,相比以往,OpenAI現在能夠訓練更深的神經網路,在OpenAI的實驗中,Transformer最多能夠到128層,為了訓練這些越深的模型,OpenAI還針對Transformer模型的操作順序,以及scheme初始化做了一些調整,OpenAI也將詳細的研究內容發表成論文。
但是,即使只計算單一個注意力矩陣,也會因為龐大的輸入資料變得不切實際,因此,OpenAI改用稀疏(sparse)注意力模式,也就是只針對每個輸出位置,從輸入位置的子集合中計算權重,當子集合比整個輸入集相對小時,就算是非常大的序列,注意力計算結果也會變得較容易處理。
為了實現該方法,OpenAI首先將用於預測影像的Transformer模型中的學習注意力模式視覺化,找出許多可解釋和結構化的稀疏模式,當輸入部分聚焦於小的子集上,且出現高度的規則性時,該層就屬於易稀疏化,不過,雖然有許多層都顯現出稀疏的架構,有些層在整張圖上還是會清楚地出現動態的注意力,為了保留模型學習這類型模式的能力,OpenAI對注意力矩陣進行二維分解,因此,模型就可以透過稀疏注意力,來檢視圖像中的所有位置。
---------------------------------------------------------------
Generative Modeling with
Sparse Transformers
We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.
One existing challenge in AI research is modeling long-range, subtle interdependencies in complex data like images, videos, or sounds. The Sparse Transformer incorporates an reformulation of the Transformer self-attention mechanism, along with several other improvements, to apply it directly to these rich data types. Previously, models used on these data were specifically crafted for one domain or difficult to scale to sequences more than a few thousand elements long. In contrast, our model can model sequences with tens of thousands of elements using hundreds of layers, achieving state-of-the-art performance across multiple domains. At OpenAI, we’re using it to help us build AI systems that possess a greater ability to understand the world.
Deep Attention
In Transformers, every output element is connected to every input element, and the weightings between them are dynamically calculated based upon the circumstances, a process called attention. While it is believed that this allows Transformers to be more flexible than models with fixed connectivity patterns, in practice it requires the creation of an attention matrix for every layer and attention head, which can consume large amounts of memory when applied to data types with many elements, like images or raw audio.
One way to reduce this is by recomputing the attention matrix from checkpoints during backpropagation, a well-established technique in deep learning for reducing memory usage at the cost of more computation. When done for the attention matrix in Transformers, it means the largest memory cost becomes independent of the number of layers, letting us train networks with substantially greater depth than possible previously. In practice, we found that Transformers with depth up to 128 layers outperformed shallower networks on benchmark tasks like CIFAR-10.
To train these models with increased depth, we made several adjustments to the ordering of operations in the transformer and modified the initialization scheme. Full details can be seen in our paper.
Sparse Attention
Even computing a single attention matrix, however, can become impractical for very large inputs. We instead use sparse attention patterns, where each output position only computes weightings from a subset of input positions. When the subset is small relative to the full set of inputs (say, elements instead of elements), the resulting attention computation becomes tractable even for very long sequences, with an algorithmic complexity of instead of .
To assess the feasibility of the approach, we first visualized the learned attention patterns for deep Transformers on images, finding that many showed interpretable and structured sparsity patterns. Each of the below images shows which input pixels (highlighted in white) are attended to by a given attention head in order to predict the next value in the image. When the input portions are focused on small subsets and show a high degree of regularity, the layer is amenable to sparsification. A sampling of them are displayed here for a 128-layer model on CIFAR-10 images:
While many layers displayed sparse structure, some layers clearly display dynamic attention that stretch over the entirety of the image. In order to preserve the ability of our network to learn such patterns, we implemented a two-dimensional factorization of the attention matrix, where the network can attend to all positions through two steps of sparse attention.
The first version, strided attention, is roughly equivalent to each position attending to its row and its column, and is similar to the attention pattern learned by the network above. (Note that the column attention can be equivalently formulated as attending to the row of the transposed matrix). The second version, fixed attention, attends to a fixed column and the elements after the latest column element, a pattern we found useful for when the data didn’t fit into a two-dimensional structure (like text). For more details, we refer readers to our paper.
Experimental results
Sparse Transformers set new state-of-the-art scores for density estimation of CIFAR-10, Enwik8, and Imagenet 64.
We also found that sparse attention achieved lower loss than full attention, in addition to being significantly faster (see our paper for comparisons). This may point to a useful inductive bias from our sparsity patterns, or an underlying optimization issue with dense attention.
Generating images
Transformers that use sparse attention seem to have a notion of global structure, which can be qualitatively evaluated by looking at image completions. Here we visualize a model trained on ImageNet:
We also generated fully unconditional samples with an unadjusted softmax temperature of 1.0. These models are trained using the maximum likelihood objective, which is well-known to cover all modes of the data (including potentially nonexistent ones) instead of increasing fidelity of a smaller portion of the data. Sampling from these models with unadjusted temperature lets us see the full distribution of images that the model believes exists in the world. As a result, some samples can appear strange.
Generating raw audio waveforms
Sparse Transformers can also be adapted to generate raw audio instead of images by simply changing the position embeddings. As deep learning expands to novel data types, we believe the ease of specifying inductive biases with this class of networks will be a useful tool.
This model was trained on raw classical music clips and uses sparse attention to generate sequences of length 65,000. This corresponds to ~5 seconds of raw audio, and we have concatenated several samples together in each of the clips below.
Code release
Normally, implementing sparse attention would involve slicing query and key matrices in blocks, so to ease experimentation we implemented a set of block-sparse kernels which efficiently perform these operations on the the GPU. We open-source these kernels and provide example sparse attention functions in this repository.
Future work and limitations
- The sparse attention patterns we introduced are only preliminary steps in the direction of efficient modeling of long sequences. We think exploring different patterns and combinations of sparsity is useful, and that learning sparse patterns is a particularly promising avenue of research for the next generation of neural network architectures.
- Even with the improvements we described above, autoregressive sequence generation still seems impractical for very high resolution images or video. The optimized attention operations we have introduced, however, may be useful primitives to combine with other approaches to modeling high dimensional data, like multi-scale approaches.
If you are interested in advancing AI capabilities and helping further our mission of ensuring they benefit humanity, we’re hiring!
留言
張貼留言