Transformers

一个关于如何将 HuggingFace 的 transformers 库与 fastai 集成的例子

在本教程中,我们将看到如何使用 fastai 库对 HuggingFace 的 transformers 库中的预训练 transformer 模型进行微调。我们将使用中层 API 来收集数据。即使本教程是独立的,查看imagenette 教程可能会有帮助,以便在计算机视觉中再次回顾中层 API(其中包含使用高层 API 的温和介绍)。

导入 transformers 预训练模型

首先,我们需要安装 transformers 库。如果你还没有安装,请安装该库

!pip install -Uq transformers

然后我们导入所需的内容:我们将在这里对 GPT2 预训练模型进行微调,并在 wikitext-2 上进行微调。为此,我们需要 GPT2LMHeadModel(因为我们想要一个语言模型)和 GPT2Tokenizer 来准备数据。

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

我们可以使用这个 GPT2 模型的多个版本,更多详情请查阅transformers 文档。在这里,我们将使用基本版本(它已经占用了很多内存空间!)。你可以通过更改 pretrained_weights 的内容来更改使用的模型(如果它不是 GPT2 模型,当然你需要更改用于模型和 tokenizer 的类)。

pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

在我们继续进行微调部分之前,先来看看这个 tokenizer 和这个 model。HuggingFace 中的 tokenizers 通常在一个步骤中完成分词和数值化(我们暂时忽略填充警告)

ids = tokenizer.encode('This is an example of text, and')
ids
[1212, 318, 281, 1672, 286, 2420, 11, 290]

与 fastai 的 Transform 类似,tokenizer 有一个 decode 方法,可以将 id 转换回文本

tokenizer.decode(ids)
'This is an example of text, and'

该模型可用于生成预测(它已预训练)。它有一个 generate 方法,需要一批提示,因此我们将 ids 喂给它并添加一个批次维度(还有一个填充警告,我们也可以忽略)

import torch
t = torch.LongTensor(ids)[None]
preds = model.generate(t)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

默认情况下,预测的长度为 20

preds.shape,preds[0]
(torch.Size([1, 20]),
 tensor([1212,  318,  281, 1672,  286, 2420,   11,  290,  340,  338,  407,  257,
          922,  530,   13,  198,  198,  464,  717, 1517]))

我们可以使用 decode 方法(它更喜欢 numpy 数组而不是 tensor)

tokenizer.decode(preds[0].numpy())
"This is an example of text, and it's not a good one.\n\nThe first thing"

弥合与 fastai 的差距

现在,让我们看看如何使用 fastai 在 wikitext-2 上微调这个模型,利用所有的训练实用工具(学习率寻找器、1cycle 策略等)。首先,我们导入所有的文本实用工具

from fastai.text.all import *

准备数据

然后我们下载数据集(如果不存在),它包含两个 csv 文件

path = untar_data(URLs.WIKITEXT_TINY)
path.ls()
(#2) [Path('/home/jhoward/.fastai/data/wikitext-2/test.csv'),Path('/home/jhoward/.fastai/data/wikitext-2/train.csv')]

我们来看看这些 csv 文件是什么样子

df_train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_train.head()
0
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....

我们将所有文本收集到一个 numpy 数组中(因为这样使用 fastai 会更方便)

all_texts = np.concatenate([df_train[0].values, df_valid[0].values])

为了处理这些数据以训练模型,我们需要构建一个 Transform,它将被延迟应用。在这种情况下,我们可以一次性完成预处理,然后只使用 transform 进行解码(稍后我们会看到如何操作),但 HuggingFace 的 fast tokenizer 顾名思义非常快,因此这样做并不会真正影响性能。

在 fastai 的 Transform 中,你可以定义

  • 一个 encodes 方法,当你调用 transform 时会应用它(有点像 nn.Module 中的 forward 方法)
  • 一个 decodes 方法,当你调用 transform 的 decode 方法时会应用它,如果你需要为了展示目的解码任何内容(比如在这里将 ids 转换为文本)
  • 一个 setups 方法,用于设置 Transform 的内部状态(此处不需要,因此跳过)
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

关于以上代码的两点说明

  • encodes 中,我们没有使用 tokenizer.encode 方法,因为它在分词和数值化后会为模型进行一些额外的预处理(之前抛出警告的部分)。这里我们不需要任何后处理,因此跳过它即可。
  • decodes 中,我们返回的是一个 TitledStr 对象,而不仅仅是普通的字符串。这是一个 fastai 类,它为字符串添加了一个 show 方法,这将允许我们使用所有的 fastai show 方法。

然后,你可以使用这个 TransformTfmdLists 来组合你的数据。它的名字中包含 's',因为它包含训练集和验证集。我们使用 splits 指定训练集和验证集的索引(这里是前 len(df_train) 个索引作为训练集,其余索引作为验证集)

splits = [range_of(df_train), list(range(len(df_train), len(all_texts)))]
tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)
Token indices sequence length is longer than the specified maximum sequence length for this model (4576 > 1024). Running this sequence through the model will result in indexing errors

我们指定 dl_type=LMDataLoader,用于将此 TfmdLists 转换为 DataLoaders:我们将使用 LMDataLoader,因为我们面临的是语言建模问题,而不是通常的 fastai TfmdDL

在一个 TfmdLists 中,你可以非常容易地访问训练集或验证集的元素

tls.train[0],tls.valid[0]
(tensor([220, 198, 796,  ..., 198, 220, 198]),
 tensor([220, 198, 796,  ..., 198, 220, 198]))

它们看起来一样,只是因为它们的开头和结尾方式相同。我们可以看到它们的形状不同

tls.tfms(tls.train.items[0]).shape, tls.tfms(tls.valid.items[0]).shape
(torch.Size([4576]), torch.Size([1485]))

我们可以使用 show_at 查看这两个解码结果

show_at(tls.train, 0)
 
 = 2013 – 14 York City F.C. season = 
 
 The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club, a professional football club based in York, North Yorkshire, England. Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two. The season ran from 1 July 2013 to 30 June 2014. 
 Nigel Worthington, starting his first full season as York manager, made eight permanent summer signings. By the turn of the year York were only above the relegation zone on goal difference, before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two. This meant York qualified for the play @-@ offs, and they were eliminated in the semi @-@ final by Fleetwood Town. York were knocked out of the 2013 – 14 FA Cup, Football League Cup and Football League Trophy in their opening round matches. 
 35 players made at least one appearance in nationally organised first @-@ team competition, and there were 12 different <unk>. Defender Ben Davies missed only five of the fifty @-@ two competitive matches played over the season. Wes Fletcher finished as leading scorer with 13 goals, of which 10 came in league competition and three came in the FA Cup. The winner of the <unk> of the Year award, voted for by the club's supporters, was <unk> Oyebanjo. 
 
 = = Background and pre @-@ season = = 
 
 The 2012 – 13 season was York City's first season back in the Football League, having won the Conference Premier play @-@ offs in 2011 – 12 after <unk> years in the Football Conference. Manager Gary Mills was sacked in March 2013 following an 11 @-@ match run without a victory, and was replaced by former Northern Ireland manager Nigel Worthington. Despite being in the relegation zone with three matches remaining, Worthington led the team to safety from relegation after a 1 – 0 win away to Dagenham & Redbridge on the final day of the season. York finished the season in 17th @-@ place in the 2012 – 13 League Two table. 
 Following the previous season's conclusion Lee <unk>, Jon <unk>, Chris <unk>, Ben Everson, Scott Kerr, David <unk>, Patrick <unk>, Michael Potts, Jamie Reed and Jason Walker were released by York, while <unk> Blair departed for Fleetwood Town. David McGurk, <unk> Oyebanjo, Danny Parslow, Tom Platt and Chris Smith signed new contracts with the club. New players signed ahead of the start of the season were goalkeeper Chris <unk> on a season @-@ long loan from Blackpool, defender Ben Davies on loan from Preston North End, midfielders Craig Clay from Chesterfield and Lewis Montrose from Gillingham, winger <unk> Puri from St <unk> and strikers Ryan Bowman from Hereford United, Richard Cresswell from Sheffield United, Wes Fletcher from Burnley and Ryan Jarvis from Torquay United. Defender Mike Atkinson and striker Chris Dickinson entered the first @-@ team squad from the youth team after agreeing professional contracts. 
 York retained the previous season's home and away kits. The home kit comprised red shirts with white sleeves, light blue shorts and white socks. The away kit included light blue shirts with white sleeves, white shorts and light blue socks. <unk> Health continued as shirt sponsors for the second successive season. 
 
 = = Review = = 
 
 
 = = = August = = = 
 
 York began the season with a 1 – 0 home win over the previous season's play @-@ off finalists, Northampton Town, with <unk> Jarvis scoring the winning goal in the 90th @-@ minute. However, defeat came in York's match against Championship side Burnley in the first round of the League Cup, going down 4 – 0 at home. The team endured their first league defeat of the season in the following game after being beaten 2 – 0 away by Dagenham & Redbridge, the home team scoring in each half. York then held Hartlepool United to a 0 – 0 home draw, before being beaten 3 – 2 away by Bristol Rovers, in which Jarvis scored twice before John @-@ Joe O 'Toole scored the winning goal for the home team in the 67th @-@ minute. Two signings were made shortly before the transfer deadline ; defender George Taft was signed on a one @-@ month loan from Leicester City, while Middlesbrough midfielder Ryan Brobbel joined on a one @-@ month loan. <unk> John <unk>, who had been told he had no future with the club, departed after signing for FC Halifax Town. Jarvis gave York the lead away at Exeter City before Alan <unk> scored in each half to see the home team win 2 – 1. 
 
 = = = September = = = 
 
 York suffered their first home league defeat of the season after AFC Wimbledon won 2 – 0, with Michael Smith scoring in each half. Former Ipswich Town midfielder Josh Carson, who had a spell on loan with York the previous season, signed a contract until the end of 2013 – 14 and Sheffield United midfielder Elliott <unk> signed on a one @-@ month loan. Brobbel opened the scoring in the second minute of his home debut against Mansfield Town, although the away team went on to score twice to win 2 – 1. York's run of four defeats ended following a 1 – 1 draw away to Wycombe Wanderers, in which McGurk gave York the lead before the home team levelled through Dean Morgan. Taft was sent back to Leicester after he fell behind McGurk, Parslow and Smith in the pecking order for a central defensive berth. York achieved their first win since the opening day of the season after beating Portsmouth 4 – 2 at home, with Fletcher ( 2 ), Montrose and Jarvis scoring. 
 
 = = = October = = = 
 
 Defender Luke O 'Neill was signed from Burnley on a 28 @-@ day emergency loan. He made his debut in York's 3 – 0 win away at Torquay, which was the team's first successive win of the season. York were knocked out of the Football League Trophy in the second round after being beaten 3 – 0 at home by League One team Rotherham United, before their winning streak in the league was ended with a 3 – 0 defeat away to Newport County. York drew 2 – 2 away to Chesterfield, having taken a two @-@ goal lead through O 'Neill and Jarvis, before the home team fought back through Armand <unk> and Jay O <unk>. The team then hosted Fleetwood Town, and the visitors won 2 – 0 with goals scored in each half by Gareth Evans and <unk> Matt. Scunthorpe United were beaten 4 – 1 at home to end York's three @-@ match run without a win, with all the team's goals coming in the first half from Carson, Fletcher and Brobbel ( 2 ). 
 
 = = = November = = = 
 
 Bowman scored his first goals for York away to Cheltenham Town, as York twice fought back from behind to draw 2 – 2. York drew 3 – 3 away to Bristol Rovers to earn a first round replay in the FA Cup, taking the lead through Jarvis before Eliot Richards equalised for the home team. Carson scored a 30 yard volley to put York back in the lead, and after Bristol Rovers goals from Matt <unk> and Chris <unk>, Fletcher scored an 86th @-@ minute equaliser for York. Bowman scored with a header from an O 'Neill cross to open the scoring at home to Plymouth Argyle, which was the first goal the visitors had conceded in 500 minutes of action. However, Plymouth equalised 11 minutes later through <unk> <unk> and the match finished a 1 – 1 draw. York were knocked out of the FA Cup after losing 3 – 2 at home to Bristol Rovers in a first round replay ; the visitors were 3 – 0 up by 50 @-@ minutes before Fletcher pulled two back for York with a penalty and a long @-@ range strike. 
 Defender Keith Lowe, of Cheltenham, and goalkeeper Nick Pope, of Charlton Athletic, were signed on loan until January 2014. They both played in York's first league defeat in four weeks, 2 – 1 away, to Southend United. <unk> <unk> gave Southend the lead early into the match and Bowman equalised for York with a low strike during the second half, before Luke Prosser scored the winning goal for the home side in stoppage time. With Pope preferred in goal, <unk> returned to Blackpool on his own accord, although his loan agreement would stay in place until January 2014. York then drew 0 – 0 away to Morecambe. After Pope was recalled from his loan by Charlton, York signed Wolverhampton Wanderers goalkeeper Aaron McCarey on loan until January 2014. McCarey kept a clean sheet in York's 0 – 0 home draw with Rochdale. 
 
 = = = December = = = 
 
 Cresswell retired from playing as a result of an eye complaint and a knee injury. York drew 1 – 1 away to Burton Albion, with an own goal scored by Shane <unk> @-@ <unk> giving York the lead in the 64th @-@ minute before the home team equalised eight minutes later through Billy <unk>. Atkinson was released after failing to force himself into the first team and signed for Scarborough Athletic, with whom he had been on loan. York drew 0 – 0 at home with second @-@ placed Oxford United, in which Carson came closest to scoring with a volley that <unk> across the face of the goal. This was followed by another draw after the match away to Accrington Stanley finished 1 – 1, with the home team <unk> 10 minutes after a Fletcher penalty had given York the lead in the 35th @-@ minute. Striker <unk> McDonald, who had been released by Peterborough United, was signed on a contract until the end of the season. York's last match of 2013 was a 2 – 1 defeat away at Bury, a result that ended York's run of consecutive draws at five. The home team were 2 – 0 up by the 19th @-@ minute, before Michael Coulson scored York's goal in the 73rd @-@ minute. This result meant York would begin 2014 in 22nd @-@ position in the table, only out of the relegation zone on goal difference. 
 
 = = = January = = = 
 
 Jarvis scored the only goal in York's first win since October 2013, a 1 – 0 home victory over Morecambe on New Year's Day. McCarey was recalled by Wolverhampton Wanderers due to an injury to one of their <unk>, while O 'Neill was recalled by Burnley to take part in their FA Cup match. York achieved back @-@ to @-@ back wins for the first time since October 2013 after Dagenham & Redbridge were beaten 3 – 1 at home, with Bowman opening the scoring in the second half before Fletcher scored twice. Adam Reed, who had a spell on loan with York in the previous season, was signed on a contract until the end of the season after parting company with Burton. Davies'loan was extended, while Brobbel and <unk> returned to their parent clubs. Cheltenham club captain Russell Penn, a midfielder, was signed on a two @-@ and @-@ a @-@ half @-@ year contract for an undisclosed fee. Lowe was subsequently signed permanently from Cheltenham on a two @-@ and @-@ a @-@ half @-@ year contract for an undisclosed fee. Having been allowed to leave the club on a free transfer, Ashley Chambers signed for Conference Premier club Cambridge United. 
 York achieved three successive wins for the first time in 2013 – 14 after beating Northampton 2 – 0 away, with Bowman and Fletcher scoring in three @-@ second half minutes. Defender John McCombe was signed on a two @-@ and @-@ a @-@ half @-@ year contract following his release from Mansfield, before Clay and Jamal <unk> left York by mutual consent. Pope returned to York on loan from Charlton for the remainder of the season. York's run of wins ended with a 0 – 0 draw at home to Bristol Rovers, before their first defeat of the year came after losing 2 – 0 away to Hartlepool. Preston winger Will Hayhurst, a Republic of Ireland under @-@ 21 international, was signed on a one @-@ month loan. York fell to a successive defeat for the first time since September 2013 after being beaten 2 – 0 at home by Chesterfield. Shortly after the match, Smith left the club by mutual consent to pursue first @-@ team football. 
 
 = = = February = = = 
 
 Fletcher scored a 90th @-@ minute winner for York away to Fleetwood in a 2 – 1 win, a result that ended Fleetwood's five @-@ match unbeaten run. York then drew 0 – 0 at home to fellow mid @-@ table team Cheltenham, before beating Plymouth 4 – 0 away with goals from Fletcher, McCombe ( 2 ) and Carson as the team achieved successive away wins for the first time in 2013 – 14. York went without scoring for a fourth consecutive home match after drawing 0 – 0 with Southend. Having worn the <unk> since an injury to McGurk, Penn was appointed captain for the rest of the season, a position that had earlier been held by Smith and Parslow. 
 
 = = = March = = = 
 
 York achieved their first home win in five matches after beating Exeter 2 – 1, with first half goals scored by McCombe and Coulson. Hayhurst's loan was extended to the end of the season, having impressed in his six appearances for the club. Coulson scored again with the only goal, a 41st @-@ minute header, in York's 1 – 0 away win over AFC Wimbledon. Bowman scored the only goal with a 32nd @-@ minute penalty as York won 1 – 0 away against Mansfield, in which Fletcher missed the opportunity to extend the lead when his stoppage time penalty was saved by Alan Marriott. York moved one place outside the play @-@ offs with a 2 – 0 home win over Wycombe, courtesy of a second Bowman penalty in as many matches and a Carson goal from the edge of the penalty area. Coulson scored York's only goal in a 1 – 0 away win over struggling Portsmouth with a low volley in the fifth @-@ minute ; this result meant York moved into the play @-@ offs in seventh @-@ place with eight fixtures remaining. 
 Striker Calvin Andrew, who had been released by Mansfield in January 2014, was signed on a contract for the remainder of the season. He made his debut as a substitute in York's 1 – 0 home win over bottom of the table Torquay, in which Hayhurst scored the only goal in the 11th @-@ minute with an 18 yard shot that <unk> off Aaron <unk>. Middlesbrough winger Brobbel rejoined on loan until the end of the season, following an injury to Carson. York's run of successive wins ended on six matches after a 0 – 0 home draw with Burton, and this result saw York drop out of the play @-@ offs in eighth @-@ place. With the team recording six wins and one draw in March 2014, including six clean sheets, Worthington was named League Two Manager of the Month. 
 
 = = = April = = = 
 
 Pope made a number of saves as York held league leaders Rochdale to a 0 – 0 away draw, with a point being enough to lift the team back into seventh @-@ place. York were prevented from equalling a club record of eight consecutive clean sheets when Accrington scored a stoppage time equaliser in a 1 – 1 home draw, in which York had taken earlier taken the lead with a Coulson penalty. A 1 – 0 win away win over Oxford, which was decided by a second half Coulson penalty, resulted in York moving one place above their opponents and back into seventh @-@ place. York consolidated their place in a play @-@ off position after beating Bury 1 – 0 at home with a fifth @-@ minute goal scored by Lowe from a Hayhurst corner. The result meant York opened up a five @-@ point lead over eighth @-@ placed Oxford with two fixtures remaining. A place in the League Two play @-@ offs was secured following a 1 – 0 win over Newport at home, in which Coulson scored the only goal in the 77th @-@ minute with a 25 yard free kick. Pope earned a nomination for League Two Player of the Month for April 2014, having conceded only one goal in five matches in that period. 
 
 = = = May = = = 
 
 The league season concluded with an away match against divisional runners @-@ up Scunthorpe ; having gone two goals down York fought back to draw 2 – 2 with goals scored by Brobbel and Andrew. This result meant York finished the season in seventh @-@ place in League Two, and would thus play fourth @-@ placed Fleetwood in the play @-@ off semi @-@ final on the back of a 17 @-@ match unbeaten run. York lost 1 – 0 to Fleetwood in the first leg at <unk> Crescent ; the goal came from former York player <unk> Blair in the 50th @-@ minute, who scored from close range after Antoni <unk>'s shot was blocked on the line. A 0 – 0 draw away to Fleetwood in the second leg meant York were eliminated 1 – 0 on aggregate, ending the prospect of a second promotion in three seasons. At an awards night held at York Racecourse, Oyebanjo was voted <unk> of the Year for 2013 – 14. 
 
 = = Summary and aftermath = = 
 
 York mostly occupied the bottom half of the table before the turn of the year, and dropped as low as 23rd in September 2013. During February 2014 the team broke into the top half of the table and with one match left were in sixth @-@ place. York's defensive record was the third best in League Two with 41 goals conceded, bettered only by Southend ( 39 ) and Chesterfield ( 40 ). Davies made the highest number of appearances over the season, appearing in 47 of York's 52 matches. Fletcher was York's top scorer in the league and in all competitions, with 10 league goals and 13 in total. He was the only player to reach double figures, and was followed by Jarvis with nine goals. 
 After the season ended York released Tom Allan, Andrew, Dickinson, McDonald, Puri and Reed, while McGurk retired from professional football. Bowman and Oyebanjo left to sign for Torquay and Crawley Town respectively while Coulson signed a new contract with the club. York's summer signings included goalkeeper Jason <unk> from Tranmere Rovers, defenders <unk> <unk> from Dagenham, Marvin McCoy from Wycombe and Dave Winfield from Shrewsbury Town, midfielders <unk> <unk> from Mansfield, Anthony <unk> from Southend and Luke <unk> from Shrewsbury and striker Jake Hyde from <unk>. 
 
 = = Match details = = 
 
 League positions are sourced by <unk>, while the remaining information is referenced individually. 
 
 = = = Football League Two = = = 
 
 
 = = = League table ( part ) = = = 
 
 
 = = = FA Cup = = = 
 
 
 = = = League Cup = = = 
 
 
 = = = Football League Trophy = = = 
 
 
 = = = Football League Two play @-@ offs = = = 
 
 
 = = <unk> = = 
 
 
 = = = In = = = 
 
 <unk> around club names denote the player's contract with that club had expired before he joined York. 
 
 = = = Out = = = 
 
 <unk> around club names denote the player joined that club after his York contract expired. 
 
 = = = Loan in = = = 
 
 
 = = = Loan out = = = 
 
 
 = = Appearances and goals = = 
 
 Source : 
 Numbers in parentheses denote appearances as substitute. 
 Players with names struck through and marked left the club during the playing season. 
 Players with names in italics and marked * were on loan from another club for the whole of their season with York. 
 Players listed with no appearances have been in the <unk> squad but only as unused <unk>. 
 Key to positions : <unk> – <unk> ; <unk> – Defender ; <unk> – <unk> ; <unk> – Forward 
 
show_at(tls.valid, 0)
 
 = Tropical Storm <unk> ( 2008 ) = 
 
 Tropical Storm <unk> was the tenth tropical storm of the 2008 Atlantic hurricane season. <unk> developed out of a strong tropical wave which moved off the African coast on August 31. The wave quickly became organized and was declared Tropical Depression Ten while located 170 mi ( 270 km ) to the south @-@ southeast of the Cape Verde Islands on September 2. The depression was quickly upgraded to Tropical Storm <unk> around noon the same day. Over the next several days, <unk> moved in a general west @-@ northwest direction and reached its peak intensity early on September 3. Strong wind shear, some due to the outflow of Hurricane Ike, and dry air caused the storm to weaken. On September 6, the combination of wind shear, dry air, and cooling waters caused <unk> to weaken into a tropical depression. <unk> deteriorated into a remnant low shortly after as convection continued to dissipate around the storm. The low ultimately dissipated while located 520 mi ( 835 km ) east of <unk> on September 10. However, the remnant moisture led to minor flooding on the island of St. Croix. 
 
 = = Meteorological history = = 
 
 Tropical Storm <unk> formed as a tropical wave that emerged off the west coast of Africa near the end of August 2008. It tracked south of Cape Verde and slowly developed, and on September 2 the disturbance became Tropical Depression Ten while located south @-@ southeast of the Cape Verde islands. As the depression became more organized, an eye @-@ like feature developed in the upper levels of the system. The depression was upgraded to Tropical Storm <unk> six hours after forming. <unk> was located in an area which was supportive for rapid intensification but was not forecast to intensify quickly. 
 <unk> continued to intensify throughout the afternoon as the storm became more symmetrical. However, due to the location of the storm, there was a lack of accurate wind speed readings, and the National Hurricane Center was uncertain of its actual intensity. Despite the lack of wind shear around the storm, the center became slightly exposed and ceased further intensification. The storm was also heading into an area where shear was <unk> to significantly increase due to an upper @-@ level trough diving southward. Despite convection being partially removed from the center of <unk>, the storm intensified slightly in the early morning hours on September 3 as thunderstorm activity to the south of the center became more organized. The intensification was forecast to be short in duration as the trough to the north was deepening, causing the wind shear to the west to become stronger. 
 <unk> reached its peak intensity of 65 mph ( 100 km / h ) around 8 a.m. ( <unk> ) as it continued to become more organized. However, there were indications that it had already begun to weaken. <unk> towards the north was becoming restricted and arc clouds began emanating from the storm, a sign that dry air was entering the system. During the afternoon hours, the structure of <unk> began to rapidly deteriorate as strong wind shear and dry air took their toll. By the late night, the center was almost completely exposed and only a band of convection persisted near the center. 
 Despite continuing effects from the strong wind shear, a large, deep burst of convection formed in the northern <unk> of <unk>. The center was found to have shifted towards the new convection leading to an increase in intensity. The forecast showed a slight decrease in wind shear as <unk> continued westward and no change in intensity over the 5 @-@ day forecast was predicted. However, the convection decreased once more and the low became completely exposed by the late morning hours and <unk> weakened again. By the afternoon, the center of <unk> was only a <unk> of clouds, devoid of convection. During the overnight hours on September 4 into the morning of September 5, convection associated with <unk> began to <unk> somewhat, mostly to the north of the circulation, due to the strong <unk> wind shear. By mid @-@ morning, <unk> re @-@ intensified slightly due to the redevelopment of some convection. However, the redevelopment was short lived and wind shear again took its toll on <unk> by late morning. The convection around the system became <unk> from the center and <unk> weakened slightly. 
 The weakening trend continued through the afternoon as the storm was being affected by strong <unk> shear. <unk> became almost fully devoid of any convection by mid @-@ afternoon and the storm weakened to 40 mph ( 65 km / h ), barely holding on to tropical storm status. <unk> regained a small amount of convection in the late night hours, but not enough to still be classified a tropical storm. Due to the lack of convection, <unk> was downgraded to a Tropical Depression at <unk> ( <unk> ) with winds of 35 mph ( 55 km / h ). Since there was no convection around the system, it would have normally been classified a remnant low but, due to the possibility of the storm <unk> over the next several days, it was considered a tropical depression. The next morning, <unk> was downgraded to a remnant low as strong wind shear and dry air caused the demise of the storm. No redevelopment was expected with <unk> as it began to move over colder waters and remain under strong wind shear until it dissipated. 
 However, the remnant low associated with <unk> began to show signs of redevelopment during the afternoon on September 7. <unk> around the system increased significantly and the low was no longer exposed. On September 8, wind shear took over the system again. <unk> around the remnant low was torn away and the low was exposed once more. The National Hurricane Center did not state the chance of regeneration once the low became exposed. Finally, on September 9, wind shear and dry air led to the remnants of <unk> deteriorating into an open wave. However, on September 10, the remnants of <unk> redeveloped and global models picked up on the reformed system. Once more, the chance of regeneration was possible as the remnants of <unk> headed towards the Bahamas. However, on September 14, dry air and wind shear caused the remnants to dissipate entirely. 
 
 = = Impact = = 
 
 As <unk> passed to the south of the Cape Verde islands on September 2, outer rain bands produced minor rainfall, totaling around 0 @.@ 55 inches ( 14 mm ). There were no reports of damage or flooding from the rain and overall effects were minor. 
 Several days after the low dissipated, the remnant moisture from <unk> brought showers and thunderstorms to St. Croix where up to 1 in ( 25 @.@ 4 mm ) of rain fell. The heavy rains led to minor street flooding and some urban flooding. No known damage was caused by the flood. 
 

fastai 库期望数据被组装在一个 DataLoaders 对象中(一个包含训练和验证 dataloader 的对象)。我们可以使用 dataloaders 方法获得一个。我们只需指定一个批量大小和一个序列长度。我们将使用长度为 256 的序列进行训练(GPT2 使用的序列长度为 1024,但并非每个人都有足够的 GPU 内存)

bs,sl = 4,256
dls = tls.dataloaders(bs=bs, seq_len=sl)

请注意,你可能需要根据你的 GPU 内存减少批量大小。

在 fastai 中,一旦我们有了 DataLoaders,我们就可以使用 show_batch 来查看数据(这里输入是文本,验证目标是相同的文本向右偏移一个 token)

dls.show_batch(max_n=2)
文本 目标文本
0 \n = Jacqueline Fernandez = \n \n Jacqueline Fernandez ( born 11 August 1985 ) is a Sri Lankan actress, former model, and the winner of the 2006 Miss Universe Sri Lanka pageant. As Miss Universe Sri Lanka she represented her country at the 2006 world Miss Universe pageant. She graduated with a degree in mass communication from the University of Sydney, and worked as a television reporter in Sri Lanka. \n While on a modelling assignment in India in 2009, Fernandez successfully auditioned for <unk> <unk>'s fantasy drama <unk>, which marked her acting debut. Fernandez'breakthrough role was in <unk> <unk>'s psychological thriller Murder 2 ( 2011 ), her first commercial success. This was followed by glamorous roles in the ensemble @-@ comedy Housefull 2 ( 2012 ) and its sequel Housefull 3, and the action thriller Race 2 ( 2013 ), all of which were box @-@ office \n = Jacqueline Fernandez = \n \n Jacqueline Fernandez ( born 11 August 1985 ) is a Sri Lankan actress, former model, and the winner of the 2006 Miss Universe Sri Lanka pageant. As Miss Universe Sri Lanka she represented her country at the 2006 world Miss Universe pageant. She graduated with a degree in mass communication from the University of Sydney, and worked as a television reporter in Sri Lanka. \n While on a modelling assignment in India in 2009, Fernandez successfully auditioned for <unk> <unk>'s fantasy drama <unk>, which marked her acting debut. Fernandez'breakthrough role was in <unk> <unk>'s psychological thriller Murder 2 ( 2011 ), her first commercial success. This was followed by glamorous roles in the ensemble @-@ comedy Housefull 2 ( 2012 ) and its sequel Housefull 3, and the action thriller Race 2 ( 2013 ), all of which were box @-@ office successes.
1 small farms in between small residential subdivisions. In the community of Freeland, M @-@ 47 runs near the <unk> International Airport off Freeland Road. North of town, M @-@ 47 leaves Midland Road and becomes a freeway near <unk> Park. The freeway section of M @-@ 47 runs through rural farm land. There is a diamond interchange with <unk> Road before the terminal interchange at US 10. \n As part of its maintenance duties, the Michigan Department of Transportation ( MDOT ) tracks the volume of traffic on the highways it maintains. This number is expressed in terms of annual average daily traffic ( AADT ), a calculation of the average traffic for a segment of roadway on any average day of the year. In 2009, the department measured a peak of 19 @,@ <unk> vehicles daily on the stretch north of <unk> Road. The section south of the farms in between small residential subdivisions. In the community of Freeland, M @-@ 47 runs near the <unk> International Airport off Freeland Road. North of town, M @-@ 47 leaves Midland Road and becomes a freeway near <unk> Park. The freeway section of M @-@ 47 runs through rural farm land. There is a diamond interchange with <unk> Road before the terminal interchange at US 10. \n As part of its maintenance duties, the Michigan Department of Transportation ( MDOT ) tracks the volume of traffic on the highways it maintains. This number is expressed in terms of annual average daily traffic ( AADT ), a calculation of the average traffic for a segment of roadway on any average day of the year. In 2009, the department measured a peak of 19 @,@ <unk> vehicles daily on the stretch north of <unk> Road. The section south of the US

另一种收集数据的方式是一次性预处理所有文本,然后只使用 transform 将 tensors 解码回文本

def tokenize(text):
    toks = tokenizer.tokenize(text)
    return tensor(tokenizer.convert_tokens_to_ids(toks))

tokenized = [tokenize(t) for t in progress_bar(all_texts)]
100.00% [662/662 00:12<00:00]

现在我们像这样更改之前的 Tokenizer

class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        return x if isinstance(x, Tensor) else tokenize(x)
        
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

encodes 方法中,我们仍然考虑了获取尚未 tokenized 的内容的情况,以防我们使用此 transform 构建包含新文本的数据集。

tls = TfmdLists(tokenized, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)
dls = tls.dataloaders(bs=bs, seq_len=sl)

我们可以检查它是否仍能正常用于展示目的

dls.show_batch(max_n=2)
文本 目标文本
0 \n = Otra Nota = \n \n Otra Nota ( English : Another Note ) is the debut album by American singer Marc Anthony that was released on January 26, 1993, by RMM Records. Produced by Sergio George, it was the first album by Anthony to record in salsa after starting his career as a freestyle musician. Recording of the album began after Anthony asked RMM president Ralph Mercado to record Juan Gabriel's " Hasta Que Te Conocí " in salsa after hearing it on the radio during a taxi ride. Recorded on a low budget, the album peaked at No. 2 on the Billboard Tropical Albums chart and reached No. 30 on the Billboard Top Latin Albums chart. \n The album was well received by critics who complimented George's production and Anthony's youthful voice. Anthony received two awards for " Best New Artists " at the Billboard Latin \n = Otra Nota = \n \n Otra Nota ( English : Another Note ) is the debut album by American singer Marc Anthony that was released on January 26, 1993, by RMM Records. Produced by Sergio George, it was the first album by Anthony to record in salsa after starting his career as a freestyle musician. Recording of the album began after Anthony asked RMM president Ralph Mercado to record Juan Gabriel's " Hasta Que Te Conocí " in salsa after hearing it on the radio during a taxi ride. Recorded on a low budget, the album peaked at No. 2 on the Billboard Tropical Albums chart and reached No. 30 on the Billboard Top Latin Albums chart. \n The album was well received by critics who complimented George's production and Anthony's youthful voice. Anthony received two awards for " Best New Artists " at the Billboard Latin Music
1 reactions and prejudices ", which leaves no room for any further interest. Donoghue complained that Lessing has not made up her mind on whether her characters are " the salt of the earth or its <unk> ". In a review in the Chicago Tribune, Kuehn felt that the work has little impact and is not memorable. He said Lessing's real interest is character development, but complained that the characters are " trivial or two @-@ dimensional or crippled by self @-@ <unk> ". \n The Good Terrorist was shortlisted for the 1985 Booker Prize, and in 1986 won the <unk> Prize and the <unk> Smith Literary Award. In 2007 Lessing was awarded the Nobel Prize in Literature for being " part of both the history of literature and living literature ". In the award ceremony speech by Swedish writer Per <unk>, The Good Terrorist was cited as " an and prejudices ", which leaves no room for any further interest. Donoghue complained that Lessing has not made up her mind on whether her characters are " the salt of the earth or its <unk> ". In a review in the Chicago Tribune, Kuehn felt that the work has little impact and is not memorable. He said Lessing's real interest is character development, but complained that the characters are " trivial or two @-@ dimensional or crippled by self @-@ <unk> ". \n The Good Terrorist was shortlisted for the 1985 Booker Prize, and in 1986 won the <unk> Prize and the <unk> Smith Literary Award. In 2007 Lessing was awarded the Nobel Prize in Literature for being " part of both the history of literature and living literature ". In the award ceremony speech by Swedish writer Per <unk>, The Good Terrorist was cited as " an in

模型微调

HuggingFace 模型将在输出中返回一个元组,包含实际的预测结果和一些额外的激活值(如果我们在某些正则化方案中想使用它们)。为了在 fastai 训练循环中工作,我们需要使用 Callback 来丢弃这些额外值:我们使用它们来改变训练循环的行为。

在这里,我们需要编写 after_pred 事件,并将 self.learn.pred(其中包含将传递给损失函数的预测结果)替换为其第一个元素。在 Callbacks 中,有一个快捷方式允许你访问任何底层 Learner 属性,所以我们可以写 self.pred[0] 而不是 self.learn.pred[0]。这个快捷方式只适用于读取访问,不适用于写入,所以右侧必须写 self.learn.pred(否则我们会在 Callback 中设置一个 pred 属性)。

class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

当然,我们可以让这更复杂一些,使用预测结果元组的另一部分来给损失函数增加一些惩罚,比如使用 RNNRegularizer

现在,我们准备创建我们的 Learner,这是一个 fastai 对象,它将数据、模型和损失函数组合在一起,并处理模型训练或推理。由于我们处于语言模型场景,我们将困惑度作为评估指标,并且需要使用我们刚刚定义的 callback。最后,我们使用混合精度来尽可能节省内存(如果你有现代 GPU,它也会加快训练速度)

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()).to_fp16()

我们可以在不进行任何微调的情况下检查模型的性能(剧透一下,它相当不错!)

learn.validate()
(#2) [3.2537169456481934,25.88637924194336]

这列出了验证损失和评估指标(因此 26.6 的困惑度相当惊人)。

现在我们有了 Learner,就可以使用 fastai 训练循环的所有功能:学习率寻找器、1cycle 训练等…

learn.lr_find()
SuggestedLRs(lr_min=0.017378008365631102, lr_steep=0.14454397559165955)

学习率寻找器曲线建议选择 1e-4 到 1e-3 之间的值。

learn.fit_one_cycle(1, 1e-4)
周期 训练损失 验证损失 困惑度 时间
0 2.986238 2.721945 15.209874 04:56

现在只进行了一轮微调且没有太多正则化,我们的模型并没有真正提升,因为它本身就已经很棒了。为了看看生成的一些文本,我们选取一个看起来像维基百科文章的提示

df_valid.head(1)
0
0 \n = Tropical Storm <unk> ( 2008 ) = \n \n Tropical Storm <unk> was the tenth tropical storm of the 2008 Atlantic hurricane season . <unk> developed out of a strong tropical wave which moved off the African coast on August 31 . The wave quickly became organized and was declared Tropical Depression Ten while located 170 mi ( 270 km ) to the south @-@ southeast of the Cape Verde Islands on September 2 . The depression was quickly upgraded to Tropical Storm <unk> around noon the same day . Over the next several days , <unk> moved in a general west @-@ northwest direction and reached its peak...

文章似乎以新行开始,标题位于 = 符号之间,所以我们将模仿这种格式

prompt = "\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn"

提示需要进行分词和数值化,所以我们使用之前相同的函数来完成此操作,然后再使用模型的 generate 方法。

prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
inp.shape
torch.Size([1, 21])
preds = learn.model.generate(inp, max_length=40, num_beams=5, temperature=1.5)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
tokenizer.decode(preds[0].cpu().numpy())
'\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn @-@ shaped head. It is a member of the <unk> family of <unk'