From 2b8cfd769a3e829c8421be580913510d0a15b512 Mon Sep 17 00:00:00 2001 From: Sofia Guerra Date: Thu, 4 Aug 2022 17:14:12 -0300 Subject: [PATCH 1/5] Adding blog post --- ...-xeon-scalable-processors-with-bfloat16.md | 76 ++++++++++++++++++ ...xeon-scalable-processors-with-bfloat16.png | Bin 0 -> 33115 bytes 2 files changed, 76 insertions(+) create mode 100644 _posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md create mode 100644 assets/images/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png diff --git a/_posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md b/_posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md new file mode 100644 index 000000000000..4d991e8e2430 --- /dev/null +++ b/_posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md @@ -0,0 +1,76 @@ +--- +layout: blog_detail +title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" +author: Mingfei Ma (Intel), Vitaly Fedyunin (Meta), Wei Wei (Meta) +featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png' +--- + +## Overview + +Recent years, the growing complexity of AI models have been posing requirements on hardware for more and more compute capability. Reduced precision numeric format has been proposed to address this problem. Bfloat16 is a custom 16-bit floating point format for AI which consists of one sign bit, eight exponent bits, and seven mantissa bits. With the same dynamic range as float32, bfloat16 doesn’t require a special handling such as loss scaling. Therefore, bfloat16 is a drop-in replacement for float32 when running deep neural networks for both inference and training. + +The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel® Advanced Vector Extensions-512 (Intel® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel® Xeon® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel® AMX) instruction set extension. + +Intel and Meta previously collaborated to enable bfloat16 on PyTorch, and the related work was published in an earlier [blog](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) during launch of Cooper Lake. In that blog, we introduced the hardware advancement for native bfloat16 support and showcased a performance boost of 1.4x to 1.6x of bfloat16 over float32 from DLRM, ResNet-50 and ResNext-101-32x4d. + +In this blog, we will introduce the latest software enhancement on bfloat16 in PyTorch 1.12, which would apply to much broader scope of user scenarios and showcase even higher performance boost. + +## Native Level Optimization on Bfloat16 + +On PyTorch CPU bfloat16 path, the compute intensive operators, e.g., convolution, linear and bmm, use oneDNN (oneAPI Deep Neural Network Library) to achieve optimal performance on Intel CPUs with AVX512_BF16 or AMX support. The other operators, such as tensor operators and neural network operators, are optimized at PyTorch native level. We have enlarged bfloat16 kernel level optimizations to majority of operators on dense tensors, both inference and training applicable (sparse tensor bfloat16 support will be covered in future work), specifically: + +- **Bfloat16 vectorization**: Bfloat16 is stored as unsigned 16-bit integer, which requires it to be casted to float32 for arithmetic operations such as add, mul, etc. Specifically, each bfloat16 vector will be converted to two float32 vectors, processed accordingly and then converted back. While for non-arithmetic operations such as cat, copy, etc., it is a straight memory copy and no data type conversion will be involved. +- **Bfloat16 reduction**: Reduction on bfloat16 data uses float32 as accumulation type to guarantee numerical stability, e.g., sum, BatchNorm2d, MaxPool2d, etc. +- **Channels Last optimization**: For vision models, Channels Last is the preferable memory format over Channels First from performance perspective. We have implemented fully optimized CPU kernels for all the commonly used CV modules on channels last memory format, taking care of both float32 and bfloat16. + +## Run Bfloat16 with Auto Mixed Precision + +To run model on bfloat16, typically user can either explicitly convert the data and model to bfloat16, for example: + +```console +# with explicit conversion +input = input.to(dtype=torch.bfloat16) +model = model.to(dtype=torch.bfloat16) +``` + +or utilize torch.amp (Automatic Mixed Precision) package. The autocast instance serves as context managers or decorators that allow regions of your script to run in mixed precision, for example: + +```console +# with AMP +with torch.autocast(device_type="cpu", dtype=torch.bfloat16): + output = model(input) +``` + +Generally, the explicit conversion approach and AMP approach have similar performance. Even though, we recommend run bfloat16 models with AMP, because: + +- **Better user experience with automatic fallback**: If your script includes operators that don’t have bfloat16 support, autocast will implicitly convert them back to float32 while the explicit converted model will give a runtime error. + +- **Mixed data type for activation and parameters**: Unlike the explicit conversion which converts all the model parameters to bfloat16, AMP mode will run in mixed data type. To be specific, input/output will be kept in bfloat16 while parameters, e.g., weight/bias, will be kept in float32. The mixed data type of activation and parameters will help improve performance while maintaining the accuracy. + +## Performance Gains + +We benchmarked inference performance of TorchVision models on Intel® Xeon® Platinum 8380H CPU @ 2.90GHz (codenamed Cooper Lake), single instance per socket (batch size = 2 x number of physical cores). Results show that bfloat16 has 1.4x to 2.2x performance gain over float32. + +

+ +

+ +## The performance boost of bfloat16 over float32 primarily comes from 3 aspects: + +- The compute intensive operators take advantage of the new bfloat16 native instruction VDPBF16PS which doubles the hardware compute throughput. +- Bfloat16 have only half the memory footprint of float32, so theoretically the memory bandwidth intensive operators will be twice faster. +- On Channels Last, we intentionally keep the same parallelization scheme for all the memory format aware operators (can’t do this on Channels First though), which increases the data locality when passing each layer’s output to the next. Basically, it keeps the data closer to CPU cores while data would reside in cache anyway. And bfloat16 will have a higher cache hit rate compared with float32 in such scenarios due to smaller memory footprint. + +## Conclusion & Future Work + +In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel® Xeon® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel® Xeon® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future! + +## Acknowledgement + +The results presented in this blog is a joint effort of Meta and Intel PyTorch team. Special thanks to Vitaly Fedyunin and Wei Wei from Meta who spent precious time and gave substantial assistance! Together we made one more step on the path of improving the PyTorch CPU eco system. + +## Reference + +- [The bfloat16 numerical format](https://cloud.google.com/tpu/docs/bfloat16?hl=en) +- [https://pytorch.org/docs/master/amp.html#torch.autocast](https://pytorch.org/docs/master/amp.html#torch.autocast) +- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) \ No newline at end of file diff --git a/assets/images/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png b/assets/images/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png new file mode 100644 index 0000000000000000000000000000000000000000..688d117520f3f4be4278bdc0a9b0b1437a2c18a3 GIT binary patch literal 33115 zcmd431yogQ*EURp(ui~$pfn;NT^m#oK@{m0kVd+@OAG`>Qb44oK}uRlm5?qGq`To? zYi(sCM?IeR|K2g)F`kFB*?aAK%{Ak?t~u{p0_EkTagS3RM?ylvy>eOd1`-l#ArcZY z6&4Ecj*Ca7Iq)B3>l@PINLj5^Q@}51hGN&mkdSi2v3GRPf!~i=URJS2LL#yO|BpOq zK;x^Ji#QpYluZK_XsS>rw5SI1kVf~mjAGcQQ_23t08XD=meJ@sXktm#YsL6OW z^vcyPvoxbTc59ZutaZ~Z-zIncKaU`^#5WA{%jY! zhcR)fn2)BPKJ46VX-9eXEt`yYcMU7Uq&2bF2wWH&_TYt7%tYbJ$WoqO&O@3O&2-1t zZDx2j?G=MVF^z>^K0P!&6M9`+eN9Py*3(hVfYu3PT5)m^MIbr9+iqJwwJ#{Km=0RW zmy*kVuf;6WMp5{pt1@2Z*=JksU-H(6c$~fNh1Cnl*DLj|J>FKc9V(1*dLVF&A=aU< zZ{EG|@ui^m$d}$MGiybcY_8`BVO?G)sZt-K)-80wt}rJP9(wob!nvT)2QP_BnAnI7 zPWNx}Ixj5{^sR0kGyPbDIcV~}Z*Dnnp=ij05X*jfQMzu(lf`;jC)YutxwQ6HR_%KB zyU+UH(xsn0(ziK3du{eE#+yBTw`uRqk-q$NQObtFt7h8H8RF)Zug@9T7)I z1aY+zox!K1p<&&9THaETc0-SOz?j`^Yi3K*j`g0d|CYT4ubDI@44KG&I|R=v<7sz7 z7Nvy+d&W@ZMPkgSigLPHBuE0O6(hsgFa>p-#|)%(zKPG)pBA1eiqV#JcI$i12ook! zeB)qVL_AW}gyT41u7atliphU7&BTv|IHUs0>BS9yT{#m`u0ottZb3m^a=w+=*{?-J zmDKMCQ`Cem3Fn?-%v1dG`1X*9utgdNZRPB&1f#mv=?L%SW#5)<-UxfS!18&}D7m4GL$(da8pxzjHHSi4iB zz?)JhKYUe#($(ty+Z9WnuNCr__FY2jw`R0rUicO}^zHN7c5-e=aY8SLHA@S< z;p&k6gV_=xp$f-_)^oG5wY}x+8?n}xRkh2gO33bu1GdUT>2;?kVAH{a-Som$g0-1d zoY$+`b#nJO*UZbEx4&dcH}Xo+YI_?*cqADKq$2m0`OX%FD7%VQI@R!(#%>bq$v1o7 znFU@{tbv#F)w2#4%>TDz!@z&K#!m zed|4TZ`8VGpGb#y&5xQDW~y&;vF<8(@VzBud=9$A*_rs1qf+YE`zLP(V+zC`vr@^}40{;MM|NR-{PsxdQc zotbH8gz^%Z*ut(x1doc7-yjmEsSb6oHTXKz#40s~+GDz*x3Q)%A8EV&r6Xy;`m;ON zn9(;Ckupjn>n@YE=Uc_`X38@OQU0+*r0?F?yjbznnw=a0EWwGzSl@Ajx5!Gzmee8! zJ7n2AQ^Pq%IAi18q`OE)#n77)wZew7I|FWxI|01Jx8KapT4*`0#Wt@6nnwEx*f}#_ z%voJp<;~+oFL!;}^HK>@kH)z|DDS)u_2z~Q5JkS883AfZ)gBY6)sw+IGYNattu(n( z187;(o%}b0FeFsA_i-AlMvT4e5{FA)Z##=gwMLOhXnq?+87m4?b$eMAP;RI~TnCds zcTa)1=-DaqPV(ZUawdyt=t)JfTNstY zR4}dU$iwW4o1iOT_F(7|l_@OTCKpMCH*JM`3|2l+vI{MB@Gj;hH)`!LSuncKc9JT! zf$m|1{8#sOW9w^d#d=BhVl_`U`|cFue5M7X#f|Z6Yawxc_VJ)wW1zc}px+ykD7v4m zwbh%>&T*kaE~0xqd^)`eSV{=;sIC~!(374!D@6D1?5)uh>>0q48-WkY;UCgx5DlRo zn;UuJSJt?5t>V;>9$+g|$_>!+IxdH?byX?issR*<*(5^S#sln1k-#lBhQ2O$F=S-6z-|e8!yZaKD}FXt2DhBxt=$+OUzk zH!xzyLPv?`a%ZzNm^UO=3_HIx?;lCsJt zwz64Ry)sDbIm1hC#n`qRQ)LL(1l8yfJ3VK}9(3`)m(rK(funB!g{Q2|Lg?f9fVe(Z zdE&Z(xAUZ9hB=CKC!_PVSnej5Khw%z(C7=hZCdF6B^l*}ac*AUK!VRiA?JSl06D93 zu(GeJR<;FOd2isz*MX*J^J z_z-*82?87yErwiIJzzh~5agn11?RIUCNk^6h4Q=uQ)|Ww04m z8(qR8-_P5+21=*adM?eF%BZzj=JvAa!W8<2&d!m}7BAL?->ZvygTtnqTyA~t-lu{- zSAv2vb9(8wcosCRX1Cfq@Uc?5bRRL`XJ>g0CHNR9bx{|sukYHsUmMYJw?b|fG{XNX zE6{D5Zgj81w&7liH7i|S?rPJAcJIvj_P}oTOpB(;3orLpGvkv?KJ8@Gn>pE*uJ6JW zWY6}nXOS0HHf`65EAUCDPY4b0)2B^#XIZu|=@4 zf~-zzc6D*Eno{m|a}HX@#B#N%-kz8*)e%cC0jcFxP%Y2d|?+a{nzB&UT zC$i-_mRt<-a7bW_mXh0b-;&p4)o2*!NU>G#&ff@%Nk$n#wuXI`a+1vr=`E%8XTgqAc|= zbKB)%D|DzRPu)r+%YD~eZ*8aDHiMBqRk<5tr~uQR9=YH=ldpYLM zYCO-vB6>PtNK!*?Y=s@9j2{>nCzGM#TC#R)Z+`tQOJ)GCNd9)A?z5{A+pnM2Z)2`} z7_Tw$QZF|I}-T5v=-Av2wGE)^+M}DvkSI9ebwc?h1Q&?9jTwn~~i( zwesb4_vKA#w@vI2M}bw}^&}PH^>>Q%E??HaoZ-ffQ?E4awDBK$gO(5+9@ArMYZt2$ z8mhIwVtY|0b&@3XWoYcq@U7FiLoWxN7N56GHfotDWBA{;)rrX(Q?xkg*g2pq=igYf zXdL?d#%KjYJ+eQ{`BWIEpu0wUb={7-x80y>MnM$=&%8)^Ky|?}-;vN{=gw=-KB(C) zSnuu-vcAOfd633i5`azo0~CudrD$FE?s!IOox7c~hmI68J|X9~^Yrc91&P_Y`ZHyf zYF(Nd1sIgZ$vPaeqodNFNcK0pZ$}uFIpPg1mzghjIw-H!&1qCk%}mdk@I*AYz6;a5 zCM8>Q>c|g8wxme*9j_S9PHzxo+;1M5!cJEn5Qj)Xrq(&!x#Qdo_e+l5-}1w}Oz>vCkqa z+Qo5Zc5a%n$gbt&vnMHEWKGPw1GX0NHDQTn^fU9=h0E=PW5NSAD-YOi<9*WAjxjU+ z^7^CnCvuG0(9Y4P7f3rq#Z)Hj-vmrw7v#Yya?nx7?dGxRW~bKmgk>;f>&dZ?w&uJf zb}VO_nO4QI37LP)la1H5A*RI=Id(xM^35#ob&aK(zq8AuA5`y!Nvus-6oRrD>Z z$=XFLy+-s@yY)9{pX-f{LI*=G;J*%a#kFlo^IpR#P*x^P;w@?v(;}|p!QM_p_uHP+ zwc0&>Lth2Em|q(&e>H!i%6H+ujHa_St0QSs#ph5ODy{^H2JOy{vxcR=5oGxrjW zx7Dk3@c5z*2yxHn7(JvXRbM4_3NqQwOo^->eH|fV#N5l~FLOD9mr`Giw6NICUG4g! zwX!_U2WzGLlXWfhdLM!c(E_zWqg&=sFMYIg%D+lV^5nu50p*U!g(MO8lYxb3n|W4B z`Q+TuYU|q5Wu1(kE>ryU=53Z!Sph4$X6j!2ra9(HZ#7^gX4#X~X9(Wdm~8S!DH?foUXm2rN!|~CB0zn z+lMMJ7DcIm9lilMg zvoreX&u)B{&IR3arjc45@GNMrfimydpw{y@2G-b>ZnjGyjZyi)G7QQccI)@;auQib zq+`p1B-OljKJ!5<&s9a*0P0~)B)(!#E1><`&}Fx~8{I9_`*(UZjrWxnt*Q5H^SIL= z_oQB)8&(Z;4H`e4S?eady%+eUz)ovG@Wf70+YP3vYs|UVBP4T(wg)2ey@87-1HWc5 zfL8;-4lKWUe&9~inS5{&!0Ea8uiQnS(+e5G)V%tu3jp&l{7;%G_Sj3iYi~Fi3217i z;}qTRD|gv>qsIIA%LQcc(PULSBIyV-jLMuZLvgNc(*XUmch>Ku8McH$ShBzdpX_B-7*$)8ycoi+q;51s0Y5zIBih~mTc>-8 z+*Y7M#feq{G7n}&@%o(#+=yw((mi0=bBn+Bl~5lKX9%L}D^ z+fg8t0?+N$fs34glJf#9mc%au9ekpaVenzHXg{r;soIS9%6kHPPQ*Ws7m+V@ahyA1f*Z37kZ|H>!eX~?O->{Y=A^XOhRm@I%@ z-wWUY(Ls7ZiR`xU|1NM=_KNybm%uxrf@url{O>|`E&@y^1e9P7*&jg3_vHHP{IR$P zg43aT@=a;HN7)280x|}O7I?2iRWM2@{<@^BOca1h@=N$Ghtg-MEdk0oQE~ z-3HSQK-llsfDnHuV~7NP_!j!0gbSo#CCCi`6g^iq4titFI8_W(fC0#U2f!aJ+GiDm2kn0ZL65()XIEfyN~3`NgRo!#?UjJ&j=O7tRRd7Jnirg-;Ot}uq8*@y zep6d}gU;C?NmMRC;s*y@9@H-16CdKn17>>A1K#4mRgLV^;IkiMNs^}w@6ApF_j>Ua z45mOa=njA@fk(aoie#|fmv?}%pwMz-d7En}e)i_5be!Y(;^SDi)$85Y-)ttV);wM- zcR#)q?ExC`L+Jhk%FngZg?97=qS+1?KmTm3t-l1Y31w`AN;3Q;l^WFUWhg$pgaeH6Z3WBuux#Kq9Zu4s8 zJAC6E>J?nx3D$bsHX@IM0Gm~z@?l%0yp?;wRRhd)A{cbHuD7bwTLe~4}i)5v;eT=A^a3!2dx059`K)1eQ(AaY}nYLq%H@kAol$e9pl?ItZSRN4kq!Ke`LX$fpeR{a5$_6sVXtqy^o6$Raq3C}e_- zA1(+he_)Z_XiRXD2k*4V4gQ!%Klj2e4?*NRqP&jA3@Vt--Xaj;pdJ)#Z2&ldNELGV zk=AcxFwPBu3gcM`e+(KnFqt{pqa%0?Ar~O@3Z?`u28R?sq~M9p10{G!D?^Wf(8>!% z?ptCAqWyE!j-+)K4tTRe6BFKO2KTYHJ@vi7HM){)h{BSGK=KlT*l!W~3FuU?F=GX~ zJn_?gIUrZoL(I&e47?8jB}m2|xnSES0KJ0>(KC=$08=jMK;5K4B!Xz+U-JUrfWOTB z(elB^Zp2Nzm%y(ABPwux;6hCL5&dt`IqhYQucbkFlBood3K~Fp1M?>|ctE+J41)mv zQ2Xo*4mb|=V$N=G_d$ODA9xQzM(1I%=YeQLSa2w>;AX%R9RfnAXpU6g+x0iohobmQ z2ZXiX$|Y#9gb>ewdw?(^iVtK(8W3_>c7bK!oey+1yu&c@1U#@mP6~~5hVK3}_4x;o zyH9ICV&bn<{VzcQau6VUjHYYF74<)#0&V?0MjgBEwgB~7M^t+w^gxG3Fyn&|!Tb$Y z{W9whnRNpg1*u^}LrriL0;fE9BFO~H1tj5}4oJCyXb&Gc(17zY1ocHO=)p()NBQP#HgMgc;Lcc8h;huX$Qal>`L#uJVIW5$MxAgSRq+bg_xQJ28hl#&GcO~?=wPw~ z5DF~w4;__S2d2eKpf-0V;Z_yu0Pg)m__E@hK+=J?41Wfxqe>LV8H^mlPr0C)>Uw*i zrdGg|zxMzH3Jg#ffr=j3qOtx}S{K`kAe zZ(q`Q=NX%Lnf{AhJsL0H7#nlLs-1)SoK)S=_(@wc;T#bAKzc z1>d&d0Ui7MhV77c|2zf4Kq^1ttx0IBBNtr%3e4u?Tu*`9 zOJ)WDV*X3N|8J7wXQkr=GUX+3SK@GH%~wKGWw>ZUw9+bDU{ND>z2Jgk9qK**4E=v0 zzf`yn{1VXFoyLy$?VJM|z*VGW$KZ)4;VcP>1C75hSrc zjSmk^65%g_i5xgKr2R2w^fvDVSaSO~2n=~(^E3VW2Z1-@LR=3y4|To-96dp4^>2k3 z1GYr}0;NY&IUFp}G5!Kq7_^h`a*#eV|4e+R;FKp+pD7KnNCoj7gz;Bi1;w(E^=P zfw%AOaRRdI);WI&SDhdgPL&fI@&3%AgRc511WE-3Z z!Q+K^a6}n`Lz=hl+hNc-AP7)iq=5wu2?i*PM}P@YHVA8f_L9oMs1NZ~_!tPHzYh2i z6n?LRsYv5EIPikn8)#qkkb;3Wh!9d1kqe0B5vD+nL+F?99rS_!4n>RR`yeOK289)J30&MBbV2}}plACrx`XG$!Pb6k#KW>q0*9PnW{fxwu0jO`Rqmgo zbf}sj0RegQ2ygwuH}HWi=$297-vXQhPAE9Ff*4;0u?74!v@ZleA<%gM4-Ir7&`{;z zH(+ZLcn^;D54*uHUo!(K2!4#>&%Y7f_BP{**Q2n9Yp~nJu){ZS{y#_Jnh!v04CQPKYwWq~bpqdmZY1e$UI$V?n2WprVG8)WJRB znRD>#uY7oEMZ%e~gV!xxb<%1#l(=Yj?r-|9PXsistVcn%=(yZi?G2C=;v z;P5&;*%2)O90C!e4a7Ggc?>Uwf0xG)dqG?G-~w_i7k?b{J|YB{P6E{e2X+XCj<9g_ zo1fJ92K#>ZJ|ob6!TSYBW&Fw|KNlF{5q@xLSK!r%)SmZ%Uq0wn4uA}S zOTUTM5gdl5T(&>YAE3gj07uPG6qgKn!QK-g@_unMc%=WFY6tNAo9ee6odY(9p%g;g z0h&xfzp~3745@wv{0IRERmqX^`*&v?(&YaU>W%~kQKWE%2u&ssSoLpSfCi1wfwgbo z2HO!x!fpP+%Hw@>2S|r-*H38Wvp?g$>2fd~`HwarB-uX(+6ZsLhiZRhH#l`eH;0bF z|HjxuvkrtF|3yoHa4*F2(64_Zei2EBsI$N5k04v1v(w-L|LBh(8{o@Wcpre@k)ZvL z9sdiTAix$8gd?~6l`KF!J@`NiaeKA%;9=|kzrYVR^!DIc8)zX&iXeR2m}!0>RS=R3 zVgtzHBT4`Y%g=Tm>U4feY`9+_VF2PCzt#7%)bI^<1K`N?VIh7+5iz6!GUZ@Ib!e~z z-&}{w{X=N~|K_Dw(KlLondjh8;dhy!ukmlS{EJlh{c=YV;>TV2|It4lc=a#vd+*`? z=6=|ZHuksK_v2;=`1sj$=$}izp99$iVJ4z)g^uYT>Ddtt$f21lbRWdN1;kW;G^!EW z-uZzoe^EzYz<(-uL|*@w{6sw7pCIoakb<}cl?a4InDWDI zM{{m}aeu`fL=xx%L~Z}>`rkwdiiXOQKb1%`c&hjRwljY7a7TUir_27i5AkpH^!r}r z-=p_WOQ0TWze91~5ReVOO@ut~U$*`4SB5zkOn3?ZrA+b-@UIYEP5x+`zK3iJ|BHqS z{lxIUJO%vKqdJoR#h3PBA{#X#6AB9IC|cVOEnIf&gv+veP7zEtj{SB{P) z7NZSrE#2N99ookfW(pcq7Y*8Jv2djn$-R>AKj1hUx1U~OopOPAakD#nPRM>=sB$A6 zzhcE=;}cHv0I9{^wEMQ&e#3m>4FeR{MZGo`?XFuVT^D;EuG4+;y}oMDb=ElkNrzG% z4{`lB>&hicBR1h!WHk59&h8Cunw%?1NgYZqf~bw^oCP8~CY3D>7uth7Cf+|#lh6rU zUc`yf(R{O4|ExG>*gF_q5r4a`zgxrlqg}boJYERXClqG-vsiJZq8y1iqAjISNmwlM z**kOdi;F7CPC+3Z`w}9HbKA7e z)r`}YMy1bVlyft(+cxjR4K;3byq|ZN_wOKOLdUdzaWP?e)5=+J(L*51{wM>a!La$J*lUH;b90v>Th+=a!Rm44F_aF^?9FJ<)Q--L91_ zGM<>mHjS3x@yoPHi4OU^I?5>Qi(1MT_Z0Y{eu#ig_HO}W+wiOvBk#gvE7%ksvVnSg)`HDk2 zGUS2`bXm&c}BNOQwcvzrcH)7RoKQXBQm4JywTa>QVxgSqX7 z^5%{wC`>dqxA&V#>aOUmtgt5HKD|Kf*+?d;TtPo$X~dQ>Jd5iZoz0%3g@;Y>3aKS` zykM>RmCsuBTc7b`wo_$7@plNUd-fC<-hJy+?@8#G&y2tH8kxJ%k}-Z@+^E5|IshgY zxGWtW&rrPbG>I-m*@pKicDV;}o-<)UAj5`s`N#;l@Fb>c+Ve~GC-Ej4ALL3UNcS#M zonztYu#%wfkW^a~W9)q+s-`m)rKgD`8>$qnlBe%u=Gswjw%%;YXSTw%h5n@4D7WH0 zT~rwfS)4L&j%dR1;w|H3o5~)k9-dW$SLjNr`9$Bmo+q{tX85~g(%uN3TgR^dXqH0k zsE$VQYTHt%Hc?UkH0R{|NX>Xp4q6V;j*pWQYkvM19Rw;VC%p8Ms~##>T%Sjhl#tf# zUtWA_MVQ*Ht1hWF!BM8lALjjWEc|mHsfICUySUDnDQ(g)z40geI1D2;mMb-P!!4@T zQErhLl3iXYW4Xd-XfszV=Ep*Q9j}N%4mqKBgWRaa7nPV(&Ponb5vRqgqMppWO}pLh z918A5^@qdxYA!2WuNO4$l2g2Te*XB|$O|}(#Qro>uU+)I!6NUSk@|`k2g_$D)o*(F zwq2gj{FxCdEQB@|PZbCEVursFL-U+nKlLNuVd}4SgX}0eD~VDL$6PsaJU)t@pl>pI zh(~uXyII!QbB(+!dG7tNqp_#Qc08YR7+5;0NbEl zi53!}cl(%AXfQfChPJ&J+C7+16fII1Fe0BZBw=f9G|G^LDQEN*hBgyjv)+0RV|m(h zF_%}*?uxj_B|q1dl}jG*HjO&zCx<}CO*~o4>*bBXce%68T=HuJs3N%m36Mqo8!=M zzjsim5v6!7^oCpC6#p7!a=7r~jAfw_w=DlBlxhVZ*-d{@isQ^gg)t)uHy1{X)+5`w z%)<0;#0QKt=cd|~M2bp?WwvzGG3k=2PMX_X6PQEKy6$%cSzcWr3iT%1djZw?6|vyK z+Mse?ASL`Y`O&!vDGZ}J92~Z0gl}MTNZZGylJM2*o0{-V$UI?A**qx3Dm=~zYX;MG`E*+Syxs*?}U(Az33PC3B;CxSNxzzI_BK^50c0R_9=&q&4 zhn9jbxEzRNK1`iDj@)o}h>W<~gj1YgWyt}btjm0*ZmP(E5Lbn;7wsX6PTxZ&;s)pR zHG=rv2Hegy16iq+)!p#&{^SJ9J-(5y0kdzOUGqi^+}hcnLnFK|(7x8+kt|Ub;wPbx zx03H~aR2(TT7i(3C!o3XlGHp@~1osu%b`&Vm{|;4qhw#c?AA24D!y%WL4YSLwQ;hQo{>`6 zQ1JN(1`Y$h072EuZ}%)PhJ&b*(5{fVF&LD!`HZ&8qn)iWblSgqE=aWb(iY0_c%O?o z@1~{4RojwXYbx%W*%eWg+*pw%T^mTazN`t!7mbqH(rIq!6MIGxoJA*5vO#`VC!_It zM~YHBz%R`FIi5UsmyM0Iv%~QoXH}DUVq(uxx*6O|1`W~e_yo(=MpmG@>f4miwseba zE}`e+Y>^_<%Nldz*-?kwdgR5xkEd#bmROJ_T}ICrR4+UKfz)3VHL+hM80jS}TP#=+ zb|T>pM`Ar^hk!0A#IF- zgP+hv*)`{uy-HxKd@a%zN;K~l3g%?6T}_6J7EV0><>Fk4q+JF1hV#QjD4e2JX_JVZH>v+W=;ZdVdK;uD-$k=`(iIvY+?-Ch?D3ts$8 zLwA-f$nRKK0d^V=UVXYNyPxWC@z_;7^H;#1gJWkaA{nR-^GLqi!(twJC%KXAOb=1c zkaoZJ3Y|>7lFb$-6OshlDTDQ`V`CWJ7$p-^>!<`S>a`9lsHP6Ab%MSZh&eg5v}U}T zJaQ(nZ6vpsz65ILNt@(HHAZgC;HBD>uGG=Xol6`rHENjk8_h}9ArwRJTRayMjHDdp zNAIJD<6-B*z=WQM^W4j`HkC((%ZVy`Dec6420v4LT;ChLi-94*qLug>;i#VysqK*! zvpZ2R&359|)p5er<|R4qw|B6zw9;PlG32PeJ`|}h*-}j&L?%ia!amZDZp7SNJNI7u z8D3^^2YWGtahtu41#A22lUvv;&Yima)}K(Gq3}$SEMwWJ38IiF=S!7Y*^Lv;nKD;- z-O$IOKCM3~Lly2^gw0;LMRDT{cd3o0%zL@EaO6n(b0}ZK1u=F*OKqZ&44D!+=diiu zhAQ~bf<)2jg@uv1FZazTps&mi0)!L$VZO{Y_kAF2xS7yh89OH@QFD?sLvh1Ve={nK zm=gnr6F3nVPtD;YuSBVUOO?d0kwU|F6D@ZBgS@r{Arfw9I!P$0$YNU8l357-le)GF z0j9(P>-uk=R||xk-&l)qV>uW};>PnU>uk_{#es3RUoFOQGudA6DP6BadYj0;a6)gU zmK`&^gu0+g{iW&)n{0G(dYDx6sm-du&gxI*&b(da65WZ*1wCxoS0fY*Lt+!A772tp z!meYIp6Nl#d6kG-e+hT1tb5JwB^DKJQzApFyrjey$~B*g`_0--@oN5euu?;XZgE*O z=bqT6c5EK3Amhu$is9+HC@Hl=PbnH6uYNOCrnjgF%cI~}o4vZXWg!y^XM+TmcS-jr z)H0YhlD6+`arrEXXNrWD&5jaPD|sL4y)wJrOmBIk28Q+yJ~Q<@IxeI1^$Sz>VVIV0 zqnyf488-*^^^pe|#g`Fcx_T=3C>dW*95C1=IxR<8mtg1O!gVXyKgBrrz4f%#Sus|; zJ53H#wQ1{s;KgGmQa#4V{&8o8fq^7~LGMvM>L-`}6>90PNKYwEm>H5%@iTK6?+FCm zuOJh4uf)nB3$=D?=H|XFfDxrf{V=cG)yj!5plbr1;nW^gxywr|8>gW;DbYv1Ja;$G zt^)vxer+(J>p(sO55!;3rZ>#IATvqT) z-()?NnxG&Ro)?&7jI>VQbQq1T#%~K)E-7bk#1GBHwldt%&#)2fZEBW7bH*XTGDS6u zHqyQY3tV^FNn^1T0Q|PthZo8`u$H5z#hoLSe@B;{F>Eg(K^`rIoWhXL@l!hWaHpcG zsXAI*?^N)KU?b)_hfOJ!fc>qlH~|}#dC?PJ!NqcvoT?_3kV7xxk9`N>)Xvgf;ncCS| zAaQ$()HyzOU{9iCoMs|+^)aB@njEQ0W2pP;9eo#<>l4bRO13H;7X`WP$k-)--2?@5 zNuo%j>c&`Yg;iPQb=OtnNeOTj9V)DRr#v1Zk#_Pe>xiI7nV9L^H6gyy=O63z68&l7 zD*AaZH?f{IBC<7e7Nyj-xks=UjbzMYzWBQXBxxV3Y5WGMd3lXo+f6XbsbNSsip|JW z+CpzLOs*%gbL93dPSH`EJt-xaaw3kx^LQ*DvZ@4bcl=8{JNa5k9&4vp9rqb1sJBtx zqqUM74sPip-8IP`dMqeGh|7%?Bx*$_S3^p}5d-+jim)cW^brGxX%1(;24+OV(6MT& z>U0-N3W=*e=QuPa7IqwJ1bih<9X~hYAo=|DtGI*!4Dp0Js$+Um@!t1`hRJ8-C*w~D zP|IL;C5{ONM|iwjs;=?6=#rwUJ+7B2abG&lfh4re&|cYr)zUn`AXPbA73ZU)I(GhG zV1X#r3aqb~aIHNp_69Z!H4_SBcmYZojhwSG_kw<(W?zke>uSTM6)KF5(TC%?LP0`+ z=JYM_SjS5dC0=f{EhH1(t!L~jsLSMwoT3fXmRnbOC+-DT5j3<@MP8bTk7}NMMa-^K zE3)#Y7yG!l^cYfg?)A&{?oMtv9ElwM4Y-}!H|eMa?5fPbu>eY;<|KG zi`sy)i3dQ>HK@f@FmXd%gsl`SV>@*q*QgfGu}FDM&CVCIWopCNXL(2svYzk(e%kRoGikXd*CH2qnr`9Tfm z&4Q}CW2db=9R1}dBL+$={ZzFDV{TyAQuWd%$uV}JEHTexedw^+}XLctr-a zZQ&&*(EY|=@B3ta>UcwD9sj)qn6^mIK(l3eP#hC_`Nwh+T(IFLB&%{|XEbk!r)Hy< zzbb$<+jVivwYbND&$72?rhNkoy!Z#$~6!Cm>VpzP#Mf7SSe;rgTHH{`v|GJ2=kF*UW}TETvbeMgK8$sr)Jw*IsknWqN+6A7}*E^m-gk*GZ>3W#wpe&fPa zW93k9C=eyQ(3$92?N=Rw%Cbs&Gq?*qARJY&R*X^ukXN9TTg^)(gw zo2?ayM+D`IIWu5|<-xdDP~v($7zNH=V!W3^Odc8>(ZNqNOqSWdq=&_N&uB}Cs_`5l zfiWz;KOxT<`0L13Jbs;-)Jeq6o{?wl(G;;yAngV>wR%bY2rdtiGM z2NtigV9k^WR8&cDRe^jTj%7c8sotwkRi#!PFgnvd?e@KZkT8$lw*euAGGbVL{6(F* z%C^WPgB2~luNRgRF=vsEzb*AyL8nG8u#;eOdnaqrN0m6Tv-yeQCK|RLkK9xs%%BMu+>>pPQG9{9 zq(i}@)Qqh@xV(tMH$=+Xo+J?&B31oZUO?)4NZ(8B;w|ZUJel(peXEx$7jb73xNE3PZi3dh?x^C-Q5K>#?q1CF3 zh9^z=O-r%ZX$a4lP@PZIlc0A~>yRcCdMtHmVY~X|ORIm^Q9!+t#+F8Tng@i~*)xJvTsx-t?L0BgZ+duUC_N>Z+Iu z^_X{`Lak=|#Ytk}C`_EOusFP?0P!+gTAt>RtofBYMMmGQo|;rQTvQarQpYS)e4f%6{m`lu$nnT9Ar_6J!wkz zbDwawPUrYl3A#SEOi`lLdCWjG+4MC+)t#G$cHLA5oYpbVlyqiYyiLv^lQ#B;pB432 zn4fN&Oz&UbTWqOR7%|x;b7#kl*Ahq>4y8PoXrbyiEU1eAF%nJtU3tk~-J=uMC^6@G zcxvzBf0bd*Ri@k6*h0n^ILi${%<{Y%8h>e~7?X+MDSpk%F$q{K)?_f{TMZW|m}lDA zQ@z8A`x!K8cg;GYt8!e@r)mA8-@2SK28}{^P=A@~E*{dZp8k0ye`a3Jgi!N8iwAu~Xh?pQCo4#rcebqU$YIS(sw6 zd8ff7LjtCK{X3I;Tt+7#I8~uRJ$j|NC~Wi0S+i$NOh%0l%7%vC8iY^0C3uSh+_jve zg`bD`K7+h(g^Dl*E%g9ij%5`AtpX$s*qm*S8&}WR9us;^-5r*=7lt2CydrElJKCTUjAggyt`%{y~ds|#=C?eY= zo(9!6=F_&6`3^*w)Y-Sp#Rvnur~FV&k4=X0WH-b_yiTV*$D`=23hT z;;U0Vgt)|>13Fy{6Bs72XEv;v(yZ-b%BE_tUj9nAu|yzAjm}A@*Cjoxw0ATy?zj^c ze@^~*3XPXuwD`HG5SL6$)+LS@%ibrC&%~XB&0#;K^hW(m)9chTsES#{8`FC2q~}2X z*vziC$c|pfo?S@#b^RbTAvr!|7bcd7W`>mc=yO$pE0AEnG!6*pk3!m6V$MkV3XaT8Dcl;iG?a0DobG z%W=LwX&U#GJL3Uvc}bWXA7!ACN+cYM9_7tLRQg8PhK<1UGw``eM}D~PRPniqm=r7bm_*p-QCVY z@{CtT_NEr8^O$<_X#A^*!J<7p+?*|b$+#!tXQim`7PuI@C75Px<=w1-J!93#E030yn`>FXXvINgkNhYG=STcFC)m*mFDZop?4 z$B?CP3KQop(H5^TdB9ivK^B0fj?JcL3ZOKS^P(W2a+5csZ&!ecIYwL+Ae%^!8T+M(i{4`IllU!FAvphhsB1lakx`u=XttD-Z>8U>oi$5e-r zPTnQeD@r0Mb}_Rk#xlr6pP()b%9%W;|Z)jo=d`W zV!?r}fUXw9`Z@#*HiJZG$u6MW(Xr79{k9QUOCH3BYUD0ku0+^eVYlk_m@1v&mG|Qp z?`=E!sa~ODETeZOvU*D!dm3jA6&rEvwaH`KOoECm!y&5xm%+@jprv|Ky2*Ez+u}rBd${Z zTHi-pe`Z&iSe_(E^hCk{e~K%BFVobKg*x8V5B%4}c-j}uhX5lbfvwyp!A!bu<|-E= zo^Zr$Xz%LC4XvP@z*VFq@8QT+pyGL9*N-0JUfU3FtVr25OeS-+!O_Z(p%qn*<{Wkk z<@pc7$1OUmC1h(VG(>2DW;(%FRbA5H*eepnstS+BfWYd&8-6S$dP_kR->>lN92a;AlFs%9h2b2->#hk;EjSTsplRC11B3HRfBxCNi9zr0hRGX zTzrLWJC9MhA+q~x^uq+006v>K_W*kj$OHsB9an*q7C;Njhmz6>I&s`?67RG-nVeZ^ zf|lPirc@8QsJBtw0nMpA+GO_Xq8w%aUvt;~&UF9A$tf&Sb8O1tQB#XXIfmqrN9FiD z5;2VAL0V*FOcpYS;VE%-P@bZmgi#@fIW~LdRLNn+!YGG1W-J@u_dMT!;QQNmKkeG} zxpuwZ_xrx@_kF+a*Y$ZJA6`rR5|Xd*!iRdK%n*u6$OfW8AJ#1Oa{g|=p8uCp2Z1Vk zCy>@~CB(uh&OzQq+=$7Y(W(3aRV~mlyZ)VlNf=58#dW^%0DfBiGsGM&3IE5!B2t3G zmA~xuN-p|@I3API^8?c|QT}OMbEZY=5$MrGLdBOw#J^3bg758`$3FHh=gUqYR4&k^GEM{XZ(Wx9tmy%CiI0GNv{B`?Yp;Z>3+|lr;*`J+d{vDugaAJT64YOe)dqld}n=Vt4Uy7O&8%Miswf|h; zo&4EW_1dI!p6AhCMP@>IlU=NHO-O?W|FxZFw&9qg0?qyYCI0FvzMf536&UWnu_JPN z8Y|fI>8IxPH7FNiHzzUF1RLtiH-x&v{5x_#Duh>JpT$R~iCxMnf@b(a7`~`SJ52n~ z5tURGtugu2gNtTYBT|Xq+GNN@>2i1%k$KFG&HV&5O_ua87jqA8}tQpBNSHcj$ ziyQ1C(fdw0r6q)``qK4Nfuls46HRvLDIx7sqZXewo1)$P&weP7k5&1j5A&yFO?FdH zzh&$1wk&!YSPxS|$ai4`-<_cEIrqo2L>Fq$?~J^Z6M^~dN6*gJtoiY^E#)daEvk8L z=ETg_ChX4ckUd`GU6DrUz6^{5{T%ek!q)Umz8dVXxVafoA$6!wxWh@a2vVQF< zF+yRd^ABjJ;pS*@`SI|bIXPL?&JG(%Zxq3F-xXf8A;A|O-03Dnt5fQ8F)Q+#|BUrY ze7W>R1CPJyjuFKbPoS|6Yo64i_-!SOTw-c#u=pV@yNoSFtArxPsG>l#s;${Y_awnI zC6%+Qgm6-%YHihc6vj%h;YWXN0A);3jl18XpBsG_8d9;;>WkwzH{QCp`}6K}X~(&m z*0C^h;oI3!|IC(~U;4}R-e@sqknRe0u5(GrMVyfivy^lmiM09WBn&M{b!Lx zWkGc79_3Rci`a$93#e9^;kAf^AW=+-)p>2S#|E|z<|4ljuY?!Q)>ZqfO{gh%C8OWm zW5axearUTal4_zNU(!gB8TO>>x{B|Y)8R_cUF&8Eggo4Y;~qK`HQ($a-rrOpw91ve zyJjV`$SHnu0efvf;8I>9fQ)8WJ*ZI#k z6s5%!eW8{7tUYFMO%3|ok6YxZo5b12Pcc^Yo22ykVK3sT?x-}ohf zNA>GxHJ*>Q1~1SK5)}S9oEN!JyAc_fH=)%_RrYF&1we8nOY}6q*6Nl8INWdvZK<M5Jf3&2HJ`tMWBaC~AH#9N&hGp4P7gK66% z-00rs9m%(E%R5QZAbAI2er5~B9n}U_lJuW9ETRs143<2OZ$wjMKuf;TTgk3nnJDr! zaPZDy`|Clf1G5jHt*(?ukA>owcZHd9u7;*(JQquju7A-w|0Utf4#W4)D*&Xr0J>zb z$8O17zfRVeSl};A&nFd}t4P4z_e_;b#T%t4f?b z^%aZT<-?|$fB`B%jFvaX*s0KW`XIXC#5>&B0UN@_jM|fdG*3t&gJ0D{ir9?fd!sgI zjy6KdgzBE-8iR`2yXAREw?QsW|Cggz<(-*>;%&B@X1Js5>V8zo{DYRq!nKJb&IQ>He-2Pk# zTCRE0I;V5ka=;RL`eIDwK&J2UKBgVijx3vZ5TWL2!MNYNK9{1$L-xbjthL8=R)I8; z$!PwGjAZw1mI+c)fWWE#3Zd!}A)#lJC#NtQ3E|4wGS!_i^+SGf4>n|_ z2Q7zFOIAQhMjX!~NU&c97gBFbXX~J+*I$=yF&d6yYnXO<%rdsCmR{m;b;-QiHZiw! zo~rS?U}_34aL8jygrNs7Z!VW_&H#F6Ut@Ge1}v_pl(H{tm%J)Z_dvaumqJy45quue zdhgrptkv!`JDYg`!@t_ARUK3jM8hwp;`t!T7zsLSa@&HYSoJW7GiCPvwZsO|ea`d@ zOz+ypTaM)sqsLsI;JaTMZB2~T7m^MD01OCwK6~D7(j>uM%tVyBPQ=mFV<)lri1luJ zdT#1IDQxz(T|KsbYy!hAk}Otk0WO5@oJ>3gm(johSC4JWNMq1P1N0p3-bE!l{}5Jj z%q$(Ho_DZR)AO8JaRCvQaskbF@f2%D|I^LUoE*N+M|6%Z5?-iSq~-j`5C#b-{%F_A*_)rp$1rD%3$muQH}W+rw#W zC=j0>@$)%-nQ8e|t&+`k(+8C2AcRXt28H)JLfg{O*1F>!Z+WCxCvzT_w@ER4QW>4p3jBLbcIevK(;WU*?Zgd)1*S8ruXnOoPn zFN0M6&iuH3!nk`Re=7@2U*u6(AY5y*l@=K4m=YH{vDW z4&$v9?l8y{DCtmF=-IynL20C|?h+x5J5@*qZgKn~x1KegvHkEPv5U)b2}>7_oYf-+ zgy2@ZSy%Co)I*j?;hO_K813z6e4Yi-aNsb>zc9*sEtb~}3-JP?vtn?`wt?WQshL1C u*x-gN@ACh-L^4SI|C@NTJ%1IowROJOf5!L2oxeoD=T|!?+p^PsN&f-c(n?bR literal 0 HcmV?d00001 From 8bfb3e89fbd8505041a1cd5f0a40117c5a1860e4 Mon Sep 17 00:00:00 2001 From: Sofia Guerra Date: Tue, 16 Aug 2022 09:51:06 -0300 Subject: [PATCH 2/5] Updating Date --- ...ng-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md => 2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md} (100%) diff --git a/_posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md similarity index 100% rename from _posts/2022-8-4-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md rename to _posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md From e9244afc061f7ff56f4996870b5ab2a513f03d0a Mon Sep 17 00:00:00 2001 From: Sofia Guerra Date: Tue, 16 Aug 2022 13:26:03 -0300 Subject: [PATCH 3/5] Updating Superscript --- ...rch-on-intel-xeon-scalable-processors-with-bfloat16.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md index 4d991e8e2430..01bb5084f700 100644 --- a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md +++ b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md @@ -1,6 +1,6 @@ --- layout: blog_detail -title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" +title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" author: Mingfei Ma (Intel), Vitaly Fedyunin (Meta), Wei Wei (Meta) featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png' --- @@ -9,7 +9,7 @@ featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processo Recent years, the growing complexity of AI models have been posing requirements on hardware for more and more compute capability. Reduced precision numeric format has been proposed to address this problem. Bfloat16 is a custom 16-bit floating point format for AI which consists of one sign bit, eight exponent bits, and seven mantissa bits. With the same dynamic range as float32, bfloat16 doesn’t require a special handling such as loss scaling. Therefore, bfloat16 is a drop-in replacement for float32 when running deep neural networks for both inference and training. -The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel® Advanced Vector Extensions-512 (Intel® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel® Xeon® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel® AMX) instruction set extension. +The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel® Advanced Vector Extensions-512 (Intel® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel® Xeon® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel® AMX) instruction set extension. Intel and Meta previously collaborated to enable bfloat16 on PyTorch, and the related work was published in an earlier [blog](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) during launch of Cooper Lake. In that blog, we introduced the hardware advancement for native bfloat16 support and showcased a performance boost of 1.4x to 1.6x of bfloat16 over float32 from DLRM, ResNet-50 and ResNext-101-32x4d. @@ -63,7 +63,7 @@ We benchmarked inference performance of TorchVision models on Intel® Xeon® Pla ## Conclusion & Future Work -In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel® Xeon® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel® Xeon® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future! +In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel® Xeon® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel® Xeon® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future! ## Acknowledgement @@ -73,4 +73,4 @@ The results presented in this blog is a joint effort of Meta and Intel PyTorch t - [The bfloat16 numerical format](https://cloud.google.com/tpu/docs/bfloat16?hl=en) - [https://pytorch.org/docs/master/amp.html#torch.autocast](https://pytorch.org/docs/master/amp.html#torch.autocast) -- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) \ No newline at end of file +- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) \ No newline at end of file From f8d26486e80b304955bee912036e18266d3a8a8b Mon Sep 17 00:00:00 2001 From: Sofia Guerra Date: Tue, 16 Aug 2022 13:43:28 -0300 Subject: [PATCH 4/5] Updating superscript --- ...pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md index 01bb5084f700..117f586a7d68 100644 --- a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md +++ b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md @@ -1,6 +1,6 @@ --- layout: blog_detail -title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" +title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" author: Mingfei Ma (Intel), Vitaly Fedyunin (Meta), Wei Wei (Meta) featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png' --- @@ -73,4 +73,4 @@ The results presented in this blog is a joint effort of Meta and Intel PyTorch t - [The bfloat16 numerical format](https://cloud.google.com/tpu/docs/bfloat16?hl=en) - [https://pytorch.org/docs/master/amp.html#torch.autocast](https://pytorch.org/docs/master/amp.html#torch.autocast) -- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) \ No newline at end of file +- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) \ No newline at end of file From 5518c929fee3b6069aff946546baa5ca6b7ff009 Mon Sep 17 00:00:00 2001 From: Sofia Guerra Date: Thu, 18 Aug 2022 11:22:15 -0300 Subject: [PATCH 5/5] Fixing featured image --- ...pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md index 117f586a7d68..f074757c24c2 100644 --- a/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md +++ b/_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md @@ -2,7 +2,7 @@ layout: blog_detail title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16" author: Mingfei Ma (Intel), Vitaly Fedyunin (Meta), Wei Wei (Meta) -featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png' +featured-img: '/assets/images/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png' --- ## Overview @@ -52,7 +52,7 @@ Generally, the explicit conversion approach and AMP approach have similar perfor We benchmarked inference performance of TorchVision models on Intel® Xeon® Platinum 8380H CPU @ 2.90GHz (codenamed Cooper Lake), single instance per socket (batch size = 2 x number of physical cores). Results show that bfloat16 has 1.4x to 2.2x performance gain over float32.

- +

## The performance boost of bfloat16 over float32 primarily comes from 3 aspects: