Skip to content

dotnet C# 在不同的机器 CPU 型号上的基准性能测试

Updated: at 12:43,Created: at 23:29

本文将记录我在多个不同的机器上,在不同的 CPU 型号上,执行相同的我编写的 dotnet 的 Benchmark 的代码,测试不同的 CPU 型号对 C# 系的优化程度。本文非严谨测试,数值只有相对意义

以下是我的测试结果,对应的测试代码放在 github 上,可以在本文末尾找到下载代码的方法

我十分推荐你自己拉取代码,在你自己的设备上跑一下,测试其性能。且在开始之前,期望你已经掌握了基础的性能测试知识,避免出现诡异的结论

本文的测试将围绕着尽可能多的覆盖基础 CPU 指令以及基础逻辑行为。基础的 CPU 指令的性能测试已经有许多前辈测试过了,我这里重点测试的是各个 C# 系的上层业务行为下,所调用的多个 CPU 指令的最终性能影响。额外的也覆盖 CPU 缓存,逻辑分支命中,方法参数堆栈传递等的性能。本文的测试重点不在于 C# 系的相同功能的多个不同实现之间的性能对比,重点在于相同的代码在不同的 CPU 型号、内存、系统上的性能差异,正如此需求所述,本文非严谨测试,测试结果的数值只有相对意义

数组创建

英特尔 13th Gen Intel Core i7-13700K

以下是在我开发机上跑的,我开了几百个进程,有比较多干扰,但是问题不大,因为 i7-13700K 依然性能遥遥领先。等后续找个空闲的机器,再跑一次比较准确的性能测试

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-AXOZTJ : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
MethodArraySizeMeanErrorStdDevMedianRatioRatioSD
NewArray103.873 ns0.1146 ns0.2417 ns3.777 ns1.000.00
GCZeroInitialized1012.234 ns0.2815 ns0.4382 ns12.168 ns3.150.21
GCZeroUninitialized104.470 ns0.1491 ns0.4056 ns4.354 ns1.140.13
NewArrayWithRandomVisit1012.012 ns0.2679 ns0.2506 ns11.941 ns3.090.18
NewArrayWithOrdinalVisit109.839 ns0.3379 ns0.9803 ns9.635 ns2.580.26
NewArray10011.875 ns0.1932 ns0.2444 ns11.813 ns1.000.00
GCZeroInitialized10021.980 ns0.4524 ns0.8931 ns21.820 ns1.880.08
GCZeroUninitialized10012.126 ns0.2769 ns0.5201 ns11.953 ns1.040.05
NewArrayWithRandomVisit10047.344 ns0.9635 ns2.1351 ns46.572 ns4.030.24
NewArrayWithOrdinalVisit10075.207 ns1.4285 ns1.3363 ns75.364 ns6.330.15
NewArray1000110.197 ns2.1602 ns2.0206 ns109.619 ns1.000.00
GCZeroInitialized1000116.560 ns2.0796 ns1.8435 ns116.604 ns1.060.03
GCZeroUninitialized100033.476 ns0.5921 ns0.5538 ns33.643 ns0.300.01
NewArrayWithRandomVisit1000208.835 ns4.1962 ns8.8512 ns205.699 ns1.920.09
NewArrayWithOrdinalVisit1000620.850 ns11.5406 ns10.7951 ns619.304 ns5.640.15
NewArray10000996.853 ns21.9389 ns61.8790 ns970.393 ns1.000.00
GCZeroInitialized10000996.704 ns20.8764 ns58.5397 ns974.900 ns1.000.08
GCZeroUninitialized1000063.200 ns1.0544 ns0.9863 ns63.315 ns0.060.00
NewArrayWithRandomVisit100001,242.151 ns24.2642 ns38.4856 ns1,233.944 ns1.210.07
NewArrayWithOrdinalVisit100006,068.245 ns90.8508 ns84.9819 ns6,076.727 ns5.790.34
NewArray1000007,381.046 ns137.9635 ns147.6194 ns7,372.520 ns1.000.00
GCZeroInitialized1000007,214.089 ns85.2068 ns71.1515 ns7,209.220 ns0.970.02
GCZeroUninitialized1000007,347.661 ns146.3643 ns174.2363 ns7,306.838 ns1.000.03
NewArrayWithRandomVisit1000008,456.669 ns164.5726 ns219.6997 ns8,517.366 ns1.140.05
NewArrayWithOrdinalVisit100000129,749.709 ns2,408.4302 ns2,773.5518 ns128,963.159 ns17.570.55
NewArray100000059,752.036 ns1,194.7579 ns1,929.3113 ns59,414.325 ns1.000.00
GCZeroInitialized100000060,008.303 ns1,188.0164 ns1,778.1671 ns59,378.000 ns1.010.04
GCZeroUninitialized100000058,868.279 ns1,023.4279 ns957.3151 ns58,724.731 ns0.970.04
NewArrayWithRandomVisit100000056,399.609 ns1,068.5479 ns999.5204 ns56,296.948 ns0.930.03
NewArrayWithOrdinalVisit10000001,314,841.960 ns26,155.6618 ns27,986.2651 ns1,313,674.414 ns21.921.00

兆芯 ZHAOXIN KaiXian KX-U6780A

BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-YPUGMN : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
MethodArraySizeMeanErrorStdDevMedianRatioRatioSD
NewArray1040.20 ns0.977 ns1.491 ns39.98 ns1.000.00
GCZeroInitialized10141.12 ns2.996 ns6.051 ns139.67 ns3.540.18
GCZeroUninitialized1048.72 ns0.849 ns0.663 ns48.91 ns1.190.05
NewArrayWithRandomVisit10195.75 ns1.082 ns0.845 ns195.65 ns4.770.16
NewArrayWithOrdinalVisit1072.42 ns1.513 ns2.400 ns72.45 ns1.800.08
NewArray100135.07 ns2.892 ns6.100 ns135.41 ns1.000.00
GCZeroInitialized100228.42 ns4.662 ns10.135 ns228.83 ns1.700.11
GCZeroUninitialized100137.26 ns2.939 ns5.519 ns136.70 ns1.020.06
NewArrayWithRandomVisit100572.02 ns11.660 ns19.157 ns568.34 ns4.260.27
NewArrayWithOrdinalVisit100467.29 ns9.357 ns13.117 ns464.49 ns3.470.21
NewArray10001,037.70 ns20.377 ns54.742 ns1,031.50 ns1.000.00
GCZeroInitialized10001,127.93 ns22.581 ns59.091 ns1,125.79 ns1.090.07
GCZeroUninitialized1000653.93 ns6.239 ns4.871 ns652.04 ns0.600.02
NewArrayWithRandomVisit10002,375.21 ns47.088 ns100.349 ns2,352.11 ns2.270.13
NewArrayWithOrdinalVisit10004,474.90 ns87.887 ns107.933 ns4,453.16 ns4.190.28
NewArray100009,586.62 ns189.501 ns369.608 ns9,657.74 ns1.000.00
GCZeroInitialized100009,767.26 ns194.643 ns462.590 ns9,811.53 ns1.020.07
GCZeroUninitialized100004,093.63 ns80.993 ns143.965 ns4,026.86 ns0.430.02
NewArrayWithRandomVisit1000013,908.10 ns202.573 ns169.158 ns13,928.15 ns1.470.06
NewArrayWithOrdinalVisit1000043,057.16 ns854.132 ns1,495.943 ns42,914.21 ns4.500.25
NewArray10000063,542.13 ns576.256 ns510.836 ns63,519.28 ns1.000.00
GCZeroInitialized10000066,357.64 ns1,312.089 ns2,118.779 ns66,043.66 ns1.030.03
GCZeroUninitialized10000063,638.29 ns1,241.493 ns1,477.909 ns63,270.73 ns1.010.03
NewArrayWithRandomVisit10000076,609.50 ns1,501.442 ns1,729.063 ns75,958.21 ns1.210.03
NewArrayWithOrdinalVisit100000665,286.65 ns9,295.620 ns7,762.264 ns662,915.19 ns10.470.16
NewArray1000000461,130.99 ns9,000.698 ns10,004.252 ns461,306.23 ns1.000.00
GCZeroInitialized1000000459,810.29 ns8,893.401 ns10,586.961 ns455,791.25 ns1.000.03
GCZeroUninitialized1000000456,245.03 ns8,819.606 ns12,363.856 ns452,252.89 ns0.990.04
NewArrayWithRandomVisit1000000497,132.01 ns9,841.562 ns12,796.810 ns490,990.22 ns1.080.03
NewArrayWithOrdinalVisit10000006,742,537.03 ns48,986.470 ns38,245.414 ns6,732,321.64 ns14.510.31

飞腾腾锐 Phytium D2000

BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-NHRLJG : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
MethodArraySizeMeanErrorStdDevRatioRatioSD
NewArray1022.18 ns0.149 ns0.132 ns1.000.00
GCZeroInitialized1092.43 ns0.564 ns0.440 ns4.170.02
GCZeroUninitialized1025.68 ns0.248 ns0.243 ns1.160.01
NewArrayWithRandomVisit10108.25 ns0.299 ns0.250 ns4.880.03
NewArrayWithOrdinalVisit1034.55 ns0.126 ns0.112 ns1.560.01
NewArray10076.35 ns0.941 ns0.880 ns1.000.00
GCZeroInitialized100163.69 ns0.952 ns0.743 ns2.140.03
GCZeroUninitialized10080.21 ns0.528 ns0.468 ns1.050.02
NewArrayWithRandomVisit100421.53 ns1.679 ns1.402 ns5.520.06
NewArrayWithOrdinalVisit100300.66 ns1.274 ns1.130 ns3.940.05
NewArray1000640.11 ns4.059 ns3.598 ns1.000.00
GCZeroInitialized1000672.06 ns3.242 ns3.032 ns1.050.01
GCZeroUninitialized1000483.70 ns2.202 ns1.952 ns0.760.01
NewArrayWithRandomVisit10001,765.24 ns6.469 ns5.402 ns2.760.02
NewArrayWithOrdinalVisit10002,850.39 ns12.971 ns12.133 ns4.450.03
NewArray100005,219.58 ns36.810 ns32.631 ns1.000.00
GCZeroInitialized100005,280.52 ns27.550 ns24.422 ns1.010.01
GCZeroUninitialized100002,640.52 ns44.642 ns34.853 ns0.510.01
NewArrayWithRandomVisit100008,992.89 ns20.367 ns19.052 ns1.720.01
NewArrayWithOrdinalVisit1000026,983.43 ns355.773 ns297.086 ns5.170.05
NewArray10000045,506.61 ns431.868 ns403.970 ns1.000.00
GCZeroInitialized10000045,543.14 ns432.449 ns404.513 ns1.000.01
GCZeroUninitialized10000044,461.84 ns331.168 ns309.775 ns0.980.01
NewArrayWithRandomVisit10000057,232.01 ns318.770 ns298.178 ns1.260.01
NewArrayWithOrdinalVisit100000445,380.51 ns2,904.888 ns2,425.713 ns9.780.10
NewArray1000000318,862.16 ns1,899.267 ns1,683.651 ns1.000.00
GCZeroInitialized1000000319,510.71 ns4,669.274 ns3,645.462 ns1.000.01
GCZeroUninitialized1000000314,884.17 ns5,637.859 ns4,401.669 ns0.990.02
NewArrayWithRandomVisit1000000357,843.40 ns3,063.527 ns2,865.625 ns1.120.01
NewArrayWithOrdinalVisit10000004,547,465.54 ns15,355.309 ns12,822.379 ns14.280.05
NewArray10000000001,541,406,672.88 ns35,733,853.844 ns102,527,125.216 ns1.0000.00
GCZeroInitialized10000000001,548,370,215.42 ns38,407,327.571 ns110,197,822.498 ns1.0090.10
GCZeroUninitialized10000000001,486,735.21 ns28,605.254 ns26,757.372 ns0.0010.00
NewArrayWithRandomVisit10000000001,590,271,119.60 ns33,473,585.461 ns96,041,991.522 ns1.0360.09
NewArrayWithOrdinalVisit10000000003,861,833,983.54 ns2,367,487.064 ns1,976,958.923 ns2.5460.16

以上的飞腾腾锐 Phytium D2000 最后的测试数据预计是不正常的

数组拷贝

测试维度

参与测试的内容如下:

英特尔 13th Gen Intel Core i7-13700K

数组较小

小于 1000 的数组时,存在较大 P/Invoke 干扰,于是决定最小设置为 1000 的值

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-GCHWHL : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevRatio
CopyByForInt32[10000]Int32[10000]1,958.98 ns8.391 ns7.007 ns1.000
MemcpyInt32[10000]Int32[10000]609.35 ns3.266 ns3.055 ns0.311
CopyBlockUnalignedInt32[10000]Int32[10000]577.84 ns1.391 ns1.301 ns0.295
CopyByForInt32[1000]Int32[1000]202.09 ns0.376 ns0.352 ns0.103
MemcpyInt32[1000]Int32[1000]32.21 ns0.323 ns0.302 ns0.016
CopyBlockUnalignedInt32[1000]Int32[1000]19.19 ns0.067 ns0.059 ns0.010

根据上述测试数据可以看到,即使在较小数据量情况下,依然 memcpy 和 Unsafe.CopyBlockUnaligned 比 for 速度快

数组较大

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-DBDADP : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByForInt32[100000000]Int32[100000000]41,348,684.32 ns751,207.515 ns1,028,261.326 ns41,102,646.15 ns1.0000.00
MemcpyInt32[100000000]Int32[100000000]27,086,427.67 ns738,121.867 ns2,057,588.736 ns26,318,143.75 ns0.6750.05
CopyBlockUnalignedInt32[100000000]Int32[100000000]24,020,801.37 ns467,035.642 ns458,691.448 ns23,894,810.94 ns0.5790.02
CopyByForInt32[10000000]Int32[10000000]3,800,486.40 ns69,523.151 ns162,508.123 ns3,748,857.23 ns0.0920.01
MemcpyInt32[10000000]Int32[10000000]2,313,413.90 ns75,362.059 ns208,827.911 ns2,248,826.17 ns0.0580.01
CopyBlockUnalignedInt32[10000000]Int32[10000000]2,005,075.29 ns55,131.653 ns149,989.727 ns1,925,467.19 ns0.0490.00
CopyByForInt32[1000000]Int32[1000000]201,416.81 ns1,630.278 ns1,524.963 ns200,902.27 ns0.0050.00
MemcpyInt32[1000000]Int32[1000000]104,570.31 ns3,304.068 ns9,319.184 ns100,412.65 ns0.0030.00
CopyBlockUnalignedInt32[1000000]Int32[1000000]99,385.15 ns1,824.888 ns1,617.716 ns99,135.09 ns0.0020.00
CopyByForInt32[10000]Int32[10000]1,958.87 ns4.267 ns3.783 ns1,959.42 ns0.0000.00
MemcpyInt32[10000]Int32[10000]624.06 ns4.451 ns4.164 ns622.60 ns0.0000.00
CopyBlockUnalignedInt32[10000]Int32[10000]581.32 ns2.044 ns1.912 ns581.53 ns0.0000.00
CopyByForInt32[1000]Int32[1000]201.05 ns0.678 ns0.635 ns201.05 ns0.0000.00
MemcpyInt32[1000]Int32[1000]32.12 ns0.638 ns0.683 ns32.10 ns0.0000.00
CopyBlockUnalignedInt32[1000]Int32[1000]21.02 ns0.090 ns0.085 ns21.04 ns0.0000.00

兆芯 ZHAOXIN KaiXian KX-U6780A

数组较小

BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-SBDPDU : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByForInt32[10000]Int32[10000]14.814 us0.1734 us0.1537 us14.785 us1.000.00
MemcpyInt32[10000]Int32[10000]15.329 us0.2950 us0.5167 us15.313 us1.040.04
CopyBlockUnalignedInt32[10000]Int32[10000]13.125 us0.5590 us1.6482 us13.188 us0.940.09
CopyByForInt32[1000]Int32[1000]1.127 us0.0226 us0.0211 us1.127 us0.080.00
MemcpyInt32[1000]Int32[1000]2.152 us0.0571 us0.1675 us2.197 us0.130.02
CopyBlockUnalignedInt32[1000]Int32[1000]2.297 us0.0453 us0.0863 us2.279 us0.160.01

数组较大

BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-KKBWNV : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByForInt32[100000000]Int32[100000000]334,741.708 μs7,661.2780 μs22,469.2022 μs332,150.996 μs1.0000.00
MemcpyInt32[100000000]Int32[100000000]164,233.004 μs3,256.7894 μs8,406.8134 μs161,660.880 μs0.4930.04
CopyBlockUnalignedInt32[100000000]Int32[100000000]164,128.312 μs3,671.9104 μs10,826.7108 μs162,440.250 μs0.4920.05
CopyByForInt32[10000000]Int32[10000000]33,404.753 μs663.0404 μs1,687.6494 μs32,963.932 μs0.1000.01
MemcpyInt32[10000000]Int32[10000000]23,405.518 μs1,142.2886 μs3,350.1346 μs24,879.320 μs0.0700.01
CopyBlockUnalignedInt32[10000000]Int32[10000000]24,981.451 μs498.7301 μs899.3133 μs24,921.681 μs0.0750.00
CopyByForInt32[1000000]Int32[1000000]5,036.027 μs100.2153 μs195.4623 μs5,014.961 μs0.0150.00
MemcpyInt32[1000000]Int32[1000000]2,585.947 μs51.0945 μs106.6533 μs2,601.145 μs0.0080.00
CopyBlockUnalignedInt32[1000000]Int32[1000000]2,529.769 μs50.4126 μs98.3259 μs2,516.467 μs0.0080.00
CopyByForInt32[10000]Int32[10000]13.663 μs0.2509 μs0.2224 μs13.680 μs0.0000.00
MemcpyInt32[10000]Int32[10000]10.112 μs0.1976 μs0.2957 μs10.131 μs0.0000.00
CopyBlockUnalignedInt32[10000]Int32[10000]10.010 μs0.1742 μs0.1630 μs9.964 μs0.0000.00
CopyByForInt32[1000]Int32[1000]1.088 μs0.0058 μs0.0045 μs1.089 μs0.0000.00
MemcpyInt32[1000]Int32[1000]1.358 μs0.0266 μs0.0364 μs1.355 μs0.0000.00
CopyBlockUnalignedInt32[1000]Int32[1000]1.349 μs0.0267 μs0.0461 μs1.334 μs0.0000.00

数据说明

通过数据对比 Intel 和 兆芯 以上测试数据,可以看到在 Int32[10000] 的测试数据集里面,轻松就可以看到 Intel 比 兆芯 快了 10 倍,如下图所示

在如下图的对比 Intel 和 兆芯 的对较大的数组进行拷贝的性能,可以看到 Intel 平台也的确能够比 兆芯 快出 10 倍的性能

具体的性能比较如下

方法数组长度Intel兆芯Intel比兆芯兆芯比Intel
CopyByForInt32[100000000]41,348,684.32334,741,708.000.12352414818.095583052
MemcpyInt32[100000000]27,086,427.67164,233,004.000.16492682356.063295094
CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.000.14635379536.832757553
CopyByForInt32[10000000]3,800,486.4033,404,753.000.11377082788.789599405
MemcpyInt32[10000000]2,313,413.9023,405,518.000.098840534110.11730672
CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.000.080262563212.45910871
CopyByForInt32[1000000]201,416.815,036,027.000.039995180725.00301241
MemcpyInt32[1000000]104,570.312,585,947.000.040437916924.72926589
CopyBlockUnalignedInt32[1000000]99,385.152,529,769.000.039286255025.45419512
CopyByForInt32[10000]1,958.8713,663.000.14337041656.974939634
MemcpyInt32[10000]624.0610,112.000.061714794316.20357017
CopyBlockUnalignedInt32[10000]581.3210,010.000.058073926117.21943164
CopyByForInt32[1000]201.051,088.000.18478860295.411589157
MemcpyInt32[1000]32.121,358.000.023652430042.27895392
CopyBlockUnalignedInt32[1000]21.021,349.000.015581912564.17697431

更具体的对 兆芯 的分析:在对较小的数组进行拷贝,使用 for 进行拷贝的速度比标准 C 的 memcpy 函数快,使用 for 循环进行拷贝与 dotnet 的 Unsafe.CopyBlockUnaligned 差不多。而在 Intel 平台下,无论是 标准 C 的 memcpy 还是 dotnet 的 Unsafe.CopyBlockUnaligned 都比 for 快几倍。这就意味着无论是 memcpy 还是 CopyBlockUnaligned 里面的指令优化,在 兆芯 下都是负优化

在更大的数据两情况下,可以看到 Intel 平台的 memcpy 和 CopyBlockUnaligned 对 for 循环的优化比率不断下跌,其数据情况如下

数组长度CopyByForMemcpyCopyBlockUnalignedCopyByFor与Memcpy比率CopyByFor与CopyBlockUnaligned比率
1000201.0532.1221.026.2593399759.564700285
100001,958.87624.06581.323.1389129253.369693112
1000000201,416.81104,570.3199,385.151.9261376392.026628827
100000003,800,486.402,313,413.902,005,075.291.6428043421.895433263
10000000041,348,684.3227,086,427.6724,020,801.371.526546241.721369894

我的猜测是随着数组长度增加,将逐渐超过了 Intel 的 CPU 的缓存,导致了比率的下降。但无论如何,使用 memcpy 和 CopyBlockUnaligned 在 Intel 下都有优化

这就是为什么在数组较大时,如在 100000000 长度时,相同的 Memcpy 方法下兆芯比Intel的耗时比例为 6.06 倍。相较于在 1000 长度时,兆芯比Intel的耗时比例为 42.27 倍小了非常多。如此可以看到其实也不能全怪兆芯,只是因为 Intel 的优化比较强,导致看起来差异比较大

在数组长度比较大的时候,在 兆芯 上也是 memcpy 会比 for 循环拷贝更快。且 memcpy 和 CopyBlockUnaligned 的性能也是基本持平的。也就是说在数据量比较大的时候,使用 dotnet 自带的 Unsafe.CopyBlockUnaligned 方法还是很有意义的,既速度快又相对安全。在数据量比较小的时候,使用 CopyBlockUnaligned 依然不会有较大的性能损失

飞腾腾锐 Phytium D2000

数组较大

BenchmarkDotNet v0.13.12, Kylin V10 SP1 Phytium,D2000/8 E8C, 8 logical cores .NET SDK 8.0.204 [Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD Job-QEJWOH : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

RunStrategy=Throughput

MethodsourcedestMeanErrorStdDevRatio
CopyByForInt32[100000000]Int32[100000000]161,848,301.3 ns275,376.77 ns229,952.07 ns1.000
MemcpyInt32[100000000]Int32[100000000]139,057,784.0 ns493,850.72 ns437,785.80 ns0.859
CopyBlockUnalignedInt32[100000000]Int32[100000000]137,746,376.7 ns740,242.45 ns618,135.97 ns0.851
CopyByForInt32[10000000]Int32[10000000]15,514,977.7 ns33,694.59 ns29,869.38 ns0.096
MemcpyInt32[10000000]Int32[10000000]14,492,865.7 ns40,272.32 ns35,700.37 ns0.090
CopyBlockUnalignedInt32[10000000]Int32[10000000]14,497,063.8 ns38,595.84 ns30,133.09 ns0.090
CopyByForInt32[1000000]Int32[1000000]1,240,798.0 ns15,140.32 ns14,162.26 ns0.008
MemcpyInt32[1000000]Int32[1000000]1,046,522.7 ns20,519.03 ns19,193.52 ns0.006
CopyBlockUnalignedInt32[1000000]Int32[1000000]1,032,201.1 ns19,159.44 ns17,921.75 ns0.006
CopyByForInt32[10000]Int32[10000]8,930.6 ns9.04 ns8.02 ns0.000
MemcpyInt32[10000]Int32[10000]3,058.1 ns12.31 ns11.51 ns0.000
CopyBlockUnalignedInt32[10000]Int32[10000]3,199.1 ns16.51 ns12.89 ns0.000
CopyByForInt32[1000]Int32[1000]886.8 ns0.66 ns0.59 ns0.000
MemcpyInt32[1000]Int32[1000]250.3 ns0.36 ns0.30 ns0.000
CopyBlockUnalignedInt32[1000]Int32[1000]235.8 ns0.29 ns0.25 ns0.000

数据说明和对比

飞腾腾锐 Phytium,D2000/8 E8C, 8 logical cores 的跑分不高,与 Intel 最大差距在数组拷贝上能拉到 10 倍,均值性能差距是 4 倍左右。但在我的测试里面飞腾腾锐的性能比兆芯快,大概均值性能差距是 2 倍左右,如以下对比

方法数组长度Intel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾
CopyByForInt32[100000000]41,348,684.32334,741,708.00161,848,301.300.12352414818.09558305193.91423098372.0682435670
MemcpyInt32[100000000]27,086,427.67164,233,004.00139,057,784.000.16492682356.06329509385.13385470001.1810414295
CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.00137,746,376.700.14635379536.83275755345.73446216791.1915254392
CopyByForInt32[10000000]3,800,486.4033,404,753.0015,514,977.700.11377082788.78959940504.08236632552.1530648413
MemcpyInt32[10000000]2,313,413.9023,405,518.0014,492,865.700.098840534110.11730672156.26470935441.6149682530
CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.0014,497,063.800.080262563212.45910870517.23018425911.7232076333
CopyByForInt32[1000000]201,416.815,036,027.001,240,798.000.039995180725.00301240996.16034977424.0587001269
MemcpyInt32[1000000]104,570.312,585,947.001,046,522.700.040437916924.729265888210.00783778882.4709898791
CopyBlockUnalignedInt32[1000000]99,385.152,529,769.001,032,201.100.039286255025.454195118710.38586851252.4508489673
CopyByForInt32[10000]1,958.8713,663.008,930.600.14337041656.97493963364.55905700741.5299084048
MemcpyInt32[10000]624.0610,112.003,058.100.061714794316.20357016954.90033009653.3066282986
CopyBlockUnalignedInt32[10000]581.3210,010.003,199.100.058073926117.21943163835.50316521023.1290050327
CopyByForInt32[1000]201.051,088.00886.800.18478860295.41158915694.41084307391.2268831755
MemcpyInt32[1000]32.121,358.00250.300.023652430042.27895392287.79265255295.4254894127
CopyBlockUnalignedInt32[1000]21.021,349.00235.800.015581912564.176974310211.21788772605.7209499576

点的几何计算

代码和性能测试的设计

以下代码用于测试密集的计算过程中的各个设备之间的性能差异,其性能测试核心代码如下

[Benchmark()]
[ArgumentsSource(nameof(GetArgument))]
public void Test(Point[] source, double[] result)
{
for (int i = 1; i < source.Length - 1; i++)
{
var a = source[i - 1];
var b = source[i];
var c = source[i + 1];
var abx = b.X - a.X;
var aby = b.Y - a.Y;
var acx = c.X - a.X;
var acy = c.Y - a.Y;
var cross = abx * acy - aby * acx;
var abs = Math.Abs(cross);
var acl = Math.Sqrt(acx * acx + acy * acy);
result[i] = abs / acl;
}
}

以上性能测试中传入的 Point[] source 为输入数据,而 double[] result 为存放的输出数据,输出数据只是为了让计算结果有的存放,让 JIT 开森而已

此性能测试中对代码逻辑的内存访问预测,即 CPU 缓存命中以及浮点计算要求较高。经过实际测试发现 Intel 在这方面的优化还是十分好的,但兆芯则有很大的优化空间

英特尔 13th Gen Intel Core i7-13700K

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 9.0.100-preview.5.24307.3
[Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
Job-UGRNFG : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
RunStrategy=Throughput
MethodsourceresultMeanErrorStdDev
万点Point[10000]Double[10000]19.622 μs0.0914 μs0.0810 μs
千点Point[1000]Double[1000]1.974 μs0.0108 μs0.0101 μs

兆芯 ZHAOXIN KaiXian KX-U6780A

BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-BBRJWB : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
MethodsourceresultMeanErrorStdDevMedian
万点Point[10000]Double[10000]475.13 us8.295 us15.782 us467.05 us
千点Point[1000]Double[1000]50.89 us1.230 us3.626 us50.81 us

飞腾腾锐 Phytium D2000

BenchmarkDotNet v0.13.12, Kylin V10 SP1 Phytium,D2000/8 E8C, 8 logical cores .NET SDK 8.0.204 [Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD Job-JCFXCW : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

RunStrategy=Throughput

MethodsourceresultMeanErrorStdDev
万点Point[10000]Double[10000]147.96 us0.015 us0.014 us
千点Point[1000]Double[1000]14.76 us0.004 us0.003 us

数据说明和对比

性能对比如下表,可以看到兆芯比Intel能慢上25倍左右,兆芯比飞腾慢上3倍左右

MethodsourceresultIntel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾
万点Point[10000]Double[10000]19.622475.13147.960.04129817124.214147397.5405157483.211205731
千点Point[1000]Double[1000]1.97450.8914.760.03878954625.780141847.4772036473.447831978

通过上图可以看到,在进行基础的密集计算中,似乎兆芯做了负面优化

代码

本文代码放在 githubgitee 上,可以使用如下命令行拉取代码

先创建一个空文件夹,接着使用命令行 cd 命令进入此空文件夹,在命令行里面输入以下代码,即可获取到本文的代码

git init
git remote add origin https://gitee.com/lindexi/lindexi_gd.git
git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26

以上使用的是 gitee 的源,如果 gitee 不能访问,请替换为 github 的源。请在命令行继续输入以下代码,将 gitee 源换成 github 源进行拉取代码

git remote remove origin
git remote add origin https://github.com/lindexi/lindexi_gd.git
git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26

获取代码之后,进入 BulowukaileFeanayjairwo 文件夹,即可获取到源代码

特别感谢

特别感谢 https://github.com/mjebrahimi/Performance-Wars-Benchmarking-CSharp 提供的代码

参考文档

C# 标准性能测试

C# 标准性能测试高级用法

dotnet 6 数组拷贝性能对比

跑分系列

ARM Phytium,D2000/8 E8C 8 Core 2300 MHz

D2000高效能桌面CPU 跑分:https://www.cpubenchmark.net/cpu.php?cpu=ARM+Phytium%2CD2000%2F8+E8C+8+Core+2300+MHz&id=4862

和 Intel i7-13700K 对比:https://www.cpubenchmark.net/compare/4862vs5060/ARM-Phytium,D20008-E8C-8-Core-2300-MHz-vs-Intel-i7-13700K

另一个和 Intel i7-13700K 对比:https://openbenchmarking.org/vs/Processor/Phytium+D2000,Intel+Core+i7-13700K


知识共享许可协议

原文链接: http://blog.lindexi.com/post/dotnet-C-%E5%9C%A8%E4%B8%8D%E5%90%8C%E7%9A%84%E6%9C%BA%E5%99%A8-CPU-%E5%9E%8B%E5%8F%B7%E4%B8%8A%E7%9A%84%E5%9F%BA%E5%87%86%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95

本作品采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可。 欢迎转载、使用、重新发布,但务必保留文章署名 林德熙 (包含链接: https://blog.lindexi.com ),不得用于商业目的,基于本文修改后的作品务必以相同的许可发布。如有任何疑问,请与我 联系