With the help of gpu. Efficient use of the GPU. Customize GPU rendering settings

Today, it is especially actively discussed and many users are interested in where to start mining coins and how it all happens. The popularity of this industry has already had a tangible impact on the GPU market, and many have long associated a powerful video card not with demanding games, but with crypto farms. In this article, we will talk about how to organize this whole process from scratch and start mining on your own farm, what to use for this and why it is impossible.

What is mining on a video card

Mining on a video card is the process of mining cryptocurrency using graphics processing units (GPUs). To do this, use a powerful video card in a home computer or a specially assembled farm of several devices in one system. If you are wondering why GPUs are used for this process, then the answer is quite simple. The thing is that video cards are initially designed to process a large amount of data by performing the same type of operations, as is the case with video processing. The same picture is observed in cryptocurrency mining, because here the hashing process is just as the same.

For mining, full-fledged discrete video cards are used. Notebooks or processor-integrated chips are not used. There are also articles on the network about mining on an external video card, but this also does not work in all cases and is not the best solution.

What video cards are suitable for mining

So, as for the choice of video card, here the usual practice is to purchase AMD rx 470, rx 480, rx 570, rx 580 or Nvidia 1060, 1070, 1080 ti. Video cards like r9 280x, r9 290, 1050, 1060 are also suitable, but will not bring big profits. Mining on a weak video card like geforce gtx 460, gts 450, gtx 550ti will definitely not bring profit. If we talk about memory, then it is better to take from 2 GB. Even 1 GB may not be enough, not to mention 512 MB. If we talk about mining on a professional video card, then it brings about the same as ordinary ones or even less. Given the cost of such VC, this is unprofitable, but you can mine with their help if you already have them.

It is also worth noting that all graphics cards can get a performance boost by unlocking the values ​​that the manufacturer has set. This process is called overclocking. However, this is not safe, it leads to the loss of the warranty and the card may fail, for example, starting to show artifacts. You can overclock video cards, but you need to familiarize yourself with the materials on this topic and proceed with caution. You should not try to immediately set all the values ​​to the maximum, but it is even better to find examples of successful overclocking settings for your video card on the Internet.

The most popular video cards for mining 2020

Below is a comparison of video cards. The table contains the most popular devices and their maximum power consumption. It should be noted that these figures may vary depending on specific model video card, its manufacturer, used memory and some other characteristics. It makes no sense to write about outdated indicators, such as mining litecoin on a video card, so only the three most popular algorithms for farms on video cards are considered.

video card Ethash Equihash CryptoNight power usage
AMD Radeon R9 280x 11 MH/s 290H/s 490H/s 230W
AMD Radeon RX 470 26 MH/s 260H/s 660H/s 120W
AMD Radeon RX 480 29.5 MH/s 290H/s 730H/s 135W
AMD Radeon RX 570 27.9 MH/s 260H/s 700H/s 120W
AMD Radeon RX 580 30.2 MH/s 290H/s 690H/s 135W
Nvidia GeForce GTX 750TI 0.5 MH/s 75H/s 250H/s 55W
Nvidia GeForce GTX 1050TI 13.9 MH/s 180H/s 300H/s 75W
Nvidia GeForce GTX 1060 22.5 MH/s 270H/s 430H/s 90W
Nvidia GeForce GTX 1070 30 MH/s 430H/s 630H/s 120W
Nvidia GeForce GTX 1070TI 30.5 MH/s 470H/s 630H/s 135W
Nvidia GeForce GTX 1080 23.3MH/s 550H/s 580H/s 140W
Nvidia GeForce GTX 1080TI 35 MH/s 685H/s 830H/s 190W

Is it possible to mine on one video card?

If you do not have the desire to build a full-fledged farm from many GPUs, or you just want to try this process on your home computer, then you can mine with one video card. There are no differences and in general the number of devices in the system is not important. Moreover, you can install devices with different chips or even from different manufacturers. You only need to run two programs for chips from different companies in parallel. Recall once again that mining on an integrated video card is not performed.

What cryptocurrencies can be mined on video cards

You can mine any cryptocurrency on the GPU, but it should be understood that the performance on different ones will differ on the same card. Older algorithms are already poorly suited for video processors and will not bring any profit. This happens due to the appearance on the market of new devices - the so-called. They are much more productive and greatly increase the complexity of the network, but their cost is high, running into the thousands of dollars. Therefore, mining coins for SHA-256 (Bitcoin) or Scrypt (Litecoin, Dogecoin) at home is a bad idea in 2018.

In addition to LTC and DOGE, ASICs have made it impossible to mine Bitcoin (BTC), Dash and other currencies. A much better choice would be cryptocurrencies that use ASIC-protected algorithms. So, for example, using a GPU, it will be possible to mine coins using the CryptoNight (Karbovanets, Monero, Electroneum, Bytecoin), Equihash (ZCash, Hush, Bitcoin Gold) and Ethash (Ethereum, Ethereum Classic) algorithms. The list is far from complete and new projects based on these algorithms are constantly appearing. Among them, there are both forks of more popular coins, as well as completely new developments. Occasionally, new algorithms even appear that are designed to solve certain problems and can use different equipment. Below we will talk about how to find out the hashrate of a video card.

What you need for mining on a video card

Below is a list of what you will need to create a farm:

  • The video cards themselves. The choice of specific models depends on your budget or what is already available. Of course, old devices on AGP will not work, but you can use any card of the middle or top class of recent years. Above, you can return to the video card performance table, which will allow you to make a suitable choice.
  • computer to install them. It is not necessary to use top-end hardware and make a farm based on high-performance components. An old one will suffice. AMD Athlon, several gigabytes of RAM and hard disk for installation operating system and desired programs. Also important motherboard. It should have enough PCI slots for your farm. There are special versions for miners that contain 6-8 slots and in certain cases it is more profitable to use them than to build several PCs. Special attention it is worth paying only to the power supply, because the system will work under high load around the clock. It is necessary to take a PSU with a power reserve and it is desirable to have 80 Plus certificates. It is also possible to connect two blocks into one using special adapters, but this solution causes controversy on the Internet. It is better not to use the case at all. For better cooling, it is recommended to make or buy a special stand. Video cards in this case are taken out using special adapters called risers. You can buy them in specialized stores or on aliexpress.
  • Well ventilated dry area. It is worth placing a farm in a non-residential room, and even better in a separate room. This will get rid of the discomfort that occurs due to the noisy operation of cooling and heat dissipation systems. If this is not possible, then you should choose video cards with a maximum quiet system cooling. You can learn more about it from reviews on the Internet, for example, on YouTube. You should also think about air circulation and ventilation to keep temperatures down as much as possible.
  • Miner program. GPU mining takes place with the help of a special one, which can be found on the Internet. ATI Radeon and Nvidia manufacturers use different software. The same applies to different algorithms.
  • Equipment service. This is a very important point, since not everyone understands that a mining farm requires constant care. The user needs to monitor the temperature, change the thermal paste and clean the CO from dust. You should also remember safety precautions and regularly check the health of the system.

How to set up mining on a video card from scratch

AT this section we will consider the entire mining process from the choice of currency to the withdrawal of funds. It should be noted that this whole process may be slightly different for different pools, programs and chips.

How to choose a video card for mining

We recommend that you familiarize yourself with the table above and with the section on calculating potential earnings. This will allow you to calculate the approximate income and decide which hardware you can afford more, as well as deal with the payback period of investments. Do not forget about the compatibility of the power connectors of the video card and the power supply. If different ones are used, then appropriate adapters should be obtained in advance. All this is easily bought in Chinese online stores for a penny or from local sellers with some extra charge.

Choosing a cryptocurrency

Now it is important to decide which coin you are interested in and what goals you want to achieve. If you are interested in real-time earnings, then you should choose currencies with the highest profit on this moment and sell them immediately upon receipt. You can also mine the most popular coins and hold them until the price spikes. There is also a kind of strategic approach when you choose a little-known, but promising currency in your opinion, and you invest power in it, in the hope that the value will increase significantly in the future.

Choosing a mining pool

They also have some differences. Some of them require registration, while others only need your wallet address to get started. The former usually hold the funds you earn until you reach the minimum payout amount, or while waiting for you to withdraw money in manual mode. good example such a pool is Suprnova.cc. It offers a lot of cryptocurrencies and to work in each of the pools, you only need to register on the site once. The service is easy to set up and is well suited for beginners.

A similar simplified system is offered by the Minergate website. Well, if you do not want to register on some site and store the earned funds there, then you should choose some pool in the official topic of the coin you are interested in on the BitcoinTalk forum. Simple pools only require you to specify an address for crypt accrual, and in the future, using the address, you can find out mining statistics.

Create a cryptocurrency wallet

You do not need this item if you are using a pool that requires registration and has a built-in wallet. If you want to receive payments automatically to your wallet, then try to read about creating a wallet in the article about the corresponding coin. This process can vary significantly for different projects.

You can also just point your wallet address to one of the exchanges, but it should be noted that not all exchange platforms accept transactions from pools. Best Option will create a wallet directly on your computer, but if you work with a large number of currencies, then storing all the blockchains will be inconvenient. In this case, you should look for reliable online wallets, or lightweight versions that do not require downloading the entire block chain.

Choosing and installing a mining program

The choice of a program for mining a crypt depends on the chosen coin and its algorithm. Probably all developers of such software have topics on BitcoinTalks, where you can find download links and information on how the setup and launch take place. Almost all of these programs have versions for both Windows and Linux. Most of these miners are free, but they use a certain percentage of the time to connect to the developer pool. This is a kind of commission for using the software. In some cases, it can be disabled, but this leads to limited functionality.

Setting up the program is that you specify a mining pool, wallet address or login, password (if any) and other options. It is recommended, for example, to set the maximum temperature limit, upon reaching which the farm will turn off, so as not to harm video cards. The speed of the cooling system fans and other finer settings are adjustable, which are unlikely to be used by beginners.

If you don't know which software to choose, check out our material on this or read the instructions on the pool website. Usually there is always a section that is dedicated to getting started. It contains a list of programs that can be used and configurations for .bat files. With it, you can quickly figure out the settings and start mining on a discrete graphics card. You can immediately create batch files for all the currencies you want to work with, so that later it would be more convenient to switch between them.

We start mining and monitor the statistics

After launch .bat file with the settings, you will see a console window where the log of what is happening will be displayed. It can also be found in the folder with the executable file. In the console, you can see the current hashrate and temperature of the card. Hotkeys usually allow you to call up actual data.

You will also be able to see if the device does not find the hashes. In this case, a warning will be displayed. This happens when something is configured incorrectly, the wrong one is selected for the coin. software or the GPU is not functioning properly. Many miners also use remote PC access tools to monitor the operation of the farm when they are not where it is installed.

We withdraw cryptocurrency

If you use pools like Suprnova, then all funds are simply accumulated in your account and you can withdraw them at any time. The rest of the pools most often use the system when funds are credited automatically to the specified wallet after receiving the minimum withdrawal amount. You can usually find out how much you have earned on the pool website. You only need to specify the address of your wallet or log in to your personal account.

How much can you earn?

The amount you can earn depends on the market situation and of course the total hashrate of your farm. It is also important which strategy you choose. It is not necessary to sell everything mined at once. You can, for example, wait for a jump in the rate of a mined coin and get many times more profit. However, everything is not so clear and it is simply unrealistic to predict the further development of events.

Payback of video cards

To calculate the payback will help a special online calculator. There are many of them on the Internet, but we will consider this process using the WhatToMine service as an example. It allows you to get current profit data based on your farm data. All you have to do is select the graphics cards you have available and then add the cost of electricity in your area. The site will calculate how much you can earn per day.

It should be understood that only the current state of affairs in the market is taken into account and the situation can change at any time. The rate may fall or rise, the difficulty of mining will become different or new projects will appear. So, for example, the production of ether may stop due to the possible transition of the network to . If Ethereum mining stops, then farms will need to send free power somewhere, for example, to mining ZCash on the GPU, which will affect the rate of this coin. There are many similar scenarios on the market and it is important to understand that today's picture may not be preserved throughout the entire payback period of the equipment.

Today, news about the use of GPUs for general computing can be heard on every corner. Words such as CUDA, Stream and OpenCL have become almost the most quoted words on the IT Internet in just two years. However, what these words mean, and what the technologies behind them are, are far from known to everyone. And for Linuxoids who are accustomed to "being in flight", in general, all this is seen as a dark forest.

Birth of GPGPU

We are all used to thinking that the only component of a computer capable of executing any code that is ordered to it is the central processing unit. For a long time, almost all mainstream PCs were equipped with a single processor that handled every conceivable calculation, including operating system code, all of our software, and viruses.

Later, multi-core processors and multi-processor systems appeared, in which there were several such components. This allowed the machines to perform multiple tasks at the same time, and the overall (theoretical) performance of the system rose exactly as many times as there were cores installed in the machine. However, it turned out that it was too difficult and expensive to manufacture and design multi-core processors.

Each core had to host a full-fledged processor of a complex and intricate x86 architecture, with its own (rather large) cache, instruction pipeline, SSE blocks, many blocks that perform optimizations, etc. etc. Therefore, the process of increasing the number of cores was significantly slowed down, and white university coats, for which two or four cores were clearly not enough, found a way to use other computing power for their scientific calculations, which was in abundance on the video card (as a result, even the BrookGPU tool appeared, emulating an additional processor using DirectX and OpenGL function calls).

GPUs, devoid of many of the shortcomings of the central processor, turned out to be an excellent and very fast calculating machine, and very soon GPU manufacturers themselves began to look closely at the developments of scientific minds (and nVidia hired most of the researchers in general). The result was nVidia's CUDA technology, which defines an interface that makes it possible to transfer the calculation of complex algorithms to the shoulders of the GPU without any crutches. It was later followed by ATi (AMD) with its own variant of the technology called Close to Metal (now Stream), and very soon Apple's version became the standard, called OpenCL.

GPU is our everything?

Despite all the advantages, the GPGPU technique has several problems. The first of these lies in a very narrow scope. GPUs have stepped far ahead of the central processor in terms of increasing computing power and the total number of cores (video cards carry a computing unit consisting of more than a hundred cores), but such a high density is achieved due to the maximum simplification of the design of the chip itself.

In essence, the main task of the GPU is reduced to mathematical calculations using simple algorithms that receive not very large amounts of predictable data as input. For this reason, GPU cores have a very simple design, meager cache volumes and a modest set of instructions, which ultimately results in their low cost of production and the possibility of very dense placement on a chip. GPUs are like a Chinese factory with thousands of workers. They do some simple things quite well (and most importantly - quickly and cheaply), but if you entrust them with the assembly of the aircraft, then the result will be a maximum hang glider.

Therefore, the first limitation of the GPU is the focus on fast mathematical calculations, which limits the scope of GPUs to help in the operation of multimedia applications, as well as any programs involved in complex data processing (for example, archivers or encryption systems, as well as software involved in fluorescence microscopy, molecular dynamics, electrostatics and other things of little interest to Linux users).

The second problem with GPGPU is that not every algorithm can be adapted to run on the GPU. Individual GPU cores are quite slow, and their power only comes into play when they work together. And this means that the algorithm will be as efficient as the programmer can effectively parallelize it. In most cases, only a good mathematician can cope with such work, and there are very few of them among software developers.

And thirdly, GPUs work with the memory installed on the video card itself, so every time the GPU is activated, there will be two additional copy operations: input data from the RAM of the application itself and output data from GRAM back to the application memory. It is not hard to guess that this can negate any gain in application run time (as is the case with the FlacCL tool, which we will look at later).

But that's not all. Despite the existence of a generally accepted standard in the face of OpenCL, many programmers still prefer to use vendor-specific implementations of the GPGPU technique. CUDA turned out to be especially popular, which, although it provides a more flexible programming interface (by the way, OpenCL in nVidia drivers implemented on top of CUDA), but tightly ties the application to video cards from the same manufacturer.

KGPU or Linux kernel accelerated by GPU

Researchers at the University of Utah have developed a KGPU system that allows some of the functions of the Linux kernel to run on a GPU using the CUDA framework. To accomplish this task, a modified Linux kernel and a special daemon that runs in user space, listens for kernel requests and passes them to the video card driver using the CUDA library, are used. Interestingly, despite the significant overhead that such an architecture creates, the authors of KGPU managed to create an implementation of the AES algorithm, which raises the encryption speed. file system eCryptfs 6 times.

What is now?

Due to its youth, and also due to the problems described above, GPGPU has not become a truly widespread technology, however, useful software that uses its capabilities exists (albeit in a meager amount). Crackers of various hashes appeared among the first, the algorithms of which are very easy to parallelize.

Also born multimedia applications, for example, the FlacCL encoder, which allows you to transcode an audio track into FLAC format. Some pre-existing applications have also acquired support for GPGPU, the most notable of which was ImageMagick, which now knows how to shift some of its work to the graphics processor using OpenCL. There are also projects for transferring data archivers and other information compression systems to CUDA / OpenCL (they don’t like ATi Unixoids). We will consider the most interesting of these projects in the following sections of the article, but for now we will try to figure out what we need in order for all this to start and work stably.

GPUs have long outperformed x86 processors in performance

· Secondly, the system must have the latest proprietary drivers for the video card installed, they will provide support for both the card's native GPGPU technologies and the open OpenCL.

· And thirdly, since distribution builders have not yet started distributing application packages with GPGPU support, we will have to build applications ourselves, and for this we need official SDKs from manufacturers: CUDA Toolkit or ATI Stream SDK. They contain the header files and libraries necessary for building applications.

Install CUDA Toolkit

We follow the link above and download the CUDA Toolkit for Linux (you can choose from several versions, for Fedora, RHEL, Ubuntu and SUSE distributions, there are versions for both x86 and x86_64 architectures). In addition, there you need to download driver kits for developers (Developer Drivers for Linux, they are first on the list).

Run the SDK installer:

$ sudo sh cudatoolkit_4.0.17_linux_64_ubuntu10.10.run

When the installation is completed, proceed to install the drivers. To do this, shut down the X server:

# sudo /etc/init.d/gdm stop

Opening the console and run the driver installer:

$ sudo sh devdriver_4.0_linux_64_270.41.19.run

After the installation is completed, we start X:

In order for applications to work with CUDA/OpenCL, we write the path to the directory with CUDA libraries in the LD_LIBRARY_PATH variable:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Or, if you installed the 32-bit version:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib32

You also need to specify the path to the CUDA header files so that the compiler can find them at the application build stage:

$ export C_INCLUDE_PATH=/usr/local/cuda/include

That's it, now you can start building CUDA/OpenCL software.

Install ATI Stream SDK

The Stream SDK does not require installation, so you can simply unpack the archive downloaded from the AMD website to any directory (/opt is the best choice) and set the path to it in the same LD_LIBRARY_PATH variable:

$ wget http://goo.gl/CNCNo

$ sudo tar -xzf ~/AMD-APP-SDK-v2.4-lnx64.tgz -C /opt

$ export LD_LIBRARY_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/lib/x86_64/

$ export C_INCLUDE_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/include/

As with the CUDA Toolkit, x86_64 needs to be replaced with x86 on 32-bit systems. Now we go to the root directory and unpack the icd-registration.tgz archive (it's kind of free license key):

$ sudo tar -xzf /opt/AMD-APP-SDK-v2.4-lnx64/icd-registration.tgz - WITH /

We check the correct installation / operation of the package using the clinfo tool:

$ /opt/AMD-APP-SDK-v2.4-lnx64/bin/x86_64/clinfo

ImageMagick and OpenCL

Support for OpenCL appeared in ImageMagick a long time ago, but it is not enabled by default in any distribution. Therefore, we will have to build IM ourselves from source. There is nothing complicated about this, everything you need is already in the SDK, so the assembly will not require the installation of any additional libraries from nVidia or AMD. So, download / unpack the archive with the sources:

$ wget http://goo.gl/F6VYV

$ tar -xjf ImageMagick-6.7.0-0.tar.bz2

$ cd ImageMagick-6.7.0-0

$ sudo apt-get install build-essential

Run the configurator and grab its output for OpenCL support:

$ LDFLAGS=-L$LD_LIBRARY_PATH ./configure | grep -e cl.h -e OpenCL

The correct output of the command should look something like this:

checking CL/cl.h usability... yes

checking CL/cl.h presence... yes

checking for CL/cl.h... yes

checking OpenCL/cl.h usability... no

checking OpenCL/cl.h presence... no

checking for OpenCL/cl.h... no

checking for OpenCL library... -lOpenCL

The word "yes" should mark either the first three lines or the second (or both). If this is not the case, then most likely the C_INCLUDE_PATH variable was not initialized correctly. If the word "no" marks the last line, then the matter is in the LD_LIBRARY_PATH variable. If everything is ok, start the build/install process:

$ sudo make install clean

Verify that ImageMagick was indeed compiled with OpenCL support:

$ /usr/local/bin/convert-version | grep Features

Features: OpenMP OpenCL

Now let's measure the resulting gain in speed. The ImageMagick developers recommend using the convolve filter for this:

$ time /usr/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

$ time /usr/local/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

Some other operations, such as resizing, should now also work much faster, but you should not hope that ImageMagick will start processing graphics at breakneck speed. So far, very little of the package has been optimized with OpenCL.

FlacCL (Flacuda)

FlacCL is a FLAC audio encoder that takes advantage of OpenCL features. It is part of the CUETools package for Windows, but thanks to mono it can also be used on Linux. To get the archive with the encoder, execute next command:

$ mkdir flaccl && cd flaccl

$ wget www.cuetools.net/install/flaccl03.rar

$ sudo apt-get install unrar mono

$ unrar x fl accl03.rar

So that the program can find the OpenCL library, we make a symbolic link:

$ ln -s $LD_LIBRARY_PATH/libOpenCL.so libopencl.so

Now let's start the encoder:

$ mono CUETools.FLACCL.cmd.exe music.wav

If the error message "Error: Requested compile size is bigger than the required workgroup size of 32" is displayed on the screen, then we have a weak video card in the system, and the number of cores involved should be reduced to the specified number using the flag '-- group-size XX', where XX is the desired number of cores.

I must say right away that due to the long initialization time of OpenCL, a noticeable gain can only be obtained on sufficiently long tracks. FlacCL processes short audio files at almost the same speed as its traditional version.

oclHashcat or quick brute force

As I already said, developers of various crackers and password brute force systems were among the first to add GPGPU support to their products. For them new technology has become a real holy grail, which made it easy to transfer the naturally easily parallelizable code to the shoulders of fast GPU processors. Therefore, it is not surprising that there are now dozens of very different implementations of such programs. But in this article I will talk about only one of them - oclHashcat.

oclHashcat is a cracker that can crack passwords by their hash at extremely high speed, while using GPU power using OpenCL. According to the measurements published on the project website, the speed of MD5 password selection on the nVidia GTX580 is up to 15800 million combinations per second, thanks to which oclHashcat is able to find an eight-character password of average complexity in just 9 minutes.

The program supports OpenCL and CUDA, MD5 algorithms, md5($pass.$salt), md5(md5($pass)), vBulletin< v3.8.5, SHA1, sha1($pass.$salt), хэши MySQL, MD4, NTLM, Domain Cached Credentials, SHA256, поддерживает распределенный подбор паролей с задействованием мощности нескольких машин.

$7z x oclHashcat-0.25.7z

$ cd oclHashcat-0.25

And run the program (we will use a trial list of hashes and a trial dictionary):

$ ./oclHashcat64.bin example.hash ?l?l?l?l example.dict

oclHashcat will open the text of the user agreement, which you must agree to by typing "YES". After that, the enumeration process will begin, the progress of which can be found by pressing . To pause the process, press

To resume - . You can also use brute force (for example, from aaaaaaaa to zzzzzzz):

$ ./oclHashcat64.bin hash.txt ?l?l?l?l ?l?l?l?l

And various modifications of the dictionary and direct enumeration method, as well as their combinations (you can read about this in the docs/examples.txt file). In my case, the speed of enumeration of the entire dictionary was 11 minutes, while direct enumeration (from aaaaaaaa to zzzzzzzz) lasted about 40 minutes. On average, the speed of the GPU (RV710 chip) was 88.3 million / s.

findings

Despite many different limitations and the complexity of software development, GPGPU is the future of high-performance desktop computers. But the most important thing is that you can use the capabilities of this technology right now, and this applies not only to Windows machines, but also to Linux.


What software is needed to mine cryptocurrency? What to consider when choosing equipment for mining? How to mine bitcoins and ethereum using a video card on a computer?

It turns out that not only fans of spectacular computer games need powerful video cards. Thousands of users around the world use graphics cards to earn cryptocurrency! From several cards with powerful processors miners create farms- computing centers that extract digital money almost out of thin air!

Denis Kuderin is with you - an expert of the HeatherBober magazine on finance and their competent multiplication. I will tell you what it is mining on video card in 17-18 years, how to choose the right device for earning cryptocurrency, and why it is no longer profitable to mine bitcoins on video cards.

You will also learn where to buy the most productive and powerful video card for professional mining, and get expert tips to improve the efficiency of your mining farm.

1. Mining on a video card - easy money or unjustified expenses

A good video card is not just a digital signal adapter, but also a powerful processor capable of solving the most complex computing problems. And including - calculate the hash code for the block chain (blockchain). This makes graphics cards ideal for mining- Cryptocurrency mining.

Question: Why the graphics processor? After all, in any computer there is a central processing unit? Isn't it logical to do calculations with it?

Answer: The CPU processor can also calculate blockchains, but it does it hundreds of times slower than the video card processor (GPU). And not because one is better, the other is worse. They just work differently. And if you combine several video cards, the power of such a computing center will increase several times more.

For those who have no idea about how digital money is mined, a small educational program. Mining - the main, and sometimes the only way to produce cryptocurrency.

Since no one mints or prints this money, and they are not a material substance, but a digital code, someone must calculate this code. This is what miners do, or rather, their computers.

In addition to code calculations, mining performs several more important tasks:

  • system decentralization support: lack of attachment to servers - the basis of the blockchain;
  • transaction confirmation– without mining, operations will not be able to enter a new block;
  • formation of new blocks of the system- and entering them into a single registry for all computers.

I want to immediately cool the ardor of novice miners: the mining process is becoming more and more difficult every year. For example, using a video card has long been unprofitable.

Bitcoins with the help of GPUs are now mined only by stubborn amateurs, since specialized processors have replaced video cards ASIC. These chips consume less electricity and are more efficient in terms of computing. All good, but worth the order 130-150 thousand rubles .

Powerful model Antminer S9

Fortunately for miners, bitcoin is not the only cryptocurrency on the planet, but one of hundreds. Other digital money - Ethereum, Zcash, Expanse, dogecoins etc. it is still profitable to mine with the help of video cards. The remuneration is stable, and the equipment pays off in about 6-12 months.

But there is another problem - the lack of powerful video cards. The excitement around the cryptocurrency has led to a rise in the price of these devices. It is not so easy to buy a new video card suitable for mining in Russia.

Novice miners have to order video adapters in online stores (including foreign ones) or purchase second-hand goods. By the way, I do not recommend doing the latter: Mining equipment becomes obsolete and wears out at a fantastic rate.

Avito even sells entire farms for cryptocurrency mining.

There are many reasons: some miners have already “played enough” in the extraction of digital money and decided to engage in more profitable operations with cryptocurrency (in particular, exchange trading), others realized that they could not compete with powerful Chinese clusters operating on the basis of power plants. Still others switched from video cards to ASICs.

However, the niche still brings some profit, and if you start with the help of a video card right now, you will still have time to jump on the bandwagon of the train leaving for the future.

Another thing is that there are more and more players on this field. Moreover, the total number of digital coins does not increase from this. On the contrary, the reward becomes smaller.

So, six years ago, the reward for one blockchain of the Bitcoin network was equal to 50 coins, now it's only 12.5 BTK. The complexity of calculations thus increased by 10 thousand times. True, the cost of bitcoin itself has increased many times during this time.

2. How to mine cryptocurrency using a video card - step by step instructions

There are two mining options - solo and as part of a pool. It is difficult to engage in single production - you need to have a huge amount of hashrate(power units) so that the started calculations have a probability of successful closure.

99% of all miners work in pools(English pool - pool) - communities engaged in the distribution of computing tasks. Joint mining eliminates the random factor and guarantees a stable profit.

One of my acquaintances, a miner, said this about it: I have been mining for 3 years, during this time I have not communicated with anyone who would mine alone.

Such prospectors are similar to the gold prospectors of the 19th century. You can search for years for your nugget (in our case, bitcoin) and never find it. That is, the blockchain will never be closed, which means you will not receive any reward.

Slightly more chances for “lone hunters” for ethers and some other crypto-coins.

Due to the peculiar encryption algorithm, ETH is not mined using special processors (they have not yet been invented). Only video cards are used for this. Due to ethereums and other altcoins, numerous farmers of our time still hold on.

One video card to create a full-fledged farm will not be enough: 4 pieces - "living wage" for the miner, counting on a stable profit. No less important powerful system cooling video adapters. And do not lose sight of such a cost item as electricity bills.

Step-by-step instructions will protect against errors and speed up the process setup.

Step 1. Choose a pool

The world's largest cryptocurrency pools are located in China, as well as in Iceland and the United States. Formally, these communities do not have a state affiliation, but Russian-language pool sites are a rarity on the Internet.

Since you will most likely have to mine ethereum on a video card, then you will need to choose the community involved in the calculation of this currency. Although Etherium is a relatively young altcoin, there are many pools for its mining. The size of your income and its stability largely depend on the choice of the community.

We select a pool according to the following criteria:

  • performance;
  • working hours;
  • fame among cryptocurrency miners;
  • Availability positive feedback in independent forums;
  • convenience of withdrawing money;
  • the size of the commission;
  • the principle of accrual of profit.

The cryptocurrency market changes daily. This also applies to rate fluctuations, and the emergence of new digital money - forks bitcoin. There are global changes as well.

So, recently it became known that the air in the near future is moving to a fundamentally different system of profit distribution. In a nutshell, miners who have “a lot of ketse”, that is, coins, will have income in the Etherium network, and novice miners will either close their shop or switch to other money.

But such "little things" never stopped enthusiasts. Moreover, there is a program called Profitable Pool. It automatically tracks the most profitable altcoins for mining at the current moment. There is also a search service for the pools themselves, as well as their real-time ratings.

Step 2. Install and configure the program

After registering on the pool website, you need to download a special miner program - do not calculate the code manually using a calculator. Such programs are also enough. For bitcoin, this is 50 miner or CGMiner, for ether - Ethminer.

Setting up requires care and certain skills. For example, you need to know what scripts are and be able to fit them into command line your computer. I advise you to check the technical points with practicing miners, since each program has its own installation and configuration nuances.

Step 3. Registering a wallet

If you don’t have a bitcoin wallet or ethereum storage yet, you need to register them. We download wallets from official sites.

Sometimes the pools themselves provide assistance in this matter, but not free of charge.

Step 4. Start mining and monitor statistics

It remains only to start the process and wait for the first receipts. Be sure to download an auxiliary program that will monitor the status of the main components of your computer - workload, overheating, etc.

Step 5. Withdraw cryptocurrency

Computers work around the clock and automatically, calculating the code. You just have to make sure that the cards or other systems do not fail. Cryptocurrency will flow into your wallet at a rate directly proportional to the amount of hashrate.

How to convert digital currency to fiat? A question worthy of a separate article. In short, the fastest way is exchange offices. They take a percentage for their services, and your task is to find the most profitable rate with the minimum commission. A professional service for comparing exchangers will help you do this.

- the best resource of this kind in Runet. This monitoring compares the performance of more than 300 exchange offices and finds the best quotes for the currency pairs you are interested in. Moreover, the service indicates the cryptocurrency reserves at the cash desk. The monitoring lists contain only proven and reliable exchange services.

3. What to look for when choosing a video card for mining

Choose your video card wisely. The first one that comes across or the one that is already on your computer will also mine, but this power even for ethers will be negligible.

The main indicators are as follows: performance (power), power consumption, cooling, overclocking prospects.

1) Power

Everything is simple here - the higher the processor performance, the better for calculating the hash code. Excellent performance is provided by cards with a memory capacity of more than 2 GB. And choose devices with a 256-bit bus. 128-bit for this case is not suitable.

2) Energy consumption

Power, of course, is great - high hashrate and all that. But don't forget the power consumption figures. Some productive farms “eat up” so much electricity that the costs barely pay off or do not pay off at all.

3) Cooling

Standard consists of 4-16 cards. It produces an excess amount of heat that is detrimental to the iron and undesirable to the farmer himself. Living and working in a one-room apartment without air conditioning will be, to put it mildly, uncomfortable.

High-quality processor cooling is an indispensable condition for successful mining

Therefore, when choosing two cards with the same performance, give preference to the one with less thermal power indicator (TDP) . The best cooling parameters are demonstrated by Radeon cards. The same devices last longer than all other cards in active mode without wear.

Additional coolers will not only remove excess heat from the processors, but also extend their life.

4) Ability to overclock

Overclocking is a forced increase in the performance of a video card. The ability to "overclock the card" depends on two parameters − GPU frequencies and video memory frequencies. These are the ones you will overclock if you want to increase computing power.

What video cards to take? You will need the latest generation devices, or at least graphics accelerators, released no earlier than 2-3 years ago. Miners use cards AMD Radeon, Nvidia, Geforce GTX.

Take a look at the payback table for video cards (the data is current at the end of 2017):

4. Where to buy a video card for mining - an overview of the TOP-3 stores

As I said, video cards with the growing popularity of mining have become a scarce commodity. To buy the right device, you have to spend a lot of time and effort.

Our review of the best online sales points will help you.

1) TopComputer

Moscow hypermarket specializing in computer and household appliances. It has been operating on the market for more than 14 years, delivering goods from all over the world almost at producer prices. There is a prompt delivery service, free for Muscovites.

At the time of this writing, there are cards for sale AMD, Nvidia(8 Gb) and other varieties suitable for mining.

2) Mybitcoinshop

Special shop, trading exclusively in goods for mining. Here you will find everything for building a home farm - video cards of the required configuration, power supplies, adapters, and even ASIC miners (for new generation miners). There is a paid delivery and pickup from a warehouse in Moscow.

The company has repeatedly received the unofficial title of the best shop for miners in the Russian Federation. Prompt service, friendly attitude to customers, advanced equipment are the main components of success.

3) Ship Shop America

Purchase and delivery of goods from the USA. An intermediary company for those who need truly exclusive and most advanced mining products.

Direct partner of the leading manufacturer of video cards for gaming and mining - Nvidia. The maximum waiting time for goods is 14 days.

5. How to increase the income from mining on a video card - 3 useful tips

Impatient readers who want to start mining right now and receive income from tomorrow morning will certainly ask - how much do miners earn?

Earnings depend on equipment, cryptocurrency rate, pool efficiency, farm capacity, hash rate and a bunch of other factors. Some manage to receive monthly up to 70 000 in rubles , others are satisfied 10 dollars in Week. This is an unstable and unpredictable business.

Useful tips will help you increase your income and optimize your expenses.

You will mine a currency that is rapidly growing in price, you will earn more. For example, ether is now worth about 300 dollars, bitcoin - more 6000 . But you need to take into account not only the current value, but also the growth rate for the week.

Tip 2. Use the mining calculator to select the optimal equipment

The mining calculator on the pool website or on another specialized service will help you choose the best program and even a video card for mining.

AMD/ATI Radeon Architecture Features

This is similar to the birth of new biological species, when living beings evolve to improve their adaptability to the environment during the development of habitats. So the GPU, starting with acceleration of rasterization and texturing of triangles, has developed additional abilities to execute shader programs for coloring these same triangles. And these abilities turned out to be in demand in non-graphical computing, where in some cases they give a significant performance gain compared to traditional solutions.

We draw analogies further - after a long evolution on land, mammals penetrated into the sea, where they pushed out ordinary marine inhabitants. In the competitive struggle, mammals used both new advanced abilities that appeared on the earth's surface and those specially acquired for adaptation to life in the water. In the same way, GPUs, based on the advantages of the architecture for 3D graphics, are increasingly acquiring special functionality useful for non-graphics tasks.

So, what allows the GPU to claim its own sector in the field of general-purpose programs? The microarchitecture of the GPU is built very differently from conventional CPUs, and it has certain advantages from the very beginning. Graphics tasks involve independent parallel processing of data, and the GPU is natively multi-threaded. But this parallelism is only a joy to him. The microarchitecture is designed to exploit the available a large number of threads to be executed.

The GPU consists of several dozen (30 for Nvidia GT200, 20 for Evergreen, 16 for Fermi) processor cores, which are called Streaming Multiprocessor in Nvidia terminology, and SIMD Engine in ATI terminology. Within the framework of this article, we will call them miniprocessors, because they execute several hundred program threads and can do almost everything that a regular CPU can, but still not everything.

Marketing names are confusing - they, for greater importance, indicate the number of functional modules that can subtract and multiply: for example, 320 vector "cores" (cores). These kernels are more like grains. It's better to think of the GPU as a multi-core processor with many cores executing many threads at the same time.

Each miniprocessor has local memory, 16 KB for the GT200, 32 KB for the Evergreen, and 64 KB for the Fermi (essentially a programmable L1 cache). It has a similar access time to the L1 cache of a conventional CPU and performs similar functions of delivering data to function modules as quickly as possible. In the Fermi architecture, a portion of the local memory can be configured as a normal cache. In the GPU, local memory is used to quickly exchange data between executing threads. One of the usual schemes for a GPU program is as follows: first, data from the global memory of the GPU is loaded into local memory. This is just ordinary video memory, located (like system memory) separately from “its own” processor - in the case of video, it is soldered by several microcircuits on the textolite of the video card. Next, several hundred threads work with this data in local memory and write the result to global memory, after which it is transferred to the CPU. It is the responsibility of the programmer to write instructions for loading and unloading data from local memory. In essence, this is the partitioning of data [of a specific task] for parallel processing. The GPU also supports atomic write/read instructions to memory, but they are inefficient and are usually required at the final stage for "gluing" the results of calculations of all miniprocessors.

Local memory is common for all threads running in the miniprocessor, so, for example, in Nvidia terminology it is even called shared, and the term local memory means exactly the opposite, namely: a certain personal area of ​​a separate thread in global memory, visible and accessible only to it. But in addition to local memory, the miniprocessor has another memory area, in all architectures, about four times larger in volume. It is divided equally among all executing threads; these are registers for storing variables and intermediate results of calculations. Each thread has several dozen registers. The exact number depends on how many threads the miniprocessor is running. This number is very important, since the latency of the global memory is very high, hundreds of cycles, and in the absence of caches, there is nowhere to store intermediate results of calculations.

And one more important feature of the GPU: “soft” vectorization. Each miniprocessor has a large number of compute modules (8 for the GT200, 16 for the Radeon, and 32 for the Fermi), but they can only execute the same instruction, with the same program address. The operands in this case can be different, different threads have their own. For example, the instruction add the contents of two registers: it is simultaneously executed by all computing devices, but different registers are taken. It is assumed that all threads of the GPU program, performing parallel processing of data, generally move in a parallel course through the program code. Thus, all computing modules are loaded evenly. And if the threads, due to branches in the program, have diverged in their path of code execution, then the so-called serialization occurs. Then not all computing modules are used, since the threads submit different instructions for execution, and the block of computing modules can execute, as we have already said, only an instruction with one address. And, of course, performance at the same time falls in relation to the maximum.

The advantage is that vectorization is completely automatic, it is not programming using SSE, MMX, and so on. And the GPU itself handles the discrepancies. Theoretically, it is possible to write programs for the GPU without thinking about the vector nature of the executing modules, but the speed of such a program will not be very high. The downside is the large width of the vector. It is more than the nominal number of functional modules, and is 32 for Nvidia GPUs and 64 for Radeon. The threads are processed in blocks of the appropriate size. Nvidia calls this block of threads the term warp, AMD - wave front, which is the same thing. Thus, on 16 computing devices, a "wave front" 64 threads long is processed in four cycles (assuming the usual instruction length). The author prefers the term warp in this case, because of the association with the nautical term warp, denoting a rope tied from twisted ropes. So the threads "twist" and form an integral bundle. However, the “wave front” can also be associated with the sea: instructions arrive at the actuators in the same way as waves roll onto the shore one after another.

If all the threads have progressed equally in the execution of the program (they are in the same place) and thus execute the same instruction, then everything is fine, but if not, it slows down. In this case, the threads from the same warp or wave front are in different places in the program, they are divided into groups of threads that have the same value of the instruction number (in other words, the instruction pointer). And as before, only the threads of one group are executed at one time - they all execute the same instruction, but with different operands. As a result, the warp is executed as many times slower, how many groups it is divided into, and the number of threads in the group does not matter. Even if the group consists of only one thread, it will still take as long to run as a full warp. In hardware, this is implemented by masking certain threads, that is, instructions are formally executed, but the results of their execution are not recorded anywhere and are not used in the future.

Although each miniprocessor (Streaming MultiProcessor or SIMD Engine) executes instructions belonging to only one warp (a bunch of threads) at any given time, it has several dozen active warps in the executable pool. After executing the instructions of one warp, the miniprocessor executes not the next in turn instruction of the threads of this warp, but the instructions of someone else in the warp. That warp can be in a completely different place in the program, this will not affect the speed, since only inside the warp the instructions of all threads must be the same for execution at full speed.

In this case, each of the 20 SIMD Engines has four active wave fronts, each with 64 threads. Each thread is indicated by a short line. Total: 64×4×20=5120 threads

Thus, given that each warp or wave front consists of 32-64 threads, the miniprocessor has several hundred active threads that are executing almost simultaneously. Below we will see what architectural benefits such a large number of parallel threads promise, but first we will consider what limitations the miniprocessors that make up GPUs have.

The main thing is that the GPU does not have a stack where function parameters and local variables could be stored. Due to the large number of threads for the stack, there is simply no room on the chip. Indeed, since the GPU simultaneously executes about 10,000 threads, with a single thread stack size of 100 KB, the total amount will be 1 GB, which is equal to the standard amount of all video memory. Moreover, there is no way to place a stack of any significant size in the GPU core itself. For example, if you put 1000 bytes of stack per thread, then only one miniprocessor will require 1 MB of memory, which is almost five times the total amount of local memory of the miniprocessor and the memory allocated for storing registers.

Therefore, there is no recursion in the GPU program, and you can’t really turn around with function calls. All functions are directly substituted into the code when the program is compiled. This limits the scope of the GPU to computational tasks. It is sometimes possible to use a limited stack emulation using global memory for recursion algorithms with a known small iteration depth, but this is not a typical GPU application. To do this, it is necessary to specially develop an algorithm, to explore the possibility of its implementation without a guarantee of successful acceleration compared to the CPU.

Fermi first introduced the ability to use virtual functions, but again their use is limited by the lack of a large fast cache for each thread. 1536 threads account for 48 KB or 16 KB L1, that is, virtual functions in the program can be used relatively rarely, otherwise the stack will also use slow global memory, which will slow down execution and, most likely, will not bring benefits compared to the CPU version.

Thus, the GPU is presented as a computational coprocessor, into which data is loaded, they are processed by some algorithm, and a result is produced.

Benefits of Architecture

But considers the GPU very fast. And in this he is helped by his high multithreading. A large number of active threads makes it possible to partly hide the large latency of the separately located global video memory, which is about 500 cycles. It is especially well leveled for code with a high density of arithmetic operations. Thus, a transistor-expensive L1-L2-L3 cache hierarchy is not required. Instead, many compute modules can be placed on a chip, providing outstanding arithmetic performance. In the meantime, the instructions of one thread or warp are being executed, the other hundreds of threads are quietly waiting for their data.

Fermi introduced a second level cache of about 1 MB, but it cannot be compared with the caches of modern processors, it is more intended for communication between cores and various software tricks. If its size is divided among all tens of thousands of threads, each will have a very insignificant amount.

But besides the latency of global memory, there are many more latencies in the computing device that need to be hidden. This is the latency of data transfer within the chip from computing devices to the first-level cache, that is, the local memory of the GPU, and to the registers, as well as the instruction cache. The register file, as well as the local memory, are located separately from the functional modules, and the speed of access to them is about a dozen cycles. And again, a large number of threads, active warps, can effectively hide this latency. Moreover, the total bandwidth (bandwidth) of access to the local memory of the entire GPU, taking into account the number of miniprocessors that make it up, is much greater than the bandwidth of access to the first-level cache in modern CPUs. The GPU can process significantly more data per unit of time.

We can immediately say that if the GPU is not provided with a large number of parallel threads, then it will have almost zero performance, because it will work at the same pace, as if it was fully loaded, and do much less work. For example, let only one thread remain instead of 10,000: performance will drop by about a thousand times, because not only will not all blocks be loaded, but all latencies will also affect.

The problem of hiding latencies is also acute for modern high-frequency CPUs; sophisticated methods are used to eliminate it - deep pipelining, out-of-order execution of instructions (out-of-order). This requires complex instruction execution schedulers, various buffers, etc., which takes up space on the chip. This is all required for best performance in single-threaded mode.

But for the GPU, all this is not necessary, it is architecturally faster for computational tasks with a large number of threads. Instead, it converts multithreading into performance like a philosopher's stone turns lead into gold.

The GPU was originally designed to optimally execute shader programs for triangle pixels, which are obviously independent and can be executed in parallel. And from this state, it evolved by adding various features (local memory and addressable access to video memory, as well as complicating the instruction set) into a very powerful computing device, which can still be effectively applied only for algorithms that allow highly parallel implementation using a limited amount of local memory. memory.

Example

One of the most classic GPU problems is the problem of calculating the interaction of N bodies that create a gravitational field. But if, for example, we need to calculate the evolution of the Earth-Moon-Sun system, then the GPU is a bad helper for us: there are few objects. For each object, it is necessary to calculate interactions with all other objects, and there are only two of them. In the case of the motion of the solar system with all the planets and their moons (about a couple of hundred objects), the GPU is still not very efficient. However, a multi-core processor, due to high overhead costs for thread management, will also not be able to show all its power, it will work in single-threaded mode. But if you also need to calculate the trajectories of comets and asteroid belt objects, then this is already a task for the GPU, since there are enough objects to create the required number of parallel calculation threads.

The GPU will also perform well if it is necessary to calculate the collision of globular clusters of hundreds of thousands of stars.

Another opportunity to use the power of the GPU in the N-body problem appears when you need to calculate many individual problems, albeit with a small number of bodies. For example, if you want to calculate the evolution of one system for different options for initial velocities. Then it will be possible to effectively use the GPU without problems.

AMD Radeon microarchitecture details

We have considered the basic principles of GPU organization, they are common for video accelerators of all manufacturers, since they initially had one target task - shader programs. However, manufacturers have found it possible to disagree on the details of the microarchitectural implementation. Although the CPUs of different vendors are sometimes very different, even if they are compatible, such as Pentium 4 and Athlon or Core. The architecture of Nvidia is already widely known, now we will look at Radeon and highlight the main differences in the approaches of these vendors.

AMD graphics cards have received full support for general purpose computing since the Evergreen family, which also pioneered the DirectX 11 specification. The 47xx family cards have a number of significant limitations, which will be discussed below.

Differences in local memory size (32 KB for Radeon versus 16 KB for GT200 and 64 KB for Fermi) are generally not fundamental. As well as the wave front size of 64 threads for AMD versus 32 threads per warp for Nvidia. Almost any GPU program can be easily reconfigured and tuned to these parameters. Performance can change by tens of percent, but in the case of a GPU, this is not so important, because a GPU program usually runs ten times slower than its counterpart for the CPU, or ten times faster, or does not work at all.

More important is AMD's use of VLIW (Very Long Instruction Word) technology. Nvidia uses scalar simple instructions, operating with scalar registers. Its accelerators implement simple classic RISC. AMD graphics cards have the same number of registers as the GT200, but the registers are 128-bit vector. Each VLIW instruction operates on several four-component 32-bit registers, which is similar to SSE, but the capabilities of VLIW are much wider. This is not SIMD (Single Instruction Multiple Data) like SSE - here the instructions for each pair of operands can be different and even dependent! For example, let the components of register A be named a1, a2, a3, a4; for register B - similarly. Can be calculated with a single instruction that executes in one cycle, for example, the number a1×b1+a2×b2+a3×b3+a4×b4 or a two-dimensional vector (a1×b1+a2×b2, a3×b3+a4×b4 ).

This was made possible thanks to the lower frequency of the GPU than the CPU, and a strong decrease in technical processes in recent years. This does not require any scheduler, almost everything is executed per clock.

With vector instructions, Radeon's peak single-precision performance is very high, at teraflops.

One vector register can store one double precision number instead of four single precision numbers. And one VLIW instruction can either add two pairs of doubles, or multiply two numbers, or multiply two numbers and add to the third. Thus, peak performance in double is about five times lower than in float. For older Radeon models, it corresponds to the performance of Nvidia Tesla on the new Fermi architecture and is much higher than the performance in double cards on the GT200 architecture. In Geforce consumer video cards based on Fermi maximum speed double-computing has been reduced by four times.


Schematic diagram of the work of Radeon. Only one miniprocessor is shown out of 20 running in parallel

GPU manufacturers, unlike CPU manufacturers (first of all, x86-compatible ones), are not bound by compatibility issues. The GPU program is first compiled into some intermediate code, and when the program is launched, the driver compiles this code into model-specific machine instructions. As described above, GPU manufacturers took advantage of this by inventing convenient ISA (Instruction Set Architecture) for their GPUs and changing them from generation to generation. In any case, this added some percentage of performance due to the lack (as unnecessary) of the decoder. But AMD went even further, inventing its own format for arranging instructions in machine code. They are not arranged sequentially (according to the program listing), but in sections.

First comes the section of conditional jump instructions, which have links to sections of continuous arithmetic instructions corresponding to different branch branches. They are called VLIW bundles (bundles of VLIW instructions). These sections contain only arithmetic instructions with data from registers or local memory. Such an organization simplifies the flow of instructions and their delivery to the execution units. This is all the more useful given that VLIW instructions are relatively large. There are also sections for memory access instructions.

Conditional Branch Instruction Sections
Section 0Branching 0Link to section #3 of continuous arithmetic instructions
Section 1Branching 1Link to section #4
Section 2Branching 2Link to section #5
Sections of continuous arithmetic instructions
Section 3VLIW instruction 0VLIW instruction 1VLIW instruction 2VLIW instruction 3
Section 4VLIW instruction 4VLIW instruction 5
Section 5VLIW instruction 6VLIW instruction 7VLIW instruction 8VLIW instruction 9

GPUs from both manufacturers (both Nvidia and AMD) also have built-in instructions for quickly calculating basic mathematical functions, square root, exponent, logarithms, sines and cosines for single precision numbers in several cycles. There are special computing blocks for this. They "came" from the need to implement a fast approximation of these functions in geometry shaders.

Even if someone did not know that GPUs are used for graphics, and only got acquainted with technical specifications, then by this sign he could guess that these computing coprocessors originated from video accelerators. Similarly, some traits of marine mammals have led scientists to believe that their ancestors were land creatures.

But a more obvious feature that betrays the graphical origin of the device is the blocks for reading two-dimensional and three-dimensional textures with support for bilinear interpolation. They are widely used in GPU programs, as they provide faster and easier reading of read-only data arrays. One of the standard behaviors of a GPU application is to read arrays of initial data, process them in computational cores, and write the result to another array, which is then passed back to the CPU. Such a scheme is standard and common, because it is convenient for the GPU architecture. Tasks that require intensive reads and writes to one large area of ​​global memory, thus containing data dependencies, are difficult to parallelize and efficiently implement on the GPU. Also, their performance will greatly depend on the latency of the global memory, which is very large. But if the task is described by the pattern "reading data - processing - writing the result", then you can almost certainly get a big boost from its execution on the GPU.

For texture data in the GPU, there is a separate hierarchy of small caches of the first and second levels. It also provides acceleration from the use of textures. This hierarchy originally appeared in GPUs in order to take advantage of the locality of access to textures: obviously, after processing one pixel, a neighboring pixel (with a high probability) will require closely spaced texture data. But many algorithms for conventional computing have a similar nature of data access. So texture caches from graphics will be very useful.

Although the size of the L1-L2 caches in Nvidia and AMD cards is approximately the same, which is obviously caused by the requirements for optimality in terms of game graphics, the latency of access to these caches differs significantly. Nvidia's access latency is higher, and texture caches in Geforce primarily help to reduce the load on the memory bus, rather than directly speed up data access. This is not noticeable in graphics programs, but is important for general purpose programs. In Radeon, the latency of the texture cache is lower, but the latency of the local memory of miniprocessors is higher. Here is an example: for optimal matrix multiplication on Nvidia cards, it is better to use local memory, loading the matrix there block by block, and for AMD, it is better to rely on a low-latency texture cache, reading matrix elements as needed. But this is already a rather subtle optimization, and for an algorithm that has already been fundamentally transferred to the GPU.

This difference also shows up when using 3D textures. One of the first GPU computing benchmarks, which showed a serious advantage for AMD, just used 3D textures, as it worked with a three-dimensional data array. And texture access latency in Radeon is significantly faster, and the 3D case is additionally more optimized in hardware.

To receive maximum performance from the hardware of various companies, some tuning of the application for a specific card is needed, but it is an order of magnitude less significant than, in principle, the development of an algorithm for the GPU architecture.

Radeon 47xx Series Limitations

In this family, support for GPU computing is incomplete. There are three important moments. Firstly, there is no local memory, that is, it is physically there, but does not have the universal access required by the modern standard of GPU programs. It is programmatically emulated in global memory, meaning it won't benefit from using it as opposed to a full-featured GPU. The second point is limited support for various atomic memory operations instructions and synchronization instructions. And the third point is quite small size instruction cache: starting from a certain program size, the speed slows down by several times. There are other minor restrictions as well. We can say that only programs that are ideally suited for the GPU will work well on this video card. Suppose that in simple test programs that operate only with registers, a video card can show good result in Gigaflops, it is problematic to effectively program something complex for it.

Advantages and disadvantages of Evergreen

If we compare AMD and Nvidia products, then, in terms of GPU computing, the 5xxx series looks like a very powerful GT200. So powerful that it surpasses Fermi in peak performance by about two and a half times. Especially after the parameters of the new Nvidia video cards were cut, the number of cores was reduced. But the appearance of the L2 cache in Fermi simplifies the implementation of some algorithms on the GPU, thus expanding the scope of the GPU. Interestingly, for well-optimized for the previous generation GT200 CUDA programs, Fermi's architectural innovations often did nothing. They accelerated in proportion to the increase in the number of computing modules, that is, less than twice (for single precision numbers), or even less, because the memory bandwidth did not increase (or for other reasons).

And in tasks that fit well on the GPU architecture and have a pronounced vector nature (for example, matrix multiplication), Radeon shows performance relatively close to the theoretical peak and overtakes Fermi. Not to mention multi-core CPUs. Especially in problems with single precision numbers.

But Radeon has a smaller die area, less heat dissipation, power consumption, higher yield and, accordingly, lower cost. And directly in the problems of 3D graphics, the Fermi gain, if any, is much less than the difference in the area of ​​the crystal. This is largely due to the fact that Radeon's compute architecture, with 16 compute units per miniprocessor, a 64-thread wave front, and VLIW vector instructions, is perfect for its main task - computing graphics shaders. For the vast majority of ordinary users, gaming performance and price are priorities.

From a professional point of view, scientific programs, the Radeon architecture provides best ratio price-performance, performance-per-watt, and absolute performance in tasks that are in principle well suited to the GPU architecture, allow for parallelization and vectorization.

For example, in a fully parallel, easily vectorizable key selection problem, Radeon is several times faster than Geforce and several tens of times faster than CPU.

This corresponds to the general AMD Fusion concept, according to which GPUs should complement the CPU, and in the future be integrated into the CPU core itself, just as the math coprocessor was previously transferred from a separate chip to the processor core (this happened about twenty years ago, before the appearance of the first Pentium processors). GPU will be integrated graphics core and a vector coprocessor for streaming tasks.

Radeon uses a tricky technique of mixing instructions from different wave fronts when executed by function modules. This is easy to do since the instructions are completely independent. The principle is similar to the pipelined execution of independent instructions by modern CPUs. Apparently, this makes it possible to efficiently execute complex, multi-byte, vector VLIW instructions. On the CPU, this requires a sophisticated scheduler to identify independent instructions or use hyper-threading technologies, which also supplies the CPU with known independent instructions from different threads.

measure 0measure 1measure 2measure 3bar 4measure 5measure 6measure 7VLIW module
wave front 0wave front 1wave front 0wave front 1wave front 0wave front 1wave front 0wave front 1
instr. 0instr. 0instr. sixteeninstr. sixteeninstr. 32instr. 32instr. 48instr. 48VLIW0
instr. oneVLIW1
instr. 2VLIW2
instr. 3VLIW3
instr. 4VLIW4
instr. 5VLIW5
instr. 6VLIW6
instr. 7VLIW7
instr. eightVLIW8
instr. nineVLIW9
instr. tenVLIW10
instr. elevenVLIW11
instr. 12VLIW12
instr. thirteenVLIW13
instr. fourteenVLIW14
instr. fifteenVLIW15

128 instructions of two wave fronts, each of which consists of 64 operations, are executed by 16 VLIW modules in eight cycles. There is an alternation, and each module actually has two cycles to execute an entire instruction, provided that it starts executing a new one in parallel on the second cycle. This probably helps to quickly execute a VLIW instruction like a1×a2+b1×b2+c1×c2+d1×d2, that is, execute eight such instructions in eight cycles. (Formally, it turns out, one per clock.)

Nvidia apparently doesn't have this technology. And in the absence of VLIW, high performance using scalar instructions requires a high frequency of operation, which automatically increases heat dissipation and places high demands on the process (to force the circuit to run at a higher frequency).

The disadvantage of Radeon in terms of GPU computing is a big dislike for branching. GPUs generally do not favor branching due to the above technology for executing instructions: immediately by a group of threads with one program address. (By the way, this technique is called SIMT: Single Instruction - Multiple Threads (one instruction - many threads), by analogy with SIMD, where one instruction performs one operation with different data.) . It is clear that if the program is not completely vector, then the larger the size of the warp or wave front, the worse, since when the path through the program of neighboring threads diverges, more groups are formed that must be executed sequentially (serialized). Let's say all the threads have dispersed, then in the case of a warp size of 32 threads, the program will run 32 times slower. And in the case of size 64, as in Radeon, it is 64 times slower.

This is a noticeable, but not the only manifestation of "dislike". In Nvidia video cards, each functional module, otherwise called the CUDA core, has a special branch processing unit. And in Radeon video cards for 16 computing modules, there are only two branching control units (they are derived from the domain of arithmetic units). So even simple processing of a conditional branch instruction, even if its result is the same for all threads in the wave front, takes additional time. And the speed drops.

AMD also manufactures CPUs. They believe that for programs with a lot of branches, the CPU is still better suited, and the GPU is intended for purely vector programs.

So Radeon provides less efficient programming overall, but provides better price-performance in many cases. In other words, there are fewer programs that can be efficiently (beneficially) migrated from CPU to Radeon than programs that can be effectively run on Fermi. But on the other hand, those that can be effectively transferred will work more efficiently on Radeon in many ways.

API for GPU Computing

The technical specifications of Radeon themselves look attractive, although it is not necessary to idealize and absolutize GPU computing. But no less important for performance is the software necessary for developing and executing a GPU program - compilers from a high-level language and run-time, that is, a driver that interacts between a part of the program running on the CPU and the GPU itself. It is even more important than in the case of the CPU: the CPU does not need a driver to manage data transfer, and from the point of view of the compiler, the GPU is more finicky. For example, the compiler must do the minimum amount registers to store intermediate results of calculations, as well as neatly embed function calls, again using a minimum of registers. After all, the fewer registers a thread uses, the more threads you can run and the more fully load the GPU, better hiding the memory access time.

And so the software support for Radeon products still lags behind the development of hardware. (In contrast to the situation with Nvidia, where the release of hardware was delayed, and the product was released in a stripped-down form.) More recently, AMD's OpenCL compiler was in beta status, with many flaws. It too often generated erroneous code, or refused to compile the code from the correct source code, or itself gave an error and crashed. Only at the end of spring came a release with high performance. It is also not without errors, but there are significantly fewer of them, and they, as a rule, appear on the sidelines when trying to program something on the verge of correctness. For example, they work with the uchar4 type, which specifies a 4-byte four-component variable. This type is in the OpenCL specifications, but it is not worth working with it on Radeon, because the registers are 128-bit: the same four components, but 32-bit. And such a uchar4 variable will still take up a whole register, only additional operations of packing and accessing individual byte components will still be required. A compiler shouldn't have any bugs, but there are no compilers without bugs. Even Intel Compiler after 11 versions has compilation errors. The identified bugs are fixed in the next release, which will be released closer to the fall.

But there are still a lot of things that need to be improved. For example, until now, the standard GPU driver for Radeon does not support GPU computing using OpenCL. The user must download and install an additional special package.

But the most important thing is the absence of any libraries of functions. For double-precision real numbers, there is not even a sine, cosine and exponent. Well, this is not required for matrix addition/multiplication, but if you want to program something more complex, you have to write all the functions from scratch. Or wait for a new SDK release. ACML (AMD Core Math Library) for the Evergreen GPU family with support for basic matrix functions should be released soon.

At the moment, according to the author of the article, real for programming Radeon video cards the use of the Direct Compute 5.0 API is seen, naturally taking into account the limitations: orientation to the Windows 7 platform and Windows Vista. Microsoft has a lot of experience in making compilers, and we can expect a fully functional release very soon, Microsoft is directly interested in this. But Direct Compute is focused on the needs of interactive applications: to calculate something and immediately visualize the result - for example, the flow of a liquid over a surface. This does not mean that it cannot be used simply for calculations, but this is not its natural purpose. For example, Microsoft does not plan to add library functions to Direct Compute - exactly those that AMD does not have at the moment. That is, what can now be effectively calculated on Radeon - some not very sophisticated programs - can also be implemented on Direct Compute, which is much simpler than OpenCL and should be more stable. Plus, it's completely portable, and will run on both Nvidia and AMD, so you'll only have to compile the program once, while Nvidia's and AMD's OpenCL SDK implementations aren't exactly compatible. (In the sense that if you develop an OpenCL program on an AMD system using the AMD OpenCL SDK, it may not run as easily on Nvidia. You may need to compile the same text using the Nvidia SDK. And vice versa, of course.)

Then, there is a lot of redundant functionality in OpenCL, since OpenCL is intended to be a universal programming language and API for a wide range of systems. And GPU, and CPU, and Cell. So in case you just want to write a program for a typical user system (processor plus video card), OpenCL does not seem to be, so to speak, "highly productive." Each function has ten parameters, and nine of them must be set to 0. And in order to set each parameter, you must call a special function that also has parameters.

And the most important current advantage of Direct Compute is that the user does not need to install a special package: everything that is needed is already in DirectX 11.

Problems of development of GPU computing

If we take the field of personal computers, then the situation is as follows: there are not many tasks that require a lot of computing power and are severely lacking in a conventional dual-core processor. It was as if big gluttonous, but clumsy monsters had crawled out of the sea onto land, and there was almost nothing to eat on land. And the primordial abodes of the earth's surface are decreasing in size, learning to consume less, as always happens when there is a shortage of natural resources. If today there was the same need for performance as 10-15 years ago, GPU computing would be accepted with a bang. And so the problems of compatibility and the relative complexity of GPU programming come to the fore. It is better to write a program that runs on all systems than a program that is fast but only runs on the GPU.

The outlook for GPUs is somewhat better in terms of use in professional applications and the workstation sector, as there is more demand for performance. Plugins for GPU-enabled 3D editors are emerging: for example, for rendering with ray tracing - not to be confused with regular GPU rendering! Something is showing up for 2D and presentation editors as well, with faster creation of complex effects. Video processing programs are also gradually acquiring support for the GPU. The above tasks, in view of their parallel nature, fit well on the GPU architecture, but now a very large code base has been created, debugged, optimized for all CPU capabilities, so it will take time for good GPU implementations to appear.

This segment also includes weak sides GPU as a limited amount of video memory - about 1 GB for regular GPUs. One of the main factors that reduce the performance of GPU programs is the need to exchange data between the CPU and GPU over a slow bus, and due to the limited amount of memory, more data must be transferred. And here AMD's concept of combining GPU and CPU in one module looks promising: you can sacrifice high throughput graphics memory for the sake of easy and easy access to shared memory, besides with lower latency. This high bandwidth of the current DDR5 video memory is much more in demand directly graphic programs than most GPU computing programs. In general, the common memory of the GPU and CPU will simply significantly expand the scope of the GPU, make it possible to use its computing capabilities in small subtasks of programs.

And most of all GPUs are in demand in the field of scientific computing. Several GPU-based supercomputers have already been built, which show very high results in the test of matrix operations. Scientific tasks are so diverse and numerous that there is always a set that fits perfectly on the GPU architecture, for which the use of the GPU makes it easy to get high performance.

If you choose one among all the tasks of modern computers, then it will be computer graphics - an image of the world in which we live. And the architecture optimal for this purpose cannot be bad. This is such an important and fundamental task that the hardware specially designed for it must be universal and be optimal for various tasks. Moreover, video cards are successfully evolving.

There are never too many cores...

Modern GPUs are monstrous fast beasts capable of chewing gigabytes of data. However, a person is cunning and, no matter how computing power grows, he comes up with tasks more and more difficult, so there comes a moment when you have to state with sadness that optimization is needed 🙁

This article describes the basic concepts, in order to make it easier to navigate in the theory of gpu-optimization and the basic rules, so that these concepts have to be accessed less often.

The reasons why GPUs are effective for dealing with large amounts of data that require processing:

  • they have great opportunities for parallel execution of tasks (many, many processors)
  • high memory bandwidth

Memory bandwidth- this is how much information - bits or gigabytes - can be transferred per unit of time, a second or a processor cycle.

One of the tasks of optimization is to use the maximum throughput - to increase performance throughput(ideally, it should be equal to memory bandwidth).

To improve bandwidth usage:

  • increase the amount of information - use bandwidth to the full (for example, each thread works with float4)
  • reduce latency - the delay between operations

Latency- the time interval between the moments when the controller requested a specific memory cell and the moment when the data became available to the processor for executing instructions. We cannot influence the delay itself in any way - these restrictions are present at the hardware level. It is due to this delay that the processor can simultaneously serve several threads - while thread A has requested to allocate memory to it, thread B can calculate something, and thread C can wait until the requested data arrives.

How to reduce latency if synchronization is used:

  • reduce the number of threads in a block
  • increase the number of block groups

Using GPU resources to the full - GPU Occupancy

In highbrow conversations about optimization, the term often flashes - gpu occupancy or kernel occupancy- it reflects the efficiency of the use of resources-capacities of the video card. Separately, I note that even if you use all the resources, this does not mean that you are using them correctly.

The computing power of the GPU is hundreds of processors greedy for calculations, when creating a program - the kernel (kernel) - the burden of distributing the load on them falls on the shoulders of the programmer. A mistake can result in most of these precious resources being idle for no reason. Now I will explain why. You have to start from afar.

Let me remind you that the warp ( warp in NVidia terminology, wavefront - in AMD terminology) - a set of threads that simultaneously perform the same kernel function on the processor. Threads united by the programmer into blocks are divided into warps by the thread scheduler (separately for each multiprocessor) - while one warp is running, the second is waiting for memory requests to be processed, etc. If some of the warp threads are still performing calculations, while others have already done their best, then there is an inefficient use of the computing resource - popularly referred to as idle power.

Every synchronization point, every branch of logic can create such an idle situation. The maximum divergence (branching of the execution logic) depends on the size of the warp. For NVidia GPUs, this is 32, for AMD, 64.

To reduce multiprocessor downtime during warp execution:

  • minimize waiting time barriers
  • minimize the divergence of execution logic in the kernel function

To effectively solve this problem, it makes sense to understand how warps are formed (for the case with several dimensions). In fact, the order is simple - first in X, then in Y, and last in Z.

the core is launched with 64×16 blocks, the threads are divided into warps in the order X, Y, Z - i.e. the first 64 elements are split into two warps, then the second, and so on.

The kernel starts with 16x64 blocks. The first and second 16 elements are added to the first warp, the third and fourth elements are added to the second warp, and so on.

How to reduce divergence (remember - branching is not always the cause of a critical performance loss)

  • when adjacent threads have different execution paths - many conditions and transitions on them - look for ways to re-structure
  • look for an unbalanced load of threads and decisively remove it (this is when we not only have conditions, but because of these conditions, the first thread always calculates something, and the fifth one does not fall into this condition and is idle)

How to get the most out of GPU resources

GPU resources, unfortunately, also have their limitations. And, strictly speaking, before launching the kernel function, it makes sense to define limits and take these limits into account when distributing the load. Why is it important?

Video cards have limits on the total number of threads that one multiprocessor can execute, the maximum number of threads in one block, the maximum number of warps on one processor, restrictions on different kinds memory, etc. All this information can be requested both programmatically, through the corresponding API, and previously using utilities from the SDK. (deviceQuery modules for NVidia devices, CLInfo modules for AMD video cards).

General practice:

  • the number of thread blocks/workgroups must be a multiple of the number of stream processors
  • block/workgroup size must be a multiple of the warp size

At the same time, it should be borne in mind that the absolute minimum - 3-4 warps / wayfronts are spinning simultaneously on each processor, wise guides advise to proceed from the consideration - at least seven wayfronts. At the same time - do not forget the restrictions on the iron!

Keeping all these details in your head quickly gets boring, therefore, for calculating gpu-occupancy, NVidia offered an unexpected tool - an excel (!) Calculator full of macros. There you can enter information on the maximum number of threads for SM, the number of registers and the size of the shared (shared) memory available on the stream processor, and the used parameters for launching functions - and it gives a percentage of resource use efficiency (and you tear your hair out realizing what to use all the cores you are missing registers).

usage information:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#calculating-occupancy

GPU and memory operations

Video cards are optimized for 128-bit memory operations. Those. ideally, each memory manipulation, ideally, should change 4 four-byte values ​​​​at a time. The main annoyance for the programmer is that modern compilers for the GPU are not able to optimize such things. This has to be done right in the function code and, on average, brings fractions of a percentage of performance gains. The frequency of memory requests has a much greater impact on performance.

The problem is as follows - each request returns in response a piece of data that is a multiple of 128 bits. And each thread uses only a quarter of it (in the case of a normal four-byte variable). When adjacent threads simultaneously work with data located sequentially in memory cells, this reduces the total number of memory accesses. This phenomenon is called combined read and write operations ( coalesced access - good! both read and write) - and with the correct organization of the code ( strided access to contiguous chunk of memory - bad!) can significantly improve performance. When organizing your kernel - remember - contiguous access - within the elements of one row of memory, working with the elements of a column is no longer so efficient. Want more details? I liked this pdf - or google for " memory coalescing techniques “.

The leading position in the “bottleneck” nomination is occupied by another memory operation - copy data from host memory to GPU . Copying does not happen anyhow, but from a memory area specially allocated by the driver and the system: when a request is made to copy data, the system first copies this data there, and only then uploads it to the GPU. Data transfer rate is limited by bus bandwidth PCI Express xN (where N is the number of data lines) through which modern video cards communicate with the host.

However, extra copying of slow memory on the host is sometimes an unjustified overhead. The way out is to use the so-called pinned memory - a specially marked memory area, so that the operating system is not able to perform any operations with it (for example, unload to swap / move at its discretion, etc.). Data transfer from the host to the video card is carried out without the participation of the operating system - asynchronously, through DMA (direct memory access).

And finally, a little more about memory. Shared memory on a multiprocessor is usually organized in the form of memory banks containing 32-bit words - data. The number of banks traditionally varies from one GPU generation to another - 16/32 If each thread requests data from a separate bank, everything is fine. Otherwise, several read / write requests to one bank are obtained and we get - a conflict ( shared memory bank conflict). Such conflicting calls are serialized and therefore executed sequentially, not in parallel. If all threads access the same bank, a “broadcast” response is used ( broadcast) and there is no conflict. There are several ways to effectively deal with access conflicts, I liked it description of the main techniques for getting rid of conflicts of access to memory banks – .

How to make mathematical operations even faster? Remember that:

  • double precision calculations are heavy operation load with fp64 >> fp32
  • constants of the form 3.13 in the code, by default, are interpreted as fp64 if you do not explicitly specify 3.14f
  • to optimize mathematics, it will not be superfluous to consult in the guides - are there any flags for the compiler
  • vendors include features in their SDKs that take advantage of device features to achieve performance (often at the expense of portability)

It makes sense for CUDA developers to pay close attention to the concept cuda stream, allowing you to run several core functions at once on one device or combine asynchronous copying of data from the host to the device during the execution of functions. OpenCL does not yet provide such functionality 🙁

Profiling junk:

NVifia Visual Profiler is an interesting utility that analyzes both CUDA and OpenCL kernels.

P.S. As a longer optimization guide, I can recommend googling all sorts of best practice guide for OpenCL and CUDA.

  • ,

Top