How to choose the number of convolution layers and filters in CNN

Let us first start with the more straightforward part. Knowing the number of input and output layers and the number of their neurons is the easiest part. Every network has a single input layer and a single output layer. The number of neurons in the input layer equals the number of input variables in the data being processed. The number of neurons in the output layer equals the number of outputs associated with each input.
But the challenge is knowing the number of hidden layers and their neurons.

The answer is you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to address a specific real-world predictive modeling problem.
The number of layers and the number of nodes in each layer are model hyperparameters that you must specify and learn.
You must discover the answer using a robust test harness and controlled experiments. Regardless of the heuristics, you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset.

For example, the filter size is one such hyperparameter you should specify before training your network.
For an image recognition problem, if you think that a big amount of pixels are necessary for the network to recognize the object you will use large filters (as 11×11 or 9×9). If you think what differentiates objects are some small and local features you should use small filters (3×3 or 5×5).
These are some tips but do not exist any rules.

There are many tricks to increase the accuracy of your deep learning model. Kindly refer to this link Improve deep learning model performance.

Hope this will help you.