PCI DSS: What's That and How to Get Certified? Bonus: Our Lessons Learnt
— Good. Now show your static code analyzer.
— Sure. This is Peter.
— Nice to meet you, Peter, but...
— Well, Peter is our static code analyzer actually.
When handling payment data, you have to maintain a certain level of security, which is already specified in the Payment Card Industry Data Security Standard (PCI DSS) developed by Visa, Mastercard, and other payment systems. The standard regulates the activities of all parties involved in card holder data handling, while, importantly, introducing additional requirements for service providers. The standard comprises 12 sections, from security team’s obligation to monitor access changes and collect pass cards of leaving employees to the regulations on where and how various logs are to be written.
I’ll tell you how we got our cloud platform certified and how much that frayed our nerves.
In brief, what’s the hitch?
Initially, each payment system, such as Visa, Mastercard, American Express, JCB and Discover, used to have their own programs with a minimum set of security requirements, which were applied to the handling of card holder data and overlapped each other. Then, the Payment Card Industry Security Standards Council (PCI SSC) was formed to include all the above payment systems and released PCI DSS 1.0 back in 2004, with minimum requirements specified for all those who participated in the card holder data storage, processing and transfer.
Somewhere in late 2000s, cloud computing took off, and it turned out that some peculiar features of clouds as a new type of hosting were not covered by the then-current version of PCI DSS, for example, architecture with constantly changing set and number of components or implementations of some technical solutions. Moreover, some requirements couldn’t be applied due to the lack of certain processes. For example, Section 4 required the payment data to be encrypted when transferred via public networks. In our case, it was solely the customer’s responsibility. Hosting platforms and customer virtual infrastructures were mostly subject to the same requirements.
In general, all of them made sense and were clear enough at first sight. A simple action plan, isn’t it? However, once you proceed, you start facing challenges and crossroads of interpretations. Let's take the standard's requirement to have a static code analyzer. In our code, the most part was in Python, which doesn't have static code analyzers at the moment. OK, technically it does, take Bandit, but it doesn’t fit our case. That’s why it was a human being called Peter who analyzed code security and made our static analyzer claims true to life. And Peter the human satisfied the analyzer requirements. In a similar way, we dealt with some other requirements.
Why to certify the cloud?
Our customers use the cloud to get their own tasks done. Just an example: the bank decided to introduce a new service and six months later found out that it consumed too much resources in the production environment and required new hardware to be procured. It takes time for the hardware to arrive, procurement is costly, and approval procedure is a pain in the neck. Naturally, a cloud option is much more convenient as everyone has already got accustomed to pay as you go and scale in no time.
However, cloud adoption is not just an easy trick because a cloud provider’s site (i.e. our site in this case), where customers keep, process and transfer Mastercard and Visa card holder data, must comply with PCI DSS. Otherwise, such site can’t be used. If a customer does not work with such data, it does not need the certification.
Customers can pass certification themselves on any infrastructure but, in practice, it is a dramatically hard and sometimes almost impossible thing to do. To get certified, a customer needs a cloud provider to be actively engaged when it comes to establishing processes and implementing technical solutions to ensure PCI DSS compliance at the cloud infrastructure level — the very location where card holder data will be processed. It’s far easier and much more cost-effective to find someone who has already completed his part of the certification. That’s why we decided to take the plunge.
For our customers, this certification accelerates migration and their own audits, while ensuring compliance with all the requirements to a data center (data centers, in our case) and cloud infrastructure, including all technical and organizational processes. Moreover, in real life, customers spend less time on running their own audit and drafting supporting documentation, part of which is already prepared by us.
How did the audit go?
It took us more than a year to get ready for the audit.
We built the entire cloud from scratch and initially didn't take PCI DSS requirements into account. Although it was based on KVM, a Red Hat technology, it was not the only component in the cloud. Our developers had to write tons of code and tweak and tune legacy technology to make sure it all worked flawlessly.
The PCI DSS has an entire section on how to develop software that is subject to the standard. It's all about home-grown card holder data processing software, apps, web interfaces, etc. In our case, the entire platform code development lifecycle fell under the regulations. The audit covered three main groups: data center operations, cloud operations, and cloud development, plus integrations with certain policies and processes of the company.
For all tricky moments, the story was the same: there were certain requirements to follow, and compliance was a must, no matter how the system was engineered. If interpretation is unclear or you don’t know how to make your infrastructure compliant, ask an auditor to check if the solution is OK or not. Very often, we had to argue with the auditors, explaining the platform’s specifics and convincing them that the solution choice was right, which was not always evident. Moreover, the certification requires annual penetration tests. For our cloud, we selected several types of violator models: external intruder, compromised (malicious) employee or customer.
One of the pressing cloud issues is that you need to keep DMZ up and running. In our design, it was extremely difficult to identify DMZ because, in a traditional sense, it was not part of the platform. So we had to make several micro DMZs on each server being accessible via the Internet or bordering with third-party LANs.
We moved step by step, requirement by requirement.
For example, you must document all system components related to the platform, including networking, monitoring, all infrastructure services, protections, etc. For each component, you must specify its location, software version, and IP address. No problem, if you have just 50 components. But what if you have 500 ones? Moreover, the number of components (both servers and roles on each server) is constantly changing. Cloud is all about roles. There can be a new server with a role, or a new role added to the server, with all this diversity being always variable and usually increasing in number. Evidently, don’t try to do it manually: it will change several times while you are just drafting the list. Thus, we had to adopt automation – a home-grown cloud-wide data collection system to deliver detailed reporting on all components anytime. Networking was the same pain in the neck. The PCI DSS requires all components to be displayed on L2-L3 layers of the OSI model, but one scheme can hardly accommodate all of your 500+ components. And even if it can, the scheme will be barely readable. We tried to group all the stuff in various ways as the scheme was a must – you got to have it even though it seemed almost impossible to make. Finally, we coped with that difficulty as well.
Monitoring challenges were also in place: extended event logs and integrity control system for all components.
As per the standard, you are to log all user actions, as well as to collect all critical component logs on an external server and analyze them. Even in case of multiple components, the first part of the task is not a rocket science, as we know how to manage large-scale infrastructures. However, the analysis part was a challenge. If processed manually, a heap of coming logs will take weeks or, perhaps, even months. That’s why we implemented a special log collection and analysis engine that monitored logs in real time and sent trigger-based alarms. Naturally, triggers gave us a hard time.
The second challenge was an integrity control mechanism. First of all, we had to identify critical configurations and logs to be monitored, and then create and upload relevant role-based templates. It should have been a rather trivial task but, due to the volumes, automated processes could provoke triggering. Initially, we received tons of false positives from all components and spent many hours to set up all the stuff.
Yet another requirement concerned internal vulnerability scanning and penetration testing – all being always conducted in a production rather than testing infrastructure to avoid dramatic deviations once going live. However, there was a serious risk to overload the system and degrade performance. Therefore, all should be done with great care.
Before the first internal scanning, you need to configure scanning profiles very carefully and scan your testing environment multiple times to check the profiles and see if they identify any deviations from normal operation of services. The calculations part wasn't easier. It was also true for penetration testing as you need to have everything to be on the table and everyone to be on the same page before they will start breaking your production environment from the outside. During the testing, we had to keep an eye on the monitoring systems to detect any deviations.
Internal firewall deployment was a particular nightmare for us. The platform architecture required firewalling of all connections on each server with the policy of ‘everything being banned unless permitted.’ We had to deploy a centralized management engine being capable of configuring a firewall in line with a component’s role, while each server may have several roles. Surely, we had to develop a custom solution. God only knows how much time and nerves it took us to configure policies for each role! Moreover, deployment in the production was like a first date – anything could go wrong anytime. So we did everything step by step, role by role, fussing over every single piece.
And so on. In general, as I've already mentioned, all requirements are clear and straightforward until the very moment when you proceed with implementing them into real infrastructure. And even then you can't see all the pitfalls at once. As a first step, you should clearly draw a border between the areas of cloud provider’s and customer’s responsibilities, which would help build a list of requirements to comply with. Something easy to say, but hard to do. And only when it's done, we can settle down to work. In practice, most of the requirements concern both the site and a customer. For example, both cloud and customer’s virtual infrastructure must have firewalls. That makes the difference in terms of both responsibility and implementation.
All’s well that ends well
Ultimately, the project took a year to complete, with a total of 470 tickets raised and closed in Jira. We had our auditor on speed dial asking him millions of questions. He explained things and shared global best practices. Tadaaam! We have our KVM-based cloud certified at last, with a few customers already willing to hop on it to process their payment data there.
As any other audit done thoroughly, this one brought benefits to the entire company as it covered many corporate processes and had many other things aligned with global best practices in security. Surely, some business units now have more work to do in order to ensure annual PCI DSS compliance, but in general it’s much better now.