Posts Tagged ‘blocking and tackling’

PayPal Outage Points To CIO Failure

Wednesday, September 2nd, 2009
Paypal's CIO Hasn't Been Doing His Job Correctly

Paypal's CIO Hasn't Been Doing His Job Correctly

The basic job of a CIO is to ensure that a company’s IT infrastructure operates smoothly and allows the company to conduct business. On Monday, August 3, 2009, PayPal’s CIO failed at this most basic of jobs.

A quick check of PayPal’s senior management structure reveals that they don’t have a CIO position (which in of itself is rather amazing), but Ryan D. Downs is their Senior Vice President, Worldwide Operations and so he’s their de facto CIO. What went wrong Ryan?

The Facts Behind The Failure

On Monday, August 3rd, Paypal experienced a world-wide outage that affected all of their customer facing systems. The effect of this outage is that millions of Paypal’s customers who rely on them to approve and complete financial transactions were unable to do so. This was a long outage – it started at 1:30 pm EST and lasted to until at least 6:30 pm EST.

Paypal is attributing this outage to “internal” issues.

Paypal is a huge business. In the most recent quarter, Paypal handled $16.7B in customer online commerce transactions. In the past the company has stated that they normally handle $2,000 in online transactions every second. Just in case you are doing the math, this means that this outage prevented at least $36M worth of business from happening.

What The CIO Did Wrong

I have no magic insights into what went wrong at Paypal, but it’s pretty easy to make a guess. Back in 2005, customers got shut out of Paypal for about 5 days when a software update went very, very wrong. I’m willing to bet that some sort of update process got away from them once again. This is just sloppy IT work.

This is exactly the type of basic “blocking & tackling” that CIOs have to get taken care of as part of building a solid IT foundation. Clearly this has not been done at Paypal.

The reason that this is such a scandal is that its happened at Paypal before. Once a problem is known, the CIO needs to step in and make sure that it will never happen again. We’re not just talking about establishing a fail-safe update process, but also making whatever changes are needed to the Paypal infrastructure in order to make sure that problems like this can’t ripple throughout the system.

Additionally, creating a process for rolling back changes is critical. If a bad change slips though the system and starts to go into production, you need to have the ability to get the system back to the way that it used to be.

Final Thoughts

Major outages like this reflect badly on all CIOs. There should be no reason that a outage like this should be allowed to happen especially since Paypal has had problems like this in the past. Paypal can’t claim that they didn’t have enough funding to prevent this problem – they are the fastest growing part of the eBay corporation.

In the end it all comes down to planning. Finding the time to gather the right people to run through “what if” scenarios and then following through with the recommendations that come out of these meetings is what every CIO needs to do. If Ryan takes the time to do this, then he will have found a way to apply IT to enable the rest of the company to grow quicker, move faster, and do more.

Click here to get automatic updates when The Accidental Successful CIO Blog is updated.

What We’ll Be Talking About Next Time

Hewlett-Packard is a huge IT products and services company that lives and dies by the actions of its sales teams. Making sure that the sales teams get paid should be a simple task right? Think again…

Citi Shows How NOT To Run An IT Department

Monday, January 5th, 2009
Citigroup's IT department is NOT doing a good job of keeping basic applications up and running.

Citigroup's IT department is NOT doing a good job of keeping basic applications up and running.

The news is always filled with IT departments that are winning awards for being innovative, reducing costs, or saving the day. That’s why a recent article in the Wall Street Journal about how the Citi group’s IT department is blowing it was so interesting…

The article was titled “Computer Glitch Slows Citi” (if you worked in Citi’s IT department, you’d know that this couldn’t be a good thing). It turns out that the retail bank part of the huge Citigroup corporation had computer problems on this past Tuesday. These computer problems ended up leaving lots of customers and employees in a bind – they couldn’t access information about bank accounts and mortgages.

It’s bad enough to have problems like this; however, this problem lingered until Wednesday morning. Now in all fairness to Citi, it appears as though customers were still able to deposit, withdraw, and transfer funds during this period.

So in many other businesses, this type of outage would be no big deal. However, when you are one of the largest consumer banks in the country, this is most definatly a no-no.

Just to make a bad thing worse, in the acticle Citigroup employees stated that their company seems especially plagued by crashes.

So who’s to blame for this IT mess? It turns out that Citigroup Chief Executive Vikram Pandit has stepped up. He has promised to upgrade and integrate the company’s computer systems. This effort is going to take several years and will probably end up costing billions of dollars.

What’s missing from all of this is any word from Kevin Kessinger who is Citi’s Chief Operations & Technology Officer. This is a fancy title for someone who has CIO responsibilities. At the end of the day, this mess is Kevin’s responsibility.

The ability to keep a firm’s basic applications up and running is so fundamental that we often refer to it as “blocking and tacking”. This is an American football term that simply means that you need to play good defense before you spend anytime focusing on offense.

There is NO WAY that the Citi IT department should be working on anything else if their apps are not staying up. I’m sure that many of us spend time throwing rocks at our own IT departments for not being innovative enough; however, hopefully most of us do a good job of taking care of the basics.

Kevin has been in his job since 2005 and so he really does not have any excuse for not having already taken care of this problem. It’s easy to throw stones at Kevin for not doing his job. However, perhaps it would be more valuable to take a look at what he should be doing right now to fix this issue:

  • Make App Stability THE Top Priority: By communicating to the entire IT department that keeping apps up and running is job #1, this would send a clear message to everyone that this is what they need to be working on.
  • Appoint A Stability Czar: Kevin needs to pick out an up-and-coming IT manager and put him / her in charge of working across the IT department to make sure that all of the apps become stable. This could be a career maker / breaker for this individual.
  • Change How Apps Are Developed: The current problems are caused by how the current set of apps were developed. Clearly, a new set of design procedure and / or testing needs to be put into place.

Kevin probably needs to do a lot more than just these basic steps, but this is how he needs to start. The CIO is responsible for how the digital side of the company operates. Let’s see if Kevin ends up doing the right thing…

Have you ever had a problem where a production application was not staying up? Who was responsible for fixing this problem? How did they go about fixing the problem? In the end, were they able fix the problem(s)? Leave me a comment and let me know what you are thinking.