All Recent Stories Staff Community Trending Elections From Markos' Desk Comics Community Groups Community Spotlight Actions Civiqs Make a Donation

Help Desk Jobs Work With Us Advertising Overview

Why the Merged NSA Database may be Technically Necessary

by Frank Vyan Walton

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Saturday, Feb. 08, 2014 Saturday, Feb. 08, 2014 at 7:23:32am PST

We've had quite a bit of talk about the NSA surveillance and the Unconstitutionality of it's mass data collection program. Many have stated quote simply that this collection in and of itself is a violation since the concept of a warrant requires there being a specific target and how can any general collection be also targeted?

I've had quite a bit of back and forth with Armando and others on twitter in recent days on this but I want to specifically address the biggest problem of splitting the database and/or letting it remain at it's source organizations instead of being moved into a government collection as it is now. A number of these options have been recently suggested by the President as well as storing it at some offsite facility to ensure that unauthorized and random mining of the data does not occur.

There are very good reasons from a technical standpoint that it's better and more efficient to perform this organized datamerge of all the sets of data that comes from multiple databases running in multiple formats and on different servers into one place.

Details over the flip.

If you want to enter one question - say a phone number- and get back a single logical answer if that user may be from one of a variety of providers, and if you want to go across a "hop" and see what numbers each of the person he dialed has called or received calls from, which again may be on a variety of services providers, you need access to numbers from all those providers ultimately arranged and formatted consistently.

You could have a situation where one system records their customer records one way, and their contact records in a different table - but another system where this information is combined or duplicated. Fortunately these differences are likely to remain mostly static until the system goes through an upgrade or reformat so once you figured out the map of where the data is, the easiest thing to do is copy all the data to a common database where everything is the same format.

But If you can't or don't do that, this is what you have to deal with.

http://www.unityjdbc.com/...

Multiple Databases on Multiple Servers from Different Vendors
The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.

There are various techniques to handling this problem:

1. Table Linking and Federation - link tables from one source into another for querying
2. Custom Code - write code and multiple queries to manually combine the data
3. Data Warehousing/ETL - extract, transform, and load the data into another source
4.Mediation Software - write one query that is translated by a mediator to extract the data required

Looking at the options one by one.

1. Table Linking and Federation
Several database products, such Microsoft Access, Microsoft SQL Server, Sybase, IBM, and Oracle, have the ability to "link" a table from another source into the current server instance. Then you can write queries as if that table was actually stored on the particular server. This is a very useful feature that will often allow you to avoid loading data into another source while still being able to execute one query. Unfortunately, not all database vendors, especially the open source products, MySQL and PostgreSQL, support table linking. The systems that support it are described in the previous table.

Let's assume, just for the sake of discussion that every telecom vendor happens to store their data in a linkable database. In order for the NSA to run one of their queries and then start checking the hops, all of them would have to be linked all the time or else something would be missing and your query would either fail or be incomplete.

For various reasons, even if technically possible, this is not a good idea as various competitors would then be sharing their data with each other in real time. It's got potentially exploitation from a corporate espionage level to a anti-trust level. Not good.

2. Custom Code
The most straightforward approach is to manually combine the data from the sources. For example, you could write a program that uses a database connection interface (ODBC, JDBC, etc.) to pull the data from each source into the program and then write your own code to perform the query processing to get the merged result. Although this works for small cases, the whole benefit of the relational model and SQL is that the declarative nature of the language makes queries easier to write and optimize. How likely are you to write code with no errors? How about with good performance? How general will the code be or will you have to write individual solutions for each query? There is a reason why the query processing and optimization system is the most complicated part of a database - it is challenging to get it correct and efficient. In all but the simplest cases, you will spend way more time than necessary by doing it yourself.

Another technique is to write code that extracts data from one database and migrates it to another. This is basically rolling your own ETL software (discussed in the next section). Once again, you will likely invest a lot of effort to produce a non-general system that does not have the required efficiency.

This is the option that I most envision when people state that the government can't touch any data until they have a specific number or datapoint, get a warrant and then start their query. Only in this case, they would have to query each and every database individually - save the results for each while doing some kind of format standardization for the results - then run the next query on the next data, and the next, and the next, and the next. AND - if they happen to want to search the next hop in the chain they would have to start over again and repeat this entire process from the top. Then again on the next hop.

Imagine if you were to searching for something and you need to individually check Google, Bing, Yahoo, Amazon and Ebay for that item. Then take all your results, which could be a considerable list from each and re-enter each result into each of those search engines again to get through just one "Hop" of answers?

Does that sound like fun? No, not really because it's not. Nor is it fun, or effective to write a program to do that.

From a systems design aspect, this is entirely messed up. It would be horrible. Each of these queries would cause processing time on each of these servers, which would compete against the normal use of these servers including what the customers and/or each company trying to look at the phone metadata for their own purposes. It would be slow as molasses.

Sorry, I'm getting a timeout on my server because The NSA is trying to look up Aunt Jenny's Numbers again. Remember how bad the ObamaCare Website Rollout went? This could be worse.

Also, and this is true of option #1 as well, these searches would only have access to the data currently on each of these systems. If a company only retains this data for 90 days or so, that's all you would get from that source and it's very likely that each provider would have slightly different retention periods.

Next we have what the NSA actually does.

3. Data Warehousing/ETL
A common solution to handle multiple data sources is to produce a single data warehouse that contains data from multiple sources. This data has been extracted from its original source, translated, summarized, and aggregated into a form suitable for efficient querying and reporting, and loaded into another database. The major benefit of data warehousing is an organization now has an integrated repository suitable for reporting that contains information from many locations. This article cannot go into depth on data warehousing and the steps involved, but some good references can be found here: Data Warehouse Introduction, Course Notes. If there is a significant amount of data to be analyzed (> 1 TB), then data warehousing will guarantee the best performance as the data is loaded once and placed in a system optimized for query performance.

And that is why they do it that way. Here are the drawbacks.

The major issue with data warehousing is that it represents a significant infrastructure investment that is time-consuming to produce. Designing a data warehouse is a complicated procedure that requests buy-in from many organization stakeholders. It may take months or years to produce a data warehouse and involve outside consultants and significant personnel time. The tools to perform ETL functions and data warehouse management are expensive. Once produced a data warehouse must be maintained (personnel, licensing costs, maintenance, etc.). A data warehouse is not an investment to be considered lightly.

In other words, it's a major investment of time and resources, however this investment is one time only, once the system is in place it would - compared to the first two options where you are essentially building the warehouse ON THE FLY in your code, be far more efficient and accurate. Retention issues are no longer a problem.

The other major issue with a datawarehouse is the it would need to be periodically refreshed with current updates. According to FISC rulings released by Glenn Greenwald this occurs for the metadata program approximately every 90 days.

And yet there is another way.

4. Mediation Software
The idea of mediation software is to leave the data where it resides and only extract the required data on demand. The user writes one query submitted to the mediation software that is responsible for optimizing the query to determine an efficient execution plan, translating each query to extract the relevant data from each source, and merging the results from sources into a single answer. From the user perspective, one query produces one answer from a single "virtual" database. Mediator systems have a long history in the database research community. There are also several commercial products that provide this functionality.

When I worked for the California DCA we used a solution similar to this, as well as a 24-Hour datawarehouse to bring all of their Licensing data onto the web without and merge in with their Complaints data without giving Web queries direct access to the source database.

There are advantages to this.

No new systems are required as they leave the data on the data sources (no data movement).
No data source modifications are required. Avoids security issues related to federated or linking servers at the database server level.
Data is always up-to-date and real-time as it comes directly from the source.
No database vendor lock-in compared to using linked servers.
Query translation allowing portable SQL and code. Can migrate database servers and application is not affected.
Rapid deployment and installation.
Very good performance and scalability. Can create multiple federation servers separate from the database servers so data sources are not overloaded.
Easy to use. Integration happens in a standard SQL query.

So that sounds pretty good, but there are disadvantages too.

Lower performance than data warehousing for very large data sets as the data must be migrated from each system to answer a query. This issue is present in all approaches that do not centralize the data as large amounts of data may need to be extracted and combined from the sources.
May not be required for simple situations that can be solved by linking a few tables.

As we all know we're dealing with a potentially large amount of data here, so large that it very well might make this as well as option 1 and 2, completely nonviable technically. Or to put it another way, in my technical and professional opinion as a Database Administrator and System Analyst, with the amount of information were talking about here - ONLY OPTION 3 IS LIKELY TO WORK AT ALL.

In general, the technical problems which would lead to creation of a merged database that first holds and stores the data before any queries are made of it would not just apply to the phone metadata information, but any aggregate information which NSA would like to be able to search, or "roll back" through time, you would have all the same considerations technically.

The question of whether simply merging the data to a central database is the same as "collecting it" in a query into a usable set of results is a central question. To many it seems a no-brainer, but I don't really think it remains that simple. Just like a tree that falls in the forest, if you have data "collected" in a sealed blackbox you can't look at until you have a "seed" number to start with - you'd don't really have it anymore than police "have" your private data on a floppy they don't have a drive for.

I bring all this up because otherwise looking at the issue Constitutionally and Legally without understanding why the technology is what it is, and what it's limitations are, is not seeing the entire picture. Moving the data to multiple servers, as the President has suggested, would create a fairly large technical gordian knot of complication, while at the same time not really solving the underlying problem of maintaining secure and auditable access controls to the data. If anything, several of the alternatives listed above would make the security and privacy options far worse than they already are.

Also again, even on separate servers, some of that data is still foreign, so it shouldn't really require a warrant to access it. A better argument than warrant requirements would be to say that any program of this type violates the Eletronic Communications Privacy Act which prohibits corporations from sharing generalized data with the government.

The best and most comprehensive outline of how the NSA functions that I've seen or read to date is the detailed PCLOB analysis (PDF) in which they argue that the Metadata program is unlawful and should be discontinued as they don't see that it's specifically authorized under section 215 of the Patriot act.

To be sure, detailed rules currently in place limit the NSA’s use of the telephone records it collects. These rules offer many valuable safeguards designed to curb the intrusiveness of the program. But in our view, they cannot fully ameliorate the implications for privacy, speech, and association that follow from the government’s ongoing collection of virtually all telephone records of every American
.
Any governmental program that entails such costs requires a strong showing of efficacy. We do not believe the NSA’s telephone records program conducted under Section 215 meets that standard.

For many technical reasons I disagree with them and here's why. The collection of this data is not done in a way that allows that data to be viewed "willy nilly" until an NSA analyst has submitted a "Reasonable, Articulable Suspicion" (or "RAS") that a specific number has some connection to terrorism. Also the fact is that for foreign surveillance they don't need a warrant, and they don't even need section 215. NSA already has the authority to do all of this on foreign soil the only real protection or limit in place is for surveillance of American Citizens, and/or calls made in the U.S..

The thing here is that they generally know when implementing the initial "RAS" that their dealing with a foreign number, but what they don't know is if one of those "Hops" will lead them to an American phone call.

The other problem the NSA has is that even if you filter for area code, there are foreign nationals on U.S. soil who may be making phone calls using their internationally based cell phone, and there may be U.S. persons making calls from overseas in the opposite manner. The limitations of the records sometimes meanthat they don't always have location data and don't know if either of these conditions or some other variation is the case until they drill down into the details and find out for sure.

Years before Snowden and his revelations the standing procedure for this, as Whistle-blower Russel Tice described back in 2006, was that any hint of phone data or other signals data originating in the U.S. would mean that that data had to be deleted. (As I will describe further, this is mostly still the case today.) Back then it was against regulations and flatly against the law, until - under the Bush Administration - procedures changed and Tice began to speak out.
.

Now during the Bush Administration they weren't even halfway pretending they weren't surveilling Americans. They were. Tice describes how NSA systems were used to datamine for information on U.S. Corporations, members of U.S. Congress -including Not Even Yet Senator Obama - high-ranking members of the military such as General Petreaus, and even members of the Supreme Court.

Tice: I held Justice Alito's paperwork - the numbers associated - that someone has used to spy on Judge Alito... These numbers were being done at night-time on the sly. A high level NSA Official told me that it was being directed by the Vice-President's (Cheney) Office.

That's what Tice talked about back then and if you want to talk worst nightmare we've already been there. That's what Bush brought us on this. But following these revelations Congress did act with Fisa Amendments Act (FAA), in 2008 and brought the Court back into the picture.

Now if they happen to hit an American on one of their "Hops" they again, as they did before, have to delete the information. But there are now a few FISC implemented exceptions to that rule and a set of "minimization" procedures that have been put in place and were first openly reported by Glenn Greenwald based on the FISC memo he published.

http://www.theguardian.com/...

A communication identified as a domestic coinmunication will be destroyed upon. recognition unless the Director (or Acting Director) of NSA specifically determines, in writing, that: (S)
(1) the communication is reasonably believed to contain significant foreign intelligence information. Such communication may be provided to the Federal Bureau of Investigation (FBI) (including United States person identities) for possible dissemination by the FBI in accordance with its minimization procedures; (S)

(2) the coinniunication does not contain foreign intelligence information but is
reasonably believed to contain evidence of a crime that has been, is being, or is about to be committed. Such comniunicati on may be disseminated (including United States person identities) to approp1ja.te Federal law enforcement authorities, in accordance with 50 U.S.C. l806(b) and l825(c), Executive Order No. 12333, and, where applicable, the reporting procedures set out in the August l995 "Mernorandurn of Understanding: Reporting of Information Concerning Federal Crimes," or any successor document. Such communications may be retained by NSA for a reasonable period of time, not to exceed. six months unless extended in writing by the Attorney General, to permit law enforcement agencies to determine whether access to original recordings of such is required for law enforcement purposes; (8)

(3) the communication is reasonably believed. to contain technical data base information, as defined in Section or information necessary to understand or assess a communications security vulnerability. Such communication may be provided to the FBI and/or disseminated. to other elements of the United States Government. Such communications may be retained for a period sufficient to allow a thorough. exploitation and to permit access to data that are, or are reasonably believed likely to become, relevant to a current or future foreign intelligence requirement. Sufficient duration may vary with the nature of the exploitation.

Or to put it another way, they have to delete ALL U.S. Communications unless they find evidence that the data involves a person who is Spy involved in foreign intelligence, an Active or Impending Crime or a Black Hat Hacker. Generally speaking I don't have a moral objection to these exceptions, as long as these are the only exceptions.

Some would argue they would still need a warrant for this, but again - these are the Rules the Court Gave them to follow, in a situation that is essentially an Exigent Circumstance. This wasn't the foreign number they were looking for, but while doing analysis of that number and its contact they became a witness to a Crime in progress, just like a police officer hearing someone cry for help doesn't need a warrant to break down the door - they just go in, and we want them to go in under those circumstances.

If we wanted to create a warrant situation that would allow for the above scenario while tracking hops in the call-chain, then every Foreign number placed on a RAS query would require a warrant Just In Case something like this comes up. But IMO that's going quite a bit above and beyond what the Constitution requires. Some might disagree, but that's the point of a debate.

What I will say again about the PCLOB report is that they indicate that as of their analysis since the implementation of the FAA and reintroduction of the FISC and their minimization procedures into the issue there have been no (further) abuses of the system like those described by Russell Tice. Although they did clearly see the potential danger of it occurring again, they view it as only a remote possibility.

Beyond such individual privacy intrusions, permitting the government to routinely collect the calling records of the entire nation fundamentally shifts the balance of power between the state and its citizens. With its powers of compulsion and criminal prosecution, the government poses unique threats to privacy when it collects data on its own citizens. Government collection of personal information on such a massive scale also courts the
ever-present danger of “mission creep.” An even more compelling danger is that personal information collected by the government will be misused to harass, blackmail, or intimidate, or to single out for scrutiny particular individuals or groups. To be clear, the Board has seen no evidence suggesting that anything of the sort is occurring at the NSA and the agency’s incidents of non-compliance with the rules approved by the FISC have generally involved unintentional misuse. Yet, while the danger of abuse may seem remote, given historical abuse of personal information by the government during the twentieth century, the risk is more than merely theoretical.

The reason that the PCLOB knows with confidence that the system is not being abused is because they periodically audit the RAS requests and verify that they were each legitimate. Usually there are only about 300 RAS "Seed" queries entered every year. Those that show up as unauthorized have generally been honest mistakes and errors, not part of any PLOT to gain the personal secrets of Americans. Well, not anymore.

And I think they could be right that it's unlikely that known authorized NSA analysts with the minimization and audit procedures currently enforced and monitored by the FISA court - who've been very grumpy when they've found any discrepancies - aren't likely to repeat the kind of abuse that Tice described. However, as I've said before a SysAdmin - like Snowden - would probably have access and authority to bypass the RAS requirement and it's subsequent audit trail, so it's not like I'm suggestion there no specific reason to be concerned. There still is.

They feel also, that despite the Administration claims that NSA metadata has aided in about 54 cases, they could only count one cases where it helped even minimally in a case where the suspect wasn't already known.

Based on the information provided to the Board, including classified briefings and documentation, we have not identified a single instance involving a threat to the United States in which the program made a concrete difference in the outcome of a counterterrorism investigation. Moreover, we are aware of no instance in which the program directly contributed to the discovery of a previously unknown terrorist plot or the disruption of a terrorist attack
.
And we believe that in only one instance over the past seven years has the program arguably contributed to the identification of an unknown
terrorism suspect.

So in their judgment the programs risk is too high, although they admit new procedures have mitigated much of that risk, and it's benefits too small - however again I think they miss another important factor. Their focus was entirely on how the 215 program did or didn't aid the FBI. The infrastructure of this program also supports the gathering of entirely foreign intelligence which is fed to the CIA. It is made repeatedly clear from the FISA documents and even the PCLOB data that this data is somewhat co-mingled. Foreign phone calls (or internet pages and emails) are routed through U.S. hubs all the time. The largest and biggest problem the NSA has is sorting foreign from domestic sources once this happens, and the risk of calling for a blanket shutoff of all these types of wide-scope programs - as the PCLOB report does - is that you may begin to lose some of the foreign data they should be looking for in the effort to protect the domestic info.

Things have massively improved since Tice originally came out on this subject, but I still believe it can get better. I think there are probably technical solutions to all of these issues that can tighten the safeguards, improve the filtering and allow for foreign surveillance to continue without getting into Angela Merkel's cellphone or who General Petreaus is banging this week - but I don't think that wild-eyed panic, or even relatively justified outrage is going to help all that much.

It's going to take a little more understand of this issues, thought and planning than that.

Vyan

11:27 AM PT: I didn't expect this view to popular, but that's the way it goes. One point i didn't think of in my initial draft is that the recurring problem I see is that the merged data warehouse has an interminging of foreign and domestic data. I think it would be a great improvement to perform that filtering with the aid of the source clients so that U.S. related data is withheld and not co-mingled with foreign data in the merged collection until After a certain number or IP or email address shows up in a "HOP". You could consider this Option #5 as a mixture of Option #3 (the status quo) and Option #4. Whether the NSA should be then required to go back to the FISA Court for a more specific warrant to re-merge specific info for THAT U.S. customers Data or be allowed to use exigent circumstance to view it as opposed to scrapping the entire system as the PCLOB suggests would be a good topic for debate.

4:56 PM PT: The Option #5 of Pre-Filtering Out the Domestic information prior to it's being transferred to the data warehouse would completely eliminate the warrant problem and the issue of unauthorized off-hours snooping that Tice mentioned was occurring back in 2006 when the NSA was spying on journalists who were critical of the Bush Administration like James Risen and Christiane Amanpour, but it would also re-introduce that retention consistency problems i noted originally that would occur with Options 1 & 2. Not insurmountable, but I think a workable solution both technically and Constitutionally.

5:13 PM PT: Here's a blast from the past, Russell Tice on Keith Olberrmann in 2009.

5:23 PM PT: Part Two of the Tice Interview - which is one of primary reasons I don't have my hair on fire about anything Snowden has told us, we were all already informed about it years ago. Snowden essentially provided the evidence and proof for what Tice had been describing, however he also provided evidence that the FISA was now involved in the process, had re-introduced some limits and were auditing the NSA to ensure that they were abiding by those imposed limits. And the PCLOB report confirms that they are doing just that, at least so far...

8:03 PM PT: the AP has reported that both the WaPo and WSJ report that NSA phone metadata collection only includes about 20-30% of all U.S. calls because of the expanse of cell phone use.

http://talkingpointsmemo.com/...

The Post said the NSA takes in less than 30 percent of all call data; the Journal said it is about or less than 20 percent. In either case, the figures are far below the amount of phone data collected in 2006, when the government extracted nearly all of U.S. calling records, both newspapers reported. NSA officials intend to press for court authorization to broaden their coverage of cellphone providers to return the government to near-total coverage of Americans' calling data, the newspapers said.

Verizon and AT&T said last December that they would provide figures this year on data requested by the government in law enforcement and intelligence investigations. But the Journal reported last year that several major cellphone entities including Verizon Wireless and T-Mobile were not part of the NSA's bulk metadata collection. It is not clear why cellphone providers would not be covered by the NSA legal authority.

So it seems if you're primary phone is your cell - the NSA isn't tracking you at all, and that isn't going to change until the FISC agrees to issue 90-day collection warrants similar to the ones revealed by Edward Snowden for each and every individual cell phone provider. I also found this interesting.

In a related development, the secretive Foreign Intelligence Surveillance Court in Washington on Thursday authorized two major changes in the phone collection program that Obama committed to in January. The court agreed to require judicial approval for each internal NSA search of telephone data for terrorist connections and it will narrow the numbers of American phone users whose records can be scanned during each search, the DNI reported.

If I'm understanding what they mean by "internal" this may indicate that FISA will begin to approve and issue specific warrants prior to further examination when a Domestic Number pops up in the system, as I wondered about in my Option #5 pondering.

The report also states that the NSA is taking the President's request to change the architecture of the system seriously, but even they don't yet now exactly how to switch from what they have to another configuration that will accomplish their detection goals.

On Friday, Office of the Director of National Intelligence, or DNI, posted a government website appeal to private companies to develop ways for the government to continue its phone record searches without storing a massive inventory of phone data. The posting, on FedBizOpps.gov, said the DNI is "investigating whether existing commercially available capabilities can provide for a new approach to the government's telephony metadata collection program."

The Associated Press reported last month that the DNI is already funding five research teams across the country in an effort to develop an encrypted search technique that could be used by the NSA to securely scan phone databanks held elsewhere.

My thinking is that an existing commercial available solution for this probably doesn't exist, and if it did the DNI would already have an idea of what that might be - additionally the fact that they are funding five research times to find a secure encrypted solution indicates that they probably don't have one outside of what they're already doing. Which was the first point I've been making during this entire diary.