As part of a series focusing on the latest innovations in data quality technology, we recently interviewed the CTO of X88 Software, Tony Rodriguez.
The X88 team are based in the UK and provide software that focuses on three main areas of data management functionality: data profiling/discovery, data integration and data governance.
Data Quality Pro: For the benefit of our readers Tony, can you please describe your background as I believe you’re no stranger to the data quality technology sector?
Tony Rodriguez: My background is Systems Programming, Technical Architecture and Consultancy and I cut my teeth working on parallel large-scale Telecoms billing systems. In the mid-nineties, I joined Evolutionary Technologies International (ETI), one of the ETL pioneers, dealing with the complex challenges of large scale Data Integration projects. This led me to form Avellino Technologies in 1997 solely providing specialist Data Integration consultancy skills on some of the most complex Data Integration projects, everything from Data Warehousing to SAP implementations. We quickly grew to 50 people, winning awards such as the Fast Track 100.
Data Profiling was an emerging technology at the time, with only Evoke’s Axio (now Informatica Data Explorer) providing a weak and expensive solution. In 1999, I designed and built Avellino Discovery. Technologically it was very advanced at the time, and pioneered now commonplace features such as Drilldown, Metaphones and Format Patterns. Discovery became the market leader, and Avellino was acquired by Trillium Software in 2004.
I joined Informatica in 2005 in an advisory capacity where once again I was working with customers on real projects and was shocked to see they were still facing the same challenges as before, albeit with a host of new drivers such as MDM and Compliance initiatives. Data volumes were increasing exponentially, yet existing products were stuck with their old technology bases, and still only solving a small part of the problem, namely that of Data Profiling. It amazed me that the solutions had not progressed into addressing the whole Data Integration design problem.
I formed X88 Software in 2007 where I am CTO and lead developer/designer.
Data Quality Pro: It’s a number of years since you created Avellino Discovery, do you think the market has changed since then with regards to customer needs?
Tony Rodriguez: Data volumes have increased in many cases beyond the capabilities of current technologies and economic realties mean that customers are having to do a lot more, with a lot less both in terms of reducing costs but also reducing risk and dependence on other parties for successful implementation.
Data Integration means you are re-purposing existing data for a new task. Data Profiling helps you to understand existing data to ensure you do not fall foul of inconsistencies and format issues whilst providing the often missing business understanding. It tells you what data you have, in the context of how you currently use it, but does not tell you how to make that existing data fit what is typically a very differently structured new target. Critical to any successful implementation is an understanding of the issues surrounding the reconstitution of existing data, i.e. what rules must be applied to transform that data, and relate it to other sources to populate the target. This work has always been done by creating manual mapping or build specifications in the likes of Word or Excel for other people to implement.
This is something of a conundrum for customers – in order to see whether a Data Integration movement will work, they have to build it. There is no mechanism to trial or “prototype” it. Whenever you see a vendor presentation on the benefits of Data Profiling, you will often see how the build phase rework was reduced from a typical 10 times rework to just 3. Why 3? Why not just 1?
The reason is because the people who understand the data and produce the specifications are generally NOT the people who would build the movement using ETL tools. Quite often they are not even in the same company (e.g. offshore in an outsourced development). The interests of cost reduction are being addressed to an extent by using cheaper labour, but perversely they are still building solutions based on poorly assembled, assumptive designs. Hence the risk of late delivery has actually gone up! It is amazing to think that no other major IT development would entertain building anything from an incomplete design, yet somehow Data Integration projects continue to do exactly that.
Data Quality Pro: Can you give our readers an overview of your Pandora product?
Tony Rodriguez: It was clear that what the market really needed was the ability for non-technical data-savvy users to be able to take data from anywhere and apply in simple terms the rules that were necessary to build a new target. A key objective was to take people out of a prescriptive and technical design paradigm typical of metadata driven solutions, and instead allow them to work directly with the data, understanding, shaping and transforming it until they produce a valid target structure. Then the technology could simply work out the most optimal validated design to move data from where they started, to where they ended up. In contrast to simple Data Profiling, this approach would completely remove the divide between the people who understand the data and the business, from those building the Data Integration processes.
This is where the idea for X88 Pandora came from. It allows enterprises to achieve much earlier delivery and higher quality results on all data-dependent projects. In order to do this, it has to embrace functionality from a number of disciplines including Data Profiling, ETL, Databases and Business Intelligence tools, and present it in a manner such that non-technical users can use every capability and work in a collaborative way. It’s very simple to use, but incredibly powerful at the same time, having none of the inherent limitations of, for example, existing Data Profiling and Discovery technologies.
Pandora lets multiple users take data at full volumes from anywhere and apply any number of transformation, validation and quality rules without any technical expertise, to produce new target structures. Regardless of how many steps the user takes, Pandora works out automatically the optimal source-to-target data movement and automatically generates the ETL specification in easy-to-understand terms. It’s very forgiving, accepting whatever data you throw at it, regardless of quality and as a matter of course provides all the capabilities you would expect of simpler Data Profiling products, but takes those to a completely new level of performance and capability.
A key ingredient to Pandora’s technical prowess is that it is built on top of our Panorama database product.
Data Quality Pro: You mentioned that Pandora makes use of the Panorama repository to store data, are there security issues with having so much potentially sensitive data from across the enterprise available in one location?
Tony Rodriguez: X88 Panorama is a self-indexing, self-optimising, administration-free database. Although it supports SQL, it is not relational in nature. It is in-fact value oriented and it only stores values once, no matter how many times a value is used. It also has a very intelligent structure that essentially means that you can think of every value as being automatically indexed. You can ask critical questions such as
- “show me everywhere in the enterprise where there is any data that resembles a telephone number“
- “where else do I have this product code, in any system“
- “which fields contain the same monetary amount“.
These all result in an instant response regardless of data volumes or numbers of tables and columns. The database is designed for massive volumes, and to provide instant drilldown to data rows.
In tandem with this incredible power, it was designed from the beginning as a multi-user system with advanced security, allowing you to control who can see what, even visually encrypting data where necessary.
Given the nature of the storage, you cannot glean any sensible information from the database files. Nowhere is there anything that resembles a row, or a column, and any information that is there is encrypted.
Data Quality Pro: The market for data profiling and discovery products is now becoming quite crowded, what sets your approach apart from other companies?
Tony Rodriguez: Data profiling is a small, but essential part of what we do. Data Profiling has always been sold as a tool that usefully helps you improve the process of producing a mapping or build specification. Pandora goes way beyond this – we actually provide everything required to produce and validate those mapping specifications.
Once data is loaded into Pandora, it is already automatically profiled and analysed. Analysts are then able to specify their business rules to validate or transform the data and instantly see the results on-screen. Pandora provides similar functionality to an ETL tool in this respect but is trivially simple to use with no programming required. Pandora handles it all, and does it quickly with full data volumes.
Finally, at the press of a button, the analyst can automatically generate a completely validated Mapping Specification, along with the data that is the result of that specification, perfect for testing.
Yes, we can tick the Data Profiling box more comprehensively than anyone else, but as you can see what we are really about is solving Data Integration design, which is way beyond Data Profiling or Data Discovery.
Data Quality Pro: What type of business scenarios do you envisage people adopting Pandora for?
Tony Rodriguez: Clearly our approach revolutionises the way people go about any Data Integration project. Pandora is also very effective as a Data Warehouse for the quality of your data, with capabilities such as a the Metadata Repository, Business Glossary and interactive Data Quality Analysis,
We have also provided our customers with the ability to tackle a whole series of other problems which were difficult, expensive or impossible to solve previously.
A good example is a customer who is embarking on a compliance initiative.This typically requires that they build an application in order to see how their data relates, where it is used and propagated, how consistent it is and so on.
Pandora can answer an awful lot of these questions out of the box just by simple loading the data in. It automatically relates everything together thanks to the power of the Panorama database engine. The rest of the job can then be easily done using the simple point-and-click capabilities.
We have people doing data inventory for MDM, data quality evaluation as part of outsourcing “due diligence”, and of course Data Warehouse and Data Migration projects.
Another example that we address easily is that of responsibility. A customer can use Pandora’s test result datasets to immediately determine whether an implementation team (internal or external) has delivered what they were asked to. It is often a yes/no decision, and the time and money saved is enormous.
Outside of this, customers have also identified that Pandora/Panorama provides a very powerful ad-hoc data analysis solution. It is so typical today to find business users trying to manipulate existing data in the likes of Excel to get a quick answer to an urgent business question. They are constrained by the responsiveness of their IT departments and their general lack of technical skills.
With Pandora, they can quickly turn raw data into intelligent information, without the need for any specialist IT resources.
Data Quality Pro: Do you have plans to extend your products into additional areas, cleansing for example?
Tony Rodriguez: Most of what customers need in terms of data cleansing is covered by the transformation and conditional logic already present in data Integration solutions, and of course we allow such rules to be prototyped and validated.
We also have a range of standardisation, cleansing, parsing and fuzzy matching capabilities which have been used on initiatives to produce master data and translation tables for data migrations. In some situations we can provide an attractive alternative to the specialist cleansing solutions particularly with respect to analysing and cleansing Master data, such as product information.
We have no intention of providing specialist Name and Address matching.
Data Quality Pro: What type of user are your products aimed at? Are they geared towards the technical or business community?
Tony Rodriguez: We have deliberately targeted data-savvy people. If you are comfortable using Excel, you will be more than comfortable with Pandora. We believe we have produced a solution that appeals to both technical and non-technical people and have shifted the focus in everything you do with the product to thedata.As I mentioned earlier, we’re putting the data back into the hands of the people that understand it. Workshops with technical and business people are usually the most effective way of getting results.
Data Quality Pro: Looking at the profiles of your management team, you have a previous background in delivering services as well as technology, how has this helped shape your technology?
Tony Rodriguez: The fact that we have several seasoned practitioners has been key. We know what problems customers have because we have personally been solving them for years. This has allowed us to build in at the start a lot of things that are usually forgotten such as structure and usability as well as integrating project management and control features such as role based security, activity audit trails, shared documentation and so on. There were no commercial constraints when we embarked on building Pandora and Panorama, and we are able to get the design right without having to rush to bring it to market. Consequently our development is proceeding at a fantastic rate and we are able to focus purely on adding more functionality with each release.
Data Quality Pro: Why did you feel the need to create a completely new repository (Panorama) for managing data storage and manipulation when there are numerous alternatives freely available?
Tony Rodriguez: I have always been of the buy-not-build persuasion. Indeed at Avellino, we bought in the technology for data storage for the Discovery product. Quite frankly, you would have to be a bit insane to embark on writing your own database! We certainly did NOT want to do that. We had some very clear goals. Firstly, it must be self-administering and not require a DBA or ongoing admin. Secondly, it must scale to full volumes, and take advantage of modern 64bit architectures and multiple cores. Thirdly, it had to be FAST and work on commodity hardware.
I spent a great deal of time researching and implementing other products. No existing commercial database could come close to providing what we needed. Some of the embedded database technologies provided answers to the first two goals, but ultimately they all fell down when it came to performance. I did perhaps start to think that what we were trying to do was just too difficult a problem to solve. I decided to forget what I knew and instead think of an entirely different way of storing data in a manner that really suited this application. A database focused on the actual values, not the structure or metadata. It was relatively easy to come up with this concept, but the stumbling block was then being able to rapidly get data into the database. What I came up with is really radically different to anything else. It is certainly too long a subject to go into here, but needless to say that Pandora would have been impossible to produce without Panorama.
There is an added secondary benefit that everything we do is produced in-house. Currently we have no 3rd-party dependence which is a good thing these days. Informatica and IBM (to name but two) have both dealt blows to competitors by buying up technologies embedded by them.
Data Quality Pro: Being able to analyse a massive amount of information, albeit with a custom-built repository, poses the issue of physical resources – are there any special hardware requirements for processing such large volumes using Panorama/Pandora?
Tony Rodriguez: A key design consideration was a completely scalable platform that is resource-sensitive and works on modern multi-core commodity hardware, and we achieved exactly that. Pandora runs on anything from a laptop upwards. A modest departmental box would service an enterprise with terabytes of data. We actually use far less disk space than an indexed relational database would.
Data Quality Pro: Can we discuss the “ETL Prototyping” feature some more? What issues were you aiming to address with this approach?
Tony Rodriguez: We built this in order to address three issues, firstly, the inability of analysts to validate the specifications they write, secondly, the difficulty they have in making those specifications complete and unambiguous and ready for the implementation people, and lastly, the inability to rigorously and objectively measure whether the implementation actually corresponds to the specifications – everybody hates trying to come up with test cases, and frankly they never do the job properly.
Pandora allows the analysts to specify and execute the rules they want against real data, and to see the results immediately on-screen. They can instantly validate that the rules do what they expect, and that they don’t miss out any cases.
A press of a button generates the associated specification document, the first of our outputs.
Downstream, we simply compare the correct target data, our second output, with the data that is produced by the implementation; we do this in an automated way of course. If the result is the same, the implementation has succeeded, if not, the team can investigate the differences, using Pandora of course.
If we are dealing with reference data, translation tables or lookup tables we can take a short-cut. Rather than trying to get the implementation team to program a combination of rules and “manual” decisions to de-duplicate a file of products for example, the analyst and the business experts can simply sit down and do the work in Pandora, interactively building a table which shows the relationships between the various products. Pandora’s got hundreds of functions, including cleansing, comparison, value translation, etc. which they can use to a drag-and-drop graphical way. Then you just give the data file to the implementation team for them to use.
Data Quality Pro: I see that Pandora now includes a business terms glossary, can you explain more about this feature and what prompted you to implement this?
Tony Rodriguez: Business people look at data in a different way to technical people. They have business areas, departments and business process. We have databases, tables and columns. At the most granular level we are talking about the same piece of information, but we get there via different routes. The Glossary allows business people to map out the world as they see it, and then to associate the Business Terms with the actual data that implements it, and hence the information on its quality, location, overlap, etc. What is so compelling about our Glossary is that is directly tied to the data. You are not working in another paradigm, or a different product. This truly allows you to come at the same data from either the physical (technical) route or the logical (business) route. Either way the user ends up looking at the real data.
We had great input and advice on our implementation from industry guru Bonnie O’Neil, and are planning on greatly extending the functionality in this area moving forward.
Data Quality Pro: Do you have any plans to release the Panorama engine as an independent product which organisations or other technology suppliers can build additional services or solutions around?
Tony Rodriguez: An interesting design consideration with Panorama was to internally use SQL as the query language. We made that decision early-on because we do indeed intend to provide Panorama as a standalone engine. The first step in this direction will be a JDBC driver which will allow external applications to directly sit on top of Panorama (and hence Pandora). This is currently in development along with enhanced batch and direct APIs. We do not think the world needs yet another generic database product, but there is definitely a gap for an embedded database to provide super-fast data oriented performance out-of-the box without the technical and management headache associated with indexed relational structures or low-level embedded databases. We are investigating partnerships with subject-area specialists, such as counter-fraud, and would happily welcome any enquiries from interested customers or vendors alike. We are also quite happy to license the Data Profiling capabilities to other vendors also.
Data Quality Pro: What next for X88, can you share news on any forthcoming releases or plans for the direction of the company?
Tony Rodriguez: Well there are few paths of logical progression that we will go down, and we have plenty of exciting innovation coming downstream. As our technology is rather wide reaching, we have key developments in lots of different areas including further expanding our Data Integration, Ad-hoc analysis and Business Intelligence capabilities. Other than that I will be deliberately vague so that I can share in your future excitement when you get to see our next innovations!
Suffice to say we will keep on making it easier for people to turn their raw data into information. And we will continue to treat every customer like they were our first, with the simple aim of turning all our customers into strong, happy references.
For more information on X88 visit: http://www.x88software.com.
If you have questions for Tony relating to this interview please use the comments section below or contact him directly.