Monday, July 11, 2011

Data Generation

We’ve had the ‘Ice Age’, ‘Bronze Age’, ‘Iron Age’ and now we are firmly entrenched in the technology era otherwise known as the ’Information Age’. We’ve had generation X and generation Y and even the ‘Generation Game’ with a plethora of celebrity hosts.

I guess I could affectionately refer to ‘now’ as the ‘Data Generation’

This is a subject I have spent quite a bit of time on recently. Everyone knows how hard it is when creating our datasets for testing our application(s). It’s quite okay if we have a mature application or scrambled ‘Production system’ data we can draw upon.

But!

What about those new systems and new modules. Really, do you want to input 200 locations for that new website you have created in manually? Do you want to try and come up with 50 unique customer names?

I bet $100 dollars that everyone has test data like ‘Mr Smith’, ‘Tom and Jerry’ etc who lives in ‘My Street’ or ‘The White House’. Sure, some of us may have written a program or two to create those datasets like ‘Location 001’ to ‘Location 999’. These approaches are all well and good but what about when you want demonstrate your new functionality to the CEO or perhaps to a prospective client if you are a software house.

Sylvester Stallone, 15 Kick Arse Street, Disneyland, BH 90210

is not going to cut the mustard or is

Mr Me, 21a My Street, My Tow, My City, ZP12345.

“Just use the live data”, I hear you say.

Well this would work as the data would survive most (if not all) of your validations, look good and is 100% realistic. But this doesn’t help with your marketing screen prints, online or in-house demonstrations, not to mention if someone presses the ‘send email’ button and Mr Smith from Browns Bay, Auckland, New Zealand gets a letter about his account overdraft or product that hasn’t been ordered.

So there are as many pitfalls with real-world datasets too and I’ve only brushed aside privacy laws and data etiquette.

Then there is application stress testing and performance!

Do you really want to enter 10m records into a database table? I needed to achieve this the other week to test out some SQL. Are you really going to write programs to populate these datasets and have expensive programming resources create your data for you? Well I hope not. Are you going to have multiple people performing this role in your organisation?

• Imagine having a centralised tool that could help you roll out data to multiple databases via direct connections or ODBC.
• Imagine creating realistic datasets from pre-defined lists or ‘Customer lists’
• Imagine being able to put 10m records into a database table in under 5 minutes.

Well this is not Hollywood fantasy or a pipe dream from a developer living on the edge of reason.

It’s reality.

After a quite a few years away from day to day programming I recently had a need to enter test data into a simple table. I wanted to enter 70 records into a table which had a number surrogate key, description, date and amount. This was to test some basic functionality I had written. The 70 records limit was to ensure I had gone beyond the default of 64 records for CA Plex BlockFetch functions.

Using the SQL Server Management Studio and wanting to key in sequential meaningful data it took me the best part of 15 to 20 minutes to the enter records like:-

1, Grid 1 - Description 01, 2001/01/01, 1.11
2, Grid 1 - Description 02, 2001/01/02, 2.22

Etc

I then needed to do this in Grid 2 as I was testing some dual grid code I had written.

1, Grid 2 - Description 01, 2001/01/01, 1.11
2, Grid 2 - Description 02, 2001/01/02, 2.22

So, I copied the data to the second table and manually changed the Grid 1 to Grid 2. Perhaps if I had better MS SQL knowledge (Certainly I intend to improve it) then I might have been able to do this with a relatively simple UPDATE SQL. If act, the answer is yes I now can. However, the point is I was dealing with a new database to me. CA Plex allows us to target many platforms so I could have come unstuck in MS SQL Server as easily as on mySQL via ODBC or any other database including IBM i DB2.

Do I want to become expert on all these DBMS’s. The simple answer is Yes and No.

• Yes. I want to become more familiar with the DBMS, how they perform and to tune application code and support the day to day administration and running of the systems.
• No. I don’t want to manually create data and use SQL scripts, imports etc and know the syntax for many DBMS types just for entering some ‘poxy’ test data.

So, I want to channel my learning to what is important to be.

So, how did I solve my issue?

Well I asked a dear friend of mine who I chat to most days. Mr Google. I entered ‘Data Generators’ and I downloaded 3 or 4 of them that appeared in the first page. Nobody goes to page 2 anymore, do they?

I looked at three products. (Note these are the home pages as I know over time they sometimes rebrand products or the links change and the blog post goes out of date). Suffice to say they all do data generation among other things.


I quickly discarded datanamic’s tool. Compared to the others it just didn’t stack up functionality wise. I’d go as far as saying it wasn’t worth the effort getting my corporate firewall open for it.

http://www.redgate.com/ and http://www.sqledit.com/ were different stories. Both tools are excellent and easy to use. I would say that the redgate’s tool looks more polished and was easier to understand the flow and interface. SQLedit’s tool catered for a larger number of databases natively and many via ODBC. Red-gate’s is for MS SQL Server. If targeting one (MS SQL Server) I’d go for that as they will be bale to tightly integrate with SQL Server versions and as they have a whole host of products they will be kept honest my Microsoft changes.

But!!!!!, I use CA Plex and I needed to target other DBMS also (two currently), MS SQL Server and of course IBM i (DB2400). I am at a 2E shop (originally) and therefore need to reach this platform also. I have also recently worked on mySQL with Plex ODBC so the need to hit these DBMS’s was real and present. Therefore, I purchase the SQLEdit tool.

With both tools I quickly spotted a bug or two or had questions for the support teams about functionality and use. Considering I was on the trial version (not that this should matter) the support and service I received from both companies was absolutely first class. I feel like I can add (Paul/Lynda of Red-gate) and (Igor of SQLEdit) to my Christmas card list (the personal one).

Fixes were created and sent to me within days (by both companies) or in the case of CYYMMDD data generation support for my traditional 2E DTE formats in about 5 hours from Igor and his team. I was simply blown away by the agility of both companies.

The tools are priced within $100 USD of each other for the top end features and the comparable version at $249 USD each makes these tools a steal as does the exchange rate here in NZ 

I will never ever ever ever ever ever ever (just getting my point across) manually create datasets again.

For me these tools have the following benefits and I haven’t even fully explored some of the other functionality within them. i.e. generation of referential integrity sounds datasets.

• Quick large datasets
• Patterns, Imports and rules based
• Cheap than doing it manually
• Can leverage existing data (Scramble – SQLEdit)
• Ability to create a centralise data pool for multiple developers, systems etc

Remember that realistic datasets also help us to identify systems limits (overflowing arrays etc) and performance bottle necks in the development cycle and not post implementation where the cost is significantly higher in terms of ‘budget to fix’ and ‘reputation’.

These tools (or something) similar should be on every developer’s tool belt and if you haven’t got one or are hiring someone who hasn’t. Think again. If you are having software developed for you and your provider doesn’t have these tools in their development methodology, get a new supplier!

Perhaps you should contact me!

Thanks for reading.
Lee.

Post Edit:  I have also just discovered another mainstream generator called Data Maker from http://www.grid-tools.com/.  I have requested pricing which to me means that it'll be outside of the scope of the tools mentioned in this blog and many of our budgets.  If I can get an evaluation and form a meaningful opinion I will post in the future.