I spent a lot of time making DB/2 work on Linux, and trying to make DB/2 work with the Network Surveyor dataset. I've put these notes together for others who have limitied experience with DB/2 and would like to see the issues we/I ran into.
I have a fair amount of experience building parallel database systems, several (2-3) of them as a matter of fact. Some were research tools, but one was a full SQL standard compliant production system with multi-terabyte capacity.
Getting DB/2 to work with the Surveyor data set was quite interesting. It was somewhat disappointing to discover that DB/2 was't as polished in some regards as those systems.
The Good thing about DB/2 is that David knows lots of people there, and we could try asking them on how to track down the problem. If we were stuck with their normal communication channels we would have not gotten anywhere.
It was amazingly slow to try and dig through the DB/2 documentation to try and track down things. I'm talking about spending hours to fix one problem, and then needing to move on only to spend more hours on the next. Sometimes the documentation did not address the issue at all. IBM's technique of "emailing using a web page", instead of real email, is a ridiculous method of communication. And they kept on harping for you to use the stupid web page instead of being able to communicate as civilized people.
Admittedly some problems were of our own doing, such as the trying to track down the licenses.
A good thing about DB/2 is that it is more-or-less a production quality database. That means it actually logs info about what it is doing, so you can ask people about it and have a chance of tracking down problems. The bad thing about this is that the logs often don't provide enough info to let you fix the problem yourself. Without contacting IBM to find out what the error really means.
The Surveyor workload and tables are a bit different than what you would find in a typical relational system. For example our largest table, the dreaded multi-terabyte Times data, only has payload entries of 16 bytes of data. Compared to the 100-200 byte (or more) rows that you would find in a conventional SQL application. Trying to load this table we ran up against the physical limitations of the DB/2 database, such as records per page and pages, which caused a lot of problems.
Another problem with DB/2 is that it does not allow for replicated tables across multiple physical nodes. The Surveyor "meta-information" tables are relatively small. It would make query performance improve substantially if they could exist on every node in the system. Without that mirroring, data has to be shipped inter-node for many queries which could just run locally.
The lack of global command-control capabilities was interesting. So, there is this parallel database, which is really a collection of independent communicating SQL database systems. The problem is that if you would like to change a database configuration option, you need to perform that change individually on each node. There is no way to globally issue a query to configure the database.
Another lack in DB/2 was that of distributed catalogs. This is a parallel database which relies upon NFS-mounting one node on all the other nodes so that catalog data can be accessed!
db2 load
doesn't work with
DB/2 EEE (Enterprise Extended Edition)
because the load is per-node, not global.
In EEE it expects a split file with hash info, not a file you can
load.
db2 import
works OK, but is horribly slow and requires a big
transaction log.
I believe it essentially does insert into _table_ values(...).
db2atld
.
It is quite fast and doesn't require a lot of log space.
However, if you have a primary key on a table, it will
still need to rebuild the index after every autoload to
ensure consistency.
PRIMARY KEY
watch out!
The index which implements the primary key needs to be rebuilt every
time you complete a insert or bulk load.
If your table is huge this can use more resources than you have.
The solution is to create and load the table with the
PRIMARY KEY
and then ALTER TABLE
to add the key after the data is loaded.
modified by coldelX
,
where X is the delimiter character.
For example, a file which uses a column delimiter of "|":
db2 import _file_ of del modified by coldel| insert into _tablename_
db2 load /dev/null of del terminate into _tablename_
list tablespaces show detail
DBM CFG
or DB CFG
parameters
there are a few things to keep in mind:
update dbm cfg using DFTDBPATH _path_
update db cfg for _database_ using logfilsiz 64000
update db cfg for _database_ using logprimary 64
CREATE BUFFERPOOL _name_
ALL NODES
SIZE _in_pages_
PAGESIZE _of_pages_
CREATE TABLESPACE _name_
PAGESIZE _of_pages_
MANAGED BY SYSTEM
USING ('name_of_file_in_db2_file_tree_for_db')
BUFFERPOOL _bufferpool_name_
CREATE TABLESPACE _name_
[PAGESIZE _of_pages_]
MANAGED BY DATABASE
USING (file 'name_of_file_in_db2_file_tree_for_db' _size_)
[BUFFERPOOL _bufferpool_name_]
MANAGED BY SYSTEM
.
Typically this means the database info is put into OS files.
If you would like the database to manage the space,
performing its own allocation, you can do that with
MANAGED BY DATABASE
.
The USING (FILE '_filename_' _size_)
stanza is used to tell the system what file to use, and what its
maximum size should be. _size_ can be expressed in
bytes directly, or you can use the appropiate K,
M, or G suffix to donate
Kilobytes, Megabytes, or Gigabytes.
MANAGED BY SYSTEM
space,
something funky happens with record IDs, and you can actually
store more than 2^24 pages with of data.
The above limitation does apply if you try specifiying
the page size of a tablespace, whether in DATABASE or SYSTEM managed
storage.
It can specify a file or a device.
If a relative name is specified db/2 puts it in an appropriate location
for that database.
Given the previous item however, it is possible to create a file in the funky default space that you can't create an index for! Sigh.
db2atld
has
one serious problem which prevents it from being used in an
efficient manner.
You can setup autoload to split
the data across the network and store the split
data on the machine it will be loaded on.
That is well and great.
However you can't have the autoloader load
the data on each node from the split file located there!
Instead, the split file has to be written back to a shared location.
Then it has to be read from disk and passed across the network to
its ultimate destination.
And of course, the shared disk is NFS, so the traffic
goes across it when the split file is stored.
Yes, it works, but with just a little bit of work it could be so much better! If you are performing partially declustered splitting this is less of an issue, since the data needs to be shared. However, when you are running fully declustered it is typically because you have so much data you need/want to.