# Publications of Katona, G.O.H.

Horizons of combinatorics

Based on papers presented at "The conference Horizons of Combinatorics during the period July 17-21, 2006 at Balatonalmádi (Lake Balaton, Hungary)

Functional dependencies distorted by errors

A relational database D is given with Q as the set of attributes. We assume that the rows (tuples, data of one individual) are transmitted through a noisy channel (or, as many times in case of the data mining applications, the observed data is distorted from the real values in a manner which we cannot know). In case of low probability of the error it may be supposed that at most one data in a row is changed by the transmission or observation. We say that A -> b (A subset of Omega, b epsilon Omega) is an error-correcting functional dependency if the data in A uniquely determine the data in b in spite of this error. We investigate the problem how much larger a minimal error-correcting functional dependency can be than the original one. We will give upper and lower bounds showing that it can be considerably larger than the original sizes, but the growth is only polynomial. (C) 2007 Elsevier B.V. All rights reserved.

Codes that attain minimum distance in every possible direction

The following problem motivated by investigation of databases is studied. Let C be a q-ary code of length n with the properties that C has minimum distance at least n – k + 1, and for any set of k – 1 coordinates there exist two codewords that agree exactly there. Let f(q, k) be the maximum n for which such a code exists. f(q, k) is bounded by linear functions of k and q, and the exact values for special k and q are determined.

No four subsets forming an N

We survey results concerning the maximum size of a family F of subsets of an n -element set such that a certain configuration is avoided. When F avoids a chain of size two, this is just Sperner's theorem. Here we give bounds on how large Y can be such that no four distinct sets A, B, C, D is an element of F, satisfy A subset of B, C subset of B, C subset of D. In this case, the maximum size satisfies [GRAPHICS] which is very similar to the best-known bounds for the more restrictive problem of F avoiding three sets B, C, D such that C subset of B, C subset of D. (C) 2007 Elsevier Inc. All rights reserved.

Bounds on maximal families of sets not containing three sets with A boolean AND B subset of C, A not subset of B

Lower and upper estimates are given on the size of a family of subsets of an n- element set containing no three distinct sets satisfying A boolean AND B subset of C, A not subset of B. This is a sharpening of an earlier result where the same question was solved under the condition that there are no three distinct sets such that A boolean AND B subset of C.

On the security of individual data

We will consider the following problem in this paper: Assume that there are n numerical data {x(1),x(2),.., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers are made public or just available for persons not eligible to learn the original data. Our motivating question is: At most how many of these subsums may be disclosed such that none of the numbers x1; x2;...; xn can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by [ 1], originally solved by Miller et al. [ 7] and revisited by Griggs [ 4, 5]. It was shown in [ 7] that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. To calculate a subsum, it might need some operations whose number is limited. This is why it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k. The goal of the present paper is to determine the maximum number of subsums of size at most k which can be disclosed without making possible to calculate any of the individual data x(i). The maximum is exactly determined for the case when the number of data is much larger than the size restriction k.

On the number of independent functional dependencies

We will investigate the following question: what can be the maximum number of independent functional dependencies in a database of n attributes, that is the maximum cardinality of a system of dependencies which which do not follow from the Armstrong axioms and none of them can be derived from the remaining ones using the Armstrong axioms. An easy and for long time believed to be the best construction is the following: take the maximum possible number of subsets of the attributes such that none of them contains the other one (by the wellknown theorem of Sperner [8] their number is (n / n/2)) and let them all determine all the further values. However, we will show ( n by a specific construction that it is possible to give more than (n / n/2) independent dependencies (the construction will give (I + 1 / n(2))(n / n/2) of them) and – on the other hand – the upper bound is 2(n) – 1, which is roughly root n(n / n/2).

Some contributions to the minimum representation problem of key systems

Some new and improved results on the minimum representation problem for key systems will be presented. By improving a lemma of the second author we obtain better or new results on badly representable key systems, such as showing the most badly representable key system known, namely of size 2(n(1-c.log n/ log n)), where n is the number of attributes. We also make an observation on a theorem of J. Demetrovics, Z. Furedi and the first author and give some new well representable key systems as well.

2-bases of quadruples

Let Beta(n, <= 4) denote the subsets of [n] := {1, 2,..., n} of at most 4 elements. Suppose that F is a set system with the property that every member of B can be written as a union of (at most) two members of F. (Such an F is called a 2-base of B.) Here we answer a question of Erdos proving that [GRAPHICS] and this bound is best possible for n >= 8.

Constructions via Hamiltonian theorems

Demetrovics et al [Design type problems motivated by database theory, J. Statist. Plann. Inference 72 (1998) 149-164] constructed a decomposition of the family of all k-element subsets of an n-element set into disjoint pairs (A, B) (A boolean AND B = 0, vertical bar A vertical bar = vertical bar B vertical bar = k) where two such pairs are relatively far from each other in some sense. The paper invented a proof method using a Hamiltonian-type theorem. The present paper gives a generalization of this tool, hopefully extending the power of the method. Problems where the method could be also used are shown. Moreover, open problems are listed which are related to the Hamiltonian theory. In these problems a cyclic permutation is to be found when certain restrictions are given by a family of k-element subsets. (c) 2005 Elsevier B.V. All rights reserved.

Two-part and k-Sperner families: New proofs using permutations

This is a paper about the beauty of the permutation method. New and shorter proofs are given for the theorem [ P. L. Erd. os and G. O. H. Katona, J. Combin. Theory. Ser. A, 43 ( 1986), pp. 58 – 69; S. Shahriari, Discrete Math., 162 ( 1996), pp. 229 – 238] determining all extremal two-part Sperner families and for the uniqueness of k-Sperner families of maximum size [ P. Erd. os, Bull. Amer. Math. Soc., 51 ( 1945), pp. 898 – 902].

Largest family without A boolean OR B subset of C boolean AND D

Let F be a family of subsets of an n-element set not containing four distinct members such that A boolean OR B subset of C boolean AND D. It is proved that the maximum size of F under this condition is equal to the sum of the two largest binomial coefficients of order n. The maximum families are also characterized. A LYM-type inequality for such families is given, too. (c) 2005 Elsevier Inc. All rights reserved.

Preface

The 2002 Korea-Hungary Joint Workshop on Combinatorics and the 2002 Com(2)MaC Conference on Graphs and Combinatorics

On the security of individual data

We will consider the following problem in this paper: Assume there are n numeric data {x(1), x(2),..., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers axe disclosed by making public or just available for persons not eligible to learn the original data. Our motivating question is: at most how many of these subsums may be disclosed such that none of the numbers x(1), x(2),...,x(n) can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by Chin and Ozsoyoglu [1], originally solved by Miller et al. [5] and revisited by Griggs [4]. It turned out [5] that the problem is equivalent to the following question: If there are n real, non-zero numbers X = {x(1), x(2),..., x(n)} given, what is the maximum number of 0 subsums of it, that is, what is the maximum number of the subsets of X whose elements sum up to 0. This approach, together with the Sperner theorem shows that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. However, it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k (in the applications databases are usually huge, while the number of operations is in most of the cases limited). We have now the same question: at most how many of these subsums of at most k members may be given such that none of the numbers x(1), x(2),...x(n) can be uniquely determined from these sums. The main result of this paper gives an upper bound on this number, which turns out to be sharp if we allow subsums of only k or k – 1 members and asymptotically sharp in case of subsums of at most k members.

Strong qualitative independence

The subsets A, B of the n-element X are said to be s-strongly separating if the two sets divide X into four sets of size at least s. The maximum number h(n,s) of pairwise s-strongly separating subsets was asymptotically determined by Frankl (Ars Combin. 1 (1976) 53) for fixed s and large n. A new proof is given. Also, estimates for h(n, en) are found where c is a small constant. (C) 2003 Elsevier B.V. All rights reserved.

New type of coding problem motivated by database theory

The present paper is intended to survey the interaction between relational database theory and coding theory. In particular it is shown how an extremal problem for relational databases gives rise to a new type of coding problem. The former concerns minimal representation of branching dependencies that can be considered as a data mining type question. The extremal configurations involve d-distance sets in the space of disjoint pairs of k-element subsets of an n-element set X. Let X be an n-element finite set, 0 < k < n/2 an integer. Suppose that {A(1), B-1} and {A(2), B-2} are pairs of disjoint k-element subsets of X (that is, \A(1)\ = \B-1\ = \A(2)\ = \B-2\ = k, A(1) boolean AND B-1 = 0, A(2) boolean AND B-2 = 0). Define the distance of these pairs by d({A(1), B-1}, {A(2), B-2}) = min{\A(1) – A(2)\ + \B-1 – B-2\, \A(1) – B-2\ + \B-1 – A(2)\). (C) 2004 Elsevier B.V. All rights reserved.

Recent combinatorial results in the theory of relational databases

Recent results of the authors-some of which are joint with Thalheim and Seleznjevin the area of combinatorial investigations of the relational database model are presented here. In relational databases keys-combinations of attributes uniquely identifying the records-play an important role. The structure and size of keys have been widely investigated. Here, after a short review of the earlier results, we discuss two generalizations: the (average) structure and size of keys in a random database and the concept of error-correcting keys in case of unreliable data collection. (C) 2003 Elsevier Ltd. All rights reserved.

Semantics in databases

Semantics in databases: second international workshop, Dagstuhl Castle, Germany, January 7-12, 2001 : revised papers

Search with small sets in presence of a liar

One unknown element of an n-element set is sought by asking if it is contained in given subsets. It is supposed that the question sets are of size at most k and all the questions are decided in advance, the choice of the next question cannot depend on previous answers. At most I of the answers can be incorrect. The minimum number of such questions is determined when the order of magnitude of k is n(alpha) with alpha < 1. The problem can be formulated as determination of the maximum sized l-error-correcting code (of length n) in which the number of ones in a given position is at most k. (C) 2002 Elsevier Science B.V. All rights reserved.

On the average size of sets in intersecting Sperner families

We show that the average size of subsets of [n] forming an intersecting Sperner family of cardinality not less than ((n-1)(k-1)) is at least k provided that k less than or equal to n/2 – rootn/2 + 1. The statement is not true if n/2 greater than or equal to k > n/2 – root8n+ 1/8 + 9/8. (C) 2002 Elsevier Science B.V. All rights reserved.

Intersecting balanced families of sets

Suppose that any t members (t greater than or equal to 2) of a regular family on and element set have ill least k common elements. It is proved that the largest member of the family has at least k(1/2)n(1-1/t) elements. The same holds for balanced families. which is a generalization of the regularity. The estimate is asymptotically sharp. (C) 2001 Academic Press.

A new type of coding problem

Let X be an n-element finite set, and 0 < k < n/2 an integer. Suppose that {A1, B1} and {A(2), B-2} are pairs of disjoint k-element subsets of X (that is, \A(1)\ = \B-1\= \A(2)\ = \B-2\ = k, A(1) boolean AND B-1 = phi, A(2) boolean AND B-2 = circle divide). Define the distance between these pairs by d({A(1), B-1}, {A(2), B-2}) = min{\A(1) – A(2) \ + \B-1 – B-2\ + \A(1) – B-2\ + \B-1 – A(2)\}. It is known ([2]) that the family of all k-element subsets of X can be paired (with one exception if their number is odd) in such a way that the distance between any two pairs is at least k. Here we answer questions arising for distances larger than k.

Error-correcting keys in relational databases

Suppose that the entries of a relational database are collected in an unreliable way, that is the actual database may differ from the true database in at most one data of each individual. An error-correcting key is such a set of attributes, that the knowledge of the actual data of an individual in this set of attributes uniquely determines the individual. It is showed that if the minimal keys are of size at most ii, then the smallest sizes of the minimal error-correcting keys can be ck(3) and this is the best possible, all minimal error-correcting keys have size at most 3k(3).

Low discrepancy allocation of two-dimensional data

Fast browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing the two-dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving information related to tiles covered by the query. Suppose that there are m I/O devices. A tile is labeled by i if the data corresponding to this area is stored in the ith I/O device. A labeling is efficient if the difference of the numbers of occurrences of distinct labels in any given rectangle is small. Except for some simple cases this discrepancy exceeds i. In the present paper constructions are given to make this discrepancy small relative to m. The constructions use latin squares and a lower bound is given, which shows that the constructions are best possible under certain conditions.

Asymptotic properties of keys and functional dependencies in random databases

Practical database applications give the impression that sets of constraints are rather small and that large sets are unusual and are caused by bad design decisions. Theoretical investigations, however, show that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between observation in practice and theory results in the rejection of theoretical results. However, practice is related to average cases and is not related to worst cases. The theory used until now considered the worst-case complexity. This paper aims to develop a theory for the average-case complexity. Several probabilistic models and asymptotics of corresponding probabilities are investigated for random databases formed by independent random tuples with a common discrete distribution. Poisson approximations are studied for the distributions of some characteristics for such databases where the number of tuples is sufficiently large. We intend to prove that the exponential complexity of key sets and sets of functional dependencies is rather unusual and almost all minimal keys in a relation have a length which depends mainly on the size of the relation.

Design type problems motivated by database theory

Let k less than or equal to n, p less than or equal to g<m be positive integers. Suppose that the m x n matrix M satisfies the following two properties: for any choice of k distinct columns c(1),c(2)....,ck, there are q + 1 rows such that the number of different entries in c(i) (1 less than or equal to i less than or equal to k-1) in these rows is at most p, while all q + 1 entries of c(k) in these rows are different; this is true for no choice of k + 1 distinct columns. We review results minimizing m, given n,p,q,k. Two of the results are new. The optimal or nearly optimal constructions can be considered as n partitions of the m-element set satisfying certain conditions. This version leads to the orthogonal double covers, also surveyed here. (C) 1998 Elsevier Science B.V. All rights reserved.

A simple proof of a theorem of Milner

A new short proof is given for the following theorem of Milner: An intersecting, inclusion-free family of subsets of all n-element set has at most [GRAPHICS] members. (C) 1998 Academic Press.

External problems for finite sets and convex hulls – A survey

Let F be a family of distinct subsets of an n-element set. Define p(i)(F) (0 less than or equal to i less than or equal to n) as the number of i-element members of F. Consider the profile vectors (p(0)(F),...,p(n)(F)) for all families F belonging to a certain class A (e.g. A can be the class of all families where any two members have a non-empty intersection). Let epsilon(A) denote the set of extreme points of the convex hull of the set of these profile vectors. Results determining epsilon(A) for some classes A are surveyed. Facets and edges of these convex hulls are also described for some A. Connections to the classical extremal problems are shown.

Preface

Selected papers in honour of Paul Erdos on the occasion of his 80th birthday

The average length of keys and functional dependencies in (random) databases

Practical database applications engender the impression that sets of constraints are rather small and that large sets are unusual and caused by bad design decisions. Theoretical investigations show, however, that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between belief and theory causes non-acceptance of theoretical results. However, beliefs are related to average cases. The theory known so far considered worst case complexity. This paper aims at developing a theory of average case complexity. Several statistic models and asymptotics of corresponding probabilities are investigated for random databases. We show that exponential complexity of independent key sets and independent sets of functional dependencies is rather unusual. Depending on the size of relations almost all minimal keys have a length which mainly depends on the size. The number of minimal keys of other length is exponentially small compared with the number of minimal keys of the derived length. Further, if a key is valid in a relation then it is probably the minimal key. The same results hold for functional dependencies.

The Largest Component In A Random Subgraph Of The N-Cycle

Let M denote the order of the largest component in a random subgraph H of the n-cycle C(n), where H has the same vertex set as C(n) and its edge set is defined by independently selecting, with the same constant probability, each of the edges of C(n). The probability that M is equal to k is known for k = 1 and for n greater-than-or-equal-to k greater-than-or-equal-to [n/2]. Here we obtain the exact result for k = 2 and comment on the cases [n/2]>k>2.

A survey of some combinatorial results concerning functional dependencies in database relations

A databaseR has some obvious and less obvious parameters such as the number of attributes, the size |r|, the maximum size of a domain, the number of some special functional dependencies (e.g. the minimal keys), and so on. The main aim of this paper is to survey some of the results giving connections and inequalities among these parameters. The methods are of a combinatorial nature. A generalization of the numerical dependency is also considered.

Optimization Of The Reliability Polynomial In Presence Of Mediocre Elements

Some connections between extremal set theory and the optimization of the reliability polynomial are shown. Then the concept of the reliability polynomial is generalized for the case when the elements can have three different states: good, mediocre and bad. The state of the device can be described by a 0,1,2 sequence. Such a state is called operative if the device operates when its elements are in the states described by the sequence. The maximum of the generalized reliability polynomial is studied under the condition that the set of operative states forms an antichain.

Partial Dependencies In Relational Databases And Their Realization

Weakening the functional dependencies introduced by Amstrong we get the notion of the partial dependencies defined on the relational databases. We show that the partial dependencies can be characterized by the closure operations of the poset formed by the partial functions on the attributes of the databases. On the other hand, we give necessary and sufficient conditions so that for such a closure operation one can find on the given set of attributes a database whose partial dependencies generate the given closure operation. We also investigate some questions about how to realize certain structures related to databases by a database of minimal number of rows, columns or elements.

The Characterization Of Branching Dependencies

A new type of dependencies in a relational database model is introduced. If b is an attribute, A is a set of attributes then it is said that b (p,q)-depends on A, in notation [GRAPHICS], in a database r if there are no q + 1 rows in r such that they have at most p different values in A, but q + 1 different values in b. (1,1)-dependency is the classical functional dependency. Let J(A) denote the set [GRAPHICS]. The set function J(A) is characterized if p=1, 1<q; p=2, 3<q; 2<p, p2-p-1<q. Implications among (p,q)-dependencies are also determined.

Combinatorial And Algebraic Results For Database Relations

A database R has some obvious and less obvious Parameters like the number of attributes, the size Absolute value of R, the maximum size of a domain, the number of some special functional dependencies (e.g. the minimal keys), and so on. The main aim of the paper is to survey some of the results giving connections, inequalities among these parameters- Results of this type give tools to guess the structure of the database having little a priori information. The methods are of combinatorial nature.

On The Number Of Databases And Closure Operations

Closure operations are considered as models of databases. Estimates on the number of closure operations on n elements (or equivalently, on the number of databases with n attributes) are given.

Minimal 2-coverings of a finite affine space based on GF(2)

Let EG(m, 2) denote the m-dimensional finite Euclidean space (or geometry) based on GF(2), the finite field with elements 0 and 1. Let T be a set of points in this space, then T is said to form a q-covering (where q is an integer satisfying 1less-than-or-equals, slantqless-than-or-equals, slantm) of EG(m, 2) if and only if T has a nonempty intersection with every (m-q)-flat of EG(m, 2). This problem first arose in the statistical context of factorial search designs where it is known to have very important and wide ranging applications. Evidently, it is also useful to study this from the purely combinatorial point of view. In this paper, certain fundamental studies have been made for the case when q=2. Let N denote the size of the set T. Given N, we study the maximal value of m.