SELECT DISTINCT over JDBC connection

Posted by Jason Bailey on 26-Apr-2016 11:47

Hi all, 

While I am certainly not new to databases, I am new to OpenEdge. I have a 10.2B database I'm querying via the SQL Client access driver (JDBC) and I'm having some troubles getting the data the way I'd like it. Hopefully one of you can give me some direction.

I'm actually querying a table that contains customer names and various contact methods that have been employed to send them notifications. If a given customer has been contacted via email and via text message, their name will appear twice, as two different rows. The problem is, I don't want duplicates. In this example, I only want one mention of this customer in my result set. I don't care of it is the email response record or the text message response record -- I just need one or the other, but not both. 

A "SELECT DISTINCT" query on only one field does the trick with a traditional SQL database. In OpenEdge, it appears that a SELECT DISTINCT only applies if ALL of the fields are exactly the same in one or more records (meaning duplicate rows). 

Consider this example:

+----------+----------------+----------+-------------------------+
| CUST_NBR |      CUST_NAME | CATEGORY |               TIMESTAMP |
+----------+----------------+----------+-------------------------+
|     3456 |       John Doe |   Static | 2016-04-25 08:17:30.134 |
+----------+----------------+----------+-------------------------+
|     2789 |  George Orwell |  Dynamic | 2016-04-25 08:17:32.945 |
+----------+----------------+----------+-------------------------+
|     3456 |       John Doe |   Static | 2016-04-25 08:17:37.256 |
+----------+----------------+----------+-------------------------+

Suppose I want to grab only one record per customer (so only one John Doe). The problem is, because the timestamp (which I need) is different from the next row, I still get both John Doe records in my result set. How would I get just one John Doe record (I don't care which)?

Hopefully all that makese sense. Could any of you give me any insight into this problem

All Replies

Posted by Libor Laubacher on 26-Apr-2016 12:25

I don't understand the remark about needing a timestamp and not caring about what John Doe record you get ? I mean, sound like you don't care about timestamp either, as they are different ?

SQLExplorer>select distinct name from pub.customer;

name

----------------

John Doe

Someone

<end>

Posted by Jason Bailey on 26-Apr-2016 13:14

On its face, it may not make a ton of sense, but there really is a reason.

This particular table is used to keep track of customer notifications. Based on the contact preferences of the user (email, sms/text, automated phone call, etc), every notice sent to them could result in 1 or more entries in this table. My query doesn't care if one customer was only emailed and another customer was sent an email, sms/text and call. All my query cares about is who was notified -- at all.

Right now I'm eliminating duplicate entries at the script/application level after the data has been pulled from the database (I only want to process one notification per customer). I'd like to have the database do that for me, if possible. It would greatly simplify things for me. Basically I only want to eliminate duplicates on one field/column, not on the entire record. Maybe OpenEdge won't do that.

Once I've processed the table, I'm recording the notification's timestamp, which helps me identify later which notifications I have already analyzed and processed. If the table were my creation, I could adjust the format of the table to suit my needs, but in this case, it's not (it's actually part of a product being provided by a vendor of ours).

Does that make a little more sense? My database experience is largely in MySQL and PostgreSQL (I'm a server/network guy, not a DBA), and I can easily accomplish what I want with that -- I'm just not very experienced with OpenEdge, so I don't know how to formulate the question in OpenEdge lingo.

This may make better sense if I put it this way:

+----------+-----------+----------+-------------------------+ 
| CUST_NBR | CUST_NAME |     TYPE | TIMESTAMP               | +----------+-----------+----------+-------------------------+ 
|     4424 |      Mary |    email | 2016-04-25 08:16:23.238 | +----------+-----------+----------+-------------------------+ 
|     3456 |      John |    email | 2016-04-25 08:17:30.134 | +----------+-----------+----------+-------------------------+ 
|     2789 |    George |    email | 2016-04-25 08:17:32.945 | +----------+-----------+----------+-------------------------+ 
|     4424 |      Mary | text/sms | 2016-04-25 08:24:54.113 | +----------+-----------+----------+-------------------------+ 
|     3456 |      John | text/sms | 2016-04-25 08:17:37.256 | +----------+-----------+----------+-------------------------+ 
|     4424 |      Mary |     call | 2016-04-25 08:27:33.468 | +----------+-----------+----------+-------------------------+

A result set like this from the above table would suit my purposes:

+----------+-----------+-------------------------+ 
| CUST_NBR | CUST_NAME |               TIMESTAMP | 
+----------+-----------+-------------------------+ 
|     2789 |    George | 2016-04-25 08:17:32.945 | +----------+-----------+-------------------------+ 
|     3456 |      John | 2016-04-25 08:17:37.256 | 
+----------+-----------+-------------------------+ 
|     4424 |      Mary | 2016-04-25 08:27:33.468 | 
+----------+-----------+-------------------------+

Thank you!

Posted by Thomas Mercer-Hursh on 26-Apr-2016 13:20

Seems like, at the very least you would want the latest notification ...

Posted by Jason Bailey on 26-Apr-2016 13:30

Are you telling me there's no way to do this in OpenEdge?

Posted by steve pittman on 26-Apr-2016 13:40

Hi Jason,
 
If I understand your example correctly, you want to use standard sql  GROUP BY  functionality to get what you want:
                Select  cust_nbr, category,  min(timestamp)
                    From my_table
                    Group by cust_nbr, category
 
This will give you one cust with one category and use the smallest timestamp for the various records for the cust, category pair.
 
Hope this helps,     ….steve pittman
 
 

Posted by Jason Bailey on 26-Apr-2016 15:54

Steve,

That does group things together, as one would expect with a group statement, but using my example above, I want no rows with duplicate CUST_NUM values. In the above example, there are three rows with a CUST_NUM value of 4424 (Mary) and two rows with a CUST_NUM value of 3456 (John).

I'd like to consolidate those results so that I'm only getting one row for 4424 (Mary) and only one row for 3456 (John). In a PostgreSQL or MySQL world, a 'SELECT DISTINCT' would do the trick. The problem is that (as I understand it), because OpenEdge thinks in terms of rows and not data sets, it just doesn't process this way.

Does that make sense?

Posted by Thomas Mercer-Hursh on 26-Apr-2016 16:01

Did you try Steve's code?  ABL tends to be row oriented, although there are more set-oriented features now, but ABL is *entirely* separate from SQL and SQL is SQL.  It may be missing some feature of SQL that you are used to in another implementation, but it acts like SQL.

Posted by steve pittman on 26-Apr-2016 16:22

OK - looked more closely at your example:

             Select  cust_nbr, category,  min(timestamp)

             From my_table

             Group by cust_nbr,  name

That should give, I believe, your preferred result in your example.

This assumes that each cust# has exactly one name value.  Their are not multiple names for a given cust#.

note - select distinct and grouping are logically equivalent if grouping by all the output columns with no aggregation.

....steve pittman

Posted by gus on 28-Apr-2016 10:14

> On Apr 26, 2016, at 5:02 PM, Thomas Mercer-Hursh wrote:

>

> It may be missing some feature of SQL that you are used to in another implementation, but it acts like SQL.

The beauty of stadards. Among many other things, when you use SELECT DISTINCT, the SQL standard leaves to the implementation what happens to the other column values. Some do it this way and others that way.

This thread is closed