A short overview of steps to be taken to improve the display of search results within EDsync.
There have been complaints about the usability/usefulness of EDsync’s search. As I wasn’t happy with the search myself I have disabled the feature with one of the earlier versions. I left the choice upon the users to decide whether to reenable the search and take it as it was. In this post I want to explain how searching the GBV catalogues works and how I am implementing it while getting ready an improved version of the search feature for a future version of EDsync.
Good news first:
- All libraries within the GBV are using the same schema to search their catalogues.
- There is an xml interface to retrieve search results. It is described in a wiki page of the GBV. The state of the interface is said to be “experimental” and “in development”
- Yes, it’s xml!
- The API you may talk to in order to get down to the search results is rudimentary.
- The xml values are intended to be displayed in a browser recognising the values’ encoding or require you to be at least aware of the values’ string encoding (multiple encodings in different libraries)
- There is exactly one relevant value within the xml.
- This one relevant value is the grand concatenation of all values you’d wish to be separate ones
Now let’s not monkey around but get into the matter.
To get a list of results, the GBV wiki suggests to make a call on the servers in the following format:
The DB is the library’s database you want to search in (here a cached example for Hamburg – in case they move it again). The XML=1.0 indicates that xml output is activated. The Action ACT is a search. The IKT (whatever it may stand for) is decoded here for the university of Hamburg. We’re sorting SRT by YOP, the year of publication and the search term TRM is ‘linux’.
This is a real life example.
As a result we receive something like this:
<?xml version="1.0" encoding="UTF-8" ?> // this looks good
<SET nr="12" type="0" hits="577" >
<SHORTTITLE nr="1" PPN="642805822" matstring="MAT_B" matcode="Aaua" // IMHO, this is marmelade
format="text" available="no">SOFSEM 2011 : theory and practice of computer science : 37th Conference on Current Trends in Theory and Practice of Computer Science, Nový Smokovec, Slovakia, January 22-28, 2011 ; proceedings
/ Ivana Černá. - Heidelberg [u.a.] : Springer, 2011</SHORTTITLE>
<SHORTTITLE nr="2" PPN="633913847" matstring="MAT_B" matcode="Aaukf"
format="text" available="no">Multicore application programming : f ...
I said this:
<?xml version="1.0" encoding="UTF-8" ?>
looked good. Except, in other libraries it may — look like this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
That’s not so good. It reminds me of Switzerland. At best you speak German, French and Italian and if you want to be a good citizen please take Rhaeto-Romanic lessons, too. As a matter of fact, to get back to searching, it forced me to introduce a new attribute to my library objects, holding the string encoding. Each library object is now aware of the encoding it has to apply to its search result lists.
To summarise, by now we have all information from
<SHORTTITLE nr="1" PPN="642805822" matstring="MAT_B" matcode="Aaua" and some arbitrary text containing html tags. The attributes of SHORTTITLE will have great importance when it comes to retrieving the details for a given search result item in the list and well, the text will be used as the title of each result in our result list. As I just mentioned the attributes are of special interest. We have the PPN as well as the nr, which I will refer to as the index in the following. We could search for the PPN to fetch a single search result for further information retrieval but the index comes in very handy as I will show you now.
When you search for a term in the catalogue, it will try to place some cookies. These cookies contain the database you were searching and it also remembers the search itself. This is when a prominent role is assigned to the index. We can now use SHW?FRST=index to obtain the detail page for the given index.
It outputs something like this:
...<TR>PPN: <TD>585163421<a href=‘/DB=1/PPNSET?PPN=585163421’ target=_blank><img src=”https://katalog.b.tu-harburg.de:443/img_psi/2.0/icons/zitierlink.gif” style=”margin-left:10px;” alt=”Zitierlink” title=”Zitierlink” border=”0″></a><br /><TR>Titel: <TD>Linux Hochverfügbarkeit : Einsatzszen
aal 1: TI - Technische Informatik<TR>Signatur: <TD>TIH-800
Actually <TR> and <TD> look like <TR> and <TD> as we receive them. The <a href=’/DB=1/P is something like <a href=’/DB=1/P. We can also observe that many <TR> tags are preceded by <br /> which is actually received like that. Anyone familiar with html will recognise the <table> elements of table row (<tr>) and table data (<td>) tags as well a hyperlink starting with <a href=… So we ask ourselves: how come? And why for gods sake? Especially, what’s the matter with some tags having a cryptic form like <TR> and other ones looking like <br />. To be true to you – I have no clue (yes I do, but leave it like this for now ;)). But I guess all data is processed at least twice. Once when it is retrieved from the database and a second time when the xml is produced. I have reasons to believe that the <br /> are introduced in the latter process.
However, we have to assume that someone has hacked the information of a media item into a database. We would assume that there is some kind of a form the librarian fills out and submits it to the database. This form may define fields detailing the title, the isbn, the author, and so on. Let’s name this form or application that is used to gather these infos winIBW and the language used to specify the entries in the form PICA. So all information is stored into separate database fields in their pure beauty. Of course, it would be a lot of overhead to request an xml containing all possible fields for each single search result. Instead I would suggest either an API to ask for specified fields or a xml containing field values as a stack of key/value pairs for each relevant (not empty & relevant for display) database field. It might look like this:
<?xml version="1.0" encoding="UTF-8" ?>
......<DESCRIPTION>This is some info about the author</DESCRIPTION>
......<DESCRIPTION>This is some info about the title</DESCRIPTION>
We could easily generate tableview cells, for example.
for each DETAIL in RESULTDETAILS
do Create tableview cell: "FIELDNAME: CONTENT"
At this point I should stop dreaming and get back to reality. I’ll call it the nasty xml reality.
Next time I’ll tell about my current efforts to make the search result details a bit more comforting in EDsync.